Mastering GPU Management in Kubernetes Using the Operator Pattern- Shiva Krishna Merla & Kevin Klues

แชร์
ฝัง
  • เผยแพร่เมื่อ 20 ก.ย. 2024
  • Don't miss out! Join us at our next Flagship Conference: KubeCon + CloudNativeCon North America in Salt Lake City from November 12 - 15, 2024. Connect with our current graduated, incubating, and sandbox projects as the community gathers to further the education and advancement of cloud native computing. Learn more at kubecon.io
    Mastering GPU Management in Kubernetes Using the Operator Pattern - Shiva Krishna Merla & Kevin Klues, NVIDIA
    Kubernetes is no longer just a tool for running workloads like web applications and microservices, it is the ideal platform for supporting the end-to-end lifecycle of large artificial intelligence (AI) and machine learning (ML) workloads, such as LLMs. GPUs have become the foundation of this workload shift. However, managing GPUs in a Kubernetes cluster requires full-stack knowledge from the installation of kernel drivers to the setup of container runtimes, device plugins, and a monitoring stack. These activities can be broken down into 4 phases. Installation of the GPU software stack on a small cluster Infrastructure build-out by adding more nodes Lifecycle management, Software Updates Monitoring and Error recovery In this talk, we discuss leveraging the operator pattern for the lifecycle management of GPU software in K8s. We demo the NVIDIA GPU Operator to show how the operator pattern can benefit K8s admin from basic driver installation to managing advanced AI/ML use cases.

ความคิดเห็น • 2

  • @travnewmatic
    @travnewmatic 23 วันที่ผ่านมา

    So much inspiration here, thank you so much!

  • @fio_mak
    @fio_mak 5 หลายเดือนก่อน

    Hi cncf,
    Where can I download the presentation file? Do you upload it anywhere?