Keynote: Accelerating AI Workloads with GPUs in Kubernetes - Kevin Klues & Sanjay Chatterjee

แชร์
ฝัง
  • เผยแพร่เมื่อ 7 ก.พ. 2025
  • Don't miss out! Join us at our next Flagship Conference: KubeCon + CloudNativeCon North America in Salt Lake City from November 12 - 15, 2024. Connect with our current graduated, incubating, and sandbox projects as the community gathers to further the education and advancement of cloud native computing. Learn more at kubecon.io
    Keynote: Accelerating AI Workloads with GPUs in Kubernetes - Kevin Klues, Distinguished Engineer & Sanjay Chatterjee, Engineering Manager, NVIDIA
    As AI and machine learning become ubiquitous, GPU acceleration is essential for model training and inference at scale. However, effectively leveraging GPUs in Kubernetes brings challenges around efficiency, configuration, extensibility, and scalability.
    This talk provides an overview of the capabilities needed to address these challenges, enabling seamless support for next-generation AI applications on Kubernetes.
    GPU resource-sharing mechanisms such as MPS (Multiple-Process Service), Time-Slicing, MIG (Multi-Instance GPU), and GPU virtualization
    Flexible accelerator configuration using the traditional device plugin and the upcoming Dynamic Resource Allocation (DRA) feature
    Advanced scheduling and resource management techniques, including gang scheduling, topology-awareness, fault-tolerance and more
    Key learnings (and areas of improvement) necessary to scale multi-node AI/ML jobs in large production clusters
    Some of these capabilities are already supported today and some of them are not. By addressing the remaining challenges, Kubernetes is poised to emerge as the go-to platform for accelerated AI/ML in the cloud, mirroring Linux's pervasive dominance in the datacenter.

ความคิดเห็น • 2

  • @luchen3414
    @luchen3414 9 หลายเดือนก่อน +1

    A perfect overview of GPU with Kubernetes today. Thank you, Kevin and Sanjay.

  • @artemZinn
    @artemZinn 3 หลายเดือนก่อน

    Welp, AMD is really behind on this, they urgently need a strong technical leader to execute on K8S integration.
    Great overview talk and capabilities.