Keep HPC Running - an SRE's Guide to Supporting GPUs on Kubernetes - Christopher Dutra, JP Morgan

แชร์
ฝัง
  • เผยแพร่เมื่อ 20 ก.ย. 2024
  • Keep HPC Running - an SRE's Guide to Supporting GPUs on Kubernetes - Christopher Dutra, JP Morgan
    Operating a traditional Kubernetes cluster requires specific knowledge about telemetry, observability, and what criteria are considered to need human intervention in restoring service. While general (CPU-only) compute paths are well known, introducing GPUs into the fleet of nodes presents additional challenges to "day 2" operational practices, and specific attention must be drawn to how these resource pools are supported. The growing expectations from HPC and AI use cases present further challenges as customer expectations of Generative Pre-trained Transformers (GPTs), machine learning, and quantitative modeling practices continue to elevate. This presentation provides best practices into what metrics should SRE teams incorporate into their armada of operational tools to support High-Performance Compute workloads on Kubernetes. As a working example, this presentation will explore custom plugin monitors for the Kubernetes node-problem-detector daemon, interacting with NVIDIA’s open-sourced DCGM and NVML bindings. Additionally, this talk will review metrics exposed by NVIDIA’s DCGM-Exporter to Prometheus, highlighting their operational importance to the health of both the cluster and the workloads running on top.

ความคิดเห็น •