Handling Multi-Terabyte LLM Checkpoints // Simon Karasik // MLOps Podcast

แชร์
ฝัง
  • เผยแพร่เมื่อ 18 พ.ค. 2024
  • Join us at our first in-person conference on June 25 all about AI Quality: www.aiqualityconference.com/
    Huge thank you to @nebiusofficial for sponsoring this episode. Nebius AI - nebius.ai/
    MLOps podcast #228 with Simon Karasik, Machine Learning Engineer at Nebius AI, Handling Multi-Terabyte LLM Checkpoints.
    // Abstract
    The talk provides a gentle introduction to the topic of LLM checkpointing: why is it hard, how big are the checkpoints. It covers various tips and tricks for saving and loading multi-terabyte checkpoints, as well as the selection of cloud storage options for checkpointing.
    // Bio
    Full-stack Machine Learning Engineer, currently working on infrastructure for LLM training, with previous experience in ML for Ads, Speech, and Tax.
    // MLOps Jobs board
    mlops.pallet.xyz/jobs
    // MLOps Swag/Merch
    mlops-community.myshopify.com/
    // Related Links
    -------------- ✌️Connect With Us ✌️ ------------
    Join our slack community: go.mlops.community/slack
    Follow us on Twitter: @mlopscommunity
    Sign up for the next meetup: go.mlops.community/register
    Catch all episodes, blogs, newsletters, and more: mlops.community/
    Connect with Demetrios on LinkedIn: / dpbrinkm
    Connect with Simon on LinkedIn: / simon-karasik
    Timestamps:
    [00:00] Simon preferred beverage
    [01:23] Takeaways
    [04:22] Simon's tech background
    [08:42] Zombie models garbage collection
    [10:52] The road to LLMs
    [15:09] Trained models Simon worked on
    [16:26] LLM Checkpoints
    [20:36] Confidence in AI Training
    [22:07] Different Checkpoints
    [25:06] Checkpoint parts
    [29:05] Slurm vs Kubernetes
    [30:43] Storage choices lessons
    [36:02] Paramount components for setup
    [37:13] Argo workflows
    [39:49] Kubernetes node troubleshooting
    [42:35] Cloud virtual machines have pre-installed mentoring
    [45:41] Fine-tuning
    [48:16] Storage, networking, and complexity in network design
    [50:56] Start simple before advanced; consider model needs.
    [53:58] Join us at our first in-person conference on June 25 all about AI Quality
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น •