Ray Scalability Deep Dive: The Journey to Support 4,000 Nodes

แชร์
ฝัง
  • เผยแพร่เมื่อ 21 ก.ย. 2024
  • In today's dynamic machine learning landscape, Ray has emerged as an essential platform, powering demanding tasks like training ChatGPT at OpenAI and processing terabytes of data everyday at Amazon. This talk unveils Ray's pivotal role in addressing the exponential growth of modern ML workloads.
    We will take a deep dive into Ray internal scalability, covering tasks, actors, objects and nodes, offering concrete examples to guide you in developing scalable code that maximizes Ray's potential.
    Furthermore, we will explore the latest post-Ray 2.0 enhancements on health checks, resource broadcasting, and asynchronous actor creation. Join us on this exciting journey as we discuss the challenges and opportunities of buidling an unprecedented 4000-node cluster.
    Takeaways
    • Help the audience understand Ray's scalability and improvements after 2.0.
    Find the slide deck here: drive.google.c...
    About Anyscale
    ---
    Anyscale is the AI Application Platform for developing, running, and scaling AI.
    www.anyscale.com/
    If you're interested in a managed Ray service, check out:
    www.anyscale.c...
    About Ray
    ---
    Ray is the most popular open source framework for scaling and productionizing AI workloads. From Generative AI and LLMs to computer vision, Ray powers the world’s most ambitious AI workloads.
    docs.ray.io/en...
    #llm #machinelearning #ray #deeplearning #distributedsystems #python #genai

ความคิดเห็น • 2

  • @yaxiongzhao6640
    @yaxiongzhao6640 10 หลายเดือนก่อน +1

    Ray is reinventing all the old lessons learned in the last iteration of distributed computing
    Hope this time the implementation could be better

  • @Dht1kna
    @Dht1kna 8 หลายเดือนก่อน

    Does the pull based health check only occur is no resource updates have occurred recently? Seems like health checks should be mergeable with the incremental resource updates (and actual health check pulls only to be done if say 100ms of silence from a node)