Anyscale
Anyscale
  • 323
  • 501 838
Optimizing LLM Inference with AWS Trainium, Ray, vLLM, and Anyscale
Webinar Details
Organizations are deploying LLMs for inference across many workloads. A common challenge that arises is how to scale and productionize these workloads cost effectively.
In this webinar with Anyscale and AWS, you will learn how to leverage AWS accelerator instances, including AWS Inferentia, to reliably serve LLMs at scale using vLLM and Ray, all hosted on Amazon EKS. You’ll also learn about Anyscale's performance and enterprise capabilities to enable your most ambitious LLM and GenAI inference workloads.
Join this session to learn more about:
-How to use AWS Inferentia accelerators for leading price-performance.
-Building a complete LLM inference stack using vLLM, Ray, on EKS with AWS Inferentia.
-How to leverage AWS compute instances on Anyscale for optimized LLM Inference
-Anyscale’s managed enterprise LLM Inference offering with advanced cluster management optimizations, including dynamic auto-scaling, scale-to-zero, on-demand to spot, fault tolerance, zero downtime upgrades, and more.
Speakers
-Art Sedighi, Sr. Partner Solutions Architect
-Vara Bonthu, Principal OSS Specialist SA
-Akshay Malik, Engineering Manager
-Matt Connor, Product Manager
Is this webinar right for me?
This technical webinar is especially useful for AI Engineers who want to explore ways to operationalize generative AI models at scale while being cost efficient. It is also useful for Infrastructure Engineers who plan to support GenAI use cases and LLM Inference in their organizations.
มุมมอง: 297

วีดีโอ

Scalable and Cost Efficient AI Workloads with AWS and Anyscale
มุมมอง 268หลายเดือนก่อน
Organizations are already making significant investments in the GenAI and LLMs space. Here at Anyscale, we work closely with leading companies like OpenAI, Canva, and DoorDash to enable their ML workloads. A common challenge that arises is how to scale and productionize GenAI and LLMs workloads cost-effectively. In this webinar with Anyscale and AWS, you will learn how to leverage cutting-edge ...
Anyscale Job Queues
มุมมอง 147หลายเดือนก่อน
Newly available, Anyscale Job Queues enable multiple Ray Jobs to be executed on a shared cluster for batch “offline” workloads like data processing, model training, or batch inference. Job Queues make it easier than ever to streamline job scheduling and optimize resource allocation. Get started on Anyscale: consolte.anyscale.com
The Anyscale Unified Log Viewer
มุมมอง 1612 หลายเดือนก่อน
With the Unified Log Viewer access and search logs to debug and optimize Ray applications. The Anyscale Unified Log Viewer gives users continuous persistent access to logs, simplifies the user interface, and integrates a scalable centralized system to reduce complexity and setup time. Enhanced with searchable attributes like instance ID or task / actor ID, simplifying searching and resolving is...
Anyscale Replica Compaction
มุมมอง 2592 หลายเดือนก่อน
Learn how Anyscale Replica Compactions increases utilization and lowers cost by avoiding resource fragmentation. Resource fragmentation occurs when scaling activities from online model serving and inferencing lead to uneven resource utilization across nodes. As models scale up, new nodes may be launched. When traffic decreases and models scale down, some nodes may become underutilized, increasi...
Fast and Scalable Model Training with PyTorch and Ray
มุมมอง 4982 หลายเดือนก่อน
Organizations are making substantial investments in GenAI and LLMs, and Anyscale is at the forefront of this innovation. Our Virtual AI Tutorial Series introduces core concepts of modern AI applications, emphasizing large-scale computing, cost-effectiveness, and ML models. In this webinar, we focus on distributed model training with PyTorch and Ray. You'll learn how to migrate your code from pu...
End-to-End LLM Workflows with Anyscale
มุมมอง 1.1K2 หลายเดือนก่อน
Webinar to explore how a modern platform can support every stage of the AI app development lifecycle. Learn to build and scale end-to-end LLM workflows with Anyscale. Gain insight into the complete LLM lifecycle with fully runnable code, covering: 1. Data processing 2. Model fine-tuning 3. LLM evaluations and offline inference 4. Online inference for production traffic Blog post instructions to...
Meetup: Evaluating LLMs: Needle in a Haystack
มุมมอง 1.3K7 หลายเดือนก่อน
LLM evaluation is a discipline where confusion reigns and foundation model builders are effectively grading their own homework. ​Building on the viral threads on X/Twitter, Greg Kamradt, Robert Nishihara, and Jason Lopatecki discuss highlights from Arize AI's ongoing research on how major foundation models - from OpenAI’s GPT-4 to Mistral and Anthropic’s Claude - are stacking up against each ot...
Build a chat assistant fast using Canopy from Pinecone and Anyscale Endpoints
มุมมอง 1.1K9 หลายเดือนก่อน
This webinar will explore the challenges of building a chat assistant and how Canopy and Anyscale endpoints provide the fastest and easiest way to build your RAG based applications for free. We will go through the architecture, a real live example, and a guide on how to get started with building your own chat assistant. Canopy is a flexible framework built on top of the Pinecone vector database...
Elevate Your AI Applications with Anyscale and Ray: Simple, Scalable, Secure
มุมมอง 1.1K10 หลายเดือนก่อน
🚀 The AI Challenge: Explore the increasing scale and complexity needs in AI. 🌐 Anyscale Solutions: Introducing Anyscale Endpoints, Anyscale Private Endpoints, and the Anyscale Platform, each designed for different stages of AI adoption. 💡 Starting with Anyscale Endpoints: Learn how this API integrates popular AI models into your applications, offering customization and cost efficiency. 🛡️ Growi...
Ray Train: A Production-Ready Library for Distributed Deep Learning
มุมมอง 2.5K10 หลายเดือนก่อน
With the growing complexity of deep learning models and the emergence of Large Language Models (LLMs) and generative AI, scaling training efficiently and cost-effectively has become an urgent need. Enter Ray Train, a cutting-edge library designed specifically for seamless, production-ready distributed deep learning. In this talk, we will take a deep dive into the architecture of Ray Train, emph...
Gismo for Ray: A Multi-Node Shared Memory Object Store That Accelerates Ray Workloads
มุมมอง 80811 หลายเดือนก่อน
Ray is a powerful distributed computing framework. However, as data sets grow and computation requirements become more complex, managing memory usage across multiple computing nodes becomes increasingly challenging. Issues that slow down performance include the data copying between the computing nodes, data spilling out of memory into storage, and the data skew among computing nodes. We'll intr...
How to simplify execution of cloud-native model training & validation with CodeFlare: A HandsOn Demo
มุมมอง 32711 หลายเดือนก่อน
Join us for a hands-on demo of the CodeFlare-SDK, an open-source project that simplifies cloud-native data pre-processing, model training and validation with an intuitive Python interface to Ray, PyTorch/TorchX, and Kubernetes. With the CodeFlare-SDK, you can easily manage your cloud resources, submit jobs, and monitor job status, without worrying about the complexities of DevOps and cloud infr...
Building an Instant-On Serverless Platform for Large-Scale Data Processing Using Ray
มุมมอง 37611 หลายเดือนก่อน
AWS Glue has been pioneering in the space of automating ETL processes by providing a fully managed serverless data integration service. This service is a simple and cost-effective way for customers to categorize their data, clean it, enrich it, and move it swiftly and reliably between various data stores. AWS Glue is made up of a Data Catalog (i.e a metadata store), sophisticated ETL engines wi...
Developing and Serving RAG-Based LLM Applications in Production
มุมมอง 20K11 หลายเดือนก่อน
There are a lot of different moving pieces when it comes to developing and serving LLM applications. This talk will provide a comprehensive guide for developing retrieval augmented generation (RAG) based LLM applications - with a focus on scale (embed, index, serve, etc.), evaluation (component-wise and overall) and production workflows. We’ll also explore more advanced topics such as hybrid ro...
NLP And The Future of Search With You.com
มุมมอง 1K11 หลายเดือนก่อน
NLP And The Future of Search With You.com
From Spark to Ray: An Exabyte-Scale Production Migration Case Study
มุมมอง 2.2K11 หลายเดือนก่อน
From Spark to Ray: An Exabyte-Scale Production Migration Case Study
Ray Scalability Deep Dive: The Journey to Support 4,000 Nodes
มุมมอง 99011 หลายเดือนก่อน
Ray Scalability Deep Dive: The Journey to Support 4,000 Nodes
Ray Observability 2.0: How to Debug Your Ray Applications with New Observability Tooling
มุมมอง 67411 หลายเดือนก่อน
Ray Observability 2.0: How to Debug Your Ray Applications with New Observability Tooling
Modernizing DoorDash Model Serving Platform with Ray Serve
มุมมอง 1.3K11 หลายเดือนก่อน
Modernizing DoorDash Model Serving Platform with Ray Serve
Deploying Many Models Efficiently with Ray Serve
มุมมอง 4.1K11 หลายเดือนก่อน
Deploying Many Models Efficiently with Ray Serve
How Spotify Built a Robust Ray Platform with a Frictionless Developer Experience
มุมมอง 72811 หลายเดือนก่อน
How Spotify Built a Robust Ray Platform with a Frictionless Developer Experience
Scaling AI Health Assistants: Challenges and Solutions
มุมมอง 25011 หลายเดือนก่อน
Scaling AI Health Assistants: Challenges and Solutions
Forecasting Covid Infections for the UK's National Health Service using Ray and Kubernetes
มุมมอง 16211 หลายเดือนก่อน
Forecasting Covid Infections for the UK's National Health Service using Ray and Kubernetes
Supercharging self-driving algor dev w/ Ray: scaling sim workloads and democratizing autotuning@Zoox
มุมมอง 24011 หลายเดือนก่อน
Supercharging self-driving algor dev w/ Ray: scaling sim workloads and democratizing autotuning@Zoox
AI Factory Accelerating Solutions with Ray
มุมมอง 48111 หลายเดือนก่อน
AI Factory Accelerating Solutions with Ray
How Ray Empowered Ant Group to Deliver a Large-Scale Online Serverless Platform
มุมมอง 25311 หลายเดือนก่อน
How Ray Empowered Ant Group to Deliver a Large-Scale Online Serverless Platform
Python-centric AI Application Building in Minutes with Lepton and Ray
มุมมอง 1.4K11 หลายเดือนก่อน
Python-centric AI Application Building in Minutes with Lepton and Ray
On-Demand Ray Clusters in ML Workflows via KubeRay & Sematic
มุมมอง 54711 หลายเดือนก่อน
On-Demand Ray Clusters in ML Workflows via KubeRay & Sematic
Parallel inferencing with KServe Ray integration
มุมมอง 1.1K11 หลายเดือนก่อน
Parallel inferencing with KServe Ray integration

ความคิดเห็น

  • @PhilippWillms
    @PhilippWillms 58 นาทีที่ผ่านมา

    Inspiring talk how to bring RL into industrial practice, thanks for sharing!

  • @FitzGeraldMamie-d6f
    @FitzGeraldMamie-d6f 2 ชั่วโมงที่ผ่านมา

    Lenora Isle

  • @MaryTaylor-d8r
    @MaryTaylor-d8r 2 วันที่ผ่านมา

    Ettie Road

  • @MadgePapiernik-c6d
    @MadgePapiernik-c6d 3 วันที่ผ่านมา

    Fae Harbors

  • @fenderbender28
    @fenderbender28 6 วันที่ผ่านมา

    Excellent talk

  • @WyattWayne-g8w
    @WyattWayne-g8w 9 วันที่ผ่านมา

    Magnus Ridges

  • @hugosonnery
    @hugosonnery 15 วันที่ผ่านมา

    Thank you very much for this !

  • @jeevanbeniwal3019
    @jeevanbeniwal3019 15 วันที่ผ่านมา

    this talk can't be more good. Thanks Hao!!!

  • @felicialynch35663
    @felicialynch35663 16 วันที่ผ่านมา

    This was a really insightful presentation. It made me think about how different tools approach information retrieval. I've been using Myko Assistant recently, and its deep internet search feature really helps me find accurate answers quickly, especially compared to some others like Perplexity.

  • @keithschaub7863
    @keithschaub7863 17 วันที่ผ่านมา

    is the PDF all text? Does it have images, tables, graphs? And if so, how well does it convert?

  • @gabrielpreciado5699
    @gabrielpreciado5699 19 วันที่ผ่านมา

    Impressive

  • @keshmesh123
    @keshmesh123 19 วันที่ผ่านมา

    It was great. Thank you!

  • @talfranji
    @talfranji หลายเดือนก่อน

    The first code slide contains an error. Profing your code example when aiming at software engineers is important :)

  • @Simeon1337
    @Simeon1337 หลายเดือนก่อน

    Great vid

  • @MrEmbrance
    @MrEmbrance หลายเดือนก่อน

    no thanks

  • @Mohsenghq
    @Mohsenghq หลายเดือนก่อน

    Thanks for explaining simple, I migrated to RLlib and it's really efficient.

  • @ndamulelosbg8887
    @ndamulelosbg8887 หลายเดือนก่อน

    Great presentation. Just one question: What is relevance_score in this case? Is it an aggregation of grounding metrics for all reference examples?

  • @sherlockho4613
    @sherlockho4613 หลายเดือนก่อน

    very helpful and distinguish presentation!

  • @elephantum
    @elephantum 2 หลายเดือนก่อน

    It should be noted, that since this talk, Anyscale deprecated Ray LLM and now recommend vLLM

  • @mattgrant4143
    @mattgrant4143 2 หลายเดือนก่อน

    Hi! What are the chances we can get the source code for this demo!! Trying to learn about KubeRay myself, have a k8s cluster (with gpu and taints setup, but cant get kuberay to play nicely with them. my job just gets stuck w/ pending)

  • @tunglee4349
    @tunglee4349 2 หลายเดือนก่อน

    great content! thanks a lot!

  • @Emerson1
    @Emerson1 2 หลายเดือนก่อน

    great video, that's a lot of useful features !

  • @JavierTorres-st7gt
    @JavierTorres-st7gt 3 หลายเดือนก่อน

    How to protect a company's information with technology ?

  • @fantasyxpress7966
    @fantasyxpress7966 3 หลายเดือนก่อน

    Thanks but what about scanned pdfs any way to handle the exceptions

  • @ReflectionOcean
    @ReflectionOcean 3 หลายเดือนก่อน

    By YouSum Live 00:00:11 Future of search innovation. 00:01:17 Linear story vs. reality of innovation. 00:02:37 Building solutions for personal challenges. 00:05:28 Importance of personal product usage. 00:06:02 Learning through personal experiences. 00:09:19 Iterative improvement and momentum. 00:10:34 Enhancing search with live knowledge integration. 00:13:20 Introduction of Copilot for interactive browsing. 00:18:06 Transition to a comprehensive research platform. 00:19:49 Importance of orchestration in complex systems. 00:20:41 Challenges in plugin and API integration reliability. 00:21:21 Evaluation metrics for generative search engines. 00:23:01 Continuous iteration and improvement for product success. 00:23:22 Fusion of Wikipedia and chat for deep topic exploration. 00:23:55 Development of faster and efficient llama models. 00:24:42 Customization of models for improved performance. 00:30:25 Balancing between open-source and proprietary models. 00:31:19 Control over pricing and business perspective in model development. By YouSum Live

  • @simbasrv30
    @simbasrv30 3 หลายเดือนก่อน

    If my models are unrelated and have no functional requirements to run together in a single application, can I still use Model composition in Ray serve to deploy multiple model in a single application providing a unified API endpoint (with different route for each model) for better resource utilisation and easier deployment? Is it a good practice? What about the security aspects and user authentications?

  • @Inceptionxg
    @Inceptionxg 3 หลายเดือนก่อน

    I love the way how he shared the story

  • @Karthikprath
    @Karthikprath 3 หลายเดือนก่อน

    How do we calculate memory used by kv cache in paged attention.Example for input 500 and output 1000

  • @ashrafamad-ds8er
    @ashrafamad-ds8er 4 หลายเดือนก่อน

    you are reading what is on the screen, do you think that you are "useful" ? or "informative "??! not at all

  • @TheAIEpiphany
    @TheAIEpiphany 4 หลายเดือนก่อน

    Great talk and amazing work guys!

  • @StoryTimeWithX
    @StoryTimeWithX 4 หลายเดือนก่อน

    Neat.

  • @vaporeon2822
    @vaporeon2822 4 หลายเดือนก่อน

    Interesting sharings. Curious about the underlying implementation for KV blocks sharing part you have a copy-on-write mechanism, but how does it avoid dirty-read condition, where both request reads that ref count is 2 and both request copies the block simultaneously.

  • @LiangyueLi
    @LiangyueLi 4 หลายเดือนก่อน

    great work

  • @ailabinsev
    @ailabinsev 4 หลายเดือนก่อน

    Great explanation!

  • @antonidabrowski4657
    @antonidabrowski4657 5 หลายเดือนก่อน

    Good content, thanks for your research

  • @dave_by_day7632
    @dave_by_day7632 5 หลายเดือนก่อน

    When I use ray.init() it starts both the dashboard and a job at the same time. Is there a way to start the dashboard, then a separate command to start individual jobs?

  • @AnnerdeJong
    @AnnerdeJong 5 หลายเดือนก่อน

    Considering the 300g ray vs spark comparison (~15m30s-18m30s) - the spark side seems to save all the prediction outputs (`...write...save()`), but I don't see that on the ray side (`for _ in ds.iter_batches(..: pass`). Does ray's 'iter_batches()` automatically dump outputs somewhere? (e.g. when specifying `batch_format='pyarrow'` does it get automatically cached or sth in the ray object store, or sth similar?) If not - I'd argue it's not an entirely fair apples-to-apples comparison?

    • @fenderbender28
      @fenderbender28 5 หลายเดือนก่อน

      In his code, the spark writer is using format(“noop”) which means it’s not also persisting the outputs anywhere

  • @carrocesta
    @carrocesta 5 หลายเดือนก่อน

    I really like your approach, some libraries like keras are difficult to use

  • @arnony1
    @arnony1 5 หลายเดือนก่อน

    Excellent, very educating

  • @sennetor
    @sennetor 6 หลายเดือนก่อน

    First Impressions! So human. :)

  • @yukewang3164
    @yukewang3164 6 หลายเดือนก่อน

    awesome talk, with useful insights!

  • @rohvir2615
    @rohvir2615 7 หลายเดือนก่อน

    goated video no cap

    • @mumcarpet109
      @mumcarpet109 6 หลายเดือนก่อน

      on god, we making out the hood with this one 💯

  • @fabsync
    @fabsync 7 หลายเดือนก่อน

    Fantastic tutorial! It will be awesome if there is another tutorial on how to set this up locally for local development..

  • @fabsync
    @fabsync 7 หลายเดือนก่อน

    fantastic video! It will be great to see how can you integrate that for local development...Most tutorials from Anyscale are in the cloud but developers work first locally and then you take things to the cloud...it will be great to see a series on how you do that with docker or kubernetes..etc

  • @fabsync
    @fabsync 7 หลายเดือนก่อน

    great video and integration with kubernetes! It will be great how can you use that for local developement...

  • @Uofmdoc
    @Uofmdoc 7 หลายเดือนก่อน

    What a disjointed , rambling, ad disorganized talk. Here’s a tip. Tell them what you’re going to tell them, tell them, then tell them what you just told them.

    • @trenfa4371
      @trenfa4371 6 หลายเดือนก่อน

      use cc

  • @isaacemanuel4287
    @isaacemanuel4287 7 หลายเดือนก่อน

    This is exciting! As a software engineer with an interest in machine learning and AI models, I have been looking for a way to start working with the aforementioned models in a simple and more user-friendly/developer-friendly way. This product seems to be the answer to that, I look forward to checking it out and getting started with the software. Thank you for sharing!

  • @hemanthsethuram6740
    @hemanthsethuram6740 7 หลายเดือนก่อน

    Beautiful adaptation of a fundamental idea of paging, reference counting and copy -on-write.👌