Developing and Serving RAG-Based LLM Applications in Production

Anyscale

มุมมอง 17 160

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 11 ต.ค. 2023
There are a lot of different moving pieces when it comes to developing and serving LLM applications. This talk will provide a comprehensive guide for developing retrieval augmented generation (RAG) based LLM applications - with a focus on scale (embed, index, serve, etc.), evaluation (component-wise and overall) and production workflows. We’ll also explore more advanced topics such as hybrid routing to close the gap between OSS and closed LLMs.
Takeaways:
• Evaluating RAG-based LLM applications are crucial for identifying and productionizing the best configuration.
• Developing your LLM application with scalable workloads involves minimal changes to existing code.
• Mixture of Experts (MoE) routing allows you to close the gap between OSS and closed LLMs.
Find the slide deck here: drive.google.com/file/d/1ZnE9...
About Anyscale
---
Anyscale is the AI Application Platform for developing, running, and scaling AI.
www.anyscale.com/
If you're interested in a managed Ray service, check out:
www.anyscale.com/signup/
About Ray
---
Ray is the most popular open source framework for scaling and productionizing AI workloads. From Generative AI and LLMs to computer vision, Ray powers the world’s most ambitious AI workloads.
docs.ray.io/en/latest/
#llm #machinelearning #ray #deeplearning #distributedsystems #python #genai

ความคิดเห็น • 11

@jzziesing 7 หลายเดือนก่อน ⁺⁴
I would love to see an hour long presentation on this!
@noosfera_It 7 หลายเดือนก่อน
amazing work! thank you
@TymonVideos 6 หลายเดือนก่อน ⁺²
Really enjoyed this talk - found a lot of value in it. Both speakers are clearly so knowledgable, and i love the extra little details the chap in the blue hoodie gave throughout. Would love to connect & share!
@TymonVideos 6 หลายเดือนก่อน
...just realised the "chap in blue" is a co-founder! No disrespect meant :) awesome
@rohvir2615 3 หลายเดือนก่อน ⁺¹
goated video no cap
@mumcarpet109 2 หลายเดือนก่อน ⁺¹
on god, we making out the hood with this one 💯
@junaidiqbal4104 7 หลายเดือนก่อน ⁺²
🎯 Key Takeaways for quick navigation:
00:05 🚀 Initial Motivation and Project Start
- Started building LM applications to gain firsthand experience and improve user experience.
- Developed a RAG application, focusing on making it easier for users to work with products.
- Emphasized the importance of underlying documents and user questions in building such applications.
01:31 🌐 Community Engagement and Insights
- Encouraged sharing insights and experiences on building RAG-based applications.
- Acknowledged the community's early stage and the value of diverse perspectives.
- Welcomed external input to enrich the collective understanding of RAG applications.
03:07 🧩 Experimentation with Data Chunking
- Explored different strategies for efficient data chunking, moving beyond random chunking.
- Utilized HTML document sections for precise references and better understanding of content.
- Aimed for a generalizable template, potentially open-sourcing a solution for various HTML documents.
05:14 🗃️ Vector Database and Technology Choices
- Chose Postgres as the Vector database, emphasizing familiarity and compatibility.
- Highlighted the increasing options of specialized databases for LM applications.
- Advised selecting a database based on team familiarity but exploring new options for specific features.
06:10 🔄 Retrieval Workflow and Database Query
- Described the retrieval process, including embedding queries and calculating distances.
- Discussed pros and cons of building Vector DB on Postgres versus using dedicated solutions.
- Addressed potential limitations based on document scale and the flexibility of different databases.
08:20 📏 Considerations for Context Size and Token Limits
- Acknowledged token limits in LM context and model-specific variations.
- Encouraged experimenting with different chunk sizes, possibly using multiple embeddings for longer chunks.
- Highlighted the importance of adapting to the LM's limitations and exploring diverse experimental setups.
09:29 🔍 Evaluation Metrics and Component-wise Assessment
- Introduced the two major components for evaluation: retrieval workflow and LM response quality.
- Explained the evaluation process, including isolating each component for focused assessment.
- Shared insights into the challenges and considerations of scoring LM responses.
11:32 📊 Evaluator Selection and Quality Assessment
- Used GPT-4 as an evaluator based on empirical comparison and understanding of the application.
- Discussed the limitations of available LM models and potential biases in self-evaluation.
- Advocated for iterative improvement and potential collaboration with external LM development communities.
15:13 📈 Iterative Evaluation and System Trust Building
- Illustrated the iterative evaluation process, starting with trusting an evaluator.
- Demonstrated the evaluation flow, using different configurations and trusting the chosen LM's outputs.
- Emphasized the importance of building trust in each component before assessing the overall system.
17:04 ❄️ Cold Start Strategy and Bootstrapping
- Presented a cold start strategy using chunked data to generate initial questions.
- Addressed noise reduction by refining generated questions and encouraging creativity.
- Described the bootstrapping cycle from clean slate to using generated data for further annotations.
18:38 🔄 Continuous Learning and Evaluation Scaling
- Responded to questions about the number of examples for cold start and overall evaluation.
- Advocated for a balance of quantity and diversity in examples for comprehensive evaluations.
- Stressed the importance of continuous learning, adaptation, and leveraging automated pipelines for scaling evaluations.
19:49 📈 Chunk Size Impact on Retrieval and Quality
- Retrieval score increases with chunk size but starts tapering off.
- Quality continues to improve even as chunk sizes increase.
- Code snippets benefit from longer context or special chunking logic.
21:30 🧩 Number of Chunks and Context Size
- Increasing the number of chunks improves retrieval and quality scores.
- Larger context windows for LLMs show a positive trend.
- Experimentation with techniques like RoPE for extending context.
22:30 🛠️ Fixing Hyperparameters During Tuning
- Fixing hyperparameters sequentially: context size, chunk size, embedding models.
- Experimentation with spread and fixing parameters once optimized.
- Illustrates a pragmatic approach to hyperparameter tuning.
23:12 🏆 Model Selection and Benchmarking
- GPT-3.5-based model (GTE base) outperformed larger models on their use case.
- Emphasizes the importance of evaluating models based on specific use cases.
- Benchmarking against openai's text embedding and choosing a smaller, performant model.
23:56 💰 Cost Analysis and Hybrid LM Routing
- Cost analysis comparing different LM models.
- Introduction of a hybrid LM routing approach for cost-effectiveness.
- Consideration of performance, cost, and hybrid routing for optimal results.
25:10 🤖 Classifier vs. Language Model for Routing
- Classifier used for routing decisions due to speed considerations.
- Mention of training a classifier using a labeled dataset for routing.
- Potential transition to LM-based routing as LM inference speed improves.
27:17 🔄 Future Developments and System Integration
- Integration of components into larger systems, citingAnyScale's doctor application.
- Anticipation of more developments and applications in the future.
- Acknowledgment of the importance of iteration in building robust systems.
Made with HARPA AI
@victorhenriquecollasanta4740 4 หลายเดือนก่อน
Top gs
@noosfera_It 7 หลายเดือนก่อน ⁺⁷
when accents swap.
@charlesthompson8938 3 หลายเดือนก่อน
😂😂 and awesome content though.

ต่อไป

เล่นอัตโนมัติ

NLP And The Future of Search With You.com