🎯 Key Takeaways for quick navigation: 00:05 🚀 Initial Motivation and Project Start - Started building LM applications to gain firsthand experience and improve user experience. - Developed a RAG application, focusing on making it easier for users to work with products. - Emphasized the importance of underlying documents and user questions in building such applications. 01:31 🌐 Community Engagement and Insights - Encouraged sharing insights and experiences on building RAG-based applications. - Acknowledged the community's early stage and the value of diverse perspectives. - Welcomed external input to enrich the collective understanding of RAG applications. 03:07 🧩 Experimentation with Data Chunking - Explored different strategies for efficient data chunking, moving beyond random chunking. - Utilized HTML document sections for precise references and better understanding of content. - Aimed for a generalizable template, potentially open-sourcing a solution for various HTML documents. 05:14 🗃️ Vector Database and Technology Choices - Chose Postgres as the Vector database, emphasizing familiarity and compatibility. - Highlighted the increasing options of specialized databases for LM applications. - Advised selecting a database based on team familiarity but exploring new options for specific features. 06:10 🔄 Retrieval Workflow and Database Query - Described the retrieval process, including embedding queries and calculating distances. - Discussed pros and cons of building Vector DB on Postgres versus using dedicated solutions. - Addressed potential limitations based on document scale and the flexibility of different databases. 08:20 📏 Considerations for Context Size and Token Limits - Acknowledged token limits in LM context and model-specific variations. - Encouraged experimenting with different chunk sizes, possibly using multiple embeddings for longer chunks. - Highlighted the importance of adapting to the LM's limitations and exploring diverse experimental setups. 09:29 🔍 Evaluation Metrics and Component-wise Assessment - Introduced the two major components for evaluation: retrieval workflow and LM response quality. - Explained the evaluation process, including isolating each component for focused assessment. - Shared insights into the challenges and considerations of scoring LM responses. 11:32 📊 Evaluator Selection and Quality Assessment - Used GPT-4 as an evaluator based on empirical comparison and understanding of the application. - Discussed the limitations of available LM models and potential biases in self-evaluation. - Advocated for iterative improvement and potential collaboration with external LM development communities. 15:13 📈 Iterative Evaluation and System Trust Building - Illustrated the iterative evaluation process, starting with trusting an evaluator. - Demonstrated the evaluation flow, using different configurations and trusting the chosen LM's outputs. - Emphasized the importance of building trust in each component before assessing the overall system. 17:04 ❄️ Cold Start Strategy and Bootstrapping - Presented a cold start strategy using chunked data to generate initial questions. - Addressed noise reduction by refining generated questions and encouraging creativity. - Described the bootstrapping cycle from clean slate to using generated data for further annotations. 18:38 🔄 Continuous Learning and Evaluation Scaling - Responded to questions about the number of examples for cold start and overall evaluation. - Advocated for a balance of quantity and diversity in examples for comprehensive evaluations. - Stressed the importance of continuous learning, adaptation, and leveraging automated pipelines for scaling evaluations. 19:49 📈 Chunk Size Impact on Retrieval and Quality - Retrieval score increases with chunk size but starts tapering off. - Quality continues to improve even as chunk sizes increase. - Code snippets benefit from longer context or special chunking logic. 21:30 🧩 Number of Chunks and Context Size - Increasing the number of chunks improves retrieval and quality scores. - Larger context windows for LLMs show a positive trend. - Experimentation with techniques like RoPE for extending context. 22:30 🛠️ Fixing Hyperparameters During Tuning - Fixing hyperparameters sequentially: context size, chunk size, embedding models. - Experimentation with spread and fixing parameters once optimized. - Illustrates a pragmatic approach to hyperparameter tuning. 23:12 🏆 Model Selection and Benchmarking - GPT-3.5-based model (GTE base) outperformed larger models on their use case. - Emphasizes the importance of evaluating models based on specific use cases. - Benchmarking against openai's text embedding and choosing a smaller, performant model. 23:56 💰 Cost Analysis and Hybrid LM Routing - Cost analysis comparing different LM models. - Introduction of a hybrid LM routing approach for cost-effectiveness. - Consideration of performance, cost, and hybrid routing for optimal results. 25:10 🤖 Classifier vs. Language Model for Routing - Classifier used for routing decisions due to speed considerations. - Mention of training a classifier using a labeled dataset for routing. - Potential transition to LM-based routing as LM inference speed improves. 27:17 🔄 Future Developments and System Integration - Integration of components into larger systems, citingAnyScale's doctor application. - Anticipation of more developments and applications in the future. - Acknowledgment of the importance of iteration in building robust systems. Made with HARPA AI
Really enjoyed this talk - found a lot of value in it. Both speakers are clearly so knowledgable, and i love the extra little details the chap in the blue hoodie gave throughout. Would love to connect & share!
🎯 Key Takeaways for quick navigation:
00:05 🚀 Initial Motivation and Project Start
- Started building LM applications to gain firsthand experience and improve user experience.
- Developed a RAG application, focusing on making it easier for users to work with products.
- Emphasized the importance of underlying documents and user questions in building such applications.
01:31 🌐 Community Engagement and Insights
- Encouraged sharing insights and experiences on building RAG-based applications.
- Acknowledged the community's early stage and the value of diverse perspectives.
- Welcomed external input to enrich the collective understanding of RAG applications.
03:07 🧩 Experimentation with Data Chunking
- Explored different strategies for efficient data chunking, moving beyond random chunking.
- Utilized HTML document sections for precise references and better understanding of content.
- Aimed for a generalizable template, potentially open-sourcing a solution for various HTML documents.
05:14 🗃️ Vector Database and Technology Choices
- Chose Postgres as the Vector database, emphasizing familiarity and compatibility.
- Highlighted the increasing options of specialized databases for LM applications.
- Advised selecting a database based on team familiarity but exploring new options for specific features.
06:10 🔄 Retrieval Workflow and Database Query
- Described the retrieval process, including embedding queries and calculating distances.
- Discussed pros and cons of building Vector DB on Postgres versus using dedicated solutions.
- Addressed potential limitations based on document scale and the flexibility of different databases.
08:20 📏 Considerations for Context Size and Token Limits
- Acknowledged token limits in LM context and model-specific variations.
- Encouraged experimenting with different chunk sizes, possibly using multiple embeddings for longer chunks.
- Highlighted the importance of adapting to the LM's limitations and exploring diverse experimental setups.
09:29 🔍 Evaluation Metrics and Component-wise Assessment
- Introduced the two major components for evaluation: retrieval workflow and LM response quality.
- Explained the evaluation process, including isolating each component for focused assessment.
- Shared insights into the challenges and considerations of scoring LM responses.
11:32 📊 Evaluator Selection and Quality Assessment
- Used GPT-4 as an evaluator based on empirical comparison and understanding of the application.
- Discussed the limitations of available LM models and potential biases in self-evaluation.
- Advocated for iterative improvement and potential collaboration with external LM development communities.
15:13 📈 Iterative Evaluation and System Trust Building
- Illustrated the iterative evaluation process, starting with trusting an evaluator.
- Demonstrated the evaluation flow, using different configurations and trusting the chosen LM's outputs.
- Emphasized the importance of building trust in each component before assessing the overall system.
17:04 ❄️ Cold Start Strategy and Bootstrapping
- Presented a cold start strategy using chunked data to generate initial questions.
- Addressed noise reduction by refining generated questions and encouraging creativity.
- Described the bootstrapping cycle from clean slate to using generated data for further annotations.
18:38 🔄 Continuous Learning and Evaluation Scaling
- Responded to questions about the number of examples for cold start and overall evaluation.
- Advocated for a balance of quantity and diversity in examples for comprehensive evaluations.
- Stressed the importance of continuous learning, adaptation, and leveraging automated pipelines for scaling evaluations.
19:49 📈 Chunk Size Impact on Retrieval and Quality
- Retrieval score increases with chunk size but starts tapering off.
- Quality continues to improve even as chunk sizes increase.
- Code snippets benefit from longer context or special chunking logic.
21:30 🧩 Number of Chunks and Context Size
- Increasing the number of chunks improves retrieval and quality scores.
- Larger context windows for LLMs show a positive trend.
- Experimentation with techniques like RoPE for extending context.
22:30 🛠️ Fixing Hyperparameters During Tuning
- Fixing hyperparameters sequentially: context size, chunk size, embedding models.
- Experimentation with spread and fixing parameters once optimized.
- Illustrates a pragmatic approach to hyperparameter tuning.
23:12 🏆 Model Selection and Benchmarking
- GPT-3.5-based model (GTE base) outperformed larger models on their use case.
- Emphasizes the importance of evaluating models based on specific use cases.
- Benchmarking against openai's text embedding and choosing a smaller, performant model.
23:56 💰 Cost Analysis and Hybrid LM Routing
- Cost analysis comparing different LM models.
- Introduction of a hybrid LM routing approach for cost-effectiveness.
- Consideration of performance, cost, and hybrid routing for optimal results.
25:10 🤖 Classifier vs. Language Model for Routing
- Classifier used for routing decisions due to speed considerations.
- Mention of training a classifier using a labeled dataset for routing.
- Potential transition to LM-based routing as LM inference speed improves.
27:17 🔄 Future Developments and System Integration
- Integration of components into larger systems, citingAnyScale's doctor application.
- Anticipation of more developments and applications in the future.
- Acknowledgment of the importance of iteration in building robust systems.
Made with HARPA AI
I would love to see an hour long presentation on this!
Really enjoyed this talk - found a lot of value in it. Both speakers are clearly so knowledgable, and i love the extra little details the chap in the blue hoodie gave throughout. Would love to connect & share!
...just realised the "chap in blue" is a co-founder! No disrespect meant :) awesome
Great presentation. Just one question: What is relevance_score in this case? Is it an aggregation of grounding metrics for all reference examples?
goated video no cap
on god, we making out the hood with this one 💯
amazing work! thank you
How to protect a company's information with technology ?
Top gs
when accents swap.
😂😂 and awesome content though.
great content! thanks a lot!