You guys are the best when it comes to intricacies of working with LLM. You presentation are simple and your explanations are cool and very simple to understand
Great presentation as always guys! Small note: those of us in the hearing-impaired community are very grateful to folks like Jithin who invest in a decent mic setup. People with normal hearing generally don't see a lot of difference between a macbook mic and a dedicated one, but when your hearing is marginal it can make the difference between intelligible speech and mush that's hard for us to process. Thanks!
Super nice. Thank you folks! Would be good to also discuss a bit more about the eval data: size, distribution, and how closely should it mimic real data, etc, but awesome stuff nevertheless 🎉
Great feedback! We like the idea of deep-diving on the eval data generation piece 🤔... definitely a piece worth adding as we keep iterating on this content!
@@AI-Makerspace For sure! I have learned the hard way that upsampling my data with GPT-4 was great and gave me a highly accurate (according to stats) model, but the generated quality was too good and completely out of the distibution of my messy data, and in production model was not so great ;-/
Hi I have a use case for text-to-SQL with RAG using LangChain. Is there any example or guide to evaluate the SQL result? Is the metric the same as regular text RAG? Thanks in advance
at 00:24:15 you give a formula for faithfulness, think it is flawed a bit. Should be (#Claims from the answer which exist in the context) / (#claims in answer). Otherwise there could be >1 result.
Can you be more specific about what the flaw is? Also, why do you choose the word "exist" rather than "inferred from?" --Here's what appears to be true from the documentation: -- "To calculate this a set of claims from the generated answer is first identified. Then each one of these claims are cross checked with given context to determine if it can be inferred from given context or not." Three steps to the calculation: 1. Break generated answer into statements 2. For each statement, verify if it can be inferred 3. Calculate Faithfulness! It seems that the condition "if (and only if) it can be inferred from the context" will keep the faithfulness calculation from going higher than 1.0
@AI-Makerspace you might be right, but at the point referenced in the video, it talks about the context not the generated answer. So a context like "Paris is a bustling French capital and center of culture and art" could contain 2-3 claims , but the answer to "what is the capital of France" may contain one claim "Paris is the Capital of France". The faithfulness would be 3/1 in that case if they were not related to the golden truth answer. I may be missing something! Great video though, thanks!
Ah I get it now, duh - the element of the formula "number of claims that can be inferred from the given context" I was reading as the number of claims that can be inferred from the context alone. It's really the number of claims in the generated answer which can be inferred from the given context.
"It's really the number of claims in the generated answer which can be inferred from the given context." Nice follow-up @Mark! Let's gooo! We find it helpful to look directly at the prompted examples in the src code here: github.com/explodinggradients/ragas/blob/7d051437a1a5d8e9ad5c42252bf1debf51679140/src/ragas/metrics/_faithfulness.py#L52 You can see how FaithfulnessStatements turn into SentencesSimplified with an example and in general via the instruction given in NLIStatementPrompt as "Your task is to judge the faithfulness of a series of statements based on a given context. For each statement you must return verdict as 1 if the statement can be directly inferred based on the context or 0 if the statement can not be directly inferred based on the context."
You can use any LLM that has OpenAI API compatibility. This means most closed source models, as well as open source options through certain hosting strategies (NIMs, vLLM, etc)
I am using llamaindex and tried some example code but it doesn't work. Is Ragas integrated with llamaindex? Ragas seems very promising for evaluation and would really like to use it. The error I am getting is in the from ragas.llama_index import evaluate, it cannot find the ragas.llamaindex. I gave up for the time being assuming that Ragas is not integrated with llamaindex?
I beliee they are integrated - but they're making adjustments to their library very often (as is Llama Index), I would submit an issue on their repo with your specific error traceout!
Just curious how was the test having improvements in Faithfulness but a degradation of Correctness? Could u perhaps help me understand how that might be?
I would interpret the results as follows: We're staying closer to our retrieved context, but we're straying away from the answer provided by the original ground-truth model. I would want to test further to see why the responses are being marked as less "correct" and see which cases the pipeline failed on to provide more insight.
You'd want to just create a loop that generates those responses! As for the ground truths - you'd need to generate those manually or use a larger language model.
@@anuvratshukla7061 "Ground Truth" is simply referring to a label on LLM responses that label as the "Truth." In general, it would be ideal to have all of these "Truths" written, verified, and optimized by humans. Since this is hardly ever actually done, what's more common is that a more powerful LLM is used to generate the "Ground Truth" on which we can run these analyses. In our case here, GPT-4 is used to create the Ground Truths and GPT-3.5-turbo is used to generate Responses. It's important to keep in mind at the end of the day that the initial absolute values are much less important than the change in these metrics as you make improvements to your system! In other words, Metrics-Driven Development doesn't require that your Ground Truth data is perfect to begin with!
Hello. It is great job. Yesterday code was working. Today it gives error in line "response = retrieval_chain.invoke({"input": "What are the major changes in v0.1.0?"})". Can you tell me how fix this one?
Google Colab Notebook: colab.research.google.com/drive/1C1Epju1lVkXTQi2jBq1njrOrmkfg0eQS?usp=sharing
Slides: www.canva.com/design/DAF8HpUMTLQ/HKyd4ajjIgCR2Y5tjyT8Eg/edit?DAF8HpUMTLQ&
These material help us understand better before our class. I am glad I found you guys
You guys are the best when it comes to intricacies of working with LLM. You presentation are simple and your explanations are cool and very simple to understand
Great presentation as always guys! Small note: those of us in the hearing-impaired community are very grateful to folks like Jithin who invest in a decent mic setup. People with normal hearing generally don't see a lot of difference between a macbook mic and a dedicated one, but when your hearing is marginal it can make the difference between intelligible speech and mush that's hard for us to process. Thanks!
Glad to hear that we're coming through loud and clear @csmac3144a!
Great tool and commitment to serving GenAI Dev community needs ... Thks to all involved
Thanks for the shoutout AI_by_AI! And we agree - the RAGAS team really built a great framework to accelerate RAG app development!
Great presentation
RAGAS starts at time 20:15, before which is just an overview of langchain and the RAG QA pipeline
Thanks for the timestamp here MrTulufan!
Thanks for the video. It was really helpful . I was looking for a how we can automate the rag evaluation process .
Super nice. Thank you folks! Would be good to also discuss a bit more about the eval data: size, distribution, and how closely should it mimic real data, etc, but awesome stuff nevertheless 🎉
Great feedback! We like the idea of deep-diving on the eval data generation piece 🤔... definitely a piece worth adding as we keep iterating on this content!
@@AI-Makerspace For sure! I have learned the hard way that upsampling my data with GPT-4 was great and gave me a highly accurate (according to stats) model, but the generated quality was too good and completely out of the distibution of my messy data, and in production model was not so great ;-/
Hi
I have a use case for text-to-SQL with RAG using LangChain. Is there any example or guide to evaluate the SQL result? Is the metric the same as regular text RAG? Thanks in advance
thank you sooo much I Learned a lot from this channel. I did my experiments and I was wondering how can I evaluate the RAG performance ..
You'll want to create a dataset of question/answer/context triplets to evaluate through RAGAS!
Thank you for the video:)
at 00:24:15 you give a formula for faithfulness, think it is flawed a bit. Should be (#Claims from the answer which exist in the context) / (#claims in answer). Otherwise there could be >1 result.
Can you be more specific about what the flaw is? Also, why do you choose the word "exist" rather than "inferred from?"
--Here's what appears to be true from the documentation: --
"To calculate this a set of claims from the generated answer is first identified. Then each one of these claims are cross checked with given context to determine if it can be inferred from given context or not."
Three steps to the calculation:
1. Break generated answer into statements
2. For each statement, verify if it can be inferred
3. Calculate Faithfulness!
It seems that the condition "if (and only if) it can be inferred from the context" will keep the faithfulness calculation from going higher than 1.0
@AI-Makerspace you might be right, but at the point referenced in the video, it talks about the context not the generated answer. So a context like "Paris is a bustling French capital and center of culture and art" could contain 2-3 claims , but the answer to "what is the capital of France" may contain one claim "Paris is the Capital of France". The faithfulness would be 3/1 in that case if they were not related to the golden truth answer. I may be missing something! Great video though, thanks!
Ah I get it now, duh - the element of the formula "number of claims that can be inferred from the given context" I was reading as the number of claims that can be inferred from the context alone. It's really the number of claims in the generated answer which can be inferred from the given context.
"It's really the number of claims in the generated answer which can be inferred from the given context."
Nice follow-up @Mark! Let's gooo!
We find it helpful to look directly at the prompted examples in the src code here: github.com/explodinggradients/ragas/blob/7d051437a1a5d8e9ad5c42252bf1debf51679140/src/ragas/metrics/_faithfulness.py#L52
You can see how FaithfulnessStatements turn into SentencesSimplified with an example and in general via the instruction given in NLIStatementPrompt as "Your task is to judge the faithfulness of a series of statements based on a given context. For each statement you must return verdict as 1 if the statement can be directly inferred based on the context or 0 if the statement can not be directly inferred based on the context."
I am getting OpenAI key error 😢😢😢
how can i perform batch wise ragas evaluation so that my evaluation time will decrease?
I'm not sure that is currently implemented, but I'll update the notebook when it is!
Does ragas work only on openai model, which model i can use for testset genrator as critic and generative model please help me out
You can use any LLM that has OpenAI API compatibility. This means most closed source models, as well as open source options through certain hosting strategies (NIMs, vLLM, etc)
@@AI-Makerspace i dont have openai api key
Can i use models from hugging face
Please help me out
I want to genarate the test set data using models other than hugging face
Like generate model and critic model
I am using llamaindex and tried some example code but it doesn't work. Is Ragas integrated with llamaindex? Ragas seems very promising for evaluation and would really like to use it. The error I am getting is in the from ragas.llama_index import evaluate, it cannot find the ragas.llamaindex. I gave up for the time being assuming that Ragas is not integrated with llamaindex?
I beliee they are integrated - but they're making adjustments to their library very often (as is Llama Index), I would submit an issue on their repo with your specific error traceout!
Just curious how was the test having improvements in Faithfulness but a degradation of Correctness? Could u perhaps help me understand how that might be?
I would interpret the results as follows:
We're staying closer to our retrieved context, but we're straying away from the answer provided by the original ground-truth model.
I would want to test further to see why the responses are being marked as less "correct" and see which cases the pipeline failed on to provide more insight.
He , how can I generate test set (and ground truth ) With open source LLM. I'm using Mixtal. Please assist
You'd want to just create a loop that generates those responses!
As for the ground truths - you'd need to generate those manually or use a larger language model.
What is the difference between ground truth and response? What actually does ground truth means wrt to evaluation?@@AI-Makerspace
@@anuvratshukla7061 "Ground Truth" is simply referring to a label on LLM responses that label as the "Truth." In general, it would be ideal to have all of these "Truths" written, verified, and optimized by humans. Since this is hardly ever actually done, what's more common is that a more powerful LLM is used to generate the "Ground Truth" on which we can run these analyses. In our case here, GPT-4 is used to create the Ground Truths and GPT-3.5-turbo is used to generate Responses.
It's important to keep in mind at the end of the day that the initial absolute values are much less important than the change in these metrics as you make improvements to your system! In other words, Metrics-Driven Development doesn't require that your Ground Truth data is perfect to begin with!
@@AI-Makerspace Great thanks :)
Hello. It is great job. Yesterday code was working. Today it gives error in line "response = retrieval_chain.invoke({"input": "What are the major changes in v0.1.0?"})". Can you tell me how fix this one?
Could you provide your notebook so I can troubleshoot? I'm not running into that specific error on my end.
Can we use ragas without openai key
If you set-up LLMs and pass them as the Critic/Generator/etc - yes!