How to evaluate an LLM-powered RAG application automatically.

Underfitted

มุมมอง 13 008

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 25 มี.ค. 2024
Source code of this example:
github.com/svpino/llm/tree/ma...
Giskard library: github.com/Giskard-AI/giskard
I teach a live, interactive program that'll help you build production-ready machine learning systems from the ground up. Check it out here:
www.ml.school
To keep up with the content I create:
• Twitter/X: / svpino
• LinkedIn: / svpino
วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 47

@TooyAshy-100 3 หลายเดือนก่อน ⁺⁴
THANK YOU
I greatly appreciate the release of the new videos. The clarity of the explanations and the logical sequence of the content are exceptional.
@aleksandarboshevski 3 หลายเดือนก่อน ⁺⁹
Hey Santiago! Just wanted to drop a comment to say that you're absolutely killing it as an instructor. Your way of breaking down the code and the whole process into simple, understandable language is pure gold, making it accessible for new comers like me. Wishing you all the success and hoping you keep blessing the community with your valuable content!
Aside of the teaching side have you tried to create micro-saas based on this technologies ? For me seems that you are half there and could be great opportunity to expand your business.
@underfitted 3 หลายเดือนก่อน ⁺¹
Thanks for taking the time and letting me know! I have not created any micro saas applications, but you are right; that could be a great idea
@mohammed333suliman 3 หลายเดือนก่อน ⁺³
This is my first time watching your videos. It is great. Thank you.
@TheScott10012 3 หลายเดือนก่อน ⁺⁶
FYI, keep an eye on the mic volume levels! Sounds like it was clipping
@underfitted 3 หลายเดือนก่อน
Thanks. You are right. Will adjust.
@liuyan8066 3 หลายเดือนก่อน
Glad to see you involved pytest in the end, it is like a surprise dessert🍰 after great meal.
@TPH310 3 หลายเดือนก่อน ⁺¹
We appreciate your work a lot, my man.
@horyekhunley 2 หลายเดือนก่อน ⁺³
Great stuff!
What are your preferred open source alternatives to all tools used in this tutorial?
@himrajdas4471 หลายเดือนก่อน
Nobody makes content like this. You are a GURU in ML:). You gonna hit million subs very fast. And I mean it Santiago.
@maxnietzsche4843 22 วันที่ผ่านมา
Damm, you explained each step really well! Love it!
@alextiger548 หลายเดือนก่อน
Super important topic you covered here man!
@dikshantgupta5539 หลายเดือนก่อน ⁺¹
Oh man, the way you explained these complex topics is mind blowing. I just wanted to say thank you for making such types of videos.
@proterotype 2 หลายเดือนก่อน
This is so well done
@arifkarim768 29 วันที่ผ่านมา
explained amazingly
@user-hh9do9fn1o หลายเดือนก่อน ⁺¹
Hello Santiago, Your explanation was thorough and I understood it really well,Now I have a question as is there any other tool than giskard to evaluate (which is open source and does not require openai api key) for my llm or rag model.
Thank you in advance😊
@theacesystem 3 หลายเดือนก่อน
That's great. You rock!!!
@ergun_kocak 3 หลายเดือนก่อน
This is gold ❤
@MohammadEskandari-do6xy 3 หลายเดือนก่อน ⁺¹
Amazing! Can you also explain how to do the same type of evaluation on Vision Language Models that use images?
@sridharm4254 หลายเดือนก่อน
Very useful video. thank you
@CliveFernandesNZ 3 หลายเดือนก่อน ⁺¹
Great stuff Santiago! You've used giskard to create the test cases. These test cases themselves are created using an LLM. In a real application, would we have to manually vet the test cases to ensure they themselves are 100% accurate?
@theacesystem 3 หลายเดือนก่อน ⁺¹
Just awesome instruction Santiago. I am a beginner but you make learning digestible and clear! Sorry if ignorant question. but is it possible to use FAISS, Postgres, MongoDB, or Chroma DB, or another free open source model that can be substituted for pinecone to save money, and if so which would you recommend for ease of implementation with Langchain?
@underfitted 3 หลายเดือนก่อน
Yes, you can! Any of them will work fine. FAISS is very popular.
@sabujghosh8474 3 หลายเดือนก่อน
Its awesome need more os models workings
@aliassim8774 2 หลายเดือนก่อน
Hey Santiago, thank you for this course in which you explained all the concepts of rag evaluation in a very clear way. However, I have a question about the the reference answers. How they have been generated ? based on what (is it an LLM) ? If it is the case, let's say we have a question that needs a specific information that exists only on the knowledge base, how can other llm generate such an answer ? & how the we know that reference questions are correct and it is what we are looking for ? Thank you in advance
@not_amanullah 3 หลายเดือนก่อน
Thanks ❤
@dhrroovv 26 วันที่ผ่านมา ⁺¹
do we need to have a paid subscription to openai apis to be able to use giskard?
@caesarHQ 3 หลายเดือนก่อน
hi excellent tutorial, wouldn't anticipate any less. Ran your notebook with an open-source LLM, however generate test set with giskard.rag is calling OpenAI api "timestamp 19:11", any work-around?
@underfitted 3 หลายเดือนก่อน ⁺¹
Giskard will always use gpt4 regardless of the model you use in your RAG app
@francescofisica4691 หลายเดือนก่อน ⁺¹
How can i use huggingface llms to generate the testset?
@utkarshgaikwad2476 3 หลายเดือนก่อน ⁺¹
Is it ok to use generative ai to test generative ai ? What about the accuracy of giskard ? I’m not sure about this
@underfitted 3 หลายเดือนก่อน
The accuracy is as good as the model they use is (which is GPT-4). Yes, this is how you can test the result of a model.
@gauravpratapsingh8840 27 วันที่ผ่านมา
Hey can you make a video that uses open source llm and make a q/a chat bot for website page?
@maxisqt 3 หลายเดือนก่อน ⁺³
So the one thing you learn training ML models is that you don’t evaluate your model on training data, and be careful of data leaking. Here, you’re providing giskard your embedded documentation, meaning giskard is likely using its own RAG system to generate tests cases, which you then use to evaluate your own RAG system. Can you please explain how this isn’t nonsense? Do you evaluate the accuracy of the giskard test cases beyond the superficial “looks good to me” method that you claim to be replacing? What metrics do you evaluate giskard’s test cases against since its answers are also subjective, you’re just now entrusting that subjective evaluation to another LLM?
@maxisqt 3 หลายเดือนก่อน
Perhaps the purpose of testing in software development is different to ML testing, in soft eng you’re ensuring that changes made to a system don’t break existing functionality, in ML you test on data your model hasn’t trained on to prove it generalises to unseen novel samples as that’s how it’ll have to perform in deployment. Maybe the tests you’re doing here fit into the software eng bucket and therefore LLMs may be perfectly capable of auto generating test cases, and since we aren’t trying to test how well the generated material “generalises” since that doesn’t make sense in this context, that’s okay… I’m a little confused.
@maxisqt 3 หลายเดือนก่อน ⁺¹
I’m new to gen ai, background in ML some years back, apologies if I come off hostile or jaded.
@mikaelhuss5080 3 หลายเดือนก่อน
@@maxisqt i think these are good questions actually. Maybe the way to think about RAG at least in the present scenario is that it is really a type of information retrieval and there is no need to generalise, as you say - we just want to be able to find relevant information in a predefined set of documents.
@u4tiwasdead 3 หลายเดือนก่อน
The way that frameworks like Giscard try to solve the problem of how we can evaluate llms/rag using llms that are not necessarily better than the ones being eveluated is through the way that test sets are generated.
Just to give one example the framework might ask an llm to generate a question and answer pair, then ask it to rephrase the question to make it harder to understand without changing its meaning/what the answer will be. It will then ask llm the harder version of the question and compare it to the original answer. This can work despite the fact their llm is not necessarily more powerful than yours, because rephrasing an easy question into a hard one is an easier problem, than interpreting the hard question.
(a good analogy migght be that a person can create puzzles that are hard to solve for much smarter people than themselves by starting from the solution and then creating the question)
Note that the test data does not need to be perfect, it just needs to be generally better than the outputs we will get from our models/pipelines. The point of these tools is not to evaluate whether the outputs we are getting are actually true, but simply whether they are improved when we make changes to the pipline.
@trejohnson7677 3 หลายเดือนก่อน
Ouroboros
@JonathanLoscalzo 23 วันที่ผ่านมา
I think that all the "AI experts" in the wild just "explain" common concepts of AI/LLM systems. It would be nice to understand a bit more other aspects, like (good choice) evaluation. It would be interesting to have some relevant courses on that. I know it is the secret juice but, could be useful.
BTW, Are you teaching causal ml in your course?
@underfitted 23 วันที่ผ่านมา ⁺¹
I’m not teaching causal ml, no. The program focuses on ML Engineering
@JonathanLoscalzo 23 วันที่ผ่านมา
@@underfitted I want to do it, but I don't have time. I hope there will be more cohorts in near future
@fintech1378 3 หลายเดือนก่อน ⁺¹
How new is this giskard
@StoryWorld_Quiz 2 หลายเดือนก่อน
how does the gpt instance that generates the questions and the answers know the validity of those answers? if they are actually accurate, why would you build the rag in the first place if you can create a gpt instance that is accurate enough (using one simple prompt: 18:33, agent description)? i dont understand, can someone explain please? do you see the paradox here?
@mehmetbakideniz 2 หลายเดือนก่อน
because gpt 4 is quite expensive, you wouldnt want to use it in production if 3.5 or any other open source model does the job correctly. this library uses gpt4 as the best llm to have RAG answers. that is why they use it as the test case to see whether for your specific application a cheaper or a free open source model is more or the less okay.
@JTMoustache 3 หลายเดือนก่อน
Langchain sucks

ต่อไป

เล่นอัตโนมัติ

Step by step no-code RAG application using Langflow.