yeah, thanks for the wait. I took a week of holidays and then ended up making a much longer video than I had expected. I hope it's of help that it goes all the way from start to finish on retrieval.
Hey trelis you may have about 10k subs only but I really do appreciate all your videos. I personally learn and benefit a lot from them and I always recommend a friend of mine your videos for any detailed explanation required. I do have some questions from this video which is probably my favourite so far and I'm trying to understand in every possible way. 1) How can we know if a model was trained using dot product or cosine? 2) Can you please explain if the dot product when trying to calculate the cosine similarity is the same as the dot product you were comparing with before. Also, could you give an example about normalizing and could they standardize instead of normalize? I've always been confused about those terms and I'm not sure if they are related in this case. "In terms of computation power, its quicker to do dot products cuz in cosine ur finding the angle between 2 vectors which u'd first do dot product then normalize" 2) Regarding the Retrieval performance, is there any reason why you picked top 12 chunks? Also, does that mean if I tried top 20 chunks I can achieve near 100% accuracy?
Appreciate that. 1) try to find the model on huggingface and look for a train.py script or read the readme to find out if cosine or dot product is used. Cosine is often default if dot product is not mentioned. 2. Similarity can be calculated using either a) dot product or b) cosine product. Cosine is a bit slower to calculate because the formula is equivalent to finding the dot product of two vectors and then dividing by the original length of each vector. So cosine is a bit slower. 3) The number of chunks depends possible depends on context length but also there is a trade off with answer quality. Too many chunks and the model will start to hallucinate more because there is too much input information. Actually I got 100% retrieval accuracy with 12 chunks BUT you can see answer accuracy is higher (relative to retrieval) for smaller numbers of chunks.
Loved the depth you went into, really enjoyed the video! Quick one, if you were to apply rag to a dataframe, how would you go about it? Converting to strings, then embedding each row as a chunk feels clunky but maybe it's the way to go? I guess with the context lengths available at this stage we could almost just convert entire dfs to strings and feed them in.
Most interesting ! Thank you for sharing this video with us. I would be most interested if you could try something like LLMLingua to compress the context. Actually, I was wondering about using that on the chuncks to make them more efficient. Also, to have a response that could be checked against the knowledge source of the RAG, I'd be interested in LLM that can give citations of the relevant source chunks (assigning ids when chunking, before any compression). Do you have any experience on that ? How hard would it be to fine tune a model for RAG with citations ? Thx !
Thanks for the great video! How does this solution scale? I can see the benefit of finetuning the embeddings for smaller data corpora, but does it do as well for large data corpora that have thousands of documents for different domains of knowledge, and does the finetuning still benefit if more documents are added at a later time that is in a different domain of knowledge?
Howdy, yes this is definitely less scalable but higher quality than doing cosine similarity. That's the key trade-off. If you have thousands of docs, you may want to make a summary of each and first ask the llm which docs are relevant, then do a deeper dive. My guess is that fine-tuning the embeddings won't help much if you add docs later that aren't in the same domain. Fine-tuning is specific to the dataset.
@@TrelisResearch seems like it's a different timestamp, but anyway I saw a whole 2h video and I didn't find an answer. You manually prepared these questions and answers. But what if I have, for example, a huge book like LOTR. How can I synthetically prepare a dataset? I assume my question is not directly related to RAG, but to training, because all what we can is to take closest vector (sentence) and pass it to LLM. And 'closest' is just a cosine/dot/bm25. Do I miss something?
I haven't, but it probably does make sense to look at graph type techniques. Basically involves, pre-organising your data for better search quality/paths
1. Yes! That’s a good idea and would have been a better comparison. Still, the basic point here is that the similarity is already very good, so adding the re-ranker doesn’t add much. 2. And yes, fine tuning the reranker makes sense too. For both Qs - and you make good suggestions - using the reranker boils down to whether your basic similarity is deficient . If not, it’s hard to make the case for re-ranking.
While TF_IDF is more lightweight and runs faster, BM25 has a saturation term so that the occurance of a subword doesn't keep increase it's weight beyond a certain point. There's also length normalisation of docs that I don't think is in TF.
trelis i have one problem. I am working on NL to SQL problem. i have written column descriptions of each column for each table in my database and then i comverted those descriptions into embeddings and stored them. now when user questions come, i convert that question into embeddings and then i multiply this with embedding of each column description we have created earlier. then i select top 20 columns based on cosine similarity score. but the thing is i mostly miss one or two columns doing this. questions is of one line, menas it dosen't include many details to get relevant columns and sometimes irrelevant columns gives higher cosine score and i miss relevant ones. do you have any idea how can i approach this problem? the only solution i see is increasing the number of columns i am selecting but it increases the prompt size that i give to LLM in the input. and you know there is limited context window for LLMs.
Have you considered trying ONLY BM25 and seeing what kind of performance you get with that before you try adding similarity? I'd also suggest playing around with a) long descriptions and b) short descriptions.
@@TrelisResearch BM25 would fail because the questions may have maximum of 7 to 8 words and to generate sql for that question i may need columns that doesn't have any matching word as per question. for the long and short descriptions, i would definitely give it a try
@@TrelisResearch hey i am thinking about fine tuning my embedding model. is it good to fine tune model for next word prediction or should i also create the dataset like this is question and these are columns needed to create sql for this question?
Thank you trellis!!! Awesome video as always, probably one of the best technical channels right now. Best
Definitely THE BEST channel for real technical insights and explanation. Thank you!!!🙏🏻
Started watching this on a Sunday afternoon after church and a heavy lunch . Looking forward to it😅
The fine-tuning code segment was very insightful. Great video, very well explained 😊
Appreciate it!
Really well done.
currently working on improving the simple rag pipeline learned a lot from this video thankyou.
Amazing, this video is a treasure! Thanks a lot for explaining in depth. Very great job!
you're welcome
This is the kind of video I have been looking out for sooo long!! Guess yt recommendations are better than yt search lol
really good tutorial. thank you very much.
You are the best. Cant wait to try this out over the weekend!
Amazing video, really the best explaination for the RAG pipeline I saw on YT. Great job!
Cheers
wow waiting you for a week
yeah, thanks for the wait. I took a week of holidays and then ended up making a much longer video than I had expected. I hope it's of help that it goes all the way from start to finish on retrieval.
ayooo trelis again. love from india
Awesome video. Thank you.
best of the best 🤩
Hey trelis you may have about 10k subs only but I really do appreciate all your videos. I personally learn and benefit a lot from them and I always recommend a friend of mine your videos for any detailed explanation required. I do have some questions from this video which is probably my favourite so far and I'm trying to understand in every possible way.
1) How can we know if a model was trained using dot product or cosine?
2) Can you please explain if the dot product when trying to calculate the cosine similarity is the same as the dot product you were comparing with before. Also, could you give an example about normalizing and could they standardize instead of normalize? I've always been confused about those terms and I'm not sure if they are related in this case. "In terms of computation power, its quicker to do dot products cuz in cosine ur finding the angle between 2 vectors which u'd first do dot product then normalize"
2) Regarding the Retrieval performance, is there any reason why you picked top 12 chunks? Also, does that mean if I tried top 20 chunks I can achieve near 100% accuracy?
Appreciate that.
1) try to find the model on huggingface and look for a train.py script or read the readme to find out if cosine or dot product is used. Cosine is often default if dot product is not mentioned.
2. Similarity can be calculated using either a) dot product or b) cosine product. Cosine is a bit slower to calculate because the formula is equivalent to finding the dot product of two vectors and then dividing by the original length of each vector. So cosine is a bit slower.
3) The number of chunks depends possible depends on context length but also there is a trade off with answer quality. Too many chunks and the model will start to hallucinate more because there is too much input information. Actually I got 100% retrieval accuracy with 12 chunks BUT you can see answer accuracy is higher (relative to retrieval) for smaller numbers of chunks.
Loved the depth you went into, really enjoyed the video! Quick one, if you were to apply rag to a dataframe, how would you go about it? Converting to strings, then embedding each row as a chunk feels clunky but maybe it's the way to go? I guess with the context lengths available at this stage we could almost just convert entire dfs to strings and feed them in.
Most interesting ! Thank you for sharing this video with us. I would be most interested if you could try something like LLMLingua to compress the context. Actually, I was wondering about using that on the chuncks to make them more efficient. Also, to have a response that could be checked against the knowledge source of the RAG, I'd be interested in LLM that can give citations of the relevant source chunks (assigning ids when chunking, before any compression). Do you have any experience on that ? How hard would it be to fine tune a model for RAG with citations ? Thx !
Thanks, I'll give that a read - LLMLingua.
Regarding citing chunks, yes I've been thinking about that and plan another video hopefully
Thanks for the great video! How does this solution scale? I can see the benefit of finetuning the embeddings for smaller data corpora, but does it do as well for large data corpora that have thousands of documents for different domains of knowledge, and does the finetuning still benefit if more documents are added at a later time that is in a different domain of knowledge?
Howdy, yes this is definitely less scalable but higher quality than doing cosine similarity. That's the key trade-off.
If you have thousands of docs, you may want to make a summary of each and first ask the llm which docs are relevant, then do a deeper dive.
My guess is that fine-tuning the embeddings won't help much if you add docs later that aren't in the same domain. Fine-tuning is specific to the dataset.
Hi. Nice video, but I didn't get how to prepare a dataset. How to get a comprehensive list of questions and answers about my document?
Timestamp 0:50:50 . Try that
@@TrelisResearch seems like it's a different timestamp, but anyway I saw a whole 2h video and I didn't find an answer. You manually prepared these questions and answers. But what if I have, for example, a huge book like LOTR. How can I synthetically prepare a dataset? I assume my question is not directly related to RAG, but to training, because all what we can is to take closest vector (sentence) and pass it to LLM. And 'closest' is just a cosine/dot/bm25. Do I miss something?
@@VerdonTrigance 1:06:00 see after this
Have you tried your pipeline on a different dataset for the test data? Maybe something like basketball rules instead.
thanks for this!
What changes would you implement if there are a large number (50+) of pdfs (100+ pages with embedded images and text)?
You may not need to change all that much !
Maybe try more chunks, but even that may not help
hey bro have you experimented with graphRAG ? appreciate the video. Learning every day about RAG..
I haven't, but it probably does make sense to look at graph type techniques. Basically involves, pre-organising your data for better search quality/paths
- if reranker is only good for similarity why not applying it only on it and after add bm 25 results ?
- why not finetune also the reranker ?
1. Yes! That’s a good idea and would have been a better comparison. Still, the basic point here is that the similarity is already very good, so adding the re-ranker doesn’t add much.
2. And yes, fine tuning the reranker makes sense too.
For both Qs - and you make good suggestions - using the reranker boils down to whether your basic similarity is deficient . If not, it’s hard to make the case for re-ranking.
Any advantage in using BM25 instead of skelarn's TF-IDF?
While TF_IDF is more lightweight and runs faster, BM25 has a saturation term so that the occurance of a subword doesn't keep increase it's weight beyond a certain point. There's also length normalisation of docs that I don't think is in TF.
@@TrelisResearch Awesome! Subscribing to the Multi-Repo Bundle very soon! Thanks a lot. Kudos from Brazil!
@@Bragheto thanks Brazil!
trelis i have one problem. I am working on NL to SQL problem. i have written column descriptions of each column for each table in my database and then i comverted those descriptions into embeddings and stored them. now when user questions come, i convert that question into embeddings and then i multiply this with embedding of each column description we have created earlier. then i select top 20 columns based on cosine similarity score. but the thing is i mostly miss one or two columns doing this. questions is of one line, menas it dosen't include many details to get relevant columns and sometimes irrelevant columns gives higher cosine score and i miss relevant ones. do you have any idea how can i approach this problem? the only solution i see is increasing the number of columns i am selecting but it increases the prompt size that i give to LLM in the input. and you know there is limited context window for LLMs.
Have you considered trying ONLY BM25 and seeing what kind of performance you get with that before you try adding similarity?
I'd also suggest playing around with a) long descriptions and b) short descriptions.
@@TrelisResearch BM25 would fail because the questions may have maximum of 7 to 8 words and to generate sql for that question i may need columns that doesn't have any matching word as per question.
for the long and short descriptions, i would definitely give it a try
@@TrelisResearch hey i am thinking about fine tuning my embedding model. is it good to fine tune model for next word prediction or should i also create the dataset like this is question and these are columns needed to create sql for this question?
can you discuss GraphRag recently released by microsoft?
Will take a look at it