Better Llama 2 with Retrieval Augmented Generation (RAG)

James Briggs

มุมมอง 82 119

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 13 ม.ค. 2025

ความคิดเห็น • 98

@AntonioEvans ปีที่แล้ว ⁺¹⁰
🎯 Key Takeaways for quick navigation:
00:13 📌 The aim is to quantize the 13 billion parameter LLM2 model and fit it into a single T4 GPU for free on Co-labs.
01:06 ⚙️ The model has to be adjusted to utilize GPU for hardware accelerator and T4 for GPU type.
01:34 💡 Explains the concept of retrieval augmented generation, giving your LM access to the 'outside world'. The plan is to work with a subset of the 'outside world' by searching with natural language.
02:52 🔄 The process will involve asking a question, obtaining relevant information about that question, and feeding that information back into the LM.
03:31 🎛️ Discusses the importance of the embedding model for translating human-readable text into machine-readable vectors.
04:56 🧠 Two documents are created and each is embedded using an embedding model.
06:19 📚 Discusses how to create a Vector database and build a Vector index using a free Pinecone API key.
07:25 ⌨️ Describes the process of initializing the index to store the vectors produced by the embedding model.
09:00 ✍️ Talks about populating the Vector database using a small dataset of chunks of text from the Llama2 paper and related documents.
11:54 🐐 Initiates the embedding process to include the LM (Llama 2) needed for the retrieval QA chain.
14:27 🗝️ Shows how to obtain the Hugging Face Authentication token needed to use Llama 2.
15:47 🥊 Compares the outcome of just asking an LM a question and using retrieval augmentation. The latter clearly provides much more relevant information.
19:00 🥇 Indicates that Llama 2 performs better on safety and usefulness benchmarks compared to local LMs like Chinchilla and Bard. This can be on par with closed-source models on certain tasks.
Made with Socialdraft
@micbab-vg2mu ปีที่แล้ว ⁺²¹
Thank you for the video. I am gradually transitioning from commercial models to open source. Your videos are very helpful.
@senju2024 ปีที่แล้ว ⁺¹²
For a PoC stage, going through this with Pinecone Vector DB is good to get my head around the concept and putting the pieces together. However, I think most people what to move to a Local Vector DB if they are trying to provide a use case within a company as sensitive data should never be stored (as a policy) outside a the company domain. Anyway, Pinecone can have its usefulness however.
@oguzhanylmaz4586 ปีที่แล้ว
What can we use as local vector db?
@amethyst1044 ปีที่แล้ว
there's vecra apparently@@oguzhanylmaz4586
@enigma26nl ปีที่แล้ว
@@oguzhanylmaz4586 ChromaDB or FAISS
Both supported by langchain or just native
@nuclear_AI ปีที่แล้ว ⁺³
I am currently utilising Linda, my Linguistically Intelligent Networked Digital Assistant as a corporate governance tool with overwhelming success. This tiny aspect of what is now possible has the potential to change the world 🤯🧠
@natecodesai ปีที่แล้ว ⁺²
Thanks! Actually your brain diagram thing saved me some time explaining something I built to friends. Lol, though I was expecting you to fill out more parts of the brain. I like those definitions of the type of knowledge in that explanation as well. Good show.
@paraconscious790 ปีที่แล้ว ⁺¹
As usual this is so incredible. I tend to believe your comments as you don't get hyper about some features and take a more systematic approach in evaluating and try to state the facts as it is. Thanks!
@jamesbriggs ปีที่แล้ว
thanks I appreciate it! I do sometimes get excited about these things but I do really believe that RAG (among a few other things, like guardrails, agents, etc) is the key to genuinely good chatbots / AI assistants
@robertgoldbornatyout ปีที่แล้ว ⁺¹
Fantastic tutorial, you deserve 1000000. Subscribers.👍👍👍
@fabianaltendorfer11 ปีที่แล้ว ⁺¹
Fantastic tutorial, you deserve 1 Mio. subscribers.
@navneetkrc ปีที่แล้ว ⁺¹
Thank you for the video.
Suggestions: Next in line can be QA Generation Evaluation using Llama 2. I have tried using open LLM evaluation and found hard to implement without using Openai.
@adilgun2775 ปีที่แล้ว ⁺²
The challenge with the RAG + LLM approach is that actual data sources are vast in most of the real use cases. When these enormous volumes of documents are splitted and vectorized, they generate an immense quantity of vectorized data (millions) in the vector database. Consequently, when trying to retrieve an answer to a specific question, the vectors extracted from the database often fail to closely or precisely align with what you're searching for. Literally like looking for a needle in a haystack...
@jamesbriggs ปีที่แล้ว ⁺⁵
I've got it working well on millions of records, I'm also aware of companies doing this on the billion scale - it's definitely possible! Some things that can help if you're struggling though is hybrid search or to return a high top_k (like 100) and then rerank with a more powerful (but slower) reranking model - like those available from Cohere or sentence-transformers
@ravisawhney8677 ปีที่แล้ว ⁺⁵
@@jamesbriggs Any guides / tutorials on this? If not would be a good one for future video.
@xifanwang9293 ปีที่แล้ว
Thank you James. You videos are awesome. One odd thing I have noticed is that when I load the llama2 7b model in a colab pro account and choose one A100 GPU, the model alone has 26.5 GB GPU memory, while in your videos, the model only takes 8.2 GB. I used the exact quantization and model settings.
@BearMan-li6be ปีที่แล้ว
Just discovered you from this video, this is amazing thank you so much
@antonpictures ปีที่แล้ว ⁺³
Nice one professor
@jamesbriggs ปีที่แล้ว
thanks 😁
@gd.rottoli-atk ปีที่แล้ว ⁺¹
This video helps me a lot! thank you!
@VenkatesanVenkat-fd4hg ปีที่แล้ว ⁺¹
Thanks for your valuable videos
@Krishi_Crew ปีที่แล้ว
This is the best explanation possible
@MasterBrain182 ปีที่แล้ว ⁺¹
James Briggs 🔥🔥🔥 🚀
@javiergimenezmoya86 ปีที่แล้ว ⁺²
Great content. Why you use a non-local Vector Data Base (pinecode) for a local LLM model?
@scharlesworth93 ปีที่แล้ว
He works for pinecone bro. No shade tho James is awesome
@jamesbriggs ปีที่แล้ว ⁺²
Yeah I work for Pinecone, you can use numpy or Faiss though too (I did a series on Faiss here in the past if it helps), or the OS vec DBs
@javiergimenezmoya86 ปีที่แล้ว
@@jamesbriggsIm going to view tjose videos about Faiss
@beyond9109 ปีที่แล้ว ⁺¹
This is awesome and will help me immensely. Im still going through the code so I can learn to implement a more elaborate sandbox with this. In theory, are we now able to get the llm to cite the document or even link to it as long as it's in the embedded data features? My org is scared of Generative AI due to the hallucinations and ethical issues of unknown training data. These types of tech explanations, especially in notebooks, help me explain the value of these open-source models and the techniques to use them with transparency. Having the LLM be able to cite sources and provide links straight to documents is critical to getting over all these legal and political concerns.
@timetravellingtoad ปีที่แล้ว ⁺²
Can we store the vectors locally? Always concerned about using 3rd parties to store potentially sensitive data, especially if it can be done locally.
@jamesbriggs ปีที่แล้ว ⁺¹
It’s possible to use numpy with small datasets like the one we use here, if you’re looking at 100K+ try Faiss - Pinecone is SOC 2 compliant though it that helps
@timetravellingtoad ปีที่แล้ว
@@jamesbriggs Thanks!
@vtambellini ปีที่แล้ว
I’ve used chromadb, works well with that! However, when using llama2 in this manner for rag, I’m assuming there is not risk of huggingface of meta seeing my personal data. I’m assuming this is correct since we’re downloading the entire model
@PoornimaDevi-yx9oh ปีที่แล้ว
Thank you for the great session !
@ilanser ปีที่แล้ว ⁺⁵
Hi, You've saved the metadata but didn't use it at all. I would improve the code/video by adding SOURCES: to the response. That will also show what text did it use to provide the answer. That way you can prove that it returned just the relevant text + how well it summarized what it got. + Compare results and explain use cases when would you enlarge the number of K retrievals and when would you enlarge the size of chunks.
@intuitivej9327 ปีที่แล้ว ⁺²
Is this model able to read pdf or csv file saved in my local?
Is this model ok for commercial use?
Thank you for sharing the wonderful video. I am learning a lot from you 😊
@jamesbriggs ปีที่แล้ว
yes and yes :)
@scottmiller2591 ปีที่แล้ว ⁺¹
It would interesting to see how to implement this with a completely private, local open source vector database like chroma.
@vtambellini ปีที่แล้ว ⁺¹
If using chroma, from what I can tell this approach would be considered safe. It’s not hard to implement with chroma!
@ashmarbarbour ปีที่แล้ว
the chest hair never disappoints!
@TzaraDuchamp ปีที่แล้ว ⁺¹
Great content James. Your dataset seems interesting, do you have more information about how you created it? How much code would be needed to be
changed to run TheBloke/Llama-2-13B-chat-GGML instead? Can that one fit on a T4?
@jamesbriggs ปีที่แล้ว ⁺²
I used a package I created a while back github.com/aurelio-labs/arxiv-bot - mainly using the `Arxiv` class found here github.com/aurelio-labs/arxiv-bot/blob/main/arxiv_bot/knowledge_base/constructors.py
As for the other llama 2 model, you should just be able to switch the model_id for what you wrote above, and as it is still a 13B parameter model it will fit on a T4 using the same loading method (with quantization) that I used in this video
@asmitachauhan738 ปีที่แล้ว
Hi James, thanks for the great content. I wanted to know would it be possible to retain the memory or context of the conversation as we go further, for example asking in more detail about a previous answer?
@VenkatesanVenkat-fd4hg ปีที่แล้ว
Kindly do some fine tuning or ICL (in-context learning) version of llama2 plz. Can you differentiate in context learnjng & instruction finetuning ?
@wayallen831 ปีที่แล้ว ⁺¹
Thanks for your tutorials! One suggestion if I may, can we not use the cut scene for transitions (like a corrupted clip). Its too much sensory feedback all of a sudden and distracts from what you say next. Thank you nonetheless for the content 👏
@jamesbriggs ปีที่แล้ว ⁺¹
Sure, I’ll tone that one down, thanks for the feedback :)
@fabianaltendorfer11 ปีที่แล้ว ⁺¹
How can I see RAM usage on the left inside COLAB?
@jamesbriggs ปีที่แล้ว ⁺¹
You can’t, it’s edited - it pops up on the right in my actual colab window
@billykotsos4642 ปีที่แล้ว
So we used a 13b parameter model in 4 bit precision here ?
@super_stimulus ปีที่แล้ว
Cool thanks! I built one like this based on your tutorial. But what if I have a list of items and I really want it to refer to my pincone embeddings when it's answering? How can I really make sure it really looks at that context? Would it just be using the right pattern of tokens in the pinecone embeddings so it's cued to look at the database rather than its own parameters?
@jamesbriggs ปีที่แล้ว
I think this might help you, gives you a more deterministic trigger for doing RAG
th-cam.com/video/QMaWfbosR_E/w-d-xo.html
@jangyuseon-j9j ปีที่แล้ว
Thank you for the video? It helped me a lot! Is it available to put retrieval data as a conversation(dialogue) form?
@hansmeiser6078 11 หลายเดือนก่อน
How can we create the pinecone environment from console? There is no env in the pinecone Web-UI anymore.
@kjpsouza ปีที่แล้ว
Hi James, could you please explain why you do the "docs" embeddings in the beginning? They are just the 2 sentences you used to create the embeddings model, but you don’t use them afterwards. Is there a specific reason for them? Since you created the embeddings from the larger dataset. Thanks!
@ashutossahoo7041 11 หลายเดือนก่อน ⁺¹
The first two sentences are just the examples to show about the embedding length of the embedded vectors which was 384.
@kjpsouza 11 หลายเดือนก่อน
@@ashutossahoo7041 thank you!!
@gokulraja1580 ปีที่แล้ว
james why did you used text instead of embedding you upload in vector database ?
@yaygomii4735 ปีที่แล้ว ⁺⁴
Hi james i have been following a lotmof your RAG videos and it has helped me a lot.
I have a few questions but heres a little context:
For extremely domain specific QA tasks, it's important for our embeddings to fully represent the meaning behind the text paggases - in which case it would be important to finetune them.
I was also looking into Dense passage retreival techniques that involve training and finetuning of a retreiver.
In more recent RAG implementations, the use of the vector database provides a static vector storage.
So essentially the entire system is very static but for updating the database.
1. Is it possible to improve the system as a whole by jointly finetuning the embedding model and also using a dpr model to eventually generalize for our domain specific use case?
2. Is it possible to provide a kind of relevance feedback to existing retreivers in the form of human feedback/ a viable metric?
3. A way to implement these in a non resource intensive way?
4. Frankly i've been running gpt4all llms on my cpu with 16gb ram. It takes around 4 minutes to generate a single answer 😂. I'm not sure why the gpt4all chat app can run smoothly, but local code implementations take up so much time?
5. On reaching max token limits and subsequently resizing the window during an answer greatly diminishes the performance of the model and also generates poor answers. Is there any way to clear the tokens/ resize it manually before generating each answer?
@hvbris_ ปีที่แล้ว ⁺¹
Thanks for the great content James, my db has similar vector size and my hardware similar specs to what you're using in the video but my GPU is running out of memory every time this line of code is hit:
retriever=vectorstore.as_retriever()
Any suggestion on how to fix this?
@tylertheeverlasting ปีที่แล้ว ⁺¹⁰
Sorry but reading the code is much clearer than watching this video, because you basically just read out the comments in the code anyway. Time spent making the video would have been better spent making a diagram of how the different pieces connect. (This is meant to be constructive criticism, thanks for the code and sharing the knowledge)
@jamesbriggs ปีที่แล้ว
No worries, appreciate the feedback
@CommunityAiConnect ปีที่แล้ว
What computing power are you using to accomplish, I find that anyone can run this is kind of very subjective, I have a laptop and I've setup other LLMs, I've gotten blue screen of death a few times lol, after a certain level of experimenting with LLMS i've decided to hold back from destroying my cpu for ML
@TzaraDuchamp ปีที่แล้ว ⁺¹
Maybe I'm doing something wrong, but I am getting dimension 768 on your dataset in Colab. This leads to an error, which goes away when I use all-mpnet-base-v2 instead of all-MiniLM-L6-v2.
This the error: ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'content-type': 'application/json', 'date': 'Sun, 30 Jul 2023 10:23:13 GMT', 'x-envoy-upstream-service-time': '1', 'content-length': '102', 'server': 'envoy'})
HTTP response body: {"code":3,"message":"Vector dimension 384 does not match the dimension of the index 768","details":[]}
@jamesbriggs ปีที่แล้ว ⁺¹
I think you are initializing the Pinecone index with dimension=768, you should delete the index via app.pinecone.io and then recreate it with pinecone.create_index(index_name, dimension=384, metric='cosine')
Let me know if that works!
@TzaraDuchamp ปีที่แล้ว
Thanks for that, deleting the index worked. In retrospect, it figures, because I initialized first with all-mpnet-base-v2, which has other dimensions. I tried another solution too, naming the index based on the model. But that fails because I am on the free tier. Can delete it manually, but wrote a small function to deal with it at the end of the notebook:
if index_name in pinecone.list_indexes():
pinecone.delete_index(index_name)
It's interesting how the different embedding models influence the answer. The model all-mpnet-base-v2 helps pick up on distillation-based training when asking about what is so special about llama 2. But it cannot deal well with the question about red teaming. Added another question to research the aforementioned differences:
rag_pipeline('Why is llama 2 released for commercial use?')
all-mpnet-base-v2:
The llama 2 model is released for both research and commercial use because it is important to encourage responsible AI innovation by drawing upon the collective wisdom, diversity, and ingenuity of the AI practitioner community. By making the model openly available, the company hopes to foster collaboration and improve the safety and effectiveness of the technology.
all-MiniLM-L6-v2:
The llama 2 model is released for both research and commercial use because it is intended for assistant-like chat and can be adapted for various natural language generation tasks. It is also intended for commercial use in English.
all-MiniLM-L12-v2:
According to the text, llama 2 is released for commercial use because it is intended for "commercial and research use in English." Additionally, the licensing agreement allows for use in any manner that is not prohibited by the acceptable use policy.
@ebudmada12 ปีที่แล้ว ⁺²
Thank you, but listening for 2 minutes and still don’t know what RAG is and why I should use it
@blackenedblue5401 ปีที่แล้ว
It's basically a famcy term for browser plugin
@ANURAG-w7z2g ปีที่แล้ว
I have a question, I have finetuned LLAMA2 chat hf model on my custom dialogue dataset, and not I want to use RAG using that, can I do it (it gave me gibberish answer), also, can I make a switch between LLAMA 2 and fine-tune model, for RAG queries and general queries respectively.
@nikitajz6560 ปีที่แล้ว
Thanks for the materials you create, you really can break down and explain brilliantly complex concepts!
A quick question - what app do you use to create those charts?
@supriyoroybanerjee6219 ปีที่แล้ว
hi getting error on if index_name not in pinecone.list_indexes():
TypeError: expected string or bytes-like object
@SunilSamson ปีที่แล้ว
How do we save the rag model after this and the vector db locally instead of loading the hugging face model everytime ?
@jamesbriggs ปีที่แล้ว ⁺¹
sagemaker is a good option for deploying the LLM and embedding model so we don't need to download every time: th-cam.com/video/0xyXYHMrAP0/w-d-xo.html
@SunilSamson ปีที่แล้ว
@@jamesbriggs thank you !!
@nourmarzouk6252 11 หลายเดือนก่อน
I cannot see the code. it is giving "Unable to render code block" from github link.
@mandilquioxtenlp1202 ปีที่แล้ว ⁺¹
thank you
@achukisaini2797 ปีที่แล้ว
Hi james, I am working on Bot using RAG with llama2 and for embeddings sentence transformers but I have completed the task but how to save a context or conversation in this like chatgpt coz it is a conversational bot how to do that can u please me this or tell which video I have to see?
@tristanmorris2074 ปีที่แล้ว
What is your opinion of llama.cpp and llama-cpp-python?
@nerygoochman ปีที่แล้ว ⁺¹
Do you think this would work in spanish language ?
@jamesbriggs ปีที่แล้ว ⁺¹
Sure you just need a generative LLM and embedding model that can work in Spanish - Cohere offer a good multilingual embedding model, and there are also multilingual sentence transformers :)
@MahimaYadav-c8t ปีที่แล้ว
How to make a Rag model for PDF? Can anyone help, or drop a link of that sort of project?
@naveenkumard9217 ปีที่แล้ว
how to create our own dataset and add with it , pls pls it will be very useful for my practice
@himgos13 ปีที่แล้ว ⁺¹
You look like guy from Max Payne 3 game xD
JK, thanks for helping me out :)
@jamesbriggs ปีที่แล้ว ⁺¹
haha he even wears the same shirts 😅 - glad it helped :)
@axeusavinash7269 9 หลายเดือนก่อน
Can i do it for 7B model?
@TheCloudShepherd ปีที่แล้ว
Very useful
@chinmaymaganur7133 ปีที่แล้ว
Hi @James
Even after using same code, my collab crashes any workaround? Anyone faced similar issue
@ylazerson ปีที่แล้ว
you are awesome!
@_XY_ ปีที่แล้ว
👏👏
@MrAlket1999 ปีที่แล้ว
How much will Pinecone charge me after I finish my free calls to its API?
@jamesbriggs ปีที่แล้ว
free upto 100K vectors, can make as many calls as you like as long as you're not storing more than that
@lambgoat2421 ปีที่แล้ว ⁺¹
your chest hair reminds me of a llama
@jamesbriggs ปีที่แล้ว ⁺¹
True
@billykotsos4642 ปีที่แล้ว
LMAO chill my guy
@jargolauda2584 ปีที่แล้ว ⁺¹
How about a video where you dont need any freakin API keys, just all local? Local model, in local GPU, using local database and local files.
@vtambellini ปีที่แล้ว
What’s wrong with using an api key to pull down the model? You are running and using the model on your own compute. Pair this with chromadb and you’re good to go
@jargolauda2584 ปีที่แล้ว
Is that what it's for? But arent the models available without api keys too? I newer used API key for download.@@vtambellini

ต่อไป

เล่นอัตโนมัติ