Llama3 Full Rag - API with Ollama, LangChain and ChromaDB with Flask API and PDF upload

Fast and Simple Development

มุมมอง 53 617

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 7 ม.ค. 2025

ความคิดเห็น •

@fastandsimpledevelopment 8 หลายเดือนก่อน ⁺²⁰
Finally a complete Fast and Simple end to end API using Ollama, Llama3, LangChain, ChromaDB, Flask and PDF processing for a complete RAG system. If you like this one check out my video on setting up an AWS Server with GPU Support - th-cam.com/video/dJX9x7bETe8/w-d-xo.html
@Phoenix-Revived 6 หลายเดือนก่อน
Amazing tutorial! I know you work with Java, but I would absolutely love an implementation in Golang.
@lesptitsoiseaux 7 หลายเดือนก่อน ⁺⁸
We're about the same age. This is by far the best tutorial on the subject I've seen in a while. Thank you very much for your conscientiousness and dedication to quality! Cheers from Vancouver :)
@fastandsimpledevelopment 7 หลายเดือนก่อน
Glad you enjoyed this.
@prokons 2 หลายเดือนก่อน ⁺¹
Definitely the best tutorial I've found on TH-cam. I especially appreciated that you included the problems you found while implementing the code like packages not yet installed, because when someone look at tutorial usually he sees that everything is always working fine but it's not what really happens when doing it for the first time. Great job.
@Dani-cg5yc 8 หลายเดือนก่อน ⁺⁷
Literally just what I was looking for. Thx so much for the video. It's reaaaaally difficult to find good information about doing projects with llama models, every use case is for openai. Again, thx a lot!!!!
@fastandsimpledevelopment 8 หลายเดือนก่อน
Glad you enjoyed it. I had the same experience and so many people asked me to do this video. I would normally only do Java but the tooling is not really ready so wanted to get this out. Remember that you need a GPU based system to really run fast. Linux or Windows, Apple appears go have given up on that Ollama needs to no option that I have found. I have a server running with a couple of NVIDIA cards and 128Gb ram, super fast and makes it production ready for my needs.
@ExpertKNowledgeGroup 8 หลายเดือนก่อน ⁺¹⁰
Amazing step by step with complete explenation of what and why. Thanks!
@ruggerovecchio 6 หลายเดือนก่อน ⁺¹
Congratulations, this is really an exhaustive explanation on how to setup the necessary architecture for exposing AI, LLM based services from a private cloud. Thank you for sharing!
@fastandsimpledevelopment 6 หลายเดือนก่อน
Thanks for the feedback, glad you enjoyed it.
@wesleymogaka 5 หลายเดือนก่อน ⁺¹
Very excellent. I am doing chatbot creation for a bank and privacy/ security is of utmost importance. I'll surely use knowledge gained here.
@fastandsimpledevelopment 5 หลายเดือนก่อน
Glad you found this useful, feel free to ask me questions, I've already setup banks and insurance companies using these techniques.
@raymondcswong8602 7 หลายเดือนก่อน ⁺¹
Thank you very much! I learnt a lot following your step-by-step guidance! Especially how you solve the 'errors', cheers!
@fastandsimpledevelopment 7 หลายเดือนก่อน
Glad you enjoyed the video. Thanks!
@salamina_ 8 หลายเดือนก่อน ⁺¹
Very helpful content. Great tutorial on putting it all together in a RAG application. Thank you for taking the time to put this together and sharing it!
@juliandarley 6 หลายเดือนก่อน ⁺²
thank you very much. i would add that adjusting the hyperparameters `k`, `score_threshold` and the `PromptTemplate` custom instruction can make a huge difference to the answer - by changing these i got the system to stop producing useless, inaccurate verbiage and give short, accurate, useful answers. it was like comparing a 1B early LLM with GPT4o.
@fastandsimpledevelopment 6 หลายเดือนก่อน
Thanks for the info, hope it helps others as well
@juliandarley 6 หลายเดือนก่อน
@@fastandsimpledevelopment to be specific this is what i have found so far to work reasonably well:
PromptTemplate
"Based on the following context, provide a precise, concise, and accurate answer to the query. Do not give a load of waffle and empty verbiage, just the actual answer. Do not use your own knowledge, only that in the text supplied. The answer is almost certainly in the given text, but it may require some intelligence on your part to piece together information in order to answer the question. Try hard; don't give up easily."
"k": 5,
"score_threshold": 0.5,
if anyone is interested, i can give some example answers; my impression is that the LLM is the weak link - it's often not quite smart enough to piece together the right pieces of information, where a human would easily work it out. i plan to try out other LLMs and embedding algorithms, but at least the results are looking promising.
@manihss 5 หลายเดือนก่อน ⁺¹
would you mind sharing those? the params and the Prompt.
@juliandarley 5 หลายเดือนก่อน
@@manihss i am away from my main desktop, but i will look as soon as i can. the prompt was something like: "keep the answer very short and don't give me a load of empty blather"! i think the score threshold was 5, but that may be erroneous.
@TechniqueIsKey 4 หลายเดือนก่อน ⁺¹
@@juliandarley would still be interested in what params you ended up finding optimal if you still have it
@ChigosGames 26 วันที่ผ่านมา
Thanks! Was looking for this!
@RobynLeSueur 8 หลายเดือนก่อน ⁺⁴
Very useful, criminally underviewed too.
@photorealm หลายเดือนก่อน
Excellent video. Helped me immensely, Thank you for sharing.
@singar1976 6 หลายเดือนก่อน ⁺¹
Thank you for making this video. It was so helpful :)
@fastandsimpledevelopment 6 หลายเดือนก่อน
Glad you enjoyed it
@deathdefier45 8 หลายเดือนก่อน
I was looking for just this, thank you sir you're a wonderful human being! Thank you so much for this content!
@candyman3537 2 หลายเดือนก่อน ⁺²
Thanks for your video. Have a question. If I a very big pdf, will this embedding data take more tokens? And what is the max length of the pdf?
@fastandsimpledevelopment 2 หลายเดือนก่อน ⁺¹
I normally use Ollama for for Rag applications with ChromaDB. Ollama is ran locally or at least on your servers so there is never a cost for tokens. If you were to use Gemini or Open AI then yes you have token costs. Depending on what database your using for your vector store, there may be small costs to store the data. In general when you retrieve data and use it in the LLM processing there are token costs. I have used up to 100Mb PDF files with ChromeDB. The big thing to watch is the Chunk Size, you may find that you will send 3 chunks to the LLM so if each one is 1Mb then 3 chunks are 3Mb, which could be 600,000 tokens (3,000,000 / 5) 5 bytes per token. That would be expensive, again using a local LLM and local vector store resolves these costs real fast ply gives you the security of your data not being outside your company.
@candyman3537 2 หลายเดือนก่อน ⁺¹
@@fastandsimpledevelopment So the size of pdf files only affect the cost of the vector db？
@fastandsimpledevelopment 2 หลายเดือนก่อน
@@candyman3537 Correct, if you have a local vector db then there is no cost. The chunk size would have more effect on tokens, I normally have 3 chunks returned from the similarity results that are then sent to the LLM.
@davidtindell950 6 หลายเดือนก่อน ⁺¹
Thank You Again. Perhaps a follow-on Vid where you make this a full Flask App with a responsive Web-GUI ?!?!?
@fastandsimpledevelopment 6 หลายเดือนก่อน ⁺¹
I have video that I'm editing that has a full React UI, I built it so anyone can build a product with it if they wanted, I have the Python code now as a Microservice which makes it much cleaner to deploy in production as well as full logging.
@dean-p6z 6 หลายเดือนก่อน ⁺²
nice starter tutorial that does not involve openai!
@ExpertKNowledgeGroup 6 หลายเดือนก่อน
There was so much OpenAI and almost none that really cover running Ollama locally, I uses this for a last company that has very private data, from HR information to Product Development, Jira and Confluence integration, no way could we use OpenAI and have them "Learn" all our IP content :)
@TokyoNeko8 8 หลายเดือนก่อน
Awesome with all of the latest LLM and APIs!
@kiranwork5466 8 หลายเดือนก่อน ⁺¹
Very informative and timely !
@chidi21 8 หลายเดือนก่อน
This is awesome! Thank you very much for posting this!👏
@yusuf50 5 หลายเดือนก่อน
Thats way man, great tutorial. Thank you
@mukeshkarthik1480 8 หลายเดือนก่อน ⁺¹
I have my ollama running in a external machine , how to configure the ip in cached_llm = Ollama(model="llama3")
@fastandsimpledevelopment 8 หลายเดือนก่อน ⁺²
Give this a try
cached_llm = Ollama( model="llama3", base_url="OLLAMA_HOST:PORT" )
@federicocalo4776 4 หลายเดือนก่อน ⁺¹
Hello i have a problem, the variable context is not declared
@fastandsimpledevelopment 4 หลายเดือนก่อน
Can you give me more information? I do not see a variable "context" in the code, "context" is returned in the lookup of data from the ChromaDB so if there is no context then I suspect there is no matching results from the search, maybe tell me the line number or share the code with me
@federicocalo4776 4 หลายเดือนก่อน
@@fastandsimpledevelopment line 27 of your github repo. "NameError: name 'context' is not defined"
@federicocalo4776 4 หลายเดือนก่อน
my mistake. the editor put a f""" automatically
@fastandsimpledevelopment 4 หลายเดือนก่อน
@@federicocalo4776 That is part of the PromptTemplate so not a real variable, it is populated by the Retriever so line #70 should create this value in the retriever for you.
@fastandsimpledevelopment 4 หลายเดือนก่อน
Make sure you have line 27 in 3 double quotes and you are using Python 3.9 or greater
@yongxiang4635 6 หลายเดือนก่อน ⁺¹
thank you so much for this video, this is very helpful for me now.
But i have some questions, how can i deploy this Flask app to server so no need to use 'localhost'
@fastandsimpledevelopment 6 หลายเดือนก่อน
You can just create a venv on your Linux server, activate it and then pip install -r requirements.txt to install all the required dependencies. The just start as normal python 3 app.py and it will be running. You may need to open a port on your firewall on the server and then you can connect externally so maybe 10/10/10/25:8081/api for a connection. I do this all the time. I break things into smaller services (microservices) and have them run in a Docker container. You can load the Ollama on a GPU based server (see my video on this).
@superman-h4i 6 หลายเดือนก่อน
why is this giving answers outside the pdfs when i am asking unrelated questions to content of an error response?
@fastandsimpledevelopment 6 หลายเดือนก่อน
This is the typical hallucination problem, in the prompt you need to say "Only use the content provided" and also if the results from the retriever or zero length then I normally put up a message that says I could not find the content in the PDF.
@gokulans-z9q 3 หลายเดือนก่อน
hi bro i tried this program will show error fastembed will not import but i will already install the package and again and again same error will show
@taariqnoor8716 7 หลายเดือนก่อน
About the search_kwargs in the as_retriever method, i can't find all other options that can be used and what are they for, can anybody help ?
@out-of-sight 8 หลายเดือนก่อน ⁺²
Perfection 👏🏻👏🏻👏🏻
@EVandPassions 4 หลายเดือนก่อน ⁺¹
The best tutorial
@fastandsimpledevelopment 4 หลายเดือนก่อน
Glad you liked it, thanks!
@darthcryod1562 7 หลายเดือนก่อน
nice video tutorial, some questions though. Q: if you already have your loader and call loader.load_and_split(), doesnt that function already use text_splitter by default using RecursiveText... class, is it really needed to call text_splitter.split_documents agai?
2nd question, if you have that upload endpoint, if you call it multiple times with different pdfs will it overwrite previous vectore store?? since i see you are re-creating Chroma.from_documents each time, should an instance of Chroma be created and then just call chroma_instance.add_documents() ?
@oscarcorreia2804 8 หลายเดือนก่อน ⁺¹
Thanks for the great tutorial. I would like to get your thoughts on 2 aspects: What is the minimum RAM needed to run this? Can the model be stored on an external hard disk?
@fastandsimpledevelopment 8 หลายเดือนก่อน ⁺¹
You will need at least 8Gb or Ram. The model files are stored on a hard disk but they are also loaded into memory when used. So when there is no activity the memory is free but after about 5 minutes of no activity the cache is cleared so the next time you use it the model is loaded into ram again. If you have a GPU (Nvidia card) then it gets loaded into the GPU. I like to use an external Linux Ubuntu server with an NVidia card, 20Gb of ram, runs really fast.
@oscarcorreia2804 8 หลายเดือนก่อน
@@fastandsimpledevelopment Thanks
@aiinsets 6 หลายเดือนก่อน ⁺¹
Can I use this without any limits or restrictions means its free no token and anything needed ? please reply
@fastandsimpledevelopment 6 หลายเดือนก่อน
FREE FREE FREE - All you have to do is host the service yourself. This is all Open Source, no fees for Ollama, Llama3, LangChain, ChromaDB, Flask or any of the Python Libraries. Keep watching for my Python Microservices that make all this even simpler to setup and use with a React Front End
@aiinsets 6 หลายเดือนก่อน
@@fastandsimpledevelopment thanks for reply
@hebertgodoy5039 7 หลายเดือนก่อน
Excellent. Thanks very much for sharing
@oguzhanylmaz4586 7 หลายเดือนก่อน ⁺¹
hi, can this project work without internet?
@fastandsimpledevelopment 7 หลายเดือนก่อน
Yes, 100% offline on local systems.
@hi-yuren 7 หลายเดือนก่อน
I am having Error: ImportError: Could not import 'fastembed' Python package. Please install it with `pip install fastembed`, even I installed fastembed. It will be appreciated if anyone can help.
@raghavareddy7134 7 หลายเดือนก่อน
Really loved it......😍... can you add delete endpoint to delete from both pdf folder and chroma db
@kyougetsubarano1534 5 หลายเดือนก่อน
Postman gives this error, can anyone help?
415 Unsupported Media Type
Unsupported Media Type
Did not attempt to load JSON data because the request Content-Type was not 'application/json'.
@ChigosGames 25 วันที่ผ่านมา ⁺¹
Is this solution scalable? With many concurrent users?
@fastandsimpledevelopment 25 วันที่ผ่านมา ⁺¹
No, it is not. That means that one request is processed in one thread of Python, you can scale this if you have multiple instances of the Python Flask app running, this can be done with a load balancer like NGINX but still does not make it fast. The bottleneck is Ollama, it can handle multiple requests but may have to reload the LLM each time if it is not the same LLM, if you were to use Llama3.2 every time then you would get a level of support for multiple requests but in the end it is not scale-able as you would expect, if 1 request takes 10 seconds, 2 requests takes 20 seconds so it does not solve anything. You can always add another GPU, Ollama supports this but then the first request takes 12 seconds, the seconds takes 18 seconds, etc. So again no scale-able solution. But if you have isolated machines that are feed from a load balancer like NGINX and each machine has your Flask API and Ollama running with its own GPU (I use the NVIDIA 4090) then yes, first request takes 10 seconds, 2nd request takes 10 seconds so the same is pretty consistent for multiple requests, you will quickly find that you may need 4 machines to create a production grade system. This is what I have done for a large LLM project and it does work well. I setup the Load Balancer for Round Robin and then process each request as they come in. If I need to support more requests them I will add more servers, I did this on AWS and it cost me about $700 per server per month but it did work. I now have my own servers that cost about $2,500 each to build that is the LLM Engine Cluster. I connect this into the cloud using ngrok and it is very fast. As far as I know there is no way to scale up vertically as far as getting more LLM ram or processor power other than replacing your GPU board. Adding boards in parallel gives more memory but does not effect the processing speed, well it is a bit slower from the overhead but Ollama will put different sections of the LLM into different cards so the memery is scaled but not the processors. Each processor runs the segment of memory based on the LLM loaded in its own instance so there is no performance increase.
@ChigosGames 25 วันที่ผ่านมา ⁺¹
@@fastandsimpledevelopment Clear! Thanks for the reply! So from a performance perspective parallel is the way to go, makes sense. Follow up question, how do you keep the source of truth (the RAG and your docs) in sync?
@fastandsimpledevelopment 25 วันที่ผ่านมา ⁺¹
@@ChigosGames I do two things for Rag, initially I used PDF input and stored the Vectors into a ChromaDB, this can be a server so all the instances use the same database but the data is as old as the last PDF upload. I have moved to a better solution where I do not source anything from PDF, I query a database, in my case MongoDB, I then take that content (which should be the current truth) and feed it into the process as if it came from ChromaDB so it is then what Ollama uses to answer a question/prompt, this works very well and a lot of the PDF/VectorDB issue went away. I have a large set of data, think of Airline tickets so I have routes, times, destinations as well as passengers that purchase and I need to answer questions like "What is a cheaper flight" or "If I change a day how much will it cost" so some times it not as simple as a PDF document with content
@ChigosGames 25 วันที่ผ่านมา ⁺¹
@@fastandsimpledevelopmentok, I love mongodb, since it is so json friendly (a bit too much sometimes), so with that you already structurize and let it enrich by the LLM.
How do you 'steer' the LLM from not to be too creative with amending your (flight) data?
@fastandsimpledevelopment 25 วันที่ผ่านมา ⁺¹
@@ChigosGames I create very specific data for example "SFO 01/10/2024 10:00AM - JFK 01/10/2024 4:45 PM American Flight 1410 $445" This is then used in the LLM and I do have a filter and use JSON format output I then transform this as needed
@AbdulHalim-mp2rs 6 หลายเดือนก่อน ⁺¹
Thank you, thank you, thank you.
@lololoololdudusoejdhdjswkk347 8 หลายเดือนก่อน
For AWS; are you using Amazon sagemaker or how would you go about deploying the local LLM model to host it?I’m new, so I apologize is this is an incoherent question.
@fastandsimpledevelopment 8 หลายเดือนก่อน
I deploy a full Ollama server on EC2 with a GPU based server, no Sagemaker needed. This works on AWS or even your own Linux machines
3 หลายเดือนก่อน
Amazing video! Your explanation is super insightful and well-presented. I'm curious-do you have any thoughts or experience with using Ollama in a production environment? I'm not sure if Ollama can handle multiple requests at scale.
If I were to implement something like this in production, would you recommend Ollama, or would alternatives like llama.cpp or vllm be better suited? Would love to hear your perspective on its scalability and performance. Thanks again for sharing such awesome content!
@oguzhanylmaz4586 7 หลายเดือนก่อน ⁺¹
Hi, first of all, thank you very much for this project. I will move forward on your project. I would like some support from you, can you integrate data streaming into your existing API project? In other words, the response should not come as the entire text. Like chatgpt, the response comes live word by word.
@fastandsimpledevelopment 7 หลายเดือนก่อน
No, sorry I do not have time to work on streaming interface, I'm sure you can get it working if you dive in yourself. Streaming is supported and is the detail API for Ollama.
@Pekarnick 8 หลายเดือนก่อน ⁺¹
Very good video, it helped me a lot with something I was looking for to integrate with a chatbot, but I had to adjust the search_kwargs a little to 3 because it got dizzy with the result. I would be happy to see how to delete added PDFs in Chroma, greetings and thank you very much for this content
@fastandsimpledevelopment 8 หลายเดือนก่อน
Glad it helped you
@lbognini 8 หลายเดือนก่อน
Great and to the point. 👏
@brajeshsahu7981 7 หลายเดือนก่อน
How do i fix this?
C:\Users\zzz>ollama pull llama3
pulling manifest
Error: 403:
C:\Users\zzz>ollama serve
Error: listen tcp 127.0.0.1:11434: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted.
@fastandsimpledevelopment 7 หลายเดือนก่อน
The first issue is that you already have Ollama running. You need to stop the instance, on Windows I think you can do that with the Task Manager. The other issue may then work, if not then make sure Ollama is running, try "Ollama list" this will show you that it is running and give you a list of the models already loaded (if any). There should not be a 403 error unless you are behind a firewall or corporate security does not allow for the download. Sorry that's Ill I got.
@nizark.5265 6 หลายเดือนก่อน ⁺¹
Thanks for the tutorial.
I had an error when saving the file in Windows , here is how itsolved:
from pathlib import Path
current_folder = Path(__file__).parent.resolve()
save_file = str(current_folder) + "\\pdf\\" + file_name
@fastandsimpledevelopment 6 หลายเดือนก่อน
Thanks for the input, hope it helps someone else as well.
@4cadia 3 หลายเดือนก่อน
Thanks so much for your comment -- where did you put this in the code?
@Dan-mm9yd 8 หลายเดือนก่อน
very useful vedio, thanks for sharing！
@nazgod11 7 หลายเดือนก่อน
Hi! your project worked for me yesterday, I tried to change the pdf loader to another one, then it broke and only gave empty chunks. Today I've tried recreating it in a new venv, but now I keep getting the following error: werkzeug.exceptions.BadRequestKeyError: 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.
KeyError: 'file' tried searching on google but I couldn't find an answer, also I've never had this error before. Thanks a lot in advance if you are able to help:)
@ChigosGames 25 วันที่ผ่านมา
Could it perhaps be that you need to relaunch your virtual environment? (venv)
@thebluefortproject 8 หลายเดือนก่อน ⁺¹
Can this be run on a MacBook? Super informative, I just don't want to fry my mac trying it out :/
@fastandsimpledevelopment 8 หลายเดือนก่อน
Yes, this video was recorded on a 2019 Intel Macbook pro. I've also ran this on an M2 Macbook. If you try large queries then it is very slow, you really need a GPU (NVIDIA) for performance.
@thebluefortproject 8 หลายเดือนก่อน
@@fastandsimpledevelopment awesome thank you very much!!
@brianclark4639 8 หลายเดือนก่อน ⁺¹
Can it be run on PC?
@brianclark4639 8 หลายเดือนก่อน ⁺¹
*Windows
@fastandsimpledevelopment 8 หลายเดือนก่อน
@@brianclark4639 Yes, Ollama now supports Windows, code should be the same
@m.waqas27 3 หลายเดือนก่อน
Where is the calling API?
@RamonLopesFaustino 7 หลายเดือนก่อน
Thx from Brazil.
@RichardEnglish1 8 หลายเดือนก่อน ⁺¹
Sweet!
@DaleIsWigging 5 หลายเดือนก่อน ⁺²
WINDOWS USERS INSTALL PROBLEM:
"uvloop" says it doesn't work on windows
remove it from requirements.txt
it should install and seems to work anyways
(I haven't tested the RAG part, only that the API call works)
@fastandsimpledevelopment 5 หลายเดือนก่อน
Thanks for the input
@DeanHorak 8 หลายเดือนก่อน
Very nice. Thanks.
@superfreiheit1 2 หลายเดือนก่อน
the code is blurred and to small, hard to read
@learnfromIITguy 6 หลายเดือนก่อน
Awesome
@ChigosGames 25 วันที่ผ่านมา ⁺¹
If you're on Windows remove uvloop from the requirements.txt. It will break your pip.
@fastandsimpledevelopment 25 วันที่ผ่านมา
Thanks for the info!
@Larimuss 5 หลายเดือนก่อน
Can you do one for Windows or Linux 😂 sorry im a bit lost too. Guess I need more python knowledge
@Mutar 6 หลายเดือนก่อน ⁺¹
gj
@fastandsimpledevelopment 6 หลายเดือนก่อน
Thanks
@curiousguy9884 8 หลายเดือนก่อน ⁺¹
I keep getting an "langchain_community.llms.ollama.OllamaEndpointNotFoundError: Ollama call failed with status code 404. Maybe your model is not found and you should pull the model with `ollama pull llama3`." error. Any help resolving this would be greatly appreciated. (I am running llama3 on a Mac OS)
@ashwinkrishnan4435 8 หลายเดือนก่อน
Go to the command line pull the model using ollama pull llama3

ต่อไป

เล่นอัตโนมัติ

RAG from the Ground Up with Python and Ollama