Finally a complete Fast and Simple end to end API using Ollama, Llama3, LangChain, ChromaDB, Flask and PDF processing for a complete RAG system. If you like this one check out my video on setting up an AWS Server with GPU Support - th-cam.com/video/dJX9x7bETe8/w-d-xo.html
We're about the same age. This is by far the best tutorial on the subject I've seen in a while. Thank you very much for your conscientiousness and dedication to quality! Cheers from Vancouver :)
Definitely the best tutorial I've found on TH-cam. I especially appreciated that you included the problems you found while implementing the code like packages not yet installed, because when someone look at tutorial usually he sees that everything is always working fine but it's not what really happens when doing it for the first time. Great job.
Literally just what I was looking for. Thx so much for the video. It's reaaaaally difficult to find good information about doing projects with llama models, every use case is for openai. Again, thx a lot!!!!
Glad you enjoyed it. I had the same experience and so many people asked me to do this video. I would normally only do Java but the tooling is not really ready so wanted to get this out. Remember that you need a GPU based system to really run fast. Linux or Windows, Apple appears go have given up on that Ollama needs to no option that I have found. I have a server running with a couple of NVIDIA cards and 128Gb ram, super fast and makes it production ready for my needs.
Congratulations, this is really an exhaustive explanation on how to setup the necessary architecture for exposing AI, LLM based services from a private cloud. Thank you for sharing!
Very helpful content. Great tutorial on putting it all together in a RAG application. Thank you for taking the time to put this together and sharing it!
thank you very much. i would add that adjusting the hyperparameters `k`, `score_threshold` and the `PromptTemplate` custom instruction can make a huge difference to the answer - by changing these i got the system to stop producing useless, inaccurate verbiage and give short, accurate, useful answers. it was like comparing a 1B early LLM with GPT4o.
@@fastandsimpledevelopment to be specific this is what i have found so far to work reasonably well: PromptTemplate "Based on the following context, provide a precise, concise, and accurate answer to the query. Do not give a load of waffle and empty verbiage, just the actual answer. Do not use your own knowledge, only that in the text supplied. The answer is almost certainly in the given text, but it may require some intelligence on your part to piece together information in order to answer the question. Try hard; don't give up easily." "k": 5, "score_threshold": 0.5, if anyone is interested, i can give some example answers; my impression is that the LLM is the weak link - it's often not quite smart enough to piece together the right pieces of information, where a human would easily work it out. i plan to try out other LLMs and embedding algorithms, but at least the results are looking promising.
@@manihss i am away from my main desktop, but i will look as soon as i can. the prompt was something like: "keep the answer very short and don't give me a load of empty blather"! i think the score threshold was 5, but that may be erroneous.
I normally use Ollama for for Rag applications with ChromaDB. Ollama is ran locally or at least on your servers so there is never a cost for tokens. If you were to use Gemini or Open AI then yes you have token costs. Depending on what database your using for your vector store, there may be small costs to store the data. In general when you retrieve data and use it in the LLM processing there are token costs. I have used up to 100Mb PDF files with ChromeDB. The big thing to watch is the Chunk Size, you may find that you will send 3 chunks to the LLM so if each one is 1Mb then 3 chunks are 3Mb, which could be 600,000 tokens (3,000,000 / 5) 5 bytes per token. That would be expensive, again using a local LLM and local vector store resolves these costs real fast ply gives you the security of your data not being outside your company.
@@candyman3537 Correct, if you have a local vector db then there is no cost. The chunk size would have more effect on tokens, I normally have 3 chunks returned from the similarity results that are then sent to the LLM.
I have video that I'm editing that has a full React UI, I built it so anyone can build a product with it if they wanted, I have the Python code now as a Microservice which makes it much cleaner to deploy in production as well as full logging.
There was so much OpenAI and almost none that really cover running Ollama locally, I uses this for a last company that has very private data, from HR information to Product Development, Jira and Confluence integration, no way could we use OpenAI and have them "Learn" all our IP content :)
Can you give me more information? I do not see a variable "context" in the code, "context" is returned in the lookup of data from the ChromaDB so if there is no context then I suspect there is no matching results from the search, maybe tell me the line number or share the code with me
@@federicocalo4776 That is part of the PromptTemplate so not a real variable, it is populated by the Retriever so line #70 should create this value in the retriever for you.
thank you so much for this video, this is very helpful for me now. But i have some questions, how can i deploy this Flask app to server so no need to use 'localhost'
You can just create a venv on your Linux server, activate it and then pip install -r requirements.txt to install all the required dependencies. The just start as normal python 3 app.py and it will be running. You may need to open a port on your firewall on the server and then you can connect externally so maybe 10/10/10/25:8081/api for a connection. I do this all the time. I break things into smaller services (microservices) and have them run in a Docker container. You can load the Ollama on a GPU based server (see my video on this).
This is the typical hallucination problem, in the prompt you need to say "Only use the content provided" and also if the results from the retriever or zero length then I normally put up a message that says I could not find the content in the PDF.
nice video tutorial, some questions though. Q: if you already have your loader and call loader.load_and_split(), doesnt that function already use text_splitter by default using RecursiveText... class, is it really needed to call text_splitter.split_documents agai? 2nd question, if you have that upload endpoint, if you call it multiple times with different pdfs will it overwrite previous vectore store?? since i see you are re-creating Chroma.from_documents each time, should an instance of Chroma be created and then just call chroma_instance.add_documents() ?
Thanks for the great tutorial. I would like to get your thoughts on 2 aspects: What is the minimum RAM needed to run this? Can the model be stored on an external hard disk?
You will need at least 8Gb or Ram. The model files are stored on a hard disk but they are also loaded into memory when used. So when there is no activity the memory is free but after about 5 minutes of no activity the cache is cleared so the next time you use it the model is loaded into ram again. If you have a GPU (Nvidia card) then it gets loaded into the GPU. I like to use an external Linux Ubuntu server with an NVidia card, 20Gb of ram, runs really fast.
FREE FREE FREE - All you have to do is host the service yourself. This is all Open Source, no fees for Ollama, Llama3, LangChain, ChromaDB, Flask or any of the Python Libraries. Keep watching for my Python Microservices that make all this even simpler to setup and use with a React Front End
I am having Error: ImportError: Could not import 'fastembed' Python package. Please install it with `pip install fastembed`, even I installed fastembed. It will be appreciated if anyone can help.
Postman gives this error, can anyone help? 415 Unsupported Media Type Unsupported Media Type Did not attempt to load JSON data because the request Content-Type was not 'application/json'.
No, it is not. That means that one request is processed in one thread of Python, you can scale this if you have multiple instances of the Python Flask app running, this can be done with a load balancer like NGINX but still does not make it fast. The bottleneck is Ollama, it can handle multiple requests but may have to reload the LLM each time if it is not the same LLM, if you were to use Llama3.2 every time then you would get a level of support for multiple requests but in the end it is not scale-able as you would expect, if 1 request takes 10 seconds, 2 requests takes 20 seconds so it does not solve anything. You can always add another GPU, Ollama supports this but then the first request takes 12 seconds, the seconds takes 18 seconds, etc. So again no scale-able solution. But if you have isolated machines that are feed from a load balancer like NGINX and each machine has your Flask API and Ollama running with its own GPU (I use the NVIDIA 4090) then yes, first request takes 10 seconds, 2nd request takes 10 seconds so the same is pretty consistent for multiple requests, you will quickly find that you may need 4 machines to create a production grade system. This is what I have done for a large LLM project and it does work well. I setup the Load Balancer for Round Robin and then process each request as they come in. If I need to support more requests them I will add more servers, I did this on AWS and it cost me about $700 per server per month but it did work. I now have my own servers that cost about $2,500 each to build that is the LLM Engine Cluster. I connect this into the cloud using ngrok and it is very fast. As far as I know there is no way to scale up vertically as far as getting more LLM ram or processor power other than replacing your GPU board. Adding boards in parallel gives more memory but does not effect the processing speed, well it is a bit slower from the overhead but Ollama will put different sections of the LLM into different cards so the memery is scaled but not the processors. Each processor runs the segment of memory based on the LLM loaded in its own instance so there is no performance increase.
@@fastandsimpledevelopment Clear! Thanks for the reply! So from a performance perspective parallel is the way to go, makes sense. Follow up question, how do you keep the source of truth (the RAG and your docs) in sync?
@@ChigosGames I do two things for Rag, initially I used PDF input and stored the Vectors into a ChromaDB, this can be a server so all the instances use the same database but the data is as old as the last PDF upload. I have moved to a better solution where I do not source anything from PDF, I query a database, in my case MongoDB, I then take that content (which should be the current truth) and feed it into the process as if it came from ChromaDB so it is then what Ollama uses to answer a question/prompt, this works very well and a lot of the PDF/VectorDB issue went away. I have a large set of data, think of Airline tickets so I have routes, times, destinations as well as passengers that purchase and I need to answer questions like "What is a cheaper flight" or "If I change a day how much will it cost" so some times it not as simple as a PDF document with content
@@fastandsimpledevelopmentok, I love mongodb, since it is so json friendly (a bit too much sometimes), so with that you already structurize and let it enrich by the LLM. How do you 'steer' the LLM from not to be too creative with amending your (flight) data?
@@ChigosGames I create very specific data for example "SFO 01/10/2024 10:00AM - JFK 01/10/2024 4:45 PM American Flight 1410 $445" This is then used in the LLM and I do have a filter and use JSON format output I then transform this as needed
For AWS; are you using Amazon sagemaker or how would you go about deploying the local LLM model to host it?I’m new, so I apologize is this is an incoherent question.
I deploy a full Ollama server on EC2 with a GPU based server, no Sagemaker needed. This works on AWS or even your own Linux machines
3 หลายเดือนก่อน
Amazing video! Your explanation is super insightful and well-presented. I'm curious-do you have any thoughts or experience with using Ollama in a production environment? I'm not sure if Ollama can handle multiple requests at scale. If I were to implement something like this in production, would you recommend Ollama, or would alternatives like llama.cpp or vllm be better suited? Would love to hear your perspective on its scalability and performance. Thanks again for sharing such awesome content!
Hi, first of all, thank you very much for this project. I will move forward on your project. I would like some support from you, can you integrate data streaming into your existing API project? In other words, the response should not come as the entire text. Like chatgpt, the response comes live word by word.
No, sorry I do not have time to work on streaming interface, I'm sure you can get it working if you dive in yourself. Streaming is supported and is the detail API for Ollama.
Very good video, it helped me a lot with something I was looking for to integrate with a chatbot, but I had to adjust the search_kwargs a little to 3 because it got dizzy with the result. I would be happy to see how to delete added PDFs in Chroma, greetings and thank you very much for this content
How do i fix this? C:\Users\zzz>ollama pull llama3 pulling manifest Error: 403: C:\Users\zzz>ollama serve Error: listen tcp 127.0.0.1:11434: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted.
The first issue is that you already have Ollama running. You need to stop the instance, on Windows I think you can do that with the Task Manager. The other issue may then work, if not then make sure Ollama is running, try "Ollama list" this will show you that it is running and give you a list of the models already loaded (if any). There should not be a 403 error unless you are behind a firewall or corporate security does not allow for the download. Sorry that's Ill I got.
Thanks for the tutorial. I had an error when saving the file in Windows , here is how itsolved: from pathlib import Path current_folder = Path(__file__).parent.resolve() save_file = str(current_folder) + "\\pdf\\" + file_name
Hi! your project worked for me yesterday, I tried to change the pdf loader to another one, then it broke and only gave empty chunks. Today I've tried recreating it in a new venv, but now I keep getting the following error: werkzeug.exceptions.BadRequestKeyError: 400 Bad Request: The browser (or proxy) sent a request that this server could not understand. KeyError: 'file' tried searching on google but I couldn't find an answer, also I've never had this error before. Thanks a lot in advance if you are able to help:)
Yes, this video was recorded on a 2019 Intel Macbook pro. I've also ran this on an M2 Macbook. If you try large queries then it is very slow, you really need a GPU (NVIDIA) for performance.
WINDOWS USERS INSTALL PROBLEM: "uvloop" says it doesn't work on windows remove it from requirements.txt it should install and seems to work anyways (I haven't tested the RAG part, only that the API call works)
I keep getting an "langchain_community.llms.ollama.OllamaEndpointNotFoundError: Ollama call failed with status code 404. Maybe your model is not found and you should pull the model with `ollama pull llama3`." error. Any help resolving this would be greatly appreciated. (I am running llama3 on a Mac OS)
Finally a complete Fast and Simple end to end API using Ollama, Llama3, LangChain, ChromaDB, Flask and PDF processing for a complete RAG system. If you like this one check out my video on setting up an AWS Server with GPU Support - th-cam.com/video/dJX9x7bETe8/w-d-xo.html
Amazing tutorial! I know you work with Java, but I would absolutely love an implementation in Golang.
We're about the same age. This is by far the best tutorial on the subject I've seen in a while. Thank you very much for your conscientiousness and dedication to quality! Cheers from Vancouver :)
Glad you enjoyed this.
Definitely the best tutorial I've found on TH-cam. I especially appreciated that you included the problems you found while implementing the code like packages not yet installed, because when someone look at tutorial usually he sees that everything is always working fine but it's not what really happens when doing it for the first time. Great job.
Literally just what I was looking for. Thx so much for the video. It's reaaaaally difficult to find good information about doing projects with llama models, every use case is for openai. Again, thx a lot!!!!
Glad you enjoyed it. I had the same experience and so many people asked me to do this video. I would normally only do Java but the tooling is not really ready so wanted to get this out. Remember that you need a GPU based system to really run fast. Linux or Windows, Apple appears go have given up on that Ollama needs to no option that I have found. I have a server running with a couple of NVIDIA cards and 128Gb ram, super fast and makes it production ready for my needs.
Amazing step by step with complete explenation of what and why. Thanks!
Congratulations, this is really an exhaustive explanation on how to setup the necessary architecture for exposing AI, LLM based services from a private cloud. Thank you for sharing!
Thanks for the feedback, glad you enjoyed it.
Very excellent. I am doing chatbot creation for a bank and privacy/ security is of utmost importance. I'll surely use knowledge gained here.
Glad you found this useful, feel free to ask me questions, I've already setup banks and insurance companies using these techniques.
Thank you very much! I learnt a lot following your step-by-step guidance! Especially how you solve the 'errors', cheers!
Glad you enjoyed the video. Thanks!
Very helpful content. Great tutorial on putting it all together in a RAG application. Thank you for taking the time to put this together and sharing it!
thank you very much. i would add that adjusting the hyperparameters `k`, `score_threshold` and the `PromptTemplate` custom instruction can make a huge difference to the answer - by changing these i got the system to stop producing useless, inaccurate verbiage and give short, accurate, useful answers. it was like comparing a 1B early LLM with GPT4o.
Thanks for the info, hope it helps others as well
@@fastandsimpledevelopment to be specific this is what i have found so far to work reasonably well:
PromptTemplate
"Based on the following context, provide a precise, concise, and accurate answer to the query. Do not give a load of waffle and empty verbiage, just the actual answer. Do not use your own knowledge, only that in the text supplied. The answer is almost certainly in the given text, but it may require some intelligence on your part to piece together information in order to answer the question. Try hard; don't give up easily."
"k": 5,
"score_threshold": 0.5,
if anyone is interested, i can give some example answers; my impression is that the LLM is the weak link - it's often not quite smart enough to piece together the right pieces of information, where a human would easily work it out. i plan to try out other LLMs and embedding algorithms, but at least the results are looking promising.
would you mind sharing those? the params and the Prompt.
@@manihss i am away from my main desktop, but i will look as soon as i can. the prompt was something like: "keep the answer very short and don't give me a load of empty blather"! i think the score threshold was 5, but that may be erroneous.
@@juliandarley would still be interested in what params you ended up finding optimal if you still have it
Thanks! Was looking for this!
Very useful, criminally underviewed too.
Excellent video. Helped me immensely, Thank you for sharing.
Thank you for making this video. It was so helpful :)
Glad you enjoyed it
I was looking for just this, thank you sir you're a wonderful human being! Thank you so much for this content!
Thanks for your video. Have a question. If I a very big pdf, will this embedding data take more tokens? And what is the max length of the pdf?
I normally use Ollama for for Rag applications with ChromaDB. Ollama is ran locally or at least on your servers so there is never a cost for tokens. If you were to use Gemini or Open AI then yes you have token costs. Depending on what database your using for your vector store, there may be small costs to store the data. In general when you retrieve data and use it in the LLM processing there are token costs. I have used up to 100Mb PDF files with ChromeDB. The big thing to watch is the Chunk Size, you may find that you will send 3 chunks to the LLM so if each one is 1Mb then 3 chunks are 3Mb, which could be 600,000 tokens (3,000,000 / 5) 5 bytes per token. That would be expensive, again using a local LLM and local vector store resolves these costs real fast ply gives you the security of your data not being outside your company.
@@fastandsimpledevelopment So the size of pdf files only affect the cost of the vector db?
@@candyman3537 Correct, if you have a local vector db then there is no cost. The chunk size would have more effect on tokens, I normally have 3 chunks returned from the similarity results that are then sent to the LLM.
Thank You Again. Perhaps a follow-on Vid where you make this a full Flask App with a responsive Web-GUI ?!?!?
I have video that I'm editing that has a full React UI, I built it so anyone can build a product with it if they wanted, I have the Python code now as a Microservice which makes it much cleaner to deploy in production as well as full logging.
nice starter tutorial that does not involve openai!
There was so much OpenAI and almost none that really cover running Ollama locally, I uses this for a last company that has very private data, from HR information to Product Development, Jira and Confluence integration, no way could we use OpenAI and have them "Learn" all our IP content :)
Awesome with all of the latest LLM and APIs!
Very informative and timely !
This is awesome! Thank you very much for posting this!👏
Thats way man, great tutorial. Thank you
I have my ollama running in a external machine , how to configure the ip in cached_llm = Ollama(model="llama3")
Give this a try
cached_llm = Ollama( model="llama3", base_url="OLLAMA_HOST:PORT" )
Hello i have a problem, the variable context is not declared
Can you give me more information? I do not see a variable "context" in the code, "context" is returned in the lookup of data from the ChromaDB so if there is no context then I suspect there is no matching results from the search, maybe tell me the line number or share the code with me
@@fastandsimpledevelopment line 27 of your github repo. "NameError: name 'context' is not defined"
my mistake. the editor put a f""" automatically
@@federicocalo4776 That is part of the PromptTemplate so not a real variable, it is populated by the Retriever so line #70 should create this value in the retriever for you.
Make sure you have line 27 in 3 double quotes and you are using Python 3.9 or greater
thank you so much for this video, this is very helpful for me now.
But i have some questions, how can i deploy this Flask app to server so no need to use 'localhost'
You can just create a venv on your Linux server, activate it and then pip install -r requirements.txt to install all the required dependencies. The just start as normal python 3 app.py and it will be running. You may need to open a port on your firewall on the server and then you can connect externally so maybe 10/10/10/25:8081/api for a connection. I do this all the time. I break things into smaller services (microservices) and have them run in a Docker container. You can load the Ollama on a GPU based server (see my video on this).
why is this giving answers outside the pdfs when i am asking unrelated questions to content of an error response?
This is the typical hallucination problem, in the prompt you need to say "Only use the content provided" and also if the results from the retriever or zero length then I normally put up a message that says I could not find the content in the PDF.
hi bro i tried this program will show error fastembed will not import but i will already install the package and again and again same error will show
About the search_kwargs in the as_retriever method, i can't find all other options that can be used and what are they for, can anybody help ?
Perfection 👏🏻👏🏻👏🏻
The best tutorial
Glad you liked it, thanks!
nice video tutorial, some questions though. Q: if you already have your loader and call loader.load_and_split(), doesnt that function already use text_splitter by default using RecursiveText... class, is it really needed to call text_splitter.split_documents agai?
2nd question, if you have that upload endpoint, if you call it multiple times with different pdfs will it overwrite previous vectore store?? since i see you are re-creating Chroma.from_documents each time, should an instance of Chroma be created and then just call chroma_instance.add_documents() ?
Thanks for the great tutorial. I would like to get your thoughts on 2 aspects: What is the minimum RAM needed to run this? Can the model be stored on an external hard disk?
You will need at least 8Gb or Ram. The model files are stored on a hard disk but they are also loaded into memory when used. So when there is no activity the memory is free but after about 5 minutes of no activity the cache is cleared so the next time you use it the model is loaded into ram again. If you have a GPU (Nvidia card) then it gets loaded into the GPU. I like to use an external Linux Ubuntu server with an NVidia card, 20Gb of ram, runs really fast.
@@fastandsimpledevelopment Thanks
Can I use this without any limits or restrictions means its free no token and anything needed ? please reply
FREE FREE FREE - All you have to do is host the service yourself. This is all Open Source, no fees for Ollama, Llama3, LangChain, ChromaDB, Flask or any of the Python Libraries. Keep watching for my Python Microservices that make all this even simpler to setup and use with a React Front End
@@fastandsimpledevelopment thanks for reply
Excellent. Thanks very much for sharing
hi, can this project work without internet?
Yes, 100% offline on local systems.
I am having Error: ImportError: Could not import 'fastembed' Python package. Please install it with `pip install fastembed`, even I installed fastembed. It will be appreciated if anyone can help.
Really loved it......😍... can you add delete endpoint to delete from both pdf folder and chroma db
Postman gives this error, can anyone help?
415 Unsupported Media Type
Unsupported Media Type
Did not attempt to load JSON data because the request Content-Type was not 'application/json'.
Is this solution scalable? With many concurrent users?
No, it is not. That means that one request is processed in one thread of Python, you can scale this if you have multiple instances of the Python Flask app running, this can be done with a load balancer like NGINX but still does not make it fast. The bottleneck is Ollama, it can handle multiple requests but may have to reload the LLM each time if it is not the same LLM, if you were to use Llama3.2 every time then you would get a level of support for multiple requests but in the end it is not scale-able as you would expect, if 1 request takes 10 seconds, 2 requests takes 20 seconds so it does not solve anything. You can always add another GPU, Ollama supports this but then the first request takes 12 seconds, the seconds takes 18 seconds, etc. So again no scale-able solution. But if you have isolated machines that are feed from a load balancer like NGINX and each machine has your Flask API and Ollama running with its own GPU (I use the NVIDIA 4090) then yes, first request takes 10 seconds, 2nd request takes 10 seconds so the same is pretty consistent for multiple requests, you will quickly find that you may need 4 machines to create a production grade system. This is what I have done for a large LLM project and it does work well. I setup the Load Balancer for Round Robin and then process each request as they come in. If I need to support more requests them I will add more servers, I did this on AWS and it cost me about $700 per server per month but it did work. I now have my own servers that cost about $2,500 each to build that is the LLM Engine Cluster. I connect this into the cloud using ngrok and it is very fast. As far as I know there is no way to scale up vertically as far as getting more LLM ram or processor power other than replacing your GPU board. Adding boards in parallel gives more memory but does not effect the processing speed, well it is a bit slower from the overhead but Ollama will put different sections of the LLM into different cards so the memery is scaled but not the processors. Each processor runs the segment of memory based on the LLM loaded in its own instance so there is no performance increase.
@@fastandsimpledevelopment Clear! Thanks for the reply! So from a performance perspective parallel is the way to go, makes sense. Follow up question, how do you keep the source of truth (the RAG and your docs) in sync?
@@ChigosGames I do two things for Rag, initially I used PDF input and stored the Vectors into a ChromaDB, this can be a server so all the instances use the same database but the data is as old as the last PDF upload. I have moved to a better solution where I do not source anything from PDF, I query a database, in my case MongoDB, I then take that content (which should be the current truth) and feed it into the process as if it came from ChromaDB so it is then what Ollama uses to answer a question/prompt, this works very well and a lot of the PDF/VectorDB issue went away. I have a large set of data, think of Airline tickets so I have routes, times, destinations as well as passengers that purchase and I need to answer questions like "What is a cheaper flight" or "If I change a day how much will it cost" so some times it not as simple as a PDF document with content
@@fastandsimpledevelopmentok, I love mongodb, since it is so json friendly (a bit too much sometimes), so with that you already structurize and let it enrich by the LLM.
How do you 'steer' the LLM from not to be too creative with amending your (flight) data?
@@ChigosGames I create very specific data for example "SFO 01/10/2024 10:00AM - JFK 01/10/2024 4:45 PM American Flight 1410 $445" This is then used in the LLM and I do have a filter and use JSON format output I then transform this as needed
Thank you, thank you, thank you.
For AWS; are you using Amazon sagemaker or how would you go about deploying the local LLM model to host it?I’m new, so I apologize is this is an incoherent question.
I deploy a full Ollama server on EC2 with a GPU based server, no Sagemaker needed. This works on AWS or even your own Linux machines
Amazing video! Your explanation is super insightful and well-presented. I'm curious-do you have any thoughts or experience with using Ollama in a production environment? I'm not sure if Ollama can handle multiple requests at scale.
If I were to implement something like this in production, would you recommend Ollama, or would alternatives like llama.cpp or vllm be better suited? Would love to hear your perspective on its scalability and performance. Thanks again for sharing such awesome content!
Hi, first of all, thank you very much for this project. I will move forward on your project. I would like some support from you, can you integrate data streaming into your existing API project? In other words, the response should not come as the entire text. Like chatgpt, the response comes live word by word.
No, sorry I do not have time to work on streaming interface, I'm sure you can get it working if you dive in yourself. Streaming is supported and is the detail API for Ollama.
Very good video, it helped me a lot with something I was looking for to integrate with a chatbot, but I had to adjust the search_kwargs a little to 3 because it got dizzy with the result. I would be happy to see how to delete added PDFs in Chroma, greetings and thank you very much for this content
Glad it helped you
Great and to the point. 👏
How do i fix this?
C:\Users\zzz>ollama pull llama3
pulling manifest
Error: 403:
C:\Users\zzz>ollama serve
Error: listen tcp 127.0.0.1:11434: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted.
The first issue is that you already have Ollama running. You need to stop the instance, on Windows I think you can do that with the Task Manager. The other issue may then work, if not then make sure Ollama is running, try "Ollama list" this will show you that it is running and give you a list of the models already loaded (if any). There should not be a 403 error unless you are behind a firewall or corporate security does not allow for the download. Sorry that's Ill I got.
Thanks for the tutorial.
I had an error when saving the file in Windows , here is how itsolved:
from pathlib import Path
current_folder = Path(__file__).parent.resolve()
save_file = str(current_folder) + "\\pdf\\" + file_name
Thanks for the input, hope it helps someone else as well.
Thanks so much for your comment -- where did you put this in the code?
very useful vedio, thanks for sharing!
Hi! your project worked for me yesterday, I tried to change the pdf loader to another one, then it broke and only gave empty chunks. Today I've tried recreating it in a new venv, but now I keep getting the following error: werkzeug.exceptions.BadRequestKeyError: 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.
KeyError: 'file' tried searching on google but I couldn't find an answer, also I've never had this error before. Thanks a lot in advance if you are able to help:)
Could it perhaps be that you need to relaunch your virtual environment? (venv)
Can this be run on a MacBook? Super informative, I just don't want to fry my mac trying it out :/
Yes, this video was recorded on a 2019 Intel Macbook pro. I've also ran this on an M2 Macbook. If you try large queries then it is very slow, you really need a GPU (NVIDIA) for performance.
@@fastandsimpledevelopment awesome thank you very much!!
Can it be run on PC?
*Windows
@@brianclark4639 Yes, Ollama now supports Windows, code should be the same
Where is the calling API?
Thx from Brazil.
Sweet!
WINDOWS USERS INSTALL PROBLEM:
"uvloop" says it doesn't work on windows
remove it from requirements.txt
it should install and seems to work anyways
(I haven't tested the RAG part, only that the API call works)
Thanks for the input
Very nice. Thanks.
the code is blurred and to small, hard to read
Awesome
If you're on Windows remove uvloop from the requirements.txt. It will break your pip.
Thanks for the info!
Can you do one for Windows or Linux 😂 sorry im a bit lost too. Guess I need more python knowledge
gj
Thanks
I keep getting an "langchain_community.llms.ollama.OllamaEndpointNotFoundError: Ollama call failed with status code 404. Maybe your model is not found and you should pull the model with `ollama pull llama3`." error. Any help resolving this would be greatly appreciated. (I am running llama3 on a Mac OS)
Go to the command line pull the model using ollama pull llama3