Open Source RAG Chatbot with Gemma and Langchain | (Deploy LLM on-prem)

แชร์
ฝัง
  • เผยแพร่เมื่อ 9 ก.ค. 2024
  • In this video, I show how to serve your open-source LLM and Embedding model on-prem for designing a Retrieval Augmented Generation (RAG) chatbot. For this purpose, I take RAG-GPT chatbot and instead of the GPT model, I use Google Gemma 7B as the LLM and instead of text-embedding-ada-002, I use baai/bge-large-en from Huggingface. I use Flask to develop a web server that will serve the LLM for real-time inferencing and I show you how to use Postman to develop these types of projects.
    00:00 intro
    01:22 Demo
    03:12 RAG-GPT schema
    04:39 RAG-Gemma schema
    05:19 Challenges of open-source LLMa for chatbots
    7:15 A possible solution for serving LLMs on-prem
    08:02 Lost in the middle (A tip for context-length)
    09:26 Project structure
    11:18 How to load and interact with Gemma
    13:06 Developing LLM web server with Flask
    20:38 Testing and debugging the LLM web server with Postman
    25:00 Testing the RAG chatbot
    27:22 GPU usage of this chatbot
    27:48 Do we need a web server for the embedding model?
    🚀 GitHub Repository: github.com/Farzad-R/LLM-Zero-...
    🎓 Models that are used in this chatbot:
    BAAI/bge-large-en: huggingface.co/BAAI/bge-large-en
    Google Gemma: huggingface.co/blog/gemma
    📚 Extra Resources:
    RAG-GPT: • RAG-GPT: Chat with any...
    Quantization: huggingface.co/docs/transform...
    Lost in the Middle Paper: arxiv.org/abs/2307.03172
    Download Postman: www.postman.com/downloads/
    #opensource #llm #rag #chatbot #huggingface #google #gemma #gradio #langchain #Flask #python #GUI #postman
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 49

  • @NinVibe
    @NinVibe 3 หลายเดือนก่อน +1

    I have a question. So once you start using this app and start uploading the documents additionally with the 'upload PDF or doc file' button, do the documents stay in the app data and you can use them whenever you want, or they will be deleted? (sorry if i got the answer in the video, i probably missed it.)

    • @airoundtable
      @airoundtable  3 หลายเดือนก่อน

      No worries! That is a good question. No they won't remain there as the intrinsic characteristics of upload docs is to have the flexiblity to work with various documents on the fly. In case you may want to have access to a document later, you can easily add it to the main vectorDB where all the documents reamin untouched and the users can chat with them whenever they want.

    • @NinVibe
      @NinVibe 3 หลายเดือนก่อน +1

      @@airoundtable can this whole app be containerized and deployed so it can be used online, or that would require a different approach?

    • @airoundtable
      @airoundtable  3 หลายเดือนก่อน

      @@NinVibe Containerization can be the right approach in general. But a more precise answer is: probably the app is not fully ready at this state.
      This version of the app is fully functional but there are some aspects that need to be taken into account for deployment.
      1. Size of the LLM: the LLM used in this chatbot is a 7B parameter model and in my opinion it is not a good choice for a production grade RAG chatbot. For now, a simple rule of thumb for selecting models is the bigger the better. So, my suggestion is to use a bigger model first, If you insist on working with a 7B parameter model my suggestion is to use Mistral 7B model instead of Gemma as that one is a better model.
      1.2. Again, if you would like to work with 7B and 13B parameter models, alternatively, you can also consider using the NVIDIA's recent chatbot as well (Chat with RTX). I have a video on that on the channel. They provide a chatbot with access to two models 7B and 13B that you can install on your PC directly. But the downside is it needs to be installed on each system separately.
      2. Serving the model: The model is served on a Flask app right now on the same machine as the chatbot and it needs access to the GPU. So, that needs proper deployment considerations such as making sure that the model container has access to necessary GPU computation.
      3. The app is currently creating the vectordb locally as well. So, in case you want to deploy it, it is better to allocate a proper storage to the system for holding the vector database and other necessary information.
      4. In case you would like to collect user's feedback, and other app related data, this also needs a proper storage attached to the app.
      5. It is best practice to implement proper authorization and access permissions to first: make sure that only internal employees can have access to the app. and second: only the right people with the right access have access to the documents (if applicable).
      6. Then there is also the security and maintenance concerns which need to be addressed based on your business needs.
      That being said, if you are planning to deploy the app for a couple of users and observe its performance and safety directly, probably you can skip a couple of these steps. But in case it is going to serve many users, the deployment aspects need to be considered carefully.

    • @NinVibe
      @NinVibe 3 หลายเดือนก่อน +1

      @@airoundtable thanks for the answers!

  • @omidsa8323
    @omidsa8323 4 หลายเดือนก่อน

    Just another fantastic video! Thanks Farzad

    • @airoundtable
      @airoundtable  4 หลายเดือนก่อน

      Thanks Omid, I'm glad to hear you liked the video!

  • @ginisksam
    @ginisksam 4 หลายเดือนก่อน

    Hi. Thanks for your detail & step-by-step with your codes. Am enjoying and learning same time. 🙏

    • @airoundtable
      @airoundtable  4 หลายเดือนก่อน

      Hi @ginisksam, thanks! I am very happy to hear that!

  • @venys1388
    @venys1388 4 หลายเดือนก่อน

    Thank u so much

    • @airoundtable
      @airoundtable  4 หลายเดือนก่อน

      Thanks for the comment @venys1388! I am glad that you liked the content!

  • @KhanhLe-pu5wx
    @KhanhLe-pu5wx 2 หลายเดือนก่อน +1

    21:22 Where did you get that url, the POST, I don't see any url like that in the output

    • @airoundtable
      @airoundtable  2 หลายเดือนก่อน

      Check out src/llm_serve.py, and you will see that:
      @app.route("/generate_text", methods=["POST"])
      ...
      if __name__ == "__main__":
      app.run(debug=False, port=8888)
      using these two you can create the URL that I showed in the video to interact with this module (using the port number and the name in the function decorator that shows what function expects to receive the input)

  • @saeednsp1486
    @saeednsp1486 4 หลายเดือนก่อน +1

    hi farzad,how are you ?
    so im testing RAG platforms and currently im using privategpt+ollama { mistral instruct fp16 v0..2} + bge-m3 + bge-reranker-large
    i have a 3090
    the inference is superfast,but the results are not satisfying
    my question is whats the best RAG platform right now if you want to go fully open source ?
    also please try your rag with more complex pdf files, not easy text files like a story

    • @airoundtable
      @airoundtable  4 หลายเดือนก่อน +1

      Hi Saeed, Thanks. I hope you are doing well.
      I think you will get the answer to a couple of your questions if you check out this video:
      th-cam.com/video/nze2ZFj7FCk/w-d-xo.htmlsi=jdX8-kOvnHtdqYsB (langchain vs llama-index).
      There I am testing these two frameworks by 40 different questions on various type of documents and I analyze their performance with different metrics.
      With a 3090, first I suggest you to install Chat-with-rtx and keep it as a baseline. With that you will have access to a 7B and 13B parameter models which can give you a good idea on how models with these sizes can perform on your documents. (There is another video on the channel on that as well). But in general:
      1. The beggier the LLM the better would be the results in RAG. So, if you want to go open-source with a 3090, I would try to find the biggest model that I can load with a 8 and 4bit quantization.
      2. For the embedding model the best one out there is BAAI/bge-large-en
      3. You are already using the best reranker as well: BAAI/bge-reranker-large
      4. Depending on your LLM context length: start exploring different chunk sizes, chunk overlaps, k (number of retrieved contents), system-roles. (In this video in order to make Gemma perform well on the documents, I spend around half an hour optimizing just the system role and eventually came up with a very simple one.)
      5. With regard to the RAG technique itself: I recommend Langchain and for the retrieval step, I recommend similarity search and mmr search. (Then start exploring graph based techniques and see if they can improve the performance)
      These are just my general suggestions. In the langchain vs llama index video I discuss and show that depend on your document types there maybe a different technique with a different config that works best. So, you have to spend some time exploring them for sure in case you are looking for very precise answers.
      I personally, start with a simple pipeline and gradually start improving it by adding different components to see and evaluate the effect of each component step by step. my first go to would be chat-with-RTX and for development I start by a simple pipeline without any specific chunking strategy (just separate the documents by each page) and without a reranking model. Then start developing on top of them.

  • @AliFarooq-yg6fn
    @AliFarooq-yg6fn 2 หลายเดือนก่อน +1

    Hi, I have a question. I already implement rag with mistral dolphin 7B. I also test some advanced rag techniques like ensembles, parent-child, multi-query retrievals. I don't have a GPU so I run my llm on LM studio server. In your video you say we need GPU can we also use CPU and make the same interface and project? I also want to deploy my app on server. I looked into your video in which you are using flask to deploy the app but its locally what I have to do to deploy it on server. I also came across with Heroku platform to deploy applications with gpu. I am confused can I make the application like you did on my local system without gpu for test and than deploy it on server.

    • @airoundtable
      @airoundtable  2 หลายเดือนก่อน +1

      Hi, you can run the whole app on CPU easily by changing the device map on this part of the code.
      ```
      model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path="google/gemma-7b-it",
      token=APPCFG.gemma_token,
      torch_dtype=torch.float16,
      device_map=APPCFG.device
      )
      ```
      However, I won't recommend it as it won't be efficient and definitely it will introduce delay into the system. Besides that, you can use frameworks like llama.cpp and other frameworks that are designed for running LLMs on the CPU more effectively. But overall, I think switching this project directly from GPU to CPU won't probably make a production-grade RAG chatbot. If you don't want to use GPU, you can also consider using LLMs through API calls like using OpenAI models. I have a video on that; you need no GPU for that project.

  • @tk-tt5bw
    @tk-tt5bw 2 หลายเดือนก่อน +1

    This is great but I think it would be better if you did your projects on colab

    • @airoundtable
      @airoundtable  2 หลายเดือนก่อน

      Thanks. Colab is nice but my main goal is to create projects that anyone can pick and use as a starting point for a practical application. Colab would require alot of refactoring to reach the point where I provided in the video. However, for the next videos, I am creating both the notebooks and a sample project for a more generic usecase.

  • @doctorbill37
    @doctorbill37 4 หลายเดือนก่อน +1

    I appreciate this example of a local LLM RAG deployment as well as your attention to detail. I wish every presenter was this thorough!
    I have the Gradio UI working for queries against the documents which I did on the manual ingestion step. However, when I attempt to upload PDFs from the interface I get an "Error" message on the screen. When I look at the app.py terminal, the error is: "TypeError: isfile: path should be string, bytes, os.PathLike or integer, not _TemporaryFileWrapper"
    Any idea what might be going wrong with the filepath and what needs to be changed?
    fyi I had to set Gradio to version 3.48.0 to over come the error: "cannot import name 'RootModel' from 'pydantic'" and perhaps this caused the problem above?

    • @airoundtable
      @airoundtable  4 หลายเดือนก่อน +1

      Hi @doctorbill37,
      Thanks a lot for your kind words!
      I have a good idea where the error is coming from, but I'm not certain what's causing it. It might have to do with downgrading gradio. I'm using gradio==4.13.0, and I just tested the project again, it worked fine.
      So, first, I suggest you try this combination:
      pydantic==2.6.2 & gradio==4.13.0
      If you're curious about the problem's source:
      1. When you upload a document in app.py, it triggers line 95, then moves to
      2. line 17 of the process_uploaded_files method in the upload_file.py module.
      In this function, the first argument is "files_dir." It should be a list of strings, each representing the directory of the PDF files you select to chat with. For some reason, in your case, this argument isn't providing the function with the proper directory type (string).
      You can print "files_dir" and check the terminal to see what it contains when you pass the PDF files. It should look something like:
      ["1st pdf directory", "2nd pdf directory", "3rd pdf directory", and so on].
      This way, you can debug it. I hope this explanation helps you solve the problem. Please let me know in case you have any other question.

    • @airoundtable
      @airoundtable  4 หลายเดือนก่อน +1

      I also just added a requirements.txt file to the project. Feel free to use it for installing the libraries in your environment. (I am using windows 11 and python 3.11.7)

    • @doctorbill37
      @doctorbill37 4 หลายเดือนก่อน

      @@airoundtable Ok I am happy to report that I got it running. I cross checked with the requirements file and redid the dependencies. That did the trick. Thank you.
      So we have documents that can be uploaded and queried. In terms of maintenance, how do we remove documents from the RAG collection if we want that at a later point in time?

    • @airoundtable
      @airoundtable  4 หลายเดือนก่อน +1

      Glad to heat it@@doctorbill37! If I understood correctly, you are looking for a way to remove specific vectors in the index which belong to a specific document. If that is the case, please have a look at these links.
      Every index has its own method. In this chatbot I am using Chroma. I have never tried this on Chroma before, but in the links below you can find a couple of suggested solutions. I hope this helps.
      github.com/langchain-ai/langchain/discussions/9495#discussioncomment-7451042
      github.com/langchain-ai/langchain/discussions/9495#discussioncomment-6769802

    • @doctorbill37
      @doctorbill37 4 หลายเดือนก่อน

      @@airoundtable You are correct, that is exactly what I meant. I have seen other RAG examples and (so far) have not seen how to remove specific entries from a vector db. I appreciate you going the extra mile and pointing me in the right direction with the links you have provided. Perhaps at some point you could show an example of how this is done if you have both the time and the interest. Thanks again for your assistance.

  • @ashwinisivanandan1715
    @ashwinisivanandan1715 3 หลายเดือนก่อน

    Hey! Thanks for the lovely videos....
    Just a question, how do I work without GPU? It's asking me for nvidia drive which is not installed in my system. And I do not intend to install it either... So any workaround you may suggest?

    • @airoundtable
      @airoundtable  3 หลายเดือนก่อน +1

      Thanks! This chatbot requires a GPU with at least 8 GB of VRAM. So, in case you want to run an LLM on CPU there are two well-known methods around it. Llama-cpp and Ollama. But to make them work, you have to go through some steps. Another key aspect of running an LLM on CPU is having enough memory for doing it. So, keep that in mind as well. In case you want to test them, you can start by checking these two websites:
      - towardsdatascience.com/set-up-a-local-llm-on-cpu-with-chat-ui-in-15-minutes-4cdc741408df
      - www.datacamp.com/tutorial/llama-cpp-tutorial

  • @user-zf1ur9lh7o
    @user-zf1ur9lh7o 4 หลายเดือนก่อน

    Thanks for your efforts. I really like your explanation style. I have a question that really kills me :) ..... How to control the LLM response and be assured that it will be from the RAG? are there any specific techniques for that? ......... Also, How to provide feedback about the response to the LLM? I can see you are showing thumbs up and down here .....but where is this feedback saved? and How to inform the LLM by this feedback? I am sorry if my question is primitive (may be for others) but for me it is very important to understand these questions. Thank you again

    • @airoundtable
      @airoundtable  4 หลายเดือนก่อน +2

      Thank you for your kind words! I am glad to hear that you liked the video!
      Great questions! Here are the answers.
      Q1&2. How to control the LLM response and be assured that it will be from the RAG? are there any specific techniques for that?
      1. Play around with the system role and make sure that the LLM has received the proper instructions (If you check out the RAG-GPT video or project you can see how I am instructing the GPT model not to use its own knowledge under any circumstances).
      2. Use bigger LLMs. The smaller the LLM, the higher the amount of hallucination. Bigger LLMs (e.g GPT models) are much better in following instruction to an extent where the model would barely use its own knowledge.
      3. Use multi step verification before providing the answer to the user. This is a bit advanced but in case you want to use Open-source LLMs you can use LLM chains to evaluate the answer based on the given context in multiple steps before showing it to the user.
      Q2, 3, &4. How to provide feedback about the response to the LLM? where is this feedback saved? and How to inform the LLM by this feedback?
      For RAG chatbots it is better not to provide feedback directly to the LLM. Feedback is more of a metric for the developing team to understand the weaknesses and strengths of the RAG chatbot so they can improve the system behind the system (by improving document processing, RAG techniques, LLM system roles, etc.). In Gradio when someone selects a thumbs down or thumbs up, you will get the value in the code. Check out open-source-RAG-gemma/src/utils/ui_settings.py lines 32 to 36. You will see how I am collecting the feedback through "data.value". Then you can store this value on a table along with whatever other information that you may think might be useful and over time start analyzing them and make decisions on how to improve your system.
      Finally, there is no primitive questions. Every question is valuable. Feel free to ask your questions anytime.

  • @Mike-Denver
    @Mike-Denver 4 หลายเดือนก่อน

    this is awesome! thank you! One thing though - by now only lazy did not say that gemma is a poor model. Many experiments on yt show its underperpormance. How about make a use of any open llm? Can you experiment with LM Studio?

    • @airoundtable
      @airoundtable  4 หลายเดือนก่อน

      Thanks Alex! Glad to see you liked the video. You are completely right about Gemma :)). I saw the criticism. However, I'd say testing these models on benchmarks is one thing and designing a RAG chatbot for production is something else. In my opinion for an industry grade RAG chatbot at least a model with 70B parameter is required unless the use case is very limited. So, my main goal in this video was to show how an open-source RAG chatbot can be developed on-prem. For industry grade chatbot I uploaded a video on a project called RAG-GPT and for low code solutions I have a video on "Chat-With-RTX". Depending on the level of complexity that you are comfortable with, you may find these two chatbot interesting.
      But if you can tell me what type of usecase and performance you have in mind, I would be able to brainstorm better with you.
      In the meantime, I will keep LM studio in mind for a video down the line. Thanks for the suggestion!

  • @horyekhunley
    @horyekhunley 4 หลายเดือนก่อน +1

    Great stuff. Instead of using pdfs, can you do a tutorial on using a large csv doc for RAG?

    • @airoundtable
      @airoundtable  4 หลายเดือนก่อน +1

      Thanks @horeykhunley! Sure, Langchain recently announced a method for working with sql and databases. I will keep it in mind to make a video aeound that

  • @user-st1br5fe4x
    @user-st1br5fe4x 4 หลายเดือนก่อน

    are there any good open source embedding models? if someone wants to keep their data private, won't ada require you to send your data to openai?

    • @airoundtable
      @airoundtable  4 หลายเดือนก่อน

      Bge-large from huggingface is the best open source embedding model out there right now. It has the embedding dimension of 1024 and I am using it in this video. You are right about ada. However in this video bge-large was used for creating the index, performing the vector search, and tokenizing the querirs (all the steps that require the embedding model). The performance is not as good as ada but it is very good for an open source model.
      In the video that I explain the concept of vector sesrch (Vecror search explained) in the channel, I compare bge-large with ada and analyze their performance.

    • @user-st1br5fe4x
      @user-st1br5fe4x 4 หลายเดือนก่อน

      thanks@@airoundtable

  • @TooyAshy-100
    @TooyAshy-100 3 หลายเดือนก่อน +1

    Hi,
    How can open-source tools and frameworks be utilized to evaluate the performance of a Retrieval-Augmented Generation (RAG) system that integrates Large Language Models (LLMs) like Google Gemma?

    • @airoundtable
      @airoundtable  3 หลายเดือนก่อน +2

      Hi, to evaluate RAG systems, we usually use the help of LLMs themselves. But before that we need to understand what the challenges in a RAG system are. Here is a brief summary of some of the key steps:
      1. In data preparation pipeline: data quality + chunking strategy + embedding quality
      2. In retrieval side: user's query quality + search quality + relevance of the contexts of the retrieved documents to the query
      3. In sysnthesis side: context overflow + LLM hallucination + answer relevance
      These are the components that need to be adjusted and evaluated in a RAG system. For the evaluation pipeline itself, you can either use the frameworks that are being developed for this purpose: e.g: TruLens, Langsmith, Galileo (My recommendation: Langsmith)
      Or you can design a custom pipeline depending on your goal and usecase. I am not aware of any 100% open source framework for evaluating RAG but I have a video on the channel called, "Langchain vs Llama-index" where I design a custom end to end pipeline and evaluate the performance of 5 different RAG techniques. There I go into much more detail about this topic.
      Overall, regardless of the approach, the main goal would be to evaluate the aspects that I mentioned in 1,2, and 3.

    • @TooyAshy-100
      @TooyAshy-100 3 หลายเดือนก่อน +1

      @@airoundtable
      Thank you for the detailed explanation of the key components and challenges in evaluating Retrieval-Augmented Generation (RAG) systems. Your breakdown of the data preparation, retrieval, and synthesis aspects provides a clear framework for understanding the evaluation process.
      I appreciate your recommendation of frameworks like TruLens, Langsmith, and Galileo, with Langsmith being your top suggestion. It's also helpful to know that custom evaluation pipelines can be designed based on specific goals and use cases.
      I'm intrigued by your mention of the "Langchain vs Llama-index" video where you demonstrate an end-to-end custom pipeline to evaluate five different RAG techniques. It sounds like a valuable resource for gaining a deeper understanding of the evaluation process.
      Building upon your response, I have a follow-up question:
      When designing a custom evaluation pipeline for a RAG system, what are some key metrics or benchmarks that should be considered to assess the performance of each component (data preparation, retrieval, and synthesis)? Are there any industry-standard metrics or emerging best practices in this area?
      Thank you again for sharing your expertise on this topic. I look forward to learning more about effective evaluation strategies for RAG systems.

    • @airoundtable
      @airoundtable  3 หลายเดือนก่อน +1

      @@TooyAshy-100 I am glad that the explanation was helpful!
      In 14:05 of Langchain vs Llama index video I briefly mention this aspect. There are a couple of well-known metrics that currently are being used for evaluation of RAG performance. But there are multiple configs together with these metrics that provide a good feedback. For example:
      In data preparation side: Chunking strategy configs
      In embedding side: the vector embedding quality
      In retrieval side: context relevant is used for evaluation
      In synthesis side: answer relevance is the metric that can be used for evaluation
      For further read, I also introduced these two papers in that video which give a clear explanation of each metric and aspects for evaluation:
      arxiv.org/abs/2312.10997v1
      arxiv.org/abs/2312.10997

    • @TooyAshy-100
      @TooyAshy-100 3 หลายเดือนก่อน +1

      @@airoundtable
      I really cannot thank you enough. A professional response with sources.
      Thank you for your effort and valuable time in clarifying and preparing the references.
      I wish you all the best, and I look forward to more of your videos which are considered an inspiration of great benefit.

    • @airoundtable
      @airoundtable  3 หลายเดือนก่อน

      @@TooyAshy-100 Thanks for your kind words!

  • @musumo1908
    @musumo1908 3 หลายเดือนก่อน

    These videos are amazing! Can you release a version with the summary doc task added back….?? Subject to LLM model…thanks

    • @airoundtable
      @airoundtable  3 หลายเดือนก่อน

      Thanks! I will add it to the chatbot but I will release the next video after a couple of serious updates. In the meantime in case you want to add it to the chatbot, you can easily use the code from RAG-GPT:
      RAG-GPT/src/utils/summarizer.py
      This file contains the summarizer class. The only change that needs to be applied is replacing the OpenAI chat completion with the code for the Open source LLM you are using.

    • @musumo1908
      @musumo1908 3 หลายเดือนก่อน +1

      @@airoundtable hey thanks so much! I will try to add this back and add LiteLLM! I’m a v v low coder though. Fantastic channel

    • @airoundtable
      @airoundtable  3 หลายเดือนก่อน +1

      @@musumo1908 In case you have any doubts about it you can watch the corresponding part in RAG-GPT video. I explained that branch in detail. That should not be too hard to pull it out. I hope you can pull it out quickly

  • @KhanhLe-pu5wx
    @KhanhLe-pu5wx 2 หลายเดือนก่อน +1

    what you got in the .env file

    • @airoundtable
      @airoundtable  2 หลายเดือนก่อน +1

      I have GEMMA_TOKEN. This is the token that I mentioned we need to generate it from huggingface to access GEMMA

    • @airoundtable
      @airoundtable  2 หลายเดือนก่อน

      @@KhanhLe-pu5wx Correct. GEMMA_TOKEN is the variable name and huggingface token is the value