I really appreciate that all if not most of your collabs dont use langchain at all. Really like to see what goes on under the hood to learn from a first principals perspective.
These videos are such a high quality collection of content for app developers in the AI space who are building apps and not AI experts (nor really care about the AI itself, just wanting to use it)
Been anticipating this video since seeing the notebook on your github! Thank you so much for your detailed explanations! Would be keen to see your implementations of NeMo Guardrail's moderation pipelines :)
Would this approach work if vecorized data was shop inventory? And the question was like how many items fo you have? Or about specifics about a group of items?
I want to build this for my research lab so that we can query information about our protocols, standards, etc. This seems really useful. I presume it wouldn't be that hard to then embed it into a slack chatbot?
Excellent video as always. Thanks for sharing. Is there a way to setup Colang for an "anything but" scenario? So far, I only seem to be able to program what to detect for a workflow. But can I setup a 'default deny' type thing? Anything different than the topic my bot is designed to handle returns an "I'm sorry, Dave. I'm afraid I can't do that"...
Why should I use guardrails? @jamesbriggs I have dialogflow which has all the intents and flows (like in colang file) - I will check the intent confidence and if it is high then I will trigger the corresponding intent flow and if it is low then I will retrieve the data from the source using naive retrieval method?s
can't you just do knn with your embeddings to make sure query isn't out of distribution, isn't this a pretty quick euclidean distance operation? why bother with guardrails? Thanks for the great video! keep it up.
1.Not all queries are straightforward. Complex queries might need more nuanced understanding and contextual analysis which KNN might not handle well 2.Guardrails can adapt to new rules and policies quickly, while KNN models might need retraining with new data 3.Guardrails can provide more interpretable reasons for why a query is out-of-distribution or not appropriate, aiding in better understanding and transparency however using both of these together might be more robust.
Great show thank you. Question -t seems awfully similar to your (more recent?) videos about semantic router , or have I got the wrong end of the stick. I know I should do a similarity search on the text for each I guess 😉! Thanks again.
Hi James, Enjoying your series greatly. Question or suggestion for a future video, I've been seeing a lot of articles on the use of graph data structures to build knowledge graphs to address issues such as hallucinations and weaknesses in logical reasoning in LLMs'. I've only found one person who has actually done this and they had mixed results as far as addressing these issues. Wondering what your experience has been in this area? Do you have an opinion? From what I can see there is not much evidence (yet) that it is a better result than well crafted semantic search.
I never tried myself, but everyone I know who tried said it was hard to do and the results were either the same as or worse than using vector search - so haven't had much reason to look into it Maybe at some point if I see it useful for a particular use-case, and it makes sense to use it given trade-offs, I'll try it out
🎯 Key points for quick navigation: 00:00 *🔍 Introduction to retrieval augmented generation with guardrails for building chatbots.* 00:27 *📂 Utilizing vector database (Pinecone), embedding model (RoBERTa), and documents for retrieval.* 00:54 *🕸️ Two traditional approaches to RAG: naive approach and agent approach.* 02:25 *⌛ Agent approach is slower but potentially more powerful with multiple thoughts and external tools.* 05:23 *🛡️ Guardrails approach: Directly embedding query, checking similarity with defined guardrails, and triggering retrieval tool if needed.* 07:42 *🧩 Guardrails approach combines query and retrieved context, then passes to language model for answer generation.* 08:23 *⚡ Guardrails approach is significantly faster than agent approach while still allowing tool usage.* 09:03 *📋 Step-by-step implementation details, including data indexing, embedding, and Pinecone setup.* 13:12 *🔄 Defining retrieve and RAG functions as guard actions for guardrails.* 14:46 *🚫 Guardrails config to avoid talking about politics.* 15:15 *🤖 Defining guardrail for user asking about LLMs to trigger RAG pipeline.* 17:10 *🔥 Demonstrating RAG pipeline via guardrails, showing its effectiveness in answering LLM-related queries.* 18:04 *🆚 Comparing guardrails without RAG, which lacks information for LLM-related queries.* 19:55 *💡 Guardrails approach allows agent-like tool usage without slow initial LM call, making it faster for triggered tools.*
Except pinecone, almost all of the vectorstores are opensource. Also, I dont know about pinecone since its not free, but others are mostly similar. I use chromadb for my personal projects since I started working on LLMs recently and it is very user friendly. You will handle it, the problematic part is data.
Yeah if you want fully local there are open source alternatives like qdrant or weaviate - for the comment above, Pinecone is free, they have the free/standard tier :)
Thanks very much for the sharing, James. May I seek your advice on how I can estimate infrastructure requirements eg number of GPUs, assuming I need to host an open source model with size of 70B on premise and the number of concurrent users being 1000 at most? Thank you very much.
You can calculate number of parameters * bytes required for data type of each parameter - people do keep asking about this so I think I can go into more detail in a future video
Very informative video. Thanks. Is there any chance that you know any open sourced LLM that supports the Greek language for retrieval augmented generation?
Thanks for your response @jamesbriggs . For the embedding part I have found a multilingual model which does an excellent job in retrieving the document which is more relevant to the placed question. What I can not find is an open sourced LLM for the generation part which will generate the answer to the user's query based on the retrieved document (I am talking for the Greek language). OpenAI tokenizer is very expensive since, from what I have noticed, it tokenizes the greek words to character level. So using their model does not fit to my task at hand. Any ways , if you ever notice any generative model which supports Greek please mention it to your upcoming videos, which by the way I have to say that they have helped me a lot.
@@ashraymallesh2856 If I am not mistaken, please correct me if I am wrong, applying the RAG pipeline in English would require first to translate the documents from Greek to English. As I mentioned in a previous post the documents contain mathematical definitions and terminology. Using a translation model or google translate api wouldn't work because, for example, google translate translates the words παραπληρωματικές , συμπληρωματικές, both as supplementary which is not correct. On the other hand translate all the documents by a human would be a tedious task. That's why I am looking for an open sourced LLM which supports the Greek language. Any ideas are welcome 😁.
What do you think about the accuracy and other related metrics while using guardrails? It really sounds nice, but if you use LLMs on fields with high risks (finance), does it promise accuracy also, at least similar to standard approaches? Great videos by the way, I guess i implemented almost all of them. And always nice to learn from a professional.
Also, (if you are ok with that since you also work for a company), if you can make a video about the hardware side of the LLMs and DBs, that would be great. Because at some point, there are enough information about coding and software (of course, not enough yet, but one can implement something somehow), but hardware side really requires theoretical knowledge. I dont want to just check the tables and go buy some NVIDIA GPU, I want to know why. Thanks in advance.
It’s hard to guarantee accuracy, LLM and the broader field of NLP is generally non-deterministic so there’s always that level of randomness, I’m still figuring out the best way of dealing with it myself - we try to add metrics, or extra LLM analysis steps (like asking “is this answer using information from these sources…”) - but it’s a difficult problem I like the GPU hardware idea, would love to jump into it
@James Are there any good "deterministic" ways to check the accuracy of information (by going through the reply and checking for eg) in the reply to that in the context? I've heard of Self-check GPT which takes multiple iterations but it's not 'deterministic'. It would be great to have such a technique!
@@rabomeister outside of highly specialized & sensitive use cases requiring procurement of a commercial grade GPU or TPU, and the talent & skill to use it effectively in a business process, there is no real advantage in spending $15-$20K or more on the HW unless you just have the insatiable desire & urge to do it for the hell of it and because you want to have your own and that's ok too my friend. Unfortunately the cloud giants, have structured the market in a way that makes getting compute from them is still more economically prudent than buying even 1 of the ASICs they have hundreds of thousands or millions of.
This what I was searching for! Thanks James your videos are very informative and easy to follow with google colab! My question would be: Can we use extracted information from vector db to analyse by LLM and provide insights or compare different documents using guardraills or Agent? Thanks, keep a great work!
It depends on what you’re comparing, but I see no reason as to why it couldn’t work! You can select an existing doc at random, perform a semantic search for similar docs and feed them into your LLM with instructions on what you’re comparing - there may be other ways of doing it too - I hope that helps!
ปีที่แล้ว
Have you tried to setup this with gpt-4? I'm getting some errors switching from davinci to gpt-4
Hey Andre! I usually avoid generating output with the built-in LLM function, I usually just use guardrails as a mid decision layer and then use actions to call LLMs like GPT4
@@jamesbriggs why would you use embeddings on the previous interactions? Can you just use the ChatCompletion endpoint and pass the array of previous messages as `chat_history` ?
@@eightrice ChatCompletion endpoint is more effective, and is what you do for the "agent approach to RAG" - it's just slower. In real-world use-cases I have always used the pure agent approach, but I recently begun experimenting with a mix of both, so I try to capture obvious queries ("user asks about LLMs") with guardrails and send the single query direct to the RAG pipeline, but for more general purpose queries I direct them to the typical agent endpoint (and include conversation history) I'm still experimenting with the best approach, but so far this system seems to be working well for speeding up a reasonable portion of queries
@@jamesbriggs yup, that hybrid architecture seems optimal if you need both normal chatbot functionality and subject matter knowledge with low latency. Thank you so much for this, I feel like I should be paying a lot for your code and tutorials :)
For implementing lagchain agents with NemoGuardrails do we need to do below? in the colang file first define the action which is calling the function which has the agent execution like this $answer = execute custom_function(query=$last_user_message) and then we register the tails like ? rag_rails.register_action(action=custom_function, name="custom_function") Am I on right track?
This method is very simple to talk using toy example , but you need lots of hard work in the real business enviorment to build and test whether it's really working or not . Using simple sentences + embedding distance for decision making is not really a reliable solution .
I use it in production, it can be more reliable at times than LLMs if you define the semantic vector space that should trigger an action well - typically I view prompt engineering as the broad stroke, and guardrails as the fine-tuning of your chatbot behavior, so when you specific RAG workflows like "refer to HR docs", "refer to eng docs", "refer to company Y DB", guardrails can be very helpful But you're very right, it needs a lot of work, testing, and iterating over the guardrails to get something reliable
I really appreciate that all if not most of your collabs dont use langchain at all. Really like to see what goes on under the hood to learn from a first principals perspective.
These videos are such a high quality collection of content for app developers in the AI space who are building apps and not AI experts (nor really care about the AI itself, just wanting to use it)
Been anticipating this video since seeing the notebook on your github! Thank you so much for your detailed explanations! Would be keen to see your implementations of NeMo Guardrail's moderation pipelines :)
Really appreciate you puttin this together 🙏
Thanks for the intro to NeMo Guardrails! I kept expecting you to say tools like ... 'google' ... but you seemed to pause and then not say it 😂
Thanks, James, very very useful. Will try to include guardrails into our corporate RAG chat bot.
You glowed up like crazy in your video content ! It so cool !!!!
thanks Sajid - means a lot coming from you :)
Would this approach work if vecorized data was shop inventory? And the question was like how many items fo you have? Or about specifics about a group of items?
I want to build this for my research lab so that we can query information about our protocols, standards, etc. This seems really useful.
I presume it wouldn't be that hard to then embed it into a slack chatbot?
Can you make a video tutorial on creating data from wikipedia?
Excellent video as always. Thanks for sharing. Is there a way to setup Colang for an "anything but" scenario? So far, I only seem to be able to program what to detect for a workflow. But can I setup a 'default deny' type thing? Anything different than the topic my bot is designed to handle returns an "I'm sorry, Dave. I'm afraid I can't do that"...
awesome video James!
Thanks Shaheer!
Excellent thank you James.
you're welcome!
Great video as always!
Thanks as always!
Brilliant mate, also don't forget this could be a massive cost optimizer along with speed :)
What is the difference in accuracy between reasoning (whether to retrieve) using embeddeding similarity vs giving it to an llm?
Why should I use guardrails? @jamesbriggs
I have dialogflow which has all the intents and flows (like in colang file) - I will check the intent confidence and if it is high then I will trigger the corresponding intent flow and if it is low then I will retrieve the data from the source using naive retrieval method?s
Thank you for the super video. I wonder how can we do chain-of-thoughts (COT) or tree-of-thoughts with Guardrails without langchain?
can't you just do knn with your embeddings to make sure query isn't out of distribution, isn't this a pretty quick euclidean distance operation? why bother with guardrails? Thanks for the great video! keep it up.
1.Not all queries are straightforward. Complex queries might need more nuanced understanding and contextual analysis which KNN might not handle well
2.Guardrails can adapt to new rules and policies quickly, while KNN models might need retraining with new data
3.Guardrails can provide more interpretable reasons for why a query is out-of-distribution or not appropriate, aiding in better understanding and transparency
however using both of these together might be more robust.
When should one opt to RAG, fine tune or just a Botpress knowledge base linked to chatgpt? Thank you !
great video! any Idea how to deal with screenshots in the documents?
useful work!
Great show thank you. Question -t seems awfully similar to your (more recent?) videos about semantic router , or have I got the wrong end of the stick. I know I should do a similarity search on the text for each I guess 😉! Thanks again.
Hi James, Enjoying your series greatly. Question or suggestion for a future video, I've been seeing a lot of articles on the use of graph data structures to build knowledge graphs to address issues such as hallucinations and weaknesses in logical reasoning in LLMs'. I've only found one person who has actually done this and they had mixed results as far as addressing these issues. Wondering what your experience has been in this area? Do you have an opinion? From what I can see there is not much evidence (yet) that it is a better result than well crafted semantic search.
I never tried myself, but everyone I know who tried said it was hard to do and the results were either the same as or worse than using vector search - so haven't had much reason to look into it
Maybe at some point if I see it useful for a particular use-case, and it makes sense to use it given trade-offs, I'll try it out
🎯 Key points for quick navigation:
00:00 *🔍 Introduction to retrieval augmented generation with guardrails for building chatbots.*
00:27 *📂 Utilizing vector database (Pinecone), embedding model (RoBERTa), and documents for retrieval.*
00:54 *🕸️ Two traditional approaches to RAG: naive approach and agent approach.*
02:25 *⌛ Agent approach is slower but potentially more powerful with multiple thoughts and external tools.*
05:23 *🛡️ Guardrails approach: Directly embedding query, checking similarity with defined guardrails, and triggering retrieval tool if needed.*
07:42 *🧩 Guardrails approach combines query and retrieved context, then passes to language model for answer generation.*
08:23 *⚡ Guardrails approach is significantly faster than agent approach while still allowing tool usage.*
09:03 *📋 Step-by-step implementation details, including data indexing, embedding, and Pinecone setup.*
13:12 *🔄 Defining retrieve and RAG functions as guard actions for guardrails.*
14:46 *🚫 Guardrails config to avoid talking about politics.*
15:15 *🤖 Defining guardrail for user asking about LLMs to trigger RAG pipeline.*
17:10 *🔥 Demonstrating RAG pipeline via guardrails, showing its effectiveness in answering LLM-related queries.*
18:04 *🆚 Comparing guardrails without RAG, which lacks information for LLM-related queries.*
19:55 *💡 Guardrails approach allows agent-like tool usage without slow initial LM call, making it faster for triggered tools.*
Hey! I am researching the topic of answering questions from an open-domain, so how can I get data from that domain? Thank you
I want to start using RAG but I want something fully local. What could be an alternative to pinecone?
Except pinecone, almost all of the vectorstores are opensource. Also, I dont know about pinecone since its not free, but others are mostly similar. I use chromadb for my personal projects since I started working on LLMs recently and it is very user friendly. You will handle it, the problematic part is data.
Use Deeplake I have been using it for my projects it is pretty good
Yeah if you want fully local there are open source alternatives like qdrant or weaviate - for the comment above, Pinecone is free, they have the free/standard tier :)
Using pgvector here, directly on top of good-ol Postgres. Works like a charm.
Thanks very much for the sharing, James. May I seek your advice on how I can estimate infrastructure requirements eg number of GPUs, assuming I need to host an open source model with size of 70B on premise and the number of concurrent users being 1000 at most? Thank you very much.
You can calculate number of parameters * bytes required for data type of each parameter - people do keep asking about this so I think I can go into more detail in a future video
Would really appreciate that!@@jamesbriggs
Very informative video. Thanks. Is there any chance that you know any open sourced LLM that supports the Greek language for retrieval augmented generation?
cohere have a multilingual embedding model - it probably covers Greek, there will also be multilingual sentence transformers you can use too :)
Thanks for your response @jamesbriggs . For the embedding part I have found a multilingual model which does an excellent job in retrieving the document which is more relevant to the placed question. What I can not find is an open sourced LLM for the generation part which will generate the answer to the user's query based on the retrieved document (I am talking for the Greek language). OpenAI tokenizer is very expensive since, from what I have noticed, it tokenizes the greek words to character level. So using their model does not fit to my task at hand. Any ways , if you ever notice any generative model which supports Greek please mention it to your upcoming videos, which by the way I have to say that they have helped me a lot.
@@georgekokkinakis7288 what about doing the RAG pipeline in english and then translating to greek for your users? :P
@@ashraymallesh2856 If I am not mistaken, please correct me if I am wrong, applying the RAG pipeline in English would require first to translate the documents from Greek to English. As I mentioned in a previous post the documents contain mathematical definitions and terminology. Using a translation model or google translate api wouldn't work because, for example, google translate translates the words παραπληρωματικές , συμπληρωματικές, both as supplementary which is not correct. On the other hand translate all the documents by a human would be a tedious task. That's why I am looking for an open sourced LLM which supports the Greek language. Any ideas are welcome 😁.
What do you think about the accuracy and other related metrics while using guardrails? It really sounds nice, but if you use LLMs on fields with high risks (finance), does it promise accuracy also, at least similar to standard approaches? Great videos by the way, I guess i implemented almost all of them. And always nice to learn from a professional.
Also, (if you are ok with that since you also work for a company), if you can make a video about the hardware side of the LLMs and DBs, that would be great. Because at some point, there are enough information about coding and software (of course, not enough yet, but one can implement something somehow), but hardware side really requires theoretical knowledge. I dont want to just check the tables and go buy some NVIDIA GPU, I want to know why. Thanks in advance.
It’s hard to guarantee accuracy, LLM and the broader field of NLP is generally non-deterministic so there’s always that level of randomness, I’m still figuring out the best way of dealing with it myself - we try to add metrics, or extra LLM analysis steps (like asking “is this answer using information from these sources…”) - but it’s a difficult problem
I like the GPU hardware idea, would love to jump into it
@James Are there any good "deterministic" ways to check the accuracy of information (by going through the reply and checking for eg) in the reply to that in the context? I've heard of Self-check GPT which takes multiple iterations but it's not 'deterministic'. It would be great to have such a technique!
@@rabomeister outside of highly specialized & sensitive use cases requiring procurement of a commercial grade GPU or TPU, and the talent & skill to use it effectively in a business process, there is no real advantage in spending $15-$20K or more on the HW unless you just have the insatiable desire & urge to do it for the hell of it and because you want to have your own and that's ok too my friend. Unfortunately the cloud giants, have structured the market in a way that makes getting compute from them is still more economically prudent than buying even 1 of the ASICs they have hundreds of thousands or millions of.
This what I was searching for! Thanks James your videos are very informative and easy to follow with google colab! My question would be: Can we use extracted information from vector db to analyse by LLM and provide insights or compare different documents using guardraills or Agent? Thanks, keep a great work!
It depends on what you’re comparing, but I see no reason as to why it couldn’t work! You can select an existing doc at random, perform a semantic search for similar docs and feed them into your LLM with instructions on what you’re comparing - there may be other ways of doing it too - I hope that helps!
Have you tried to setup this with gpt-4? I'm getting some errors switching from davinci to gpt-4
Hey Andre! I usually avoid generating output with the built-in LLM function, I usually just use guardrails as a mid decision layer and then use actions to call LLMs like GPT4
How to use guardrail and RAG with other LLM? Like falcon or Llama?
You can modify the model provider and name in the config.yaml file - they have docs on it in the guardrails GitHub repo :)
This is awesome, thanks James, out of curiosity, do you know if this can be integrated with langchain?
absolutely, Langchain is code, and we can execute code via actions like we did with our RAG pipeline here
does this have message history? Does the context carry over from one input to the next?
In this example no, but you can bring in a few previous interactions for embedding
@@jamesbriggs why would you use embeddings on the previous interactions? Can you just use the ChatCompletion endpoint and pass the array of previous messages as `chat_history` ?
@@eightrice ChatCompletion endpoint is more effective, and is what you do for the "agent approach to RAG" - it's just slower.
In real-world use-cases I have always used the pure agent approach, but I recently begun experimenting with a mix of both, so I try to capture obvious queries ("user asks about LLMs") with guardrails and send the single query direct to the RAG pipeline, but for more general purpose queries I direct them to the typical agent endpoint (and include conversation history)
I'm still experimenting with the best approach, but so far this system seems to be working well for speeding up a reasonable portion of queries
@@jamesbriggs yup, that hybrid architecture seems optimal if you need both normal chatbot functionality and subject matter knowledge with low latency. Thank you so much for this, I feel like I should be paying a lot for your code and tutorials :)
@@eightrice yeah so far I've liked this approach - haha no worries, I'm happy it's useful :)
For implementing lagchain agents with NemoGuardrails do we need to do below?
in the colang file first define the action which is calling the function which has the agent execution like this
$answer = execute custom_function(query=$last_user_message)
and then we register the tails like ?
rag_rails.register_action(action=custom_function, name="custom_function")
Am I on right track?
This method is very simple to talk using toy example , but you need lots of hard work in the real business enviorment to build and test whether it's really working or not . Using simple sentences + embedding distance for decision making is not really a reliable solution .
I use it in production, it can be more reliable at times than LLMs if you define the semantic vector space that should trigger an action well - typically I view prompt engineering as the broad stroke, and guardrails as the fine-tuning of your chatbot behavior, so when you specific RAG workflows like "refer to HR docs", "refer to eng docs", "refer to company Y DB", guardrails can be very helpful
But you're very right, it needs a lot of work, testing, and iterating over the guardrails to get something reliable
Did u smoke something before recording this ?
I have a relaxed nature 😂
😂😂😂Hilarious