Would love it if you could showcase a working rag example with live changing data. For example item price change, or policy update. Does it require to manually manage chunks and embedding references or are there better existing solutions? I think this really differentiates between fun-todo and actual production systems and applications. Thanks and all the best! Awesome video ❤
this is often done by having the RAG return a variable and then just look up the variable for the latest price etc. You probably don't want to put info you want to change into a vector db etc.
I love your videos I hope you can have some community messages soon in the meantime I would love to mention that: I would love to have information about Open Devin as well as some clever way to get the cheapest Anthropic AI Agent (Haiku) to perform pre processing or post processing of messages (in parallel to be grouped with the help of an other layer or in series with the context window being kept in the smaller model who could be used to perform the processing of queries to explain the way a more capable model like opus or gpt-4 would be able to handle the query without having the larger context window or without needing to rely on images that would be preprocessed by sonnet or haiku and described in few words to the AI Agent of the more capable level sonnet or opus or other… I am impressed with the fact that all 3 recent models of Claude are isometric on their respective images capabilities and lengthy context window only the price is different along with their overall capabilities but I have not experimented enough with them to really get to witness where they have strategic advantages one over the other…
@@samwitteveenai what I meant is that perplexity offers an api with 7b and 70 models similarly to this for a while. Though this looks like a cleaner solution
I will try it within Llama Index... and let you know. My aim is to build summaries based on a predifined document structure. Maybe I will try to "coherce" or "influence clustering inspired on something like "Hyde"... Not sure if pydantic would also be usefull, but probably less flexible... Thank you again.
Interesting, although I am a bit confused. Isn't RAG itself just a code implementation? The model itself doesn't do the retrieval. So with that in mind what about the model makes it a retrieval model? Is it just the needle in a haystack performance and function calling?
While this is true, the goal of this model is that it was trained with RAG in mind. All other models are general generation models that can do RAG. The idea here is that this model should be able to do RAG better because it retrained to do so at least hopefully
I guess the final query in RAG is always going to look something like: “ Answer the query with the context below: {query} {context} “ Where context is a list of paragraphs. So if a model is trained on that style of prompt then it’s good for RAG.
@samwitteveenai Great video was always legend! Was wondering if you've used Langchain with Grok cloud/Mixtral 8x7b ? I'm trying to swap out ChatOpenAI with the Grok Mixtral but I'm not sure if it can work with 'bind_tools' . Any idea on this?
Thanks. Yes I have got Groq working fine with LangChain but I haven't tried the function calling. I don't think .bind works with the open source models, unless you use it with the OpenAI spec and change the end point.
It feels like someone could train a model with the help of this and then use their own model (I don’t know how it works or if people doing this could be caught or if it would be against the licensing agreement)…😮😮😮😮
This seems like the perfect use case for atlassian, they have some AI stuff, but no RAG. Do you happen to know if they plan on doing this? I wonder why they are sleeping on this...
Rather than dropping $20,000 on an NVIDIA card, just buy a MacBook Pro lol. My M2 laptop with 96GB of VRAM works great with 70b models. Save your five figures, or your time jockeying for GPU rental time. Almost all of us hobbyists just need an Apple laptop.
By “works great” I assume you mean an inference speed of about 5 tokens per second (slightly under the 7t/s one can achieve with an M3 Ultra)? In other words, at a context size of 10k tokens - easily reached with an RAG infused context within three or four messages - a wait time of 30 minutes per message? If so you should probably clarify that because I suspect 99% of people will disagree with your classification of that as “great”.
@@peterwlodarczyk3987 Sure thing! For a 70b model, llama.cpp reports: `( 55.04 ms per token, 18.17 tokens per second)` That's pretty typical, and _far_ exceeds my reading speed. Keep in mind: 1. For memory bandwidth, M2 > M3. This may, or may not, affect results. 2. The relationship between context length and inference speed is complex. It's highly sensitive to the hardware, model architecture, optimizations (e.g. KV cache scheme, quantization, etc), and integrations (e.g. RAG systems). I've never seen remotely near "30 minutes / message" generation speeds, but I've also never exceeded 32k context window sizes. 🤷♀ 3. As I'm sure we all agree, my laptop inference of course won't surpass beastly setups which suck kilowatts through many thousands of dollars in dedicated multi-GPU hardware. But for most of us, that's okay! Apple silicon is an excellent alternative. Apropos of nothing: I saw someone run the new 120b DBRX model on their M2 Ultra at 14t/s (using Apple's MLX framework), and can't wait to try it myself! Yeah, the context size probably won't be great, but it's something I simply could not afford to do without Apple hardware, full stop. I just want people to know that it's a good option!
I run different models on a mac16 M3 max 48gb of ram via ollama or langchain. My preference goes to Mixtral q4, speedwise comparable to GPT4, quality wise GPT 3.5. I haven't pushed the model on large context windows, but if you need to process series of small to mid size texts (reviews, emails, blog posts, short reports) the mac laptops are a good option.
In corpo world I use LMMs always with RAGs:) - I will check this model - thank you for sharing:)
Nice job! I look forward to a day where either we could make smart models tiny, or could run huge models on regular hardware.
Would love it if you could showcase a working rag example with live changing data. For example item price change, or policy update. Does it require to manually manage chunks and embedding references or are there better existing solutions? I think this really differentiates between fun-todo and actual production systems and applications.
Thanks and all the best! Awesome video ❤
this is often done by having the RAG return a variable and then just look up the variable for the latest price etc. You probably don't want to put info you want to change into a vector db etc.
thanks for the video, it will be great for a video on how to use this with langchain, using tools and agents.
thanks I will put something like that together.
@@samwitteveenaiI would love it!!
Thanks for pumping out another one bro!
I love your videos I hope you can have some community messages soon in the meantime I would love to mention that: I would love to have information about Open Devin as well as some clever way to get the cheapest Anthropic AI Agent (Haiku) to perform pre processing or post processing of messages (in parallel to be grouped with the help of an other layer or in series with the context window being kept in the smaller model who could be used to perform the processing of queries to explain the way a more capable model like opus or gpt-4 would be able to handle the query without having the larger context window or without needing to rely on images that would be preprocessed by sonnet or haiku and described in few words to the AI Agent of the more capable level sonnet or opus or other…
I am impressed with the fact that all 3 recent models of Claude are isometric on their respective images capabilities and lengthy context window only the price is different along with their overall capabilities but I have not experimented enough with them to really get to witness where they have strategic advantages one over the other…
Nice video thx. Gotta check this, wondering how it does compared to the perplexity-online models, which performed a bit mixed in my tests.
Good question I haven't used perplexity a lot, but my guess is this could build something like perplexity pretty easily
@@samwitteveenai what I meant is that perplexity offers an api with 7b and 70 models similarly to this for a while. Though this looks like a cleaner solution
Very nice video, as always.
I wonder how this works with Raptor retrieval.
Thanks!
Ohh I was playing with Raptor on the weekend, that is very cool. Haven't tried with this model, but my guess is it will do well.
I will try it within Llama Index... and let you know.
My aim is to build summaries based on a predifined document structure. Maybe I will try to "coherce" or "influence clustering inspired on something like "Hyde"... Not sure if pydantic would also be usefull, but probably less flexible...
Thank you again.
Interesting, although I am a bit confused. Isn't RAG itself just a code implementation? The model itself doesn't do the retrieval. So with that in mind what about the model makes it a retrieval model? Is it just the needle in a haystack performance and function calling?
While this is true, the goal of this model is that it was trained with RAG in mind. All other models are general generation models that can do RAG. The idea here is that this model should be able to do RAG better because it retrained to do so at least hopefully
I guess the final query in RAG is always going to look something like:
“
Answer the query with the context below:
{query}
{context}
“
Where context is a list of paragraphs.
So if a model is trained on that style of prompt then it’s good for RAG.
@samwitteveenai Great video was always legend! Was wondering if you've used Langchain with Grok cloud/Mixtral 8x7b ? I'm trying to swap out ChatOpenAI with the Grok Mixtral but I'm not sure if it can work with 'bind_tools' . Any idea on this?
Thanks. Yes I have got Groq working fine with LangChain but I haven't tried the function calling. I don't think .bind works with the open source models, unless you use it with the OpenAI spec and change the end point.
@@samwitteveenai how do you do that??
is it better than OpenAI's AssitantsAPI with Retrieval tool? OAI Aapi did not work well for me
Yes in my testing I think it is better based on the Coral UI they have.
It feels like someone could train a model with the help of this and then use their own model (I don’t know how it works or if people doing this could be caught or if it would be against the licensing agreement)…😮😮😮😮
😀 Might run into som licensing issues. But in theory you this idea can be replicated with out them as well
🚨🚨🚨 NEW MODEL DROP ALERT 🚨🚨🚨
This seems like the perfect use case for atlassian, they have some AI stuff, but no RAG. Do you happen to know if they plan on doing this? I wonder why they are sleeping on this...
Not sure I haven't used Atlassian in quite a while.
Rather than dropping $20,000 on an NVIDIA card, just buy a MacBook Pro lol. My M2 laptop with 96GB of VRAM works great with 70b models. Save your five figures, or your time jockeying for GPU rental time. Almost all of us hobbyists just need an Apple laptop.
By “works great” I assume you mean an inference speed of about 5 tokens per second (slightly under the 7t/s one can achieve with an M3 Ultra)? In other words, at a context size of 10k tokens - easily reached with an RAG infused context within three or four messages - a wait time of 30 minutes per message? If so you should probably clarify that because I suspect 99% of people will disagree with your classification of that as “great”.
@@peterwlodarczyk3987 Sure thing! For a 70b model, llama.cpp reports: `( 55.04 ms per token, 18.17 tokens per second)` That's pretty typical, and _far_ exceeds my reading speed. Keep in mind:
1. For memory bandwidth, M2 > M3. This may, or may not, affect results.
2. The relationship between context length and inference speed is complex. It's highly sensitive to the hardware, model architecture, optimizations (e.g. KV cache scheme, quantization, etc), and integrations (e.g. RAG systems). I've never seen remotely near "30 minutes / message" generation speeds, but I've also never exceeded 32k context window sizes. 🤷♀
3. As I'm sure we all agree, my laptop inference of course won't surpass beastly setups which suck kilowatts through many thousands of dollars in dedicated multi-GPU hardware. But for most of us, that's okay! Apple silicon is an excellent alternative.
Apropos of nothing: I saw someone run the new 120b DBRX model on their M2 Ultra at 14t/s (using Apple's MLX framework), and can't wait to try it myself! Yeah, the context size probably won't be great, but it's something I simply could not afford to do without Apple hardware, full stop. I just want people to know that it's a good option!
I run different models on a mac16 M3 max 48gb of ram via ollama or langchain. My preference goes to Mixtral q4, speedwise comparable to GPT4, quality wise GPT 3.5. I haven't pushed the model on large context windows, but if you need to process series of small to mid size texts (reviews, emails, blog posts, short reports) the mac laptops are a good option.