Haha. Last night I was chatting with someone about how to come up with data to solve this problem. I have a story writing engine that can blow through blow through $10 of tokens in minutes. It is getting really expensive just to develop it. This morning I was going to look around to see if anybody had something like this. And the solution to my quest is in the first video I watched this morning. I hope the rest of my day is this awesome.
You could probably modify the code as well so that you have both a "debug" context and a "production" context so that cheaper LLMs could be used when the final output doesn't require the most expensive tokens.
@toadlguy Thanks. You have pointed me to solutions several times. We were trying to figure out how to get prompt plus user choice data. There are plenty of sites that run your prompt against 2 LLMs and ask which you prefer. That is some really valuable data... Now I don't need to - and that would have blocked me. My current algorithm involves looking at the size and complexity of the prompt. Most of my prompting is highly structured and templatized - I expect LLMRouter will pick Chat 4o most of the time. But later today I will find out :) Now I suspect need to find a way to generalize the use of tokenizers and different embeddings and such... Do you have a quick and easy solution to this, too? Thanks for all the help.
@toadlguy It was a nice thought... The application I am making uses prompts that contain RDF - state/node/conditional edge... names, descriptions, pass messages, fail messages... The prompt then acts as a state engine outputting the proper messaging based on the non-templatized stuff passed in. This is ChatGPT 4o all the way.
@anantgupta3285 Sure. I get different results - undesired results - when I try using lesser models. My initial thoughts: * Templatized agent prompts work differently in different LLMs. * It would be quite a bit of work to revise things so I could dynamically select the LLM - this is not something I considered when I started playing with LangChain... Not gonna work for this project. However - I am writing an auto-tester to execute my standard test suite against new LLMs...I will plug it into that... However... ChatGPT 4o $20 plan supports 8k tokens. I assume 750 words for 1000 tokens... 5600 words in the context window. Realistically this limits my story engine to about 10 275-word pages of text. A couple of chapters before it loses track of the story metadata and rules... A team account would get me 32k tokens perhaps 50 pages... I think I might just need to use Google's million token context window...
Would be interesting to see how this and MoA (mixture of agents) could be used together. Perhaps the route could go to a different model that uses several smaller agents (models) together, medium agents together, and larger agents together and/or mixed with smaller agents
Great one Sam. So, to make this all about me :-), I've been using GPT4x as the router/manager under the theory that it is the smartest (this is a Mixture of Agents). Then the agents are cheaper. I can see this is much better. Thanks!
Wow, this is great that it was released with the entire framework open source, as I believe that this (or something like it) will be part of the interface we will all be using soon. The other component is determining what data is required to respond. For instance, does the query require proprietary or personal data? This would first create a context (through RAG) for that data but also determine which LLMs would be available to that context based on the required security (do you even want to send the proprietary data to a commercial LLM?). Also with Llama3 8B, this could be done locally (at almost no cost). BTW, this is part of the framework that Apple will implementing, but can be tailored for many other applications now using this framework and LangChain (for instance).
This is actually ery interesting. Concretely, when you use langchain and has satically linked LLMs on some custom tools, how could we redirect this from langchain directly from langchain so the routing is made afterwards ?
this might actually work real well in test scenarios, ie which llm provides the best accuracy vs speed compromise, for example in rag- / knowledge-graph systems
The router would be a very small and fast model. The cheaper model would also be a smaller model. Since cheaper models are smaller, they respond much faster. Latency would only go up a tiny bit for the responses that get routed to the most expensive model. Overall, you'd see a massive improvement in latency. Consider cost to be a proxy for compute time spent. They say they save 85% of the cost while maintaining 95% of the benchmark score. So estimate a 85% latency reduction (not counting the fixed networking latency). This doesn't actually play out exactly as expensive models are more parallel as well but you get the idea.
I would think so. Perhaps that’s another parameter to be optimised for: latency required for the type of query. Eg if conversational voice tokens input would route to a low latency model like upcoming 4o voice mode.
It's a good idea, but not practical in real life apps. I use multiple LLM for my app, and i manually test them first and make sure the weaker model are suitable for my task , then i route each of different tasks to different LLMs based on intensive test results. I am unsure how or where this router AI would be useful.
Maybe it can help you iterate faster. Help you manually test your prompt + models quicker and see which of the cheapest models are suitable for each task.
@hqcart1 Initial thoughts... After poking around with example prompts... Highly structured and templatized prompts seem to always suggest frontier models. Not too useful for me either.
@@tvwithtiffani no, all it does is add another layer of complexity and latency to a judge (route LLM) that is weak in nature, so it will guess which model to use based on unknown criteria. a manual route is way better without complex systems and reliable results.
I don’t know whether this data oriented way of evaluating where to direct a query is going to be better than a task based one. For my app, it would be far easier to route for summarisation vs data extraction tasks, vs other tasks
Haha how is this new? I started doing this about 20 mins after trying GPT-4. I appreciate the formal framework and improvements they've made tho. That said, I use GPT 3.5 to filter first. And yeah, saved me a ton of money. Not only is it the cheaper model but using simpler (short) prompts. Like "respond IGNORE if this message is not asking for a response." Then I'll only send the messages to GPT4 with a full prompt for messages that need responses. Use a simple model first. Save tokens. Save 20-50x of LLM costs (my usecase). Profit. Also worth noting that ChatGPT has something similar. People have long known that some responses get routed to gpt 3.5 vs 4.0+.
Haha, nice man, yeah I’ve been doing the same thing too, I also use the cheap guys for filtering. It’s very cool to bump into someone else who does the same thing!
They used their own data from the arena to produce the framework. So theoretically it should be more versed in which queries are ok for each of the Models you select to MoE(xperts)/MoA(gents) with.
@@BorisHrzenjak I think so. More like mixture of models tho because Agents and experts are typically preset with a specific task or set of tasks for agents and a specific set of preset knowledge for each the Experts in a mixture.
I have a huge problem with with all solutions must be a Framework, at best this is a library or even a function. Not saying you but companies/developers.
Not very accurate I have built a tool for my org that works on a custom built sql agent that could use a cheap model for over half of the questions being asked. I am building a "router" to check complexity and context need.
Haha.
Last night I was chatting with someone about how to come up with data to solve this problem. I have a story writing engine that can blow through blow through $10 of tokens in minutes. It is getting really expensive just to develop it.
This morning I was going to look around to see if anybody had something like this.
And the solution to my quest is in the first video I watched this morning.
I hope the rest of my day is this awesome.
You could probably modify the code as well so that you have both a "debug" context and a "production" context so that cheaper LLMs could be used when the final output doesn't require the most expensive tokens.
@toadlguy
Thanks. You have pointed me to solutions several times.
We were trying to figure out how to get prompt plus user choice data. There are plenty of sites that run your prompt against 2 LLMs and ask which you prefer. That is some really valuable data...
Now I don't need to - and that would have blocked me.
My current algorithm involves looking at the size and complexity of the prompt. Most of my prompting is highly structured and templatized - I expect LLMRouter will pick Chat 4o most of the time.
But later today I will find out :)
Now I suspect need to find a way to generalize the use of tokenizers and different embeddings and such... Do you have a quick and easy solution to this, too?
Thanks for all the help.
@toadlguy
It was a nice thought...
The application I am making uses prompts that contain RDF - state/node/conditional edge... names, descriptions, pass messages, fail messages...
The prompt then acts as a state engine outputting the proper messaging based on the non-templatized stuff passed in.
This is ChatGPT 4o all the way.
@JohnBoen do post how it went, for you test it
@anantgupta3285
Sure.
I get different results - undesired results - when I try using lesser models.
My initial thoughts:
* Templatized agent prompts work differently in different LLMs.
* It would be quite a bit of work to revise things so I could dynamically select the LLM - this is not something I considered when I started playing with LangChain...
Not gonna work for this project.
However - I am writing an auto-tester to execute my standard test suite against new LLMs...I will plug it into that...
However...
ChatGPT 4o $20 plan supports 8k tokens. I assume 750 words for 1000 tokens... 5600 words in the context window.
Realistically this limits my story engine to about 10 275-word pages of text. A couple of chapters before it loses track of the story metadata and rules...
A team account would get me 32k tokens perhaps 50 pages...
I think I might just need to use Google's million token context window...
Would be interesting to see how this and MoA (mixture of agents) could be used together.
Perhaps the route could go to a different model that uses several smaller agents (models) together, medium agents together, and larger agents together and/or mixed with smaller agents
Great one Sam. So, to make this all about me :-), I've been using GPT4x as the router/manager under the theory that it is the smartest (this is a Mixture of Agents). Then the agents are cheaper. I can see this is much better. Thanks!
Wow, this is great that it was released with the entire framework open source, as I believe that this (or something like it) will be part of the interface we will all be using soon. The other component is determining what data is required to respond. For instance, does the query require proprietary or personal data? This would first create a context (through RAG) for that data but also determine which LLMs would be available to that context based on the required security (do you even want to send the proprietary data to a commercial LLM?). Also with Llama3 8B, this could be done locally (at almost no cost). BTW, this is part of the framework that Apple will implementing, but can be tailored for many other applications now using this framework and LangChain (for instance).
Great opensource release, thanks for the video
Claude 3.5 Haiku with this framework is gonna be insane. Nice video as always !
Great work and excellent explanation. Thank you
Good one. Really it will help enterprises to save cost.
totally agree!
Now that is OPEN 😮 wow. Great work!
I'd like to see more examples of applications of LLMs.
This makes so much sense
This is good stuff 🙌♥️
This is actually ery interesting. Concretely, when you use langchain and has satically linked LLMs on some custom tools, how could we redirect this from langchain directly from langchain so the routing is made afterwards ?
this might actually work real well in test scenarios, ie which llm provides the best accuracy vs speed compromise, for example in rag- / knowledge-graph systems
Can we have a code example of this using langchain since its the most common framework people use for LLMs please
Can this pair a local model to a cloud LLM and be even cheaper? Would love to see with the new generations of phi and Gemma
Very helpful! Thank you! 😎🤖
Wonderful video. Thank you
What about "function calling"? Can you really move between models?
What about the latency impact? Wouldnt this preclude a lot of production use cases
The router would be a very small and fast model. The cheaper model would also be a smaller model. Since cheaper models are smaller, they respond much faster. Latency would only go up a tiny bit for the responses that get routed to the most expensive model. Overall, you'd see a massive improvement in latency.
Consider cost to be a proxy for compute time spent. They say they save 85% of the cost while maintaining 95% of the benchmark score. So estimate a 85% latency reduction (not counting the fixed networking latency). This doesn't actually play out exactly as expensive models are more parallel as well but you get the idea.
I would think so. Perhaps that’s another parameter to be optimised for: latency required for the type of query. Eg if conversational voice tokens input would route to a low latency model like upcoming 4o voice mode.
Worth comparing to how well it performs vs the semantic router lib which is also free to use
It's a good idea, but not practical in real life apps.
I use multiple LLM for my app, and i manually test them first and make sure the weaker model are suitable for my task , then i route each of different tasks to different LLMs based on intensive test results.
I am unsure how or where this router AI would be useful.
Maybe it can help you iterate faster. Help you manually test your prompt + models quicker and see which of the cheapest models are suitable for each task.
@hqcart1
Initial thoughts...
After poking around with example prompts...
Highly structured and templatized prompts seem to always suggest frontier models.
Not too useful for me either.
@@tvwithtiffani no, all it does is add another layer of complexity and latency to a judge (route LLM) that is weak in nature, so it will guess which model to use based on unknown criteria. a manual route is way better without complex systems and reliable results.
@@hqcart1 oh well don't use it then 🤷🏾♀️and it's not unknown criteria. This video said it uses its llm arena data.
@@tvwithtiffani exactky, thats the unknown
Interesting 😮
I don’t know whether this data oriented way of evaluating where to direct a query is going to be better than a task based one. For my app, it would be far easier to route for summarisation vs data extraction tasks, vs other tasks
Good insight
Haha how is this new? I started doing this about 20 mins after trying GPT-4. I appreciate the formal framework and improvements they've made tho. That said, I use GPT 3.5 to filter first. And yeah, saved me a ton of money. Not only is it the cheaper model but using simpler (short) prompts. Like "respond IGNORE if this message is not asking for a response." Then I'll only send the messages to GPT4 with a full prompt for messages that need responses. Use a simple model first. Save tokens. Save 20-50x of LLM costs (my usecase). Profit.
Also worth noting that ChatGPT has something similar. People have long known that some responses get routed to gpt 3.5 vs 4.0+.
Haha, nice man, yeah I’ve been doing the same thing too, I also use the cheap guys for filtering. It’s very cool to bump into someone else who does the same thing!
Thanks!
that's basically like mixture of agents or am I wrong?
They used their own data from the arena to produce the framework. So theoretically it should be more versed in which queries are ok for each of the Models you select to MoE(xperts)/MoA(gents) with.
@@tvwithtiffani so same concept just a bit more polished :)
Am I right that this could be combined with MoA to enable you to then optimise for cost / performance and accuracy?
@@BorisHrzenjak I think so. More like mixture of models tho because Agents and experts are typically preset with a specific task or set of tasks for agents and a specific set of preset knowledge for each the Experts in a mixture.
@@BorisHrzenjakSambaNova call it CoE I guess
Does it work for languages other than english?
This is a good question. It may do ok out of the box, but you could certainly train one of their models to handle other languages.
isn't a semantic router easier and faster?
Flash is a better and cheaper reranker than the rerankers in the market (including Cohere)
Requesting a vid on GraphRAG
❤❤❤
Lite llm?
This is different than LiteLLM it is dynamically changing between the two model choices
I have a huge problem with with all solutions must be a Framework, at best this is a library or even a function. Not saying you but companies/developers.
Yeah I do feel like that about a bunch of these things. This I would look at as more of a proxy you go through.
I guess this idea similar to CoE that SambaNova use
No code time this time?
I linked their GitHub with the code etc. in the description
this is a super sexy topic
Most people don't need this because you'll know if you can use the cheaper model before hand.
Not very accurate
I have built a tool for my org that works on a custom built sql agent that could use a cheap model for over half of the questions being asked. I am building a "router" to check complexity and context need.
They could saved a load of time by just using gpt3.5 and function calling.
Nice content but too much complexity if you want to build a product for scaling
the good thing is it out there an open sourced so others can work on improving it for lots of use cases.
😢
Clearly not for (most) of production cases. But it be useful in dev as a heuristic
First Comment should get likes 😅
This is like a really old idea.
❤❤❤