THANK YOU. I hadn't realized that my models were "forgetting context" at the 2k mark because of this default value. I always thought it was just because they were overriding themselves with their own new information and that was "just AI being AI" -- my use cases have me floating around 1.5k to 2.4k so it was only barely noticeable only some of the times and never really worth a deep dive. Thanks again!
Fascinating! I have been calling ollama using llama 3.2:3b through python which allows me to manage context with my own memory structure, which only recalls what is necessary to complete the current query. I have found this to be extremely useful, since supplying the whole context simply reduces the response to something less than useful.
Excellent content as usual, Matt: kudos! My experience: Ollama's default 2k context is way too small. I tend to increase it in the model file with other options and the results are much better, albeit at the expense of speed. On a machine with 48 GB of VRAM and 192 GB of standard RAM, larger models (30 GB and higher) occupy the VRAM first, leaving larger context inference to the CPU: unless the model size is smaller than 20 GB, setting a 128k context means that most of the work is done by the CPU at slow CPU speed. As always, it is important to strike the right balance for your needs. Thanks for all you do: keep up the great work!
Hey Matt, I want to start off by saying I *never* leave comments on TH-cam videos. I just don't really ever. That being said, I really wanted to share positive recognition with you for the work you're doing to share knowledge around these bleeding edge tools with the world. It can be very confusing for people to enter technical spaces like these. The way you facilitate & share information, and organize your speech, and the timeliness of your videos, all lead to these technologies becoming more accessible to people -- which is amazing! So kudos to you and keep killing it!!
Thanks for explaining this. I was thinking of context as memory of previous completions - didn’t realize that it is also used for output. I’ve been playing with larger contexts, following one of your videos on adding num_ctx to the model file, and noticed that my chat responses were getting bigger. I’m going to try passing num_predict in the api request to limit this. The notes on your website are very helpful.
Thanks, I enjoy your information and appreciate it. I adjust the num_ctx based on the number of token I want to send. It seems to work well in managing memory usage. I have 3 RTX 6000 so I have a lot of wriggle room. But I do agree that "hole in the middle" is a problem if you don't RAG. Thanks again.
Thanks Matt! I really appreciate your content. It sounds like, if I create a trained agent, the I would start trying to make the context smaller, not bigger; so as not to use overly large models and context sizes. I'm still learning. Very green. 😐 I'm learning (slowly) how to build agents that I can sell. All of your help is much appreciated. I will buy you some cups of coffee when I can earn some money.
Cover the topic of flash attention as a measure to reduce the memory footprint for accommodating the large context. I think it will be closely related to this topic
The largest num_ctx I used on my M2 Ultra was 128k, during experiments of keeping „everything“ in the window. I came to the same results, especially above 60000 tokens. RAG is fine, especially to find similar things, but I cannot imagine a true history in a vector store. I tried to summarize past messages, but honestly, simple summaries are not enough. I have no clue how to handle really huge chats.
@@technovangelist you are right, after looking some more and rewatching your video, I was confusing an OpenAI API call with your curl example. It would be cool if Ollama could take advantage of an optional "options" parameter to do stuff like this though. Either way, thanks for the great content 👍
if building anything new you should never use the openai api. there is no benefit and only downsides. that’s mostly there for the lazy dev who has already built something and doesn't want to do the right thing and build a good interface. it saves maybe 30 minutes vs getting it right.
The problem arises mostly with code. Code fills up a lot of that context and every revision or answer from the llm does the same The solution is to summarize often but keep the original messages for the user or for later retrivial.
Thanks for the video. Two questions: 1. Are tokens the same across foundation models? E.g. the word token is tokenized into to ken by both Open AI and Anthropic? Or does one tokenize to tok en? Or even to k en? 2. If yes, what is the common origin of tokenization?
The token visualizer I showed had different options for llama vs OpenAI. Not sure how they differ. For the most part it’s an implementation detail most don’t need to know about.
does it make sense to control the context through the chat client ie truncating the chat history? or is that unnecessary if the context window is set on the Ollama side?
I run a context of 8192 for most models (after making sure that is an acceptable size) I tried bigger context sizes but they seem to cause strange results, as you mentioned. It is good to know it is memory related. Now, is that system memory (32g) or video memory (12g)?
I recently set mistral small to 120k and forgot to reduce the number of parallels back to 1 before executing a query. That was a white knuckle 10 minutes I can tell you. Thought the laptop would catch fire 🔥
Wow great short video. Is it possible that you can provide any good source of information about test of changing context size and its performance like how slow it was or how inconsistent it was or how much memory it took etc? Maybe you could start a group that can work on some solutions as hmm hobby that could hmm be beneficial or at least show some most fun things that can be done ... with section like low lv local things, higher level and hmm group project or web like construction with mutliple models / resources connected via web etc ... I dunno just guessing as I have no idea what can be achieved ... yesterday tried to make model clear text from a book for rag and it failed misserably as I wanted it to not change words byt rather fix print errors, remove any strange numbers like ISBN table of content etc and join any splitted words because ond of page ... it ended up producing a monstrous manipulation of text and every time differently ... one in 3-5 times it was ok-ish but then left table of content and remove what I wanted or just added there his own interpretation of text I wanted it to just correct. I believe with your background and history you have could be a great leader of project even if it would be hobby like at first. :) Nice content in nutshell - I would just grasp any detailed long video fromyou about some interesting stuff can be constructed with local/ nonlocal... but I beleive algoorithms do not like those same as much of people ?
If I set num_ctx=4096 in the options parameter to generate() and then set num_ctx=8192 in the next call but use the same model name does ollama reload the model to get a version with the larger context or does it just use the model already in memory with the larger context size?
if your pc reboots of it like recalculating the python mathlib and fonts and/or using a GPUI, check your bios settings that pci-reset on crash is disabled... -.-
Another way to determine the context length is to ask the model. I asked “Estimate your context length “. The model responded: “The optimal context length for me is around 2048 tokens, which translates to approximately 16384 characters including spaces. This allows for a detailed conversation with enough historical context to provide relevant and accurate responses.”
It's just hallucinating based on some general nonsense it was trained on. It has nothing to do with the real capability of this specific model you're asking
THANK YOU. I hadn't realized that my models were "forgetting context" at the 2k mark because of this default value. I always thought it was just because they were overriding themselves with their own new information and that was "just AI being AI" -- my use cases have me floating around 1.5k to 2.4k so it was only barely noticeable only some of the times and never really worth a deep dive. Thanks again!
Fascinating! I have been calling ollama using llama 3.2:3b through python which allows me to manage context with my own memory structure, which only recalls what is necessary to complete the current query. I have found this to be extremely useful, since supplying the whole context simply reduces the response to something less than useful.
This is the most important channel on ollama on youtube.
Excellent content as usual, Matt: kudos! My experience: Ollama's default 2k context is way too small. I tend to increase it in the model file with other options and the results are much better, albeit at the expense of speed. On a machine with 48 GB of VRAM and 192 GB of standard RAM, larger models (30 GB and higher) occupy the VRAM first, leaving larger context inference to the CPU: unless the model size is smaller than 20 GB, setting a 128k context means that most of the work is done by the CPU at slow CPU speed. As always, it is important to strike the right balance for your needs. Thanks for all you do: keep up the great work!
Hey Matt, I want to start off by saying I *never* leave comments on TH-cam videos. I just don't really ever. That being said, I really wanted to share positive recognition with you for the work you're doing to share knowledge around these bleeding edge tools with the world. It can be very confusing for people to enter technical spaces like these. The way you facilitate & share information, and organize your speech, and the timeliness of your videos, all lead to these technologies becoming more accessible to people -- which is amazing! So kudos to you and keep killing it!!
@@Studio.burnside mentions aside, those amazing guayaberas, COOL AF
Thank you Very simple and elegant explanation about complex topics
Great video, as usual. Every "obvious" argument you talk about , I learn a lot of "not so obvious" fundamentsls concepts. Thank you!
I like so much your video... you explai so well (special for who don't know so well english)
Thanks a lot :-)
Thanks for explaining this. I was thinking of context as memory of previous completions - didn’t realize that it is also used for output. I’ve been playing with larger contexts, following one of your videos on adding num_ctx to the model file, and noticed that my chat responses were getting bigger. I’m going to try passing num_predict in the api request to limit this. The notes on your website are very helpful.
Thanks Matt! It's as concise, clear and interesting as always!
Matt, can you make video how to approximately calculate of hardware costs and usage for running locally models
really great technical info explained to us dummies, thank you
Thanks, I enjoy your information and appreciate it.
I adjust the num_ctx based on the number of token I want to send. It seems to work well in managing memory usage. I have 3 RTX 6000 so I have a lot of wriggle room. But I do agree that "hole in the middle" is a problem if you don't RAG. Thanks again.
Thanks Matt! I really appreciate your content.
It sounds like, if I create a trained agent, the I would start trying to make the context smaller, not bigger; so as not to use overly large models and context sizes.
I'm still learning. Very green. 😐
I'm learning (slowly) how to build agents that I can sell. All of your help is much appreciated. I will buy you some cups of coffee when I can earn some money.
Cover the topic of flash attention as a measure to reduce the memory footprint for accommodating the large context. I think it will be closely related to this topic
The largest num_ctx I used on my M2 Ultra was 128k, during experiments of keeping „everything“ in the window. I came to the same results, especially above 60000 tokens. RAG is fine, especially to find similar things, but I cannot imagine a true history in a vector store. I tried to summarize past messages, but honestly, simple summaries are not enough. I have no clue how to handle really huge chats.
Thanks, Matt. Great video.
Where is context normally stored? Vtam or ram or both.
This is what I have been looking for :)
is it possible to do reranking on our RAG application using ollama? Your insight is always interesting
Such great info @Matt. I couldn't find the `num_ctx` param for options via the OpenAI API anywhere in the official Ollama docs. Thanks for sharing!
I don't know if the openai api supports it. there is a lot of stuff that the openai api can't do which is why Ollama uses the native api first.
@@technovangelist you are right, after looking some more and rewatching your video, I was confusing an OpenAI API call with your curl example. It would be cool if Ollama could take advantage of an optional "options" parameter to do stuff like this though. Either way, thanks for the great content 👍
if building anything new you should never use the openai api. there is no benefit and only downsides. that’s mostly there for the lazy dev who has already built something and doesn't want to do the right thing and build a good interface. it saves maybe 30 minutes vs getting it right.
Thanks, in-ter-est-ing and btw nice shirt.
Thanks Matt! The fact the only visible parameter about context size doesn't tell you the context size is baffling 😮.
Thanks! Learned a lot.
The problem arises mostly with code.
Code fills up a lot of that context and every revision or answer from the llm does the same
The solution is to summarize often but keep the original messages for the user or for later retrivial.
Thanks for the video. Two questions: 1. Are tokens the same across foundation models? E.g. the word token is tokenized into to ken by both Open AI and Anthropic? Or does one tokenize to tok en? Or even to k en?
2. If yes, what is the common origin of tokenization?
The token visualizer I showed had different options for llama vs OpenAI. Not sure how they differ. For the most part it’s an implementation detail most don’t need to know about.
does it make sense to control the context through the chat client ie truncating the chat history? or is that unnecessary if the context window is set on the Ollama side?
I run a context of 8192 for most models (after making sure that is an acceptable size) I tried bigger context sizes but they seem to cause strange results, as you mentioned. It is good to know it is memory related. Now, is that system memory (32g) or video memory (12g)?
great video🎉❤
I recently set mistral small to 120k and forgot to reduce the number of parallels back to 1 before executing a query. That was a white knuckle 10 minutes I can tell you. Thought the laptop would catch fire 🔥
Wow great short video. Is it possible that you can provide any good source of information about test of changing context size and its performance like how slow it was or how inconsistent it was or how much memory it took etc?
Maybe you could start a group that can work on some solutions as hmm hobby that could hmm be beneficial or at least show some most fun things that can be done ... with section like low lv local things, higher level and hmm group project or web like construction with mutliple models / resources connected via web etc ... I dunno just guessing as I have no idea what can be achieved ... yesterday tried to make model clear text from a book for rag and it failed misserably as I wanted it to not change words byt rather fix print errors, remove any strange numbers like ISBN table of content etc and join any splitted words because ond of page ... it ended up producing a monstrous manipulation of text and every time differently ... one in 3-5 times it was ok-ish but then left table of content and remove what I wanted or just added there his own interpretation of text I wanted it to just correct.
I believe with your background and history you have could be a great leader of project even if it would be hobby like at first. :)
Nice content in nutshell - I would just grasp any detailed long video fromyou about some interesting stuff can be constructed with local/ nonlocal... but I beleive algoorithms do not like those same as much of people ?
If I set num_ctx=4096 in the options parameter to generate() and then set num_ctx=8192 in the next call but use the same model name does ollama reload the model to get a version with the larger context or does it just use the model already in memory with the larger context size?
it just use the model already in memory with the larger context size
if your pc reboots of it like recalculating the python mathlib and fonts and/or using a GPUI, check your bios settings that pci-reset on crash is disabled... -.-
not really relevant to anything in this video, but interesting for the day i might use a pc or the python mathlib.
awesome content
So what memory are we talking about, RAM or VRAM or something else?
in general in this video I am talking about memory of the model, or context
🎉
My max context is 5000 on a laptop CPU RAM of 12 GB (9.94 GB usable) for llama3.2
that's why ladies and gentlement stop asking LLM to count strawberry "s" we saw letter, they (LLM) saw tokens
Another way to determine the context length is to ask the model. I asked “Estimate your context length “. The model responded: “The optimal context length for me is around 2048 tokens, which translates to approximately 16384 characters including spaces. This allows for a detailed conversation with enough historical context to provide relevant and accurate responses.”
the model in most cases doesn't really know. if you get an answer that makes sense, it's luck
It's just hallucinating based on some general nonsense it was trained on. It has nothing to do with the real capability of this specific model you're asking
Gunna hallucinate bro