Honestly i tested that 1M token context window model and doesn't perform very well but i hope in the future the finetunning to large context windows works better.
My experience was similar. I limited it to 300,000 tokens context window and the response quality was pretty poor. BUT I was using a Q8 model and I expect the issues with reducing precision escalate significantly with context. From my experience, if you could graph out the deterioration in quality, it could even be a logarithmic deterioration.
00:02 Gradient unlocked a million token context window for Llama 3 Model 02:11 Importance of context window in language models 06:28 Enhancing coding capabilities with large language models 08:31 Advantages of the million token context window model 12:39 Extending context length in model training challenges and process 14:51 Challenges in training million token context models 18:49 Needle and HStack benchmarks for testing model performance 20:35 Examining the performance of large language models in cross-referencing information 24:22 Exploring new ways to serve long context models efficiently. 26:13 Algorithms extensions for memory compression and selective opening
Can also use a database of embedded chat interactions with a small local model to continuously search by Calculating Similarity and distance to find relevant past interactions that could apply to the current interaction. This will allow you to use a smaller context window without sacrificing the information available to the AI
I was hoping they would serve the 1 million token model on Groq. That would have been incredible, even a 100k context would be nice. Groq and Gradient should talk together :)
@@executivelifehacks6747 It's not, I was referring to the behavior they were testing for - not benchmark results. I've found just through testing that I'm able to upload several documents (usually around 20-150 pages worth of them) and can continuously ask complex questions about them that require comprehension of data across several locations to generate an effective answer. The best example I can give is that I created an AI call transcript QA agent with C3O where I upload a 5 page grading rubric with a 20-40 page call transcript and it gives human level responses to generalized questions in the rubric, it tallies the scores correctly, and draws overall conclusions for complex and interrelated questions... and it does it all in a zero shot response I tested this same workflow with all the other large models and they weren't even close to the quality of analysis C3O provides.
VRAM-need explodes the larger the context window is. Have you ever tried to fill in 120,000 tokens with a total window of 1M, asking a tiny question about a fragment of the 120,000 tokens?
Love these interviews. Thanks Leo for taking the time to chat with Matt and giving us some context on what your working on. Love it. Keep up the great worrk guys.
In my experience the needle in a haystack benchmark, unfortunately, is practically useless... A better approach is to fill the context partly with a codebase and partly with a story the model doesn't know. Then ask it to rewrite a chapter from the perspective of another character and to write a new function for a certain code class which uses other functions with known outputs. Something like that. Llama 1M can just do about 2k tokens, in that case many models can do about 4k and GPT4o 16k.
Love how you ask him questions in the same format you would ask an AI. He answers, but like a human, has to think about it so he can present it back as informative as possible for a human to understand. If there is one thing I love about AI, it's how it already knows and will concisely detail the topic at hand. If it was me, I'd have an AI right there to help me answer these questions in realtime. Then I would have AI do everything for me while I go riding.
Well the problem is as Leo mentioned that say you download the 70b llama model and you have the hardware to run it... okay fine... now you start using it and as you use up more and more context the models next compute cycle is the square of the number of tokens so far.... fine at 8k squared in terms of compute.... not so good after 500k squared!
Simplified answer increasing practical memory to help users in various ways recall conversations codes and text understand them prioritize information more efficiently due to a large token contacts window.
What’s the point of a 1M token context if it forgets 90% of it or fails to follow instructions? Context length should not be quoted without also including those two metrics.
Yes exacly. The gradient llama3 model is shit compared to the llama3 base model. Even asking it something that has very little context, that llama3 70B instruct can handle at ease the gradient 1M token context model does not handle well at any task. With the phi3 128k model its exacly the same. I want quality and if they raise the context window. I want to know that the quality stays the same. My prof always said: "If you want the one thing, you have to give up another thing" And as long as there aren't any improvements on the Base model, the context extensions for base model will always degrade quality.
Has anyone figured out a way to make context be abstracted into a fractal non-euclidean "mind palace" the LLM can traverse to find any past information, and always find more places to fit new info?
@@chrism3440 I'm not sure. For a long time I've had the gut instinct that some sort convolution-style compression might be applicable to concepts instead of image pixels, sorta like how when we don't remember something but it's on the tip of the tongue we can remember things similar to it, what is around it, what category it is etc, seems for humans remembering is a lot like following a scent-trail, you catch a whiff and move to where it's stronger; maybe that could partially be how this could work, each edge of a node in the graph would have a different smell, representing a convolution of everything that can be reached going thru it, with things requiring less steps being more over-represented but things many nodes down still having a hint of it's smell coming thru that door. And perhaps there could be some sort video-game style culling of nodes too many nodes away, and nodes at the distant surface of the graph's volume gets streamed in and out of the GPU as needed while the LLMs itself is always at a moving center position where up to the Nth level neighbors are all already/still loaded in VRAM and by the time it reaches the old surface new stuff has already had time to load at full "LOD"? Another related intuition that's just at the egde of my knowledge is perhaps it could be something like a non-euclidean NeRF/gaussian splatting abstraction, with working mirrors, lenses, wormholes etc, and instead of visual data the "pixels" from a different perspective compose different concepts; I know there are already some projects about stuff like that (for 3d visual data) with more data than fits VRAM (or that can be rendered fast enough all at once), and instead streams data from disk as needed. I never looked into the finer details of how vector databases work, maybe it already is something similar to that , dunno; could perhaps have some elements of it.
A combination of Layer-wise attention caching and selective attention computation would definitely make large context workflows more efficient. I'd also like to see if such a thing as in-context token quantization / vector graphing can be achieved without the need to offload to external DB's....... I think this would be a worthwhile area of research.
What a great Interview! And also a great advancement. So many questions. Leo said it is more like training a model than "pre-training" to create these larger context windows, however, as far as I know, there is no access to the training data from Meta and the compute necessary would seem prohibitive, so there must be some way that they are modifying or manipulating the weights to recognize the "distance encoding". Would be great to get more clarity on that. Also would be very interested to know the trade offs between using a pre-trained model (like with your companies code base) vs a cached large context model. Obviously changing the code base would be easier in the large context model, but what trade offs would there be. Also, is there any mechanism for caching multiple parts of the context window. Finally, although this might be interesting for things like video, it would seem the compute necessary would be somewhat prohibitive unless you can cache the "attention" which would make more sense for text. Matt, it would be great if you could do a follow up on this after you have tried it yourself (and even better if you could get Leo back again.) Great stuff!!!
What happened to that thing with "attention sinks" allowing infinite context sizes, that could be implemented in the interpreter app (dunno what they're called, the stuff that run the LLMs) without even needing to modify the models? (sorry, I don't remember which channel talked about it, just remember the term "attention sinks", was some months ago I think).
to reason on an entire codebase, we need to look at tokenization data prep and ideally dataflow and explicit graph data.... its a solvable but complex issue
A larger context window is not always best. The larger the context window the higher the quality of the context matters. This means it’s even more important to ensure the information available to the model within the context is specific and verbose. The quality of the output that the LLM adds to the context is very important. A larger context window is not good for smaller variants of models. Also high compression/quantization really affects the quality of the output adding to the issue of context quality.
Suggested programming test: Write a function in Ada that takes a pair of numbers and an array of such pairs and returns true if there are an odd number of occurrences of the pair in the array.
Thanks for this. What's to stop bad faith actors from inputting someone else's IP and have the AI reword/recode it to pass it off as their own? Whether that's an author's book, a screenplay, a video, a piece of music, sicientific research, etc.
I really want to see that in action. I currently play around with llama 3 abliterated, that thing is interesting. the only problem I see with 1m token, my computer is already quite slow with 8k token, even bumping the setting to 16k would be painfully slow. but I guess, there are tasks, where it totally makes sense to give the task to the LLM in the evening, and have the response ready by the next day.
now if only they wouldn't censor prompts like pansy snowflakes, them and everyone. I've never tried this 1 million token thing, I'm still looking for ais that do basic things without being blocked
Matt. It is a great video. However, I am skeptical that this is really as fast as the 8000 tokens version. It is more like a physical impossibility rather than a problem with training. Is the simple fact that it requires more time to find information If you have more information you will require more time unless you increase computational power. However, it is really remarkable that they can fit 1 million tokens context window with a relative high-performance. I would love to test it.
Nice cant wait 1M tokens to improve my AI personality and to add some more stuff to its memory and also I should be able to chat longer without AI model to hallucinate. Just really wish to get unbiased uncensored model at one point.
Have they ACTUALLY used their 1 million window on any use case successfully? Or is it just a claim that it would be helpful? I am yet to hear that Gemini is creating any klller apps.
@@4.0.4 I started using OpenAI with gpt-pilot and after $92 turned it off. Now I just use chatGPT to produce chunks of code and then slap it together by hand. The latter seems to work reasonably well.
I really like your channel and I find your videos on artificial intelligence extremely interesting. Thanks to the subtitles, I am able to follow and understand the topic better, which I am very passionate about, I am Italian. However, I have noticed that I often find myself unsubscribed from the channel without reason. It has happened 4 or 5 times already and I can't figure out if it's a technical issue or if I'm being removed by the channel owner. Could you help me understand why this happens and how I can solve the problem? Thank you."
The needle needs to not stick out. Have it for instance change the name of a character that is only stated once. But also this should be done on a text the model was never trained on so war and peace should not be used. What was the name of the character that did x thing.
It's kind of ironic. The creation he's making will soon outperform him, making him obsolete. Decreasion, well outgrow the creator ironic. Hopefully you build a good friendship with it.
You should consider educating yourself on this matter. What someone appears to be does not determine their gender identity. Additionally, this is how the world is now; by including pronouns, people are helping to normalize this practice and support a more inclusive environment.
@@AI-under-Five Oh right. Gender is a social construct right? Literally saying that man/women are defined by their behaviors within a society. You are literally defining women based on gender norms and gender roles. I thought we fought to get rid of gender roles? Now you're saying those are the EXACT things that we should use to define ourselves? Imagine a political ideology forcing you to change your definition what a woman is, and then saying. this is how the world is now. The hubris is staggering
@@AI-under-Five How can gender be a social construct, but when choosing your gender, it has nothing to do with gender norms which are socially constructed? And this is not how it is taught in schools. I'll refer to the def of gender via the WHO "Gender refers to the characteristics of women, men, girls and boys that are socially constructed. This includes norms, behaviours and roles"
Yikes 😅. Gender roles are a social construct. This is extensively researched by anthropologists, psychologists, sociologists… a bunch of -ologists 😝. Gender identity is not a social construct; it’s based on how a person feels innately. There’s a wealth of scientific research supporting this. Please refer back to my point about self-education. Or maybe just try to be a nicer human being 😘. Political ideology… lolllllll. Feels myopic.
Honestly i tested that 1M token context window model and doesn't perform very well but i hope in the future the finetunning to large context windows works better.
Are you talking about speed or quality of the response?
Of course it's gonna be different for each use case.
My experience was similar. I limited it to 300,000 tokens context window and the response quality was pretty poor. BUT I was using a Q8 model and I expect the issues with reducing precision escalate significantly with context. From my experience, if you could graph out the deterioration in quality, it could even be a logarithmic deterioration.
Are we talking about ‘lost in the middle’ ?
You can't expect a 1m token model to reason over 100 tokens the same way they're over corrected for longer windows
00:02 Gradient unlocked a million token context window for Llama 3 Model
02:11 Importance of context window in language models
06:28 Enhancing coding capabilities with large language models
08:31 Advantages of the million token context window model
12:39 Extending context length in model training challenges and process
14:51 Challenges in training million token context models
18:49 Needle and HStack benchmarks for testing model performance
20:35 Examining the performance of large language models in cross-referencing information
24:22 Exploring new ways to serve long context models efficiently.
26:13 Algorithms extensions for memory compression and selective opening
Thanks, where is MoA tests? I'm waiting it!
Can also use a database of embedded chat interactions with a small local model to continuously search by Calculating Similarity and distance to find relevant past interactions that could apply to the current interaction. This will allow you to use a smaller context window without sacrificing the information available to the AI
This channel is Great. Thanks Matthew.
Wow thank you so much!!
@@matthew_berman Who will make the best Ragflow tutorial?
I was hoping they would serve the 1 million token model on Groq. That would have been incredible, even a 100k context would be nice. Groq and Gradient should talk together :)
Hell, I'd take 32k. 8k is virtually useless for a serious application that requires a lot of instructions
I would be so happy if they give us 128k. I could do so much with that
Thanks & good idea. We're working on it!
@@LeoPekelisGradient thanks a lot. I'm a very big fan of groq 😉
Have you seen the AutoGroq project? It automatically generates agents for Autogen and CrewAI
The Ruler test is such a great insight. This behavior is why I've almost entirely switched to Claude 3 Opus. It performs incredibly well with this!
Can you elaborate pls... where is C3O tested on the RULER test?
@@executivelifehacks6747 It's not, I was referring to the behavior they were testing for - not benchmark results.
I've found just through testing that I'm able to upload several documents (usually around 20-150 pages worth of them) and can continuously ask complex questions about them that require comprehension of data across several locations to generate an effective answer.
The best example I can give is that I created an AI call transcript QA agent with C3O where I upload a 5 page grading rubric with a 20-40 page call transcript and it gives human level responses to generalized questions in the rubric, it tallies the scores correctly, and draws overall conclusions for complex and interrelated questions... and it does it all in a zero shot response
I tested this same workflow with all the other large models and they weren't even close to the quality of analysis C3O provides.
VRAM-need explodes the larger the context window is. Have you ever tried to fill in 120,000 tokens with a total window of 1M, asking a tiny question about a fragment of the 120,000 tokens?
@@polarper8165 , you don’t have to. Everything is ok.
You don't need to use Vram you can use DDR ram to load huge AI models. It's slower than Vram but it's not too bad.
Proud of you Matthew! Much love from New Orleans brother❤💪
Love these interviews. Thanks Leo for taking the time to chat with Matt and giving us some context on what your working on. Love it. Keep up the great worrk guys.
In my experience the needle in a haystack benchmark, unfortunately, is practically useless...
A better approach is to fill the context partly with a codebase and partly with a story the model doesn't know. Then ask it to rewrite a chapter from the perspective of another character and to write a new function for a certain code class which uses other functions with known outputs. Something like that. Llama 1M can just do about 2k tokens, in that case many models can do about 4k and GPT4o 16k.
Love how you ask him questions in the same format you would ask an AI. He answers, but like a human, has to think about it so he can present it back as informative as possible for a human to understand. If there is one thing I love about AI, it's how it already knows and will concisely detail the topic at hand. If it was me, I'd have an AI right there to help me answer these questions in realtime. Then I would have AI do everything for me while I go riding.
Outstanding interview.
Very articulate and knowledgeable this Leo chap.
When are will there be an open source 1 M context model?
Well the problem is as Leo mentioned that say you download the 70b llama model and you have the hardware to run it... okay fine... now you start using it and as you use up more and more context the models next compute cycle is the square of the number of tokens so far.... fine at 8k squared in terms of compute.... not so good after 500k squared!
@@mickelodiansurname9578 Good to know. Thank you.
Matt, you have asked good questions and summarized responses in concise way! Thank you for the content!
I would be happy with a 30k token window tbh. 64k ftw, but I'll take scraps.
large context only exists in the demos, never worked for me on the best models.
Gradient has one of the coolest websites I've seen full of retro AI art from the '70's.
Glad to see RULER being used.
RULER: What's the Real Context Size of Your Long-Context Language Models?
Simplified answer increasing practical memory to help users in various ways recall conversations codes and text understand them prioritize information more efficiently due to a large token contacts window.
What’s the point of a 1M token context if it forgets 90% of it or fails to follow instructions?
Context length should not be quoted without also including those two metrics.
Marketing is the point.
I regularly run GPT4o up to 300,000 to 500,000 tokens, its great hardly forgets anything. Just gotta become one with the model.
@@aaronravak1407 What's your use case? Code?
Yes exacly. The gradient llama3 model is shit compared to the llama3 base model. Even asking it something that has very little context, that llama3 70B instruct can handle at ease the gradient 1M token context model does not handle well at any task.
With the phi3 128k model its exacly the same.
I want quality and if they raise the context window. I want to know that the quality stays the same.
My prof always said: "If you want the one thing, you have to give up another thing"
And as long as there aren't any improvements on the Base model, the context extensions for base model will always degrade quality.
@@aaronravak1407 It has 128k context tho...
Has anyone figured out a way to make context be abstracted into a fractal non-euclidean "mind palace" the LLM can traverse to find any past information, and always find more places to fit new info?
How would you fetch from this fractal structure? How would it organize its categorizations for fetchability
@@chrism3440 I'm not sure. For a long time I've had the gut instinct that some sort convolution-style compression might be applicable to concepts instead of image pixels, sorta like how when we don't remember something but it's on the tip of the tongue we can remember things similar to it, what is around it, what category it is etc, seems for humans remembering is a lot like following a scent-trail, you catch a whiff and move to where it's stronger; maybe that could partially be how this could work, each edge of a node in the graph would have a different smell, representing a convolution of everything that can be reached going thru it, with things requiring less steps being more over-represented but things many nodes down still having a hint of it's smell coming thru that door. And perhaps there could be some sort video-game style culling of nodes too many nodes away, and nodes at the distant surface of the graph's volume gets streamed in and out of the GPU as needed while the LLMs itself is always at a moving center position where up to the Nth level neighbors are all already/still loaded in VRAM and by the time it reaches the old surface new stuff has already had time to load at full "LOD"? Another related intuition that's just at the egde of my knowledge is perhaps it could be something like a non-euclidean NeRF/gaussian splatting abstraction, with working mirrors, lenses, wormholes etc, and instead of visual data the "pixels" from a different perspective compose different concepts; I know there are already some projects about stuff like that (for 3d visual data) with more data than fits VRAM (or that can be rendered fast enough all at once), and instead streams data from disk as needed.
I never looked into the finer details of how vector databases work, maybe it already is something similar to that , dunno; could perhaps have some elements of it.
A combination of Layer-wise attention caching and selective attention computation would definitely make large context workflows more efficient. I'd also like to see if such a thing as in-context token quantization / vector graphing can be achieved without the need to offload to external DB's....... I think this would be a worthwhile area of research.
its the desk space to operate in thats the context window
What a great Interview! And also a great advancement. So many questions. Leo said it is more like training a model than "pre-training" to create these larger context windows, however, as far as I know, there is no access to the training data from Meta and the compute necessary would seem prohibitive, so there must be some way that they are modifying or manipulating the weights to recognize the "distance encoding". Would be great to get more clarity on that. Also would be very interested to know the trade offs between using a pre-trained model (like with your companies code base) vs a cached large context model. Obviously changing the code base would be easier in the large context model, but what trade offs would there be. Also, is there any mechanism for caching multiple parts of the context window. Finally, although this might be interesting for things like video, it would seem the compute necessary would be somewhat prohibitive unless you can cache the "attention" which would make more sense for text. Matt, it would be great if you could do a follow up on this after you have tried it yourself (and even better if you could get Leo back again.) Great stuff!!!
Berman, I am still looking forward to the Matt3 (Matt to the third power) ai conference with you, Matt Wolfe and MattVidPro
What happened to that thing with "attention sinks" allowing infinite context sizes, that could be implemented in the interpreter app (dunno what they're called, the stuff that run the LLMs) without even needing to modify the models? (sorry, I don't remember which channel talked about it, just remember the term "attention sinks", was some months ago I think).
to reason on an entire codebase, we need to look at tokenization data prep and ideally dataflow and explicit graph data.... its a solvable but complex issue
Would be interested to know if this has really been successfully done for anything other than test cases.
A larger context window is not always best. The larger the context window the higher the quality of the context matters. This means it’s even more important to ensure the information available to the model within the context is specific and verbose. The quality of the output that the LLM adds to the context is very important. A larger context window is not good for smaller variants of models. Also high compression/quantization really affects the quality of the output adding to the issue of context quality.
Suggested programming test:
Write a function in Ada that takes a pair of numbers and an array of such pairs and returns true if there are an odd number of occurrences of the pair in the array.
Great interview, you asked a lot of questions I was curious about
Thanks for this. What's to stop bad faith actors from inputting someone else's IP and have the AI reword/recode it to pass it off as their own? Whether that's an author's book, a screenplay, a video, a piece of music, sicientific research, etc.
Thanks for mentioning Cursor. I'm going to try it.
Is their million-token context implementation public? Or it's just that the base llama-3 is open-source?
I really want to see that in action. I currently play around with llama 3 abliterated, that thing is interesting.
the only problem I see with 1m token, my computer is already quite slow with 8k token, even bumping the setting to 16k would be painfully slow.
but I guess, there are tasks, where it totally makes sense to give the task to the LLM in the evening, and have the response ready by the next day.
Brilliant, good questions.
Good questions good answers
now if only they wouldn't censor prompts like pansy snowflakes, them and everyone. I've never tried this 1 million token thing, I'm still looking for ais that do basic things without being blocked
Matt. It is a great video. However, I am skeptical that this is really as fast as the 8000 tokens version. It is more like a physical impossibility rather than a problem with training. Is the simple fact that it requires more time to find information If you have more information you will require more time unless you increase computational power. However, it is really remarkable that they can fit 1 million tokens context window with a relative high-performance. I would love to test it.
Nice cant wait 1M tokens to improve my AI personality and to add some more stuff to its memory and also I should be able to chat longer without AI model to hallucinate. Just really wish to get unbiased uncensored model at one point.
Not convinced that hallucinating llm's will ever go away
Thanks for this video!
unable to register on Gradient. Wonder if Gmail is accepted as a Registration email.
Generative AI models are for wisdom mining, they feel like simple systems of Earth from perspectives draughtsman
Have they ACTUALLY used their 1 million window on any use case successfully? Or is it just a claim that it would be helpful? I am yet to hear that Gemini is creating any klller apps.
The problem is it costs _$7 per prompt._ You can't build a killer app out of something that expensive right now.
@@4.0.4 I started using OpenAI with gpt-pilot and after $92 turned it off. Now I just use chatGPT to produce chunks of code and then slap it together by hand. The latter seems to work reasonably well.
ChatGPT 4o context window is still only a little over 4000, no?
How much VRAM for 1m context, tho?
I really like your channel and I find your videos on artificial intelligence extremely interesting. Thanks to the subtitles, I am able to follow and understand the topic better, which I am very passionate about, I am Italian. However, I have noticed that I often find myself unsubscribed from the channel without reason. It has happened 4 or 5 times already and I can't figure out if it's a technical issue or if I'm being removed by the channel owner. Could you help me understand why this happens and how I can solve the problem? Thank you."
I tried this model, and via ollama it just produces trash, completely unusable comparing to regular llama 70b. Maybe it's a bug or something.
The latest Gemini modal has a 2M context window
I think its best to wait for Meta to increase the context length.
lol. He’s a research scientist. This is literally his remit 😅
Great interview and very interesting.
The needle needs to not stick out. Have it for instance change the name of a character that is only stated once. But also this should be done on a text the model was never trained on so war and peace should not be used. What was the name of the character that did x thing.
Nice.
Dude what happened to your thumbnails?
Omg. This dude has 1 mil tokens window too. Talking and talking and talking.. something that can be said with 2 words
Who sees the cigarette?
It's useless for me because they're censored and not multi Languages supported.
❤
It's kind of ironic. The creation he's making will soon outperform him, making him obsolete. Decreasion, well outgrow the creator ironic. Hopefully you build a good friendship with it.
Probably a next gen of Transformer model could but not the current models.
Gemini has had 1 million context for months now. Not a single new klller app created nor any new Shakespeare plays written.
Wait.. the guy with a beard is a He/Him? So glad that he filled that out on LinkedIn. I would have never guessed.
You should consider educating yourself on this matter. What someone appears to be does not determine their gender identity. Additionally, this is how the world is now; by including pronouns, people are helping to normalize this practice and support a more inclusive environment.
No, it is just snowflakes virtue signaling their illness.
@@AI-under-Five Oh right. Gender is a social construct right? Literally saying that man/women are defined by their behaviors within a society. You are literally defining women based on gender norms and gender roles. I thought we fought to get rid of gender roles? Now you're saying those are the EXACT things that we should use to define ourselves?
Imagine a political ideology forcing you to change your definition what a woman is, and then saying. this is how the world is now. The hubris is staggering
@@AI-under-Five How can gender be a social construct, but when choosing your gender, it has nothing to do with gender norms which are socially constructed?
And this is not how it is taught in schools. I'll refer to the def of gender via the WHO "Gender refers to the characteristics of women, men, girls and boys that are socially constructed. This includes norms, behaviours and roles"
Yikes 😅. Gender roles are a social construct. This is extensively researched by anthropologists, psychologists, sociologists… a bunch of -ologists 😝.
Gender identity is not a social construct; it’s based on how a person feels innately. There’s a wealth of scientific research supporting this. Please refer back to my point about self-education. Or maybe just try to be a nicer human being 😘.
Political ideology… lolllllll. Feels myopic.
That was brilliant
hmm
You should make two different channels. One with tutorials and alike, then one with shit like this.
I cannot repost what my Ai says (omg) 🔥, but I wish I could 😊. What are tokens 😁
Any time i see a pro tech company with random tubes and inapropiate stuff in backgrund, i know they wont show up again🤣
Yah Gemini Flash 1.5 1m context, it's web app utilize vram more as you approach high context
Glad to see RULER being used.
RULER: What's the Real Context Size of Your Long-Context Language Models?