Making 1 MILLION Token Context LLaMA 3 (Interview)

Matthew Berman

มุมมอง 23 436

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 29 พ.ย. 2024

ความคิดเห็น • 122

@paelnever 5 หลายเดือนก่อน ⁺³⁷
Honestly i tested that 1M token context window model and doesn't perform very well but i hope in the future the finetunning to large context windows works better.
@toadlguy 5 หลายเดือนก่อน
Are you talking about speed or quality of the response?
@Quitcool 5 หลายเดือนก่อน
Of course it's gonna be different for each use case.
@ts757arse 5 หลายเดือนก่อน ⁺⁴
My experience was similar. I limited it to 300,000 tokens context window and the response quality was pretty poor. BUT I was using a Q8 model and I expect the issues with reducing precision escalate significantly with context. From my experience, if you could graph out the deterioration in quality, it could even be a logarithmic deterioration.
@tollington9414 5 หลายเดือนก่อน ⁺¹
Are we talking about ‘lost in the middle’ ?
@daivionosaghae4807 5 หลายเดือนก่อน ⁺¹
You can't expect a 1m token model to reason over 100 tokens the same way they're over corrected for longer windows
@brianlink391 5 หลายเดือนก่อน ⁺²
00:02 Gradient unlocked a million token context window for Llama 3 Model
02:11 Importance of context window in language models
06:28 Enhancing coding capabilities with large language models
08:31 Advantages of the million token context window model
12:39 Extending context length in model training challenges and process
14:51 Challenges in training million token context models
18:49 Needle and HStack benchmarks for testing model performance
20:35 Examining the performance of large language models in cross-referencing information
24:22 Exploring new ways to serve long context models efficiently.
26:13 Algorithms extensions for memory compression and selective opening
@MagnesRUS 5 หลายเดือนก่อน ⁺²³
Thanks, where is MoA tests? I'm waiting it!
@CreativeEngineering_ 5 หลายเดือนก่อน ⁺⁴
Can also use a database of embedded chat interactions with a small local model to continuously search by Calculating Similarity and distance to find relevant past interactions that could apply to the current interaction. This will allow you to use a smaller context window without sacrificing the information available to the AI
@matt.stevick 5 หลายเดือนก่อน ⁺¹⁸
This channel is Great. Thanks Matthew.
@matthew_berman 5 หลายเดือนก่อน ⁺⁴
Wow thank you so much!!
@guillerf10 5 หลายเดือนก่อน
@@matthew_berman Who will make the best Ragflow tutorial?
@kkollsga 5 หลายเดือนก่อน ⁺¹³
I was hoping they would serve the 1 million token model on Groq. That would have been incredible, even a 100k context would be nice. Groq and Gradient should talk together :)
@justtiredthings 5 หลายเดือนก่อน ⁺²
Hell, I'd take 32k. 8k is virtually useless for a serious application that requires a lot of instructions
@arjangfarahzadeh 5 หลายเดือนก่อน ⁺²
I would be so happy if they give us 128k. I could do so much with that
@LeoPekelisGradient 5 หลายเดือนก่อน ⁺⁴
Thanks & good idea. We're working on it!
@arjangfarahzadeh 5 หลายเดือนก่อน
@@LeoPekelisGradient thanks a lot. I'm a very big fan of groq 😉
@jefframpe5075 5 หลายเดือนก่อน
Have you seen the AutoGroq project? It automatically generates agents for Autogen and CrewAI
@skeptiklive 5 หลายเดือนก่อน ⁺⁴
The Ruler test is such a great insight. This behavior is why I've almost entirely switched to Claude 3 Opus. It performs incredibly well with this!
@executivelifehacks6747 5 หลายเดือนก่อน ⁺³
Can you elaborate pls... where is C3O tested on the RULER test?
@skeptiklive 5 หลายเดือนก่อน
@@executivelifehacks6747 It's not, I was referring to the behavior they were testing for - not benchmark results.
I've found just through testing that I'm able to upload several documents (usually around 20-150 pages worth of them) and can continuously ask complex questions about them that require comprehension of data across several locations to generate an effective answer.
The best example I can give is that I created an AI call transcript QA agent with C3O where I upload a 5 page grading rubric with a 20-40 page call transcript and it gives human level responses to generalized questions in the rubric, it tallies the scores correctly, and draws overall conclusions for complex and interrelated questions... and it does it all in a zero shot response
I tested this same workflow with all the other large models and they weren't even close to the quality of analysis C3O provides.
@MeinDeutschkurs 5 หลายเดือนก่อน ⁺⁷
VRAM-need explodes the larger the context window is. Have you ever tried to fill in 120,000 tokens with a total window of 1M, asking a tiny question about a fragment of the 120,000 tokens?
@MeinDeutschkurs 5 หลายเดือนก่อน
@@polarper8165 , you don’t have to. Everything is ok.
@stanpikaliri1621 5 หลายเดือนก่อน ⁺¹
You don't need to use Vram you can use DDR ram to load huge AI models. It's slower than Vram but it's not too bad.
@tjchatgptgoat 5 หลายเดือนก่อน
Proud of you Matthew! Much love from New Orleans brother❤💪
@JustinsOffGridAdventures 5 หลายเดือนก่อน
Love these interviews. Thanks Leo for taking the time to chat with Matt and giving us some context on what your working on. Love it. Keep up the great worrk guys.
@perschistence2651 5 หลายเดือนก่อน ⁺³
In my experience the needle in a haystack benchmark, unfortunately, is practically useless...
A better approach is to fill the context partly with a codebase and partly with a story the model doesn't know. Then ask it to rewrite a chapter from the perspective of another character and to write a new function for a certain code class which uses other functions with known outputs. Something like that. Llama 1M can just do about 2k tokens, in that case many models can do about 4k and GPT4o 16k.
@jimrhea5484 5 หลายเดือนก่อน
Love how you ask him questions in the same format you would ask an AI. He answers, but like a human, has to think about it so he can present it back as informative as possible for a human to understand. If there is one thing I love about AI, it's how it already knows and will concisely detail the topic at hand. If it was me, I'd have an AI right there to help me answer these questions in realtime. Then I would have AI do everything for me while I go riding.
@jackflash6377 5 หลายเดือนก่อน ⁺⁴
Outstanding interview.
Very articulate and knowledgeable this Leo chap.
When are will there be an open source 1 M context model?
@mickelodiansurname9578 5 หลายเดือนก่อน ⁺¹
Well the problem is as Leo mentioned that say you download the 70b llama model and you have the hardware to run it... okay fine... now you start using it and as you use up more and more context the models next compute cycle is the square of the number of tokens so far.... fine at 8k squared in terms of compute.... not so good after 500k squared!
@jackflash6377 5 หลายเดือนก่อน
@@mickelodiansurname9578 Good to know. Thank you.
@elu1 5 หลายเดือนก่อน
Matt, you have asked good questions and summarized responses in concise way! Thank you for the content!
@robxsiq7744 5 หลายเดือนก่อน ⁺²
I would be happy with a 30k token window tbh. 64k ftw, but I'll take scraps.
@hqcart1 5 หลายเดือนก่อน ⁺¹
large context only exists in the demos, never worked for me on the best models.
@nathanbanks2354 5 หลายเดือนก่อน
Gradient has one of the coolest websites I've seen full of retro AI art from the '70's.
@jefframpe5075 5 หลายเดือนก่อน ⁺¹
Glad to see RULER being used.
RULER: What's the Real Context Size of Your Long-Context Language Models?
@superfliping 5 หลายเดือนก่อน
Simplified answer increasing practical memory to help users in various ways recall conversations codes and text understand them prioritize information more efficiently due to a large token contacts window.
@daniel_tenner 5 หลายเดือนก่อน ⁺¹¹
What’s the point of a 1M token context if it forgets 90% of it or fails to follow instructions?
Context length should not be quoted without also including those two metrics.
@pensiveintrovert4318 5 หลายเดือนก่อน ⁺²
Marketing is the point.
@aaronravak1407 5 หลายเดือนก่อน
I regularly run GPT4o up to 300,000 to 500,000 tokens, its great hardly forgets anything. Just gotta become one with the model.
@pensiveintrovert4318 5 หลายเดือนก่อน
@@aaronravak1407 What's your use case? Code?
@ollimacp 5 หลายเดือนก่อน ⁺²
Yes exacly. The gradient llama3 model is shit compared to the llama3 base model. Even asking it something that has very little context, that llama3 70B instruct can handle at ease the gradient 1M token context model does not handle well at any task.
With the phi3 128k model its exacly the same.
I want quality and if they raise the context window. I want to know that the quality stays the same.
My prof always said: "If you want the one thing, you have to give up another thing"
And as long as there aren't any improvements on the Base model, the context extensions for base model will always degrade quality.
@susmitdas 5 หลายเดือนก่อน ⁺¹
@@aaronravak1407 It has 128k context tho...
@tiagotiagot 5 หลายเดือนก่อน ⁺¹
Has anyone figured out a way to make context be abstracted into a fractal non-euclidean "mind palace" the LLM can traverse to find any past information, and always find more places to fit new info?
@chrism3440 5 หลายเดือนก่อน
How would you fetch from this fractal structure? How would it organize its categorizations for fetchability
@tiagotiagot 5 หลายเดือนก่อน
@@chrism3440 I'm not sure. For a long time I've had the gut instinct that some sort convolution-style compression might be applicable to concepts instead of image pixels, sorta like how when we don't remember something but it's on the tip of the tongue we can remember things similar to it, what is around it, what category it is etc, seems for humans remembering is a lot like following a scent-trail, you catch a whiff and move to where it's stronger; maybe that could partially be how this could work, each edge of a node in the graph would have a different smell, representing a convolution of everything that can be reached going thru it, with things requiring less steps being more over-represented but things many nodes down still having a hint of it's smell coming thru that door. And perhaps there could be some sort video-game style culling of nodes too many nodes away, and nodes at the distant surface of the graph's volume gets streamed in and out of the GPU as needed while the LLMs itself is always at a moving center position where up to the Nth level neighbors are all already/still loaded in VRAM and by the time it reaches the old surface new stuff has already had time to load at full "LOD"? Another related intuition that's just at the egde of my knowledge is perhaps it could be something like a non-euclidean NeRF/gaussian splatting abstraction, with working mirrors, lenses, wormholes etc, and instead of visual data the "pixels" from a different perspective compose different concepts; I know there are already some projects about stuff like that (for 3d visual data) with more data than fits VRAM (or that can be rendered fast enough all at once), and instead streams data from disk as needed.
I never looked into the finer details of how vector databases work, maybe it already is something similar to that , dunno; could perhaps have some elements of it.
@KCM25NJL 5 หลายเดือนก่อน
A combination of Layer-wise attention caching and selective attention computation would definitely make large context workflows more efficient. I'd also like to see if such a thing as in-context token quantization / vector graphing can be achieved without the need to offload to external DB's....... I think this would be a worthwhile area of research.
@FunwithBlender 5 หลายเดือนก่อน ⁺²
its the desk space to operate in thats the context window
@toadlguy 5 หลายเดือนก่อน ⁺¹
What a great Interview! And also a great advancement. So many questions. Leo said it is more like training a model than "pre-training" to create these larger context windows, however, as far as I know, there is no access to the training data from Meta and the compute necessary would seem prohibitive, so there must be some way that they are modifying or manipulating the weights to recognize the "distance encoding". Would be great to get more clarity on that. Also would be very interested to know the trade offs between using a pre-trained model (like with your companies code base) vs a cached large context model. Obviously changing the code base would be easier in the large context model, but what trade offs would there be. Also, is there any mechanism for caching multiple parts of the context window. Finally, although this might be interesting for things like video, it would seem the compute necessary would be somewhat prohibitive unless you can cache the "attention" which would make more sense for text. Matt, it would be great if you could do a follow up on this after you have tried it yourself (and even better if you could get Leo back again.) Great stuff!!!
@AssWann 5 หลายเดือนก่อน ⁺¹
Berman, I am still looking forward to the Matt3 (Matt to the third power) ai conference with you, Matt Wolfe and MattVidPro
@tiagotiagot 5 หลายเดือนก่อน
What happened to that thing with "attention sinks" allowing infinite context sizes, that could be implemented in the interpreter app (dunno what they're called, the stuff that run the LLMs) without even needing to modify the models? (sorry, I don't remember which channel talked about it, just remember the term "attention sinks", was some months ago I think).
@FunwithBlender 5 หลายเดือนก่อน
to reason on an entire codebase, we need to look at tokenization data prep and ideally dataflow and explicit graph data.... its a solvable but complex issue
@toadlguy 5 หลายเดือนก่อน
Would be interested to know if this has really been successfully done for anything other than test cases.
@ManjaroBlack 5 หลายเดือนก่อน
A larger context window is not always best. The larger the context window the higher the quality of the context matters. This means it’s even more important to ensure the information available to the model within the context is specific and verbose. The quality of the output that the LLM adds to the context is very important. A larger context window is not good for smaller variants of models. Also high compression/quantization really affects the quality of the output adding to the issue of context quality.
@johnbollenbacher6715 3 หลายเดือนก่อน
Suggested programming test:
Write a function in Ada that takes a pair of numbers and an array of such pairs and returns true if there are an odd number of occurrences of the pair in the array.
@countofst.germain6417 5 หลายเดือนก่อน
Great interview, you asked a lot of questions I was curious about
@Goggleboxing 5 หลายเดือนก่อน
Thanks for this. What's to stop bad faith actors from inputting someone else's IP and have the AI reword/recode it to pass it off as their own? Whether that's an author's book, a screenplay, a video, a piece of music, sicientific research, etc.
@dr.mikeybee 5 หลายเดือนก่อน
Thanks for mentioning Cursor. I'm going to try it.
@alexcoventry7580 5 หลายเดือนก่อน
Is their million-token context implementation public? Or it's just that the base llama-3 is open-source?
@robertheinrich2994 5 หลายเดือนก่อน
I really want to see that in action. I currently play around with llama 3 abliterated, that thing is interesting.
the only problem I see with 1m token, my computer is already quite slow with 8k token, even bumping the setting to 16k would be painfully slow.
but I guess, there are tasks, where it totally makes sense to give the task to the LLM in the evening, and have the response ready by the next day.
@ITSupport-q1y 5 หลายเดือนก่อน
Brilliant, good questions.
@aa-xn5hc 5 หลายเดือนก่อน
Good questions good answers
@AssWann 5 หลายเดือนก่อน ⁺¹
now if only they wouldn't censor prompts like pansy snowflakes, them and everyone. I've never tried this 1 million token thing, I'm still looking for ais that do basic things without being blocked
@ISK_VAGR 5 หลายเดือนก่อน
Matt. It is a great video. However, I am skeptical that this is really as fast as the 8000 tokens version. It is more like a physical impossibility rather than a problem with training. Is the simple fact that it requires more time to find information If you have more information you will require more time unless you increase computational power. However, it is really remarkable that they can fit 1 million tokens context window with a relative high-performance. I would love to test it.
@stanpikaliri1621 5 หลายเดือนก่อน
Nice cant wait 1M tokens to improve my AI personality and to add some more stuff to its memory and also I should be able to chat longer without AI model to hallucinate. Just really wish to get unbiased uncensored model at one point.
@Jacstaoisitio 5 หลายเดือนก่อน
Not convinced that hallucinating llm's will ever go away
@arunsammitpandey86 5 หลายเดือนก่อน
Thanks for this video!
@bamit1979 5 หลายเดือนก่อน
unable to register on Gradient. Wonder if Gmail is accepted as a Registration email.
@mafaromapiye539 5 หลายเดือนก่อน
Generative AI models are for wisdom mining, they feel like simple systems of Earth from perspectives draughtsman
@pensiveintrovert4318 5 หลายเดือนก่อน ⁺¹
Have they ACTUALLY used their 1 million window on any use case successfully? Or is it just a claim that it would be helpful? I am yet to hear that Gemini is creating any klller apps.
@4.0.4 5 หลายเดือนก่อน
The problem is it costs _$7 per prompt._ You can't build a killer app out of something that expensive right now.
@pensiveintrovert4318 5 หลายเดือนก่อน
@@4.0.4 I started using OpenAI with gpt-pilot and after $92 turned it off. Now I just use chatGPT to produce chunks of code and then slap it together by hand. The latter seems to work reasonably well.
@leewilliams5828 5 หลายเดือนก่อน
ChatGPT 4o context window is still only a little over 4000, no?
@spectator59 4 หลายเดือนก่อน
How much VRAM for 1m context, tho?
@arcamari1222 5 หลายเดือนก่อน
I really like your channel and I find your videos on artificial intelligence extremely interesting. Thanks to the subtitles, I am able to follow and understand the topic better, which I am very passionate about, I am Italian. However, I have noticed that I often find myself unsubscribed from the channel without reason. It has happened 4 or 5 times already and I can't figure out if it's a technical issue or if I'm being removed by the channel owner. Could you help me understand why this happens and how I can solve the problem? Thank you."
@gileneusz 5 หลายเดือนก่อน
I tried this model, and via ollama it just produces trash, completely unusable comparing to regular llama 70b. Maybe it's a bug or something.
@tollington9414 5 หลายเดือนก่อน
The latest Gemini modal has a 2M context window
@TheReferrer72 5 หลายเดือนก่อน
I think its best to wait for Meta to increase the context length.
@INTELLIGENCE_Revolution 5 หลายเดือนก่อน
lol. He’s a research scientist. This is literally his remit 😅
@TobiasWeg 5 หลายเดือนก่อน
Great interview and very interesting.
@zyxwvutsrqponmlkh 5 หลายเดือนก่อน
The needle needs to not stick out. Have it for instance change the name of a character that is only stated once. But also this should be done on a text the model was never trained on so war and peace should not be used. What was the name of the character that did x thing.
@truepilgrimm 5 หลายเดือนก่อน ⁺¹
Nice.
@brunodangelo1146 5 หลายเดือนก่อน ⁺¹
Dude what happened to your thumbnails?
@penshon7775 5 หลายเดือนก่อน ⁺⁴
Omg. This dude has 1 mil tokens window too. Talking and talking and talking.. something that can be said with 2 words
@NotU-eg1jf 5 หลายเดือนก่อน
Who sees the cigarette?
@大支爺 5 หลายเดือนก่อน
It's useless for me because they're censored and not multi Languages supported.
@angloland4539 5 หลายเดือนก่อน
❤
@clearmind3022 5 หลายเดือนก่อน ⁺¹
It's kind of ironic. The creation he's making will soon outperform him, making him obsolete. Decreasion, well outgrow the creator ironic. Hopefully you build a good friendship with it.
@TheRealUsername 5 หลายเดือนก่อน
Probably a next gen of Transformer model could but not the current models.
@pensiveintrovert4318 5 หลายเดือนก่อน
Gemini has had 1 million context for months now. Not a single new klller app created nor any new Shakespeare plays written.
@MikeWoot65 5 หลายเดือนก่อน ⁺¹
Wait.. the guy with a beard is a He/Him? So glad that he filled that out on LinkedIn. I would have never guessed.
@AI-under-Five 5 หลายเดือนก่อน
You should consider educating yourself on this matter. What someone appears to be does not determine their gender identity. Additionally, this is how the world is now; by including pronouns, people are helping to normalize this practice and support a more inclusive environment.
@cesarsantos854 5 หลายเดือนก่อน
No, it is just snowflakes virtue signaling their illness.
@MikeWoot65 5 หลายเดือนก่อน
@@AI-under-Five Oh right. Gender is a social construct right? Literally saying that man/women are defined by their behaviors within a society. You are literally defining women based on gender norms and gender roles. I thought we fought to get rid of gender roles? Now you're saying those are the EXACT things that we should use to define ourselves?
Imagine a political ideology forcing you to change your definition what a woman is, and then saying. this is how the world is now. The hubris is staggering
@MikeWoot65 5 หลายเดือนก่อน
@@AI-under-Five How can gender be a social construct, but when choosing your gender, it has nothing to do with gender norms which are socially constructed?
And this is not how it is taught in schools. I'll refer to the def of gender via the WHO "Gender refers to the characteristics of women, men, girls and boys that are socially constructed. This includes norms, behaviours and roles"
@AI-under-Five 5 หลายเดือนก่อน
⁠Yikes 😅. Gender roles are a social construct. This is extensively researched by anthropologists, psychologists, sociologists… a bunch of -ologists 😝.
Gender identity is not a social construct; it’s based on how a person feels innately. There’s a wealth of scientific research supporting this. Please refer back to my point about self-education. Or maybe just try to be a nicer human being 😘.
Political ideology… lolllllll. Feels myopic.
@BradleyKieser 5 หลายเดือนก่อน
That was brilliant
@lunevka 5 หลายเดือนก่อน
hmm
@GoysForGiza 5 หลายเดือนก่อน
You should make two different channels. One with tutorials and alike, then one with shit like this.
@Ms.Robot. 5 หลายเดือนก่อน
I cannot repost what my Ai says (omg) 🔥, but I wish I could 😊. What are tokens 😁
@tex1297 5 หลายเดือนก่อน
Any time i see a pro tech company with random tubes and inapropiate stuff in backgrund, i know they wont show up again🤣
@mafaromapiye539 5 หลายเดือนก่อน
Yah Gemini Flash 1.5 1m context, it's web app utilize vram more as you approach high context
@jefframpe5075 5 หลายเดือนก่อน
Glad to see RULER being used.
RULER: What's the Real Context Size of Your Long-Context Language Models?

ต่อไป

เล่นอัตโนมัติ