Extending Llama-3 to 1M+ Tokens - Does it Impact the Performance?
ฝัง
- เผยแพร่เมื่อ 7 มิ.ย. 2024
- In this video we will look at the 1M+ context version of the best open llm, llama-3 built by gradientai.
🦾 Discord: / discord
☕ Buy me a Coffee: ko-fi.com/promptengineering
|🔴 Patreon: / promptengineering
💼Consulting: calendly.com/engineerprompt/c...
📧 Business Contact: engineerprompt@gmail.com
Become Member: tinyurl.com/y5h28s6h
💻 Pre-configured localGPT VM: bit.ly/localGPT (use Code: PromptEngineering for 50% off).
Signup for Advanced RAG:
tally.so/r/3y9bb0
LINKS:
Model: ollama.com/library/llama3-gra...
Ollama tutorial: • Ollama: The Easiest Wa...
TIMESTAMPS:
[00:00] LLAMA-3 1M+
[00:57] Needle in Haystack test
[02:45] How its trained?
[03:32] Setting Up and Running Llama3 Locally
[05:45] Responsiveness and Censorship
[07:25] Advanced Reasoning and Information Retrieval
All Interesting Videos:
Everything LangChain: • LangChain
Everything LLM: • Large Language Models
Everything Midjourney: • MidJourney Tutorials
AI Image Generation: • AI Image Generation Tu... - วิทยาศาสตร์และเทคโนโลยี
CORRECTION: There is a mistake in testing long context in the video [haystack test towards the end of the video (Tim Cook and Apple related question)]. If you set the context length in a session in ollama and exist it, you will have to reset the context length in the new session again. Parameters set in one session do not persist across sessions. An oversight on my end and thanks to everyone for pointing it out.
So did it work correctly after setting the context length?
The AI went from scholarly professor to unintelligible drunk quite quickly.
I've been testing it. It has a hallucination issue when large text is put in. However the writing is good so even the hallucinations are interesting.
Have you tried lowering the temperature?
need more information about this model. How it work with autogen or another agent system?
thank you for the video. When you exited and reran the model, shouldnt you also reset the context window to 256K ?
that's a valid point. I thought (mistakenly) it persists for the llm but seems like you actually have to do it for each session.
Thanks a lot for this showcase! That test you've done is fantastic:
"A glass door has 'push' on it in mirror writing. Should you push or pull it?
Please think out loud step by step."
I've tried with several llama3 8b, down to llama3:8b-instruct-q4_1 quantization which ends quickly with a spot-on: "So, to answer the question: You should pull the glass door"
I'm able to reproduce the infinite output you get with llama3-gradient:8b-instruct-q5_K_M.
So there was something broken in this fine tune for larger context indeed. I was hoping to leverage larger context in an application with llama3 but that won't be the model for that I guess.
And I've tried with the full fp16 from Ollama too.
It does seem to stop consistently but answers wrong:
"Conclusion: Even though the word is written backwards when looking from within your reflection, I should still try opening the glass doors by pushing them like they would say."
there are 16k and 64k versions which are finetuned. Might be interesting to look into those.
@@engineerprompt thanks for the suggestion, I will 😌
So I guess this will not run on a GTX 1080 TI?
Thanks for the update. Anybody try to do "real" RAG using multiple documents? You cannot access it using GRoq?
you can look into localgpt :)
Due to regressive nature of LLMs there is an exponential chance of producing errors for every passing token. Be careful what you wish for.
Yes, without multiple needle runs, the test is pretty weak.
agree.
May I ask which system you are using to run this?
I am using M2 Max 96GB to run this.
Can you do a review on AirLLM it’s lets you run a 70b model on 4gb of vram
havne't seen that before. Will explore what it is.
does it let us run such size without GPU.. i.e. on CPU only ??
You need to set context size in each session. Not just once. That's why the needle test failed.
The setting isn't persistent? Major oversight in the video if this the case.
@@JoeBrigAI They are persistent per session. Once you enter /bye that's it.
that is true. I thought it to be otherwise. Added a pinned comment to highlight this.
Interesting, this one didn't bring a ladder to the party for the joke haha. About the model not stopping, it's probably related to RoPE (Rotary Positional Encoding). If someone messed with that, things could go forever. Anyway the quantized version definitely affects the model's behavior.
haha, that's true. Pleasantly surprised with the joke :) that's actually a good point with RoPE.
you need to pick up an M4 Mac Studio when it comes out ;)
indeed :D
Phi3 128k got worse against 4k when trying to analyze a program I gave to him
13:16 it appears the llm was suffering Schizophrenia that moment 😅
100+ GB of vram for 4-bit quantized model? 🙄Are you sure about quantized one?
The issue with all these "BENCHMARKS" is that they are all lies. We need a better, real benchmark for LLM because what we get from people who make these models are all lies. They don't perform well when it comes down to doing real work, or anything in real production. They all suck compared to closed models. It's like the people who benchmark them show very cherry-picked examples.
Yes, some models are good to Daily use and in some case better than gpt 3.5 of chatgpt but I never used one that is close to gpt 4. And in some use cases the 3.5 still better than mistral in my own experience. So they really should put the real breachmarks
Lmao 64gb vram I'm using rtx 4070 which only has 8gb vram
:)
how did you manage to pick up an 8gb 4070 even the founders edition has 12gb
@@kecksbelit3300 using acer predator hellos neo 16 laptop
why are you using ollama not llama.cpp directly?
Just ease of use.
64gb vram 😂