Very Nice. Since not everyone has 40 GB Vram, can you be more specific on how to do this with the llama3 8B model. ( because you say we maybe need to change the datatype ist we use a different model, and I have no clue how I should know the correct data type 😁 )
Great question, thanks for asking! You can see the data type in the config.json inside the Hugging Face repository. Inside the config.json search for “torch_dtype”. Bfloat16 is pretty popular but does currently not work for AWQ quantized models, which usually use float16. Hope this is helpful! :-)
I have this error: pip install autoawq install -r requirements.txt Defaulting to user installation because normal site-packages is not writeable ERROR: Could not find a version that satisfies the requirement autoawq (from versions: none) ERROR: No matching distribution found for autoawq
Hi. Thank you for your video. I wanna know one thing. I have a multiple CSV files which I want the llama to know about it. I have went through other videos, there is a guy that does the same task like I want but after incorporating the files, llama cannot respond other general question correctly but focus only on the information of CSV file. Their method first split the text into chunck and use embedding to embedded them using other embedding methods. Can you please provide any solution to it using only llama and nothing else. What I want is for the llama to know about my files in the top of its already existing knowledge.
You need to embed the knowledge from your CSV files into vector database properly and then when you ask about something related to this knowledge, llama or any of smaller models good for vector search should find it ( this specific chunk or couple of them from DB ) and attach to your question as context. Otherwise you will have to train your model towards this data and this is much more hardware-devouring. AFAIK.
@@axelef2344 hi. do you have any good video for it. i followed some videos, and yes my model can answer the queries regarding my specific data but when i asked other general question, it fails to reply and also it does not have memory. i want llama to have knowledge about my data and still able to answer other general question also and still has memory.
Asking a LLM questions is fun and everything, but most will want a LLM to act as an "agent base". Utilizing a multi-expert foundation. Meaning the LLM is tasked with a coding problem or a finance problem, or to re-write a story. The LLM base is where the agent front-end goes to. How about you front-end something like pythagora;DOT:ai using Llama3 as a LOCAL backend over API? And (and I know I am asking a lot here), provide a training methodology which ingest something a like companies FAQ's, help-desk/knowledge base/ etc? Otherwise playing with any AI is more amusement and entertainment than an actual system for productivity.
Basically the entire UI class. You would only load the vLLM engine and then call the generation function directly: llm = StreamingLLM(model="casperhansen/llama-3-70b-instruct-awq", quantization="AWQ", dtype="float16") tokenizer = llm.llm_engine.tokenizer.tokenizer sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=4096, stop_token_ids=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("")] ) prompt = tokenizer.apply_chat_template(history_chat_format, tokenize=False) llm.generate(prompt, sampling_params):
This would be possible if you further customise the UI. But probably using an existing solution is easier. I think PrivateGPT or Chat with RTX could be helpful for your use case. Is that something you would like me to create a video about? :)
Great question, fitting 70B parameters in 24 GB VRAM where the default precision for a parameter is 16 bits (8 bits = 1 byte) is a challenging task. Even if you quantize the model weights to 3-bit precision, fitting the model is difficult. However, there still seems to be hope by offloading some model weights to RAM and dynamically loading the required model weights during inference. Of course, this approach is not ideal due to the latency of loading and offloading the model weights, but it seems that a generation speed of 2 tokens/sec is still possible which is not too far off from the human reading speed (~7 tokens/sec). Probably the llama.cpp library is a good solution here: www.reddit.com/r/LocalLLaMA/comments/17znv35/how_to_run_70b_on_24gb_vram/
let's be real, it's not GPT-4. I don't know why people insist on trying to make this false equivalency. No open source model has still come even close to GPT-4. They can release all the benchmarks they want and blah blah blah, using the two models you immediately see that llama3 is still quite a bit weaker than GPT-4. We'll see when the 300B version comes out. I'm not holding my breath though. If 300B still falls short then it wil be at least another year and a half maybe two before llama 4 comes out that should finally surpass it but by that time GPT-5 will be out and llama will again be behind of course.
OpenAI despite the name has gone down the closed source route. This makes them dependent on their software engineers. More open llms like llama3 have the advantage of a huge community of developers. One will be like Windows and the other like Linux. What is better? It depends from the use case scenario.
@@lucamatteobarbieri2493 my use case is complex coding tasks. Sure maybe on some rag stuff llama 3 can hang with gpt-4 but the advanced reasoning, context and instructions following is still not anywhere where it needs to be for my use case.
Yes agreed, it’s better than ChatGPT (GPT-3.5) but worse than GPT-4. I think the 400B+ model will achieve GPT-4 level performance. Of course, it would be helpful to know how many tokens it has already been trained on and how many more Meta plans to train it on, but the current benchmarks look very promising!
Thanks for sharing. It is amazing that not only you create quality videos, but you also reply to so many technical problems. You are a great guy.
Thanks! That's actually a really nice compliment, really appreciate it! :-)
Very Nice. Since not everyone has 40 GB Vram, can you be more specific on how to do this with the llama3 8B model. ( because you say we maybe need to change the datatype ist we use a different model, and I have no clue how I should know the correct data type 😁 )
Great question, thanks for asking! You can see the data type in the config.json inside the Hugging Face repository. Inside the config.json search for “torch_dtype”. Bfloat16 is pretty popular but does currently not work for AWQ quantized models, which usually use float16. Hope this is helpful! :-)
Hi , thanks for making super helpful content , I have RTX 3050 which has a 4 GB VRAM is there a way i can use this model locally and if yes then how
I have this error: pip install autoawq install -r requirements.txt
Defaulting to user installation because normal site-packages is not writeable
ERROR: Could not find a version that satisfies the requirement autoawq (from versions: none)
ERROR: No matching distribution found for autoawq
i came here looking to see nice cute animals where are the 3 llama's you advertised i could see on my computer
thanks for sharing! this was super helpful :D
Hi. Thank you for your video. I wanna know one thing. I have a multiple CSV files which I want the llama to know about it. I have went through other videos, there is a guy that does the same task like I want but after incorporating the files, llama cannot respond other general question correctly but focus only on the information of CSV file. Their method first split the text into chunck and use embedding to embedded them using other embedding methods. Can you please provide any solution to it using only llama and nothing else. What I want is for the llama to know about my files in the top of its already existing knowledge.
You need to embed the knowledge from your CSV files into vector database properly and then when you ask about something related to this knowledge, llama or any of smaller models good for vector search should find it ( this specific chunk or couple of them from DB ) and attach to your question as context. Otherwise you will have to train your model towards this data and this is much more hardware-devouring. AFAIK.
@@axelef2344 hi. do you have any good video for it. i followed some videos, and yes my model can answer the queries regarding my specific data but when i asked other general question, it fails to reply and also it does not have memory. i want llama to have knowledge about my data and still able to answer other general question also and still has memory.
is this on windows or linux. can't seem to install the vllm library on windows
you are so enthusiastic !! : )
Thank, really appreciate it! :-)
hello , just wondering . Can you help in doing this on Google Collab ?
When’s the next Video?
When i run pip install command I get the error "Could not find a version that satisfies the requirement flash-attn==2.5.7"
Can you maybe check your pip config? On PyPi you can see that 2.5.7 is actually the most recent version for flash-attn: pypi.org/project/flash-attn/
How to run llama 3 70b on 4 x rtx in linux?
Asking a LLM questions is fun and everything, but most will want a LLM to act as an "agent base". Utilizing a multi-expert foundation. Meaning the LLM is tasked with a coding problem or a finance problem, or to re-write a story. The LLM base is where the agent front-end goes to. How about you front-end something like pythagora;DOT:ai using Llama3 as a LOCAL backend over API? And (and I know I am asking a lot here), provide a training methodology which ingest something a like companies FAQ's, help-desk/knowledge base/ etc?
Otherwise playing with any AI is more amusement and entertainment than an actual system for productivity.
Yes, definitely a fair call! I'll keep it in mind for future videos to take a look at more advanced use cases, such as autonomous task solving :)
If i wanted to only get the text results and not launch the UI, what should I remove? Thanks!
Basically the entire UI class. You would only load the vLLM engine and then call the generation function directly:
llm = StreamingLLM(model="casperhansen/llama-3-70b-instruct-awq", quantization="AWQ", dtype="float16")
tokenizer = llm.llm_engine.tokenizer.tokenizer
sampling_params = SamplingParams(temperature=0.6,
top_p=0.9,
max_tokens=4096,
stop_token_ids=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("")]
)
prompt = tokenizer.apply_chat_template(history_chat_format, tokenize=False)
llm.generate(prompt, sampling_params):
Can use on i5 12th gen? No gpu?
I need the model to upload files for anaclasis like Chat-GPT's interface.
This would be possible if you further customise the UI. But probably using an existing solution is easier. I think PrivateGPT or Chat with RTX could be helpful for your use case. Is that something you would like me to create a video about? :)
Can I run the 70B with an rtx 3090, has 24gb of vram, and how would I do it?
Great question, fitting 70B parameters in 24 GB VRAM where the default precision for a parameter is 16 bits (8 bits = 1 byte) is a challenging task. Even if you quantize the model weights to 3-bit precision, fitting the model is difficult. However, there still seems to be hope by offloading some model weights to RAM and dynamically loading the required model weights during inference. Of course, this approach is not ideal due to the latency of loading and offloading the model weights, but it seems that a generation speed of 2 tokens/sec is still possible which is not too far off from the human reading speed (~7 tokens/sec). Probably the llama.cpp library is a good solution here: www.reddit.com/r/LocalLLaMA/comments/17znv35/how_to_run_70b_on_24gb_vram/
thanks . is great job.
can i use for gpu 3080?
Yes, you can :)
does it do images
I too have an RTX 6000, but only in my dreams. 🤑
Support 💙😊
Thanks! But lol, joe average don't have a 10,000+ euro GPU :D
Nice
Sweeet!!
so basically watching all of this was a waste of time having a gtx 1650 4 gb vram got u
let's be real, it's not GPT-4. I don't know why people insist on trying to make this false equivalency. No open source model has still come even close to GPT-4. They can release all the benchmarks they want and blah blah blah, using the two models you immediately see that llama3 is still quite a bit weaker than GPT-4. We'll see when the 300B version comes out. I'm not holding my breath though. If 300B still falls short then it wil be at least another year and a half maybe two before llama 4 comes out that should finally surpass it but by that time GPT-5 will be out and llama will again be behind of course.
OpenAI despite the name has gone down the closed source route. This makes them dependent on their software engineers. More open llms like llama3 have the advantage of a huge community of developers. One will be like Windows and the other like Linux. What is better? It depends from the use case scenario.
@@lucamatteobarbieri2493 my use case is complex coding tasks. Sure maybe on some rag stuff llama 3 can hang with gpt-4 but the advanced reasoning, context and instructions following is still not anywhere where it needs to be for my use case.
Yes agreed, it’s better than ChatGPT (GPT-3.5) but worse than GPT-4. I think the 400B+ model will achieve GPT-4 level performance. Of course, it would be helpful to know how many tokens it has already been trained on and how many more Meta plans to train it on, but the current benchmarks look very promising!
can i getvyour email for business inquiries