Run Mixtral 8x7B MoE in Google Colab
ฝัง
- เผยแพร่เมื่อ 27 ก.ค. 2024
- Run the mighty Mixtral 8x7B MoE on Free Google Colab. Mixtral is huge 45B parameters model but with offloading, you can run it on consumer-grade GPUs.
🦾 Discord: / discord
☕ Buy me a Coffee: ko-fi.com/promptengineering
|🔴 Patreon: / promptengineering
💼Consulting: calendly.com/engineerprompt/c...
📧 Business Contact: engineerprompt@gmail.com
Become Member: tinyurl.com/y5h28s6h
💻 Pre-configured localGPT VM: bit.ly/localGPT (use Code: PromptEngineering for 50% off).
LINKS:
Technical Report: arxiv.org/pdf/2312.17238.pdf
Github Repo: tinyurl.com/msuj2v47
Google Colab: tinyurl.com/2nn5snb4
Huggingface: tinyurl.com/csnapujn
TIMESTAMPS:
[00:00] Intro
[00:30] Understanding the Offloading Paper
[03:00] Running the Model on Google Colab
[04:15] Walking Through the Notebook
[06:26] Running the Model and Generating Responses
[07:24] Examples of Model Outputs
[08:16] Final Thoughts
All Interesting Videos:
Everything LangChain: • LangChain
Everything LLM: • Large Language Models
Everything Midjourney: • MidJourney Tutorials
AI Image Generation: • AI Image Generation Tu... - วิทยาศาสตร์และเทคโนโลยี
i was Dyingggg for this tutorial. thanks mannnn
Amazing
You are my go to guy for anything open source. Thanks for your work bhai 🙏
Glad it's helpful
Thanks
Can you do a video on finetuning a multimodal LLM (Video-LlaMA, LLaVA, or CLIP) with a custom multimodal dataset containing images and texts for relation extraction or a specific task? Can you do it using open-source multimodal LLM and multimodal datasets like video-llama or else so anyone can further their experiments with the help of your tutorial. Can you also talk about how we can boost the performance of the fine-tuned modal using prompt tuning in the same video?
This guy is right on! Asking the right questions! Although I don't expect him to answer all of that, you are def. in the right direction!
make a video on fine tuning Mixtral 8x7b and how to use in production
Thank you my friend. One question. You dont use the following, why? :
The template used to build a prompt for the Instruct model is defined as follows:
[INST] Instruction [/INST] Model answer [INST] Follow-up instruction [/INST]
Please bring some multilingual (Hindi) TTS voice cloning on colab.
Thank you this is amazing i will use it for sure!, could make a video using this method with Free Kaggle, since t you can use 2 16gb T4 cards at the same time in the same instance also with 30 GB of RAM, this should run a lot faster, pretty please, also im sure that Free Kaggle tier videos will make you a tons of views for your channel, best of wishes for you and your love ones and happy 2024!
Thank you for the wishes, happy new year to you too! Kaggle is a great option. I haven't looked at it in a while but will see what I can do. Didn't know that they now offers two GPUs. Will explore that further.
Grate video, is there a way to upload your own RAG documents to this
I have a 2060 Super. Only 4% slower than a 3060, but only 8GB VRAM. I have 64 GB of DDR5 RAM and a 14900K CPU (with an NPU). I bet I could run it in 2 bit, but I never thought I'd go below 4 bit. Frankly I just see 8x7B as being a less efficient version of having several models fine tuned to a specific task. A couple 4 bit 7B models can fit in 8GB VRAM.
I think for general tasks it might be a good option. If you are working on a specific application, I will also recommend to fine tune a smaller model and use that instead. Will probably be a better option
@@engineerpromptYeah, the total is smaller than its implied parts, so for a general-purpose model it's probably more efficient. 8 7B models at 16 bit would usually take around 112GB instead of 90.
How to let it write several pages text? Eventhough I set the max tokens to 32k and tell him to write 10 pages it still Outputs only 1 page of text
I imagine it's costly to run LLMs.. is there a limit on how much Google Colab will do for free?
I'm interested in creating a Python application that uses AI.. from what I've read, I could use ChatGPT4 Assistant API and I as the developer would incur the cost whenever the app is used.
Alternatively, I could host a model like Ollama, on my own computer or on the cloud (beam cloud/ Replicate/Streamlit/replit)?
The model can be 30+GB. Not surprising that it takes a while to load.
can we run uncensored model ?
I think yes but it needs to be converted into HQQ format
Is this better than chatgpt 3.5?
yes
On benchmarks, yes
You can run quantized 4bit mixtral literally on any recent computer with 32 gb of RAM without any GPU at all. I don't understand why you need Google Collab here, memory is ultracheap these days
Do you have a reference for a tutorial about how to do it? Thanks
@@unkim7085 or do the same in ollama - it just works there
can it work on 8gb ram?
Is there a way to make this model works in oobabooga Text generation WebUI that run in a Google Collab? Thx,