Thank you my friend. One question. You dont use the following, why? : The template used to build a prompt for the Instruct model is defined as follows: [INST] Instruction [/INST] Model answer [INST] Follow-up instruction [/INST]
Can you do a video on finetuning a multimodal LLM (Video-LlaMA, LLaVA, or CLIP) with a custom multimodal dataset containing images and texts for relation extraction or a specific task? Can you do it using open-source multimodal LLM and multimodal datasets like video-llama or else so anyone can further their experiments with the help of your tutorial. Can you also talk about how we can boost the performance of the fine-tuned modal using prompt tuning in the same video?
I have a 2060 Super. Only 4% slower than a 3060, but only 8GB VRAM. I have 64 GB of DDR5 RAM and a 14900K CPU (with an NPU). I bet I could run it in 2 bit, but I never thought I'd go below 4 bit. Frankly I just see 8x7B as being a less efficient version of having several models fine tuned to a specific task. A couple 4 bit 7B models can fit in 8GB VRAM.
I think for general tasks it might be a good option. If you are working on a specific application, I will also recommend to fine tune a smaller model and use that instead. Will probably be a better option
@@engineerpromptYeah, the total is smaller than its implied parts, so for a general-purpose model it's probably more efficient. 8 7B models at 16 bit would usually take around 112GB instead of 90.
Thank you this is amazing i will use it for sure!, could make a video using this method with Free Kaggle, since t you can use 2 16gb T4 cards at the same time in the same instance also with 30 GB of RAM, this should run a lot faster, pretty please, also im sure that Free Kaggle tier videos will make you a tons of views for your channel, best of wishes for you and your love ones and happy 2024!
Thank you for the wishes, happy new year to you too! Kaggle is a great option. I haven't looked at it in a while but will see what I can do. Didn't know that they now offers two GPUs. Will explore that further.
I imagine it's costly to run LLMs.. is there a limit on how much Google Colab will do for free? I'm interested in creating a Python application that uses AI.. from what I've read, I could use ChatGPT4 Assistant API and I as the developer would incur the cost whenever the app is used. Alternatively, I could host a model like Ollama, on my own computer or on the cloud (beam cloud/ Replicate/Streamlit/replit)?
You can run quantized 4bit mixtral literally on any recent computer with 32 gb of RAM without any GPU at all. I don't understand why you need Google Collab here, memory is ultracheap these days
i was Dyingggg for this tutorial. thanks mannnn
Thank you my friend. One question. You dont use the following, why? :
The template used to build a prompt for the Instruct model is defined as follows:
[INST] Instruction [/INST] Model answer [INST] Follow-up instruction [/INST]
You are my go to guy for anything open source. Thanks for your work bhai 🙏
Glad it's helpful
Can you do a video on finetuning a multimodal LLM (Video-LlaMA, LLaVA, or CLIP) with a custom multimodal dataset containing images and texts for relation extraction or a specific task? Can you do it using open-source multimodal LLM and multimodal datasets like video-llama or else so anyone can further their experiments with the help of your tutorial. Can you also talk about how we can boost the performance of the fine-tuned modal using prompt tuning in the same video?
This guy is right on! Asking the right questions! Although I don't expect him to answer all of that, you are def. in the right direction!
make a video on fine tuning Mixtral 8x7b and how to use in production
I have a 2060 Super. Only 4% slower than a 3060, but only 8GB VRAM. I have 64 GB of DDR5 RAM and a 14900K CPU (with an NPU). I bet I could run it in 2 bit, but I never thought I'd go below 4 bit. Frankly I just see 8x7B as being a less efficient version of having several models fine tuned to a specific task. A couple 4 bit 7B models can fit in 8GB VRAM.
I think for general tasks it might be a good option. If you are working on a specific application, I will also recommend to fine tune a smaller model and use that instead. Will probably be a better option
@@engineerpromptYeah, the total is smaller than its implied parts, so for a general-purpose model it's probably more efficient. 8 7B models at 16 bit would usually take around 112GB instead of 90.
Please bring some multilingual (Hindi) TTS voice cloning on colab.
Thank you this is amazing i will use it for sure!, could make a video using this method with Free Kaggle, since t you can use 2 16gb T4 cards at the same time in the same instance also with 30 GB of RAM, this should run a lot faster, pretty please, also im sure that Free Kaggle tier videos will make you a tons of views for your channel, best of wishes for you and your love ones and happy 2024!
Thank you for the wishes, happy new year to you too! Kaggle is a great option. I haven't looked at it in a while but will see what I can do. Didn't know that they now offers two GPUs. Will explore that further.
Amazing
I imagine it's costly to run LLMs.. is there a limit on how much Google Colab will do for free?
I'm interested in creating a Python application that uses AI.. from what I've read, I could use ChatGPT4 Assistant API and I as the developer would incur the cost whenever the app is used.
Alternatively, I could host a model like Ollama, on my own computer or on the cloud (beam cloud/ Replicate/Streamlit/replit)?
Grate video, is there a way to upload your own RAG documents to this
The model can be 30+GB. Not surprising that it takes a while to load.
Thanks
How to let it write several pages text? Eventhough I set the max tokens to 32k and tell him to write 10 pages it still Outputs only 1 page of text
can we run uncensored model ?
I think yes but it needs to be converted into HQQ format
Is this better than chatgpt 3.5?
yes
On benchmarks, yes
You can run quantized 4bit mixtral literally on any recent computer with 32 gb of RAM without any GPU at all. I don't understand why you need Google Collab here, memory is ultracheap these days
Do you have a reference for a tutorial about how to do it? Thanks
@@unkim7085 or do the same in ollama - it just works there
can it work on 8gb ram?
Is there a way to make this model works in oobabooga Text generation WebUI that run in a Google Collab? Thx,