So this seems like the basis for a business: offer to train a custom model for product documentation, FAQ, etc with a specific product or company focus. Cool!
@@handsanitizer2457 It's a bit too much to explain here but search in youtube for "openai embeddings" or "embedding searches" and you will have a general idea of how models can be used for searches, not only for open ai but other open source models as well. Fine tuning a model on close domain will help it understand your company's data better. You can also fine tune it to reply back in a certain way which opens the door to many options. Chatgpt was trained this way but more in conversational outputs
This is great. Not so many channels on YT that do this kind of stuff. Would appreciate more like this, other frameworks like deepspeed, useful datasets, training parameters experiments, etc. so many interesting stuff that is not covered on YT.
Perfect balance of theory and hands-on with a colab attached to most of your videos. Much Much apreciated. I recommend this channel to all people who wants to follow this crazy trend of LLM releases. the best path to keep all of us up to date! I learn so much thanks to you Sam. Thanks a ton. Keep moving forward.
You continue to make videos on exactly the things I'm trying to understand more deeply! Fantastic! There are a lot of detailed parameters in this video that you could certainly continue to elaborate on for those of us who aren't programmers...yet :) Looking forward to more of your vids!
Many have said it but I'll reiterate -- your LLM videos are really great to watch, both the pace and the way you go from high level overviews to the detailed info. I also appreciate that it's not just focused on ChatGPT/GPT-4/hosted-models all the time and talks more about local training/finetuning/inferencing.
These fine-tuning-related topics are especially relevant to me right now. Currently training llama-30b variants at 4-bit. I’m very interested in how to roll adapters/checkpoints back into base models to keep VRAM usage down during inference (under 24GB)
I would love a vid covering examples of the differently formatted types of datasets that can be used to train a lora and the types of abilities that the different kinds of dataset training will allow - or put another way - what kinds of behavioral changes in abilities can we use lora to fine-tune for in a model, and how do we then know what types of data formatting to use in order to get a chosen outcome. :D
Thank you! Be great to see more on the data section - everyone always seems to gloss over that part, despite the fact that is clearly the most important part. Seen a lot of (from diff youtubers) 20-40 min vids on the configuration, barely mentioning the actual use of the data?
Hi Sam, thank you for the video. I'm getting RuntimeError: expected scalar type Half but found Float running in Colab with GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-a971aa0c-5408-727a-3b72-48b1926b5f66) On the training loop, what am i missing?
I’d love a quick video like this on how to use checkpoints from PEFT training to do inference. When I’m training, I’m never sure how much is too much, and I can save checkpoints automatically easily to resume in case training stops. What I need to learn is how to use these checkpoints with the base model to do inference so I can test output quality against several checkpoints. Ideally I’d like to be able to do inference on a base model plus checkpoint, and then once I find a good result, merge the checkpoint into the base model so I can use it in production and keep VRAM low. (I am assuming inference on base model + checkpoint will use more vram)
Great quick tutorial. This is good for English-only pretraining/fine-tuning. What is about non-English ? What are steps should we take to (1) extend vocab (2) pretraining (with or without LoRA) free-non-structure-text corpus (3) fine-tune with LoRA for each task ... ! Would love to have your tutorial on this road, it would be great. Thanks, Steve.
Awesome stuff Sam. I’m in the process of using langchain to build a vector store and - whilst it’s fine for now - would be really interested in understanding the best way to then take this and use to generate a LORA. Feels like the logical next step.
This is a great way to understand how we can fine-tune a text classification task using an LLM. I want to know if there is a method through which we can make the LLM learn from data in JSON format, where there are multiple labels for information retrieval or conversational recommendation tasks.
I have two sample dataset like bello 1) [{ "en": "Hello, how are you today?", "fr": "Bonjour, comment ça va aujourd'hui ?" },...] 2) [ { "text": "Ravi is a young man from India who loves panipuri." },... ] so how can i fine tune above dataset using falcon llm model Please help me
Hey, sorry I late to the party. I tried to load my LoRA model but when I checked the weights, the weights are the same with the original model. Is it supposed to do that? I already checked with my after-trained model and yes the weights is different.
@samwitteveenai I encounter RuntimeError: expected scalar type Half but found Float while running the training script specified in the colab notebook. Can you please helpme with pointers to solve the error. I am running in Colab (GPU 0: Tesla V100-SXM2-16GB)
Hey Sam, Thanks for the great informative video as always! Do you know of a way to see which neurons get activated during training? I am because I was thinking of ways to reduce the big models and the most obvious way I could think of would be to view which neurons are getting activated when training especially with falcon 170b, even 32b is to big for me and considering I don't need multiple languages I was hoping this would be a good approach to reduce the size of models? It would be cool to see a Brain Surgeon type debugger for LLMs. It would be good to run a different training datasets through different llms to see which neurons get activated and which ones do not and ideally have a way to disable them during inference to test and measure the differences of the output.
Why to freeze model weights ? I guess when you use lora config and get_peft_model it automatically freezes the model weights and keep lora weights trainable.
Yes you freeze the weights to make the training easier and quicker. If you weren't freezing the models weights there would be no need for LoRA you would just do a a full fine tune, which also takes a lot more compute etc.
If the only difference is in these added-on weights, is it possible to run multiple distinct finetuned models at the same time without duplicating the shared base pretrained model in memory?
How does LoRA differ from transfer learning? If I understand correctly TL means adding additional layers onto frozen pre-trained network and training it on new dataset, right?
LoRa is not adding additional weights. Although it might seems so while training, at inference there are no additional parameters. It acts more like diff and patch (though in vector space).
Hi Sam, thanks for the great video. I got a general question you might know the answer to. If I freeze pre-trained model weights (for example, BERT) and then train a classifier on top of its embeddings, does that called fine-tuning? If the weights are unfrozen, I know this can be called fine-tuning.
Yes I promised this and I will get to it. Will try to do it this week. Please remind me if I don't. Too many new LLMs and cool papers being released :D
Very informative!!!! does fine tunning with qlora/lora does support this kind of dataset? If not, what changes should i make in my output dataset? Review(col1) Nice cell phone, big screen, plenty of storage. Stylus pen works well. Analysis(col2) [{“segment”: “Nice cell phone”,“Aspect”: “Cell phone”,“Aspect Category”: “Overall satisfaction”,“sentiment”: “positive”},{“segment”: “big screen”,“Aspect”: “Screen”,“Aspect Category”: “Design”,“sentiment”: “positive”},{“segment”: “plenty of storage”,“Aspect”: “Storage”,“Aspect Category”: “Features”,“sentiment”: “positive”},{“segment”: “Stylus pen works well”,“Aspect”: “Stylus pen”,“Aspect Category”: “Features”,“sentiment”: “positive”}]
How many examples is necessary in dataset for it to learn the certain pattern? With OpenAI you are fine with just 200 examples, which I don't think would work here.
Can you do a video on finetuning a multimodal LLM (Video-LlaMA, LLaVA, or CLIP) with a custom multimodal dataset containing images and texts for relation extraction or a specific task? Can you do it using open-source multimodal LLM and multimodal datasets like video-llama or else so anyone can further their experiments with the help of your tutorial. Can you also talk about how we can boost the performance of the fine-tuned modal using prompt tuning in the same video?
Hey man great video, hd a question do.u think a 500 m or 1b model could give good results similar to alpaca. What would be the smallest size a model can follow instructions?
Its a really interesting question and something I am currently doing research on. 500 is probably too small. 1.5B things get a bit more interesting. The big challenge with smaller models is you can't expect them to know facts correctly. So you want to use them more as retrieval generation models. They can do language but need to have the facts and context fed in at generation time etc.
The cerebras-gpt models are really fast compared to gpt2, gpt-neo in inference like a cerebras2.7b inference speed is almost equal to gpt1.5b and gptneo 1.3B
In LoraConfig() method r is not the number of attention head instead it is the rank of the matrix that your are decomposing. From High Rank to LowRank. Here rank is 16.
great video! I actually was hitting an error while trying to finetiune Dolly 2.0 model : RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the forward function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes 2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple checkpoint functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases yet. this was fixed by commenting: model.gradient_checkpointing_enable() do you know why that might be the issue?
That video is quite old now, I think they have updated the library. I will try to take a look at it at some point. I am currently making some new Fine tuning vids so they should be out within a week.
Great! Can we merge the peft weights with the actual weights and use it for inference? any downside to it other than the size? Also, wouldn't the weights get tampered if we save them locally instead and use it for inference?
yes you can do that. I might show that in a future video. no big downside for most use cases. Saving the LoRa weights locally as when you load them they will load the original weights as well. Not sure what you mean by tampered.
So this seems like the basis for a business: offer to train a custom model for product documentation, FAQ, etc with a specific product or company focus. Cool!
Or close domain semantic search with summarization
@E Marrero can you explain that a bit more. I'm new to the machine learning space
@@handsanitizer2457 It's a bit too much to explain here but search in youtube for "openai embeddings" or "embedding searches" and you will have a general idea of how models can be used for searches, not only for open ai but other open source models as well. Fine tuning a model on close domain will help it understand your company's data better. You can also fine tune it to reply back in a certain way which opens the door to many options. Chatgpt was trained this way but more in conversational outputs
Does chatbase use this technique? It does the training on website or file data very fast.
@@ArjunKrishnaUserProfile I am pretty sure it doesn't train the model , that would be way more expensive than embedding
This is great. Not so many channels on YT that do this kind of stuff. Would appreciate more like this, other frameworks like deepspeed, useful datasets, training parameters experiments, etc. so many interesting stuff that is not covered on YT.
Perfect balance of theory and hands-on with a colab attached to most of your videos. Much Much apreciated. I recommend this channel to all people who wants to follow this crazy trend of LLM releases. the best path to keep all of us up to date! I learn so much thanks to you Sam. Thanks a ton. Keep moving forward.
Awesome! been waiting for your take on this topic
You continue to make videos on exactly the things I'm trying to understand more deeply! Fantastic! There are a lot of detailed parameters in this video that you could certainly continue to elaborate on for those of us who aren't programmers...yet :) Looking forward to more of your vids!
Many have said it but I'll reiterate -- your LLM videos are really great to watch, both the pace and the way you go from high level overviews to the detailed info.
I also appreciate that it's not just focused on ChatGPT/GPT-4/hosted-models all the time and talks more about local training/finetuning/inferencing.
Sam, thanks for giving your audience their requests! The alpaca training video you made makes much more sense now
Thanks for the awesome explanation. Going to binge your videos.
These fine-tuning-related topics are especially relevant to me right now. Currently training llama-30b variants at 4-bit. I’m very interested in how to roll adapters/checkpoints back into base models to keep VRAM usage down during inference (under 24GB)
Hi I am also interested. Can we connect via email?
Hey, I am also facing the same issue, did you find any update and could help me out please?
Very underrated channel. You deserve more viewers and subs.
Thanks for the kind words.
Great video on fine-chewning 😄
I would love a vid covering examples of the differently formatted types of datasets that can be used to train a lora and the types of abilities that the different kinds of dataset training will allow - or put another way - what kinds of behavioral changes in abilities can we use lora to fine-tune for in a model, and how do we then know what types of data formatting to use in order to get a chosen outcome. :D
Awesome explanation, this is exactly what I was looking for. Thank you!
Thank you! Be great to see more on the data section - everyone always seems to gloss over that part, despite the fact that is clearly the most important part. Seen a lot of (from diff youtubers) 20-40 min vids on the configuration, barely mentioning the actual use of the data?
Can you create more videos on instruction-prompt-tuning as well, as a further extension to this video? Amazing work!
Hi Sam, thank you for the video. I'm getting
RuntimeError: expected scalar type Half but found Float running in Colab with GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-a971aa0c-5408-727a-3b72-48b1926b5f66)
On the training loop, what am i missing?
It was a GPU issue, switching GPUs fixed it
YeahI don't think the Bitsnbytes fully supports v100 GPUs I have had issues with it in the past.
Nice work. Thank you.
i would love to see more videos about this and showing people how we could adapt this to our own projects and maybe even a video about 4bit tuning.
I’d love a quick video like this on how to use checkpoints from PEFT training to do inference.
When I’m training, I’m never sure how much is too much, and I can save checkpoints automatically easily to resume in case training stops.
What I need to learn is how to use these checkpoints with the base model to do inference so I can test output quality against several checkpoints.
Ideally I’d like to be able to do inference on a base model plus checkpoint, and then once I find a good result, merge the checkpoint into the base model so I can use it in production and keep VRAM low. (I am assuming inference on base model + checkpoint will use more vram)
Very rare videos found on youtube for this topic.
this is the first of a few on the topic
Great quick tutorial. This is good for English-only pretraining/fine-tuning. What is about non-English ? What are steps should we take to (1) extend vocab (2) pretraining (with or without LoRA) free-non-structure-text corpus (3) fine-tune with LoRA for each task ... ! Would love to have your tutorial on this road, it would be great. Thanks, Steve.
Awesome stuff Sam. I’m in the process of using langchain to build a vector store and - whilst it’s fine for now - would be really interested in understanding the best way to then take this and use to generate a LORA. Feels like the logical next step.
Great lectures.
This is a great way to understand how we can fine-tune a text classification task using an LLM. I want to know if there is a method through which we can make the LLM learn from data in JSON format, where there are multiple labels for information retrieval or conversational recommendation tasks.
I finetuned BART, but the model output was extactly the same as the input ids. Whats possibly wrong ?
Did you merge weights?
This is gold! Thank you!
I'm looking to train the Wizard-Vicuna models but run into `ValueError: The following `model_kwargs` are not used by the model: ['token_type_ids']`
This could be because they have already folded a LoRA in there or the base model setup is different.
I have two sample dataset like bello
1) [{ "en": "Hello, how are you today?", "fr": "Bonjour, comment ça va aujourd'hui ?" },...]
2) [ { "text": "Ravi is a young man from India who loves panipuri." },... ]
so how can i fine tune above dataset using falcon llm model
Please help me
Hey, sorry I late to the party. I tried to load my LoRA model but when I checked the weights, the weights are the same with the original model. Is it supposed to do that? I already checked with my after-trained model and yes the weights is different.
@samwitteveenai I encounter RuntimeError: expected scalar type Half but found Float while running the training script specified in the colab notebook. Can you please helpme with pointers to solve the error. I am running in Colab (GPU 0: Tesla V100-SXM2-16GB)
Ok v100s had some problems with the 8bit part in the past, so it could be that.
@@samwitteveenai Thanks for the acknowledgement
@@nayakdonkey Were you able to solve the above RuntimeError? I am facing the same with V100 machine
how is Lora fine-tuning track changes from creating two decomposition matrix? How the ΔW is determined?
This really useful, Thank you!
Have 3 pcs. 1 4080 and 2 3080 pcs. Do you have a recommendation for multi node peft/lora training?
model.to(device) is not called, how to make sure this task is running on GPU?
So badass. Thanks!
Hey Sam, Thanks for the great informative video as always!
Do you know of a way to see which neurons get activated during training? I am because I was thinking of ways to reduce the big models and the most obvious way I could think of would be to view which neurons are getting activated when training especially with falcon 170b, even 32b is to big for me and considering I don't need multiple languages I was hoping this would be a good approach to reduce the size of models?
It would be cool to see a Brain Surgeon type debugger for LLMs. It would be good to run a different training datasets through different llms to see which neurons get activated and which ones do not and ideally have a way to disable them during inference to test and measure the differences of the output.
Why to freeze model weights ? I guess when you use lora config and get_peft_model it automatically freezes the model weights and keep lora weights trainable.
Yes you freeze the weights to make the training easier and quicker. If you weren't freezing the models weights there would be no need for LoRA you would just do a a full fine tune, which also takes a lot more compute etc.
If the only difference is in these added-on weights, is it possible to run multiple distinct finetuned models at the same time without duplicating the shared base pretrained model in memory?
Yes this is trick we are working on for production. You have multiple LoRA weights for different tasks etc. Very much beyond the scope of here though.
How does LoRA differ from transfer learning? If I understand correctly TL means adding additional layers onto frozen pre-trained network and training it on new dataset, right?
Wow thank you for you work
In the past, I tried fine-tuning some GPT models, but the results weren't good. Maybe this new technique will give me a better outcome
fine-tuning comes down a lot to what you are tuning on and how much etc. LoRa has a lot of advantages and certainly worth a try.
LoRa is not adding additional weights. Although it might seems so while training, at inference there are no additional parameters. It acts more like diff and patch (though in vector space).
Hi Sam, thanks for the great video. I got a general question you might know the answer to. If I freeze pre-trained model weights (for example, BERT) and then train a classifier on top of its embeddings, does that called fine-tuning? If the weights are unfrozen, I know this can be called fine-tuning.
you can freeze some of the weights and tune the top layer etc and it is fine tuning yes.
very useful!! thanks a ton
very great tutorial! with the saved pretrained model, how do we make prediction for classification problems?
You can do that with a much simper model like BERT etc. or a T5 or structure the data to do it with the causal LM
Hi, I have a question for you. When/or will you be uploading the video about seq2seq models? I would like to see that one as well!
Yes I promised this and I will get to it. Will try to do it this week. Please remind me if I don't. Too many new LLMs and cool papers being released :D
@@samwitteveenai Okay, thank you so much.
How do you handle "CUDA out of memory" error in free Colab notebook?
Very informative!!!! does fine tunning with qlora/lora does support this kind of dataset? If not, what changes should i make in my output dataset?
Review(col1)
Nice cell phone, big screen, plenty of storage. Stylus pen works well.
Analysis(col2)
[{“segment”: “Nice cell phone”,“Aspect”: “Cell phone”,“Aspect Category”: “Overall satisfaction”,“sentiment”: “positive”},{“segment”: “big screen”,“Aspect”: “Screen”,“Aspect Category”: “Design”,“sentiment”: “positive”},{“segment”: “plenty of storage”,“Aspect”: “Storage”,“Aspect Category”: “Features”,“sentiment”: “positive”},{“segment”: “Stylus pen works well”,“Aspect”: “Stylus pen”,“Aspect Category”: “Features”,“sentiment”: “positive”}]
How many examples is necessary in dataset for it to learn the certain pattern? With OpenAI you are fine with just 200 examples, which I don't think would work here.
Can you do a video on finetuning a multimodal LLM (Video-LlaMA, LLaVA, or CLIP) with a custom multimodal dataset containing images and texts for relation extraction or a specific task? Can you do it using open-source multimodal LLM and multimodal datasets like video-llama or else so anyone can further their experiments with the help of your tutorial. Can you also talk about how we can boost the performance of the fine-tuned modal using prompt tuning in the same video?
Hi Sam,
Can you provide more videos on fine tunning? Especially with Mistra-Orca model.
I like your videos very much. Thanks for sharing them.
Yeah I have been meaning to do this for a while. Next week will do some new ones.
@@samwitteveenai Thanks so much Sam.
5:23 is it possible to train this on a Nvidia RTX 4090 FE (24GB RAM)?
How to re-train it with additional data? Great video!
Hey man great video, hd a question do.u think a 500 m or 1b model could give good results similar to alpaca. What would be the smallest size a model can follow instructions?
Its a really interesting question and something I am currently doing research on. 500 is probably too small. 1.5B things get a bit more interesting. The big challenge with smaller models is you can't expect them to know facts correctly. So you want to use them more as retrieval generation models. They can do language but need to have the facts and context fed in at generation time etc.
The cerebras-gpt models are really fast compared to gpt2, gpt-neo in inference like a cerebras2.7b inference speed is almost equal to gpt1.5b and gptneo 1.3B
What is the minimum GPU RAM Memory to run this code?
I think I need a new GPU to run this on my local machine
Does anyone know the proper settings for generation with the story model? Mine tends to start ok and then becomes word spew halfway though.
bitsandbytes seems to have lots of issues in terms of compatibility with various CUDA versions and outright doesn't support windows directly
Yes they don't support the older GPUs that well either
In LoraConfig() method r is not the number of attention head instead it is the rank of the matrix that your are decomposing. From High Rank to LowRank. Here rank is 16.
11:43 max_steps here are 200, how much max_steps do we usually set for proper finetuning? Please someone help me
set it to a full epochs rather than steps or calculate the number of steps in an epoch etc.
Awesome video!
12:10 The loss was not goin down tho brother..., try to update the video with model training converging. This one clerly did not
Excellent!
Maybe a tutorial to integrate langchain with flan but accesing an api rest to query data.
can you edit LoRa to LoRA in the tiitle? I was really confused for a second saying to myself what does long range radio do with LLMs
lol done, thanks for pointing it out.
what is the difference between lora and embedings?
Would it be practical to train a small model on a 1660 super 6gb? I just want to add a personality for a home voice assistant
probably not train it, you might be able to do some inference with that but training it on something with more VRAM etc
great video! I actually was hitting an error while trying to finetiune Dolly 2.0 model :
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the forward function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes 2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple checkpoint functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases yet.
this was fixed by commenting: model.gradient_checkpointing_enable()
do you know why that might be the issue?
That video is quite old now, I think they have updated the library. I will try to take a look at it at some point. I am currently making some new Fine tuning vids so they should be out within a week.
Can I train any llm model from hugging face like llama model
yes with most of them.
Hello! Can you fine-tuning T5?
Love It.
This should be a seq2seq model, because you are tagging (classifying) text. Actually a Sequence to Tag (Sequence Classification)
Hey Sam is there a chance I can reach to out to you personally?
just reach on Linkedin is easiest
how do you customize your dataset?
I am planning to make some vids on fine tuning LLaMA2 so I will go more into that there. Basically you just want to feed it strings.
Thanks a lot~
Great! Can we merge the peft weights with the actual weights and use it for inference? any downside to it other than the size? Also, wouldn't the weights get tampered if we save them locally instead and use it for inference?
yes you can do that. I might show that in a future video. no big downside for most use cases. Saving the LoRa weights locally as when you load them they will load the original weights as well. Not sure what you mean by tampered.
Why is it called causal?
👍
Finally something real...
1:02 LoRa which stands for Low Rank ADAPTION xddddd
LORA is a low power wireless data transmission...