Fascinating video, I have been searching for something similar for a while. I am working on fine-tuning open-source LLMs for Arabic. While Arabic isn't exactly a low-resource language due to its widespread use globally, there is a scarcity of labeled data. I will refer to this video during the fine-tuning process. Thank you very much for this valuable content.
It does seem to degrade quite a bit in gguf or awq quantization. Using QLoRA involves using nf4 type quantization and is not so bad, so I wouldn't be too concerned.
if I might suggest a topic for a video, I think it would be very interesting to understand how benchmarks work under the hood! For example, I have now embarked on the journey to make the best performing Open LLM on Italian (the lack of competition actually makes this possible), but I don't understand how benchmarks work... Like how does the benchmark know if the model's answer is similar to the right answer in the benchmarking dataset and how are instruct vs foundation models treated differently? because if I have an instruct model the instruction need different formatting (e.g. [INST]), how does the benchmark know that?
In many cases, when LLMs are benchmarked, they are benchmarked using the same prompt format, but with few shot examples. Have a look at the example at the bottom of the latest phi-3 paper in Appendix A: arxiv.org/abs/2404.14219 . But yes, you're right, prompt format DOES affect performance a lot, so one might argue benchmarking should be done using the native chat format that was used for instruction fine-tuning. I believe this is the case for the Arena type approaches where models are compared based on responses.
Great video! On the annealing, you did it as a separate run. Is it possible to integrate it into the first fine-tuning run, much like how the warmup is part of the fine-tune training process in that run? That way it can all be done in one script and run of the training.
Yes! you can do that by passing in a custom learning rate scheduler. You have to be a little careful so as not mess up the learning rates for the lora adapters. In my video on Lora+ I go through a related example.
Thanks for the great video as usual. I have to admit I get a bit confused when talking about unsupervised training in this context (through no fault of your explanation, I've just always struggled with this). My initial intuition was that unsupervised would be when you train on "unstructured" text, where you simply mask the next tokens and predict / evaluate loss one by one (as you normally do in pretraining). In this scenario, you won't need your dataset to be in QA/Instruct/Chat format, and you can (typically) procure huge amounts of data more easily. In this video, you present unsupervised fine-tuning as "fine-tuning on all of the data" (29:01), which I understand since you mix formatted and unformatted text data. My question is, would you normally still consider it unsupervised fine-tuning if you trained on an Instruct dataset but chose to set the train on completions only to false? And also, in what circumstances would you decide to FT on an instruct dataset with a train on completions only to false? Thanks!
Yeah, it's murky! Strictly speaking, unsupervised means that the data is not parsed into instruction or chat format. In practise though, the fact is that the more organised your data, the better the results, so it's beneficial to have your data be as clean as possible. Training on completions only is specifically useful if you want to train on a short dataset, because it focuses the model on the answers only - thus removing any noise from the content. In longer pre-trainings and unsupervised trainings I think you would train on all of the text. Actually, it's kind of inefficient to train on completions only because you are still forward passing through the full sequence. But doing completions only is less noisy, so it can achieve a specific effect with a smaller amount of data.
hello sir I'd like to know what is your opinion on new LLM structure like omni, i saw some papers covered in a video by someone channel hu-po it's about LLM multimodallity where text, audio, images are converted into token directly on one mLLM. How would it affect the fine tune? will there be totally different? I really like to know what's your thoughts around it.
I like the idea of using the same model for multiple input forms. And yes, you can then fine-tune on multiple input form data. I've made videos about combined text + image and hope to do one taking in sound soon.
Hello, when it has been clearly established that to retain base capabilities of model, lora is much better than fine-tuning. I would like to know what do you personally prefer, function calling is something i would like my fine tuned models to be able to do.
Yes, LoRA can yield better results when the training data being used is much smaller than pre-training. So, yes I often use LoRA, including for function calling fine-tuning
@18:34, Can anyone direct me to resources regarding this topic (preferably on TH-cam)? I am new to LLMs, I've seen this in an Unsloth colab notebook, but unsure of what purpose it serves.
Thank you for your excellent video! Would you give me some rationale for setting an embedding layer (and lm_head) to be trainable? I would also appreciate any references on this topic. Thank you in advance.
Typically it's needed if: a) you are adding new tokens or new uses of the same tokens (e.g. a new chat template), or b) if you are using a pretty different distribution of tokens than the base model (which could be the case for a new language, as the token usage will be different).
yo, mistral introduced a officail fine-tuning document. Can we get a video on this. This is what i have been waiting for, maybe we get some official way from meta too.
Both ChatGPT and Gemini fail to capture the logical nuances (the logical meaning of the target language) words in English, and even more in other languages. Discussing in details the meaning with ChatGPT usually fails with ChatGPT reasoning that it can only capture the usage patterns, not their logical meaning.
yeah, I agree that having a chat with chatGPT about the reasoning is hard. At some level, it's possible to fake this by having the model be trained on some discussions about reasoning. The most advanced models now include that kind of data (I assume, based on my interactions with the models).
@@TrelisResearch I'm not sure you are following the meaning. My point is that there is a hidden logic behind the words and their meaning, something like a logical definition for each word. ChatGPT can't capture that even for most English words. At the same time (because I tried to figure out why), ChatGPT has powerful self-explanation abilities, which you can use to find the faults in its logic.
@@TrelisResearch Because IMO it's laced with political opinions and I reckon we have already enough of that in the native pre-trained models. Imagine my system prompt is something like that (because it knows epistemology inside but but doesn't apply it): "Only abide by Karl Popper epistemological rules: consensus, opinions and/or testimonial statements, even when authoritative, are not scientific evidence, only empirical evidence is scientific evidence; lack of consensus or opinions do not disprove a theory, nor negate empirical evidence. Statistical power only gives a marginal certainty but has a lower status than empirical evidence it can prove correlation but not causation."
@@TrelisResearch Hey I mean for the sake of demonstration it's fair enough, don't think I was criticising you in any shape or form, I am big fan of what you do. Just my 2 cents.
@@TrelisResearch I think that if the cement of the AI is semantic and you force illogical alignments it's bound to deteriorate the quality of some answers. Imagine real science says stuff that is contradicted by "BBC Science" and big time, so where do you consolidate stuff that contradicts itself? Also everybody using the same datasets, they all look and talk similar in the end.
the amount of knowledge in this video is just mind blowing! thanks for making this available 🙏
cheers!
Have been looking for a tutorial for this for a while. Thanks for doing this
Fascinating video, I have been searching for something similar for a while. I am working on fine-tuning open-source LLMs for Arabic. While Arabic isn't exactly a low-resource language due to its widespread use globally, there is a scarcity of labeled data. I will refer to this video during the fine-tuning process. Thank you very much for this valuable content.
you're welcome
Excellent video - always love the in-depth discussion.
appreciate it, thanks
Is it true that the Llama 3 8b degrades a lot at 4 bit quantization? Makes me wonder if should fine tune some other model.
It does seem to degrade quite a bit in gguf or awq quantization. Using QLoRA involves using nf4 type quantization and is not so bad, so I wouldn't be too concerned.
if I might suggest a topic for a video, I think it would be very interesting to understand how benchmarks work under the hood!
For example, I have now embarked on the journey to make the best performing Open LLM on Italian (the lack of competition actually makes this possible), but I don't understand how benchmarks work... Like how does the benchmark know if the model's answer is similar to the right answer in the benchmarking dataset and how are instruct vs foundation models treated differently? because if I have an instruct model the instruction need different formatting (e.g. [INST]), how does the benchmark know that?
In many cases, when LLMs are benchmarked, they are benchmarked using the same prompt format, but with few shot examples. Have a look at the example at the bottom of the latest phi-3 paper in Appendix A: arxiv.org/abs/2404.14219 .
But yes, you're right, prompt format DOES affect performance a lot, so one might argue benchmarking should be done using the native chat format that was used for instruction fine-tuning. I believe this is the case for the Arena type approaches where models are compared based on responses.
Great video! On the annealing, you did it as a separate run. Is it possible to integrate it into the first fine-tuning run, much like how the warmup is part of the fine-tune training process in that run? That way it can all be done in one script and run of the training.
Yes! you can do that by passing in a custom learning rate scheduler. You have to be a little careful so as not mess up the learning rates for the lora adapters. In my video on Lora+ I go through a related example.
Thanks for the great video as usual. I have to admit I get a bit confused when talking about unsupervised training in this context (through no fault of your explanation, I've just always struggled with this).
My initial intuition was that unsupervised would be when you train on "unstructured" text, where you simply mask the next tokens and predict / evaluate loss one by one (as you normally do in pretraining). In this scenario, you won't need your dataset to be in QA/Instruct/Chat format, and you can (typically) procure huge amounts of data more easily.
In this video, you present unsupervised fine-tuning as "fine-tuning on all of the data" (29:01), which I understand since you mix formatted and unformatted text data. My question is, would you normally still consider it unsupervised fine-tuning if you trained on an Instruct dataset but chose to set the train on completions only to false? And also, in what circumstances would you decide to FT on an instruct dataset with a train on completions only to false? Thanks!
Yeah, it's murky!
Strictly speaking, unsupervised means that the data is not parsed into instruction or chat format.
In practise though, the fact is that the more organised your data, the better the results, so it's beneficial to have your data be as clean as possible.
Training on completions only is specifically useful if you want to train on a short dataset, because it focuses the model on the answers only - thus removing any noise from the content.
In longer pre-trainings and unsupervised trainings I think you would train on all of the text. Actually, it's kind of inefficient to train on completions only because you are still forward passing through the full sequence. But doing completions only is less noisy, so it can achieve a specific effect with a smaller amount of data.
@@TrelisResearch Thanks for the explanation!
I made the same face as in the thumbnail once I learned it'll be fine tuning for another language based on the wiki data. Go raibh maith agat !
I realize that I don't know Irish. Excellent presentation. I'm toying with training Llama3 myself using unsloth.
incredible I had no idea that Irish was a thing in its own and so different from English too!
you need to visit Ireland so, but you may not understand local english too ;)
hello sir I'd like to know what is your opinion on new LLM structure like omni, i saw some papers covered in a video by someone channel hu-po it's about LLM multimodallity where text, audio, images are converted into token directly on one mLLM. How would it affect the fine tune? will there be totally different? I really like to know what's your thoughts around it.
I like the idea of using the same model for multiple input forms. And yes, you can then fine-tune on multiple input form data.
I've made videos about combined text + image and hope to do one taking in sound soon.
@@TrelisResearch awesome!
Hello, when it has been clearly established that to retain base capabilities of model, lora is much better than fine-tuning. I would like to know what do you personally prefer, function calling is something i would like my fine tuned models to be able to do.
Yes, LoRA can yield better results when the training data being used is much smaller than pre-training. So, yes I often use LoRA, including for function calling fine-tuning
Thanks for another great video!
If we use orpo method to fine tune a model on a chat template dataset do we have to set completion only to true ?
Nope, won’t have an effect I believe
@18:34, Can anyone direct me to resources regarding this topic (preferably on TH-cam)? I am new to LLMs, I've seen this in an Unsloth colab notebook, but unsure of what purpose it serves.
If your question is about LoRA - see this vid: th-cam.com/video/SL2nZpv7dtY/w-d-xo.html
Thank you for your excellent video! Would you give me some rationale for setting an embedding layer (and lm_head) to be trainable? I would also appreciate any references on this topic. Thank you in advance.
Typically it's needed if:
a) you are adding new tokens or new uses of the same tokens (e.g. a new chat template), or
b) if you are using a pretty different distribution of tokens than the base model (which could be the case for a new language, as the token usage will be different).
Thank you, Ronan! I really enjoy your video. Keep it up!
yo, mistral introduced a officail fine-tuning document. Can we get a video on this. This is what i have been waiting for, maybe we get some official way from meta too.
cool, thanks for the tip, will dig into that
not sure your graph of lr for cosine is right..
Yeah we say cosine but it’s really a (1-cos) learning rate decay. Does that answer your point?
Both ChatGPT and Gemini fail to capture the logical nuances (the logical meaning of the target language) words in English, and even more in other languages. Discussing in details the meaning with ChatGPT usually fails with ChatGPT reasoning that it can only capture the usage patterns, not their logical meaning.
yeah, I agree that having a chat with chatGPT about the reasoning is hard. At some level, it's possible to fake this by having the model be trained on some discussions about reasoning. The most advanced models now include that kind of data (I assume, based on my interactions with the models).
@@TrelisResearch I'm not sure you are following the meaning. My point is that there is a hidden logic behind the words and their meaning, something like a logical definition for each word. ChatGPT can't capture that even for most English words. At the same time (because I tried to figure out why), ChatGPT has powerful self-explanation abilities, which you can use to find the faults in its logic.
Could not be a better title "Learning a new language with wikipedia data"? As stated by you the model did not learn subjacent information.
Yeah I was considering something like that. Always a balance between being too broad and too specific
Fine-tuning on wikipedia is one huge mistake.
Can you expand on that?
@@TrelisResearch Because IMO it's laced with political opinions and I reckon we have already enough of that in the native pre-trained models.
Imagine my system prompt is something like that (because it knows epistemology inside but but doesn't apply it): "Only abide by Karl Popper epistemological rules: consensus, opinions and/or testimonial statements, even when authoritative, are not scientific evidence, only empirical evidence is scientific evidence; lack of consensus or opinions do not disprove a theory, nor negate empirical evidence. Statistical power only gives a marginal certainty but has a lower status than empirical evidence it can prove correlation but not causation."
@@BoominGame yeah, take your point that there can be biases alright.
I'll be making a continuation pulling in some common crawl data too.
@@TrelisResearch Hey I mean for the sake of demonstration it's fair enough, don't think I was criticising you in any shape or form, I am big fan of what you do. Just my 2 cents.
@@TrelisResearch I think that if the cement of the AI is semantic and you force illogical alignments it's bound to deteriorate the quality of some answers.
Imagine real science says stuff that is contradicted by "BBC Science" and big time, so where do you consolidate stuff that contradicts itself?
Also everybody using the same datasets, they all look and talk similar in the end.