Fine-tuning on Wikipedia Datasets

Trelis Research

มุมมอง 2 916

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 16 ต.ค. 2024

ความคิดเห็น • 48

@MultiTheflyer 4 หลายเดือนก่อน ⁺³
the amount of knowledge in this video is just mind blowing! thanks for making this available 🙏
@TrelisResearch 4 หลายเดือนก่อน
cheers!
@MikeMm-n9n 4 หลายเดือนก่อน
Have been looking for a tutorial for this for a while. Thanks for doing this
@sheikhakbar2067 4 หลายเดือนก่อน
Fascinating video, I have been searching for something similar for a while. I am working on fine-tuning open-source LLMs for Arabic. While Arabic isn't exactly a low-resource language due to its widespread use globally, there is a scarcity of labeled data. I will refer to this video during the fine-tuning process. Thank you very much for this valuable content.
@TrelisResearch 4 หลายเดือนก่อน ⁺¹
you're welcome
@iainattwater1747 4 หลายเดือนก่อน
Excellent video - always love the in-depth discussion.
@TrelisResearch 4 หลายเดือนก่อน
appreciate it, thanks
@awikdhar6607 4 หลายเดือนก่อน ⁺¹
Is it true that the Llama 3 8b degrades a lot at 4 bit quantization? Makes me wonder if should fine tune some other model.
@TrelisResearch 4 หลายเดือนก่อน
It does seem to degrade quite a bit in gguf or awq quantization. Using QLoRA involves using nf4 type quantization and is not so bad, so I wouldn't be too concerned.
@MultiTheflyer 4 หลายเดือนก่อน
if I might suggest a topic for a video, I think it would be very interesting to understand how benchmarks work under the hood!
For example, I have now embarked on the journey to make the best performing Open LLM on Italian (the lack of competition actually makes this possible), but I don't understand how benchmarks work... Like how does the benchmark know if the model's answer is similar to the right answer in the benchmarking dataset and how are instruct vs foundation models treated differently? because if I have an instruct model the instruction need different formatting (e.g. [INST]), how does the benchmark know that?
@TrelisResearch 4 หลายเดือนก่อน ⁺¹
In many cases, when LLMs are benchmarked, they are benchmarked using the same prompt format, but with few shot examples. Have a look at the example at the bottom of the latest phi-3 paper in Appendix A: arxiv.org/abs/2404.14219 .
But yes, you're right, prompt format DOES affect performance a lot, so one might argue benchmarking should be done using the native chat format that was used for instruction fine-tuning. I believe this is the case for the Arena type approaches where models are compared based on responses.
@unshadowlabs 4 หลายเดือนก่อน
Great video! On the annealing, you did it as a separate run. Is it possible to integrate it into the first fine-tuning run, much like how the warmup is part of the fine-tune training process in that run? That way it can all be done in one script and run of the training.
@TrelisResearch 4 หลายเดือนก่อน ⁺¹
Yes! you can do that by passing in a custom learning rate scheduler. You have to be a little careful so as not mess up the learning rates for the lora adapters. In my video on Lora+ I go through a related example.
@josecorte-real4565 2 หลายเดือนก่อน
Thanks for the great video as usual. I have to admit I get a bit confused when talking about unsupervised training in this context (through no fault of your explanation, I've just always struggled with this).
My initial intuition was that unsupervised would be when you train on "unstructured" text, where you simply mask the next tokens and predict / evaluate loss one by one (as you normally do in pretraining). In this scenario, you won't need your dataset to be in QA/Instruct/Chat format, and you can (typically) procure huge amounts of data more easily.
In this video, you present unsupervised fine-tuning as "fine-tuning on all of the data" (29:01), which I understand since you mix formatted and unformatted text data. My question is, would you normally still consider it unsupervised fine-tuning if you trained on an Instruct dataset but chose to set the train on completions only to false? And also, in what circumstances would you decide to FT on an instruct dataset with a train on completions only to false? Thanks!
@TrelisResearch 2 หลายเดือนก่อน ⁺¹
Yeah, it's murky!
Strictly speaking, unsupervised means that the data is not parsed into instruction or chat format.
In practise though, the fact is that the more organised your data, the better the results, so it's beneficial to have your data be as clean as possible.
Training on completions only is specifically useful if you want to train on a short dataset, because it focuses the model on the answers only - thus removing any noise from the content.
In longer pre-trainings and unsupervised trainings I think you would train on all of the text. Actually, it's kind of inefficient to train on completions only because you are still forward passing through the full sequence. But doing completions only is less noisy, so it can achieve a specific effect with a smaller amount of data.
@josecorte-real4565 2 หลายเดือนก่อน
@@TrelisResearch Thanks for the explanation!
@alchemication 4 หลายเดือนก่อน
I made the same face as in the thumbnail once I learned it'll be fine tuning for another language based on the wiki data. Go raibh maith agat !
@simonstrandgaard5503 4 หลายเดือนก่อน
I realize that I don't know Irish. Excellent presentation. I'm toying with training Llama3 myself using unsloth.
@MultiTheflyer 4 หลายเดือนก่อน
incredible I had no idea that Irish was a thing in its own and so different from English too!
@alchemication 4 หลายเดือนก่อน ⁺¹
you need to visit Ireland so, but you may not understand local english too ;)
@NLPprompter 4 หลายเดือนก่อน
hello sir I'd like to know what is your opinion on new LLM structure like omni, i saw some papers covered in a video by someone channel hu-po it's about LLM multimodallity where text, audio, images are converted into token directly on one mLLM. How would it affect the fine tune? will there be totally different? I really like to know what's your thoughts around it.
@TrelisResearch 4 หลายเดือนก่อน
I like the idea of using the same model for multiple input forms. And yes, you can then fine-tune on multiple input form data.
I've made videos about combined text + image and hope to do one taking in sound soon.
@NLPprompter 4 หลายเดือนก่อน
@@TrelisResearch awesome!
@TheBestgoku 4 หลายเดือนก่อน
Hello, when it has been clearly established that to retain base capabilities of model, lora is much better than fine-tuning. I would like to know what do you personally prefer, function calling is something i would like my fine tuned models to be able to do.
@TrelisResearch 4 หลายเดือนก่อน
Yes, LoRA can yield better results when the training data being used is much smaller than pre-training. So, yes I often use LoRA, including for function calling fine-tuning
@LairdForet 4 หลายเดือนก่อน
Thanks for another great video!
@loicbaconnier9150 4 หลายเดือนก่อน
If we use orpo method to fine tune a model on a chat template dataset do we have to set completion only to true ?
@TrelisResearch 4 หลายเดือนก่อน
Nope, won’t have an effect I believe
@sheikhakbar2067 4 หลายเดือนก่อน
@18:34, Can anyone direct me to resources regarding this topic (preferably on TH-cam)? I am new to LLMs, I've seen this in an Unsloth colab notebook, but unsure of what purpose it serves.
@TrelisResearch 4 หลายเดือนก่อน ⁺¹
If your question is about LoRA - see this vid: th-cam.com/video/SL2nZpv7dtY/w-d-xo.html
@214F7Iic0ybZraC 4 หลายเดือนก่อน
Thank you for your excellent video! Would you give me some rationale for setting an embedding layer (and lm_head) to be trainable? I would also appreciate any references on this topic. Thank you in advance.
@TrelisResearch 4 หลายเดือนก่อน ⁺¹
Typically it's needed if:
a) you are adding new tokens or new uses of the same tokens (e.g. a new chat template), or
b) if you are using a pretty different distribution of tokens than the base model (which could be the case for a new language, as the token usage will be different).
@214F7Iic0ybZraC 4 หลายเดือนก่อน
Thank you, Ronan! I really enjoy your video. Keep it up!
@TheBestgoku 4 หลายเดือนก่อน
yo, mistral introduced a officail fine-tuning document. Can we get a video on this. This is what i have been waiting for, maybe we get some official way from meta too.
@TrelisResearch 4 หลายเดือนก่อน
cool, thanks for the tip, will dig into that
@loicbaconnier9150 4 หลายเดือนก่อน
not sure your graph of lr for cosine is right..
@TrelisResearch 4 หลายเดือนก่อน
Yeah we say cosine but it’s really a (1-cos) learning rate decay. Does that answer your point?
@trevoro.9731 4 หลายเดือนก่อน
Both ChatGPT and Gemini fail to capture the logical nuances (the logical meaning of the target language) words in English, and even more in other languages. Discussing in details the meaning with ChatGPT usually fails with ChatGPT reasoning that it can only capture the usage patterns, not their logical meaning.
@TrelisResearch 4 หลายเดือนก่อน
yeah, I agree that having a chat with chatGPT about the reasoning is hard. At some level, it's possible to fake this by having the model be trained on some discussions about reasoning. The most advanced models now include that kind of data (I assume, based on my interactions with the models).
@trevoro.9731 4 หลายเดือนก่อน
@@TrelisResearch I'm not sure you are following the meaning. My point is that there is a hidden logic behind the words and their meaning, something like a logical definition for each word. ChatGPT can't capture that even for most English words. At the same time (because I tried to figure out why), ChatGPT has powerful self-explanation abilities, which you can use to find the faults in its logic.
@davidgortega3734 4 หลายเดือนก่อน
Could not be a better title "Learning a new language with wikipedia data"? As stated by you the model did not learn subjacent information.
@TrelisResearch 4 หลายเดือนก่อน ⁺¹
Yeah I was considering something like that. Always a balance between being too broad and too specific
@BoominGame 4 หลายเดือนก่อน
Fine-tuning on wikipedia is one huge mistake.
@TrelisResearch 4 หลายเดือนก่อน
Can you expand on that?
@BoominGame 4 หลายเดือนก่อน ⁺²
@@TrelisResearch Because IMO it's laced with political opinions and I reckon we have already enough of that in the native pre-trained models.
Imagine my system prompt is something like that (because it knows epistemology inside but but doesn't apply it): "Only abide by Karl Popper epistemological rules: consensus, opinions and/or testimonial statements, even when authoritative, are not scientific evidence, only empirical evidence is scientific evidence; lack of consensus or opinions do not disprove a theory, nor negate empirical evidence. Statistical power only gives a marginal certainty but has a lower status than empirical evidence it can prove correlation but not causation."
@TrelisResearch 4 หลายเดือนก่อน ⁺¹
@@BoominGame yeah, take your point that there can be biases alright.
I'll be making a continuation pulling in some common crawl data too.
@BoominGame 4 หลายเดือนก่อน
@@TrelisResearch Hey I mean for the sake of demonstration it's fair enough, don't think I was criticising you in any shape or form, I am big fan of what you do. Just my 2 cents.
@BoominGame 4 หลายเดือนก่อน
@@TrelisResearch I think that if the cement of the AI is semantic and you force illogical alignments it's bound to deteriorate the quality of some answers.
Imagine real science says stuff that is contradicted by "BBC Science" and big time, so where do you consolidate stuff that contradicts itself?
Also everybody using the same datasets, they all look and talk similar in the end.

ต่อไป

เล่นอัตโนมัติ

Preparing Fineweb - A Finely Cleaned Common Crawl Dataset