Direct Preference Optimization (DPO)

Trelis Research

มุมมอง 6 180

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 24 ก.ค. 2024
Get the Dataset: huggingface.co/datasets/Treli...
Get the DPO Script + Dataset: buy.stripe.com/cN2cNyg8t0zp2g...
Get the full Advanced Fine Tuning Repo: trelis.com/advanced-fine-tuni...
Resources:
- Google Slides Presentation: tinyurl.com/mtd2ehnp
- Anthropic Helpful and Harmless Dataset: huggingface.co/datasets/Anthr...
- Ultrachat dataset: huggingface.co/datasets/Huggi...
- DPO Trainer: huggingface.co/docs/trl/dpo_t...
- Runpod Affiliate link (helps support the channel): runpod.io?ref=jmfkcdio
Chapters:
0:00 Direct Preference Optimisation
0:37 Video Overview
1:37 How does “normal” fine-tuning work?
3:41 How does DPO work?
8:31 DPO Datasets: UltraChat
10:59 DPO Datasets: Helpful and Harmless
14:00 DPO vs RLHF
15:25 Required datasets and SFT models
18:26 DPO Notebook Run through
28:22 DPO Evaluation Results
31:15 Weights and Biases Results Interpretation
35:16 Runpod Setup for 1 epoch Training Run
41:58 Resources
วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 30

@palashjyotiborah9888 8 หลายเดือนก่อน ⁺²
Wow. Learning so much from you
@brandon1902 7 หลายเดือนก่อน ⁺¹
I'm a novice LLM user and am confused by how a limited set of specific DPO pairs can align a model to a near infinite number of diverse user prompts.
I know there's generalization to some degree because when I use aligned LLMs I keep running across misfiring alignment. For example, I'm warned against engaging in celebrity gossip when asking about fictional characters in a TV show simply because I used actor names to help keep the LLM on track (avoid hallucinations). So it sees real celebrity names and questions about ex-wives (even though it's about fictional characters in a show), and then triggers an alignment response despite there being no specific example in the DPO, RLHF... training data.
@TrelisResearch 7 หลายเดือนก่อน ⁺²
This post might help: ronanmcgovern.com/what-makes-a-great-language-model/
The short answer is that DPO drags the statistical distribution towards certain parts of the training data.
DPO is done at the end of training, so this is the last adjustments the weights have undergone, which also makes that adjustment particularly "fresh" and potent. i.e. earlier training adjustments are all smoothed by then.
@brandon1902 7 หลายเดือนก่อน
@@TrelisResearch Thanks. That link actually answered a lot of other questions I had, such as why it can be a good thing that perplexity goes up after DPO, assuming your aren't doing it for things like moralizing or censorship which redirects you away from the truth and towards the wrong or no answer.
@imranullah3097 7 หลายเดือนก่อน ⁺¹
I have just my own script, and the script just run for a few steps and then the session is crashed with unknown reason i don't know why.
The second thing should we need to qunatiz the lora adapter? But it also load the base model. I just merge adapter with base model. Is it good approach?
@TrelisResearch 7 หลายเดือนก่อน
Howdy! I can't say much without having code to reproduce the error.
LoRA adapters are not quantized, they are automatically in 16bit regardless of whether the base model is quantized or not.
Yes, as of just a week or so ago, it's possible to merge the adapter with the base model - whether the base model is quantized or not.
@stephanembatchou5300 7 หลายเดือนก่อน ⁺⁶
I think DPO is not RL but PPO is RL
@TrelisResearch 7 หลายเดือนก่อน
Not quite sure what you mean here. Reinforcement learning is about increasing the log props of one action versus another. RLHF, PPO and DPO all do that.
@vivekpadman5248 5 หลายเดือนก่อน
@@TrelisResearchnot really rl is lot more than log probs of actions, it about exploration to collect data and exploit on right data too, basically both rlhf and dpo aren't rl, but ppo is rl arch
@TrelisResearch 5 หลายเดือนก่อน
@@vivekpadman5248 I think I see what you're saying. Did you read the DPO paper btw? worth a read
@vivekpadman5248 5 หลายเดือนก่อน
@@TrelisResearch yup just went through it briefly, have to recheck the math 😅
@StevenPack-nh9ns หลายเดือนก่อน
英语
Although the DPO algorithm borrows some elements of reinforcement learning, it does not fully conform to the framework of traditional reinforcement learning algorithms, right?
@TrelisResearch หลายเดือนก่อน
I suppose it depends what you mean by "traditional". If you mean having an actual reference model (rather than just the underlying data), then it wouldn't.
@tomiwaibrahim6198 8 หลายเดือนก่อน
Hey! Did you make a video/have a video link about what the results of tinyLlama means? I read the Readme and understood nothing. Thank you!
@TrelisResearch 8 หลายเดือนก่อน ⁺¹
Howdy! you meant you read the tinyllama readme?
In principle, it's a great idea - train a 1B model on 3T tokens. So far, the quality of responses is poor, and I'm not quite sure why. There was an issue with the training, ut I believe that was fixed, still. I haven't seen good performance.
The best small model I've used is DeepSeek 1.3B - there's a vid for that. But it's for coding and has a narrow knowledge base.
@tomiwaibrahim6198 8 หลายเดือนก่อน ⁺¹
@@TrelisResearch Thank you for your reply! I couldn't wrap my head around the information
Goodluck!
@firsfnamelastname8490 2 หลายเดือนก่อน
Why are you doing a SFT first. Can't we from the LLAMA2 directly apply the DPOTrainer ?
@TrelisResearch หลายเดือนก่อน
Yes, you can apply the DPO trainer to an instruction fine-tuned model.
In this video, I'm doing DPO on TinyLlama - and there wasn't an SFT version available at the time, so I made an SFT version to do DPO on.
@timetravellingtoad 3 หลายเดือนก่อน
So if the quality of your fine-tuning data set is already very high, do you even need to do these types of reinforcement learning?
@TrelisResearch 3 หลายเดือนก่อน ⁺¹
Preference fine-tuning does bring you to higher performance than just SFT.
So, maybe not essential, but it does boost model performance.
DPO is not very reliable on improving quality over SFT, but ORPO combines SFT and preference fine-tuning and seems to reliably outperform, so you may enjoy the ORPO video I made.
@timetravellingtoad 3 หลายเดือนก่อน
@@TrelisResearch Thanks!
@user-bw5np7zz5m 5 หลายเดือนก่อน
I’m guessing here. Perhaps, the DPO experiment (for Tiny Llama) didn’t produce the final results you wanted? Would you consider another DPO tutorial where you get good results that are with the effort (and longer compute time) for using DPO? Thanks.
@TrelisResearch 5 หลายเดือนก่อน
It’s a good idea. I’ll see if I can find the time. I see it as higher priority to do an updated SFT video for better memorization of custom data. DPO is still a cherry on top - mostly SFT gets the big performance gain
@user-bw5np7zz5m 5 หลายเดือนก่อน
Consider putting a high pass filter on your audio :) There are some low frequency computer noises you can easily filter out.
@TrelisResearch 5 หลายเดือนก่อน
Thanks for the tip
@imranullah3097 7 หลายเดือนก่อน
Kindly also explain maths behind every topic in future so it will be more helpful... 💜 Do you have any book?
@TrelisResearch 7 หลายเดือนก่อน
Howdy! It's a balance, if I do all the maths, it makes what is already a long video even longer. I don't have a book, but here is the paper: arxiv.org/abs/2305.18290
@terryzhenningtan9590 5 หลายเดือนก่อน
Purchase???!!!!
@TrelisResearch 5 หลายเดือนก่อน
say more...?
@vivekpadman5248 5 หลายเดือนก่อน ⁺¹
Loved the video, thanks for giving out such gr8 content for free 🫶

ต่อไป

เล่นอัตโนมัติ