Thanks for the video very useful. One question I have is , how the migration happens from Step 2 to Step 3 is not really clear may I need to look at the DP paper. So in step 3 of the overall approach where you get the response from Base and new LLM, how are you comparing whether is it is winner or loser. you are using the Base and New LLM to perform the Winner and loser determination. Or are you creating a model in Step 2 on the preference data? If you throw some light it will be helpful. Appreciate that.
Good question. For zephyr step 3, we don't even use our model for data generation. We use a dataset called UltraFeedback that contains outputs from Llama, GPT-3.5, GPT-4 etc. The winner is the one GPT-4 preferred, and we use a random output as the loser. Unlike for OpenAI, for us Step 3 is after Step 2 but does not depend on it.
This video is dope, Thanks so much!
Thanks! Yeah this stuff is cool.
Great video Sasha - thanks.
This video is really helpful. Please keep them coming. Would be great if you can also cover details of multi modal models.
Thanks a lot for the detailed explaination man ❤
Interesting, I don't _think_ the UltraChat paper explicitly says that it's using Self-Instruct, does it?
Do we use just use "Self-Instruct" to mean "some seed of information enhanced by an LLM?"
I do feel like that is the common use at this point. I agree it is a bit subtle, but that paper first popularized the approach.
Thanks for the video very useful. One question I have is , how the migration happens from Step 2 to Step 3 is not really clear may I need to look at the DP paper. So in step 3 of the overall approach where you get the response from Base and new LLM, how are you comparing whether is it is winner or loser. you are using the Base and New LLM to perform the Winner and loser determination. Or are you creating a model in Step 2 on the preference data? If you throw some light it will be helpful. Appreciate that.
Good question. For zephyr step 3, we don't even use our model for data generation. We use a dataset called UltraFeedback that contains outputs from Llama, GPT-3.5, GPT-4 etc. The winner is the one GPT-4 preferred, and we use a random output as the loser. Unlike for OpenAI, for us Step 3 is after Step 2 but does not depend on it.
Thank you
Good work bro