Hey Nathan, your research seems to defend PPO over DPO but the most recent large models from llama3.1 and nemotron 4 DONT make use of PPO. They just make use of DPO with rejection sampling. In fact llama 3.1 paper chooses DPO only because of ease of compute. What are your thoughts on this? Is PPO more relevant for small to medium sized LLMs? Can the scale of large LLMs with DPO (and clever rejection sampling) be enough?
Офигенно близкая игра! Очень кайфово смотреть такие адреналиновые заносы!
Models, datasets, etc: huggingface.co/collections/allenai/tulu-v25-suite-66676520fd578080e126f618
Hey Nathan, your research seems to defend PPO over DPO but the most recent large models from llama3.1 and nemotron 4 DONT make use of PPO. They just make use of DPO with rejection sampling. In fact llama 3.1 paper chooses DPO only because of ease of compute.
What are your thoughts on this?
Is PPO more relevant for small to medium sized LLMs?
Can the scale of large LLMs with DPO (and clever rejection sampling) be enough?
@@sumanthbalaji1768 will write an update on this soon on www.interconnects.ai/ :)
@@natolambert lovely, thanks
THX! :D
"White Rice Research" 🍚🔍👁