🎯 Key Takeaways for quick navigation: 00:00 🤖 Reinforcement learning improves large language models like ChatGPT. 00:25 🃏 Large language models face issues like bias, errors, and quality. 01:11 📊 Training data quality impacts results; removing bad jokes might help. 01:55 🧩 Training on both good and bad jokes improves language models. 02:38 🔄 Language models are policies, reinforcement learning uses policy gradient. 03:08 🎯 Reinforcement Learning from Human Feedback (RLHF) challenges data acquisition. 03:35 🤔 RLHF theory: Language model might already know jokes' boundary. 04:18 🏆 Training a reward network predicts human ratings for model's output. 04:47 🔄 Reward network is a modified language model for predicting ratings. 05:14 📝 Approach: Humans write text, train reward network, refine model with RL. 05:57 ⚖️ Systems convert comparisons to ratings for reward network training. 06:11 😄 RLHF successfully improves language models, including humor. Made with HARPA AI
I just binged this playlist at 1 am. Absolutely worth it. You deserve more views.
agreed
PLEASE COMEBACK!! You are an amazing theacher!
Please come back, your videos are great!
All of your videos are amazing, please upload more
Welcome back!
Hope to see more of these videos..
Joel, excellent explanation and talk! Thank you!
Amazing content! Please keep them coming!
help me a lot, can't wait to see more
Super helpful - thank you for this series!
🎯 Key Takeaways for quick navigation:
00:00 🤖 Reinforcement learning improves large language models like ChatGPT.
00:25 🃏 Large language models face issues like bias, errors, and quality.
01:11 📊 Training data quality impacts results; removing bad jokes might help.
01:55 🧩 Training on both good and bad jokes improves language models.
02:38 🔄 Language models are policies, reinforcement learning uses policy gradient.
03:08 🎯 Reinforcement Learning from Human Feedback (RLHF) challenges data acquisition.
03:35 🤔 RLHF theory: Language model might already know jokes' boundary.
04:18 🏆 Training a reward network predicts human ratings for model's output.
04:47 🔄 Reward network is a modified language model for predicting ratings.
05:14 📝 Approach: Humans write text, train reward network, refine model with RL.
05:57 ⚖️ Systems convert comparisons to ratings for reward network training.
06:11 😄 RLHF successfully improves language models, including humor.
Made with HARPA AI
ok everything makes sense now, thx
Good teaching.
You are the Best
How long it takes to train a reward network? And how reliable would it be?
Great content!!
Who is this guy? He made all the complexity so simple with his words. Anyone know this gentleman name?
come back :(