Reinforcement Learning through Human Feedback - EXPLAINED! | RLHF

CodeEmporium

มุมมอง 23 093

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 3 ก.พ. 2025
We talk about reinforcement learning through human feedback. ChatGPT among other applications makes use of this.
ABOUT ME
⭕ Subscribe: www.youtube.co...
📚 Medium Blog: / dataemporium
💻 Github: github.com/ajh...
👔 LinkedIn: / ajay-halthor-477974bb
PLAYLISTS FROM MY CHANNEL
⭕ Reinforcement Learning: • Reinforcement Learning...
Natural Language Processing: • Natural Language Proce...
⭕ Transformers from Scratch: • Natural Language Proce...
⭕ ChatGPT Playlist: • ChatGPT
⭕ Convolutional Neural Networks: • Convolution Neural Net...
⭕ The Math You Should Know : • The Math You Should Know
⭕ Probability Theory for Machine Learning: • Probability Theory for...
⭕ Coding Machine Learning: • Code Machine Learning
MATH COURSES (7 day free trial)
📕 Mathematics for Machine Learning: imp.i384100.ne...
📕 Calculus: imp.i384100.ne...
📕 Statistics for Data Science: imp.i384100.ne...
📕 Bayesian Statistics: imp.i384100.ne...
📕 Linear Algebra: imp.i384100.ne...
📕 Probability: imp.i384100.ne...
OTHER RELATED COURSES (7 day free trial)
📕 ⭐ Deep Learning Specialization: imp.i384100.ne...
📕 Python for Everybody: imp.i384100.ne...
📕 MLOps Course: imp.i384100.ne...
📕 Natural Language Processing (NLP): imp.i384100.ne...
📕 Machine Learning in Production: imp.i384100.ne...
📕 Data Science Specialization: imp.i384100.ne...
📕 Tensorflow: imp.i384100.ne...

ความคิดเห็น • 27

@LudAngel 29 วันที่ผ่านมา ⁺¹
Did I get this right? Quiz 1) = D all of the above : 2) B accelerates learning 3) C Assess and Score - I'm learning AI training model professionally
Enjoyed your class Prabhu!
@RameshKumar-ng3nf 8 หลายเดือนก่อน ⁺²
Brilliant Bro 👌. Excellent explanation. I never understand RLHF reading so many books and notes. Your examples are GREAT & simple to understand 👌
I am new to your channel and subscribed.
@neetpride5919 ปีที่แล้ว ⁺⁵
Great video! I have a few questions:
1) Why do we need to manually train the reward model with human feedback if the point is to evaluate responses of another pretrained model? Can't we just cut out the reward model altogether, rate the responses directly using human feedback to generate a loss value for each response, then backpropagate on that? Does it require less human input to train the reward model than to train the GPT model directly?
2) When backpropagating the loss, do you need to do recurrent backpropagation for a number of steps that is the same as the length of the token output?
3) Does the loss value apply equally to every token that is output? Seems like this would overly punish some words e.g. if the question starts with "why" it's likely the response is going to start with "because" regardless of what comes after. Does RLHF only work with sentence embeddings rather than word embeddings?
@0xabaki ปีที่แล้ว
1) I think the point is to minimize the human feed back volume so humans just give enough responses to train a model for all future feedback. this way humans are not going to always have to give feedback, but instead will lay the basis, and probably come back to re-evaluate what the reward model is doing so it is still acting human
(2) and (3) seem more specific to the architecture of chatGPT and neither PPO nor RLHF. I would look into the other GPT specific videos he made
@theartofwar1750 10 หลายเดือนก่อน ⁺²
At 6:58, you have an error: PPO is not used to build the reward model.
@francisco444 6 หลายเดือนก่อน
That's correct. The PPO algorithm is used to fine-tune the SFT model against the reward model scores, in order to prevent the model from "cheating" and generating outputs that maximize the reward score but are no longer normal human-like text.
PPO ensures the final RLHF model's outputs remain close to the original SFT model's outputs.
@manigoyal4872 ปีที่แล้ว
what about the generation of rewards, will there be another model to check the relativity of the answer and the precision of the answer, cause we have a lot of data
@sangeethashowrya0318 10 หลายเดือนก่อน
Sir ,please make a video on function approximation in RL
@IsmailNajib-q5f 2 หลายเดือนก่อน
you are the best
@manigoyal4872 ปีที่แล้ว
Acts as a randomizing factor depending on whom you are getting feedback from
@thangarajr-qw6wy 7 หลายเดือนก่อน
(1) supervised fine-tuning (SFT), (2) reward model (RM) training, and (3) reinforcement learning via proximal policy optimization (PPO) on this reward model explain me
@l.u.v หลายเดือนก่อน
Nice
@ayeshariaz3382 9 หลายเดือนก่อน
where to det your slides?
@0xabaki ปีที่แล้ว
haha quiz time again:
0) when the person knows me well
1)D
2)B if proper human feedback
3)C
@manigoyal4872 ปีที่แล้ว ⁺¹
Aren't we users are the humans in feedback loop for openai
@ksieiehxnzna ปีที่แล้ว ⁺²
Yeah, however openai has the final say on what feedback goes through
@thangasamysaminathan515 5 หลายเดือนก่อน ⁺¹
Quiz 2: B
@063harshsahu2 6 หลายเดือนก่อน ⁺¹
looking like indian but accent like britisher, where u from bro ?
@aswinselva03 7 หลายเดือนก่อน ⁺⁵
The video is informative and good. but stop saying quiz time in an annoying way
@cajunguy6502 4 หลายเดือนก่อน ⁺²
I have to say I disagree because of how annoying it is. The best mnemonic ones are the ones that become annoy earworms. That super obnoxious voice is meant to trigger something in our brain. If your brain doesnt like something, it flips a switch, just in case you need to go fight-or-flight, focus sharpens, Circulation increases, your whole body is getting ready to respond. It basically puts you in testing/high stress mode, which is perfect for a practice quiz.
@thangasamysaminathan515 5 หลายเดือนก่อน ⁺¹
D. all of the above
@LabibIbnMuzahid 5 หลายเดือนก่อน
all the above
@CookKama-r1p 3 หลายเดือนก่อน
Gerlach Springs
@ArielOmerez 8 หลายเดือนก่อน ⁺¹
B
@ArielOmerez 8 หลายเดือนก่อน
C
@ArielOmerez 8 หลายเดือนก่อน
D
@vtrandal 4 หลายเดือนก่อน
RLHF? ROTFL

ต่อไป

เล่นอัตโนมัติ

Proximal Policy Optimization | ChatGPT uses this