Deep Q-Networks Explained!

CodeEmporium

มุมมอง 28 756

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 9 พ.ย. 2024

ความคิดเห็น • 64

@CodeEmporium 11 หลายเดือนก่อน ⁺¹²
If you like this video and you think I deserve it, please consider giving this video a like. Subscribe for more!
@0xabaki 9 หลายเดือนก่อน
I second this statement
@deviduttanayak2684 10 หลายเดือนก่อน ⁺²⁰
quiz 1=A
q2=B
q3=C
@neetpride5919 11 หลายเดือนก่อน ⁺²⁵
Where does the target network come from, and if it's the ideal "conscience" why not just use that? If we already have the ideal network, why bother training a second one?
@CodeEmporium 11 หลายเดือนก่อน ⁺¹⁷
Good question. Maybe my rhetoric was not super clear here. Essentially, without that target network, the Q network would compute the loss by comparing to itself. In practice, this can lead to unstable values as it is chasing a moving target. Hence a slightly delayed network is introduced to stabilize training .
Note this target network- isn’t the final iteration of ideal conscience. It is rather an iteration in the direction of ideal conscience. I say “ideal conscience” in this context to illustrate that the loss is computed based on this target network value. But this target network also gets better over time
@zerge69 7 หลายเดือนก่อน ⁺¹³
The target network should be called the "snapshot network". It's simply an older version of the Q-network, over which you improve.
@royvivat113 7 หลายเดือนก่อน
Great video, this was helpful for me. The only thing that I found pretty confusing was the target network explanation, which I saw you address in another comment. You described it as the ideal conscience which really made it seem like its the optimal q-network that we're comparing to (which would defeat the purpose of training if we had that). In fact since gets updated every few batches, its less ideal that the q-network.
@amithapa1994 10 หลายเดือนก่อน ⁺⁴
Quiz 2:
B. It stores Q for future reference
@ajaytaneja111 9 หลายเดือนก่อน ⁺⁷
Sorry Ajay, I'm not sure I'm getting it. What do you mean an idealised network (you say Frank's idealised conscience)? Where does it come from? Looks like you say that's the actual solution (idealised conscience) but what's it's origin?
@sotasearcher 11 หลายเดือนก่อน ⁺²
A scenario a computer could benefit from learning on it's own: I remember Google reporting research on a model that used RL and was able to find more efficient assembly code for a sorting algorithm
@sotasearcher 11 หลายเดือนก่อน ⁺²
It was AlphaDev
@zerge69 7 หลายเดือนก่อน
Awesome explanation, thanks. Except the quizzes.
@bean217 8 หลายเดือนก่อน ⁺¹
Is the target network also randomly initialized? Is it initialized with the same parameters as the Q-network?
From what I gather, the Q-network is acting as our behavior policy, and the target network is acting as our target policy. The way you describe it here makes it seem like the target network is already learned, but that would defeat the purpose of the algorithm in the first place.
@seno3863 4 หลายเดือนก่อน
It'll be really ideal if we can have the quize's answers presented in the video instead of answering by comments since it might be inaccurate and there will be a huge time loss during the waiting for reply.
The quiz was a cool idea for understanding though, really helps.
@katnip1917 5 หลายเดือนก่อน
Great Video!! Thank you for the explanation. My question is, why not use the current state in the target network, instead of the next state?
@florentb8578 7 หลายเดือนก่อน
Brillant explanation, well done
@rpraver1 11 หลายเดือนก่อน ⁺²
This was a good video, but I would love to see a deeper dive into your transformer series, that was the best, but I am still missing clarity on some of the steps. Your explanations are the best and would love to see more.
I have re-watched your videos atleast 10 times and have many questions, we need more of your explanations. Keep it up.
@CodeEmporium 11 หลายเดือนก่อน ⁺¹
Thanks! Yea I am trying to get core concept videos out first and will soon love to dive into a series where I implement this system too :)
@hamzaali98 11 หลายเดือนก่อน
@@CodeEmporium Hey! A decision transformer video would be really appreciated
@navanarun หลายเดือนก่อน
Thank you! This is super helpful.
@hakunamatata1o1 7 หลายเดือนก่อน
GOOD EXPLAINATION
@Trubripes 2 หลายเดือนก่อน
don't think a DQN outputs actions, that would make it a policy gradients.
It uses MC to collect Q values and use it for supervised training right ?
@sharonkevin9906 8 หลายเดือนก่อน
Love your videos mehn. They’ve really helped me understand the concepts
@eliasblancocastro9677 5 หลายเดือนก่อน
Amazing video and explanation! I have a question, Can I use SGD instead of MSE?
@CodeEmporium 5 หลายเดือนก่อน
Thanks! SGD is an optimizer (algorithm that describes HOW a model learns) while MSE is a loss function (a function that describes WHAT to minimize). They serve different purposes. But in general, you can replace loss functions with appropriate counter parts. They may not work exactly as described, but they can work in general
@edro1128 2 หลายเดือนก่อน
where the target network comes from? Thnaks
@Yohelloworld 6 หลายเดือนก่อน
Amazing explanation 🎉🎉🎉🎉
@johantchassem1553 9 หลายเดือนก่อน
Thanks for the explanation.
@axelolafsson7312 6 หลายเดือนก่อน
this video is great
@שלוםשליו 11 หลายเดือนก่อน ⁺¹
I did not understand where the target network comes from? And if it exists, why should a new one be trained?
@CodeEmporium 11 หลายเดือนก่อน ⁺¹
Good question. From my understanding the answer is more practical than theoretical.
The target network ensures the Q network isn’t chasing a moving target. If the network was compared against itself for every iteration, training would not be stable. Hence another slightly delayed network is introduced to ensure this stability
@DeborahMartin-y6m 2 หลายเดือนก่อน
Reinger Curve
@muralidhar40 4 หลายเดือนก่อน
QT-1: Option A (by definition)
@shivamsahil3660 หลายเดือนก่อน
Umm I have a doubt, if we already have target network then why do we need to evaluate Q network? Can't we directly use target network?
@sotasearcher 11 หลายเดือนก่อน ⁺³
2:40 A
@CodeEmporium 11 หลายเดือนก่อน ⁺²
That’s right! Nice!
@harshsonar9346 6 หลายเดือนก่อน
✨Quiiizz Timmmeeee✨
@himanshumeena745 7 หลายเดือนก่อน
quiz time 3 ka answer hai C , sahI khe rha hu na codemporium bhai
@Christoo228 3 หลายเดือนก่อน
εισαι κουκλος .
@EzekielAmy-v3b หลายเดือนก่อน
Shaina Harbor
@AnikaLiliana-e2e หลายเดือนก่อน
Muller Roads
@John-Martin-p3c หลายเดือนก่อน
Maybell Park
@rishukumar4045 4 หลายเดือนก่อน
q1=A
@Miko-fr9cc 13 วันที่ผ่านมา
Engagement Comment
@jenniferdsouza7708 3 หลายเดือนก่อน
why is it called a Q network and not just a neural network?
@jenstrudenau9134 2 หลายเดือนก่อน
Because the network replaces a q table. Q stands for quality because the table stores how good a decision was.
@riadhossainbhuiyan4978 9 หลายเดือนก่อน ⁺¹
B
@labreynth 2 หลายเดือนก่อน ⁺¹
On my life, I've got no idea what you're on about, at any stage of the video.
It feels like you're jumping between concepts and not explaining how they're linked.
I'd rather you say what a QN is, describe how it works, then give the Frank example
@NoraGeoffrey-x2w หลายเดือนก่อน
Cristian Trail
@ercanatam6344 2 หลายเดือนก่อน
Explanation for phase 2 is very poor. Why do you use s1 for Q network and s2 for the target network? The logic of subsequent calculations is not clear as well.
@NaveenKumar-vn7vx 11 หลายเดือนก่อน ⁺²
A
@CodeEmporium 11 หลายเดือนก่อน ⁺¹
A! Yep that’s right for Quiz 1
@IntrepidAlgonaut 4 หลายเดือนก่อน ⁺⁷
If you'd like to see your channel perform better, you might consider that your audience is composed of intelligent adults.
@RoyMuriel-f1v หลายเดือนก่อน
Stoltenberg Throughway
@SedatKarakan-f1t หลายเดือนก่อน
Welch Mall
@InspectorA-r2e 11 หลายเดือนก่อน
QAnon Network.
@tomoki-v6o 11 หลายเดือนก่อน
teach him how to use a pencil
@CodeEmporium 11 หลายเดือนก่อน ⁺⁶
I am a pencil
@riadhossainbhuiyan4978 9 หลายเดือนก่อน
Q3.A
@jongxina3595 10 หลายเดือนก่อน
Very cringey but good video nonetheless 👍
@hakunamatata1o1 7 หลายเดือนก่อน ⁺³
SHUDDAP
HE'S GIVING A GOOD VIBE
@jenstrudenau9134 2 หลายเดือนก่อน
Nah the vibe is like m******* *******

ต่อไป

เล่นอัตโนมัติ

Proximal Policy Optimization | ChatGPT uses this