Where does the target network come from, and if it's the ideal "conscience" why not just use that? If we already have the ideal network, why bother training a second one?
Good question. Maybe my rhetoric was not super clear here. Essentially, without that target network, the Q network would compute the loss by comparing to itself. In practice, this can lead to unstable values as it is chasing a moving target. Hence a slightly delayed network is introduced to stabilize training . Note this target network- isn’t the final iteration of ideal conscience. It is rather an iteration in the direction of ideal conscience. I say “ideal conscience” in this context to illustrate that the loss is computed based on this target network value. But this target network also gets better over time
Great video, this was helpful for me. The only thing that I found pretty confusing was the target network explanation, which I saw you address in another comment. You described it as the ideal conscience which really made it seem like its the optimal q-network that we're comparing to (which would defeat the purpose of training if we had that). In fact since gets updated every few batches, its less ideal that the q-network.
Sorry Ajay, I'm not sure I'm getting it. What do you mean an idealised network (you say Frank's idealised conscience)? Where does it come from? Looks like you say that's the actual solution (idealised conscience) but what's it's origin?
A scenario a computer could benefit from learning on it's own: I remember Google reporting research on a model that used RL and was able to find more efficient assembly code for a sorting algorithm
Is the target network also randomly initialized? Is it initialized with the same parameters as the Q-network? From what I gather, the Q-network is acting as our behavior policy, and the target network is acting as our target policy. The way you describe it here makes it seem like the target network is already learned, but that would defeat the purpose of the algorithm in the first place.
It'll be really ideal if we can have the quize's answers presented in the video instead of answering by comments since it might be inaccurate and there will be a huge time loss during the waiting for reply. The quiz was a cool idea for understanding though, really helps.
This was a good video, but I would love to see a deeper dive into your transformer series, that was the best, but I am still missing clarity on some of the steps. Your explanations are the best and would love to see more. I have re-watched your videos atleast 10 times and have many questions, we need more of your explanations. Keep it up.
Thanks! SGD is an optimizer (algorithm that describes HOW a model learns) while MSE is a loss function (a function that describes WHAT to minimize). They serve different purposes. But in general, you can replace loss functions with appropriate counter parts. They may not work exactly as described, but they can work in general
Good question. From my understanding the answer is more practical than theoretical. The target network ensures the Q network isn’t chasing a moving target. If the network was compared against itself for every iteration, training would not be stable. Hence another slightly delayed network is introduced to ensure this stability
On my life, I've got no idea what you're on about, at any stage of the video. It feels like you're jumping between concepts and not explaining how they're linked. I'd rather you say what a QN is, describe how it works, then give the Frank example
Explanation for phase 2 is very poor. Why do you use s1 for Q network and s2 for the target network? The logic of subsequent calculations is not clear as well.
If you like this video and you think I deserve it, please consider giving this video a like. Subscribe for more!
I second this statement
quiz 1=A
q2=B
q3=C
Where does the target network come from, and if it's the ideal "conscience" why not just use that? If we already have the ideal network, why bother training a second one?
Good question. Maybe my rhetoric was not super clear here. Essentially, without that target network, the Q network would compute the loss by comparing to itself. In practice, this can lead to unstable values as it is chasing a moving target. Hence a slightly delayed network is introduced to stabilize training .
Note this target network- isn’t the final iteration of ideal conscience. It is rather an iteration in the direction of ideal conscience. I say “ideal conscience” in this context to illustrate that the loss is computed based on this target network value. But this target network also gets better over time
The target network should be called the "snapshot network". It's simply an older version of the Q-network, over which you improve.
Great video, this was helpful for me. The only thing that I found pretty confusing was the target network explanation, which I saw you address in another comment. You described it as the ideal conscience which really made it seem like its the optimal q-network that we're comparing to (which would defeat the purpose of training if we had that). In fact since gets updated every few batches, its less ideal that the q-network.
Quiz 2:
B. It stores Q for future reference
Sorry Ajay, I'm not sure I'm getting it. What do you mean an idealised network (you say Frank's idealised conscience)? Where does it come from? Looks like you say that's the actual solution (idealised conscience) but what's it's origin?
A scenario a computer could benefit from learning on it's own: I remember Google reporting research on a model that used RL and was able to find more efficient assembly code for a sorting algorithm
It was AlphaDev
Awesome explanation, thanks. Except the quizzes.
Is the target network also randomly initialized? Is it initialized with the same parameters as the Q-network?
From what I gather, the Q-network is acting as our behavior policy, and the target network is acting as our target policy. The way you describe it here makes it seem like the target network is already learned, but that would defeat the purpose of the algorithm in the first place.
It'll be really ideal if we can have the quize's answers presented in the video instead of answering by comments since it might be inaccurate and there will be a huge time loss during the waiting for reply.
The quiz was a cool idea for understanding though, really helps.
Great Video!! Thank you for the explanation. My question is, why not use the current state in the target network, instead of the next state?
Brillant explanation, well done
This was a good video, but I would love to see a deeper dive into your transformer series, that was the best, but I am still missing clarity on some of the steps. Your explanations are the best and would love to see more.
I have re-watched your videos atleast 10 times and have many questions, we need more of your explanations. Keep it up.
Thanks! Yea I am trying to get core concept videos out first and will soon love to dive into a series where I implement this system too :)
@@CodeEmporium Hey! A decision transformer video would be really appreciated
Thank you! This is super helpful.
GOOD EXPLAINATION
don't think a DQN outputs actions, that would make it a policy gradients.
It uses MC to collect Q values and use it for supervised training right ?
Love your videos mehn. They’ve really helped me understand the concepts
Amazing video and explanation! I have a question, Can I use SGD instead of MSE?
Thanks! SGD is an optimizer (algorithm that describes HOW a model learns) while MSE is a loss function (a function that describes WHAT to minimize). They serve different purposes. But in general, you can replace loss functions with appropriate counter parts. They may not work exactly as described, but they can work in general
where the target network comes from? Thnaks
Amazing explanation 🎉🎉🎉🎉
Thanks for the explanation.
this video is great
I did not understand where the target network comes from? And if it exists, why should a new one be trained?
Good question. From my understanding the answer is more practical than theoretical.
The target network ensures the Q network isn’t chasing a moving target. If the network was compared against itself for every iteration, training would not be stable. Hence another slightly delayed network is introduced to ensure this stability
Reinger Curve
QT-1: Option A (by definition)
Umm I have a doubt, if we already have target network then why do we need to evaluate Q network? Can't we directly use target network?
2:40 A
That’s right! Nice!
✨Quiiizz Timmmeeee✨
quiz time 3 ka answer hai C , sahI khe rha hu na codemporium bhai
εισαι κουκλος .
Shaina Harbor
Muller Roads
Maybell Park
q1=A
Engagement Comment
why is it called a Q network and not just a neural network?
Because the network replaces a q table. Q stands for quality because the table stores how good a decision was.
B
On my life, I've got no idea what you're on about, at any stage of the video.
It feels like you're jumping between concepts and not explaining how they're linked.
I'd rather you say what a QN is, describe how it works, then give the Frank example
Cristian Trail
Explanation for phase 2 is very poor. Why do you use s1 for Q network and s2 for the target network? The logic of subsequent calculations is not clear as well.
A
A! Yep that’s right for Quiz 1
If you'd like to see your channel perform better, you might consider that your audience is composed of intelligent adults.
Stoltenberg Throughway
Welch Mall
QAnon Network.
teach him how to use a pencil
I am a pencil
Q3.A
Very cringey but good video nonetheless 👍
SHUDDAP
HE'S GIVING A GOOD VIBE
Nah the vibe is like m******* *******