Deep Q-Learning/Deep Q-Network (DQN) Explained | Python Pytorch Deep Reinforcement Learning

แชร์
ฝัง
  • เผยแพร่เมื่อ 20 ก.ย. 2024

ความคิดเห็น • 102

  • @johnnycode
    @johnnycode  8 หลายเดือนก่อน +9

    Please like and subscribe if this video is helpful for you 😀
    Check out my DQN PyTorch for Beginners series where I explain DQN in much greater details and show the whole process of implementing DQN to train Flappy Bird: th-cam.com/video/arR7KzlYs4w/w-d-xo.html

    • @hrishikeshh
      @hrishikeshh 7 หลายเดือนก่อน

      Subscribed. You have really understood the concept of DQN. I was trying to implement DQN for a hangman game (Guessing characters to finally guess the word in less than 6 attempts), and your explanation helped drastically.
      Although I need your opinion regarding the input size and dimensions for hangman game since in this Frozen lake game, the grid is predefined.
      What could we do in case of word guessing game where each word have a different length ? Thanks for the help in advance.

    • @johnnycode
      @johnnycode  7 หลายเดือนก่อน +1

      Hi, not sure if this will work, but try this:
      - Use a vector of size 26 to represent the guessed/unguessed letters: 0 = available to use for guessing, 1 = correct guess, -1 = wrong guess.
      - As you'd mentioned, words are variable length, but maybe set a fixed max length of say 15, so use a vector of size 15 to represent the word: 0 = unrevealed letter, 1 = revealed letter, -1 = unused position.
      Concatenate the 2 vectors into 1 for the input layer. Try training with really short words and a fixed max length of maybe 5, so you can quickly see if it works or not.

    • @hrishikeshh
      @hrishikeshh 7 หลายเดือนก่อน +2

      @@johnnycode Thanks man, the concatenation advice worked. You are really good.

    • @johnnycode
      @johnnycode  7 หลายเดือนก่อน

      @@hrishikeshhGreat job getting your game to train!🎉

  • @bagumamartin
    @bagumamartin 4 หลายเดือนก่อน +4

    Johnny, you explained what I've been trying to wrap my head around for 9 months in a few minutes. Keep up the good work.

  • @user-zw7pd5io3e
    @user-zw7pd5io3e 2 หลายเดือนก่อน +3

    The best teachers are those who teach a difficult lesson simply. thank you

  • @thefall0190
    @thefall0190 8 หลายเดือนก่อน +3

    Thank you for making this video. Your explanations were clear👍 , and I learned a lot. Also, I find your voice very pleasant to listen to.

  • @johngrigoriadis
    @johngrigoriadis 8 หลายเดือนก่อน +3

    These videos on the new gymnasium version of gym are great. ❤ Could you do a video about thr bipedal walker environment?

  • @faisalshaikh2252
    @faisalshaikh2252 2 วันที่ผ่านมา

    Thanks! Learned a lot from you...hoping to learn more

    • @johnnycode
      @johnnycode  2 วันที่ผ่านมา

      Thank you!!!

  • @kimiochang
    @kimiochang 4 หลายเดือนก่อน +1

    Thanks again for your good work to help me understand reinforcement learning better.

    • @johnnycode
      @johnnycode  4 หลายเดือนก่อน

      You’re welcome, thanks for all the donations😊😊😊

  • @johndowling1861
    @johndowling1861 2 หลายเดือนก่อน +1

    This was a really well explained example

  • @carringoosen1177
    @carringoosen1177 25 วันที่ผ่านมา

    That makes so much sense, thanks Johnny! I just have a few questions. Why does the performance change when I define the environment to be randomly generated [i.e. env = gym.make('FrozenLake-v1', desc=generate_random_map(size=4), is_slippery=is_slippery, render_mode='human' if render else None)]. It still learns, but it falls into the holes most of the time. Also, when I set size=10, it doesn't seem to be learning at all. I'm really stuck so any input into why I can't change the environment would be much appreaciated. I've tried increasing the number of episodes, and I've played around with the values for the discount factor and learning rate, but I can't seem to get it working.

    • @johnnycode
      @johnnycode  25 วันที่ผ่านมา

      The network is learning the output (best action) based on the input (state). The state that you are capturing only contains the location of the agent. That means, if you train the agent on 1 map, the network only learns that 1 map. That also means, the network can't learn more than 1 map. In order to handle more than 1 map, you must encode the map information into the state. For example, use 0 to represent solid ground, 0.5 to represent holes, 1 to represent location of agent. With this state representation, you could train the network on random maps, but training will take a lot longer, because the agent needs to see as many different configurations as possible. Just think of the network as a look up table: what action should the agent take, if the state looks like this. Also, turn slippery off when experimenting.
      When you change the size of the map to 10, you must also change the state input to match the size.

  • @Persian-boy
    @Persian-boy หลายเดือนก่อน

    ‏‪13:25‬‏ If we don't have an end state. How is our taget amount determined? And finally, how does our neural network (politics) learn?

  • @drm8164
    @drm8164 8 หลายเดือนก่อน

    You are a great teacher, thank you so much and Merry Christmas 2023

  • @HarshVardhan-m7e
    @HarshVardhan-m7e 8 หลายเดือนก่อน

    Thanks for this tutorial. It helped me to understand DQN .

  • @Persian-boy
    @Persian-boy หลายเดือนก่อน +1

    ‏‪If we don't have an end state. How is our taget amount determined? And finally, how does our neural ‏‪13:24‬‏ network (politics) learn?

    • @johnnycode
      @johnnycode  หลายเดือนก่อน

      Answer for question 1: Target is estimated using the formula: reward+discount_factor*max(q[new_state]). Basically, the Target value is mainly driven by the best q-value of the next state. Q values change thru out training, but it should get better and better, so the Target also gets better.
      Answer for question 2: We are providing the policy network the Inputs and expected Outputs. This is basically Supervised Learning; a machine learning technique.

    • @Persian-boy
      @Persian-boy หลายเดือนก่อน

      @@johnnycode Let me ask my question more precisely. For training, we started from the one remaining to the last one. Because the value of q is equal to the reward, it does not depend on the next state. When the value of q is determined, other values ​​of q in other states are determined. But in my issue, which is related to pricing, we never have an end state. So where should the network start its training and how is the first correct value of q (in the target network) determined.

    • @johnnycode
      @johnnycode  หลายเดือนก่อน

      Ok, I understand your question now. Take a look at my DQN on Flappy Bird series. The Flappy Bird environment does not have an end state either, so pay attention to how the rewards are set up to encourage the bird to fly forward.
      th-cam.com/play/PL58zEckBH8fCMIVzQCRSZVPUp3ZAVagWi.html&si=cxJojVxxvGVTvAJL

    • @Persian-boy
      @Persian-boy หลายเดือนก่อน +1

      @@johnnycode I looked at this valuable tutorial. In this tutorial, the fifth part explains the target network. And it is written like this video:
      DQN target = reward if new state is terminal
      else: reward + discount factor*max(q[new state])
      What I understand is that in every episode, wherever the game ends (die or hit the pipe) there is an end state. That is, in environments like
      flappy-bird we also have an end state. Did I say it right?
      In my case, which is related to pricing. I need to define an episode that includes multiple customers and the decision on the last customer becomes the end state?

    • @johnnycode
      @johnnycode  หลายเดือนก่อน

      Yes, your understanding is correct. The goal of RL is to maximize total reward within an episode, so you need to define something that ends your episode. Your episode must have an end, otherwise you are collecting rewards forever.
      For your problem, you can define an arbitrary stopping point. After 100 actions, for example. In the Flappy Bird series, I tested the code against Cart Pole. When training Cart Pole, it is possible to train to a point where the pole never falls. In one of the videos in that series, I talked about defining the episode termination condition of Cart Pole as: "end episode if pole falls OR actions >= 100,000". 100,000 is just an arbitrary number that I selected.

  • @fawadkhan8905
    @fawadkhan8905 8 หลายเดือนก่อน

    Wonderful Explanation!

  • @user-ks2kc9qz3d
    @user-ks2kc9qz3d 4 หลายเดือนก่อน +1

    please , give me a code source of this algorithm DQN , and thank you for this explanation

    • @johnnycode
      @johnnycode  4 หลายเดือนก่อน +1

      github.com/johnnycode8/gym_solutions

    • @user-ks2kc9qz3d
      @user-ks2kc9qz3d 4 หลายเดือนก่อน

      @@johnnycode merci beaucoup pour votre réponse, je travail sur colab ( python enline ) puisque j'ai des contraintes matérielles, et je comprend pas comment utiliser les codes sources que vous avez m'envoyé pour faire d'autres implémentation sur ce code , merci d'avoir expliquer comment faire cette opération , , et c'est possible par un simple vidéo ....., merci infiniment d'autre fois .

    • @johnnycode
      @johnnycode  4 หลายเดือนก่อน

      Here are some suggestions:
      How to modify this code for MountainCar: th-cam.com/video/oceguqZxjn4/w-d-xo.html
      You might be interest in using a library like Stable Baselines3: th-cam.com/video/OqvXHi_QtT0/w-d-xo.html

  • @koka-lf8ui
    @koka-lf8ui 3 หลายเดือนก่อน

    thank you so much.can u please implement any env with dqn showing the forgetting problem?

  • @sosukeyuto2199
    @sosukeyuto2199 15 วันที่ผ่านมา

    Hello there, me again. I am not sure why you used the update q value rule "reward+gamma*max_next_q_value". Are you in the case of deterministic environment? What if I am training in Pitfall: should I use the non deterministic formula for the q values update ("(1-alpha)current_q+alpha(reward+disc_factormax_next_q_value)")?
    And lastly what is alpha in this last formula? Is it the learning rate or something else?
    I am asking because I am still training the agent in Pitfall, but I am really confused on which rule should I use since the results are very poor, but my professor said that I should train using just the target network and using the update rule for non deterministic case.

    • @johnnycode
      @johnnycode  15 วันที่ผ่านมา +1

      Hey, for DQN, there is only 1 target formula "reward+gamma*max_next_q_value". There is no deterministic vs non-deterministic formula. The other formula you mentioned is for Q-Learning only and alpha is the learning rate. For Pitfall, you should really use a RL library like StableBaselines3 (check out my tutorial here th-cam.com/video/OqvXHi_QtT0/w-d-xo.html). Pitfall probably takes many many days of training, so you should use a proven and efficient implementation of DQN. My code is for learning and definitely not efficient.

    • @sosukeyuto2199
      @sosukeyuto2199 15 วันที่ผ่านมา

      @@johnnycode Thank you so much. I will check out your video and see if I can manage to improve my approach.

  • @rickyolal
    @rickyolal 3 หลายเดือนก่อน

    Hey Johnny! Thanks so much for these videos! I have a question, is it possible to apply this algorithm to a continuous action space? For example, select a number in a range between [0, 120] as an action, or should I investigate other algorithms?

    • @johnnycode
      @johnnycode  3 หลายเดือนก่อน

      Hi, DQN only works on discrete actions. Try a policy gradient type algorithm. My other video talks about choosing an algorithm: th-cam.com/video/2AFl-iWGQzc/w-d-xo.html

  • @user-ks2kc9qz3d
    @user-ks2kc9qz3d 4 หลายเดือนก่อน +1

    merci beaucoup pour votre réponse

  • @clashwithdheeraj1599
    @clashwithdheeraj1599 7 หลายเดือนก่อน +1

    for i in range(1000);
    print("thankyou")

    • @johnnycode
      @johnnycode  7 หลายเดือนก่อน

      😁

  • @ProjectOfTheWeek
    @ProjectOfTheWeek 8 หลายเดือนก่อน +1

    I don't quite understand, because if you change the position of the puddles, the trained model will no longer be able to find the reward, right? What is the purpose of Qlearning then?

    • @johnnycode
      @johnnycode  8 หลายเดือนก่อน

      Q-Learning provides a general way to find the "most-likely-to-succeed" set of actions to the goal. You are correct that the trained model only works on a specific map. In order for the model to solve (almost) any map, the agent has to be trained on as many map layouts as possible. The input to the neural network will probably need to include the map layout.

    • @ProjectOfTheWeek
      @ProjectOfTheWeek 8 หลายเดือนก่อน

      @@johnnycode Do you know other AI systems other than QLearning? I wouldn't like to pass on the layout since that would be like 'cheating' (talking about the battle ships board game for example)

    • @johnnycode
      @johnnycode  8 หลายเดือนก่อน

      @TutorialesHTML5 I'm not sure what other learning algorithm would work for Battleship, but Deep Q-Learning should work. Your "input" to the neural network would be your Target Grid, i.e. the shots fired/missed/hit. When there is a hit, the model should guess that the next highest chance of hitting is one of the adjacent squares. This video is for understanding the underlying algorithm, you might want to use a Reinforcement Library like what I show in this video: th-cam.com/video/OqvXHi_QtT0/w-d-xo.html

    • @ProjectOfTheWeek
      @ProjectOfTheWeek 8 หลายเดือนก่อน

      ​@@johnnycode yes but every game change the boats position.. will work? If you like make video.. 😂😊

    • @johnnycode
      @johnnycode  8 หลายเดือนก่อน

      I might give it a shot, but no guarantees 😁

  • @peterhpchen
    @peterhpchen 3 หลายเดือนก่อน

    Excellent!

  • @World-Of-Mr-Motivater
    @World-Of-Mr-Motivater 2 หลายเดือนก่อน

    sir i have one doubt.
    sir we are training the policy network first and then copy pasting it as the target network
    then again we are letting our agent to go through the policy network and updating it based on the target network
    but again you mentinoed target network uses the dqn formula
    i am totally confused sir.can you give it in crisp steps?

    • @johnnycode
      @johnnycode  2 หลายเดือนก่อน +1

      My video series on implementing DQN to train Flappy Bird has much more detailed explanations of the end-to-end process, check it out: th-cam.com/video/arR7KzlYs4w/w-d-xo.html

    • @World-Of-Mr-Motivater
      @World-Of-Mr-Motivater 2 หลายเดือนก่อน

      @@johnnycode ok sir thanks a lot

    • @World-Of-Mr-Motivater
      @World-Of-Mr-Motivater 2 หลายเดือนก่อน

      @@johnnycode sir can i use the deep q n to generate the pixel positions of stones in a snake game , where the stones acts as obstacles?
      please guide me sir

    • @johnnycode
      @johnnycode  2 หลายเดือนก่อน

      @World-Of-Mr-Motivater See if this video answers your question: th-cam.com/video/AoGRjPt-vms/w-d-xo.html

  • @DEVRAJ-np2og
    @DEVRAJ-np2og 2 หลายเดือนก่อน

    how to start learning reinforcement learning? i knew panda numpy matplotlib and basic ml algo?

  • @Shootamcgav
    @Shootamcgav 9 หลายเดือนก่อน

    Just curious how you learned all of this? Did you just read the documentation or watch other videos?

    • @johnnycode
      @johnnycode  9 หลายเดือนก่อน +2

      These 2 resources were helpful to me:
      huggingface.co/learn/deep-rl-course/unit3/deep-q-algorithm
      pytorch.org/tutorials/intermediate/reinforcement_q_learning.html

  • @ProjectOfTheWeek
    @ProjectOfTheWeek 8 หลายเดือนก่อน +1

    Thanks for the video! Can you make a example with a BattleShips game? im trying, but the action (ex. position 12) its the same that the new state (12)😢

    • @johnnycode
      @johnnycode  8 หลายเดือนก่อน

      I actually never played Battle Ship (I'm assuming the board game?) before :D
      Do you have a link to the environment?

    • @ProjectOfTheWeek
      @ProjectOfTheWeek 8 หลายเดือนก่อน

      @@johnnycode i dont have the environment finish yet

  • @rverm1000
    @rverm1000 24 วันที่ผ่านมา

    how can i apply it to autonomous car? i can put the code on a raspberry pi pico. adding sensors?

    • @johnnycode
      @johnnycode  24 วันที่ผ่านมา

      The way Reinforcement Learning works is that you train the model on a software simulation of your car, because this could take millions of trials. After training, you can put the model onto your raspberry pi. Having a simulation of the car and its surroundings is a tough problem.

  • @JJGhostHunters
    @JJGhostHunters 3 หลายเดือนก่อน

    Hi Johnny...Do you know where I can find an example of applying a DQN to the Taxi-V3 environment?

    • @johnnycode
      @johnnycode  3 หลายเดือนก่อน

      The Taxi env can be solved with regular q-learning. If you take my code from the Frozen Lake video and swap in Taxi, it should work.

    • @JJGhostHunters
      @JJGhostHunters 3 หลายเดือนก่อน

      @@johnnycode Hi Johnny...I made it work with Q-Learning, howerver I was wanting to replace the Q table with a simple DQN. Seems like this would be possible. I tried searching and even asked ChatGPT to help but cannot quite get it to work.

    • @johnnycode
      @johnnycode  3 หลายเดือนก่อน

      How did you encode the input to the policy/target network?

    • @JJGhostHunters
      @JJGhostHunters 3 หลายเดือนก่อน

      @@johnnycode Can I send you my code? It is a very short script that attempts to use a CNN to solve the Taxi-V3 problem.

    • @johnnycode
      @johnnycode  3 หลายเดือนก่อน

      Sorry, I can’t review your code. If you have specific questions, I can try to answer them.

  • @AfitQADirectorate
    @AfitQADirectorate 6 หลายเดือนก่อน

    Great video, thanks. Please I am from cyber security background, please do you have idea on Network Attack Simulator (NASim) which also uses Deep Q-learning and openai gym? If you don't, please can you guide on where to find tutorials on it? I have checked youtube for weeks but couldn't get any. THANKS

    • @johnnycode
      @johnnycode  6 หลายเดือนก่อน

      Are you referring to networkattacksimulator.readthedocs.io/ ?
      Did you try the tutorial in the documentation? What are you looking to do?

  • @Shootamcgav
    @Shootamcgav 9 หลายเดือนก่อน

    awesome video!

  • @nimo9503
    @nimo9503 7 หลายเดือนก่อน

    Thank you for making this video, may I ask a question about the Q network, why you set the input space for the network 16 input at 5:19 rather than 1 that represents the state index only?
    I think 1 input is enough

    • @johnnycode
      @johnnycode  7 หลายเดือนก่อน +1

      Hi, good observation. You can change the input to 1 node of 0-15, rather than 16 nodes of 0 or 1, and the training will work.
      Currently, the trained Q-values will not work if we were to reconfigure the locations of the ice holes. That is because we are not passing in the map configuration to the network and reconfiguring the map during training. I was thinking of encoding the map configuration into the 16 nodes, but I left that out of the video to keep things simple. I hope this answers why I had 16 nodes instead of 1.

    • @nimo9503
      @nimo9503 7 หลายเดือนก่อน

      @@johnnycode thank you for this quick reply which I was't expect
      This makes sense to me

  • @sosukeyuto2199
    @sosukeyuto2199 2 หลายเดือนก่อน

    Hi there. Thanks for the video guide. Super interesting and useful to better understand the topic. I'm trying to implement Tabular-Q-Learning and Deep-Q-Learning in the context of atari environment (especially pitfall), but I'm struggling to understand how to handle the number of possible states. I must use ram observation space but in this observation space a state is represented by an array of 128 cells each of type uint8 and with values that range from 0 to 255. So I cannot know a priori what is the exact number of states, since changing just one value of a cell will result in a different new state. Have you got any suggestions or do you know some guide to better understand how to manage this environment?

    • @johnnycode
      @johnnycode  2 หลายเดือนก่อน

      Tabular q-learning can not solve Atari games. You have to use deep q-learning (DQN) or another type of advanced algorithm.
      For Pitfall, you should use rgb or grayscale instead of ram. Watch my video that explains some basics on how to use DQN on atari games: th-cam.com/video/qKePPepISiA/w-d-xo.html
      However, Pitfall is probably too difficult to start with. You should try training non-atari environments first. For example, my series on training Flappy Bird: th-cam.com/video/arR7KzlYs4w/w-d-xo.html

    • @sosukeyuto2199
      @sosukeyuto2199 2 หลายเดือนก่อน

      @@johnnycode I wish I could change to another environment but I must do this because it is a project for university, so I need to stick with this one.
      For Tabular I though the same thing, but my professor suggested that I could do it using a map, instead of a matrix, which accepts arrays or tuples as index, and whenever there is a state which was not initialized before, then init it to the default.
      As for dqn I will happily check out your video to see if I can figure this up. Thank you.

    • @sosukeyuto2199
      @sosukeyuto2199 2 หลายเดือนก่อน

      And Ram is another condition I should maintain on observation space

    • @johnnycode
      @johnnycode  2 หลายเดือนก่อน +1

      I don't mean the challenge your professor, but it is impossible to solve pitfall with tabular q-learning. 128 combinations of 256 is 128 to the power of 256, which no computer memory can hold. You need to use a neutral network-based algorithm like dqn.

    • @sosukeyuto2199
      @sosukeyuto2199 2 หลายเดือนก่อน

      @@johnnycode I agree with you. I had the same thought about it and I even passed the last two days trying to train it with Tabular with this approach. I managed to make it work (in terms of structure of table and writing the q values), but of course I had just some episodes in which it maintained a reward of 0 and a lot of other episodes where the reward was negative (and it could never get to collect a single treasure).
      So I don't know, maybe he wants to make me do it anyway and demonstrate that it can't be implemented like that. I will ask for a meeting with him and see what comes out.
      Thank you so much for your time and your considerations. When I finish with the tabular I will start the dql with cnn. If I have some doubts, would you mind if I still ask you some things? I don't mean to disturb you further, but I found this debate and your videos about the topic really useful to better approach the problem.

  • @ElisaFerrari-q5i
    @ElisaFerrari-q5i 2 หลายเดือนก่อน

    how can we solve the problem using a single DQN instead of 2?

    • @johnnycode
      @johnnycode  2 หลายเดือนก่อน

      You can use the same policy network in place of the target network. Training results may be worst than using 2 networks.

    • @ElisaFerrari-q5i
      @ElisaFerrari-q5i 2 หลายเดือนก่อน

      And which is the best solution between the single DQN and the Q-learning method?

    • @johnnycode
      @johnnycode  2 หลายเดือนก่อน

      @@ElisaFerrari-q5i dqn is the advanced version of q-learning.

  • @envelopepiano2453
    @envelopepiano2453 8 หลายเดือนก่อน

    sry, may i ask how can i find the max_step in the training of every epsiode? i do i know that max action is 200?

    • @johnnycode
      @johnnycode  8 หลายเดือนก่อน +1

      You actually have to add a few lines of code to enable the max step truncation. Like this:
      from gymnasium.wrappers import TimeLimit
      env = gym.make("FrozenLake-v1", map_name="8x8")
      env = TimeLimit(env, max_episode_steps=200)

    • @envelopepiano2453
      @envelopepiano2453 8 หลายเดือนก่อน

      @@johnnycode tyvm sryy may i ask another question if now my agent have a health state =16, every step took one blood and how can I let the agent know this? I mean.. let the agent also consider the health state in training of DQN. Can u give me some thought ? sry , im a newbe, and this may be stupid to ask.. thanks

    • @johnnycode
      @johnnycode  8 หลายเดือนก่อน

      You are talking about changing the reward/penalty scheme, that is something you have to change in the environment. You have to make a copy of the FrozenLake environment and then make your code changes. If you are on Windows, you can find the file here: C:\Users\\.conda\envs\gymenv\Lib\site-packages\gymnasium\envs\toy_text\, find frozen_lake .py
      For example, I made some visual changes in the FrozenLake environment in the following video, however, I did not modify the reward scheme: th-cam.com/video/1W_LOB-0IEY/w-d-xo.html

  • @fernandomaroli8481
    @fernandomaroli8481 7 หลายเดือนก่อน

    Any chance you can show us how to use Keras and RL on Tetris?

    • @johnnycode
      @johnnycode  7 หลายเดือนก่อน

      Thanks for the interest, but I'm no RL expert and totally not qualified to give private lessons on this subject. I'm just learning myself and making the material easier to understand for others. As for Tetris, I have not tried it but may attempt it in the future.

  • @user-ks2kc9qz3d
    @user-ks2kc9qz3d 4 หลายเดือนก่อน

    merci d'avoir dit moi comment crier l'environement en preier lieu , puisque je voudrais crier un environement tels que votre 4 x 4 mais avec d'autre images .

  • @dylan-652
    @dylan-652 9 หลายเดือนก่อน

    the goat

  • @henriquefantato9804
    @henriquefantato9804 8 หลายเดือนก่อน

    great video! Maybe a next env to try is mario?

    • @johnnycode
      @johnnycode  8 หลายเดือนก่อน

      Thanks, I’ll consider it.

  • @NoonSummit-i3x
    @NoonSummit-i3x 9 วันที่ผ่านมา

    Young Elizabeth Walker Jason Walker Ruth

  • @user-ks2kc9qz3d
    @user-ks2kc9qz3d 4 หลายเดือนก่อน