Please like and subscribe if this video is helpful for you 😀 Check out my DQN PyTorch for Beginners series where I explain DQN in much greater details and show the whole process of implementing DQN to train Flappy Bird: th-cam.com/video/arR7KzlYs4w/w-d-xo.html
Subscribed. You have really understood the concept of DQN. I was trying to implement DQN for a hangman game (Guessing characters to finally guess the word in less than 6 attempts), and your explanation helped drastically. Although I need your opinion regarding the input size and dimensions for hangman game since in this Frozen lake game, the grid is predefined. What could we do in case of word guessing game where each word have a different length ? Thanks for the help in advance.
Hi, not sure if this will work, but try this: - Use a vector of size 26 to represent the guessed/unguessed letters: 0 = available to use for guessing, 1 = correct guess, -1 = wrong guess. - As you'd mentioned, words are variable length, but maybe set a fixed max length of say 15, so use a vector of size 15 to represent the word: 0 = unrevealed letter, 1 = revealed letter, -1 = unused position. Concatenate the 2 vectors into 1 for the input layer. Try training with really short words and a fixed max length of maybe 5, so you can quickly see if it works or not.
That makes so much sense, thanks Johnny! I just have a few questions. Why does the performance change when I define the environment to be randomly generated [i.e. env = gym.make('FrozenLake-v1', desc=generate_random_map(size=4), is_slippery=is_slippery, render_mode='human' if render else None)]. It still learns, but it falls into the holes most of the time. Also, when I set size=10, it doesn't seem to be learning at all. I'm really stuck so any input into why I can't change the environment would be much appreaciated. I've tried increasing the number of episodes, and I've played around with the values for the discount factor and learning rate, but I can't seem to get it working.
The network is learning the output (best action) based on the input (state). The state that you are capturing only contains the location of the agent. That means, if you train the agent on 1 map, the network only learns that 1 map. That also means, the network can't learn more than 1 map. In order to handle more than 1 map, you must encode the map information into the state. For example, use 0 to represent solid ground, 0.5 to represent holes, 1 to represent location of agent. With this state representation, you could train the network on random maps, but training will take a lot longer, because the agent needs to see as many different configurations as possible. Just think of the network as a look up table: what action should the agent take, if the state looks like this. Also, turn slippery off when experimenting. When you change the size of the map to 10, you must also change the state input to match the size.
Answer for question 1: Target is estimated using the formula: reward+discount_factor*max(q[new_state]). Basically, the Target value is mainly driven by the best q-value of the next state. Q values change thru out training, but it should get better and better, so the Target also gets better. Answer for question 2: We are providing the policy network the Inputs and expected Outputs. This is basically Supervised Learning; a machine learning technique.
@@johnnycode Let me ask my question more precisely. For training, we started from the one remaining to the last one. Because the value of q is equal to the reward, it does not depend on the next state. When the value of q is determined, other values of q in other states are determined. But in my issue, which is related to pricing, we never have an end state. So where should the network start its training and how is the first correct value of q (in the target network) determined.
Ok, I understand your question now. Take a look at my DQN on Flappy Bird series. The Flappy Bird environment does not have an end state either, so pay attention to how the rewards are set up to encourage the bird to fly forward. th-cam.com/play/PL58zEckBH8fCMIVzQCRSZVPUp3ZAVagWi.html&si=cxJojVxxvGVTvAJL
@@johnnycode I looked at this valuable tutorial. In this tutorial, the fifth part explains the target network. And it is written like this video: DQN target = reward if new state is terminal else: reward + discount factor*max(q[new state]) What I understand is that in every episode, wherever the game ends (die or hit the pipe) there is an end state. That is, in environments like flappy-bird we also have an end state. Did I say it right? In my case, which is related to pricing. I need to define an episode that includes multiple customers and the decision on the last customer becomes the end state?
Yes, your understanding is correct. The goal of RL is to maximize total reward within an episode, so you need to define something that ends your episode. Your episode must have an end, otherwise you are collecting rewards forever. For your problem, you can define an arbitrary stopping point. After 100 actions, for example. In the Flappy Bird series, I tested the code against Cart Pole. When training Cart Pole, it is possible to train to a point where the pole never falls. In one of the videos in that series, I talked about defining the episode termination condition of Cart Pole as: "end episode if pole falls OR actions >= 100,000". 100,000 is just an arbitrary number that I selected.
@@johnnycode merci beaucoup pour votre réponse, je travail sur colab ( python enline ) puisque j'ai des contraintes matérielles, et je comprend pas comment utiliser les codes sources que vous avez m'envoyé pour faire d'autres implémentation sur ce code , merci d'avoir expliquer comment faire cette opération , , et c'est possible par un simple vidéo ....., merci infiniment d'autre fois .
Here are some suggestions: How to modify this code for MountainCar: th-cam.com/video/oceguqZxjn4/w-d-xo.html You might be interest in using a library like Stable Baselines3: th-cam.com/video/OqvXHi_QtT0/w-d-xo.html
Hello there, me again. I am not sure why you used the update q value rule "reward+gamma*max_next_q_value". Are you in the case of deterministic environment? What if I am training in Pitfall: should I use the non deterministic formula for the q values update ("(1-alpha)current_q+alpha(reward+disc_factormax_next_q_value)")? And lastly what is alpha in this last formula? Is it the learning rate or something else? I am asking because I am still training the agent in Pitfall, but I am really confused on which rule should I use since the results are very poor, but my professor said that I should train using just the target network and using the update rule for non deterministic case.
Hey, for DQN, there is only 1 target formula "reward+gamma*max_next_q_value". There is no deterministic vs non-deterministic formula. The other formula you mentioned is for Q-Learning only and alpha is the learning rate. For Pitfall, you should really use a RL library like StableBaselines3 (check out my tutorial here th-cam.com/video/OqvXHi_QtT0/w-d-xo.html). Pitfall probably takes many many days of training, so you should use a proven and efficient implementation of DQN. My code is for learning and definitely not efficient.
Hey Johnny! Thanks so much for these videos! I have a question, is it possible to apply this algorithm to a continuous action space? For example, select a number in a range between [0, 120] as an action, or should I investigate other algorithms?
Hi, DQN only works on discrete actions. Try a policy gradient type algorithm. My other video talks about choosing an algorithm: th-cam.com/video/2AFl-iWGQzc/w-d-xo.html
I don't quite understand, because if you change the position of the puddles, the trained model will no longer be able to find the reward, right? What is the purpose of Qlearning then?
Q-Learning provides a general way to find the "most-likely-to-succeed" set of actions to the goal. You are correct that the trained model only works on a specific map. In order for the model to solve (almost) any map, the agent has to be trained on as many map layouts as possible. The input to the neural network will probably need to include the map layout.
@@johnnycode Do you know other AI systems other than QLearning? I wouldn't like to pass on the layout since that would be like 'cheating' (talking about the battle ships board game for example)
@TutorialesHTML5 I'm not sure what other learning algorithm would work for Battleship, but Deep Q-Learning should work. Your "input" to the neural network would be your Target Grid, i.e. the shots fired/missed/hit. When there is a hit, the model should guess that the next highest chance of hitting is one of the adjacent squares. This video is for understanding the underlying algorithm, you might want to use a Reinforcement Library like what I show in this video: th-cam.com/video/OqvXHi_QtT0/w-d-xo.html
sir i have one doubt. sir we are training the policy network first and then copy pasting it as the target network then again we are letting our agent to go through the policy network and updating it based on the target network but again you mentinoed target network uses the dqn formula i am totally confused sir.can you give it in crisp steps?
My video series on implementing DQN to train Flappy Bird has much more detailed explanations of the end-to-end process, check it out: th-cam.com/video/arR7KzlYs4w/w-d-xo.html
@@johnnycode sir can i use the deep q n to generate the pixel positions of stones in a snake game , where the stones acts as obstacles? please guide me sir
These 2 resources were helpful to me: huggingface.co/learn/deep-rl-course/unit3/deep-q-algorithm pytorch.org/tutorials/intermediate/reinforcement_q_learning.html
Thanks for the video! Can you make a example with a BattleShips game? im trying, but the action (ex. position 12) its the same that the new state (12)😢
The way Reinforcement Learning works is that you train the model on a software simulation of your car, because this could take millions of trials. After training, you can put the model onto your raspberry pi. Having a simulation of the car and its surroundings is a tough problem.
@@johnnycode Hi Johnny...I made it work with Q-Learning, howerver I was wanting to replace the Q table with a simple DQN. Seems like this would be possible. I tried searching and even asked ChatGPT to help but cannot quite get it to work.
Great video, thanks. Please I am from cyber security background, please do you have idea on Network Attack Simulator (NASim) which also uses Deep Q-learning and openai gym? If you don't, please can you guide on where to find tutorials on it? I have checked youtube for weeks but couldn't get any. THANKS
Thank you for making this video, may I ask a question about the Q network, why you set the input space for the network 16 input at 5:19 rather than 1 that represents the state index only? I think 1 input is enough
Hi, good observation. You can change the input to 1 node of 0-15, rather than 16 nodes of 0 or 1, and the training will work. Currently, the trained Q-values will not work if we were to reconfigure the locations of the ice holes. That is because we are not passing in the map configuration to the network and reconfiguring the map during training. I was thinking of encoding the map configuration into the 16 nodes, but I left that out of the video to keep things simple. I hope this answers why I had 16 nodes instead of 1.
Hi there. Thanks for the video guide. Super interesting and useful to better understand the topic. I'm trying to implement Tabular-Q-Learning and Deep-Q-Learning in the context of atari environment (especially pitfall), but I'm struggling to understand how to handle the number of possible states. I must use ram observation space but in this observation space a state is represented by an array of 128 cells each of type uint8 and with values that range from 0 to 255. So I cannot know a priori what is the exact number of states, since changing just one value of a cell will result in a different new state. Have you got any suggestions or do you know some guide to better understand how to manage this environment?
Tabular q-learning can not solve Atari games. You have to use deep q-learning (DQN) or another type of advanced algorithm. For Pitfall, you should use rgb or grayscale instead of ram. Watch my video that explains some basics on how to use DQN on atari games: th-cam.com/video/qKePPepISiA/w-d-xo.html However, Pitfall is probably too difficult to start with. You should try training non-atari environments first. For example, my series on training Flappy Bird: th-cam.com/video/arR7KzlYs4w/w-d-xo.html
@@johnnycode I wish I could change to another environment but I must do this because it is a project for university, so I need to stick with this one. For Tabular I though the same thing, but my professor suggested that I could do it using a map, instead of a matrix, which accepts arrays or tuples as index, and whenever there is a state which was not initialized before, then init it to the default. As for dqn I will happily check out your video to see if I can figure this up. Thank you.
I don't mean the challenge your professor, but it is impossible to solve pitfall with tabular q-learning. 128 combinations of 256 is 128 to the power of 256, which no computer memory can hold. You need to use a neutral network-based algorithm like dqn.
@@johnnycode I agree with you. I had the same thought about it and I even passed the last two days trying to train it with Tabular with this approach. I managed to make it work (in terms of structure of table and writing the q values), but of course I had just some episodes in which it maintained a reward of 0 and a lot of other episodes where the reward was negative (and it could never get to collect a single treasure). So I don't know, maybe he wants to make me do it anyway and demonstrate that it can't be implemented like that. I will ask for a meeting with him and see what comes out. Thank you so much for your time and your considerations. When I finish with the tabular I will start the dql with cnn. If I have some doubts, would you mind if I still ask you some things? I don't mean to disturb you further, but I found this debate and your videos about the topic really useful to better approach the problem.
You actually have to add a few lines of code to enable the max step truncation. Like this: from gymnasium.wrappers import TimeLimit env = gym.make("FrozenLake-v1", map_name="8x8") env = TimeLimit(env, max_episode_steps=200)
@@johnnycode tyvm sryy may i ask another question if now my agent have a health state =16, every step took one blood and how can I let the agent know this? I mean.. let the agent also consider the health state in training of DQN. Can u give me some thought ? sry , im a newbe, and this may be stupid to ask.. thanks
You are talking about changing the reward/penalty scheme, that is something you have to change in the environment. You have to make a copy of the FrozenLake environment and then make your code changes. If you are on Windows, you can find the file here: C:\Users\\.conda\envs\gymenv\Lib\site-packages\gymnasium\envs\toy_text\, find frozen_lake .py For example, I made some visual changes in the FrozenLake environment in the following video, however, I did not modify the reward scheme: th-cam.com/video/1W_LOB-0IEY/w-d-xo.html
Thanks for the interest, but I'm no RL expert and totally not qualified to give private lessons on this subject. I'm just learning myself and making the material easier to understand for others. As for Tetris, I have not tried it but may attempt it in the future.
merci d'avoir dit moi comment crier l'environement en preier lieu , puisque je voudrais crier un environement tels que votre 4 x 4 mais avec d'autre images .
Please like and subscribe if this video is helpful for you 😀
Check out my DQN PyTorch for Beginners series where I explain DQN in much greater details and show the whole process of implementing DQN to train Flappy Bird: th-cam.com/video/arR7KzlYs4w/w-d-xo.html
Subscribed. You have really understood the concept of DQN. I was trying to implement DQN for a hangman game (Guessing characters to finally guess the word in less than 6 attempts), and your explanation helped drastically.
Although I need your opinion regarding the input size and dimensions for hangman game since in this Frozen lake game, the grid is predefined.
What could we do in case of word guessing game where each word have a different length ? Thanks for the help in advance.
Hi, not sure if this will work, but try this:
- Use a vector of size 26 to represent the guessed/unguessed letters: 0 = available to use for guessing, 1 = correct guess, -1 = wrong guess.
- As you'd mentioned, words are variable length, but maybe set a fixed max length of say 15, so use a vector of size 15 to represent the word: 0 = unrevealed letter, 1 = revealed letter, -1 = unused position.
Concatenate the 2 vectors into 1 for the input layer. Try training with really short words and a fixed max length of maybe 5, so you can quickly see if it works or not.
@@johnnycode Thanks man, the concatenation advice worked. You are really good.
@@hrishikeshhGreat job getting your game to train!🎉
Johnny, you explained what I've been trying to wrap my head around for 9 months in a few minutes. Keep up the good work.
The best teachers are those who teach a difficult lesson simply. thank you
Thank you for making this video. Your explanations were clear👍 , and I learned a lot. Also, I find your voice very pleasant to listen to.
These videos on the new gymnasium version of gym are great. ❤ Could you do a video about thr bipedal walker environment?
Thanks! Learned a lot from you...hoping to learn more
Thank you!!!
Thanks again for your good work to help me understand reinforcement learning better.
You’re welcome, thanks for all the donations😊😊😊
This was a really well explained example
That makes so much sense, thanks Johnny! I just have a few questions. Why does the performance change when I define the environment to be randomly generated [i.e. env = gym.make('FrozenLake-v1', desc=generate_random_map(size=4), is_slippery=is_slippery, render_mode='human' if render else None)]. It still learns, but it falls into the holes most of the time. Also, when I set size=10, it doesn't seem to be learning at all. I'm really stuck so any input into why I can't change the environment would be much appreaciated. I've tried increasing the number of episodes, and I've played around with the values for the discount factor and learning rate, but I can't seem to get it working.
The network is learning the output (best action) based on the input (state). The state that you are capturing only contains the location of the agent. That means, if you train the agent on 1 map, the network only learns that 1 map. That also means, the network can't learn more than 1 map. In order to handle more than 1 map, you must encode the map information into the state. For example, use 0 to represent solid ground, 0.5 to represent holes, 1 to represent location of agent. With this state representation, you could train the network on random maps, but training will take a lot longer, because the agent needs to see as many different configurations as possible. Just think of the network as a look up table: what action should the agent take, if the state looks like this. Also, turn slippery off when experimenting.
When you change the size of the map to 10, you must also change the state input to match the size.
13:25 If we don't have an end state. How is our taget amount determined? And finally, how does our neural network (politics) learn?
You are a great teacher, thank you so much and Merry Christmas 2023
Thanks for this tutorial. It helped me to understand DQN .
If we don't have an end state. How is our taget amount determined? And finally, how does our neural 13:24 network (politics) learn?
Answer for question 1: Target is estimated using the formula: reward+discount_factor*max(q[new_state]). Basically, the Target value is mainly driven by the best q-value of the next state. Q values change thru out training, but it should get better and better, so the Target also gets better.
Answer for question 2: We are providing the policy network the Inputs and expected Outputs. This is basically Supervised Learning; a machine learning technique.
@@johnnycode Let me ask my question more precisely. For training, we started from the one remaining to the last one. Because the value of q is equal to the reward, it does not depend on the next state. When the value of q is determined, other values of q in other states are determined. But in my issue, which is related to pricing, we never have an end state. So where should the network start its training and how is the first correct value of q (in the target network) determined.
Ok, I understand your question now. Take a look at my DQN on Flappy Bird series. The Flappy Bird environment does not have an end state either, so pay attention to how the rewards are set up to encourage the bird to fly forward.
th-cam.com/play/PL58zEckBH8fCMIVzQCRSZVPUp3ZAVagWi.html&si=cxJojVxxvGVTvAJL
@@johnnycode I looked at this valuable tutorial. In this tutorial, the fifth part explains the target network. And it is written like this video:
DQN target = reward if new state is terminal
else: reward + discount factor*max(q[new state])
What I understand is that in every episode, wherever the game ends (die or hit the pipe) there is an end state. That is, in environments like
flappy-bird we also have an end state. Did I say it right?
In my case, which is related to pricing. I need to define an episode that includes multiple customers and the decision on the last customer becomes the end state?
Yes, your understanding is correct. The goal of RL is to maximize total reward within an episode, so you need to define something that ends your episode. Your episode must have an end, otherwise you are collecting rewards forever.
For your problem, you can define an arbitrary stopping point. After 100 actions, for example. In the Flappy Bird series, I tested the code against Cart Pole. When training Cart Pole, it is possible to train to a point where the pole never falls. In one of the videos in that series, I talked about defining the episode termination condition of Cart Pole as: "end episode if pole falls OR actions >= 100,000". 100,000 is just an arbitrary number that I selected.
Wonderful Explanation!
please , give me a code source of this algorithm DQN , and thank you for this explanation
github.com/johnnycode8/gym_solutions
@@johnnycode merci beaucoup pour votre réponse, je travail sur colab ( python enline ) puisque j'ai des contraintes matérielles, et je comprend pas comment utiliser les codes sources que vous avez m'envoyé pour faire d'autres implémentation sur ce code , merci d'avoir expliquer comment faire cette opération , , et c'est possible par un simple vidéo ....., merci infiniment d'autre fois .
Here are some suggestions:
How to modify this code for MountainCar: th-cam.com/video/oceguqZxjn4/w-d-xo.html
You might be interest in using a library like Stable Baselines3: th-cam.com/video/OqvXHi_QtT0/w-d-xo.html
thank you so much.can u please implement any env with dqn showing the forgetting problem?
Hello there, me again. I am not sure why you used the update q value rule "reward+gamma*max_next_q_value". Are you in the case of deterministic environment? What if I am training in Pitfall: should I use the non deterministic formula for the q values update ("(1-alpha)current_q+alpha(reward+disc_factormax_next_q_value)")?
And lastly what is alpha in this last formula? Is it the learning rate or something else?
I am asking because I am still training the agent in Pitfall, but I am really confused on which rule should I use since the results are very poor, but my professor said that I should train using just the target network and using the update rule for non deterministic case.
Hey, for DQN, there is only 1 target formula "reward+gamma*max_next_q_value". There is no deterministic vs non-deterministic formula. The other formula you mentioned is for Q-Learning only and alpha is the learning rate. For Pitfall, you should really use a RL library like StableBaselines3 (check out my tutorial here th-cam.com/video/OqvXHi_QtT0/w-d-xo.html). Pitfall probably takes many many days of training, so you should use a proven and efficient implementation of DQN. My code is for learning and definitely not efficient.
@@johnnycode Thank you so much. I will check out your video and see if I can manage to improve my approach.
Hey Johnny! Thanks so much for these videos! I have a question, is it possible to apply this algorithm to a continuous action space? For example, select a number in a range between [0, 120] as an action, or should I investigate other algorithms?
Hi, DQN only works on discrete actions. Try a policy gradient type algorithm. My other video talks about choosing an algorithm: th-cam.com/video/2AFl-iWGQzc/w-d-xo.html
merci beaucoup pour votre réponse
for i in range(1000);
print("thankyou")
😁
I don't quite understand, because if you change the position of the puddles, the trained model will no longer be able to find the reward, right? What is the purpose of Qlearning then?
Q-Learning provides a general way to find the "most-likely-to-succeed" set of actions to the goal. You are correct that the trained model only works on a specific map. In order for the model to solve (almost) any map, the agent has to be trained on as many map layouts as possible. The input to the neural network will probably need to include the map layout.
@@johnnycode Do you know other AI systems other than QLearning? I wouldn't like to pass on the layout since that would be like 'cheating' (talking about the battle ships board game for example)
@TutorialesHTML5 I'm not sure what other learning algorithm would work for Battleship, but Deep Q-Learning should work. Your "input" to the neural network would be your Target Grid, i.e. the shots fired/missed/hit. When there is a hit, the model should guess that the next highest chance of hitting is one of the adjacent squares. This video is for understanding the underlying algorithm, you might want to use a Reinforcement Library like what I show in this video: th-cam.com/video/OqvXHi_QtT0/w-d-xo.html
@@johnnycode yes but every game change the boats position.. will work? If you like make video.. 😂😊
I might give it a shot, but no guarantees 😁
Excellent!
sir i have one doubt.
sir we are training the policy network first and then copy pasting it as the target network
then again we are letting our agent to go through the policy network and updating it based on the target network
but again you mentinoed target network uses the dqn formula
i am totally confused sir.can you give it in crisp steps?
My video series on implementing DQN to train Flappy Bird has much more detailed explanations of the end-to-end process, check it out: th-cam.com/video/arR7KzlYs4w/w-d-xo.html
@@johnnycode ok sir thanks a lot
@@johnnycode sir can i use the deep q n to generate the pixel positions of stones in a snake game , where the stones acts as obstacles?
please guide me sir
@World-Of-Mr-Motivater See if this video answers your question: th-cam.com/video/AoGRjPt-vms/w-d-xo.html
how to start learning reinforcement learning? i knew panda numpy matplotlib and basic ml algo?
Just curious how you learned all of this? Did you just read the documentation or watch other videos?
These 2 resources were helpful to me:
huggingface.co/learn/deep-rl-course/unit3/deep-q-algorithm
pytorch.org/tutorials/intermediate/reinforcement_q_learning.html
Thanks for the video! Can you make a example with a BattleShips game? im trying, but the action (ex. position 12) its the same that the new state (12)😢
I actually never played Battle Ship (I'm assuming the board game?) before :D
Do you have a link to the environment?
@@johnnycode i dont have the environment finish yet
how can i apply it to autonomous car? i can put the code on a raspberry pi pico. adding sensors?
The way Reinforcement Learning works is that you train the model on a software simulation of your car, because this could take millions of trials. After training, you can put the model onto your raspberry pi. Having a simulation of the car and its surroundings is a tough problem.
Hi Johnny...Do you know where I can find an example of applying a DQN to the Taxi-V3 environment?
The Taxi env can be solved with regular q-learning. If you take my code from the Frozen Lake video and swap in Taxi, it should work.
@@johnnycode Hi Johnny...I made it work with Q-Learning, howerver I was wanting to replace the Q table with a simple DQN. Seems like this would be possible. I tried searching and even asked ChatGPT to help but cannot quite get it to work.
How did you encode the input to the policy/target network?
@@johnnycode Can I send you my code? It is a very short script that attempts to use a CNN to solve the Taxi-V3 problem.
Sorry, I can’t review your code. If you have specific questions, I can try to answer them.
Great video, thanks. Please I am from cyber security background, please do you have idea on Network Attack Simulator (NASim) which also uses Deep Q-learning and openai gym? If you don't, please can you guide on where to find tutorials on it? I have checked youtube for weeks but couldn't get any. THANKS
Are you referring to networkattacksimulator.readthedocs.io/ ?
Did you try the tutorial in the documentation? What are you looking to do?
awesome video!
Thank you for making this video, may I ask a question about the Q network, why you set the input space for the network 16 input at 5:19 rather than 1 that represents the state index only?
I think 1 input is enough
Hi, good observation. You can change the input to 1 node of 0-15, rather than 16 nodes of 0 or 1, and the training will work.
Currently, the trained Q-values will not work if we were to reconfigure the locations of the ice holes. That is because we are not passing in the map configuration to the network and reconfiguring the map during training. I was thinking of encoding the map configuration into the 16 nodes, but I left that out of the video to keep things simple. I hope this answers why I had 16 nodes instead of 1.
@@johnnycode thank you for this quick reply which I was't expect
This makes sense to me
Hi there. Thanks for the video guide. Super interesting and useful to better understand the topic. I'm trying to implement Tabular-Q-Learning and Deep-Q-Learning in the context of atari environment (especially pitfall), but I'm struggling to understand how to handle the number of possible states. I must use ram observation space but in this observation space a state is represented by an array of 128 cells each of type uint8 and with values that range from 0 to 255. So I cannot know a priori what is the exact number of states, since changing just one value of a cell will result in a different new state. Have you got any suggestions or do you know some guide to better understand how to manage this environment?
Tabular q-learning can not solve Atari games. You have to use deep q-learning (DQN) or another type of advanced algorithm.
For Pitfall, you should use rgb or grayscale instead of ram. Watch my video that explains some basics on how to use DQN on atari games: th-cam.com/video/qKePPepISiA/w-d-xo.html
However, Pitfall is probably too difficult to start with. You should try training non-atari environments first. For example, my series on training Flappy Bird: th-cam.com/video/arR7KzlYs4w/w-d-xo.html
@@johnnycode I wish I could change to another environment but I must do this because it is a project for university, so I need to stick with this one.
For Tabular I though the same thing, but my professor suggested that I could do it using a map, instead of a matrix, which accepts arrays or tuples as index, and whenever there is a state which was not initialized before, then init it to the default.
As for dqn I will happily check out your video to see if I can figure this up. Thank you.
And Ram is another condition I should maintain on observation space
I don't mean the challenge your professor, but it is impossible to solve pitfall with tabular q-learning. 128 combinations of 256 is 128 to the power of 256, which no computer memory can hold. You need to use a neutral network-based algorithm like dqn.
@@johnnycode I agree with you. I had the same thought about it and I even passed the last two days trying to train it with Tabular with this approach. I managed to make it work (in terms of structure of table and writing the q values), but of course I had just some episodes in which it maintained a reward of 0 and a lot of other episodes where the reward was negative (and it could never get to collect a single treasure).
So I don't know, maybe he wants to make me do it anyway and demonstrate that it can't be implemented like that. I will ask for a meeting with him and see what comes out.
Thank you so much for your time and your considerations. When I finish with the tabular I will start the dql with cnn. If I have some doubts, would you mind if I still ask you some things? I don't mean to disturb you further, but I found this debate and your videos about the topic really useful to better approach the problem.
how can we solve the problem using a single DQN instead of 2?
You can use the same policy network in place of the target network. Training results may be worst than using 2 networks.
And which is the best solution between the single DQN and the Q-learning method?
@@ElisaFerrari-q5i dqn is the advanced version of q-learning.
sry, may i ask how can i find the max_step in the training of every epsiode? i do i know that max action is 200?
You actually have to add a few lines of code to enable the max step truncation. Like this:
from gymnasium.wrappers import TimeLimit
env = gym.make("FrozenLake-v1", map_name="8x8")
env = TimeLimit(env, max_episode_steps=200)
@@johnnycode tyvm sryy may i ask another question if now my agent have a health state =16, every step took one blood and how can I let the agent know this? I mean.. let the agent also consider the health state in training of DQN. Can u give me some thought ? sry , im a newbe, and this may be stupid to ask.. thanks
You are talking about changing the reward/penalty scheme, that is something you have to change in the environment. You have to make a copy of the FrozenLake environment and then make your code changes. If you are on Windows, you can find the file here: C:\Users\\.conda\envs\gymenv\Lib\site-packages\gymnasium\envs\toy_text\, find frozen_lake .py
For example, I made some visual changes in the FrozenLake environment in the following video, however, I did not modify the reward scheme: th-cam.com/video/1W_LOB-0IEY/w-d-xo.html
Any chance you can show us how to use Keras and RL on Tetris?
Thanks for the interest, but I'm no RL expert and totally not qualified to give private lessons on this subject. I'm just learning myself and making the material easier to understand for others. As for Tetris, I have not tried it but may attempt it in the future.
merci d'avoir dit moi comment crier l'environement en preier lieu , puisque je voudrais crier un environement tels que votre 4 x 4 mais avec d'autre images .
the goat
great video! Maybe a next env to try is mario?
Thanks, I’ll consider it.
Young Elizabeth Walker Jason Walker Ruth