Using MuZero's Tree Search To Find Optimal Tic-Tac-Toe Strategy in a Spreadsheet

Kevin Lubick

มุมมอง 6 025

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 28 พ.ค. 2024
A video that explores how the MuZero algorithm combines aspects of Reinforcement Learning and Monte Carlo Tree Search to play efficiently. The animations in the video make use of a spreadsheet which acts as a worked example of the calculations involved in the algorithm. Formulas discussed include Upper Confidence Bounds (for Trees, also called UCB or UCT).
MuZero Tree Search Spreadsheet:
- bit.ly/MuZeroSheetCopy (if you want to mess with the values yourself)
- bit.ly/MuZeroSheetView (if you just want to inspect)
Playlist implementing other neural networks in a spreadsheet:
- • Neural Networks in Spr...
Reinforcement Learning explained by MIT:
- • MIT 6.S191 (2022): Rei...
MuZero Paper and psuedocode:
- arxiv.org/pdf/1911.08265
- arxiv.org/src/1911.08265v2/an...
MuZero talk from one of the authors, Julian Schrittwieser:
- • MuZero - ICAPS 2020
- Slides drive.google.com/file/d/1nwRR...
Monte Carlo Tree Search Visualization by Vinícius Garcia
- • Monte Carlo Tree Searc...
- vgarciasc.github.io/mcts-viz/
Other Articles on Reinforcement Learning
- / how-to-train-ai-agents...
- dev.to/satwikkansal/a-gentle-...
Other Articles/Videos on Monte Carlo Tree Search (MCTS)
- towardsdatascience.com/monte-...
- jeffbradberry.com/posts/2015/0...
- / monte-carlo-tree-searc...
- Animation by J O • Monte Carlo Tree Searc...
Reinforcement Learning and RTS games by Edan Meyer
- • Reinforcement Learning...
Image Credits:
- / 1
- www.deepmind.com/research/hig...
0:00 - Why is MuZero important?
1:05 - Outline/Overview
1:50 - MuZero's three neural networks
3:11 - Key Vocabulary Terms - state, reward, policy, action, value
4:50 - Assumptions for Tree Search Example
5:18 - What will Tree Search find? (The best action)
5:38 - Tree Search Setup
6:55 - Tree Search Begins: Selection, Expansion & Evaluation, Tree Update
11:37 - Upper Confidence Bound (UCB) Formula Explained
14:12 - First Winning Move Discovered!
15:03 - Tree Search Value Explained
18:45 - Using Tree Search Results to Form Policy and Select Action
20:30 - Collecting Training Data
22:16 - Unrolling Representation, Dynamics, and Prediction Networks for Training
24:26 - Playing Better with trained Neural Networks
26:13 - Recap
Special Thanks to
- Robbie Close and Pat Berard for excellent feedback on initial drafts
P.S. I was wondering where the UCB constant 19652 came from. I emailed the primary author and got the following response:
"19652 was found using automatic hyperparameter optimization. The exact value is not critical i.e. 19000 or 20000 would work just as well."
I looked more closely Appendix C (Hyperparameters) and see they did this tuning for AlphaZero and just used the same UCB constants for MuZero.

ความคิดเห็น • 9

@TimJSwan 9 หลายเดือนก่อน ⁺¹⁰
Aww, man, now I can no longer win all the games in “set the video compression settings more optimally than your opponent,” anymore. 😔
@walkastray007 9 หลายเดือนก่อน
I know this video is a year old but I have to say this was extremely well crafted together. At no point did I ever feel lost while watching the video. I have actually come from watching two other recent videos both related to #SOME. I really enjoy how you don't use a ton of buzzwords and read textbook definitions and pass them off as "explanations".
@khoakirokun217 4 หลายเดือนก่อน ⁺¹
This video is gold.
@souvikbhattacharyya2480 23 วันที่ผ่านมา
Are you sure that the input to the dynamic network is the actual game state and not the representation network output?
@cyberkraken1606 8 หลายเดือนก่อน ⁺¹
Ok so i know this video is old, but how do you train a network like this to know what valid states are or what winning states or loosing states or draw states are?
@KevinLubick 8 หลายเดือนก่อน
At 20:40 or so, the video shows the process of turning gameplay into training data. In the very first round of training, the goal is to get the agent to know what the valid moves are, so the goal policy is all legal moves (the game engine told us these at each step) with more or less equal probability (the tree search has no idea what is good or bad). We also want to start teaching the representation and dynamics networks "how good" a state or action is. For board games (with a win or loss at the end only), we take the reward from the end of the game (again, provided by the game engine) and feed it to earlier states; this effectively pretends that each play along the way was "winning" or "losing".
Let's suppose we played randomly for 1000 games and generated that test data. We train the agent, it starts to learn what moves are "good" and we then use *that* agent to play some more. This time, the agent's tree search will be better (it has learned a little bit how to win), so the policies that are produced for training are a bit more focused towards the good moves. Play another 1000 games, and then re-train the agent on that data (I think the MuZero authors would use a random sample of training data skewed towards later generations [when the agent was better], but I could be misremembering).
@KevinLubick 8 หลายเดือนก่อน
I realize that was a long response, but in short: The game engine tells us about wins/losses and legal moves and we use some clever tricks to turn that into training data to teach the model how to look ahead better.
@felipeolivos8934 ปีที่แล้ว ⁺⁴
what amazing content, thank you so much for share it!!! :clapping_hands:
@felipeolivos8934 ปีที่แล้ว
👏🏻

ต่อไป

เล่นอัตโนมัติ

Building a ML Transformer in a Spreadsheet