What is Q-Learning (back to basics)

Yannic Kilcher

มุมมอง 89 633

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 14 พ.ค. 2024
#qlearning #qstar #rlhf
What is Q-Learning and how does it work? A brief tour through the background of Q-Learning, Markov Decision Processes, Deep Q-Networks, and other basics necessary to understand Q* ;)
OUTLINE:
0:00 - Introduction
2:00 - Reinforcement Learning
7:00 - Q-Functions
19:00 - The Bellman Equation
26:00 - How to learn the Q-Function?
38:00 - Deep Q-Learning
42:30 - Summary
Paper: arxiv.org/abs/1312.5602
My old video on DQN: • [Classic] Playing Atar...
Links:
Homepage: ykilcher.com
Merch: ykilcher.com/merch
TH-cam: / yannickilcher
Twitter: / ykilcher
Discord: ykilcher.com/discord
LinkedIn: / ykilcher
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 180

@raunaquepatra3966 5 หลายเดือนก่อน ⁺²⁹¹
this is how you leverage the hype like a true gentleman 😎
@MideoKuze 5 หลายเดือนก่อน ⁺⁴
Yes! Jumping on the hype train like a sir. One upvote from me, you're welcome for the gold.
@EdFormer 5 หลายเดือนก่อน ⁺²⁹
I have no time for the hype, but I have all the time in the world for a classic Yannic Kilcher paper explanation video
@guidaditi 5 หลายเดือนก่อน ⁺³⁰
Thank Q!
@changtimwu 5 หลายเดือนก่อน ⁺³⁹
Thanks for such a solid fundamental introduction to Q-learning especially in a time many are really excited about Q-star, but few seem to try understanding its basic principles.
@Alilinpow2 5 หลายเดือนก่อน ⁺⁹
Thank you Yannic your style of surfing the hype is the best!!!
@K.F-R 5 หลายเดือนก่อน ⁺⁶
This was very informative. Thank you so much for sharing.
@qwerty123443wifi 5 หลายเดือนก่อน ⁺³
Love these paper videos, the reason I subscribed to the channel :)
@travisporco 5 หลายเดือนก่อน ⁺³
thanks for posting this; good to see some real content
@OrdniformicRhetoric 5 หลายเดือนก่อน ⁺⁹
I would be very interested in seeing a series of paper/concept reviews such as this focusing on the state of the art in RL
@agenticmark 5 หลายเดือนก่อน
Another awesome video from you Yannic! Gold material on this channel.
@Dron008 5 หลายเดือนก่อน
Thank you, great explanation!
@cezary_dmowski 5 หลายเดือนก่อน
perfect for my sunday. appreciated!
@Alberto_Cavalcante 5 หลายเดือนก่อน
Thanks for this explanation!
@maxbretschneider6521 2 หลายเดือนก่อน
By far the best video on the topic
@nickd717 5 หลายเดือนก่อน
This is great. You’re a true wizard in explaining Q, and I love the anonymous look with the sunglasses. You’re a regular Q-anon shaman.
@drdca8263 5 หลายเดือนก่อน ⁺²
I will make sure to stay hydrated, thank you
@matskjr5425 5 หลายเดือนก่อน
By far the most effective way of learning. Hacking at the essence, in a chain of thought manner.
@ceezar 5 หลายเดือนก่อน ⁺⁵
I did deep q learning for my cs bachelors thesis way back. Thank you so much for reminding me of those memories.
@clray123 5 หลายเดือนก่อน ⁺¹
of those terrible memories ;)
@abnormal010 5 หลายเดือนก่อน
What are you currently doing?
@user9924 5 หลายเดือนก่อน
Thanks man for the explanation
@JuanColonna 5 หลายเดือนก่อน ⁺²
Great explanation
@neocrz 5 หลายเดือนก่อน ⁺³
Very informative.
@vorushin 5 หลายเดือนก่อน ⁺³
18:00 In chess terms, 'Reason 1' can be likened to: 1) Choosing a1 means you won't capture any of your opponent's pieces. 2) Opting for a2 allows you to swiftly capture a substantial piece.
@drhilm 5 หลายเดือนก่อน ⁺²
Old paper review - yeh! we missed that.
@draken5379 5 หลายเดือนก่อน
A good example for what you were talking about just before the bellman eq, would be that Move B(10 reward) will help take a chess piece in the future. Where as Move A, will result in moving away from that reality, or even maybe having the piece be taken by the opponent, making the 'next move' the 'policy' would want, not be possible.
@jurischaber6935 5 หลายเดือนก่อน
Thanks again.😊
@dr.mikeybee 5 หลายเดือนก่อน ⁺³
Good job!
@hilmiterzi3847 5 หลายเดือนก่อน
Nice explanation G
@sultanzabu 5 หลายเดือนก่อน
great explanation
@AncientSlugThrower 5 หลายเดือนก่อน ⁺¹
Great video.
@MichaelScharf 5 หลายเดือนก่อน ⁺¹
Great video
@alexd.3905 4 หลายเดือนก่อน
very good explanation video
@michaelbondarenko4650 5 หลายเดือนก่อน ⁺³³
Will this be a series?
@jackschultz23 5 หลายเดือนก่อน ⁺¹
My dude, that point you mention at 45:05, right at the end, about having state and actions being the input is exactly the question I've been trying to find an answer to. To see and hear it mentioned twice but each time you said you're not going to talk about it felt like knife in heart. If you don't do a video on it, do you have papers that talk through how this has been done? Great stuff either way, able to learn a bunch.
@EnricoGolfettoMasella 5 หลายเดือนก่อน ⁺⁹
During your explanation it comes to my mind the Dijkstra's Algorithm. They say that this Q* can increase the processing needs some 1000 times. You check all the paths in your graph and choose the ideal one.
@ericbabich 5 หลายเดือนก่อน
maybe not if you consider if a non-reward end means you have to run the whole process again and prehaps the only way to reach a satisfactory answer is to employ a checking mechanism that reduces chance of failure for some questions
@notu483 5 หลายเดือนก่อน ⁺¹
Yes, and A* is even better than Dijkstra for pathfinding.
@clray123 5 หลายเดือนก่อน ⁺¹
And what pray tell might be the heuristic or reward function when it comes to next token generation? It seems it all hinges on the most important issue of first having to solve the problem which you are aiming to solve by your wonderful search algorithm.
@user-tg6lv6hv4r 5 หลายเดือนก่อน ⁺⁴
I realize that I read this paper ten years ago. Now I'm ten years older omg.
@dreamphoenix 5 หลายเดือนก่อน
Thank you.
@tchlux 5 หลายเดือนก่อน ⁺²⁰
Yeah I guess that Q-star will run multiple completions for each prompt with the large language model and then model the cumulative probability of the next token over the different completions. To trim the search space they probably do one full response at a temperature of 0 (only pick highest likelihood next tokens), then pick the few places in the response where it was closest to picking a different token and explore the graph of alternative responses that way, similar to the greedy A-star search for a best path. Alternatively they could just generate a few responses with a small temperature.
If they generate a bunch of completions that way then they could create a Q estimator to improve the selection of tokens at the current step of the response for longer time horizon "correctness". At runtime they could use that Q estimator and an A* approach (greedy after adding in heuristic) to pick next tokens, which encourages the model to "think ahead" better than current approaches.
Without the ability to reassess final responses, and "go back and change it's mind" (which would be a lot more computationally expensive), I suspect we'll still see lots of examples of the large language models being confidently wrong, but I guess we'll find out soon!
@yohanhamilton7149 5 หลายเดือนก่อน ⁺¹
Quote ' I suspect we'll still see lots of examples of the large language models being confidently wrong', It's like we human would think-twice about what we are going to say before speaking. So, it's always good habit (but more computationally expensive) to do so. So does LLM, Q* might be mimicking that human habit using by forecasting reward of saying A instead of B using some "smart" heuristics (just like A*'s distance heuristic in making decision of what state to explore next)
@clray123 5 หลายเดือนก่อน ⁺²
The problem with your "approach" is that you "forgot" to define the reward function.
@user-oj9iz4vb4q 5 หลายเดือนก่อน ⁺¹
With regards to that future discounting, it's not just that you'd "like" to have it right now. It's that it's more useful right now and so $100 now is more useful than $100 tomorrow. If only because I could invest that $100 I got today, and have $100 + interest tomorrow. Economics formally defines these things with stuff like net present value.
@luckyrand66 5 หลายเดือนก่อน
nice video!
@Seehart 5 หลายเดือนก่อน ⁺⁶
15:30 You still tend to want a discount < 1 in things like chess with a terminal reward. All other things being equal, you want to win sooner than later. Otherwise, with discount=1, you might forego a mate in one if a different mate is within your horizon, and that could go on forever (or perhaps for 49 unnecessary moves). I use 0.9999 for that kind of scenario, which is sufficient.
@therainman7777 5 หลายเดือนก่อน ⁺¹
Good point, thanks for the info.
@NelsLindahl 5 หลายเดือนก่อน
My favorite videos are the ones where Yannic draws everywhere... please build that into the future bot Yannic constitution...
@Rizhiy13 5 หลายเดือนก่อน ⁺¹
40:50 Why limit the possible actions only to best and random? Why not sample according to something like softmax of Q?
@ProblematicBitch 5 หลายเดือนก่อน ⁺¹⁰
I need someone to upload the Q function to my brain so my life choices start making sense
@banknote501 5 หลายเดือนก่อน ⁺⁷
Maybe just try to lower the epsilon to make less random choices in life?
@2ndfloorsongs 5 หลายเดือนก่อน ⁺¹
Your brain comes preloaded with a Q function and it's following it. Make some popcorn and enjoy the show.
@user-dt7px5xp6z 5 หลายเดือนก่อน
@@2ndfloorsongssome entities don't seem to have it
@DeltafangEX 5 หลายเดือนก่อน
It's possible they'll make sense in retrospect as the most optimal path far in the future.
Or it could be that your most optimal path will always suck from your perspective but in fact provides the least amount of suffering possible.
So...look on the bright side?
@33markiss 5 หลายเดือนก่อน ⁺¹
@@user-dt7px5xp6z
That’s called Natural Selection, another algorithm of “nature”.
@visuality2541 5 หลายเดือนก่อน
Could you also exaplain A* algorithm in detail?
@jimmy21584 5 หลายเดือนก่อน
Reminds me of the minmax algorithm that I used for a Game Boy Advance board game back in the day.
@JonathanYankovich 5 หลายเดือนก่อน ⁺¹
This is great, thank you. Have some engagement.
@oraz. 5 หลายเดือนก่อน
Is the update done at each step or do you actually have to recurse to the end to get R.
@fixfaxerify 5 หลายเดือนก่อน
Hmm.. in classic chess algos you have board eval / positional analysis functions, should be useable as a reward function, no?
@skipintro9988 5 หลายเดือนก่อน
Yannic is the best
@2ndfloorsongs 5 หลายเดือนก่อน ⁺²
When things get hard for me to understand, I find myself blankly staring at your sunglasses. I like to tell myself it's some sort of behavioral adaptation that provides a survival advantage of some sort. After staring at your sunglasses for a few minutes, I find I can detach myself enough to get up and make popcorn. All this evolution stuff usually comes down to food.
@nisenobody8273 5 หลายเดือนก่อน ⁺¹
same
@clray123 5 หลายเดือนก่อน
It seems you've already mastered your Q function, what else is there to learn?
@EdanMeyer 5 หลายเดือนก่อน
The timing on this is too good lmao
@JohnSmith-he5xg 5 หลายเดือนก่อน ⁺²
I'm a big fan. I'm impressed by you being able to speak/draw this off the cuff, but it might be better next time to refer to the printed out equations. It got a bit messy.
@Eric-eo1dp 5 หลายเดือนก่อน
it's primal dual optimization on neural networks. Currently researcher uses infinite network theory to approach global solution in neural networks. Primal dual achieve the same goal by transforming the space of the neural network.
@pi5549 5 หลายเดือนก่อน ⁺¹
Yannic, when you invent time-travel can you go back to 2015 and re-upload? This will save me from struggling through Sutton's book.
@Halopend 5 หลายเดือนก่อน
Nice explanation. Very clear explanation. It’s only at the end with gradient descent where I got lost as to what the comparison is between. Total reward vs ____________? I normal think of gradient descent as y vs y’, actual vs measured or current vs next and updating knowledge based on the diff.
Not seeing the y’ here. If I need a different analogy here, let know.
@davidbell304 5 หลายเดือนก่อน
Hey Yannick. Thought you might like to know, Juergen Schmidhuber is claiming Q* as his invention 😀
@lincolt 5 หลายเดือนก่อน
10:12 my favorite type of pie
@warsin8641 5 หลายเดือนก่อน
Ty
@KolTregaskes 5 หลายเดือนก่อน
36:00 Yannic goes Geordie on us and starts repeating way aye over and over, hehe.
@visuality2541 5 หลายเดือนก่อน
Lovely
@adamrak7560 5 หลายเดือนก่อน
it is shockingly simple, compared to how powerful it is at solving problems.
@gianpierocea 5 หลายเดือนก่อน
Being overly pedantic, but in terms of notation at 24:27 you want argmax, not max: this is a policy so it should spit out an action a, right? Very clear exposition, nice video :)
@idiomaxiom 5 หลายเดือนก่อน
So ChatGPT would try to guess what my next question will be after its response and optimize for that, effectively learning to read my mind?
@wolffischer3597 5 หลายเดือนก่อน ⁺³
Thanks a lot, really good video! What I am wondering: how exactly could that translate to an LLM? You said that the possible actions would depend on the token space. So how would the Q function then know that it would have reached the final state with the highest reward possible? I do understand how that works for chess (check mate) but for a natural language prompt that states some ambiguous problem or statement? Maybe that is why The rumours said that this works for grade school level math problems which is a very specific subset of the whole space out there (and it still is large). I can't yet imagine to make this work properly for something like GPT4 or Bard or..., especially not for a customer grade solution.
@therainman7777 5 หลายเดือนก่อน ⁺²
I think the idea is that you seed the whole process with human-provided feedback, that is given at the step-by-step level after instructing the model to “reason step by step.” That’s what the “Let’s Verify Step by Step” paper is all about. Rather than simply checking whether the model got the final answer right, humans grade every individual step of the model’s reasoning. A reward model is then trained on this human feedback data, and learns to mimic a human grader at assessing the quality of a reasoning step.
Once you have this reward model, you can grade any string that is meant to be an expression of reasoning, and you can do this one token at a time, so that you can test out the search space in a tree-like fashion, and choose the optimal next token in terms of maximizing the expected value of the finished string.
Does that make sense? You’re starting with (probably expert) human feedback as a seed, training a model to emulate such human feedback, and then using reinforcement learning to search the space and improve both the token selection AND the reward model that you initially trained on human feedback. The fact that you’re improving both the next-token prediction (the “policy”) AND the reward model (the “Q table”) at the same time is critical, as this is what makes this approach similar to self-play systems such as AlphaGo. Both sides are being optimized simultaneously, so that the model is teaching itself. This is how these systems can ultimately achiever superhuman performance, even though they were initially seeded with human data.
@clray123 5 หลายเดือนก่อน ⁺¹
The answer is nobody knows, probably not even OpenAI's marketing team.
@wolffischer3597 5 หลายเดือนก่อน
@@therainman7777 hm I sadly am not an expert on AI or LLM, just an enthusiastic amateur. From my point of view, it could make sense, I still miss the phantasy of how you would scale that to large and unknown domains. And in the end it probably still is not "reasoning" but "just" recall of trained associations /patterns in a known domain and as soon as you deviate from that domain your results start to get worse and worse... But let's see. I am not too unhappy if it takes more time until the singularity ;)
@therainman7777 5 หลายเดือนก่อน ⁺¹
@@wolffischer3597 No, the point of reinforcement learning is specifically that it is NOT just recalling patterns or learned associations that it saw in the training data. With reinforcement learning, the model is free to explore the _entire_ search space, meaning it can (and usually does) stumble onto solutions and methods that no human being has ever thought of before. This happened with AlphaGo, along with countless other RL systems.
@wolffischer3597 5 หลายเดือนก่อน
@@therainman7777 hm yes, but it is still go, right? And it is still pattern matching, although highly complex patterns, combined with tree search and some temperature probably for some randomisation..?
@user-oj9iz4vb4q 5 หลายเดือนก่อน
I think what's missing here is model regression and simulated playback (dreaming).
@SLAM2977 5 หลายเดือนก่อน ⁺¹²
Teaching AGI already :)
@UCs6ktlulE5BEeb3vBBOu6DQ 5 หลายเดือนก่อน ⁺¹
👀
@LostMekkaSoft 5 หลายเดือนก่อน
wow, turns out i accidentally invented q-learning myself when i started university. (sry for the schmidhuber vibes lol)
i didnt have the mathematical background, but i knew how neural networks work in theory. the way i thought about it was this: suppose you have the complete state tree of a game, so it starts at the starting state and the tree contains all actions and all resulting states and therefore also all the terminal states. i only know the reward for the terminal states, but i can play a random (or semi random) game and this will give me a path from the starting state to one of the terminal states. then i can take the value of that terminal state and kinda "smear" it backwards, with the reward value of every state in the path getting nudged a bit in the direction of the known reward of the terminal state. and i imagined that a neural net could be trained in a way that organically "smears" all the useful known reward values from the terminal states backwards, so that after a bit of training time it would have a good reward value on any given state.
when i came up with this i was super excited and started to implement this thing, but i wasnt that experienced yet, so every attempt to build my own neural network stuff just resulted in the weights diverging xDD but im super happy to know now that my idea was spot on at least ^^
@user-hw2bb9jy5l 5 หลายเดือนก่อน
based on simply the name alone abt not watching the video, i would assume the Q would stand for quantum and the algorithm would be a positive reward reinforcement also based on a tree of thoughts that calculates the highest probability for the best outcome and would probably decide finally based on a principle like occams razor
@ohadgivaty2366 4 หลายเดือนก่อน
such an algorithm can turn LLM from something that simply answers questions to something that will have a clear goal like a salesman or convince people to vote for a certain candidate.
@sagetmaster4 5 หลายเดือนก่อน ⁺¹
I get your skepticism but names mean something. Especially programmer types tend to have sensible names. Whether this thing is just conceptually similar or actually shares similarities in the architecture I think at least one of these lines of speculation is pretty close to the real Q*
@therealjezzyc6209 2 หลายเดือนก่อน
This looks a lot like dynamic programming and the Bellman Equation
@JinKee 5 หลายเดือนก่อน ⁺¹
I wonder if the star in Q* is a reference to A* pathfinding
@awillingham 5 หลายเดือนก่อน ⁺³
You can build out a graph of states with actions connecting them, and then use the Q function as the heuristic you need as input to A*, and you can effectively search the problem space to find an optimal path to the solution (for your input Q function).
I think this technique would let you efficiently search complex problem spaces
@oraz. 5 หลายเดือนก่อน
Many people are saying this!
@JinKee 5 หลายเดือนก่อน
@@awillingham winston churchill once said "you can always count on the united states to do the right thing, once they've tried everything else." Sounds like our government needs to implement Q*
@bebeperrunocanino2337 3 หลายเดือนก่อน
Yo aprendi el algoritmo Q-Learning con la ayuda de una ia, la ia me enseño como y yo pude.
@clray123 5 หลายเดือนก่อน
14:44 Actually, the lack of certainty about future is the only reason why we've evolved to be impatient and greedy about getting our rewards. If there was a guarantee for every future promise to be fulfilled and also for our life (and health and other circumstances related to the success of consuming the reward) to be extended so as to be exactly the same later as it is now, there would be no reason to hurry at all.
P.S. This is also why "today or tomorrow" is an unconvincing example for time preference. Make it "today or in a hundred years" and everyone will understand and agree.
@kinwong8618 5 หลายเดือนก่อน ⁺¹
I thought the discount factor is a constant.
@pensiveintrovert4318 5 หลายเดือนก่อน ⁺³
I am speculating that it is named after James Bond nerd sidekick Q. As good as your speculation.
@EdFormer 5 หลายเดือนก่อน ⁺¹
What speculation?
@2ndfloorsongs 5 หลายเดือนก่อน ⁺¹
@@EdFormerSpeculation as to what Open AI meant by q star.
@EdFormer 5 หลายเดือนก่อน ⁺¹
@@2ndfloorsongs when did he speculate about what OpenAI meant by Q*?
@2ndfloorsongs 5 หลายเดือนก่อน
@@EdFormer He alluded to it in the first few minutes. It was a continuation of his last video updating the AI debacle. That's pretty much the whole reason he was giving this tutorial on Q.
@EdFormer 5 หลายเดือนก่อน
@@2ndfloorsongsyou didn't sense the sarcasm, even after conditioning yourself on his views in the previous video? I.e. the one where he ridiculed those speculating that Q* is an AGI that combines Q-learning with A*, joked that it could just have come from someone holding the shift key and mashing the left hand side of the keyboard, and called everything going on with OpenAI a clown car?
@jaymee_ 5 หลายเดือนก่อน
Ok so effectively it's just comparing two different policies, that might actually be the same policy but with a single check before taking the following step? Like playing Tetris knowing what the next piece is going to be?
@thorcook 5 หลายเดือนก่อน
sort of... but i think it's actually _recursively_ comparing policies to 'itself' (the 'composed' or embedded [prior] policy) to iterate through steps. at least that's how Q [reinforcement] learning works
@keypey8256 5 หลายเดือนก่อน ⁺²
I'm just at 26:09 and so far it has been just min-max for singleplayer
Edit:
It's funny that I don't know much about machine learning but have already seen all of those ideas in other fields. This kind of shows that the ideas from papers make their way also into other fields.
@alleycatsphinx 5 หลายเดือนก่อน
You should really go the other direction with this video - instead of starting at succession, ask what a number is (binary enumeration) and then delve into how and why binary succession works.
From there you could go into binary addition (perhaps looking into how carry works,) multiplication, division, exponentiation, etc...
It isn't trivial that a shift in binary is the equivalent of multiplication by two - there's a good video in all this. : )
@indikom 5 หลายเดือนก่อน
Policy is strategy what action to choose, for example you can choose to be more greedy.
@washedtoohot 5 หลายเดือนก่อน
Is this true? Iirc greediness is a parameter
@clray123 5 หลายเดือนก่อน
Policy/strategy is a really stupid word for a function which produces an action given a state. But since the term was chosen so multiple decades ago, we have to suffer and live with that. "Action function" might have been less pompous and misleading, but in case you haven't noticed yet, AI people really like to pretend their inventions are smarter than actual, and this has remained true throughout history of the field.
@indikom 5 หลายเดือนก่อน
@clray123 I don't agree. "Policy" term concisely describes a fundamental concept - the decision-making strategy of an agent
@clray123 5 หลายเดือนก่อน
@@indikom The problem is that "policy" / "strategy" in colloquial use are both much broader concepts, and are understood as some abstract considerations that GUIDE decision-making, not a definite mapping of which action to take given a particular state of affairs. Law makers who devise policies or managers who devise strategies do not continuously stalk every citizen/employee to prescribe them what to do at every possible decision point. But this is what the "policy" in AI accomplishes. So I repeat, this is a bad terms which sows confusion through analogy to real life uses of the same term. But theoretical sciences, including mathematics, are full of such weird misnomers, perhaps stemming from the fact that researchers who work in them have no clue about how real life operates beside them.
@mikebarnacle1469 5 หลายเดือนก่อน
The call magnus carlson example is funny, I'm just imagining chat-gpt figured out long ago it can call humans and write back what they say and we had no idea this was happening all along and we were actually talking to people through an intermediate. Would explain why I have been getting so many random calls asking for medical advice lately. I always just say go see a doctor.
@serta5727 5 หลายเดือนก่อน ⁺¹
Q-ute star 🌟😊
@dullyvampir83 5 หลายเดือนก่อน ⁺¹
So this wouldn't work well, if there is no immediate reward for a move like in chess?
@clray123 5 หลายเดือนก่อน
No, the whole point is that it works even with no immediate reward like in chess (because of the discounted future reward component which kinda transports information about the final reward back across all the time steps preceding it). But having (additional and correct) immediate rewards/penalties along the path helps guide the algorithm to converge on the optimal solution faster.
@dullyvampir83 5 หลายเดือนก่อน
@@clray123 And how do you get these immediate rewards for chess?
@clray123 5 หลายเดือนก่อน
@@dullyvampir83Arbirarily - for example, you could introduce an immediate penalty on each move, to limit the number of moves per game. Or you could get statistics from records of games of successful real-world players, providing pressure to avoid certain configurations of player pieces that occur more often in losers' games than in those of the winners.
@abdelkaioumbouaicha 5 หลายเดือนก่อน ⁺⁶
📝 Summary of Key Points:
📌 Q-learning is a concept in reinforcement learning where an agent interacts with an environment, receiving observations and taking actions based on those observations.
🧐 The Q function is used in Q-learning to predict the total reward that would be obtained if a proposed action is taken in a given state. It helps the agent make decisions about which actions to take.
🚀 The Markov decision process assumes that observations are equivalent to states, and discounting future rewards is important in reinforcement learning.
🚀 The Bellman equation describes the relationship between the Q value of a state-action pair and the immediate reward plus the discounted future reward.
🚀 Q-learning can be used to estimate the Q function by iteratively updating the Q values based on observed rewards and future Q values.
🚀 Neural networks, particularly in Deep Q-learning for playing Atari games, can be used in Q-learning. Experience replay, where transitions are stored and sampled to train the Q function, is also mentioned.
💡 Additional Insights and Observations:
💬 "The Q function predicts the total reward that would be obtained if a proposed action is taken in a given state."
📊 No specific data or statistics were mentioned in the video.
🌐 No specific references or sources were mentioned in the video.
📣 Concluding Remarks:
This video provides a clear introduction to Q-learning and its application in reinforcement learning. It explains the concept of the Q function, the Markov decision process, the Bellman equation, and the iterative process of updating Q values. The video also touches on the use of neural networks and experience replay in Q-learning. Overall, it provides a solid foundation for understanding Q-learning and its role in decision-making and learning optimal policies.
Generated using Talkbud (Browser Extension)
@torikapotat977 5 หลายเดือนก่อน
tks u so much
@Summersault666 5 หลายเดือนก่อน ⁺²
Q* = Bellman + A* search ?
@watcher8582 5 หลายเดือนก่อน ⁺¹
I've seen people mention A* search, but is there a hint for that? Already in Q-learning you got objected names Q*. I mean in optimization adding a star usually just means "the solution"
@Summersault666 5 หลายเดือนก่อน
@@watcher8582 maybe changing beam search to A* search using reward as distance in a vector knowledge graph database instead of full reinforcement learning?
@TheDukeGreat 5 หลายเดือนก่อน
Ah shit, here we go again
@OperationDarkside 5 หลายเดือนก่อน
I think, I got like 30% - 40%. I definitely lack pre-existing knowledge.
@clray123 5 หลายเดือนก่อน
Reinforcement learning is how most businesses are run. Investors to management: make us munnnnniiies, and up to you to figure out how. Management to employees: make us munnnniesssss and you go figure out how. Employees: .
@seidtgeist 5 หลายเดือนก่อน ⁺¹
🧐What if Q* is a very believable Star (*!) Trek reference and, basically, the biggest and most Q-esque diversion troll ever? 🧐
@barni_7762 5 หลายเดือนก่อน
Nice
@garyyakamoto2648 หลายเดือนก่อน ⁺¹
Thanks. I wish you didn't have to go such a low bass in your voice, it's like a continuous drilling in the brain.
@Henry_Okinawa 5 หลายเดือนก่อน ⁺¹
I rly think that all this drama around Altman is fictional to advertise new product they have prepared
@clray123 5 หลายเดือนก่อน
You got it wrong, the drama is real, the product is fictional.
@amansinghal5908 2 หลายเดือนก่อน
Amazing! Are you open to constructive feedback?
@rogerc7960 5 หลายเดือนก่อน
Elon: great, I'll use my time machine to go back to before google brought deepmind and invest in the start-up, and fork a copy of human reinforcement learning. Perfect for tesla to learn how to drive...
@14types 5 หลายเดือนก่อน ⁺¹
Is this a madman from a mental hospital who makes up formulas on the fly?
@drdca8263 5 หลายเดือนก่อน ⁺¹
He’s summarizing a well-known technique.
@Pierluigi_Di_Lorenzo 5 หลายเดือนก่อน
Q*anon. Doesn't exist, including the letter about it.
@broli123 5 หลายเดือนก่อน
Qstar is that Qanon's long lost brother?
@scorber23 5 หลายเดือนก่อน ⁺¹
Q + .. Quantum Intelligence changes everything 💫 things are lining up / stacking up / leveling up
@alan2here 5 หลายเดือนก่อน ⁺²
While I love GPT-4, remember that Open AI thinks very highly of Open AI, never underestimate this. "we are amazing everything's changing, major but entirely vague advancements, give us research money!111"
@therainman7777 5 หลายเดือนก่อน ⁺¹
Their opinion of themselves is accurate and justified by their output. No one has shipped more groundbreaking AI developments than they have over the past few years, despite the fact that much larger and much richer companies (Google, Meta, etc) are trying as hard as they can. They deserve every penny of research money they’ve gotten, and have clearly spent it prudently given the results they’ve gotten are far better than other companies who have spent far more than them.
It’s so easy to criticize and try to impute the motives or reputation of people who are actually productive and creative; it’s much harder to produce and create yourself.
@clray123 5 หลายเดือนก่อน ⁺¹
@@therainman7777 In case of OpenAI it's really hard to say whether the "ground-breaking developments" are based on brilliance or whether they have to do with having the first mover advantage (most pertinent training data to their application because of capturing the user base). In any case, we know that OpenAI did NOT invent the fundamental LLM algorithm (the transformer architecture) - and the famously cited transformer paper did not either. The cornerstone "attention" mechanism was proposed by Bahdanau from University of Bremen. So (as usual) you have to be careful where you assign credit for any ground-breaking inventions...
@therainman7777 5 หลายเดือนก่อน
@@clray123 No, you really don’t understand. I don’t mean to be condescending. But I’m an AI researcher and have been in this field for nearly 20 years. I did not ever claim OpenAI invented the Transformer architecture. Everyone always throws that out as a strawman/red herring. Literally no one is saying they invented the Transformer. What we are saying is that the whole world has known about Transformers for six years now, yet no one has managed to do what OpenAI has done. They got there first, both with GPT-4 and several of their other best-in-class models, and they consistently shop groundbreaking PRODUCTS, ahead of everyone else, while many of their competitors are still trying to ship a single useful product. Including companies like Google who have orders of magnitude more resources. I never said anything about who invented the Transformer. I said they’ve invented an incredibly useful PRODUCT that no one else has been able to match, and actually they’ve pulled that off multiple times now, despite being a relatively small firm for much of their existence. They also have invented a number of additional techniques and components, and figured out clever ways to assemble others together, to get the result that they got-which to this day no one else has come close to in terms of performance on benchmarks, consumer adoption, business sector adoption, or anything else.
@DanFrederiksen 5 หลายเดือนก่อน ⁺¹
If you have the formulas in print beforehand it's much faster to convey. This classic blackboard approach of writing it out during lecture is very time inefficient.
@clray123 5 หลายเดือนก่อน
But you have to incorporate entertainment value into your reward function.
@DanFrederiksen 5 หลายเดือนก่อน
@@clray123 that's a little meta :)

ต่อไป

เล่นอัตโนมัติ

Scalable Extraction of Training Data from (Production) Language Models (Paper Explained)