Predictive Coding Approximates Backprop along Arbitrary Computation Graphs (Paper Explained)

Yannic Kilcher

มุมมอง 26 284

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 14 ต.ค. 2024

ความคิดเห็น • 83

3 ปีที่แล้ว ⁺⁷
Thank you for analyzing this awesome paper Yannic, much appreciated.
@sebastianmestre8971 3 ปีที่แล้ว ⁺⁵
If I understand correctly, you first do a forward pass to make some guesses, then you do a backward pass to find better guesses, then you do a parallel pass to improve weights. (though you can kick off the weight refinement on a separate thread as soon as you find the improved guess)
The cool thing is that we can refine weights on multiple layers at once, instead of going one at a time, even if there are a few sequential steps before that.
@1234dbk ปีที่แล้ว ⁺²
This might be a silly question, but if you are doing a backwards pass to refine your guesses, doesn't that still not solve the main issue with why these people created this in the first place -- to solve the lack of bidirectionally in biological circumstances (for example an RNN of synapses of neural pathways). To generalize, if the graph of nodes is purely one directional, how would information on error be sent backwards after calculating it?
@JamesAwokeKnowing 3 ปีที่แล้ว ⁺²¹
I think the big deal (other than plausibility) only makes sense in context of hardware. With this scheme you can build local hardware neurons which only compute locally. In software it seems like "backward pass" because central processor goes around computing for all the neurons. Instead imagine a cuda core per neuron which never needs to load memory from anywhere except the other cores it's physically connected to.
@fimbulvntr 3 ปีที่แล้ว ⁺⁴
Also, again thinking about hardware, this would enable "dynamic scaling" of a network, where you simply throw more neurons into the mix (since they're all clones and independent). I.e. imagine a gpu where you can bolt on extra cuda cores, ad infinitum.
The current model needs (maybe I am wrong and misunderstood, I am a layman) to know the entire topology before it can work
@eelcohoogendoorn8044 3 ปีที่แล้ว ⁺¹
Exactly; where this becomes relevant is with hardware that is explicitly simplified to take advantage of this compute structure that presumably does not need any global connections.
@ssssssstssssssss 3 ปีที่แล้ว ⁺²
I am doubtful about the "plausibility" argument, but the realization of such a learning mechanism in hardware seems to me a very powerful argument. I imagine we could get analog processors to carry out this learning algorithm incredibly fast.
@23kl104 3 ปีที่แล้ว ⁺³
Can't you just as well make the same case for backpropagation?
Imagine a bunch of backprop neurons only receiving information from their neighboring nodes (last hidden state for forward pass / gradient of next node for backward pass).
@leylakhenissi6641 3 ปีที่แล้ว ⁺⁶
Thank you for the paper presentation, it's really well done and provides a useful overview of the topic and the paper. May I kindly ask that in the future you refrain from poking fun at other people's code though? It may keep others, especially in scientific computing, from making their code open/public, which would be a shame for everyone. Cheers.
@JTMoustache 3 ปีที่แล้ว ⁺³¹
Predictive coding is a red herring, it is really a dynamic programming version of a variational gradient descent.
@skdx1000 3 ปีที่แล้ว ⁺⁵
yeah it seems analagous to using a taylor series to approximate a function where in this case the error term corresponds to the nth derivative differential multiplier and the function is represented as the evaluation of the original LSTM cell.
@jordyvanlandeghem3457 3 ปีที่แล้ว ⁺¹
@@skdx1000 oomph what resources should I check to understand this reply? :)
@skdx1000 3 ปีที่แล้ว ⁺⁴
@@jordyvanlandeghem3457 this link will provide an explanation as to what a taylor series is: brilliant.org/wiki/taylor-series/ and then from there you can check the formula for derivation against the techniques used in the paper explained by yannic and then compare how the error term technique used in this paper corresponds to how a taylor series approximates error using the derivative
@AbeDillon 3 ปีที่แล้ว ⁺⁶
I don't see anything wrong with giving "a dynamic programming version of variational gradient descent" a shorter name, like "predictive coding". What makes it a red herring?
@peterfireflylund 3 ปีที่แล้ว ⁺¹
@@jordyvanlandeghem3457 take a look at 3brown1blue. He has a series of videos that explain Taylor series intuitively. In order to REALLY understand them, you need to understand calculus and do lots of homework exercises, of course. But maybe the videos are enough for you? Or maybe just the brilliant link was enough? Up to you :)
@ssssssstssssssss 3 ปีที่แล้ว ⁺⁷
Interesting paper.. But this still does not seem biologically plausible to me, which they stated as the purpose. Not to mention, from what I see, so-called predictive coding is a variant of backpropagation (implementing dynamic programming) so saying it approximates backpropagation is misleading. They should qualify the title "Predictive Coding Approximates Backpropagation with Gradient Descent".
@rockapedra1130 3 ปีที่แล้ว ⁺⁴
This was super helpful! Thanks! I love this channel !!
@cedricvillani8502 3 ปีที่แล้ว ⁺¹
Which part exactly was helpful?
@rockapedra1130 3 ปีที่แล้ว ⁺¹
@@cedricvillani8502 well ...all of it! He goes from the abstract and motivation to describing the general idea with simplified drawings to analyzing each equation to commenting on figures to dissecting the code and finally to his considered opinion about the whole thing.
For me, this level of comprehension would take weeks (at least). Plus there are tons of papers out there and he filters and reviews “what’s hot” papers, another huge time saving!
This channel is awesome!!!
@gruffdavies 3 ปีที่แล้ว
This could be a gamechanger. Thanks for the analysis!
@subarashii1368 3 ปีที่แล้ว ⁺⁵
I feel it just keep input/target fixed, then back-propagate one layer per iteration. In real life, you don't keep input fix until your brain form an equilibrium.
@Yash-vm4uk 3 ปีที่แล้ว ⁺²
It is still using back-propagate which he said is not possible in brain and done by looping, so how is this biologically possible?
@herp_derpingson 3 ปีที่แล้ว ⁺¹¹
21:30 I wonder how would skip connections look for this system.
34:20 I wonder if we should run it to convergence or would it cause instability as it overfits to the batch.
I am not sold on this. We are still sending information backward. How is this biologically feasible?
@linminhtoo 3 ปีที่แล้ว ⁺³
Looks like it happens through the local 'feedback' connections between neurons
So the main difference from backprop is that the gradient doesn't need to be computed exactly all the way from the loss value back to the very first neurons that received the input, in one pass, like in backprop. We can just do it locally and it approximates backprop (which makes sense, since the errors are being sent backwards anyway)
@herp_derpingson 3 ปีที่แล้ว ⁺¹
@@linminhtoo Regardless if it done in one pass or multiple. Bidirectional propagation is not feasible.
@ibrax1 3 ปีที่แล้ว ⁺⁷
@@herp_derpingson Biological neurons do have local feedback dendrites.
@wunkewldewd 3 ปีที่แล้ว ⁺⁵
I was confused by this too! I have two qualms: A) it seems like this still requires sending info backwards like you said, so I don't see how it solves the problem... and B) backprop could be considered local IMO: even though the gradient at some much earlier layer is dL/dw_1 or whatever, the chain rule decomposition has the effect of breaking it down into a local gradient, da/dw_1, times the error signal from later in the network (the dL/dy * dy/dw * ... etc).
@danielbrennan5942 3 ปีที่แล้ว ⁺²
long term potentiation and long term depression (loosely) follow hebbian learning rules. if this algorithm also follows those hebbian rules, it should be biologically plausible
@raunaquepatra3966 3 ปีที่แล้ว ⁺⁴
If in the inner loop (where they update the guesses with 100 iterations or so) we only run it once and instead of updating the predictions with small steps we just add the whole error, Then isn’t it becomes normal backprop 🤨
Pls correct me if I am wrong.
@TheIvanIvanich 3 ปีที่แล้ว ⁺¹¹
More papers about predicitive coding!
@boss91ssod 3 ปีที่แล้ว ⁺²
-> ... please!
@woolfel 3 ปีที่แล้ว ⁺¹
This paper makes me ask this question. After you've trained a base model, could the local errors reduce the need to backprop during re-training? If that's possible, would it actually reduce the cost of retraining base models?
@Zantorc 3 ปีที่แล้ว ⁺⁷
This is interesting but I'm not sure it's applicable. The brain doesn't use point neurons, nor can it be replicated using them. You'll be lucky if you get 2 bits of accuracy out of most neurons. Beyond sensory motor inputs the idea that a neuron could output a value which could be compared to some other value is a non starter. Most connections are feedback in the brain not feed forward. The more you know about the brain the less like the idealised NN it seems.
@semjuel3077 3 ปีที่แล้ว ⁺¹
@Zantorc Could you explain what you mean by "Most connection are feedback, not feed forward"?
@Zantorc 3 ปีที่แล้ว ⁺¹
@@semjuel3077 th-cam.com/video/iccd86vOz3w/w-d-xo.html
Explains it quite well.
@probbob947 3 ปีที่แล้ว ⁺²
The structure of the update rule resembles a graph Laplacian.
@DavidSaintloth 3 ปีที่แล้ว ⁺³
This looks a lot like the mechanism I presented as salience modulation when I presented the idea in a 2013 post.
sent2null.blogspot.com/2013/11/salience-theory-of-dynamic-cognition.html?m=1
The back propagation happens through a salience driven remapping of stored information in any given sensory dimension. With inferencing happening continuously between data mapping into the networks.
Tangent: there is some evidence that real neurons do have feedback sub signals along the firing path. Which would make this paper more biologically similar than you asserted.
@lemurpotatoes7988 3 ปีที่แล้ว
Link to evidence of feedback subsignals, please?
@v.gedace1519 3 ปีที่แล้ว ⁺¹
I am pretty sure that the linearity of the decompose is the issue. Means dL/dh2 * dh2/ dw2 -> ...h3 ... -> ... h4 ....
Nature make it different.
dl/dh2 *dh2 /dw2 -> F[h3](L, h3w'3 ... h0w'0) -> F[h2](L, h2w'2 ... h0w'0) where w'... are weights aka "feed back connections". Hard to explain using text only. But you got the idea ;-)
@23kl104 3 ปีที่แล้ว ⁺³
no lol, I don't
@v.gedace1519 3 ปีที่แล้ว ⁺¹
WOW! The paper is great. Your explanations are greater!
@dm_grant 3 ปีที่แล้ว ⁺³
Neurons are not bidirectional. Exactly!
@quebono100 3 ปีที่แล้ว ⁺¹³
Nice Paper :) tanh-k you
@jonatan01i 3 ปีที่แล้ว ⁺⁴
tanh-q
@quebono100 3 ปีที่แล้ว ⁺³
@@jonatan01i even nicer :D tanh-q
@diegofcm6201 3 ปีที่แล้ว
Like Jeff Hawkins says: neurons CANNOT be assigned numerical precision whatsoever. So even if there wasn’t any backwards pass, just by the fact that it’s assuming that much stability in input output is flawed from the POV of biological plausibility
@diegofcm6201 3 ปีที่แล้ว
It’s much more likely that it’s something more discrete, with Hebbian learning happening through information sent by neurotransmitters
@Hukkinen 3 ปีที่แล้ว
How do neurons cannot be approximated by numerical representations? - I'd say this is just a trade-off between realism and abstraction of the model. Why am I wrong here?
@diegofcm6201 3 ปีที่แล้ว
@@Hukkinen
TL;DR: It's naive to try to pick just a single part of bio neural networks (local update rules) and try to tie it (with expectancy of similar/better performance) in artificial one, without considering most of the other computational aspects of the real thing.
The idea is that neuronal connections, in the actual brain, are maintained by STDP (spike time dependent plasticity) which is a rule that is not much dependent on the actual voltage but on their behaviour in the long term (potentiation or depression). There are no static weights, they're a dynamical property, evolving over time.
There are also lots of other things we are neglecting, like the fact that memories are in the connections (somehow) and computation is done in the time domain (tied to the latency of the input neuron's time before spiking occurs in the outputs, and, just a "small" detail, in bio-neural networks the output neuron can spike before inputs).
@charleshong1196 3 ปีที่แล้ว ⁺²
I just don't get it. What's the difference? it still need to backpropagate... the temporal and spatial dependence have not changed...
@YannicKilcher 3 ปีที่แล้ว
the algorithm is biologically plausible
@lucidraisin 3 ปีที่แล้ว ⁺²
Thank you!!
@lucidraisin 3 ปีที่แล้ว ⁺²
Nobody could have explained it as well as you did!
@kimanthony1667 3 ปีที่แล้ว ⁺²
Next project ==> lucidrains/predictive-coding-backprop-pytorch
@blacklistnr1 3 ปีที่แล้ว ⁺⁸
I'd like to say that I appreciate how you handled discussing this paper. Perhaps this is my biased incomplete view, but damn some research is this over pompous explanation of a really basic idea that makes you facepalm "Is that it?". I imagine these guys chuckling with pipes:
- What should we research next?
- Well I'd love to do something useful, but all the money seems to go to A.I. these days.. *scratches beard*
- Oh... these primal monkeys, will they ever understand the beauty of exploring math?
- I truly don't know, but let's give them what they want: deep networks and backprop.
- Hasn't that been done like 10000 times already?
- No no no, we don't do backprop, we break the chain with local variables and call it predictive coding
- You're mad! *loud laugh* So you want do 100 LOCAL iterations to propagate what could be done in one pass?
- You wouldn't say it like that.. use flashy words neuromorphic, LSTM, etc.
- Neuromorphic Machine Learning? isn't that like what we've been calling what we're doing since 1970s? Have a little dignity, at least call it "Hebbian plasticity"
- *drinks the whole glass and slams it on the table* Fine with me. Let's get this over with.
@gruffdavies 3 ปีที่แล้ว ⁺²
The paper's purpose was to address "biological plausibility" so "Hebbian plasticity" is perfectly appropriate.
@hoaxuan7074 3 ปีที่แล้ว ⁺¹
Well almost anything will train a neural net and there is no point in being too clever about it. A dot product is a statistical summary measure and a filter. It will respond to the statistics of the neurons of the prior layers. No neuron can be so exceptional because its output will be shared by many forward dot products. And any realistic optimisation algorithm will be able to search only a small space of statistical solutions. And is that a bad thing? You exclude many brittle overfitted solutions.
@hoaxuan7074 3 ปีที่แล้ว
I guess one way to test that is to delete a weight and see how badly it affects the net, or delete one neuron.
Do you only ever get a small statistical effect or does such an action sometimes dramatically impact the net?
Evolutionary algorithms like Continuous Gray Code Optimization can actually train large nets. And can have low network bandwidth requirements relative to BP. for federated learning. Each compute device is given the full network model and part of the training set. The same short sparse list of mutations to make to the model is sent to each device and it returns the cost for its part of the training set. The costs are summed and if an improvement an accept mutations message is sent to each device else a reject message.
Anyway there is some kind of related chat at 'discourse numenta' under sparse numenta nets.
@amitkumarsingh406 3 ปีที่แล้ว ⁺⁴
How about the papers in dark mode
@keeperofthelight9681 ปีที่แล้ว
It doesn’t for Reinforcement learning though
@kascesar 3 ปีที่แล้ว ⁺¹
wich program did you use to read papers?
@jerrygreenest 3 ปีที่แล้ว ⁺¹
And what OS 🤔
@YannicKilcher 3 ปีที่แล้ว ⁺¹
OneNote on Windows
@Yash-vm4uk 3 ปีที่แล้ว
It is still using back-propagate which he said is not possible in brain and done by looping, so how is this biologically possible?
@SianaGearz 3 ปีที่แล้ว
Back propagation as defined is a global mechanism that makes use of the computer implementation of neural networks. However, in the brain, there can be no explicit metadata describing the connections, and no direct connections spanning all the way across the brain!
Two-way communication for the purpose of reinforcement occurs biologically, but it is local, spanning just every pair of adjacent neurons. There are many mysteries regarding function of biological neural tissue.
So this paper presents a mechanism which it shows to be identical in result to back-propagation, but which is local only, not global, and appears biologically plausible. It helps come one step closer to understanding the function of biological neural tissue.
@albertwang5974 3 ปีที่แล้ว ⁺²
Brain do backpropagation by generating connect between activating cell to the confirmed result.
@minecraftermad 3 ปีที่แล้ว
I hope i can understand this cuz those graphs sure didn't look promising
@yasurikressh8325 3 ปีที่แล้ว ⁺¹
Doesn’t look hideous to me. If it can be mapped than it is a beauteous model
@Rizhiy13 3 ปีที่แล้ว ⁺²
Not very convincing so far, distribution of errors doesn't seem to offer any advantages in comparison to backprop.
@AirmailMRCOOL 3 ปีที่แล้ว ⁺³
"Advantages" aren't really what they were looking for. They were looking for a biologically possible training method. Your brain doesn't use backprop, so they're just theory shooting what it does use.
@444haluk 3 ปีที่แล้ว ⁺¹
This is the smartest thing I have ever heard. I have always hated backprop because at each step it assumes it finds the temporarily perfect solution. This approach fixes that monstrosity.
@23kl104 3 ปีที่แล้ว ⁺³
no it doesn't. It finds the locally steepest direction.
@Prince-sf5en 3 ปีที่แล้ว ⁺⁹
Can't believe I'm first here
@bassr3hab 3 ปีที่แล้ว ⁺²
haha same here
@herp_derpingson 3 ปีที่แล้ว ⁺²
Can't believe its not butter
@andreassyren329 3 ปีที่แล้ว ⁺¹
Oh I had no idea this just premiered.
@notgabby604 3 ปีที่แล้ว ⁺¹
Naw, it's trans-fat margarine. Which certainly was a terrible thing.
@quebono100 3 ปีที่แล้ว ⁺¹
@@bassr3hab same here on your post xD (recursion?)
@quAdxify 2 ปีที่แล้ว
This is a bit difficult to understand. I think it just needs a bit more theoretical context for all the people not familiar with variational inference. For the interested viewers, here is an excellent review (including predictive coding and VI) by the authors of the discussed paper (I belief) arxiv.org/pdf/2107.12979.pdf.

ต่อไป

เล่นอัตโนมัติ

Deep Networks Are Kernel Machines (Paper Explained)