Assuming the hebbian network is topologically the same as the RL network, it might be interesting to see if there is an improvement if you - train the RL-way to solve all the tasks as well as possible with a single policy - train update rules the Hebbian way to solve all tasks as well as possible given a random initial network - feed the RL-learned solution to the Hebbian solver to see if that somehow gives you the best of both worlds. If that doesn't work, then a way to jointly train the starting architecture and the Hebbian learning rules on top would work. One way to do that might be to: - have a population of RL agents (could also just be a single one) - randomly mix them plus noise-initialized networks (i.e. induce noise) - apply Hebbian learning to this new distribution which will have more structure to it than *just* entirely random networks Then you could iterate this to use Hebbian networks as they were at the end of an episode as a starting point for further RL optimization, which may expand or replace the original RL population. Presumably this would be very expensive though. And the thing where they try to just effectively zero out a large chunk of neurons for a bit could be used in training as some sort of robustness regularization as well. Although clearly it's *already* quite robust to such rather insane changes. I'd also love to know what the diagonals are about. It's certainly interesting. I feel like you're *quite* hebbian given with what rate you learn new papers lol
WIthout their code, I'm guessing the diagonal structure comes about from the structures used to express the state and action sequences. Rather than just the instantaneous observations, a history of observations is normally stacked together into a feature matrix which is sensor X time delay. Depending on the ordering used for the delays (running up or down), this will form a (partial) Hankel matrix or a (partial) Toeplitz matrix. These feature matrices are then vectorized (if they were ever explicitly constructed in the first place), and each vector forms a single "observation" vector. This certainly would explain why input matrices, if displayed in the same format, would have a (partial) Hankel structure. A somewhat similar construction might explain the output matrix, if again delays are used to provide information about rates to the effectors. However, it is interesting that this structure appears in the WEIGHT matrices, and not (just) the (unillustrated) input and output feature matrices. I would guess the two partial Hankel structures on the input and output impose their structures on the adjacent weight matrices, and work their way to the middle, but how is certainly mysterious at this point. From an engineering perspective, it would seem like the "erasure" experiment, where a broad swath of weights is set to zero, might be retrained in ONE step, simply by taking a ruler along those antidiagonals and filling in the erased portion with the median non-zero values from the non-erased portion.
Very cool! Probably because of Yannic, your inquiry, they found the anti-diagonal matrix of the weight is wrong. The anti-diagonal patterns no longer exist in any of their figures in the revised version 3 submitted on oct. 22. At the beginning, I was wondering whether the anti-diagonal pattern is the aliasing due to your screen video recoding, because I didn't see that from my PDF (v.3).
Thanks for the video! I wonder if subnetworks could be learned with traditional RL then a Hebbian rule could tweak the weights to choose which subnetworks to send signals to. I bet you could get a network to learn a broader set of tasks this way.
We've said before that papers shouldn't need state-of-the-art results to be important... but since we've gotten used to all papers are SOTA, these quadraped results felt lackluster.
This is such a nice way to combine ontogenic("classic") RL with evolutionary RL :) I wonder if the weight initialization or network topology were to be evolved as well, would that invoke Baldwin effect? Might throw in some Lamarckian inheritance for good measure :)
Hebian network is like cutting a creature in half - adding there some other feature and by doing it building another level of abstraction. If this is the case, you can cut at every point - like a god creating an algorithm to create more gods, so you can have fully hebian network without any convolutional layers inbetween. It would probably have much more weights but who counts these days (with networks reqching 175G weights).
Rather than learning specific ABCD for each connection, i think it could be of great potential to have different types of neurons & connections, then to learning different hebbian weights for each type of connections. Much less search space, more generalization, and more biological plausible. Also geometric relationship of neurons should also be considered when designing a meta hebbian rule.
@@snippletrap this paper is fast weight from Hinton... FW: Learn weight by BP, run with zero init fast weight This Paper: Learn update rule by evolution, run with random init weight In this case they do have some similarity... And my point is that if we learn a more general update rules for TYPES of connection, it could make a huge difference.
@@revimfadli4666 If every neuron has its own learning rule, then you have to search over every neuron. If groups of neurons share learning rules, then you only have to search over the number of groups. Imagine there are 1,024 neurons in a network. Then the search space is m^1024p. If every neuron falls into one of two groups, then the search space is m^2p, where p is the number of parameters per neuron and m is the degrees of freedom per parameter.
@@revimfadli4666 In brain, there are a lot of types of neurons, and more types of synapses, depend on how you classify them. The most simple way is to have Excitatory and inhibitory neuron and E-I, I-I, I-E, E-E synapses, you can have a pre defined distribution of these elements, and search for hebbian rules for only 4 types of connection. In this simplified case, the search space is 4 hebbian rules, and if you want, the spatial distribution of neuron and their connection distribution. Any information regards to each connection are very unlikely to be encoded in gene, but each hebbian rule could be.
Wirklich interessant... ich habe ne Zeit lang mit Spiking neural networks gearbeitet, sollte mich selbst daran setzen an biologisch inspirierte Modelle :D
Wow that's pretty dang cool but why don't they test something like mini batches for the hebbian updates? Seems like a logical step to take in terms of progressing to dynamically training NN models. Oh wait perhaps is it because of the evolutionary approach...
Interesting video. But I wasn't able to understand how do we adjust weights if we haven't got some target values? Because in RL I will use as target values the ones given by the formula reward+gamma*max(value). How ca I do this in hebbian learning? What do I use as a metric to optimize?
Great paper review! I would like to replicate this but the Hebbian weight update part (27:10) is not completely clear to me. 1) How is F given here? Is it normalized somehow?, 2) in order to have a weighted average, shouldn't the summation be divided by the sum of all Fs (1 to i)? I checked directly the paper - also the newer version, where the update equation is slightly different -, and the references, but cannot wrap my head around it. Any help would be much appreciated!
This is a type of reinforcement learning so it is a little confusing to keep comparing it to RL as if it is not RL. This is a quite interesting thread of research, though.
In equation: dwij / dt = ηw · (Awoioj + Bwoi + Cwoj + Dw) oi is basically xi and oj is yj as in standard linear layer. Now I'm curious, would be possible to take dy/dA (also dB, dC and dD) and update the hebbian learning matrices with respect to backpropagated gradient? The fact that the authors are using NES tells me that it is not possible, but I somehow can't see why.
They are all adjacent but meta learning is “learning to learn” or given a new task, we want to be able to adapt to it quickly and learn the task “shortly” after seeing it. Transfer learning isn’t as much concerned with getting a good starting point and being able to adapt quickly, but more focused on using what you have learnt in the past to better learn something new. Multi-task learning is learning multiple tasks at once and hopefully using what you learn in one task in another task all while you are learning.
How well does this optimize on the GPU? I imagine you'd only want to check a local field around a neuron, then move that neuron closer to others that fire together, then wire them together. 'Hebbian motion'.
I don't think it "wires together" neurons of the same layer, but rather make each particular _weight_ more positive if its incoming signal and output node's activation correlate, and more negative if they negatively correlate, thus "wiring together" said weight's incoming & outgoing nodes? In that case, the correlation is just the vector outer product between a layer's activation and its previous layer's, which is highly parallelizable CMIIW?
Rewards are given per time step and RL is a very classic approach to online learning. Perhaps you mean (monte-carlo) returns in the video?. Overall, I think the premise of this paper is severly flawed. The idea is cool, but I don't buy the motivation at all.
@@revimfadli4666 Those are two separate points I was making, but I can see how it's confusing the way I wrote it. What I meant is that the motivation in the abstract ("RL approaches ... are typically static and incapable of adapting to new information or perturbations") is flawed, since RL can be, and often is, performed online.
@@Alex-rh5jo that might be the case, unless what they called "static" was the policy that the RL algorithm generated(which is frozen/not trained during testing to differentiate it from training phase). Sure the _RL training_ algorithm is capable of adapting to new information, but the _trained policy?_ On the other hand, this Hebbian plasticity method's policy can adapt after training
this seems like a proxy for Meta-learning. It's not exactly meta-learning but kinda goes by the idea. Also, is it similar ot HSIC bottleneck from here? towardsdatascience.com/hsic-bottleneck-an-alternative-to-back-propagation-36e951d4582c
Assuming the hebbian network is topologically the same as the RL network, it might be interesting to see if there is an improvement if you
- train the RL-way to solve all the tasks as well as possible with a single policy
- train update rules the Hebbian way to solve all tasks as well as possible given a random initial network
- feed the RL-learned solution to the Hebbian solver
to see if that somehow gives you the best of both worlds.
If that doesn't work, then a way to jointly train the starting architecture and the Hebbian learning rules on top would work.
One way to do that might be to:
- have a population of RL agents (could also just be a single one)
- randomly mix them plus noise-initialized networks (i.e. induce noise)
- apply Hebbian learning to this new distribution which will have more structure to it than *just* entirely random networks
Then you could iterate this to use Hebbian networks as they were at the end of an episode as a starting point for further RL optimization, which may expand or replace the original RL population.
Presumably this would be very expensive though.
And the thing where they try to just effectively zero out a large chunk of neurons for a bit could be used in training as some sort of robustness regularization as well. Although clearly it's *already* quite robust to such rather insane changes.
I'd also love to know what the diagonals are about. It's certainly interesting.
I feel like you're *quite* hebbian given with what rate you learn new papers lol
WIthout their code, I'm guessing the diagonal structure comes about from the structures used to express the state and action sequences. Rather than just the instantaneous observations, a history of observations is normally stacked together into a feature matrix which is sensor X time delay. Depending on the ordering used for the delays (running up or down), this will form a (partial) Hankel matrix or a (partial) Toeplitz matrix. These feature matrices are then vectorized (if they were ever explicitly constructed in the first place), and each vector forms a single "observation" vector. This certainly would explain why input matrices, if displayed in the same format, would have a (partial) Hankel structure. A somewhat similar construction might explain the output matrix, if again delays are used to provide information about rates to the effectors.
However, it is interesting that this structure appears in the WEIGHT matrices, and not (just) the (unillustrated) input and output feature matrices. I would guess the two partial Hankel structures on the input and output impose their structures on the adjacent weight matrices, and work their way to the middle, but how is certainly mysterious at this point. From an engineering perspective, it would seem like the "erasure" experiment, where a broad swath of weights is set to zero, might be retrained in ONE step, simply by taking a ruler along those antidiagonals and filling in the erased portion with the median non-zero values from the non-erased portion.
Very cool! Probably because of Yannic, your inquiry, they found the anti-diagonal matrix of the weight is wrong. The anti-diagonal patterns no longer exist in any of their figures in the revised version 3 submitted on oct. 22. At the beginning, I was wondering whether the anti-diagonal pattern is the aliasing due to your screen video recoding, because I didn't see that from my PDF (v.3).
Thanks for writing this comment! ❤
Very nice video! Really enjoyed the comparisons with RL .
Now instead of using random weight in test, initialise with the RL weight result, that should be fun :)
Thanks for the video! I wonder if subnetworks could be learned with traditional RL then a Hebbian rule could tweak the weights to choose which subnetworks to send signals to. I bet you could get a network to learn a broader set of tasks this way.
Getting closer to my research field of Neuromorphic and Spiking Neural Networks
Hey Yannic, this is offtopic w.r.t this video, but could you take a look at Tsetlin Automata? It looks like a neat new mechanism to use.
Kinda feeling the same here. I'd like to hear his opinion about the tsetlin machine
never heard of it, will check it out, thanks :)
What is tsetlin automata? Even Googleing does not help
Re: diagonal pattern, see the wiki on recurrence plots. Same thing. This is a dynamical system.
@32:00 the way the weights can be recovered resembles the way a Hopfield network can recover a memory from a partial or corrupted state
Very nicely explained. !!!
I like hearing about alternatives to classic RL.
We've said before that papers shouldn't need state-of-the-art results to be important... but since we've gotten used to all papers are SOTA, these quadraped results felt lackluster.
This is similar to "Growing Neural Cellular Automata", which also learns a rule set. It amounts to data-dependent program synthesis.
This is such a nice way to combine ontogenic("classic") RL with evolutionary RL :) I wonder if the weight initialization or network topology were to be evolved as well, would that invoke Baldwin effect? Might throw in some Lamarckian inheritance for good measure :)
The updated version of this paper on arXiv doesn't have the diagonals (arXiv:2007.02686v3 [cs.NE] 22 Oct 2020)
Hebian network is like cutting a creature in half - adding there some other feature and by doing it building another level of abstraction. If this is the case, you can cut at every point - like a god creating an algorithm to create more gods, so you can have fully hebian network without any convolutional layers inbetween. It would probably have much more weights but who counts these days (with networks reqching 175G weights).
Rather than learning specific ABCD for each connection, i think it could be of great potential to have different types of neurons & connections, then to learning different hebbian weights for each type of connections.
Much less search space, more generalization, and more biological plausible.
Also geometric relationship of neurons should also be considered when designing a meta hebbian rule.
For different kinds of update rules, see arxiv.org/pdf/1610.06258.pdf
@@snippletrap this paper is fast weight from Hinton...
FW: Learn weight by BP, run with zero init fast weight
This Paper: Learn update rule by evolution, run with random init weight
In this case they do have some similarity... And my point is that if we learn a more general update rules for TYPES of connection, it could make a huge difference.
I'm curious, how would having different types of neurons & connections reduce the search space? Thanks in advance
@@revimfadli4666 If every neuron has its own learning rule, then you have to search over every neuron. If groups of neurons share learning rules, then you only have to search over the number of groups. Imagine there are 1,024 neurons in a network. Then the search space is m^1024p. If every neuron falls into one of two groups, then the search space is m^2p, where p is the number of parameters per neuron and m is the degrees of freedom per parameter.
@@revimfadli4666 In brain, there are a lot of types of neurons, and more types of synapses, depend on how you classify them.
The most simple way is to have Excitatory and inhibitory neuron and E-I, I-I, I-E, E-E synapses, you can have a pre defined distribution of these elements, and search for hebbian rules for only 4 types of connection.
In this simplified case, the search space is 4 hebbian rules, and if you want, the spatial distribution of neuron and their connection distribution.
Any information regards to each connection are very unlikely to be encoded in gene, but each hebbian rule could be.
Wirklich interessant... ich habe ne Zeit lang mit Spiking neural networks gearbeitet, sollte mich selbst daran setzen an biologisch inspirierte Modelle :D
Wow that's pretty dang cool but why don't they test something like mini batches for the hebbian updates? Seems like a logical step to take in terms of progressing to dynamically training NN models.
Oh wait perhaps is it because of the evolutionary approach...
perhaps batches would trade more memory for less update computations?
nice work
~
I live this channel
Interesting video. But I wasn't able to understand how do we adjust weights if we haven't got some target values? Because in RL I will use as target values the ones given by the formula reward+gamma*max(value).
How ca I do this in hebbian learning? What do I use as a metric to optimize?
Making robots that can function even when damaged definitely sounds useful in high-risk areas such as search-and-rescue
perhaps the diagonal pattern is a programming error, either in the visualization or in the actual network.
Are the initial weights set to zero? That would explain a pattern of horizontal rising.
Great paper review! I would like to replicate this but the Hebbian weight update part (27:10) is not completely clear to me. 1) How is F given here? Is it normalized somehow?, 2) in order to have a weighted average, shouldn't the summation be divided by the sum of all Fs (1 to i)? I checked directly the paper - also the newer version, where the update equation is slightly different -, and the references, but cannot wrap my head around it. Any help would be much appreciated!
Second..lovely video..thanks for sharing
This is a type of reinforcement learning so it is a little confusing to keep comparing it to RL as if it is not RL. This is a quite interesting thread of research, though.
In equation:
dwij / dt = ηw · (Awoioj + Bwoi + Cwoj + Dw)
oi is basically xi and oj is yj as in standard linear layer. Now I'm curious, would be possible to take dy/dA (also dB, dC and dD) and update the hebbian learning matrices with respect to backpropagated gradient?
The fact that the authors are using NES tells me that it is not possible, but I somehow can't see why.
It should be possible, but you'll have to backpropagate through time, which might be tedious
Could you perhaps take a look at the paper about e-prop, it's published on nature, so I think it might be worth checking out
I know you've covered a video on meta learning but I want to know the difference between meta learning, transfer learning and multi-task learning.
They are all adjacent but meta learning is “learning to learn” or given a new task, we want to be able to adapt to it quickly and learn the task “shortly” after seeing it. Transfer learning isn’t as much concerned with getting a good starting point and being able to adapt quickly, but more focused on using what you have learnt in the past to better learn something new. Multi-task learning is learning multiple tasks at once and hopefully using what you learn in one task in another task all while you are learning.
@@Stwinky So meta learning is more like transfer learning but with a different objective function?
How well does this optimize on the GPU? I imagine you'd only want to check a local field around a neuron, then move that neuron closer to others that fire together, then wire them together. 'Hebbian motion'.
perhaps they only 'wire together' weights between a neuron and its input nodes, instead of neurons within the same layer?
I think anything that can be vectorized will benefit from gpu just the same.
I don't think it "wires together" neurons of the same layer, but rather make each particular _weight_ more positive if its incoming signal and output node's activation correlate, and more negative if they negatively correlate, thus "wiring together" said weight's incoming & outgoing nodes?
In that case, the correlation is just the vector outer product between a layer's activation and its previous layer's, which is highly parallelizable
CMIIW?
I love how the RL network got much better results when its right fron leg was damaged than when all the legs were intact. Like who needs extra limbs?
Rewards are given per time step and RL is a very classic approach to online learning. Perhaps you mean (monte-carlo) returns in the video?. Overall, I think the premise of this paper is severly flawed. The idea is cool, but I don't buy the motivation at all.
Perhaps by 'reward' he meant cumulative episodic reward?
Why would that make the premise flawed?
@@revimfadli4666 Those are two separate points I was making, but I can see how it's confusing the way I wrote it. What I meant is that the motivation in the abstract ("RL approaches ... are typically static and incapable of adapting to new information or perturbations") is flawed, since RL can be, and often is, performed online.
@@Alex-rh5jo that might be the case, unless what they called "static" was the policy that the RL algorithm generated(which is frozen/not trained during testing to differentiate it from training phase). Sure the _RL training_ algorithm is capable of adapting to new information, but the _trained policy?_
On the other hand, this Hebbian plasticity method's policy can adapt after training
Is there an existing Git-repo that holds the code of this paper?
Holy Trifecta LOL : D
I’d like to apply HTM to this
They don't use evolutionary methods, they use local search. I swear, people these days think that "evolution" means "any change with random numbers".
"Technology is good, technology is bad, technology is biased" That is the best joke ever.
Is there any relation between the input and the random weights used to initialize?
No, I don't think so
this seems like a proxy for Meta-learning. It's not exactly meta-learning but kinda goes by the idea.
Also, is it similar ot HSIC bottleneck from here? towardsdatascience.com/hsic-bottleneck-an-alternative-to-back-propagation-36e951d4582c
First