Fascinating work, thanks for reviewing it Yannic! The random matrices reminded me of that alchemy in some metric learning models where you project triples through a random matrix and it still kind of works :)
This time not one but actually two related papers, you found an effective way to double the output! Feels a bit like magic that FDA works. Great to see basic research in the spotlight (after so many hype-bandwagon number pushing papers). Interesting that due to the random projections the weights of all neurons can be initialized with the same value (e.g. all zero when tanh is used) - e.g. BP needs some asymmetry/randomness in the initial weights in order for two neurons to be different. My first thought was that FDA needs layer-normalization or some form of weight normalization, but seems that works without it for simple tasks like MNIST.
Your theory seems really intuitive to me, the idea is that we just need "a coordinate system" for which to communicate the errors. I've been trying to think about normal backprop in these terms, and if what you say is true, then it basically means that we are potentially unnecessarily preserving the coordinate system with respect to the next layer to communicate the error between the current layer and the next layer in backprop. As long as the error ultimately is preserved with respect to the final task, then the whole system will learn. Super fascinating idea. I wonder if we can use a sparse coordinate system that is more biologically plausible.
TLDR: Applying a random matrix transformation to a layer vector removes degeneracy of solution by being non-linear and having outputs separate more quickly than in linear transformations or low param count transformations.
The idea that DFA induces clustering sounds good. However, the last layer really needs to capitalize on the clustered input and it's surprising that it apparently does, since it itself just gets a "random" feedback.
I would love videos on topic/threads/ideas that are shared across papers. Also a mention in your descriptions of the software you use to create content, drawing over papers as you read
My intuitions on how random matrices can replace the weight transport in BP: 1. Most vectors in high-dimensional space are almost always (nearly) perpendicular. 2. Using the forward weights for BP gives you an absolute measure (current output is 5 and correct output is 7, we need to reduce 2 units). Because of 1), replace weights matrices with random matrices give you relative measure (I don't care where we're at, but we need to reduce 2).
There is a way to test your hypothesis. There are many ways to approximate random matrices B drawn from gaussian distribution, including sparse binary matrices! The key idea is that many other distributions like binomial or Poisson approximate gaussian distributions (shifted and scaled appropriately) if enough samples are drawn. So, if we replace the B matrices in DFA with sparse binary matrices sampled from binomial distributions and still get reasonable performance, it will strongly support your hypothesis. With sparse matrices, the weight updates will look like jagged co-ordinate descent type updates, with nothing in common with BP or DFA updates except the locality preserving property of the matrix. We did some work in locality sensitive hashing, and showed that such sparse matrices provide amazing computational efficiency while preserving distances and angles: arxiv.org/abs/1812.01844
Absolutely, this is a great point. In fact, I bet its an order of magnitude or more since the random B_i matrix is fixed at the start and we only need gradients for the loss function with respect to the output
Hey, author here. Showing benchmarks of improved GPU performance is complicated, because our implementation of DFA is not meant to be super-duper-fast, but easy to use and to retrieve info out of instead (like getting alignment angles). The main issue is that the backward process is ingrained deep within ML libraries like Torch and TensorFlow. Hacking into it to make it do something completely different (such as DFA) is challenging. If you don't want to bother with rewriting a lot of the autograd system in C++/CUDA, you have to accept your implementation will be less efficient, and will still do things "the BP-way" sometimes. @Phúc Lê is also right, we are developing at LightOn optical hardware that performs very large random projections much more rapidly than a GPU. You can put two and two together and guess why we are interested in DFA :). Also @Theodoros Galanos and @Yannic Kilcher, there are ways to actually enable more parallelization with DFA. Because in classification problems the sign of the error is known in advance, you can directly update a layer with a modified binarized DFA as soon as it has done its forward. See 1909.01311 for reference.
I have been reading about the fact that a single soma can communicate with dual/multiple neurotransmitters. (In fact they may never act alone) This has been theorised to allow the necessary negative feedback at the synaptic junction. I found the subject to be called 'co-transmission' following on from a revised view of something called Dale's principle. :)
At 4:45, if you haven't already looked at Blake Richard's work you should do so. He claims to have found a biological plausible way in which (clusters of) neurons could do credit assignment, by "multiplexing" their signals (or sending both signals on the same channel) to simultaneously propagate a forwards signal and the error backwards signal with different compartments in the neuron. multiplexing: th-cam.com/video/AfrU2wHQnrs/w-d-xo.html and credit assignment th-cam.com/video/Q18ahll-mRE/w-d-xo.html (modeling a cluster of neurons as the fundamental unit) At 21:00 I think somewhere Geoffrey Hinton as an explanation for how the forward weights adapt to the random backward weights when talking about stacked auto encoders and temporal derivatives.
sorry for the noob question: what do you mean precisely when you say random matrices almost preserve angles and distances? what kind of stuff should I read to understand that?
There is a class of random linear transformations that approximately preserve these quantities with high probability. Search for Johnson Lindenstrauss Lemma
For weight normalization, it has occurred to me that perhaps something very loosely aligned with the notion of conservation of energy could be applied to the weights. Something like; the input values that are summed for a neuron (usually weights x output of previous) are proportional to the synaptic weight divided by the root of the sum of the squares of the weights which are connected to that input x the output. Do you know of any papers on this line of thinking? It's not a very sound science as to why I think it would be useful, except that in my mind it forces a feature to "become relevant" for reasons too long to explain in a youtube comment.
There is "weight standardization" which goes on that direction. Also many initialization methods have properties like that and otherwise there are things like unitary neural networks. Not sure if any of those are exactly what you're describing
This is amazing channel. Rate at which you describe latest papers on daily basis is incredible. I have one request to you also. Can you keep the background of research paper screen and your notes' screen dark, with the letters being white for research papers and white/colorful for your notes? Also you can use dark mode for your browser if you want ,like with Nighteye extension.
Wouldn't this technique also avoid the O(N^2) of transformers, and thus be way way faster? Could it be used to pre-train the network and have backprop fine tune it...
Can they really claim biological plausibility while using random matrices? Just because the technique doesn't follow BP's biological implausibilities doesn't mean that it is itself biologically plausible, no?
It's a bit like building mechanical horse... One of the natures criterions for evolution is continuously valid offspring, this heavily restricts the architecture.
random matrix, hmmmm.... any possible combination with bayesian methods (approximating the effect from following layers)? It will be hard for an entire deep network, but for pack of few layers, its maybe do-able (e.g arxiv.org/pdf/1310.1867.pdf) (p.s, I'm just an undergrad student so plz forgive my ignorance)
Or if you prefer a paper: "Database-friendly Random Projections" by Dimitris Achlioptas people.ee.duke.edu/~lcarin/p93.pdf also see: scikit-learn.org/stable/modules/random_projection.html
Yeah, the elements being gaussian is sufficient but not necessary. Johnson-Lindenstrauss lemma and random projections are probably the keywords to search for if you want to know more.
Hebbian learning in spiking networks is biologically plausible (Numenta). Layer updates are not. It doesn't matter whether the layers are updated in series or parallel, since the architecture is wrong from the start. But so what? Backprop is just one way to implement the search for global minima. However the brain does it, the results are probably the same.
12 seconds in and I'm already impressed by your French pronunciation skills
LOVED IT. i CANNOT EXPLAIN HOW EASY U MADE THE WHOLE PAPER. pLEASE MAKE MORE VIDEOS LIKE THIS
Fascinating work, thanks for reviewing it Yannic! The random matrices reminded me of that alchemy in some metric learning models where you project triples through a random matrix and it still kind of works :)
This time not one but actually two related papers, you found an effective way to double the output! Feels a bit like magic that FDA works. Great to see basic research in the spotlight (after so many hype-bandwagon number pushing papers). Interesting that due to the random projections the weights of all neurons can be initialized with the same value (e.g. all zero when tanh is used) - e.g. BP needs some asymmetry/randomness in the initial weights in order for two neurons to be different. My first thought was that FDA needs layer-normalization or some form of weight normalization, but seems that works without it for simple tasks like MNIST.
Yes, and it's entirely possible that if we find the correct normalizations etc. like we did for BP, we can push this pretty far.
I watch all of this videos and I'm not even a data scientist, just a programmer. I think I can learn from it
Your theory seems really intuitive to me, the idea is that we just need "a coordinate system" for which to communicate the errors. I've been trying to think about normal backprop in these terms, and if what you say is true, then it basically means that we are potentially unnecessarily preserving the coordinate system with respect to the next layer to communicate the error between the current layer and the next layer in backprop. As long as the error ultimately is preserved with respect to the final task, then the whole system will learn. Super fascinating idea. I wonder if we can use a sparse coordinate system that is more biologically plausible.
TLDR: Applying a random matrix transformation to a layer vector removes degeneracy of solution by being non-linear and having outputs separate more quickly than in linear transformations or low param count transformations.
The idea that DFA induces clustering sounds good. However, the last layer really needs to capitalize on the clustered input and it's surprising that it apparently does, since it itself just gets a "random" feedback.
I would love videos on topic/threads/ideas that are shared across papers. Also a mention in your descriptions of the software you use to create content, drawing over papers as you read
My intuitions on how random matrices can replace the weight transport in BP:
1. Most vectors in high-dimensional space are almost always (nearly) perpendicular.
2. Using the forward weights for BP gives you an absolute measure (current output is 5 and correct output is 7, we need to reduce 2 units).
Because of 1), replace weights matrices with random matrices give you relative measure (I don't care where we're at, but we need to reduce 2).
Yes but in half the cases that reducing 2 would actually increase 2 because you're rotated
Thanks Yannic! Question: Wouldn't DFA be a solution for the vanishing/exploding gradient problem? Is this mentioned anywhere?
I know relu doesn't suffer from this, but what about e.g. sigmoid/tanh?
It's not mentioned, but yes, it could be a remedy. I mean, the problem doesn't even arise with this, so yes :)
Thank for this! The paper explained!
There is a way to test your hypothesis. There are many ways to approximate random matrices B drawn from gaussian distribution, including sparse binary matrices! The key idea is that many other distributions like binomial or Poisson approximate gaussian distributions (shifted and scaled appropriately) if enough samples are drawn. So, if we replace the B matrices in DFA with sparse binary matrices sampled from binomial distributions and still get reasonable performance, it will strongly support your hypothesis. With sparse matrices, the weight updates will look like jagged co-ordinate descent type updates, with nothing in common with BP or DFA updates except the locality preserving property of the matrix. We did some work in locality sensitive hashing, and showed that such sparse matrices provide amazing computational efficiency while preserving distances and angles: arxiv.org/abs/1812.01844
Nice, thanks for the reference and the suggestion :)
I wish they showed some benchmarks of improved GPU performance. This certainly looks a lot more GPU friendly than backprop.
Absolutely, this is a great point. In fact, I bet its an order of magnitude or more since the random B_i matrix is fixed at the start and we only need gradients for the loss function with respect to the output
Lighton.ai is working on some kind of optic hardware, maybe this algorithm is more suitable to their hardware than on a GPU so they leave it out.
I thought so too, but then there's still forward prop, so it can't all be parallel.
@@YannicKilcher could it be that you can do a sort of ensemble of backward passes and smh approximate better since that can be parallelized?
Hey, author here. Showing benchmarks of improved GPU performance is complicated, because our implementation of DFA is not meant to be super-duper-fast, but easy to use and to retrieve info out of instead (like getting alignment angles). The main issue is that the backward process is ingrained deep within ML libraries like Torch and TensorFlow. Hacking into it to make it do something completely different (such as DFA) is challenging. If you don't want to bother with rewriting a lot of the autograd system in C++/CUDA, you have to accept your implementation will be less efficient, and will still do things "the BP-way" sometimes.
@Phúc Lê is also right, we are developing at LightOn optical hardware that performs very large random projections much more rapidly than a GPU. You can put two and two together and guess why we are interested in DFA :).
Also @Theodoros Galanos and @Yannic Kilcher, there are ways to actually enable more parallelization with DFA. Because in classification problems the sign of the error is known in advance, you can directly update a layer with a modified binarized DFA as soon as it has done its forward. See 1909.01311 for reference.
Much love and gratitude
ε>
I have been reading about the fact that a single soma can communicate with dual/multiple neurotransmitters. (In fact they may never act alone) This has been theorised to allow the necessary negative feedback at the synaptic junction. I found the subject to be called 'co-transmission' following on from a revised view of something called Dale's principle. :)
Nice, thanks for the reference!
Great review. Which software do you use for it?
OneNote
setup a patreon man keep up the good work! definitely helps
At 4:45, if you haven't already looked at Blake Richard's work you should do so. He claims to have found a biological plausible way in which (clusters of) neurons could do credit assignment, by "multiplexing" their signals (or sending both signals on the same channel) to simultaneously propagate a forwards signal and the error backwards signal with different compartments in the neuron.
multiplexing: th-cam.com/video/AfrU2wHQnrs/w-d-xo.html and credit assignment th-cam.com/video/Q18ahll-mRE/w-d-xo.html (modeling a cluster of neurons as the fundamental unit)
At 21:00 I think somewhere Geoffrey Hinton as an explanation for how the forward weights adapt to the random backward weights when talking about stacked auto encoders and temporal derivatives.
Thanks a lot for these references, very much appreciated :)
sorry for the noob question: what do you mean precisely when you say random matrices almost preserve angles and distances? what kind of stuff should I read to understand that?
Think about those random matrices as reference-points,
if your reference-point stays constant, you can use that point to understand something else.
Take that with a grain of salt :)
There is a class of random linear transformations that approximately preserve these quantities with high probability. Search for Johnson Lindenstrauss Lemma
For weight normalization, it has occurred to me that perhaps something very loosely aligned with the notion of conservation of energy could be applied to the weights. Something like; the input values that are summed for a neuron (usually weights x output of previous) are proportional to the synaptic weight divided by the root of the sum of the squares of the weights which are connected to that input x the output. Do you know of any papers on this line of thinking?
It's not a very sound science as to why I think it would be useful, except that in my mind it forces a feature to "become relevant" for reasons too long to explain in a youtube comment.
There is "weight standardization" which goes on that direction. Also many initialization methods have properties like that and otherwise there are things like unitary neural networks. Not sure if any of those are exactly what you're describing
@@YannicKilcher I will investigate. thanks
Man, do you have patreon? I would like to donate, so you never stop! :)
I will stop, with or without patreon 😁 but thanks
@@YannicKilcher : 0
This is amazing channel. Rate at which you describe latest papers on daily basis is incredible.
I have one request to you also. Can you keep the background of research paper screen and your notes' screen dark, with the letters being white for research papers and white/colorful for your notes? Also you can use dark mode for your browser if you want ,like with Nighteye extension.
Sadly, research papers are still on white PDF pages. I like dark mode too, but I don't think it would work
@@YannicKilcher foxit reader allows modification in text and background color of pdf
Wouldn't this technique also avoid the O(N^2) of transformers, and thus be way way faster? Could it be used to pre-train the network and have backprop fine tune it...
Yes, I mean technically, this could bypass anything, the question is just how well it performs :)
I don't know why the angles from different layers should be preserved. Since there are non-linear activation functions between different layer.
I don't think there are nonlinearities in the backward projection
Could we use both loss functions at the same time (in some way)? (Ex: sum backpropagation of last layer and the DFA of the current layer)
Sure, why not.
Can they really claim biological plausibility while using random matrices? Just because the technique doesn't follow BP's biological implausibilities doesn't mean that it is itself biologically plausible, no?
Not per se, but it's more plausible than backprop
It's a bit like building mechanical horse... One of the natures criterions for evolution is continuously valid offspring, this heavily restricts the architecture.
random matrix, hmmmm.... any possible combination with bayesian methods (approximating the effect from following layers)? It will be hard for an entire deep network, but for pack of few layers, its maybe do-able (e.g arxiv.org/pdf/1310.1867.pdf)
(p.s, I'm just an undergrad student so plz forgive my ignorance)
I'm not sure there's a connection here, but maybe
Is the reason why B matrices can preserve length and angles because they are gaussian?
You might have a look at en.wikipedia.org/wiki/Restricted_isometry_property and en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma ...
Or if you prefer a paper: "Database-friendly Random Projections" by Dimitris Achlioptas people.ee.duke.edu/~lcarin/p93.pdf also see: scikit-learn.org/stable/modules/random_projection.html
Yeah, the elements being gaussian is sufficient but not necessary. Johnson-Lindenstrauss lemma
and random projections are probably the keywords to search for if you want to know more.
Thanks. Andreas is right. Gaussians are one of an entire class of random projections that preserve these quantities.
I see, thanks!
I have been really hoping you would cover this, this content is very sexy, thank you!
Hebbian learning in spiking networks is biologically plausible (Numenta). Layer updates are not. It doesn't matter whether the layers are updated in series or parallel, since the architecture is wrong from the start.
But so what? Backprop is just one way to implement the search for global minima. However the brain does it, the results are probably the same.
uffff the video only addresses DFA at the minute 10 :/
That's why there's time stamps
@@YannicKilcher oh! thanks for that!!!
this is quite similar to policy gradient