While I can see how one can rationalize the results otherwise, it seems to me that the scaling differences between dense and Switch (or other MoE) models on downstream tasks, relative to their scaling on perplexity, is further evidence against the idea that these are just memorize-interpolators. One would, I think, expect that such memorization and interpolation would be more robust on average to MoE-style partitioning than if they were also learning more general reasoning. Yet while we see Switch-Base outperform T5-Large on perplexity, it underperforms on every downstream task except CB Trivia QA. As in, this seems like what you get if your parameter scaling was giving benefits predominantly through better memorization, and it seems of a distinctly different character.
I think the main takeaway for Switch-C is that it outperforms T5-XXL using 1/10th of the FLOPS (although blowing over 1T params). While the smaller Switch model gets the best performance but matching T5's compute. They haven't tried with both Equal Compute and 1T params.
Re: "model parallelism has high communication costs." Yes and no. Standard data-parallelism (aka layer sequential execution) incurs the overhead of synchronizing all accelerators, reducing all gradients, doing the weight update and distributing the updated parameters again. Model parallel (aka layer parallel aka layer pipelined) execution incurs the overhead of moving the hidden activations, but the weights are not moved. If moving weights is more expensive than moving activations then you probably want to run using model parallel execution. There are many cases where pipelining a model incurs the penalty of moving weights, but avoids a lot of overheads present in layer sequential execution. From Pipelined Backpropagation at Scale: Training Large Models without Batches (Kosson et al 2020, arxiv.org/abs/2003.11666) "Zhang et al. (2019c) find that fine-grained pipelining can enable speedups of up to 3.5x in their setting. Li & Pedram (2017) and Chen et al. (2016) both report energy savings of up to 3x. Fine-grained pipelining can also enable efficient sparse processing which Chen et al. (2019) show can result in up to a 42.5x and an 11.3x improvement in throughput and energy efficiency, respectively." In a recent white paper Sambanova shows how they plan to pipeline model. See Figure 4 here: sambanova.ai/wp-content/uploads/2020/12/RDA-Whitepaper.pdf Cerebras has also talked about the benefits of pipelining models: www.cerebras.net/data-model-pipeline-parallel-training-neural-networks
Very wrong models. Use Fast Transform neural nets with a O(nlog(n)) compute cost per layer and 2n parameters. You can either go wow, 1 trillion parameters, or you can slap your palm to your forehead. It depends if you like watching Monster Trucks or reading a chemistry book.
Max pooling, locality sensitive hashing parameter switching, ReLU (f(x)=x connect, f(x)=0 disconnect) are all switching. Convolution, weighted sums, fast transforms (FFT Hadamard) are all dot products. Locality sensitive hash = Random projection followed by binarization. Random projection = fixed pattern of randomly chosen sign flips followed by Hadamard transform. Repeat for better quality. 3 Elements: Dot product, switching, predicate for switch state (EG. x
Thank you for the summary, this was very informative. I was just wondering how did they manage to train the router weights if they are only sending exaamples to a single expert ?
There is still a gradient through the selected expert. Therefore, the router can effectively up- or down-weight that expert relative to the others (perhaps akin to a policy gradient).
Uh oh, this is getting out of hand! Transformers are crazy and I can't imagine what they can do with that many params... This is also amazing because it potentially gives common folk like me some hope to actually be able to run a reasonably sized transformer on local hardware.
>tfw in my MocapNET work I use a a classifier that decides on using an ensemble trained on a subset of the problem (basically poor man's routing) and it was one of the reviewer complains.. This is a fundamentally good idea, divide and conquer!
So HN has a comment: news.ycombinator.com/item?id=26174038 (sorry if @thesz saw this, I did not ask for permission) The context is that one comment suggested that Switch Transformer is parameter-inefficient, i.e., it uses too much parameters to achieve the performance that some other architecture would achieve with much less parameters. To that comment, someone asked what's the basis for this conclusion. This comment provides the reasoning (actually from different user from the original comment of inefficiency). The gist is that TensorFlow actually does not provide the APIs for experimenting with different algorithm, quote: "researchers at Google cannot do IRLS (search provides IRLS only for logistic regression in Tensorflow), they cannot do Hessian-free optimization ([4], closed due lack of activity - notice the "we can't support RNN due to the WHILE loop" bonanza), etc. All due to the fact they have to use Tensorflow - it just does not support these things." Any comments? I actually cannot comment on TensorFlow's capability at all...
When every slightly unique concept gets its own distinct vector. Everything a person could say with the current state of the language, culture, reality is encoded in a state.
Thanks for the video! Unfortunately your explanation of model parallelism is inaccurate. The way you explained it requires sequential execution of the layers (unless pipelining is used). Switch Transformer, T5, etc. splits each layer into N shards and processes them in parallel.
this is just like how the brain works. There are different parts of the brain that specialize in different layers of information processing. They should be able to have some of the FFNs the ability to handle visual data, audio data etc so it has more than just one form of perception. The road to consciousness is through the combination of multiple forms of perception of the world in the same network. Until now its all done on separate networks. But that's just like separate people, not like the brain which is processing on multiple dimensions of input, and its that multi-dimensional (sight, sound, touch, social communication, etc) processing which combine to form a concept (as opposed to a concept that is composed of strictly natural language. You can be told what a chair is your whole life but until you can touch a chair, see a chair, sit in a chair, make a chair, etc you don't really know a chair - just the word chair and how it relates to other words. Consciousness is knowing the word chair AND seeing it AND having other forms of measurement -and then the combined concept of the chair and the concept OF the concept OF the concept is sent through a feedback loop for self reflection - and only then do the conditions from which consciousness emerges. And thats all that consciousness is - its nothing more than multi-FFN-feedback loops.
16:04 I ran into a similar problem while implementing a similar thing for one of my projects. Here, how will the router know that it should have routed to FFN1 instead of FFN2? If we do hard routing, there is no "push" to change the class from any gradients. 31:00 I would recommend the tensorflow mixed precision video from the official tensorflow youtube channel. Its pretty good.
Good question, I think the fact that they have a uniform distribution regularizer in there means that every now and then a token is routed to a different expert, from which it might get better gradients. A bit like RL
The usual way to do huge transformer model parallelism isn't layer-by-layer, but vertically (e.g. split a trainable variable into different machines). The layer-by-layer approach leads to high TPU idle time. Another framework gpipe (arxiv.org/abs/1811.06965) described it and proposed a way to alleviate it. Of course, the performance of the vertical model split relies on the fast communicate between TPUs.
11:17 a feed forward layer still relates each token to the other tokens. It’s just not computed on the based on the sample like self attention. I guess each expert must have capacity # >= 2 to make sense. Unless they are FFing on the token vector only.
What I thought when I watched this was "ok, so it's a CapsNet except instead of feeding the revisions to a deeper expert each expert has the final say". Is that accurate?
As I watched I kinda felt like I was witnessing the first strokes of AGI. I suspect when we learn to make networks of these models collaborating with each other to solve general problems we would have got AGI.
Yeah. We still have a way to go, but it feels like we are a lot closer than 6 years ago. The refinement of the techniques and how AI researchers think about neural processes as second nature, imo, are accelerating everything. With these huge models there is a legitimate interest in creating powerful hardware to run them. I can imagine a transformer being used as a "memory tracker" that given a few data points can "predict" (remember) what happened before or in between.
Trillion parameters... okay, that's approaching the number of synapses in the human brain. ~800 trillion though, so it will probably still take a bit of time. Also, still needs better design, which is out there, but not all put together. Edit: I think a further improvement would be to have multiple of these switch-transformers switch between running on some dataset, or different types of data, and I think having it combined at the end with info of which transformer ran would help too.
yep, first goal is to get to the 'scale' of data processing that the brain in doing, mostly to rule out magic "if it were scaled" claims/hopes/thoughts, then goal can focus on architecture for learning/acting. I'd say 20 years till a brain scale computer could run on a robot like boston dynamics', though maybe in a Tesla in about 15 years (hi KIT). Current tech can scale to brain scale in data center, but new tech will distribute the computation at the layer level in a more neuromorphic (1 billion tiny processors physically connected in a pattern 'learned' in data center training) style.
@@Daniel-ih4zh information is routed in the brain carefully: you don't need your complete brain to make sense of this sentence. in normal transformers thats not the case and every information goes through all the computation. with this routing mechanism it finally does only makes use of compute when it precalculated its necessity. at least a bit ;) at least this is my intuition.
So, FF layer is essentially a one-dimensional convolution. In this case, what is a kernel size of it? 1? Still don't quite understand that part. Also, when you say "token", you mean 768-dimensional vector whatever the embedding dimensionality is, right?
@@anshul5243 I think it is some sort of hard attention, but on the parameters not the input. It must use argmax, which has a derivative of 0 almost everywhere. Except where 2 or more arguments have the same maximum value, then it is undefined. It is not useful for gradient descent. Maybe they are doing something with random sampling to estimate the optimization step to take. I have not read the paper.
4:00 "We are not going to have trillion parameters models around anytime soon just yet" Don't you think OpenAI will release GPT-4 with more than 1 trillion parameters in 6-9 months? I think they will.
Sparsity concept is like neural network able to figure out by itself which two/many parameters can form a concept which it can npt while training and after training you can't really figure out which parameter learned what because they are numbers. Only way they can be possible is match values from a lower concept to higher concept. Whole part theory Mr Hinton was talking about. There is no precise why of doing it because energy manifestations in our world are probabilistic and not absolute. But this idea is worthwhile to explore.. At least for solid objects but simply impossible for thought processes.
Pytorch model for those playing at home - github.com/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/switch/experiment.py
I was crossing my fingers for this video. Thank you.
Crazy how Yannic has a video on every topic I want to research
Thank you for your explanation! You always make complex things easy to understand, so great!
Thanks so much, Yannic!
While I can see how one can rationalize the results otherwise, it seems to me that the scaling differences between dense and Switch (or other MoE) models on downstream tasks, relative to their scaling on perplexity, is further evidence against the idea that these are just memorize-interpolators. One would, I think, expect that such memorization and interpolation would be more robust on average to MoE-style partitioning than if they were also learning more general reasoning. Yet while we see Switch-Base outperform T5-Large on perplexity, it underperforms on every downstream task except CB Trivia QA. As in, this seems like what you get if your parameter scaling was giving benefits predominantly through better memorization, and it seems of a distinctly different character.
man, you have a style of tell the story that I find very easy to understand. it is easy for me to learn from you :) thanks
I think the main takeaway for Switch-C is that it outperforms T5-XXL using 1/10th of the FLOPS (although blowing over 1T params). While the smaller Switch model gets the best performance but matching T5's compute. They haven't tried with both Equal Compute and 1T params.
Re: "model parallelism has high communication costs."
Yes and no. Standard data-parallelism (aka layer sequential execution) incurs the overhead of synchronizing all accelerators, reducing all gradients, doing the weight update and distributing the updated parameters again. Model parallel (aka layer parallel aka layer pipelined) execution incurs the overhead of moving the hidden activations, but the weights are not moved. If moving weights is more expensive than moving activations then you probably want to run using model parallel execution. There are many cases where pipelining a model incurs the penalty of moving weights, but avoids a lot of overheads present in layer sequential execution.
From Pipelined Backpropagation at Scale: Training Large Models without Batches (Kosson et al 2020, arxiv.org/abs/2003.11666)
"Zhang et al. (2019c) find that fine-grained pipelining can enable speedups of up to 3.5x in their setting. Li & Pedram (2017) and Chen et al. (2016) both report energy savings of up to 3x. Fine-grained pipelining can also enable efficient sparse processing which Chen et al. (2019) show can result in up to a 42.5x and an 11.3x improvement in throughput and energy efficiency, respectively."
In a recent white paper Sambanova shows how they plan to pipeline model. See Figure 4 here: sambanova.ai/wp-content/uploads/2020/12/RDA-Whitepaper.pdf
Cerebras has also talked about the benefits of pipelining models: www.cerebras.net/data-model-pipeline-parallel-training-neural-networks
Very wrong models. Use Fast Transform neural nets with a O(nlog(n)) compute cost per layer and 2n parameters. You can either go wow, 1 trillion parameters, or you can slap your palm to your forehead. It depends if you like watching Monster Trucks or reading a chemistry book.
impressively well explained ! Thank you Yannic !
"We are not going to gave trillion parameters anytime soon". It took just 2 years to reach that soon.
Max pooling, locality sensitive hashing parameter switching, ReLU (f(x)=x connect, f(x)=0 disconnect) are all switching. Convolution, weighted sums, fast transforms (FFT Hadamard) are all dot products.
Locality sensitive hash = Random projection followed by binarization.
Random projection = fixed pattern of randomly chosen sign flips followed by Hadamard transform. Repeat for better quality.
3 Elements: Dot product, switching, predicate for switch state (EG. x
3 Element theory gives you Fast Transform fixed filter bank neural nets.
GTP 5 Password re-rememberer: [Complete the following text]
"I forgot my password, it is..."
Yannic: "Clearly memorized by scraping hacked databases"
@@shaxosYT Oooh. Good one. Unintended successful consequences of AI.
... very hard to remember.
... forgotten
Thank you for the summary, this was very informative. I was just wondering how did they manage to train the router weights if they are only sending exaamples to a single expert ?
Maybe thats where the high expert dropout comes into play.
There is still a gradient through the selected expert. Therefore, the router can effectively up- or down-weight that expert relative to the others (perhaps akin to a policy gradient).
Uh oh, this is getting out of hand! Transformers are crazy and I can't imagine what they can do with that many params...
This is also amazing because it potentially gives common folk like me some hope to actually be able to run a reasonably sized transformer on local hardware.
Welp, I'm still happy with my "lonely one Colab in the corner", LOL.
* plays a small lonely violin in the corner *
>tfw in my MocapNET work I use a a classifier that decides on using an ensemble trained on a subset of the problem (basically poor man's routing) and it was one of the reviewer complains.. This is a fundamentally good idea, divide and conquer!
Thanks for great explanation 😄
So HN has a comment: news.ycombinator.com/item?id=26174038 (sorry if @thesz saw this, I did not ask for permission)
The context is that one comment suggested that Switch Transformer is parameter-inefficient, i.e., it uses too much parameters to achieve the performance that some other architecture would achieve with much less parameters.
To that comment, someone asked what's the basis for this conclusion. This comment provides the reasoning (actually from different user from the original comment of inefficiency).
The gist is that TensorFlow actually does not provide the APIs for experimenting with different algorithm, quote:
"researchers at Google cannot do IRLS (search provides IRLS only for logistic regression in Tensorflow), they cannot do Hessian-free optimization ([4], closed due lack of activity - notice the "we can't support RNN due to the WHILE loop" bonanza), etc. All due to the fact they have to use Tensorflow - it just does not support these things."
Any comments? I actually cannot comment on TensorFlow's capability at all...
Thank you for the high qualities of your videos :)
whats next, routing is all you need?
Lol, don't give em ideas
Thanks Yannic!
sparsity is all you need
Thanks for another awesome video!!!
When every slightly unique concept gets its own distinct vector. Everything a person could say with the current state of the language, culture, reality is encoded in a state.
Great explanation.
Here after Mixtral 8x7B release!
Thanks for the video! Unfortunately your explanation of model parallelism is inaccurate. The way you explained it requires sequential execution of the layers (unless pipelining is used). Switch Transformer, T5, etc. splits each layer into N shards and processes them in parallel.
I think the feedforward is a fully connected linear layer, not a disjoint linear layer
this is just like how the brain works. There are different parts of the brain that specialize in different layers of information processing. They should be able to have some of the FFNs the ability to handle visual data, audio data etc so it has more than just one form of perception. The road to consciousness is through the combination of multiple forms of perception of the world in the same network. Until now its all done on separate networks. But that's just like separate people, not like the brain which is processing on multiple dimensions of input, and its that multi-dimensional (sight, sound, touch, social communication, etc) processing which combine to form a concept (as opposed to a concept that is composed of strictly natural language. You can be told what a chair is your whole life but until you can touch a chair, see a chair, sit in a chair, make a chair, etc you don't really know a chair - just the word chair and how it relates to other words. Consciousness is knowing the word chair AND seeing it AND having other forms of measurement -and then the combined concept of the chair and the concept OF the concept OF the concept is sent through a feedback loop for self reflection - and only then do the conditions from which consciousness emerges. And thats all that consciousness is - its nothing more than multi-FFN-feedback loops.
At 6:20, check out the y axis! It definitely is flattening out...
16:04 I ran into a similar problem while implementing a similar thing for one of my projects. Here, how will the router know that it should have routed to FFN1 instead of FFN2? If we do hard routing, there is no "push" to change the class from any gradients.
31:00 I would recommend the tensorflow mixed precision video from the official tensorflow youtube channel. Its pretty good.
Good question, I think the fact that they have a uniform distribution regularizer in there means that every now and then a token is routed to a different expert, from which it might get better gradients. A bit like RL
The usual way to do huge transformer model parallelism isn't layer-by-layer, but vertically (e.g. split a trainable variable into different machines). The layer-by-layer approach leads to high TPU idle time. Another framework gpipe (arxiv.org/abs/1811.06965) described it and proposed a way to alleviate it. Of course, the performance of the vertical model split relies on the fast communicate between TPUs.
11:17 a feed forward layer still relates each token to the other tokens. It’s just not computed on the based on the sample like self attention.
I guess each expert must have capacity # >= 2 to make sense. Unless they are FFing on the token vector only.
you have such a lovely voice
What I thought when I watched this was "ok, so it's a CapsNet except instead of feeding the revisions to a deeper expert each expert has the final say". Is that accurate?
This is a model parallizing paper.
maybe the expert dropout works because each expert is trained on fewer data, effectively, so regularizing it more helps
As I watched I kinda felt like I was witnessing the first strokes of AGI. I suspect when we learn to make networks of these models collaborating with each other to solve general problems we would have got AGI.
yas
You mean like the neurona
More like a society of mind
Yeah. We still have a way to go, but it feels like we are a lot closer than 6 years ago. The refinement of the techniques and how AI researchers think about neural processes as second nature, imo, are accelerating everything. With these huge models there is a legitimate interest in creating powerful hardware to run them.
I can imagine a transformer being used as a "memory tracker" that given a few data points can "predict" (remember) what happened before or in between.
Trillion parameters... okay, that's approaching the number of synapses in the human brain. ~800 trillion though, so it will probably still take a bit of time. Also, still needs better design, which is out there, but not all put together.
Edit: I think a further improvement would be to have multiple of these switch-transformers switch between running on some dataset, or different types of data, and I think having it combined at the end with info of which transformer ran would help too.
yep, first goal is to get to the 'scale' of data processing that the brain in doing, mostly to rule out magic "if it were scaled" claims/hopes/thoughts, then goal can focus on architecture for learning/acting. I'd say 20 years till a brain scale computer could run on a robot like boston dynamics', though maybe in a Tesla in about 15 years (hi KIT). Current tech can scale to brain scale in data center, but new tech will distribute the computation at the layer level in a more neuromorphic (1 billion tiny processors physically connected in a pattern 'learned' in data center training) style.
finally an architecture that feels more biological plausible..
Why do you think?
@@Daniel-ih4zh information is routed in the brain carefully: you don't need your complete brain to make sense of this sentence. in normal transformers thats not the case and every information goes through all the computation. with this routing mechanism it finally does only makes use of compute when it precalculated its necessity. at least a bit ;) at least this is my intuition.
Sounds like GShard with top-1 expert routing. What's the novelty?
Google brain published it, there's your novelty
@@christospapadopoulos7894 Say no more, you had me at Google
Specific useful application, or (in the first few mins of the video) mentioned it being "stable"?
Novelty is in the domain
At what point can novel start being able to make sense ie start reasoning. How to give model reasoning power
So, FF layer is essentially a one-dimensional convolution. In this case, what is a kernel size of it? 1? Still don't quite understand that part. Also, when you say "token", you mean 768-dimensional vector whatever the embedding dimensionality is, right?
yes true, a token is represented as its vector. and a FF is a 1d convolution with k=1
Is Yannic mainly concerned with NLP application?
no I'm just interested in whatever I think is cool :)
Doesn't the hard routing screw with differentiation?
For automatic differentiation I believe it is analogous to a max pool operation
@@anshul5243 I think it is some sort of hard attention, but on the parameters not the input. It must use argmax, which has a derivative of 0 almost everywhere. Except where 2 or more arguments have the same maximum value, then it is undefined. It is not useful for gradient descent. Maybe they are doing something with random sampling to estimate the optimization step to take. I have not read the paper.
4:00 "We are not going to have trillion parameters models around anytime soon just yet"
Don't you think OpenAI will release GPT-4 with more than 1 trillion parameters in 6-9 months?
I think they will.
Closed AI
They will, but my prediction is near the end of 2021
DeBERTa explaination please.
Reminds me of lambda layers
no examples/use cases?
how do they back-prop the error if using argmax to switch expert?
a bit like random exploration in RL
Sparsity concept is like neural network able to figure out by itself which two/many parameters can form a concept which it can npt while training and after training you can't really figure out which parameter learned what because they are numbers. Only way they can be possible is match values from a lower concept to higher concept. Whole part theory Mr Hinton was talking about. There is no precise why of doing it because energy manifestations in our world are probabilistic and not absolute. But this idea is worthwhile to explore.. At least for solid objects but simply impossible for thought processes.
Experts or exberts?
nice one
The paper is somewhat not friendly
It starting to get ridiculous, no?
the rate of advancement in AI is astounding, it wasnt long ago people thought GO would never be solved by computers and now thats ez work
i believe i will see atleast the path to agi in my lifetime
@@brandonwickstead9159 meh, not sure what is hype and iterations of more powerand parameters, and what is really ground-breaking work.
How is token forwarded to each different FF layers. Deep fakes.
Pytorch model for those playing at home - github.com/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/switch/experiment.py