OUTLINE: 0:00 - Intro 0:30 - What are sparse expert models? 4:25 - Start of Interview 5:55 - What do you mean by sparse experts? 8:10 - How does routing work in these models? 12:10 - What is the history of sparse experts? 14:45 - What does an individual expert learn? 19:25 - When are these models appropriate? 22:30 - How comparable are sparse to dense models? 26:30 - How does the pathways system connect to this? 28:45 - What improvements did GLAM make? 31:30 - The "designing sparse experts" paper 37:45 - Can experts be frozen during training? 41:20 - Can the routing function be improved? 47:15 - Can experts be distributed beyond data centers? 50:20 - Are there sparse experts for other domains than NLP? 52:15 - Are sparse and dense models in competition? 53:35 - Where do we go from here? 56:30 - How can people get started with this? Papers: Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (arxiv.org/abs/2101.03961) GLaM: Efficient Scaling of Language Models with Mixture-of-Experts (arxiv.org/abs/2112.06905) Designing Effective Sparse Expert Models (arxiv.org/abs/2202.08906)
Glad to hear the guests comment that the primary motivation is engineering considerations. It really does seem like the whole concept of experts (in the light discussed) is to better route latent representations to compute. But they also make a great point that this must be the smart way forward in the longer term, instead of always shoving things through dense networks. I wonder if there is more opportunity than acknowledged with the experts’ token tendencies regarding interpretability. In fact I’m surprised they expected other than basic switching behavior, which would be the seemingly obvious first optima to train towards. (No?) Overall seems like a really ripe area for research and maybe architecture innovations. Dense networks must be doing something similar but why not make it explicit with experts and leverage that. Anyway great interview as always thanks Yannic and guests!
how can it be that there's literally no one talking about the KAT? on the other hand there was a DeepMind paper mentioned wrt formalizing Sparse-vs-Dense comparisons, is there any pointer to that by any chance?
what about group convolutions? the have been used to spread groups of filters (e.g. 4) across different machines. They did not have a router function, otherwise they seem related?
There's something confused about the use of the word "sparse" in the context of expanded numbers of parameters for the same quantity of data. Overfitting is an obvious consequence: A sequential memory can be viewed as a bunch of "experts", each of which knows only about one bit in the data. Routing is memory address decoding. No compression of anything, no generalization/prediction hence no validation. When I think of "sparse models" it is relative to the quantity of observations up to the present point in time, hence it has a natural correspondence to Kolmogorov Complexity of the data -- the polar opposite of overfitting. Maybe one way of approaching this is to explicitly represent the lack of connections as 0-weight connections -- so it is a degenerate case of a stupendously huge dense network model -- and then ask how one would train such a stupendously huge dense network model to zero out those connections to a merely huge sparse network model.
I feel this, experts is like a proxy major shortcut for backprop to 0 all those weights, or seen another way it pins all the gradient on the previous layer in tuning the latent representation to conform to the arbitrary expert partition of downstream parameter buckets. There’s something about it that feels right in conjunction with lottery ticket ideas.
I'm quite confused, too much evolution for small amount of time, however when i see Reformers, they were not used like Transformer for NLP tasks, when i search for pre-trained models of reformer i cannot find a useful pre-trained models, and now for Switch transformers will have the same destiny?
Dr. Ashish Vaswani is a pioneer and nobody is talking about him. He is a scientist from Google Brain and the first author of the paper that introduced TANSFORMERS, and that is the backbone of all other recent models.
OUTLINE:
0:00 - Intro
0:30 - What are sparse expert models?
4:25 - Start of Interview
5:55 - What do you mean by sparse experts?
8:10 - How does routing work in these models?
12:10 - What is the history of sparse experts?
14:45 - What does an individual expert learn?
19:25 - When are these models appropriate?
22:30 - How comparable are sparse to dense models?
26:30 - How does the pathways system connect to this?
28:45 - What improvements did GLAM make?
31:30 - The "designing sparse experts" paper
37:45 - Can experts be frozen during training?
41:20 - Can the routing function be improved?
47:15 - Can experts be distributed beyond data centers?
50:20 - Are there sparse experts for other domains than NLP?
52:15 - Are sparse and dense models in competition?
53:35 - Where do we go from here?
56:30 - How can people get started with this?
Papers:
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (arxiv.org/abs/2101.03961)
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts (arxiv.org/abs/2112.06905)
Designing Effective Sparse Expert Models (arxiv.org/abs/2202.08906)
we are not going to talk about the attention layer today. all you have to know is that attention is all you need.
was kinda expecting this :D
I've been hearing about a fractal/hierarchy of groups/clusters of neurones approach for years, it's nice to see it actually happen
Glad to hear the guests comment that the primary motivation is engineering considerations. It really does seem like the whole concept of experts (in the light discussed) is to better route latent representations to compute. But they also make a great point that this must be the smart way forward in the longer term, instead of always shoving things through dense networks.
I wonder if there is more opportunity than acknowledged with the experts’ token tendencies regarding interpretability. In fact I’m surprised they expected other than basic switching behavior, which would be the seemingly obvious first optima to train towards. (No?) Overall seems like a really ripe area for research and maybe architecture innovations. Dense networks must be doing something similar but why not make it explicit with experts and leverage that.
Anyway great interview as always thanks Yannic and guests!
I've been waiting for this video for about a year!
Great to see OpenAI’s secret sauce explained so clearly
this was invented by google
Expert model experts, eh?
They probably have expertise about expert model...Experts of expert model with expertise about expert model...!
EXPERTCEPTION
Yannic the only dude who can throw out an “ergo” like playing frisbee at the park. And the catch is effortless
how can it be that there's literally no one talking about the KAT?
on the other hand there was a DeepMind paper mentioned wrt formalizing Sparse-vs-Dense comparisons, is there any pointer to that by any chance?
Is anyone else reminded of the cores from Portal 2 (2011)?
"wanna go to space, did you know that mars is really big? space, SPACE"
what about group convolutions? the have been used to spread groups of filters (e.g. 4) across different machines.
They did not have a router function, otherwise they seem related?
Did anyone else feel they kinda talked around the questions? Barret especially?
There's something confused about the use of the word "sparse" in the context of expanded numbers of parameters for the same quantity of data. Overfitting is an obvious consequence: A sequential memory can be viewed as a bunch of "experts", each of which knows only about one bit in the data. Routing is memory address decoding. No compression of anything, no generalization/prediction hence no validation. When I think of "sparse models" it is relative to the quantity of observations up to the present point in time, hence it has a natural correspondence to Kolmogorov Complexity of the data -- the polar opposite of overfitting.
Maybe one way of approaching this is to explicitly represent the lack of connections as 0-weight connections -- so it is a degenerate case of a stupendously huge dense network model -- and then ask how one would train such a stupendously huge dense network model to zero out those connections to a merely huge sparse network model.
I feel this, experts is like a proxy major shortcut for backprop to 0 all those weights, or seen another way it pins all the gradient on the previous layer in tuning the latent representation to conform to the arbitrary expert partition of downstream parameter buckets.
There’s something about it that feels right in conjunction with lottery ticket ideas.
What do you think of mixture of expert models for vision task?
I'm quite confused, too much evolution for small amount of time, however when i see Reformers, they were not used like Transformer for NLP tasks, when i search for pre-trained models of reformer i cannot find a useful pre-trained models, and now for Switch transformers will have the same destiny?
This is very Greg Egan, for example "Diaspora".
I've said that before, but it's even more so now.
Dr. Ashish Vaswani is a pioneer and nobody is talking about him. He is a scientist from Google Brain and the first author of the paper that introduced TANSFORMERS, and that is the backbone of all other recent models.
2017, wow, that is ridiculously old. Really really old indeed. (13:10). This concept predated the dinosaurs.
an expert in knowing everyones name
Expert autonoma
"Translate to English (United Kingdom)" come on TH-cam, this comment is already in English.
experts seem like neurones
Bruh
I am first!