@23:44 the chart says that they disagree on 20% of the labels, but it doesn't say that those 20% are different elements from the dataset. For example, it could be the same 20% of the dataset, but training 1 guesses they're cats, and traing 2 says they're dogs. Of course, this has some diminishing returns because there are only 10 classes in CIFAR-10, but I think the point still holds. Also, I agree with the paper that this supports the idea that they're different functions, but i don't think it supports that the 20% are totally different elements from the dataset. What do you think, @Yannic Kilcher?
0:00 Your pronunciation for names of all cultures is remarkably good. In fact, I would say that you are the best amongst all the people I have seen so far. A Russian, Chinese and an Indian walked into a bar, Yannic greeted all of them and they had a drink together. There is no joke here, go away. . 24:00 It is interesting to see that ResNet has lower disagreement and higher accuracy. I wonder if disagreement is inversely correlated with accuracy. . 32:50 I think one way of interpreting this is that both 3 * 5 and 5 * 3 give the same result 15. So, although they are different in weight space, they are not in solution space. Thus, they have the same loss/accuracy. This is difficult to prove for neural networks with millions of parameters, but I would wager that something similar happens. I think this problem may disappear completely if we manage to find a way to make a neural network whose parameters are order independent. . 44:00 I wonder what would happen if we slice a ResNet lengthwise into maybe 2-5 trunks so that the neurons in each layer are only connected to its own trunk. All trunks would have a common start and end. Would that outperform the regular ResNet? Technically, it is an ensemble. Right? . I think authors should also start publishing disagreement matrix in the future.
24:00 I guess residual connections makes the loss landscape more convex, that's why resnet and densenet seem more similar. A similar idea was shown in a paper on visualizing the loss landscape of NNs. I guess it relates.
Thanks, that's a big compliment :) Yea I think there's definitely a link between disagreement and accuracy, but also the disagreement metric among different architectures is very shaky, because it's entirely unclear how to normalize it. In the paper, they do mention weight space symmetry and acknowledge it, but the effect here goes beyond that, otherwise there would be no disagreement, if I understand you correctly. Your idea is interesting, sounds a bit like AlexNet. The question here is how much of the ensemble-ness comes from the fact that you actually train the different parts separately from each other. I have no idea what would turn out, but it's an interesting question :)
I'm not convinced the final decision vectors are so different since they can be all permuted. I am intrigued by the fact please they disagree, but we know from adversarial methods that it doesn't take much for a network to change its decision. Very Intrigued, but not fully convinced 🤔
TL;DR [45:14] Multiple random initializations lead to functionally different but equally accurate modes of the solution space that can be combined into ensambles to combine their competence. (This works far better than building an ensamble from a single mode in solution space that has been perturbed multiple times to capture parts of it's local neighborhood.)
Love this paper and your explanation. It almost seems that if you train the model on different random initialization, on say a dataset of cat and dog pictures for example, it will try to figure out the labels based on one set of features like eye shape, and another initialization will try to figure it out based on another features like hair textures. Completely different types of approaches to problems, but I would expect different accuracies at the end. Incredible results, makes the cogs turn in my head as to whats going on.
19:43 correction -- the T-SNE plot is not for optimization iterates but rather the prediction sets (train/test idk). So the conclusion from the plot is *not* that the optimization enters a specific valley in the *parameter space* but isntead that it enters a valley in the *prediction space* (eg. R^N we have N data points with scalar outputs).
Super interesting video. I wonder whether the differently initialised models learn to “specialise” on a specific subset of classes, given that 1) each independent model performs about equally well, 2) the non-overlapping cosine similarities across the models, and 3) the average/sum of multiple models improves accuracy. Could it be that one model specialises in predicting airplanes and ships, while another model specialises in predicting deers and dogs. This might explain why they preform about equally well on the CIFAR10 test set, don’t overlap in terms of cosine similarity, and explain why accuracy improves by simply averaging their predictions. Or stated differently, whether the number of modes in the loss landscape correspond to the number of ways of combining the classes. This is also somewhat related to you comment at 34:21, regarding under parameterisation (“that no single model can look at both features at the same time”) vs. over-specified model (“too simple of a task” and that there is 500 different ways of solving it). With many classes this is a difficult hypothesis to prove/disprove, since the combinatorics of N classes/labels gives 2^N-1 modes in the loss landscape. With CIFAR10, although they check 25 independent solutions and find no overlapping cosine similarly, this might simply be due to there being 2^10-1 = 1023 different possible combinations of functions. If I was to test it, I would try to investigate a dataset with three classes: airplane, bird and cat (hereafter “A”, “B”, and “C”), which only gives 2^3-1 = 7 combinations. Then you could test whether you find a model that was good at predicting class A, one that was good at predicting class B, good at C, reasonable at A+B, reasonable at A+C, reasonable at B+C, and ok at predicting A+B+C. It would be disproven by checking 10-25 independent solutions and finding no overlap in the cosine similarity. On the other hand, if you find overlap, then this would indicate that the models during training become class-specific models (or class-subset-specific models), and this might also explain why we with ensembling see decreasing marginal improvements in accuracy as ensemble size grows.
Very nice thoughts, it sounds like an awesome project to look at! It gets further complicated by the numerous symmetries that exist in neural networks, it's very unclear how much cosine similarity is an actual accurate measure here.
Interesting. I wonder if people have take various task types, explored the problem space and figured out if the landscape is different. Would the landscape of (hills and valleys) of CIFAR vs MNIST be different? Are there characteristics of patterns that we can derive other than just figuring out the gradient and trying to go down into the local minima? For instance, are all minima shaped the same? Can you derive the depth of a local minima from that shape? Could you take advantage of that during training to abandon that training run and start from a new random spot? Could you if you knew how large a valley (or the average valley in the landscape) was, use that to inform where you put a new starting point to your training? If the thesis that all local minima are generally about the same level of loss couldn't one randomly sample points until one is found within an error of the global minima level such that you can save training time? This would be especially fruitful if you knew some characteristics about the average size of valleys of local minima. Thus if you were going to train ensembles anyways, this would be fruitful. As always, great work. Thank you for your insight. It's become a daily habit, checking out the Kilcher paper of the day.
Yeah it is somewhat problem dependent. Ensembles are particularly valuable in regression tasks with a known large nullspace. Image reconstruction tasks usually fall into this category; there being a family of interpretations consistent with the data, most variations being physically indistinguishable rather than just having similar loss, and we dont want to be fooled into believing just one of them. Neural networks, being typically overparameterized, intrinsically have some of this null-spaceness within them it seems though, regardless of what problem you are trying to solve with them.
I would like to see the difference per layer, if the weight diverge on higher level rather than lower, this could a good insight about how to optimize, basically train a network once, extract stable layer, retrain with divergence on unstable layer. I always felt something akin to frequency separation would work.
im just spitballing here but what if towards the end of the NN each of the nodes was treated a bit differently when training in an attempt to stimulate multi-modal solutions; perhaps only backpropigate on the node with the current highest signal given the current training example.
Fascinating insight into ensemble methods, and by extension lottery tickets etc. It rather begs the question though, how many basins/modes are there, and what do they most correspond to? It feels like we are exploring the surface of a distant and unknown planet, exciting times 😀
Very cool paper! Especially the insight that the local maxima are very different in their predictions. And not just all failing with the same datapoints! 25 percent disagreement with 60 percent accuracy is not a giant effect though. With 40% mislabeled, there is a lot of space for disagreement between two wrong solutions.
And cos similarity in parameter space seems like a useless measure to me, since just by reordering Neurons, you can permute weights without changing the function behavior. This permutation is enough to make the parameters very cos-dissimilar.consider e.g. a network with a 1x2 layer of (1,-3) and a 2x1 layer with weights (2,1)^T (and no bias for simplicity). Thus the parameter vector is (1,-3,2,1) By switching the two hidden neurons we get (-3,1,1,2) resulting in cos similarity of -2/15 = -.133 while the networks are equivalent!
@@dermitdembrot3091 yeah the cosine invariance between models doesnt say much; but the difference in classification does indicate that those differences are meaningful; clearly these are not mere permutations, or something very close to it.
This time a "classic" from Dec-2019 ;-)... The consistent amount of disagreement in Fig. 3 right is very interesting. But Fig. 5 middle and right 15% predictions-similarity seem to me very low (e.g. if they disaggree on 85% of the predictions how can the accuracy of those models be 80% at the optima as shown in the left plot)? Would be great if somebody could give me a hint... Also I am not convinced that different t-sne projections proof a functional-difference (e.g. symmetrical weight-solutions also may have high distance). And just a though regarding weight-orthogonality.. even normal distributed random vectors are nearly orthogonal (e.g. probably also two different initializations). I will take a closer look at the paper, probably there are some details explained...
I guess they might have taken validation examples with very high noise or a very different set of vectors... One thing I always observed and wondered is that for classifiers to be accurate they need not be same or similar. Like humans would agree on images of cats and dogs but probably will never ever agree if they are given images of random patterns or pixels and have to label them as cats and dogs. This paper very very much strengthens my belief but will read it first.
Regarding the t-sne plots: I thought they were trajectories of the network weights... but they are mapped predictions of the networks, so the plot indeed shows functional differences...
@@eelcohoogendoorn8044 Thanks for the attempt but in paper it says "show function space similarity (defined as the fraction of points on which they agree on the class prediction) of the parameters along the path to optima 1 and 2" ... this is also consistent with values approaching 1 in the direct neighborhood around the target optimum. In the supplementary material are similar plots with a different color legend ranging from 0.4 to 1.04 with the description in the text "shows the similarity of these functions to their respective optima (in particular the fraction of labels predicted on which they differ divided by their error rate)". Maybe this are standard plots that can be found in other literature too ... I had not yet time to go through the references. If somebody is familiar with this topic it would be great if she/he could give a short explanation or reference.
Is it impossible to design networks with a globally convex loss space? The paper seems to say these networks have multiple local minima but does this necessarily have to be so? I'm curious of a small enough network on a small enough problem can have a globally convex loss space, then we could build larger networks our of these smaller networks with some sort of hebbian process controlling the connections between them
Is there any kind of gauge invariance for the neural network? In that sense that we have to look not at particilar assignments of weights of neurons, but at equivalence class up to some transformations in the neurons?
It seems to be a really good paper that views training from a totally different lens. Also, I have seen a few top Kaggle solutions with 8+ Fold cross validation ensembles, so that seems to work. A much needed break from SoTA papers indicating "I beat your model" :D
One thing is more and more deep ensembles usually ends up having diminishing or no returns after a certain point. Indicating that adding a new model after the 20th one might not have any more disagreement in predictions. Any thoughts on why that is? Maybe it's limited by the architecture + training method somehow.
@@YannicKilcher I'm not sure I agree with this completely, there may be a large number of minima in weight space due to symmetries. But I'm not sure there are a large number of disagreements in the output space given a particular test set, the disagreements in the output space is what gives better accuracy for the test set as more models are added.
If each network covers a subset of the data, a preprocessing network could classify a sample and choose which sub network will do the best job "actually" classifying it? I'm sure this isn't novel, what's it called?
A multiplexor network? Probably better to learn a function that takes a weighted average of each sub network's output, since it uses all of them together rather than choosing just one. Or better yet -- learn an ensemble of such functions! Ha.
@@snippletrapThanks! Looking at the MUXConv paper, that's spatial multiplexing, not quite what I was wondering. I'll keep searching multiplexing though. However, Rather than putting all samples through all subnetworks, I'm imagining e.g. 10 small networks each trained on 1/10 of the dataset. The partitioning which samples end up in which 1/10th would also be learned, and then a controller network would also be learned! Overfitting strategies would be needed.
This paper basically confirm the well-known technique of ensemble from *actual* independent runs, rather than lots of different paper describing trick to do ensemble from a single run such as SWA / SWAG. They didn't say it explicitly but I guess this might imply a fundamental tradeoff in ensembling, you can either save some computation by single-run averaging, or reap all the benefit of ensemble from multiple runs.
Is is not costly toTrain same network multiple times if the dataset is large. I read one paper titled as “Snapshot ensembles : train 1 get M for free” there the authors propose training a network just once with LR schedule and checkpointing the minimas. Later treating checkpointed models as ensemble candidates. I see both of them talking in same lines. Is there any comparison done with the results from that paper as well?
Thanks for the video. As always great breakdown. It is interesting why the authors are considering perturbing the weights slightly as a strategy too? Since that does not get you out of the current local minima right? So for ensembles to be effective one would need learners that are as uncorrelated as possible (kinda captured by the independent optima strategy). Maybe I need to read the paper first hand too, to understand the details more :P
I think its a good question. The benefits provided by ensembles apply to both over and under parameterized models to some extent. Though moreso for the typically overparameterized case, I would imagine. Would be good to address that question head on experimentally. Still, the equi-parameter single big model isnt going to outcompete the ensemble, unless we are talking about a severely underparameterized scenario. These benefits are quantitatively different from having simply more parameters.
Great video and paper. A lot of insight! However I was a bit disappointed by the 5% improvement of combining 10+ ensembles relative to the original one when the different solutions were so different (were they?). Makes me think the way ensembles were combined is not optimal.
Dood, you're on FIRE! I've used XGBoost ensambles to compete in Kaggle - there was no way to compete without them in the competitions I'm thinking of. But wasn't dropout supposed to make ensambles redundant? A network with dropout is, in effect, training continously shifting ensambles?
I think dropout is for regularizing NN only. It does NOT effect in ensemble. That's because it is still a single weight space which is converging to a local minima. Dropout just helps the convergence. Please correct me if I am wrong.
There are a lot of quasi-informed statements going on around dropout; it making ensembles redundant never made it past the wishful-thinking stage, as far as I can tell from my own experiences.
Is there any known pattern in the distances between these loss space minima? I’m curious if this could provide a way to adjust learning rate to jump to minima quicker
Very cool paper, i kind of want to do that plane plot from 2 solutions with some networks now :) On fig 3 I am not super happy about the measure they used for the disagreement. If your baseline accuracy is 64% the fact that the solutions disagree on 25% of labels shows that the functions are different, but does not convince me that they are in different modes. Definitely does not suggest that picture with the first 10% errors vs last 10% errors. The diversity measure in fig 6 seems a bit better to me, but still not ideal. I would be more interested in something like a full confusion matrix for the two solutions, or parts of it. To me disagreement on examples that both networks get wrong is not interesting, in cases where at least one of the networks gets it right is a lot more interesting. Because those are the cases that at least have the potential to boost the ensemble performance.
I'm not really convinced of the criticism of Bayesian NNs. The paper itself seems to indicate that issues stem from using a gaussian distribution, not from the need to approximate some prior, or whatever. It's unclear to me why approximating a prior should restrict us to one local minimum, while it's clear why using a gaussian would do that. Intuitively, replacing the gaussian with some multimodal distribution with N modes should perform similarly to an ensemble of N gaussian NNs; in fact, the latter seems like it would be basically the same thing as a bayesian NN which used normal mixture distributions instead of gaussian ones. Though non-gaussian methods are common in baysian machine learning, I can't think of any work that does this in the context of NNs; maybe this work provides adequate motivation to push in that direction.
Are the predictions of the ensemble models simply averaged? I would try to weight them in a way that maximizes the signal to noise ratio (or more likely the logarithm of it). I'm guessing it won't make a big difference with large ensembles, but might help if your can only afford to train an ensemble of only a few different initializations. Also, I'm slightly damp.
Trimmed mean is where it is at; pretty much best of both worlds of a median and mean. Yes it looks simple and hacky; but just like training 10 models from scratch, show me something that works better :).
22:31 that is not strong evidence since the weight soace is vv high dimensional, and most vectors-pairs in high dimensions are othogonal to each other.
Hey so, great content as usual however I disagree with your interpretation. You are saying, since the parameter space of the models is so different the model must perform differently on different examples. As far as I have understood that you take that as evidence that our intuition of easy and hard examples is incorrect. However, I dont think these two things are related. I think this big discoverey that the models parameters are so different is just a fact of strangeness of high dimensions. If you pick any 2 random vectors in high dimensional space they will be almost orthogonal to each other. So any difference in initial conidtions will elad to quite different outcomes
Does the brains also use this trick? - Creating an ensemble of thousands of not-so-deep learners, all of them trying to solve a similar problem. Thus, it can easily generalize between distance tasks. Having a better sampling of the functional space? This is similar to the 'thousand brain theory' of Numenta isn't it ? What do you think?
I was thinking the same thing. Neurons in the brain are only connected to nearby neurons. Thus, if we take clusters of far off neurons, we can say that it acts like an ensemble.
Bayesian... I hate this marchening pseudo-duonomial filtering networks using only the VGE small-memory encoder. Would you call it TensorFlow libraries it's too low for these 2D refractions?
reminds me of Kazanova and his stacking -- analyticsweek.com/content/stacking-made-easy-an-introduction-to-stacknet-by-competitions-grandmaster-marios-michailidis-kazanova/
@23:44 the chart says that they disagree on 20% of the labels, but it doesn't say that those 20% are different elements from the dataset. For example, it could be the same 20% of the dataset, but training 1 guesses they're cats, and traing 2 says they're dogs. Of course, this has some diminishing returns because there are only 10 classes in CIFAR-10, but I think the point still holds. Also, I agree with the paper that this supports the idea that they're different functions, but i don't think it supports that the 20% are totally different elements from the dataset. What do you think, @Yannic Kilcher?
0:00 Your pronunciation for names of all cultures is remarkably good. In fact, I would say that you are the best amongst all the people I have seen so far. A Russian, Chinese and an Indian walked into a bar, Yannic greeted all of them and they had a drink together. There is no joke here, go away.
.
24:00 It is interesting to see that ResNet has lower disagreement and higher accuracy. I wonder if disagreement is inversely correlated with accuracy.
.
32:50 I think one way of interpreting this is that both 3 * 5 and 5 * 3 give the same result 15. So, although they are different in weight space, they are not in solution space. Thus, they have the same loss/accuracy. This is difficult to prove for neural networks with millions of parameters, but I would wager that something similar happens. I think this problem may disappear completely if we manage to find a way to make a neural network whose parameters are order independent.
.
44:00 I wonder what would happen if we slice a ResNet lengthwise into maybe 2-5 trunks so that the neurons in each layer are only connected to its own trunk. All trunks would have a common start and end. Would that outperform the regular ResNet? Technically, it is an ensemble. Right?
.
I think authors should also start publishing disagreement matrix in the future.
24:00 I guess residual connections makes the loss landscape more convex, that's why resnet and densenet seem more similar. A similar idea was shown in a paper on visualizing the loss landscape of NNs. I guess it relates.
It must be a Swiss thing -- as a small country in the middle of Europe they need to speak with all their neighbors.
Thanks, that's a big compliment :)
Yea I think there's definitely a link between disagreement and accuracy, but also the disagreement metric among different architectures is very shaky, because it's entirely unclear how to normalize it.
In the paper, they do mention weight space symmetry and acknowledge it, but the effect here goes beyond that, otherwise there would be no disagreement, if I understand you correctly.
Your idea is interesting, sounds a bit like AlexNet. The question here is how much of the ensemble-ness comes from the fact that you actually train the different parts separately from each other. I have no idea what would turn out, but it's an interesting question :)
this channel is life changing! Please keep up the amazing work!
What a paper! Some really fascinating information there, and immensely thought provoking. Thanks a lot for the talk-through!
I loved this paper explaination. And the paper is so interesting. Will try it out. Thanks for the explaination.
I'm not convinced the final decision vectors are so different since they can be all permuted. I am intrigued by the fact please they disagree, but we know from adversarial methods that it doesn't take much for a network to change its decision.
Very Intrigued, but not fully convinced 🤔
TL;DR [45:14]
Multiple random initializations lead to functionally different but equally accurate modes of the solution space that can be combined into ensambles to combine their competence.
(This works far better than building an ensamble from a single mode in solution space that has been perturbed multiple times to capture parts of it's local neighborhood.)
Nice when other people prove your theory on your behalf :) Thanks for the clear breakdown of this paper
Love this paper and your explanation. It almost seems that if you train the model on different random initialization, on say a dataset of cat and dog pictures for example, it will try to figure out the labels based on one set of features like eye shape, and another initialization will try to figure it out based on another features like hair textures. Completely different types of approaches to problems, but I would expect different accuracies at the end. Incredible results, makes the cogs turn in my head as to whats going on.
Since the Lottery Ticket Hypothesis, I have been expecting a paper which shows that those ensembles can be joined into one neural network.
Mean, mode median: oh yeah its all coming together
You are correct. Training independent subnetworks for robust prediction under review of ICLR 2021
I wonder how similar the weights of the first n-layers (maybe n=5) of the different networks. Do they capture the same low level features?🤔
Good question.
the cosine comparison is just a scalar projection of all final parameters. When 2 networks have similar weights, then the scalar projection is big.
It's even possible that they capture the same low level features, but extract/represent these in different ways.
Perhaps a bit late, but there is a paper on exactly this question: arxiv.org/pdf/1905.00414.pdf
19:43 correction -- the T-SNE plot is not for optimization iterates but rather the prediction sets (train/test idk). So the conclusion from the plot is *not* that the optimization enters a specific valley in the *parameter space* but isntead that it enters a valley in the *prediction space* (eg. R^N we have N data points with scalar outputs).
This is so cool! I love learning about this stuff
What a awesome paper and video
Super interesting video. I wonder whether the differently initialised models learn to “specialise” on a specific subset of classes, given that 1) each independent model performs about equally well, 2) the non-overlapping cosine similarities across the models, and 3) the average/sum of multiple models improves accuracy. Could it be that one model specialises in predicting airplanes and ships, while another model specialises in predicting deers and dogs. This might explain why they preform about equally well on the CIFAR10 test set, don’t overlap in terms of cosine similarity, and explain why accuracy improves by simply averaging their predictions. Or stated differently, whether the number of modes in the loss landscape correspond to the number of ways of combining the classes.
This is also somewhat related to you comment at 34:21, regarding under parameterisation (“that no single model can look at both features at the same time”) vs. over-specified model (“too simple of a task” and that there is 500 different ways of solving it).
With many classes this is a difficult hypothesis to prove/disprove, since the combinatorics of N classes/labels gives 2^N-1 modes in the loss landscape. With CIFAR10, although they check 25 independent solutions and find no overlapping cosine similarly, this might simply be due to there being 2^10-1 = 1023 different possible combinations of functions. If I was to test it, I would try to investigate a dataset with three classes: airplane, bird and cat (hereafter “A”, “B”, and “C”), which only gives 2^3-1 = 7 combinations. Then you could test whether you find a model that was good at predicting class A, one that was good at predicting class B, good at C, reasonable at A+B, reasonable at A+C, reasonable at B+C, and ok at predicting A+B+C. It would be disproven by checking 10-25 independent solutions and finding no overlap in the cosine similarity. On the other hand, if you find overlap, then this would indicate that the models during training become class-specific models (or class-subset-specific models), and this might also explain why we with ensembling see decreasing marginal improvements in accuracy as ensemble size grows.
Very nice thoughts, it sounds like an awesome project to look at! It gets further complicated by the numerous symmetries that exist in neural networks, it's very unclear how much cosine similarity is an actual accurate measure here.
Interesting. I wonder if people have take various task types, explored the problem space and figured out if the landscape is different. Would the landscape of (hills and valleys) of CIFAR vs MNIST be different? Are there characteristics of patterns that we can derive other than just figuring out the gradient and trying to go down into the local minima? For instance, are all minima shaped the same? Can you derive the depth of a local minima from that shape? Could you take advantage of that during training to abandon that training run and start from a new random spot? Could you if you knew how large a valley (or the average valley in the landscape) was, use that to inform where you put a new starting point to your training? If the thesis that all local minima are generally about the same level of loss couldn't one randomly sample points until one is found within an error of the global minima level such that you can save training time? This would be especially fruitful if you knew some characteristics about the average size of valleys of local minima. Thus if you were going to train ensembles anyways, this would be fruitful.
As always, great work. Thank you for your insight. It's become a daily habit, checking out the Kilcher paper of the day.
I dont think so. Non-convex optimization is NP-Hard after all. If there was an easy way, it would break all computer science.
Yeah it is somewhat problem dependent. Ensembles are particularly valuable in regression tasks with a known large nullspace. Image reconstruction tasks usually fall into this category; there being a family of interpretations consistent with the data, most variations being physically indistinguishable rather than just having similar loss, and we dont want to be fooled into believing just one of them.
Neural networks, being typically overparameterized, intrinsically have some of this null-spaceness within them it seems though, regardless of what problem you are trying to solve with them.
Look into this project if you haven't:
losslandscape.com
They want to build a database of loss landscape visualizations.
This is the kind of research that makes it all possible - makes me wonder about nested hierarchies of ensembles, a la Hinton's capsules?
I would like to see the difference per layer, if the weight diverge on higher level rather than lower, this could a good insight about how to optimize, basically train a network once, extract stable layer, retrain with divergence on unstable layer. I always felt something akin to frequency separation would work.
im just spitballing here but what if towards the end of the NN each of the nodes was treated a bit differently when training in an attempt to stimulate multi-modal solutions; perhaps only backpropigate on the node with the current highest signal given the current training example.
nice idea!
How do you get the intialization points to be so close for the T SNe thing. I have been trying to implement in pytorch but to no success
He said wrongly. The t-sne is not for parameters, but rather for predictions (maybe on the whole training set).
Fascinating insight into ensemble methods, and by extension lottery tickets etc. It rather begs the question though, how many basins/modes are there, and what do they most correspond to? It feels like we are exploring the surface of a distant and unknown planet, exciting times 😀
Very cool paper! Especially the insight that the local maxima are very different in their predictions. And not just all failing with the same datapoints!
25 percent disagreement with 60 percent accuracy is not a giant effect though. With 40% mislabeled, there is a lot of space for disagreement between two wrong solutions.
And cos similarity in parameter space seems like a useless measure to me, since just by reordering Neurons, you can permute weights without changing the function behavior. This permutation is enough to make the parameters very cos-dissimilar.consider e.g. a network with a 1x2 layer of (1,-3) and a 2x1 layer with weights (2,1)^T (and no bias for simplicity).
Thus the parameter vector is (1,-3,2,1)
By switching the two hidden neurons we get (-3,1,1,2) resulting in cos similarity of -2/15 = -.133 while the networks are equivalent!
@@dermitdembrot3091 I wonder if we can make a neural network which is permutation invariant.
@@dermitdembrot3091 yeah the cosine invariance between models doesnt say much; but the difference in classification does indicate that those differences are meaningful; clearly these are not mere permutations, or something very close to it.
thank you, great efforts ,great explanation.
I wonder how good that measure of weight similarity really is
This time a "classic" from Dec-2019 ;-)... The consistent amount of disagreement in Fig. 3 right is very interesting. But Fig. 5 middle and right 15% predictions-similarity seem to me very low (e.g. if they disaggree on 85% of the predictions how can the accuracy of those models be 80% at the optima as shown in the left plot)? Would be great if somebody could give me a hint... Also I am not convinced that different t-sne projections proof a functional-difference (e.g. symmetrical weight-solutions also may have high distance). And just a though regarding weight-orthogonality.. even normal distributed random vectors are nearly orthogonal (e.g. probably also two different initializations). I will take a closer look at the paper, probably there are some details explained...
I guess they might have taken validation examples with very high noise or a very different set of vectors... One thing I always observed and wondered is that for classifiers to be accurate they need not be same or similar. Like humans would agree on images of cats and dogs but probably will never ever agree if they are given images of random patterns or pixels and have to label them as cats and dogs. This paper very very much strengthens my belief but will read it first.
Regarding the t-sne plots: I thought they were trajectories of the network weights... but they are mapped predictions of the networks, so the plot indeed shows functional differences...
Please update us if you figure out the accuracy thing, it's bothering me too!
Seems to me the axis is the amount of disagreement, given that 0 is on the diagonal. So they disagree in 15%, which is consistent with 80% accuracy.
@@eelcohoogendoorn8044 Thanks for the attempt but in paper it says "show function space similarity (defined as the fraction of points on which they agree on the class
prediction) of the parameters along the path to optima 1 and 2" ... this is also consistent with values approaching 1 in the direct neighborhood around the target optimum. In the supplementary material are similar plots with a different color legend ranging from 0.4 to 1.04 with the description in the text "shows the similarity of these functions to their respective optima (in particular
the fraction of labels predicted on which they differ divided by their error rate)". Maybe this are standard plots that can be found in other literature too ... I had not yet time to go through the references. If somebody is familiar with this topic it would be great if she/he could give a short explanation or reference.
Is it impossible to design networks with a globally convex loss space? The paper seems to say these networks have multiple local minima but does this necessarily have to be so? I'm curious of a small enough network on a small enough problem can have a globally convex loss space, then we could build larger networks our of these smaller networks with some sort of hebbian process controlling the connections between them
as soon as you add nonlinearities, the non-convexity appears, unfortunately
25:26 That caught me off guard! LMAO!!!
Is there any kind of gauge invariance for the neural network? In that sense that we have to look not at particilar assignments of weights of neurons, but at equivalence class up to some transformations in the neurons?
Good questions, there are definitely symmetries, but no good way so far of capturing all of them.
It seems to be a really good paper that views training from a totally different lens. Also, I have seen a few top Kaggle solutions with 8+ Fold cross validation ensembles, so that seems to work. A much needed break from SoTA papers indicating "I beat your model" :D
I almost lost hope in the SOTA ocean, then this paper kicks in...Thanks Google...Oh! Wait...
One thing is more and more deep ensembles usually ends up having diminishing or no returns after a certain point. Indicating that adding a new model after the 20th one might not have any more disagreement in predictions. Any thoughts on why that is? Maybe it's limited by the architecture + training method somehow.
I think they will still have disagreements, but these are already covered by the other networks in the ensemble.
@@YannicKilcher I'm not sure I agree with this completely, there may be a large number of minima in weight space due to symmetries. But I'm not sure there are a large number of disagreements in the output space given a particular test set, the disagreements in the output space is what gives better accuracy for the test set as more models are added.
If each network covers a subset of the data, a preprocessing network could classify a sample and choose which sub network will do the best job "actually" classifying it? I'm sure this isn't novel, what's it called?
A multiplexor network? Probably better to learn a function that takes a weighted average of each sub network's output, since it uses all of them together rather than choosing just one. Or better yet -- learn an ensemble of such functions! Ha.
@@snippletrapThanks! Looking at the MUXConv paper, that's spatial multiplexing, not quite what I was wondering. I'll keep searching multiplexing though. However, Rather than putting all samples through all subnetworks, I'm imagining e.g. 10 small networks each trained on 1/10 of the dataset. The partitioning which samples end up in which 1/10th would also be learned, and then a controller network would also be learned! Overfitting strategies would be needed.
"Run-time Deep Model Multiplexing" is what I'm reading now!
PathNet: Evolution Channels Gradient Descent in Super Neural Networks arxiv.org/abs/1701.08734 seems relevant too!
You should also search for “Mixture of experts”. It’s a general principle which has been applied to neural networks in the way you’re proposing.
looks like ensemble is still a good way to squeeze out a little bit more accuracy, when a single model is highly optimized.
This paper basically confirm the well-known technique of ensemble from *actual* independent runs, rather than lots of different paper describing trick to do ensemble from a single run such as SWA / SWAG.
They didn't say it explicitly but I guess this might imply a fundamental tradeoff in ensembling, you can either save some computation by single-run averaging, or reap all the benefit of ensemble from multiple runs.
Is is not costly toTrain same network multiple times if the dataset is large. I read one paper titled as “Snapshot ensembles : train 1 get M for free” there the authors propose training a network just once with LR schedule and checkpointing the minimas. Later treating checkpointed models as ensemble candidates.
I see both of them talking in same lines. Is there any comparison done with the results from that paper as well?
I don't think there is an explicit comparison, but I'd expect the snapshot ensembles to generally fall into the same mode.
Thanks for the video. As always great breakdown.
It is interesting why the authors are considering perturbing the weights slightly as a strategy too? Since that does not get you out of the current local minima right? So for ensembles to be effective one would need learners that are as uncorrelated as possible (kinda captured by the independent optima strategy). Maybe I need to read the paper first hand too, to understand the details more :P
Tobias BO
It would be interesting to compare the accuracy of a single network with the parameter size of all ensembles
I think its a good question. The benefits provided by ensembles apply to both over and under parameterized models to some extent. Though moreso for the typically overparameterized case, I would imagine. Would be good to address that question head on experimentally.
Still, the equi-parameter single big model isnt going to outcompete the ensemble, unless we are talking about a severely underparameterized scenario. These benefits are quantitatively different from having simply more parameters.
N part disjunction efficiency neuron clustering?
This is a good case for the wisdom of crowds and democracy
Great video and paper. A lot of insight!
However I was a bit disappointed by the 5% improvement of combining 10+ ensembles relative to the original one when the different solutions were so different (were they?). Makes me think the way ensembles were combined is not optimal.
Dood, you're on FIRE! I've used XGBoost ensambles to compete in Kaggle - there was no way to compete without them in the competitions I'm thinking of. But wasn't dropout supposed to make ensambles redundant? A network with dropout is, in effect, training continously shifting ensambles?
Ah, they mention dropout - good, at least I fed the algorithm!
Dropout can be seen as a Bayesian method if applied in training and testing. Probably suffers from the same single-mode problem
I think dropout is for regularizing NN only. It does NOT effect in ensemble. That's because it is still a single weight space which is converging to a local minima. Dropout just helps the convergence. Please correct me if I am wrong.
There are a lot of quasi-informed statements going on around dropout; it making ensembles redundant never made it past the wishful-thinking stage, as far as I can tell from my own experiences.
Thank you for the great explaination. I wonder if works have been done for ensemble for other tasks as well, eg. Segmentation, GANs etc ?
Will you do a live video of a paper explanation?
Do we have combination of this and "Groking, generalisation beyond overfitting" paper?
Is there any known pattern in the distances between these loss space minima? I’m curious if this could provide a way to adjust learning rate to jump to minima quicker
Great paper and great explanation! Can I ask what software you used for reading and writing on this paper file?
Very cool paper, i kind of want to do that plane plot from 2 solutions with some networks now :)
On fig 3 I am not super happy about the measure they used for the disagreement. If your baseline accuracy is 64% the fact that the solutions disagree on 25% of labels shows that the functions are different, but does not convince me that they are in different modes. Definitely does not suggest that picture with the first 10% errors vs last 10% errors. The diversity measure in fig 6 seems a bit better to me, but still not ideal. I would be more interested in something like a full confusion matrix for the two solutions, or parts of it. To me disagreement on examples that both networks get wrong is not interesting, in cases where at least one of the networks gets it right is a lot more interesting. Because those are the cases that at least have the potential to boost the ensemble performance.
True, I agree there is a lot of room to improve on the sensibility of these measurements.
I didn't catch, is there a specific name for multi, as opposed to single, maxima ensembles?
I'm not really convinced of the criticism of Bayesian NNs. The paper itself seems to indicate that issues stem from using a gaussian distribution, not from the need to approximate some prior, or whatever. It's unclear to me why approximating a prior should restrict us to one local minimum, while it's clear why using a gaussian would do that. Intuitively, replacing the gaussian with some multimodal distribution with N modes should perform similarly to an ensemble of N gaussian NNs; in fact, the latter seems like it would be basically the same thing as a bayesian NN which used normal mixture distributions instead of gaussian ones. Though non-gaussian methods are common in baysian machine learning, I can't think of any work that does this in the context of NNs; maybe this work provides adequate motivation to push in that direction.
True, the problem here is that most of the bayesian methods actually use gaussians, because these are the only models that are computable in practice.
Are the predictions of the ensemble models simply averaged? I would try to weight them in a way that maximizes the signal to noise ratio (or more likely the logarithm of it). I'm guessing it won't make a big difference with large ensembles, but might help if your can only afford to train an ensemble of only a few different initializations. Also, I'm slightly damp.
Trimmed mean is where it is at; pretty much best of both worlds of a median and mean. Yes it looks simple and hacky; but just like training 10 models from scratch, show me something that works better :).
22:31 that is not strong evidence since the weight soace is vv high dimensional, and most vectors-pairs in high dimensions are othogonal to each other.
Which app do you use to annotate papers? Thanks!
Hey so, great content as usual however I disagree with your interpretation. You are saying, since the parameter space of the models is so different the model must perform differently on different examples. As far as I have understood that you take that as evidence that our intuition of easy and hard examples is incorrect.
However, I dont think these two things are related. I think this big discoverey that the models parameters are so different is just a fact of strangeness of high dimensions. If you pick any 2 random vectors in high dimensional space they will be almost orthogonal to each other. So any difference in initial conidtions will elad to quite different outcomes
Does the brains also use this trick? - Creating an ensemble of thousands of not-so-deep learners, all of them trying to solve a similar problem. Thus, it can easily generalize between distance tasks. Having a better sampling of the functional space? This is similar to the 'thousand brain theory' of Numenta isn't it ? What do you think?
I was thinking the same thing. Neurons in the brain are only connected to nearby neurons. Thus, if we take clusters of far off neurons, we can say that it acts like an ensemble.
Bayesian... I hate this marchening pseudo-duonomial filtering networks using only the VGE small-memory encoder.
Would you call it TensorFlow libraries it's too low for these 2D refractions?
You sound like a GPT
Like no two humans are the same, no two randomly initialised classifiers are the same. (Generally)
won't it hugely depend on the data?
sure, I guess it always does
25:20 😂😂
reminds me of Kazanova and his stacking -- analyticsweek.com/content/stacking-made-easy-an-introduction-to-stacknet-by-competitions-grandmaster-marios-michailidis-kazanova/
Talk about machines taking our jobs XD
Here's a talk by the author: th-cam.com/video/stTzg8iUaXM/w-d-xo.html.. highly recommend along with Yannic's explanation here