The optimisers curse

probabl

มุมมอง 13 389

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 20 ม.ค. 2025

ความคิดเห็น • 34

@minerscale หลายเดือนก่อน ⁺³⁵
I'd be pretty suspicious if the result of my random tests looked like a bell curve over some random hyper parameters. I'd start to think that probably any hyper-parameter would do basically the same thing down to some natural variability in the score. I guess we can do hypothesis testing to determine whether our results are significant.
@probabl_ai หลายเดือนก่อน ⁺⁵
A uniform distribution might also be suspicious. But the main lesson is that you can indeed apply a healthy amount of doubt when looking at these stats. They can give you an optimistic version of reality and it can be very tempting to get fooled by good numbers.
@Sunnywastakentoo หลายเดือนก่อน ⁺³
Thought this was about DnD. Came for the dragon, stayed for the interesting science stuff.
@probabl_ai หลายเดือนก่อน
How so? Is the optimisers curse a DnD thing?
@WindsorMason หลายเดือนก่อน ⁺¹
@@probabl_ai I am not aware of one, but the name certainly sounds like it could be referring to a D&D optimizer. As someone with a lot of science and D&D content showing up in my feed, I honestly half thought it was related too. Haha.
@Frdyan หลายเดือนก่อน ⁺¹⁴
I know its not practical but this is one of the reasons I get really crosswise with introducing the concept of scoring to any decision makers higher up the pay ladder. Its SOOOOOO easy for these things to become THE measure of a model instead of A measure. Although that problem goes the other way as well. I've seen a professor fit a battery of models and just pick the highest score... which sort of defeats the purpose of the statistician. Idk how much value is lost in just withholding scores from anyone not trained up in the stats behind them
@probabl_ai หลายเดือนก่อน ⁺⁵
Goodhearts law is indeed something to look out for. When a metric becomes a target, it is not a great metric anymore.
@andytroo หลายเดือนก่อน ⁺¹
I ran into this recently - ran 400 models with a hyperparameter search, and then discarded the top 2 (by validation %) because they were super lucky, and failed to do anything special with the holdout test set ... ultimately i settled on the 4th "best" model out of 400, its parameters were "nice" in a particular way.
@probabl_ai หลายเดือนก่อน
Out of curiosity, did you look at the hyperparameters and did these also show regions of better performance?
@seedmole หลายเดือนก่อน
This exposes how data with periodic components can cause resonances in analytical processes that likewise make use of periodic components. Randomness and complexity can appear the same on the surface, but if something appearing random was constructed by combining multiple relatively-prime roots (for example), those roots can then stick out in analysis. Using a modulo component like that can be a good way to do that.. ultimately this kinda edges into the territory covered by Fourier Transforms, in a way. Cool stuff!
@timonix2 หลายเดือนก่อน ⁺²
It seems like running a bog standard, factor analysis after the tests would reveal this. It's basically what you are doing in your visualizer, except it can run on thousands of parameters more than you can visualize, and it feels more formal than "ey, this graph looks like it has a correlation".
@probabl_ai หลายเดือนก่อน ⁺⁴
Statistical tests can certainly be a good idea for sure, but confirming the results with a visual can be a nice insurance policy either way.
@joshuaspeckman7075 หลายเดือนก่อน ⁺²
Factor analysis finds linear relationships, which is good, but there are important nonlinear relationships between hyperparameters, especially for complex models and/or datasets (learning rate vs batch size for neural networks is one common example of this).
@probabl_ai หลายเดือนก่อน ⁺¹
@@joshuaspeckman7075 Good point!
@Mayur7Garg หลายเดือนก่อน ⁺⁵
Can this be used to define sort of reliability for the model?
For a given RF with fixed hyper params, calculate the scores for various random states. Then use the standard deviation to depict the spread in score only due to randomness.
The idea being that for a given model and data, a lower spread means that the model is able to model the data similarly in all instances irrespective of any randomness in it. If the spread is high, the model might be too sensitive, or the data needs more engineering.
@probabl_ai หลายเดือนก่อน ⁺³
I might be a bit careful there. By using a random seed you will sample from a distribution that is defined by the hyperparams. Change the hyperparams and you will have another distribution and I don't know if you can claim anything general about the difference of these two distributions upfront.
@Mayur7Garg หลายเดือนก่อน ⁺²
@probabl_ai As I said fixed hyper params. I am only updating the random seed. The idea is to establish some sort of likelihood of the model's score distribution just due to chance. So instead of saying that the model score is 0.7 for some specific random seed value, I can say something like model score is 0.7±0.1 where the latter is the std of the scores or that the model scores over 0.6 for 95% of random seed values.
@oreo-sy2rc หลายเดือนก่อน
Is a random state similar to initial starting point? So with a bad random state you end up in a local minima
@victorlee6129 หลายเดือนก่อน
Would nested cross validation help mitigate the effects of the optimizer's curse? Maybe I'm not understanding the material well - but this also reads like an issue of overfitting to the validation set.
@probabl_ai หลายเดือนก่อน
It's more that you are battling the fact that luck may work against you.
@LuisCruz-pq1oy หลายเดือนก่อน
What do you think about purposefully varying the random seed to verify your model's sensitivity to randomness in formal experiments? I've been discussing about this a lot with my colleagues recently and I have been doing this type of analysis specially with neural networks experiments, but some people advised me to not do this as to avoid the temptation of hand picking the "best" random state... however, some other people have been saying that a random seed is as much of a hyperparameter as any other, so it would be fine to hand pick it...
@probabl_ai หลายเดือนก่อน ⁺¹
You might enjoy this TIL on the topic:
koaning.io/til/optimal-seeds/
That said, lets think for a moment what the `random_seed` is meant for. The reason that it exists is to allow for repeatability of the experiment. When we set the seed, we hope to ensure that future runs give us the same result.
That's a useful mechanism, but notice how it is not a mechanism that is meant as a control lever of an algorithm. That's why I find there is something dirty about finding an "optimal" algorithm by changing the seed value. It's meant to control consistency, not model performance.
There is a middle path though: use randomized search. Adding a random seed here is "free" in the sense that it does not blow up the search space but it might allow you to measure the effect of randomness in hindsight.
Does this help?
@LuisCruz-pq1oy 24 วันที่ผ่านมา
@@probabl_ai Makes sense to me! Thanks.
@johnlashlee1315 หลายเดือนก่อน ⁺³
@ 4:50 "And this phenomenon actually has a name" I was 100% certain you were going to say null hypothesis significance testing, because that's what it's called
@probabl_ai หลายเดือนก่อน ⁺³
It's not a wrong perspective, but the phenomenon that is being described is the counterintuitive situation where adding more hyper-parameters may make the "best performance" statistic less reliable. Hypothesis testing also tends to be a bit tricky in the domain of hyper-parameters too, mainly around the question of what underlying distribution/test you can assume.
@punchster289 หลายเดือนก่อน
is it possible to get the gradients of the hyperparameters?
@probabl_ai หลายเดือนก่อน ⁺¹
Typically the hyperparameters do not have "gradients". To use Pytorch as an analogy, the weights of a neural network might be differentiable, but the amount of dropout isn't. Not to mention that a lot of algorithms in scikit-learn aren't based on gradient algorithms.
@Mr_Hassell หลายเดือนก่อน ⁺¹
Quality video
@marksonson260 หลายเดือนก่อน ⁺¹
Or someone might say that you fool yourself since you look at the trained models and assume that the best model is a part of that set when your optimization problem most likely is non-convex.
@probabl_ai หลายเดือนก่อน
I've even seen this happen on convex models actually. Granted, when this happened it was related to the `max_iter` variable being too low so it isn't converging properly. Bit of an edge case, but devil is in the details.
@PepegaOverlord หลายเดือนก่อน
This is a common problem i think everyone experiences at some point, and understanding the model as well as having metrics that cover a wide variety of edge cases both seem to resolve this quite well. There also plenty of strategies to circumvent the issue such as the cross validation you showcased, but also more "stratified" approaches exist such as genetic algorithms or particle swarm optimization.
My issue however is how to deal with this when you have a limited amount of compute on hand and wish to obtain a good result without having to spend a lot of time testing until you isolate the good hyperparameters and the more noisy ones ? obviously i don't expect a one size fits all solution, but i'd love to hear what solutions or workaround people use, especially nowadays when models are getting bigger and bigger.
@probabl_ai หลายเดือนก่อน ⁺²
There is certainly a balance there yeah. Not everyone has unlimited compute. The simplest answer is to just try and remain pragmatic. Make sure you invest a little bit in visuals and don't trust your numbers blindly. Really think about what metric matters and really try to think about what problem needs to be solved in reality.
@TheJDen หลายเดือนก่อน ⁺¹
For deep models, initializing weights sampled from a relatively small variance Gaussian distribution has been shown to give faster convergence. Andrew Karpathy doesn’t touch on it in his making GPT video, but if you go to the GitHub code you can see the change. Also, adding a weight-size penalty to the loss can encourage the model to come up with more general parameters, but this can be very delayed (grokking). I have seen several gradient and double descent methods that basically “pick up the signal” early, though. Remember for nontrivial tasks and good architecture this is more of icing on cake tho.
@justfoundit หลายเดือนก่อน ⁺¹
It would have hurt if you didn't choose 42, so genuinely Thank You!

ต่อไป

เล่นอัตโนมัติ