I've done only 1-st semester math at my uni, and I can understand this lecturer almost perfectly. He's definatelly not overfitting :). Amazing teacher, trully. Thanks, Caltech!
Very clear, informative lecture! Especially, I like the part from 54:30 to 58:00 the most in this lecture since I have never found such clear description of the bias variance trade off. Thank you very much!
The analogy of deterministic noise is enlightening... so 对牛弹琴 (play the lute to the cow) can be interpreated as that the complexity of music is deterministic noise to cows...lol...
Hi, At 42:16 : "Stochastic noise increases => overfitting increases, deterministic noise increases => overfitting increases". At 1:13:16 : "[Values of terministic noise and stochastic noise have] nothing to do with the overfitting aspect". Can someone explain to me ? Thank you.
The point is that, the exact values themselves do not reflect the overfitting severity. You cannot say that a smaller value of bias leads to less overfitting. A model may get a smaller bias by overfitting it to the samples. At the same time, the bias value does not reflect the deterministic noise level. The latter is reflected by the complexity of the target (when it is above the model complexity): the higher complexity, the more noise, hence more overfitting.
If stochastic noise is the "normal" noise from a noisy target function and deterministic noise is the noise produced by the target function being more complex than the hypothesis set you have, then you still don't have an explanation of why the 10th order polynomial had overfitting trying to fit a noiseless 10th order polynomial (since it doesn't have stochastic noise by definition and it can't have deterministic noise because the hypothesis set is equal in complexity as the target function). Did I miss something?
He explained in 26:48. 10th order polynomial is supposed to fit well a 10th order target, but only when there are enough samples. When you don't have enough samples, you can get good enough Ein, but you get really bad generalization error because of the theorem of VC dimension inequality. Now compared to 2th order model, the 10th order one is overfitting. (Note the definition of overfitting is "comparative".) In other words, deterministic noise is produced by the mismatch of model/target complexity, and also by the finite number of samples.
To also clarify, the 'noiseless' target function in his example was a 50th order polynomial, therefore the overfitting there was due to deterministic noise 16:41. All the 10th order polynomials he used in his explanation were 'noisy', hence why the learning curves had a expected error > 0 even at large N (26:41)
When I draw learning curves of decision trees or support vector machines, I have never so far seen overfitting in this sense. If I take my decision tree of depth 500, it will still be better on the test set than a depth-200 model, say. I get no added value on the validation set, of course. Can anyone show me a decision tree that overfits?!
This is funny that he has the same facial expression after answering each question in each video. Some sort of shaking head to express that he's just finished the answer. Great lecture, by the way. I still don't quite understand that notion of the deterministic noise, since if we choose richer model we make this noise smaller, but we end up in a worse result if we have small N. He explained it during the lecture and during the Q&A, but still I don't get it why he calls a bias a noise.
I also struggled with the same question and here is my understanding. First of all, what is bias? Bias can be interpreted into the distance between the best h of a given H (= the best a H can do) and the target function, which is also the definition of deterministic noise. Then, why call it a noise? according to the example given by the professor, if what you want to learn (the target function) is far beyond your learning ability (the model complexity of H), it will mislead you like the child trying to learning complex number by using the real numbers, which happens to be the definition of noise.
I've done only 1-st semester math at my uni, and I can understand this lecturer almost perfectly. He's definatelly not overfitting :). Amazing teacher, trully.
Thanks, Caltech!
Man! What a fantastic professor.
This guy has amazing teaching skills. Thanks for making such high quality content avaiable
Very clear, informative lecture! Especially, I like the part from 54:30 to 58:00 the most in this lecture since I have never found such clear description of the bias variance trade off. Thank you very much!
Why did I miss this series until 2021? 😮😮 Professor is explaining things really well. 💙💙
Great lecturer. Very easy and user-friendly for beginners to understand fundamental concepts to machine learning. Thank you.
The analogy of deterministic noise is enlightening... so 对牛弹琴 (play the lute to the cow) can be interpreated as that the complexity of music is deterministic noise to cows...lol...
Amazing lecture! WHAT an explanation of deterministic noise!
Hi,
At 42:16 : "Stochastic noise increases => overfitting increases, deterministic noise increases => overfitting increases".
At 1:13:16 : "[Values of terministic noise and stochastic noise have] nothing to do with the overfitting aspect".
Can someone explain to me ? Thank you.
Yvon liuhliuhk got an answer to my question here : book.caltech.edu/bookforum/showthread.php?t=503
My second quotation is wrong.
The point is that, the exact values themselves do not reflect the overfitting severity. You cannot say that a smaller value of bias leads to less overfitting. A model may get a smaller bias by overfitting it to the samples. At the same time, the bias value does not reflect the deterministic noise level. The latter is reflected by the complexity of the target (when it is above the model complexity): the higher complexity, the more noise, hence more overfitting.
Interesting logo for the course!
"Welcome back" :D
Sort of brand
nice work sir and thanks for this video
much more clear about overfitting now
I learned something new today: that machines may hallucinate! LOL.
Great lecture and poignant examples.
If stochastic noise is the "normal" noise from a noisy target function and deterministic noise is the noise produced by the target function being more complex than the hypothesis set you have, then you still don't have an explanation of why the 10th order polynomial had overfitting trying to fit a noiseless 10th order polynomial (since it doesn't have stochastic noise by definition and it can't have deterministic noise because the hypothesis set is equal in complexity as the target function). Did I miss something?
He explained in 26:48. 10th order polynomial is supposed to fit well a 10th order target, but only when there are enough samples. When you don't have enough samples, you can get good enough Ein, but you get really bad generalization error because of the theorem of VC dimension inequality. Now compared to 2th order model, the 10th order one is overfitting. (Note the definition of overfitting is "comparative".) In other words, deterministic noise is produced by the mismatch of model/target complexity, and also by the finite number of samples.
To also clarify, the 'noiseless' target function in his example was a 50th order polynomial, therefore the overfitting there was due to deterministic noise 16:41. All the 10th order polynomials he used in his explanation were 'noisy', hence why the learning curves had a expected error > 0 even at large N (26:41)
sir, you are god of machine learning . thank your sir for nice explanations.
The goal of this lecture is at 1:06:30 .. 💥
When I draw learning curves of decision trees or support vector machines, I have never so far seen overfitting in this sense. If I take my decision tree of depth 500, it will still be better on the test set than a depth-200 model, say. I get no added value on the validation set, of course. Can anyone show me a decision tree that overfits?!
Brilliant lecture!
Does anyone happen to know what graph-creating tool he is using or a tool with similar capabilities?
Programs such as MATLAB, Mathematica, Maple, and perhaps Octave could get you similar results
Fantastic prof.
I couldn't stop laughing when I heard that machine is hallucinating.
Early stopping seems to make the unstated assumption that E_out will not come back down.
Amazing lesson
This is funny that he has the same facial expression after answering each question in each video. Some sort of shaking head to express that he's just finished the answer. Great lecture, by the way. I still don't quite understand that notion of the deterministic noise, since if we choose richer model we make this noise smaller, but we end up in a worse result if we have small N. He explained it during the lecture and during the Q&A, but still I don't get it why he calls a bias a noise.
I also struggled with the same question and here is my understanding. First of all, what is bias? Bias can be interpreted into the distance between the best h of a given H (= the best a H can do) and the target function, which is also the definition of deterministic noise. Then, why call it a noise? according to the example given by the professor, if what you want to learn (the target function) is far beyond your learning ability (the model complexity of H), it will mislead you like the child trying to learning complex number by using the real numbers, which happens to be the definition of noise.
Que buena clase por favor!