Professor , I am at loss for words at how clearly you go about explaining fairly complex concepts. I understand bias-variance tradeoff might seem rudimentary to people who skim ML at the surface but it is much more complex than that. As someone working in the field of data science at a fairly large company, I find myself with many problems in the models I build ....what do I do? Turn to Prof. Weinberger of course. Absolutely the right balance between capturing the complexity of the topic while keeping it concise. And amazing that he makes it all available for free.
Much better than the whole GOT season 8. Thank you so much professor, whenever i think im about to give up my ML self-studying, i found amazing teacher like you.
This is absolutely fantastic that you provide this high quality course for free, for everyone. I hope this trend continues and expands in all fields of science. Thank you
Professor, Your lectures help students from bachelors to PhD. Truly amazing way of explaining things. Thank you for helping out students all over the world.
Currently taking Machine learning at Columbia in person and taking Machine learning at Cornell online at the same time! The reason why I'm doing this is Professor Weinberger is soooooooo amazing
The series is seriously impressive. It would be nice if we could have a subsequent lecture series on Deep learning and one on Machine learning core theory
This is an amazing thing you have given us. Makes me REALLY UNDERSTAND THE "GOOD STUFF" and encourages to learn more. I Hope I get the chance to meet you someday professor.lots of love from india💕
Thank you for this conference, I learned a lot! I do have one question though. In the course, it is shown how to diagnose the ML model to balance the bias/variance trade-off, but what about noise? How is it possible to know if the error of the model comes from significant noise in the dataset?
More iterations will primarily increase variance (because you specialize more towards the particular data set you are optimizing over) and might decrease bias a little (e.g. if you go further from a fixed initialization). d
If you perform K-fold cross validation, you will receive a validation error for each of the K folds. From these K validation errors, you can compute the average, and the standard deviation. The latter is very important, because it will tell you how much your val-error varies if you simply make different train/val splits. If this variance is high please your train/val sets are too small for this type of data. Also expect them to be bad estimates of the test error (i.e. if you do hyper-parameter search based on val, your choice may be pretty bad).
Hey Kilian, You said that doing regularization by changing the value of lambda is more expensive because you have to retrain from scratch for every lambda. Why not set regularization to a very high number and train the model, and then slowly decrease lambda (which means increasing the size of the ball) and then for the new lambda, initialize the weight vector to the weight vector found in last iteration(and not to 0). This would be equally fast to early stopping. Btw, amazing lectures. Thanks for the great work. Aayush Chhabra Computer Engineering, University of Washington
Yes, those two methods are essentially (roughly) equivalent. If you have a small enough learning rate you will increase the norm of the weight vector a tiny bit with each gradient step - similar to lowering your regularization constant.
Hi Kilian , I have a small confusion, why does adding more data increase the training error ? We’re talking about the average error on the dataset right? How does adding more data affect the average error on the trainingdata?
Yes, we are always talking about the average training error. Adding more training data makes it harder to get a low (average) error. Think about the extreme case: If you only have 1 training sample, it is super easy to get zero training error (just always predict the label of that sample). The moment you get more samples, the fitting becomes a lot trickier. Another extreme case is if you have completely random data with random labels. Then you can see that doubling your training data would require you to double your capacity to memorize all those samples. Hope this helps.
Sir, in early stopping, we are just increasing number of iterations?, so here there is no lamda at all in this algorithm? if no, how can we say our model generalises overall? and here we only calculate weights for the model and minimise loss function without considering regularisation ?
In k-fold cross validation,how is the test error we're calculating an unbiased estimator of Generalization error (E(h_d(x) -y)^2) . To my understanding, Generalization error should be computed as follows: Draw D training points, learn a model h_D, draw a test point (x,y), make prediction using h_D and compute error. And then we do this step (drawing another D points from same distribution -> learning h_d and making prediction on new test point drawn) millions of times and average this error that we got each time. However, in k-fold validation, the D training points that we have in each iteration (out of k iterations) are going to be very similar and not independent at all (since there is an overlap in data used for training model in each of iteration)
You are right that the classifiers (and error estimates) are not independent. But that will only affect the *variance* of the error (because some of the training data is shared, the variance will be lower than if you had taken truly independent samples.) The mean is still unbiased, because each classifier has never seen any of the data it is tested on. Hope this makes sense.
If i ask this many doubts that these students ask this wonderful Professor, My professor would beat the sh*t out of us and get irritated. Kilian Is Gold on TH-cam.
sir is there any way i can mail you? i have to ask questions for further study, i am looking to do NLP or be an NLP engineer, do i have to learn these algorithms? or i can learn only algorithms which deals with texts , like RNN, transformer, ?
Well, this course covers the underlying principles of machine learning - something all ML algorithms (including RNNs or transformers) are based upon. In general I would recommend to understand the principles, if you intend to use ML algorithms.
Professor , I am at loss for words at how clearly you go about explaining fairly complex concepts. I understand bias-variance tradeoff might seem rudimentary to people who skim ML at the surface but it is much more complex than that. As someone working in the field of data science at a fairly large company, I find myself with many problems in the models I build ....what do I do? Turn to Prof. Weinberger of course. Absolutely the right balance between capturing the complexity of the topic while keeping it concise. And amazing that he makes it all available for free.
Does your company take ML interns, especially the ones who have also learnt ML concepts from professor kilian?
This lecture is so good like some episodes of GOT (with a mesmerizing ending) !
hopefully less violent ... :-)
Much better than the whole GOT season 8. Thank you so much professor, whenever i think im about to give up my ML self-studying, i found amazing teacher like you.
are you implying how terrible it will become?
@@kilianweinberger698 Savvy
This is absolutely fantastic that you provide this high quality course for free, for everyone. I hope this trend continues and expands in all fields of science. Thank you
This channel definitely deserves more subs, the depth and coverage of this course are the best I have ever seen.
Best lectures in machine learning on the internet
Professor, Your lectures help students from bachelors to PhD. Truly amazing way of explaining things. Thank you for helping out students all over the world.
Currently taking Machine learning at Columbia in person and taking Machine learning at Cornell online at the same time! The reason why I'm doing this is Professor Weinberger is soooooooo amazing
The series is seriously impressive. It would be nice if we could have a subsequent lecture series on Deep learning and one on Machine learning core theory
I watched 3 lectures till now in the playlist; all are of high-quality content with clear explanation.
+1 for the awesome demo for kernelization!
This is an amazing thing you have given us. Makes me REALLY UNDERSTAND THE "GOOD STUFF" and encourages to learn more.
I Hope I get the chance to meet you someday professor.lots of love from india💕
The part where he was saying moving your stuff to the window is so humorous >
What a brilliant man!
The simulation at the end was very impressive. Thank you for that :)
I thought I have studied these concepts many times.. But I watched till the end and learnt different ways to look at the same things.
This is all amazingly good stuff. Thanks Professor!
lecture starts at 5:06
Oh ya its good stuff... Prof its amazing .. Thank you very much... (Thank you)*Kilian....
kilian is a very large number which tends to infinity...
Thank you for this conference, I learned a lot! I do have one question though. In the course, it is shown how to diagnose the ML model to balance the bias/variance trade-off, but what about noise? How is it possible to know if the error of the model comes from significant noise in the dataset?
kernal @34:00
Thank you my man!!! I was looking for this comment.
Kernal 34:00
@Kilian Weinberger at 31:31 what will be the effect of increasing more number of iterations on variance and bias ?
More iterations will primarily increase variance (because you specialize more towards the particular data set you are optimizing over) and might decrease bias a little (e.g. if you go further from a fixed initialization). d
Why not x1^2 as an increased feature prof?? Timestamp 42:05 !!
Kernel starts at 32:40
How do we calculate standard deviation when using K-fold cross validation? I didn't understand that.
Danke Schoen!
If you perform K-fold cross validation, you will receive a validation error for each of the K folds. From these K validation errors, you can compute the average, and the standard deviation. The latter is very important, because it will tell you how much your val-error varies if you simply make different train/val splits. If this variance is high please your train/val sets are too small for this type of data. Also expect them to be bad estimates of the test error (i.e. if you do hyper-parameter search based on val, your choice may be pretty bad).
Awesome stuff, thank you!
Hey Kilian,
You said that doing regularization by changing the value of lambda is more expensive because you have to retrain from scratch for every lambda.
Why not set regularization to a very high number and train the model, and then slowly decrease lambda (which means increasing the size of the ball) and then for the new lambda, initialize the weight vector to the weight vector found in last iteration(and not to 0). This would be equally fast to early stopping.
Btw, amazing lectures. Thanks for the great work.
Aayush Chhabra
Computer Engineering, University of Washington
Yes, those two methods are essentially (roughly) equivalent. If you have a small enough learning rate you will increase the norm of the weight vector a tiny bit with each gradient step - similar to lowering your regularization constant.
Hi Kilian ,
I have a small confusion, why does adding more data increase the training error ? We’re talking about the average error on the dataset right?
How does adding more data affect the average error on the trainingdata?
Yes, we are always talking about the average training error. Adding more training data makes it harder to get a low (average) error. Think about the extreme case: If you only have 1 training sample, it is super easy to get zero training error (just always predict the label of that sample). The moment you get more samples, the fitting becomes a lot trickier.
Another extreme case is if you have completely random data with random labels. Then you can see that doubling your training data would require you to double your capacity to memorize all those samples. Hope this helps.
Sir, in early stopping, we are just increasing number of iterations?, so here there is no lamda at all in this algorithm? if no, how can we say our model generalises overall? and here we only calculate weights for the model and minimise loss function without considering regularisation ?
Happy teacher's day sir 🎉
Amazing demo
In k-fold cross validation,how is the test error we're calculating an unbiased estimator of Generalization error (E(h_d(x) -y)^2) . To my understanding, Generalization error should be computed as follows: Draw D training points, learn a model h_D, draw a test point (x,y), make prediction using h_D and compute error. And then we do this step (drawing another D points from same distribution -> learning h_d and making prediction on new test point drawn) millions of times and average this error that we got each time. However, in k-fold validation, the D training points that we have in each iteration (out of k iterations) are going to be very similar and not independent at all (since there is an overlap in data used for training model in each of iteration)
You are right that the classifiers (and error estimates) are not independent. But that will only affect the *variance* of the error (because some of the training data is shared, the variance will be lower than if you had taken truly independent samples.) The mean is still unbiased, because each classifier has never seen any of the data it is tested on. Hope this makes sense.
If i ask this many doubts that these students ask this wonderful Professor, My professor would beat the sh*t out of us and get irritated.
Kilian Is Gold on TH-cam.
thank you professor .
Machine Learning Lecture 22 is missed?
No, it's not missing. Here it is:
th-cam.com/video/FgTQG2IozlM/w-d-xo.html
I am curious about the questions at this exam
Again, great lecture! Btw, professor you kinda look like Dirk Nowitzki.
Haha, thanks! He is about 1 foot taller, though. :-)
Kernels starts from: th-cam.com/video/a7cofmFgwIk/w-d-xo.html
1:35 oh boy
Kernels 34:10
Coolest guy ever !!
Starts from: th-cam.com/video/a7cofmFgwIk/w-d-xo.html
Kernels start from: th-cam.com/video/a7cofmFgwIk/w-d-xo.html
What an ending!
sir is there any way i can mail you? i have to ask questions for further study, i am looking to do NLP or be an NLP engineer, do i have to learn these algorithms? or i can learn only algorithms which deals with texts , like RNN, transformer, ?
Well, this course covers the underlying principles of machine learning - something all ML algorithms (including RNNs or transformers) are based upon. In general I would recommend to understand the principles, if you intend to use ML algorithms.
Moral of the story: dont trust your cousin! :P