The subscript of x_i implicitly switches from referring to i-th feature dimension to i-th data sample at the 7:40 mark for the discussion on kernels. Just a note to prevent potential confusion arising from this.
one question: around 17:33 "this is only true for linear classifier", but from the induction proof it can also apply to linear regression, why we never see linear regression show w = sum( alpha_i * x_i)? Thank you so much for the great great lecture!
About the inductive proof around 15:14, I think we should also specify that, because the linear space generated by the inputs is closed, then the sequence of w_i converges to a w that is in the same linear space. Otherwise we are merely saying that w is in the boundary of that linear space
Great lecture! But I had a question: In the induction proofat 15:14, as is said we can initialise any way, what if I initialise w to such a value that it cannot be written as a linear combination of the x's. Then every iteration, I will add a linear combination of x's but it still won't be in total a linear combination of the x's. Will this not dispove the induction for some initlizations?
@@kilianweinberger698 Thank you for taking time to reply. When you saw 'in practice these scenarios are avoided by adding a little bit of l2-regularization. " how does l2 regularization make the weight vector a linear combination of the input data when the input does not span the space?
@@rahuldeora1120 it doesnt make it so it spans the space, but since we are enclosed in a ball the Regularization value is the best estimate for all the global minima present, thus kind of making it seems as though the data spans the regularized space- This is what i understood hope im not wrong professor
Hi Prof. Killian, around 34:53 Q&A you said that we could just set zero weights to the features that we don't care about. I was a bit confused how you could potentailly do this since you only have one alpha i for the i-th observation. If you assign zero to these features, it would be zero for all the features of the Xi. Am I wrong?
Intuition about kernels : a good Kernel says two different points are "similar" in the attribute space when their labels are "similar" in the label space
Great Lecture! I just don't understand why we are using Linear Regression for Classification? Can we use sigmoid instead, as it also has the wTx component, so we can kernelize that as well?
Just because it is simple. It is not ideal, but also not terrible. But yes, typically you would use the logistic loss, which makes more sense for classification.
Though the proof by induction to prove that the W vector can be expressed as a linear combination of all input vectors makes sense, the other point of view is confusing me. Here is the other point of view: Say I have 2-d training data with only 2 training points and I map them to the three-dimensional space using some kernel function. Now since I have got only two data points (vectors) in three dimensions, expressing W as a linear combination of these vectors imply the span of W would only be limited to the plane formed by these two vectors which seem to reduce/defy the purpose of mapping to higher-dimensional space. The same example can be made pragmatic when say we have 10k training points and we are mapping each of them to a million-dimensional space.
Yes, good point. That’s why e.g. SVM with RBF kernel are often referred to as non-parametric. They become more powerful (the number of parameters and their expressive power increases) as you obtain more training data.
@@kilianweinberger698 Thanks Professor. This brings me to the next question: In my example (2 training points in 3-d space), does this mean there might be a better solution (in terms of lower loss) that is not in the plane spanned by those 2 training points?
Not for every loss - but for many of them. E.g. for the squared loss, take a look at the gradient derivation in the notes. The gradient consists of a sum of terms \sum_i gamma_i*x_i .
@@kilianweinberger698 I am sorry if its a silly doubt. I thought that a linear combination means the coefficients of x_i should be constant independent of x_i. When gamma itself depends on x_i, isn't it then a non-linear combination?
No, it is still linear. The gradient being a linear combinations of the inputs just means that the gradient always lies in the space spanned by the input vectors. If the coefficients are a function of x_i or not doesn’t matter in this particular context. Hope this helps. (Btw, you are not alone, a lot of students find that confusing ...)
(Question I ask myself after 11 minutes): In this Video a linear !Classifier! is stated as an example and Squared Loss is used. Squared loss does not make much sense in classification or am I wrong?
Well, it is not totally crazy. In practice people still use the squared loss often for classification, just because it is so easy to implement and comes with a closed form solution. But you are right, if you want top performance a logistic loss makes more sense - simply because if the classifier is very confident about a sample it can give it a very large or very negative inner-product, whereas with a squared loss it is trying to hit the label exactly (e.g. +1 or -1, and a +5 would actually be penalized).
I am pretty convinced that this guy can explain theory of relativity to a toddler. Pure Respect!!
Is he seems to be a great prof.. I love him..
The subscript of x_i implicitly switches from referring to i-th feature dimension to i-th data sample at the 7:40 mark for the discussion on kernels. Just a note to prevent potential confusion arising from this.
Studying at the best uni in India and still completely dependent on these lectures. Amazing explanations! Hope you visit IIT Delhi sometime!
Your lectures for me, is me as a kid going to candy shop.. I love them..
Kernel trick on the last data set blew my mind... literally !!
The climax at the end is worth the 50 minutes lecture!
I even can't explain how good this explanations are!! Thank you sir!
Love your lectures! Especially the demos!
How I start my day everyday these days : Welcome everybody! Please put away your laptops.
You should make a t-shirt or something with that line.
Mad respect to this guy. I might consider being his lifelong disciple or some shit. Noicce!!
a one more very good lecture from the playlist
Extra claps for the demos they are so cool.
one question: around 17:33 "this is only true for linear classifier", but from the induction proof it can also apply to linear regression, why we never see linear regression show w = sum( alpha_i * x_i)? Thank you so much for the great great lecture!
Best course of machine learning I ever watched. Amazing...:)
I really like your course,very interesting
About the inductive proof around 15:14, I think we should also specify that, because the linear space generated by the inputs is closed, then the sequence of w_i converges to a w that is in the same linear space. Otherwise we are merely saying that w is in the boundary of that linear space
I wonder how lucky the grad students are whose adviser is this professor.
Great lecture! But I had a question: In the induction proofat 15:14, as is said we can initialise any way, what if I initialise w to such a value that it cannot be written as a linear combination of the x's. Then every iteration, I will add a linear combination of x's but it still won't be in total a linear combination of the x's. Will this not dispove the induction for some initlizations?
Yes, good catch. If the x’s do not span the full space (i.e. n
@@kilianweinberger698 Thank you for taking time to reply. When you saw 'in practice these scenarios are avoided by adding a little bit of l2-regularization. " how does l2 regularization make the weight vector a linear combination of the input data when the input does not span the space?
@@rahuldeora1120 Do reply
@@rahuldeora1120 it doesnt make it so it spans the space, but since we are enclosed in a ball the Regularization value is the best estimate for all the global minima present, thus kind of making it seems as though the data spans the regularized space- This is what i understood hope im not wrong professor
Hi Prof. Killian, around 34:53 Q&A you said that we could just set zero weights to the features that we don't care about. I was a bit confused how you could potentailly do this since you only have one alpha i for the i-th observation. If you assign zero to these features, it would be zero for all the features of the Xi. Am I wrong?
Amazing! Good explanation! Very helpful! Great thanks from China.
Intuition about kernels : a good Kernel says two different points are "similar" in the attribute space when their labels are "similar" in the label space
Great Lecture! I just don't understand why we are using Linear Regression for Classification? Can we use sigmoid instead, as it also has the wTx component, so we can kernelize that as well?
Just because it is simple. It is not ideal, but also not terrible. But yes, typically you would use the logistic loss, which makes more sense for classification.
Though the proof by induction to prove that the W vector can be expressed as a linear combination of all input vectors makes sense, the other point of view is confusing me. Here is the other point of view: Say I have 2-d training data with only 2 training points and I map them to the three-dimensional space using some kernel function. Now since I have got only two data points (vectors) in three dimensions, expressing W as a linear combination of these vectors imply the span of W would only be limited to the plane formed by these two vectors which seem to reduce/defy the purpose of mapping to higher-dimensional space. The same example can be made pragmatic when say we have 10k training points and we are mapping each of them to a million-dimensional space.
Yes, good point. That’s why e.g. SVM with RBF kernel are often referred to as non-parametric. They become more powerful (the number of parameters and their expressive power increases) as you obtain more training data.
@@kilianweinberger698 Thanks Professor. This brings me to the next question: In my example (2 training points in 3-d space), does this mean there might be a better solution (in terms of lower loss) that is not in the plane spanned by those 2 training points?
I should have taken this course and intro to wine before graduating...
How can you say the gradient is a linear combination of inputs??
Not for every loss - but for many of them. E.g. for the squared loss, take a look at the gradient derivation in the notes. The gradient consists of a sum of terms \sum_i gamma_i*x_i .
@@kilianweinberger698 I am sorry if its a silly doubt. I thought that a linear combination means the coefficients of x_i should be constant independent of x_i. When gamma itself depends on x_i, isn't it then a non-linear combination?
No, it is still linear. The gradient being a linear combinations of the inputs just means that the gradient always lies in the space spanned by the input vectors. If the coefficients are a function of x_i or not doesn’t matter in this particular context. Hope this helps. (Btw, you are not alone, a lot of students find that confusing ...)
Thank you for your patient replies. Just one more intriguing question. Is x_i a vector? so that I can write x_i as (x_i1, xi2... x_in)
Infinite dimensions made me remember Dr. strange !!
whenever there is a breakthrough in ML there is always that exp(x) sneaking around somehow (Boosting, tSNE, RBF...)
Good point ... maybe in the future we should start all papers with “Take exp(x) ...”. ;-)
Started with Gauss and his bell shaped distribution, I guess? 😏😏
@kiliam weinberger the 2 hd in noth Korea feel very alone
(Question I ask myself after 11 minutes): In this Video a linear !Classifier! is stated as an example and Squared Loss is used. Squared loss does not make much sense in classification or am I wrong?
Well, it is not totally crazy. In practice people still use the squared loss often for classification, just because it is so easy to implement and comes with a closed form solution.
But you are right, if you want top performance a logistic loss makes more sense - simply because if the classifier is very confident about a sample it can give it a very large or very negative inner-product, whereas with a squared loss it is trying to hit the label exactly (e.g. +1 or -1, and a +5 would actually be penalized).
these lectures are for undergraduate or graduate program??
Both, but the class typically has more undergrads than graduate students.
respect