43:01 The reason of using [0][0]=0 is that because in our metrix W the first term is basically intercept not slope and we have to multiply Lambda with only slopes. Thats why first term became zero and lambda is multiplied with only slopes.
No bro reason is not correct because in w matrix all are coefficients only I mean to say they all are slopes it's not intercept so reason is not correct bro
Luckily stumbled across your videos few days back. Found them to be better than most on this platform. Like the fact that your explanation involves both the intuition and the maths (wherever necessary) behind these algorithms. It gives greater justification for its usage. Many tutorials fail to do that and only rely on intuition. Really appreciate the efforts in trying to find out the difference between the custom code and library code and sharing that with us also. You've definitely earned a sub!
can you please help , i am unable to understand [XW-Y]` [XW-Y], even though i have revised and understood that E(error for multiple linear reg) = e transpose *e
43:00 Ig reason of using I[0][0] = 0 is : bias term should not be heavily regularized because it represents the baseline value of the target variable when all input features are zero.
As in the Simple Regression case we have added lambda*(m^2) and have not included the b wala term but at the end it automatically modified because b = y.mean() - m* x.mean() i.e b is a function of m. Similiarly in the Nd case the intercept term is a function of the rest N weights and the change in rest weights alters the value of this intercept hence we does not need to perform the operation on intercept explicitly.
42:58 b0 is optional to use in regulisation because it doesn't change the shape of the fitted curve or hyperparameter.b0 just shift it. We use regulisation to make the complex model simpler. So, regulising b0 will not able to help in that.
can you please help , i am unable to understand [XW-Y]` [XW-Y], even though i have revised and understood that E(error for multiple linear reg) = e transpose *e
The statement I[0][0] = 0 is likely setting the regularization strength for the bias term to zero. Regularization is a technique used to prevent overfitting in a model by penalizing large coefficients. However, the comment suggests that the bias term, which represents the baseline value of the target variable when all input features are zero, should not be heavily regularized. In simpler terms, the bias term is essential for capturing the inherent value of the target variable when there's no influence from the input features. By setting its regularization strength to zero, the model is allowed to keep this baseline value without being penalized too much, as it's crucial for accurate predictions.
Only on this channel , everything explained so thoroughly. U'll clear conceptual. This channel has that magic & the magic is Nitish sir. Thanku so much sir for working so hard everytime.
can you please help , i am unable to understand [XW-Y]` [XW-Y], even though i have revised and understood that E(error for multiple linear reg) = e transpose *e
the reason they are replacing with 0 i,e ( first value in I matrix) because in LAMBDA(W) square we have first value w0 (i,e intercept) so they don't want to consider that first value in weights vector because the lambda is concerned only with coefficients as per the regularization term. This could be the reason i'm thinking but not sure .Correct me if im wrong .Thanks
Hi sir great explanation. I have a doubt: At 30:13 the formula for differentiation of 'xTBx' given is as '2Bx' but the formula told by you while Multiple Regression is '2BxT'. I think if we have take '2wTyTx' instead of '2wTxTy'.
@ 25:04 . I think the loss function should be = loss of linear regression + lambda*(W.W' - Wo*Wo) as we only need to add squares of coef_ not of intercept_ .. (W' is transpose of W) someone please help...
because lamda will repersent to changing slope value i.e coef_ 1th column is intercept_ ( adding as per required initially )so they didn't want idendity matrix in intercept_ columns also we observe
I do not agree with the idea that if m is too high then it can lead to overfitting, and if m is less than it can lead to underfitting. The word overfitting means anything only when the performance on the training data is really good but that in testing data is poor, but 'm' will perform poor in both datasets if either it'll be too high or it'll be too low. Please confirm if that's right or wrong understanding.
great point! i was thinking same.....i think Nitish meant this only for the example datapoints he showed on the whiteboard to clarify the example....in general, a higher m might be an optimal model or a bad model & a low m might underfit or even overfit the data, should depend completely on the data
regularizing the intercept could affect baseline prediction as the intercept represents the expected mean value of the target when all features are zero
Just curious if we really need to learn how to derive the model from scratch. I mean we already have Scikitlearn for that no. These formulations are little complex! Any comments would be appreciated.
Well, it's not mandatory but companies can ask you to code it or explain the mathematics, it happened with me in the Microsoft DS test. so don't miss out
They are regularizing only for the coefficients and not for intercept as per the definition of regularisation , but we are also regularising the intercept by keeping 1 in the 1st row of the I'd entity matrix which handles the intercept part I guess I am true
In the previous video the coefficients matrix was written as beta which in this video's derivation is W, so first thing to keep in mind is that W(for this video)=beta(in the previous video). Moreover y_hat = X*beta for the previous video So for this video Y_hat becomes Y_hat = XW E(error for multiple linear reg) = e transpose * e and e = (Y - Y_hat) so the final expression here becomes [XW-Y] transpose [XW-Y] I hope this helps
why the identity matrix in ridge regression typically has zero in its main diagonal for the first term ? we add a penalty term λ||β||² to the ordinary least squares objective function, where λ is the regularization parameter and β contains the coefficients. The penalty term helps prevent overfitting by shrinking coefficients toward zero. However, we usually don't want to penalize (shrink) the intercept term β₀, because: The intercept represents the baseline prediction when all features are zero Shrinking the intercept doesn't help with reducing model complexity or overfitting It could introduce unnecessary bias into the model Therefore, in the penalty matrix (often written as λI where I is the identity matrix), we put a 0 in the first diagonal position corresponding to the intercept, while keeping 1s for all other diagonal entries. This effectively excludes β₀ from regularization. credit : Claude AI :)
Sir the main reason behind optimisation techniques like Gradient descent is to find the appropriate parameters that does not overfit or underfit right? Then why do we use regularisation again?
Those parameters that you find using gradient descent are the best parameters on the given dataset(which can be considered as sample data) how would know these parameters will be suitable for all of population data(read testing data)
@@campusx-official can you please help , i am unable to understand [XW-Y]` [XW-Y], even though i have revised and understood that E(error for multiple linear reg) = e transpose *e
Reason why intercept is not regularized: Intercept acts as a receiver of reduction in coefficients thus regularisation of both will not improve the model or in other words you are regularising the curve, not shifting it.
I found a similar reason too. When regularizing, we are trying to reduce variance of our model, not bias. The first term m0 (or theta0) is a bias term, regularizing it will not reduce the variance but shift the whole curve.
Am I right in assuming that, instead of changing the individual values of m, you just add one m at the end and tuning that will effect the entire equation?
sir little bit confused here, last time multiple reg me aapane (y - y^) (actual - predic) liya tha so our equation was (y - XB) but is video/lecture me aapne(y^-y) (predic - actual ) liya and equation here is (XB - y) So why this difference?
22:45 L= ( yi - ŷi ) ² In matrix from L= ( y - Xw )ᵀ ( y - Xw ) L= ( y - Xw )ᵀ ( y - Xw ) + || w || ² L= ( y - Xw )ᵀ ( y - Xw) + λ wᵀw L= ( yᵀ - wᵀ Xᵀ )( y - Xw) + λ wᵀw L= yᵀ y - wᵀ Xᵀ y - yᵀXw + wᵀXᵀXw + λ wᵀw As he told wᵀXᵀy and yᵀXw both are same L= yᵀy - 2(wᵀXᵀy) + wᵀXᵀXw + λ wᵀw this is same eqn he got
Eg = (A-B)(C-D) = AC - BC -AD + BD ----- 1 (B-A)(D-C) = BD - AD - BC + AC re-arrange (AC - BC -AD + BD) ---- 2 Both eqn found out to be same don’t get confused 😵💫
why there is no learning rate hyperparameter in scikit-learn Ridge/lasso/Elasticnet . As it has a hyperparameter called max_iteration that means it uses gradient descent but still there is no learning rate present in hyperparameters . if anyone knows please help me out with it.
Dear Sir, I am from Bangladesh. we have learned (Y-Y-predict)whole square in the case of Error formula. So, I need to put XW instead Y-predict. Why you applied XW instead of Y. Kindly advise on this please. (XW - Y) or (Y-XW)
i think they did not want to change intercept coz intercept kisi ke weightage ko show nhi kr rha so intercept change krne pr overfitting pr koi effect nhi aayega
Sir , now that we are adding the regularisation in the loss function , wont it change the parabolic nature of the earlier function ?? And it won't effect our solution ??
This is called as bias-variance trade off. i.e. We should only increase the bias (regularization) in our model if the varaince (Total Loss on Test Data - Total Loss on train Data) is redused. Lambda is a tuning factor as you know tune it to get the best result. Better than normal Linear Regression
@@aadarshbhalerao8507 can you please help , i am unable to understand [XW-Y]` [XW-Y], even though i have revised and understood that E(error for multiple linear reg) = e transpose *e
Ref 22:45 L= ( yi - ŷi ) ² In matrix from L= ( y - Xw )ᵀ ( y - Xw ) L= ( y - Xw )ᵀ ( y - Xw ) + || w || ² L= ( y - Xw )ᵀ ( y - Xw) + λ wᵀw L= ( yᵀ - wᵀ Xᵀ )( y - Xw) + λ wᵀw L= yᵀ y - wᵀ Xᵀ y - yᵀXw + wᵀXᵀXw + λ wᵀw As he told wᵀXᵀy and yᵀXw both are same L= yᵀy - 2(wᵀXᵀy) + wᵀXᵀXw + λ wᵀw this is same eqn he got
Eg = (A-B)(C-D) = AC - BC -AD + BD ----- 1 (B-A)(D-C) = BD - AD - BC + AC re-arrange (AC - BC -AD + BD) ---- 2 Both eqn found out to be same
two ways for doing these linear reg the OLS way linear reg the Gradient descent way polynomial reg. the OLS way poly. reg. the GD way ridge reg the OLS way ridge reg. the GD way
43:01
The reason of using [0][0]=0 is that because in our metrix W the first term is basically intercept not slope and we have to multiply Lambda with only slopes.
Thats why first term became zero and lambda is multiplied with only slopes.
yes exactly this is the reason for it.....I was about to comment the same...but found the same answer in your comment
Thank you bro........Good Job.....
No bro reason is not correct because in w matrix all are coefficients only I mean to say they all are slopes it's not intercept so reason is not correct bro
Yes
but we are adding that one column of 1's know ? before training [ X_train = np.insert(X_train,0,1,axis=1) ]
Luckily stumbled across your videos few days back. Found them to be better than most on this platform.
Like the fact that your explanation involves both the intuition and the maths (wherever necessary) behind these algorithms. It gives greater justification for its usage. Many tutorials fail to do that and only rely on intuition.
Really appreciate the efforts in trying to find out the difference between the custom code and library code and sharing that with us also.
You've definitely earned a sub!
can you please help , i am unable to understand [XW-Y]` [XW-Y], even though i have revised and understood that E(error for multiple linear reg) = e transpose *e
Don't know why did I paid so much to study the same which is already available on your channel.. great work sir...cheers
true
more content than paid course
@@Noob31219 very true, in paid courses also, they dont go in that much detail
@@mohitkushwaha8974 tum liye ho kya couse??
Us bro us i paid 1,50,000 just for roadmap in some institute 😂😂
43:00
Ig reason of using I[0][0] = 0 is :
bias term should not be heavily regularized because it represents the baseline value of the target variable when all input features are zero.
As in the Simple Regression case we have added lambda*(m^2) and have not included the b wala term but at the end it automatically modified because b = y.mean() - m* x.mean() i.e b is a function of m. Similiarly in the Nd case the intercept term is a function of the rest N weights and the change in rest weights alters the value of this intercept hence we does not need to perform the operation on intercept explicitly.
42:58 b0 is optional to use in regulisation because it doesn't change the shape of the fitted curve or hyperparameter.b0 just shift it. We use regulisation to make the complex model simpler. So, regulising b0 will not able to help in that.
If you see Nitesh sir's playlists for machine learning and deep learning you will easily get hired for data scientist role with good package
did you got selected?
@@mihirsrivastava2668 tumharaa??
can you please help , i am unable to understand [XW-Y]` [XW-Y], even though i have revised and understood that E(error for multiple linear reg) = e transpose *e
I think it should be (Y-WX) because e= (Y-Y^) not( Y^ -Y)@RamandeepSingh_04
I think it should be (Y-WX) because e= (Y-Y^) not( Y^ -Y)
Sir please upload video on DBSCAN and xgboost algorithm plz 🙏 sir your teaching style is awesome sir
The statement I[0][0] = 0 is likely setting the regularization strength for the bias term to zero. Regularization is a technique used to prevent overfitting in a model by penalizing large coefficients. However, the comment suggests that the bias term, which represents the baseline value of the target variable when all input features are zero, should not be heavily regularized.
In simpler terms, the bias term is essential for capturing the inherent value of the target variable when there's no influence from the input features. By setting its regularization strength to zero, the model is allowed to keep this baseline value without being penalized too much, as it's crucial for accurate predictions.
Thank for comment i understood
By the way where can i learn the ML in depth like to solve this type of error(probelms)
Only on this channel , everything explained so thoroughly. U'll clear conceptual. This channel has that magic & the magic is Nitish sir. Thanku so much sir for working so hard everytime.
can you please help , i am unable to understand [XW-Y]` [XW-Y], even though i have revised and understood that E(error for multiple linear reg) = e transpose *e
the reason they are replacing with 0 i,e ( first value in I matrix) because in LAMBDA(W) square we have first value w0 (i,e intercept) so they don't want to consider that first value in weights vector because the lambda is concerned only with coefficients as per the regularization term. This could be the reason i'm thinking but not sure .Correct me if im wrong .Thanks
oh...could be right!
Hi sir great explanation.
I have a doubt: At 30:13 the formula for differentiation of 'xTBx' given is as '2Bx' but the formula told by you while Multiple Regression is '2BxT'.
I think if we have take '2wTyTx' instead of '2wTxTy'.
as W0 is intercept which means that alpha can only work with W1 to Wm not with intercept .. hence the kept the value I[0][0] = 0
at 8:30 You are multiplying the all the terms with minis(-) but there is error in the sign of resultant expression
no the equation is okay! when you take the minus sign inside the summation then it becomes +m {x-x(mean)}^2...
in overfitting bias is high and variance is low ....underfitting means high variance and low bias
@ 25:04 . I think the loss function should be = loss of linear regression + lambda*(W.W' - Wo*Wo) as we only need to add squares of coef_ not of intercept_ ..
(W' is transpose of W)
someone please help...
because lamda will repersent to changing slope value i.e coef_ 1th column is intercept_ ( adding as per required initially )so they didn't want idendity matrix in intercept_ columns also we observe
I do not agree with the idea that if m is too high then it can lead to overfitting, and if m is less than it can lead to underfitting. The word overfitting means anything only when the performance on the training data is really good but that in testing data is poor, but 'm' will perform poor in both datasets if either it'll be too high or it'll be too low.
Please confirm if that's right or wrong understanding.
great point! i was thinking same.....i think Nitish meant this only for the example datapoints he showed on the whiteboard to clarify the example....in general, a higher m might be an optimal model or a bad model & a low m might underfit or even overfit the data, should depend completely on the data
class MyRidge:
def __init__(self,alpha=0.1):
self.intercept=None
self.coef=None
self.alpha=alpha
def fit(self,X,y):
num=np.sum(np.dot(y-y.mean(),X-X.mean()))
denom=np.sum((X-X.mean())*(X-X.mean()))+self.alpha
self.intercept=num/denom
self.coef=y.mean()-self.intercept*X.mean()
print(self.intercept,self.coef)
def predict(self,X):
y_pred=self.intercept*X+self.coef
return y_pred
is the above class correct???
regularizing the intercept could affect baseline prediction as the intercept represents the expected mean value of the target when all features are zero
Thanks for such a detailed explanation
Just curious if we really need to learn how to derive the model from scratch. I mean we already have Scikitlearn for that no. These formulations are little complex! Any comments would be appreciated.
No, it's not mandatory.
Well, it's not mandatory but companies can ask you to code it or explain the mathematics, it happened with me in the Microsoft DS test. so don't miss out
@@Vinayworks666 Hello bhaiya can we please talk?needed guidance
IF yes, Pls provide a way by which I can contact you with pls
@@flakky626 Checkmy channel
@@flakky626 Well i could help you out but youtube is not allowing me to do it
can someone tell me the difference between the lambda and c which is a hyperparameter for logistic regression? Both are used for regularization
They are regularizing only for the coefficients and not for intercept as per the definition of regularisation , but we are also regularising the intercept by keeping 1 in the 1st row of the I'd entity matrix which handles the intercept part
I guess I am true
yes
Thank You Sir.
4:04 why not lambda(m^2+b^2)
can anyone provide source for verification
32:54
n+1 by n+1
Sir i am unable to understand [XW-Y]` [XW-Y], even though i have revised and understood that E(error for multiple linear reg) = e transpose *e
In the previous video the coefficients matrix was written as beta which in this video's derivation is W, so first thing to keep in mind is that W(for this video)=beta(in the previous video).
Moreover y_hat = X*beta for the previous video
So for this video Y_hat becomes Y_hat = XW
E(error for multiple linear reg) = e transpose * e
and e = (Y - Y_hat) so the final expression here becomes [XW-Y] transpose [XW-Y]
I hope this helps
@@adenmukhtar9804 if e = (Y - Y_hat) so the final expression here becomes [Y-XW] transpose [Y-XW], how e = [XW-Y] transpose [XW-Y]
I am confused
@@AlphaEthic yes i have same question
why the identity matrix in ridge regression typically has zero in its main diagonal for the first term ?
we add a penalty term λ||β||² to the ordinary least squares objective function, where λ is the regularization parameter and β contains the coefficients. The penalty term helps prevent overfitting by shrinking coefficients toward zero.
However, we usually don't want to penalize (shrink) the intercept term β₀, because:
The intercept represents the baseline prediction when all features are zero
Shrinking the intercept doesn't help with reducing model complexity or overfitting
It could introduce unnecessary bias into the model
Therefore, in the penalty matrix (often written as λI where I is the identity matrix), we put a 0 in the first diagonal position corresponding to the intercept, while keeping 1s for all other diagonal entries. This effectively excludes β₀ from regularization.
credit : Claude AI :)
why overfit models tends to have large value of slope ?
best teacher ever!
I[0,0] is regularizing intercept term that is why we make it 0
Sir the main reason behind optimisation techniques like Gradient descent is to find the appropriate parameters that does not overfit or underfit right? Then why do we use regularisation again?
Those parameters that you find using gradient descent are the best parameters on the given dataset(which can be considered as sample data) how would know these parameters will be suitable for all of population data(read testing data)
Even remember that heavy advantages of using gradient descent over that is faster convergence as inverse increases time complexity.
And also linear regression limited to to convex function and can't wait work on very big data efficiently
@@campusx-official can you please help , i am unable to understand [XW-Y]` [XW-Y], even though i have revised and understood that E(error for multiple linear reg) = e transpose *e
Sir, I think it should be (Y-WX) because e= (Y-Y^) not( Y^ -Y).
Reason why intercept is not regularized:
Intercept acts as a receiver of reduction in coefficients thus regularisation of both will not improve the model or in other words you are regularising the curve, not shifting it.
I found a similar reason too. When regularizing, we are trying to reduce variance of our model, not bias. The first term m0 (or theta0) is a bias term, regularizing it will not reduce the variance but shift the whole curve.
Thanks for this!
Am I right in assuming that, instead of changing the individual values of m, you just add one m at the end and tuning that will effect the entire equation?
Shouldnt the formula for Loss be (Y-XW)^T (Y-XW) ?
i think ulta ho gaya
Yes buddy it's mistake
Then how did sir got the right ans 😶
After doing this can i now use logistic regression??
i got error saying can ot convert string to numeric .. should i encode ??
@28: 37 dono term same nahi hai. ek term dusre ka transpose hai.
22:31 ,
(Y^trans -(X. Beta) ^trans) .
(Y - (X. Beta))
By the way at the end it doesn't matter even bhaiya has taken reverse it's also fit.
I am having the same doubt can u please tell me how's it correct
Because -1 is taken common from both term of The equation we got in multiple linear regression
sir little bit confused here, last time multiple reg me aapane (y - y^) (actual - predic) liya tha so our equation was (y - XB)
but is video/lecture me aapne(y^-y) (predic - actual ) liya and equation here is (XB - y)
So why this difference?
take both minus signs common and you will get the same answer as above, it doesn't matter. He by mistake took the other way round.
@@ankurlohiya Can we even take minus common out of a transposed braces?
also thankyousomuch brutha!!
22:45
L= ( yi - ŷi ) ²
In matrix from
L= ( y - Xw )ᵀ ( y - Xw )
L= ( y - Xw )ᵀ ( y - Xw ) + || w || ²
L= ( y - Xw )ᵀ ( y - Xw) + λ wᵀw
L= ( yᵀ - wᵀ Xᵀ )( y - Xw) + λ wᵀw
L= yᵀ y - wᵀ Xᵀ y - yᵀXw + wᵀXᵀXw + λ wᵀw
As he told wᵀXᵀy and yᵀXw both are same
L= yᵀy - 2(wᵀXᵀy) + wᵀXᵀXw + λ wᵀw
this is same eqn he got
Eg =
(A-B)(C-D) = AC - BC -AD + BD ----- 1
(B-A)(D-C) = BD - AD - BC + AC re-arrange (AC - BC -AD + BD) ---- 2
Both eqn found out to be same don’t get confused 😵💫
@@shreyasmhatre9393 thanks brother
why there is no learning rate hyperparameter in scikit-learn Ridge/lasso/Elasticnet . As it has a hyperparameter called max_iteration that means it uses gradient descent but still there is no learning rate present in hyperparameters . if anyone knows please help me out with it.
Dear Sir, I am from Bangladesh. we have learned (Y-Y-predict)whole square in the case of Error formula. So, I need to put XW instead Y-predict. Why you applied XW instead of Y. Kindly advise on this please. (XW - Y) or (Y-XW)
This equation calculation, is taught in the Linear Regression video (N-Dimensional)
i think they did not want to change intercept coz intercept kisi ke weightage ko show nhi kr rha so intercept change krne pr overfitting pr koi effect nhi aayega
I don't know why your views is low .Sir, your teaching style is too good.
God bless you sir!!
Sir , now that we are adding the regularisation in the loss function , wont it change the parabolic nature of the earlier function ?? And it won't effect our solution ??
This is called as bias-variance trade off. i.e. We should only increase the bias (regularization) in our model if the varaince (Total Loss on Test Data - Total Loss on train Data) is redused. Lambda is a tuning factor as you know tune it to get the best result. Better than normal Linear Regression
@@aadarshbhalerao8507 can you please help , i am unable to understand [XW-Y]` [XW-Y], even though i have revised and understood that E(error for multiple linear reg) = e transpose *e
Thank you very much!!!
Ref 22:45
L= ( yi - ŷi ) ²
In matrix from
L= ( y - Xw )ᵀ ( y - Xw )
L= ( y - Xw )ᵀ ( y - Xw ) + || w || ²
L= ( y - Xw )ᵀ ( y - Xw) + λ wᵀw
L= ( yᵀ - wᵀ Xᵀ )( y - Xw) + λ wᵀw
L= yᵀ y - wᵀ Xᵀ y - yᵀXw + wᵀXᵀXw + λ wᵀw
As he told wᵀXᵀy and yᵀXw both are same
L= yᵀy - 2(wᵀXᵀy) + wᵀXᵀXw + λ wᵀw
this is same eqn he got
Eg =
(A-B)(C-D) = AC - BC -AD + BD ----- 1
(B-A)(D-C) = BD - AD - BC + AC re-arrange (AC - BC -AD + BD) ---- 2
Both eqn found out to be same
Thank you so much man. Even i thought, something was odd as he replaced y with wx, instead of y hat.
can we connect on Linkedin ?
thank you so much for the explaination
OMG Explanation
you are great
two ways for doing these
linear reg the OLS way
linear reg the Gradient descent way
polynomial reg. the OLS way
poly. reg. the GD way
ridge reg the OLS way
ridge reg. the GD way
mera rizz