Ridge Regression Part 2 | Mathematical Formulation & Code from scratch | Regularized Linear Models

แชร์
ฝัง
  • เผยแพร่เมื่อ 27 ม.ค. 2025

ความคิดเห็น • 114

  • @nihonium_01
    @nihonium_01 ปีที่แล้ว +55

    43:01
    The reason of using [0][0]=0 is that because in our metrix W the first term is basically intercept not slope and we have to multiply Lambda with only slopes.
    Thats why first term became zero and lambda is multiplied with only slopes.

    • @rijanpokhrel9281
      @rijanpokhrel9281 ปีที่แล้ว +5

      yes exactly this is the reason for it.....I was about to comment the same...but found the same answer in your comment

    • @princeagrawal9565
      @princeagrawal9565 8 หลายเดือนก่อน +1

      Thank you bro........Good Job.....

    • @anilkumarreddykorivi
      @anilkumarreddykorivi 7 หลายเดือนก่อน +2

      No bro reason is not correct because in w matrix all are coefficients only I mean to say they all are slopes it's not intercept so reason is not correct bro

    • @korivianilkumarreddy3273
      @korivianilkumarreddy3273 7 หลายเดือนก่อน

      Yes

    • @atharvkazarid2-354
      @atharvkazarid2-354 7 หลายเดือนก่อน +2

      but we are adding that one column of 1's know ? before training [ X_train = np.insert(X_train,0,1,axis=1) ]

  • @arnabroy9782
    @arnabroy9782 3 ปีที่แล้ว +57

    Luckily stumbled across your videos few days back. Found them to be better than most on this platform.
    Like the fact that your explanation involves both the intuition and the maths (wherever necessary) behind these algorithms. It gives greater justification for its usage. Many tutorials fail to do that and only rely on intuition.
    Really appreciate the efforts in trying to find out the difference between the custom code and library code and sharing that with us also.
    You've definitely earned a sub!

    • @RamandeepSingh_04
      @RamandeepSingh_04 ปีที่แล้ว

      can you please help , i am unable to understand [XW-Y]` [XW-Y], even though i have revised and understood that E(error for multiple linear reg) = e transpose *e

  • @abhinavkale4632
    @abhinavkale4632 3 ปีที่แล้ว +49

    Don't know why did I paid so much to study the same which is already available on your channel.. great work sir...cheers

    • @mr.deep.
      @mr.deep. 2 ปีที่แล้ว +1

      true

    • @Noob31219
      @Noob31219 2 ปีที่แล้ว +1

      more content than paid course

    • @mohitkushwaha8974
      @mohitkushwaha8974 2 ปีที่แล้ว +4

      @@Noob31219 very true, in paid courses also, they dont go in that much detail

    • @hritikroshanmishra3630
      @hritikroshanmishra3630 ปีที่แล้ว

      @@mohitkushwaha8974 tum liye ho kya couse??

    • @mihirthakkare504
      @mihirthakkare504 ปีที่แล้ว

      Us bro us i paid 1,50,000 just for roadmap in some institute 😂😂

  • @animeshsingh4645
    @animeshsingh4645 ปีที่แล้ว +17

    43:00
    Ig reason of using I[0][0] = 0 is :
    bias term should not be heavily regularized because it represents the baseline value of the target variable when all input features are zero.

  • @satyabratanayak2264
    @satyabratanayak2264 7 หลายเดือนก่อน +3

    As in the Simple Regression case we have added lambda*(m^2) and have not included the b wala term but at the end it automatically modified because b = y.mean() - m* x.mean() i.e b is a function of m. Similiarly in the Nd case the intercept term is a function of the rest N weights and the change in rest weights alters the value of this intercept hence we does not need to perform the operation on intercept explicitly.

  • @Ishant875
    @Ishant875 ปีที่แล้ว

    42:58 b0 is optional to use in regulisation because it doesn't change the shape of the fitted curve or hyperparameter.b0 just shift it. We use regulisation to make the complex model simpler. So, regulising b0 will not able to help in that.

  • @sudarshansutar1463
    @sudarshansutar1463 ปีที่แล้ว +18

    If you see Nitesh sir's playlists for machine learning and deep learning you will easily get hired for data scientist role with good package

    • @mihirsrivastava2668
      @mihirsrivastava2668 ปีที่แล้ว

      did you got selected?

    • @hritikroshanmishra3630
      @hritikroshanmishra3630 ปีที่แล้ว

      @@mihirsrivastava2668 tumharaa??

    • @RamandeepSingh_04
      @RamandeepSingh_04 ปีที่แล้ว

      can you please help , i am unable to understand [XW-Y]` [XW-Y], even though i have revised and understood that E(error for multiple linear reg) = e transpose *e

    • @shibakarmakar7030
      @shibakarmakar7030 2 หลายเดือนก่อน

      I think it should be (Y-WX) because e= (Y-Y^) not( Y^ -Y​)@RamandeepSingh_04

    • @shibakarmakar7030
      @shibakarmakar7030 2 หลายเดือนก่อน

      I think it should be (Y-WX) because e= (Y-Y^) not( Y^ -Y)

  • @piyushpathak7311
    @piyushpathak7311 3 ปีที่แล้ว +8

    Sir please upload video on DBSCAN and xgboost algorithm plz 🙏 sir your teaching style is awesome sir

  • @krutikashimpi626
    @krutikashimpi626 ปีที่แล้ว +4

    The statement I[0][0] = 0 is likely setting the regularization strength for the bias term to zero. Regularization is a technique used to prevent overfitting in a model by penalizing large coefficients. However, the comment suggests that the bias term, which represents the baseline value of the target variable when all input features are zero, should not be heavily regularized.
    In simpler terms, the bias term is essential for capturing the inherent value of the target variable when there's no influence from the input features. By setting its regularization strength to zero, the model is allowed to keep this baseline value without being penalized too much, as it's crucial for accurate predictions.

    • @janardhan1853
      @janardhan1853 6 หลายเดือนก่อน

      Thank for comment i understood
      By the way where can i learn the ML in depth like to solve this type of error(probelms)

  • @siyays1868
    @siyays1868 2 ปีที่แล้ว

    Only on this channel , everything explained so thoroughly. U'll clear conceptual. This channel has that magic & the magic is Nitish sir. Thanku so much sir for working so hard everytime.

    • @RamandeepSingh_04
      @RamandeepSingh_04 ปีที่แล้ว

      can you please help , i am unable to understand [XW-Y]` [XW-Y], even though i have revised and understood that E(error for multiple linear reg) = e transpose *e

  • @FarhanAhmed-xq3zx
    @FarhanAhmed-xq3zx 3 ปีที่แล้ว +7

    the reason they are replacing with 0 i,e ( first value in I matrix) because in LAMBDA(W) square we have first value w0 (i,e intercept) so they don't want to consider that first value in weights vector because the lambda is concerned only with coefficients as per the regularization term. This could be the reason i'm thinking but not sure .Correct me if im wrong .Thanks

  • @sujithsaikalakonda4863
    @sujithsaikalakonda4863 2 ปีที่แล้ว +4

    Hi sir great explanation.
    I have a doubt: At 30:13 the formula for differentiation of 'xTBx' given is as '2Bx' but the formula told by you while Multiple Regression is '2BxT'.
    I think if we have take '2wTyTx' instead of '2wTxTy'.

  • @piyushkumar0i0
    @piyushkumar0i0 ปีที่แล้ว +1

    as W0 is intercept which means that alpha can only work with W1 to Wm not with intercept .. hence the kept the value I[0][0] = 0

  • @saptarshisanyal6738
    @saptarshisanyal6738 2 ปีที่แล้ว +3

    at 8:30 You are multiplying the all the terms with minis(-) but there is error in the sign of resultant expression

    • @ritwikdubey5331
      @ritwikdubey5331 9 หลายเดือนก่อน

      no the equation is okay! when you take the minus sign inside the summation then it becomes +m {x-x(mean)}^2...

  • @sudduswetu8912
    @sudduswetu8912 2 ปีที่แล้ว

    in overfitting bias is high and variance is low ....underfitting means high variance and low bias

  • @HarshKumar-ni5hx
    @HarshKumar-ni5hx 5 หลายเดือนก่อน

    @ 25:04 . I think the loss function should be = loss of linear regression + lambda*(W.W' - Wo*Wo) as we only need to add squares of coef_ not of intercept_ ..
    (W' is transpose of W)
    someone please help...

  • @MohammadZeeshan-zg6ek
    @MohammadZeeshan-zg6ek 2 ปีที่แล้ว

    because lamda will repersent to changing slope value i.e coef_ 1th column is intercept_ ( adding as per required initially )so they didn't want idendity matrix in intercept_ columns also we observe

  • @varunahlawat9013
    @varunahlawat9013 2 ปีที่แล้ว +2

    I do not agree with the idea that if m is too high then it can lead to overfitting, and if m is less than it can lead to underfitting. The word overfitting means anything only when the performance on the training data is really good but that in testing data is poor, but 'm' will perform poor in both datasets if either it'll be too high or it'll be too low.
    Please confirm if that's right or wrong understanding.

    • @email4ady
      @email4ady 10 หลายเดือนก่อน

      great point! i was thinking same.....i think Nitish meant this only for the example datapoints he showed on the whiteboard to clarify the example....in general, a higher m might be an optimal model or a bad model & a low m might underfit or even overfit the data, should depend completely on the data

  • @univer_se1306
    @univer_se1306 ปีที่แล้ว +1

    class MyRidge:
    def __init__(self,alpha=0.1):
    self.intercept=None
    self.coef=None
    self.alpha=alpha
    def fit(self,X,y):
    num=np.sum(np.dot(y-y.mean(),X-X.mean()))
    denom=np.sum((X-X.mean())*(X-X.mean()))+self.alpha
    self.intercept=num/denom
    self.coef=y.mean()-self.intercept*X.mean()
    print(self.intercept,self.coef)
    def predict(self,X):
    y_pred=self.intercept*X+self.coef
    return y_pred
    is the above class correct???

  • @shahbazali4141
    @shahbazali4141 5 หลายเดือนก่อน

    regularizing the intercept could affect baseline prediction as the intercept represents the expected mean value of the target when all features are zero

  • @ujjalroy1442
    @ujjalroy1442 7 หลายเดือนก่อน

    Thanks for such a detailed explanation

  • @maheshwaroli653
    @maheshwaroli653 3 ปีที่แล้ว +6

    Just curious if we really need to learn how to derive the model from scratch. I mean we already have Scikitlearn for that no. These formulations are little complex! Any comments would be appreciated.

    • @campusx-official
      @campusx-official  3 ปีที่แล้ว +4

      No, it's not mandatory.

    • @Vinayworks666
      @Vinayworks666 ปีที่แล้ว +2

      Well, it's not mandatory but companies can ask you to code it or explain the mathematics, it happened with me in the Microsoft DS test. so don't miss out

    • @flakky626
      @flakky626 ปีที่แล้ว

      @@Vinayworks666 Hello bhaiya can we please talk?needed guidance
      IF yes, Pls provide a way by which I can contact you with pls

    • @vinayrathore560
      @vinayrathore560 ปีที่แล้ว

      @@flakky626 Checkmy channel

    • @Vinayworks666
      @Vinayworks666 ปีที่แล้ว

      @@flakky626 Well i could help you out but youtube is not allowing me to do it

  • @Prachi_Gupta_zeal
    @Prachi_Gupta_zeal 2 หลายเดือนก่อน

    can someone tell me the difference between the lambda and c which is a hyperparameter for logistic regression? Both are used for regularization

  • @krishnakanthmacherla4431
    @krishnakanthmacherla4431 2 ปีที่แล้ว +2

    They are regularizing only for the coefficients and not for intercept as per the definition of regularisation , but we are also regularising the intercept by keeping 1 in the 1st row of the I'd entity matrix which handles the intercept part
    I guess I am true

  • @ParthivShah
    @ParthivShah 10 หลายเดือนก่อน +1

    Thank You Sir.

  • @RaushanKumar-y1i9k
    @RaushanKumar-y1i9k 3 หลายเดือนก่อน

    4:04 why not lambda(m^2+b^2)
    can anyone provide source for verification

  • @biswajitgorai627
    @biswajitgorai627 หลายเดือนก่อน

    32:54
    n+1 by n+1

  • @RamandeepSingh_04
    @RamandeepSingh_04 ปีที่แล้ว +1

    Sir i am unable to understand [XW-Y]` [XW-Y], even though i have revised and understood that E(error for multiple linear reg) = e transpose *e

    • @adenmukhtar9804
      @adenmukhtar9804 6 หลายเดือนก่อน

      In the previous video the coefficients matrix was written as beta which in this video's derivation is W, so first thing to keep in mind is that W(for this video)=beta(in the previous video).
      Moreover y_hat = X*beta for the previous video
      So for this video Y_hat becomes Y_hat = XW
      E(error for multiple linear reg) = e transpose * e
      and e = (Y - Y_hat) so the final expression here becomes [XW-Y] transpose [XW-Y]
      I hope this helps

    • @AlphaEthic
      @AlphaEthic หลายเดือนก่อน

      @@adenmukhtar9804 if e = (Y - Y_hat) so the final expression here becomes [Y-XW] transpose [Y-XW], how e = [XW-Y] transpose [XW-Y]
      I am confused

    • @hunterking8194
      @hunterking8194 หลายเดือนก่อน

      @@AlphaEthic yes i have same question

  • @DataTalesByMuskan
    @DataTalesByMuskan 2 หลายเดือนก่อน

    why the identity matrix in ridge regression typically has zero in its main diagonal for the first term ?
    we add a penalty term λ||β||² to the ordinary least squares objective function, where λ is the regularization parameter and β contains the coefficients. The penalty term helps prevent overfitting by shrinking coefficients toward zero.
    However, we usually don't want to penalize (shrink) the intercept term β₀, because:
    The intercept represents the baseline prediction when all features are zero
    Shrinking the intercept doesn't help with reducing model complexity or overfitting
    It could introduce unnecessary bias into the model
    Therefore, in the penalty matrix (often written as λI where I is the identity matrix), we put a 0 in the first diagonal position corresponding to the intercept, while keeping 1s for all other diagonal entries. This effectively excludes β₀ from regularization.
    credit : Claude AI :)

  • @usmanriaz6157
    @usmanriaz6157 6 หลายเดือนก่อน

    why overfit models tends to have large value of slope ?

  • @balrajprajesh6473
    @balrajprajesh6473 2 ปีที่แล้ว

    best teacher ever!

  • @vinayupadhyay1090
    @vinayupadhyay1090 24 วันที่ผ่านมา

    I[0,0] is regularizing intercept term that is why we make it 0

  • @ganeshreddy1808
    @ganeshreddy1808 ปีที่แล้ว

    Sir the main reason behind optimisation techniques like Gradient descent is to find the appropriate parameters that does not overfit or underfit right? Then why do we use regularisation again?

    • @campusx-official
      @campusx-official  ปีที่แล้ว +1

      Those parameters that you find using gradient descent are the best parameters on the given dataset(which can be considered as sample data) how would know these parameters will be suitable for all of population data(read testing data)

    • @animeshsingh4645
      @animeshsingh4645 ปีที่แล้ว +1

      Even remember that heavy advantages of using gradient descent over that is faster convergence as inverse increases time complexity.

    • @animeshsingh4645
      @animeshsingh4645 ปีที่แล้ว

      And also linear regression limited to to convex function and can't wait work on very big data efficiently

    • @RamandeepSingh_04
      @RamandeepSingh_04 ปีที่แล้ว

      @@campusx-official can you please help , i am unable to understand [XW-Y]` [XW-Y], even though i have revised and understood that E(error for multiple linear reg) = e transpose *e

  • @shibakarmakar7030
    @shibakarmakar7030 2 หลายเดือนก่อน

    Sir, I think it should be (Y-WX) because e= (Y-Y^) not( Y^ -Y​).

  • @RohitKumar-iw3tt
    @RohitKumar-iw3tt 2 ปีที่แล้ว +3

    Reason why intercept is not regularized:
    Intercept acts as a receiver of reduction in coefficients thus regularisation of both will not improve the model or in other words you are regularising the curve, not shifting it.

    • @barryallen3051
      @barryallen3051 2 ปีที่แล้ว

      I found a similar reason too. When regularizing, we are trying to reduce variance of our model, not bias. The first term m0 (or theta0) is a bias term, regularizing it will not reduce the variance but shift the whole curve.

    • @tanmaythaker2905
      @tanmaythaker2905 ปีที่แล้ว

      Thanks for this!

  • @a1x45h
    @a1x45h 3 ปีที่แล้ว

    Am I right in assuming that, instead of changing the individual values of m, you just add one m at the end and tuning that will effect the entire equation?

  • @TheVicky888
    @TheVicky888 3 ปีที่แล้ว +3

    Shouldnt the formula for Loss be (Y-XW)^T (Y-XW) ?
    i think ulta ho gaya

  • @hossain9410
    @hossain9410 6 หลายเดือนก่อน

    After doing this can i now use logistic regression??

  • @hossain9410
    @hossain9410 6 หลายเดือนก่อน

    i got error saying can ot convert string to numeric .. should i encode ??

  • @uditjec8587
    @uditjec8587 ปีที่แล้ว

    @28: 37 dono term same nahi hai. ek term dusre ka transpose hai.

  • @anshulsharma7080
    @anshulsharma7080 2 ปีที่แล้ว +2

    22:31 ,
    (Y^trans -(X. Beta) ^trans) .
    (Y - (X. Beta))
    By the way at the end it doesn't matter even bhaiya has taken reverse it's also fit.

    • @tafiquehossainkhan3740
      @tafiquehossainkhan3740 ปีที่แล้ว

      I am having the same doubt can u please tell me how's it correct

    • @rajpurohitpravin9606
      @rajpurohitpravin9606 2 หลายเดือนก่อน

      Because -1 is taken common from both term of The equation we got in multiple linear regression

  • @flakky626
    @flakky626 ปีที่แล้ว

    sir little bit confused here, last time multiple reg me aapane (y - y^) (actual - predic) liya tha so our equation was (y - XB)
    but is video/lecture me aapne(y^-y) (predic - actual ) liya and equation here is (XB - y)
    So why this difference?

    • @ankurlohiya
      @ankurlohiya ปีที่แล้ว

      take both minus signs common and you will get the same answer as above, it doesn't matter. He by mistake took the other way round.

    • @flakky626
      @flakky626 ปีที่แล้ว

      @@ankurlohiya Can we even take minus common out of a transposed braces?
      also thankyousomuch brutha!!

    • @shreyasmhatre9393
      @shreyasmhatre9393 ปีที่แล้ว +2

      22:45
      L= ( yi - ŷi ) ²
      In matrix from
      L= ( y - Xw )ᵀ ( y - Xw )
      L= ( y - Xw )ᵀ ( y - Xw ) + || w || ²
      L= ( y - Xw )ᵀ ( y - Xw) + λ wᵀw
      L= ( yᵀ - wᵀ Xᵀ )( y - Xw) + λ wᵀw
      L= yᵀ y - wᵀ Xᵀ y - yᵀXw + wᵀXᵀXw + λ wᵀw
      As he told wᵀXᵀy and yᵀXw both are same
      L= yᵀy - 2(wᵀXᵀy) + wᵀXᵀXw + λ wᵀw
      this is same eqn he got

      Eg =
      (A-B)(C-D) = AC - BC -AD + BD ----- 1
      (B-A)(D-C) = BD - AD - BC + AC re-arrange (AC - BC -AD + BD) ---- 2
      Both eqn found out to be same don’t get confused 😵‍💫

    • @hunterking8194
      @hunterking8194 หลายเดือนก่อน

      @@shreyasmhatre9393 thanks brother

  • @rohitdahiya6697
    @rohitdahiya6697 2 ปีที่แล้ว

    why there is no learning rate hyperparameter in scikit-learn Ridge/lasso/Elasticnet . As it has a hyperparameter called max_iteration that means it uses gradient descent but still there is no learning rate present in hyperparameters . if anyone knows please help me out with it.

  • @symonhalder391
    @symonhalder391 2 ปีที่แล้ว

    Dear Sir, I am from Bangladesh. we have learned (Y-Y-predict)whole square in the case of Error formula. So, I need to put XW instead Y-predict. Why you applied XW instead of Y. Kindly advise on this please. (XW - Y) or (Y-XW)

    • @YogaNarasimhaEpuri
      @YogaNarasimhaEpuri 2 ปีที่แล้ว

      This equation calculation, is taught in the Linear Regression video (N-Dimensional)

  • @602rohitkumar8
    @602rohitkumar8 ปีที่แล้ว

    i think they did not want to change intercept coz intercept kisi ke weightage ko show nhi kr rha so intercept change krne pr overfitting pr koi effect nhi aayega

  • @arslanahamd7742
    @arslanahamd7742 2 ปีที่แล้ว +1

    I don't know why your views is low .Sir, your teaching style is too good.

  • @pramodshaw2997
    @pramodshaw2997 2 ปีที่แล้ว

    God bless you sir!!

  • @krishnakanthmacherla4431
    @krishnakanthmacherla4431 2 ปีที่แล้ว

    Sir , now that we are adding the regularisation in the loss function , wont it change the parabolic nature of the earlier function ?? And it won't effect our solution ??

    • @aadarshbhalerao8507
      @aadarshbhalerao8507 2 ปีที่แล้ว

      This is called as bias-variance trade off. i.e. We should only increase the bias (regularization) in our model if the varaince (Total Loss on Test Data - Total Loss on train Data) is redused. Lambda is a tuning factor as you know tune it to get the best result. Better than normal Linear Regression

    • @RamandeepSingh_04
      @RamandeepSingh_04 ปีที่แล้ว

      @@aadarshbhalerao8507 can you please help , i am unable to understand [XW-Y]` [XW-Y], even though i have revised and understood that E(error for multiple linear reg) = e transpose *e

  • @StartGenAI
    @StartGenAI 2 ปีที่แล้ว

    Thank you very much!!!

  • @shreyasmhatre9393
    @shreyasmhatre9393 ปีที่แล้ว

    Ref 22:45
    L= ( yi - ŷi ) ²
    In matrix from
    L= ( y - Xw )ᵀ ( y - Xw )
    L= ( y - Xw )ᵀ ( y - Xw ) + || w || ²
    L= ( y - Xw )ᵀ ( y - Xw) + λ wᵀw
    L= ( yᵀ - wᵀ Xᵀ )( y - Xw) + λ wᵀw
    L= yᵀ y - wᵀ Xᵀ y - yᵀXw + wᵀXᵀXw + λ wᵀw
    As he told wᵀXᵀy and yᵀXw both are same
    L= yᵀy - 2(wᵀXᵀy) + wᵀXᵀXw + λ wᵀw
    this is same eqn he got

    Eg =
    (A-B)(C-D) = AC - BC -AD + BD ----- 1
    (B-A)(D-C) = BD - AD - BC + AC re-arrange (AC - BC -AD + BD) ---- 2
    Both eqn found out to be same

    • @ali75988
      @ali75988 ปีที่แล้ว

      Thank you so much man. Even i thought, something was odd as he replaced y with wx, instead of y hat.

    • @RamandeepSingh_04
      @RamandeepSingh_04 ปีที่แล้ว

      can we connect on Linkedin ?
      thank you so much for the explaination

  • @princekhunt1
    @princekhunt1 9 หลายเดือนก่อน

    OMG Explanation

  • @Noob31219
    @Noob31219 2 ปีที่แล้ว

    you are great

  • @QuantanalystHA
    @QuantanalystHA 4 หลายเดือนก่อน

    two ways for doing these
    linear reg the OLS way
    linear reg the Gradient descent way
    polynomial reg. the OLS way
    poly. reg. the GD way
    ridge reg the OLS way
    ridge reg. the GD way

  • @beluga8956
    @beluga8956 หลายเดือนก่อน

    mera rizz