Data Science Interview Questions- Multicollinearity In Linear And Logistic Regression

แชร์
ฝัง
  • เผยแพร่เมื่อ 14 ต.ค. 2024
  • Please join as a member in my channel to get additional benefits like materials in Data Science, live streaming for Members and many more
    / @krishnaik06
    Please do subscribe my other channel too
    / @krishnaikhindi
    Connect with me here:
    Twitter: / krishnaik06
    Facebook: / krishnaik06
    instagram: / krishnaik06

ความคิดเห็น • 72

  • @harshstrum
    @harshstrum 4 ปีที่แล้ว +53

    Diff 1 --->Gradient descent takes all the data point into consideration to update the weight during back propagation to minimize the loss function..................whereas stochastic gradient descent considers only one data point at a time for weight updation.
    Diff 2 ----> In gradient descent convergence towards the minima is fast..............where as in stochastic gradient descent convergence is slow.
    Diff3-------> Since in gradient descent whole data points are loaded and use for calculation, computation get slow.........where as stochastic gradient descent is comparatively fast.

    • @GauravSharma-ui4yd
      @GauravSharma-ui4yd 4 ปีที่แล้ว +9

      @Ashwini Verma it's 1 data point at a time only for SGD. Batch is for mini-batch gradient descent. But it's not your fault because many people uses mini-batch gradient descent and stochastic gradient descent interchangeably and infact all library implementation of SGD uses a mini-batch.
      But the thing is all data point at a time is GD/vanilla GD/Batch GD, 1 data point at a time is SGD and a batch at a time is mini-batch GD.
      And when batch-size is 1 then mini-batch GD is just SGD and when it is equal to entire dataset than it becomes vanilla GD

    • @GauravSharma-ui4yd
      @GauravSharma-ui4yd 4 ปีที่แล้ว +5

      Just one thing to add don't put back propagation in the definition of GD it has nothing to do with GD in general. But it's an clever way of doing GD optimally and effectively in case of neural nets. But when we apply these optimizers on non-neural nets like linear regression then their is no notion of back propagation.
      So long story short, backprop is not a intrinsic part of GD but just a clever way of doing it effectively only when applied on neural nets.

    • @abhaypratapsingh4225
      @abhaypratapsingh4225 4 ปีที่แล้ว +2

      @@GauravSharma-ui4yd Great explanation! You seem to have a strong concept understanding. Thanks again

    • @vishprab
      @vishprab 4 ปีที่แล้ว

      @@GauravSharma-ui4yd Hi, aren't the weights in the linear regression model too determined using backpropagation? Isn't it the same idea applied in a neural net which is updating the parameters after each step in GD ? Don't we go back and forth to determine the weights in any regression for that matter? So backprop is not a wrong term to use per se. Please let me know how these two are different in a conventional sense. Does having an activation function change the definition ?
      medium.com/@Aj.Cheng/linear-regression-by-gradient-decent-bb198724eb2c#:~:text=linear%20regression%20formulation%20is%20very,some%20detail%20of%20it%20later.&text=the%20purpose%20of%20backpropagation%20is,side%20the%20one%20update%20step.

    • @DS_AIML
      @DS_AIML 3 ปีที่แล้ว

      Thats why people prefer Mini Batch stochastic gradient descent

  • @sushilchauhan2586
    @sushilchauhan2586 4 ปีที่แล้ว +4

    thanks krishna!...krishna i m from that person who like your videos first then i watch your videos .... i yesterday said you about it and u explained it...thank you..
    so you r saying after applying regularization ... there will be no multicolinearity
    stochastic gradient descent :
    stochastic gradient descent is almost similar to gradient descent only difference is :
    if i have "n" no. of points in training data then it will randomly pick "k" no. of points where k

  • @ShivShankarDutta1
    @ShivShankarDutta1 4 ปีที่แล้ว +10

    GD: Run all samples in training to do a single update for all params in a specific iteration
    SGD: Only one or subset of training sample from training set to update parameter in a specific iteration
    GD: If sample/features are larger it takes much time in updating the values
    SGD: It is faster because there is one training sample
    SGD conveges faster than GD.

    • @rahuldey6369
      @rahuldey6369 3 ปีที่แล้ว

      Yes. SGD converges faster than GD, but fails to find the global minima like GD but oscillates around a value close to global minima which people say is a good approximation to go with

  • @bharathjc4700
    @bharathjc4700 4 ปีที่แล้ว +7

    multicollinearity may not be a problem every time. The need to fix multicollinearity depends primarily on the below reasons:
    When you care more about how much each individual feature rather than a group of features affects the target variable, then removing multicollinearity may be a good option
    If multicollinearity is not present in the features you are interested in, then multicollinearity may not be a problem.

  • @bharathjc4700
    @bharathjc4700 4 ปีที่แล้ว +2

    In batch gradient descent, you compute the gradient over the entire dataset, averaging over potentially a vast amount of information.
    It takes lots of memory to do that. But the real handicap is the batch gradient trajectory land you in a bad spot (saddle point).
    In pure SGD, on the other hand, you update your parameters by adding (minus sign) the gradient computed on a single instance of the dataset.
    Since it's based on one random data point, it's very noisy and may go off in a direction far from the batch gradient.
    However, the noisiness is exactly what you want in non-convex optimization, because it helps you escape from saddle points or local minima
    GD theoretically minimizes the error function better than SGD. However, SGD converges much faster once the dataset becomes large.
    That means GD is preferable for small datasets while SGD is preferable for larger ones..

  • @mahender1440
    @mahender1440 4 ปีที่แล้ว +3

    Hi, Krish
    Gradient descent : on big volume of data it takes more number of iterations,for each iteration it works with entire data so casuses High latency and more computing power,
    Solution : batch gradient
    Batch gradient : data is splitted into multiple batches,on each batch gradient will be applied separately,for each batch separate minimum loss is achieved,it considers finally the weight matrix of global minimum loss
    Problem with batch gradient : each batch contains few patterns the entire data,that means missing other patterns,model couldn't learn all patterns from the data

  • @DionysusEleutherios
    @DionysusEleutherios 4 ปีที่แล้ว +2

    If you have a a large feature space that contains multicollinearity, you could also try running a PCA and use only the first n components in your model (where n is the number of components that collectively explain at least 80% of the variance), since they are by definition orthogonal to each other.

    • @akashsaha3921
      @akashsaha3921 4 ปีที่แล้ว +1

      No.... PCA reduces dimensionality but doesn't consider class labels while doing that. In logistic classification doing PCA means u r dropping features wdout considering the class labels.
      So in practice, if our aim is to just find acc of model then PCA is good but if u want interpretation of wch features are important for classification in the model then PCA not recommended. Moreover PCA preserve variance BT creating new features. In practice, generally u might be asked to not create new features.
      That's y lasso n rigde regression are used.... To penalize

    • @hokapokas
      @hokapokas 4 ปีที่แล้ว

      Any response for VIF approach???

    • @akashsaha3921
      @akashsaha3921 4 ปีที่แล้ว

      @@hokapokas Look L1 and L2 is must u shd do to control over and under fitting. . . but using L1 when features are multicollinear will create more sparsity so in that case we do L2.
      Next to detect multicollinearity u can do pertubation test.
      U can do Forward feature selection also to select the important features.
      At the end u shd try every possible way to make ur model better. I prefer to use pertubation to check collinearity and then based on that select L1 or L2 or elastic net

    • @hokapokas
      @hokapokas 4 ปีที่แล้ว

      @@akashsaha3921 I agree on regulatisation to put constraints to a model in terms of feature selection or reducing magnitudes of coefficients but I was suggesting for a VIF approach as well to select and reject features. We should explore this approach as well for multicollinearity.

    • @GauravSharma-ui4yd
      @GauravSharma-ui4yd 4 ปีที่แล้ว

      @@hokapokas I lot people are actually using this approach

  • @brahimaksasse2577
    @brahimaksasse2577 4 ปีที่แล้ว +1

    GD algorithm uses all data for updating weights when optimising loss function in BP algorithm. However SGD uses a sample data at each iteration.

  • @K-mk6pc
    @K-mk6pc 2 ปีที่แล้ว

    Stoic Gradient Descent is a type where the Feature Values are Taken randomly Unlike the other Type of Gradient Descent where the global minima is found out after training the Entire Model.

  • @sathwickreddymora8767
    @sathwickreddymora8767 4 ปีที่แล้ว

    Let's assume we are use a MSE cost function
    Gradient Descent -> It takes all the points into account for computing the derivatives of the cost function w.r.t each feature which tells the right direction to move. It is not productive if we have a large number of data points.
    SGD -> It computes the derivatives of the cost function w.r.t each feature based a single or some subset of data points and moves in that direction pretending it was the right direction. So, it decreases much of the computational complexity.

  • @sridhar6358
    @sridhar6358 3 ปีที่แล้ว

    Lasso and Ridge Regression - precondition is that there should not be multicollinearity, if we see linear relationship between the independent variables like how we see it with dependent and independent variables we call it multicollinearity which is not the same as correlation

  • @cutyoopsmoments2800
    @cutyoopsmoments2800 4 ปีที่แล้ว +2

    Sir, I am great fan of yours

  • @charlottedsouza274
    @charlottedsouza274 4 ปีที่แล้ว +4

    Hi Krish...Thanks for such clear explanation. For large datasets, for regression problems we have ridge and lasso. What about classification problem..How to deal with multi collinearity for large datasets?

    • @SC-hp5dn
      @SC-hp5dn 2 ปีที่แล้ว

      thats what i came here to find

  • @dragonhead48
    @dragonhead48 4 ปีที่แล้ว +2

    @krishNaik you can add the links for lasso and ridge regularization techniques in this current video. That would be helpful and beneficial for both parties as well I think.

    • @Trendz-w5d
      @Trendz-w5d 3 ปีที่แล้ว +1

      th-cam.com/video/9lRv01HDU0s/w-d-xo.html here it is

  • @cutyoopsmoments2800
    @cutyoopsmoments2800 4 ปีที่แล้ว +5

    Sir, kindly make all the videos of feature engineering and Feature selection which is present in your Github Link.. please..

  • @venkivtz9961
    @venkivtz9961 2 ปีที่แล้ว

    PCA is the best in some cases of multicollinearity problems

  • @sarveshmankar7272
    @sarveshmankar7272 4 ปีที่แล้ว +2

    Sir can we use pca to reduce multicollinarity if we have suppose more than 200 columns??

  • @ganeshprabhakaran9316
    @ganeshprabhakaran9316 4 ปีที่แล้ว

    That was an clear explanation ..Thanks Krish.. Small request can you make a video for feature selection using atleast 15-20 variables based on multicollinearity for better understanding by practice..

  • @charlottedsouza274
    @charlottedsouza274 4 ปีที่แล้ว +2

    In addition, can you create a separate playlist for interview question so that it is all in one place?

  • @rahuldey6369
    @rahuldey6369 3 ปีที่แล้ว +1

    @2:26 could you please explain what disadvantage can it cause to model performance? I mean, what if I remove correlated features,will my model performance increase or stays the same?

  • @AnotherproblemOn
    @AnotherproblemOn 3 ปีที่แล้ว

    You're simply the best, love you

  • @sahubiswajit1996
    @sahubiswajit1996 4 ปีที่แล้ว

    Stochastic Gradient Descent (SGD): It means we are sending ONLY ONE DATA POINT (Only one row) for the training phase.
    Gradient Descent (GD): It means we are sending ALL DATA POINTS (all rows) for the training phase.
    Mini Batch SGD: It means we are sending SOME PORTION DATA POINTS (let us consider each 50 data points from 50K data points, it means 5000 epoch) for the training phase.
    Sir is this correct?

    • @GauravSharma-ui4yd
      @GauravSharma-ui4yd 4 ปีที่แล้ว +2

      The only correct distinction so far. Just one thing it's not 5000 epochs it's 5000 steps/updates in a single epoch.

    • @sahubiswajit1996
      @sahubiswajit1996 4 ปีที่แล้ว

      @@GauravSharma-ui4yd last sentence I am not understanding. Please explain me again

    • @GauravSharma-ui4yd
      @GauravSharma-ui4yd 4 ปีที่แล้ว +1

      @@sahubiswajit1996 1 epoch means 1 trip of entire dataset. So if 50k points are there and you are batching them in 50 then you have 5000 such batches and in each batch you calculate the grads and update weights in short a step in GD. And you do that 5000 times until your dataset is exhausted. But after doing this you make just one trip to the entire data and hence 1 epoch.
      If you do so for 10 epochs then there will be 5k * 10 steps/updates.

  • @swatisawant7632
    @swatisawant7632 4 ปีที่แล้ว

    Very nicely explained!!!

  • @swethakulkarni3563
    @swethakulkarni3563 4 ปีที่แล้ว +1

    Can you make a video for Naive Bayes in detail?

  • @gaugogoi
    @gaugogoi 4 ปีที่แล้ว +1

    Along with correlation heatmap and lasso & ridge regression ,can VIF is another option to figure out multicollinearity ?

    • @GauravSharma-ui4yd
      @GauravSharma-ui4yd 4 ปีที่แล้ว +1

      Yes

    • @HBG-SHIVANI
      @HBG-SHIVANI 4 ปีที่แล้ว

      Set the standard benchmark that if VIF >5 then multicollinearity exits.

  • @louerleseigneur4532
    @louerleseigneur4532 3 ปีที่แล้ว

    Thanks Krish

  • @priyankabanda7562
    @priyankabanda7562 5 หลายเดือนก่อน

    exact question asked in an interview

  • @haneulkim4902
    @haneulkim4902 3 ปีที่แล้ว

    When u are using small dataset and x1,x2 are highly correlated, drop which one?

  • @BhanudaySharma506
    @BhanudaySharma506 3 ปีที่แล้ว

    In case of multicollinearity, why there is no mention of PCA?

  • @Arjun147gtk
    @Arjun147gtk 2 ปีที่แล้ว

    is it recommended to remove highly negative correlated.

  • @surajshivakumar5124
    @surajshivakumar5124 3 ปีที่แล้ว

    We can just use variance inflation factor right?

  • @thunder440v3
    @thunder440v3 4 ปีที่แล้ว

    So helpful video 🙏☺️

  • @adityay525125
    @adityay525125 4 ปีที่แล้ว

    SGD uses a variable learning rate and hence is better imo. I do not know the answer though, still a noob.

  • @ManishKumar-qs1fm
    @ManishKumar-qs1fm 4 ปีที่แล้ว

    Sir plz upload eigenvalues & eigenvector video

  • @Sunilgayakawad
    @Sunilgayakawad 4 ปีที่แล้ว

    Why multicollinearity reduces after standardization?

  • @ManishKumar-qs1fm
    @ManishKumar-qs1fm 4 ปีที่แล้ว

    Nice Sir

  • @ashum6612
    @ashum6612 4 ปีที่แล้ว

    Multicollinearity can be completely removed from the model. (True/False). Give reasons.

    • @rohankavari8612
      @rohankavari8612 3 ปีที่แล้ว

      false.....features with zero multicollinearity is an ideal situation...there will always be some multicollinearity present...our role is to deal with variables which have high multicollinearity

  • @ShubhanshuAnand
    @ShubhanshuAnand 4 ปีที่แล้ว

    SGD: Picks k random sample from n samples in each iteration whereas GD considers all n samples.

  • @alipaloda9571
    @alipaloda9571 4 ปีที่แล้ว

    How about to use box cox technique

  • @prathmeshbusa2195
    @prathmeshbusa2195 4 ปีที่แล้ว

    hello i am not able to do payment of 59rupees to join u r channel i tried with all the possible bank card but it always fails

  • @harvinnation3027
    @harvinnation3027 2 ปีที่แล้ว

    i didnt know post malone is into data analysis !!

  • @jagannadhareddykalagotla624
    @jagannadhareddykalagotla624 4 ปีที่แล้ว

    Stochastic gradient for globel minima
    Gradient desecent some times shows local minima

    • @GauravSharma-ui4yd
      @GauravSharma-ui4yd 4 ปีที่แล้ว

      Nopes:(
      If function is convex than GD always converge to global minimum but SGD doesn't ensure that.
      But when problem is non-convex no one ensures convergence to global minimum but SGD has better chances as compare to GD in case of non-convex functions.

  • @shoraygoel
    @shoraygoel 4 ปีที่แล้ว

    Why can't we remove features using correlation when there are many features?

    • @rohankavari8612
      @rohankavari8612 3 ปีที่แล้ว

      as there are many variables...we might find many variables that are highly correlated....and removing that many variables will lead to loss of information

    • @shoraygoel
      @shoraygoel 3 ปีที่แล้ว

      @@rohankavari8612 if they are correlated then there is no loss of information right?

  • @yajingli4384
    @yajingli4384 ปีที่แล้ว

    Ridge & Lasso regression tutorial: th-cam.com/video/9lRv01HDU0s/w-d-xo.html&ab_channel=KrishNaik

  • @ElonTusk889
    @ElonTusk889 4 ปีที่แล้ว

    Actually

  • @thunder440v3
    @thunder440v3 4 ปีที่แล้ว

    So helpful video 🙏☺️