Why multicollinearity is a problem | Why is multicollinearity bad | What is multicollinearity

แชร์
ฝัง
  • เผยแพร่เมื่อ 5 ต.ค. 2024

ความคิดเห็น • 162

  • @swatikute219
    @swatikute219 3 ปีที่แล้ว +10

    If x1 and x2 are strongly correlated then we should check their individual correlation with target and will select the variable which is highly correlated with target and can also check p value for the variables.

  • @sanjeevkmr5749
    @sanjeevkmr5749 3 ปีที่แล้ว +17

    Thanks a lot for the detailed discussion on this topic. For the question asked in the video(Which feature to be removed incase of high correlation), I guess among the two, we have to remove the one which least contributes(less correlated) with the target variable. In that way, we will be able to preserve the feature which has high contribution.

    • @UnfoldDataScience
      @UnfoldDataScience  3 ปีที่แล้ว +2

      Thanks Sanjeev. True.

    • @babareddy44
      @babareddy44 3 ปีที่แล้ว +1

      How do we know which contributes least, help?

    • @arslanshahid3454
      @arslanshahid3454 2 ปีที่แล้ว

      @@babareddy44 from R2, F- value or p- value?

    • @beautyisinmind2163
      @beautyisinmind2163 ปีที่แล้ว

      @@babareddy44 you can use random forest model to see the significance of feature that contribute the most

  • @koustavdutta1176
    @koustavdutta1176 3 ปีที่แล้ว +16

    Firstly great explanation !! Now coming to your question, we have to check the bi-variate strength between dependent variables with independent variables. The independent variable with weakest strength should choose to remove from model

    • @UnfoldDataScience
      @UnfoldDataScience  3 ปีที่แล้ว +4

      Awesome. Thank you. :)

    • @jamiainaga5853
      @jamiainaga5853 2 ปีที่แล้ว +1

      what is bi- viariate?

    • @sowsow5199
      @sowsow5199 2 ปีที่แล้ว

      @@jamiainaga5853 the two variables that have been found to be highly correlated with each other

    • @kavankomer3048
      @kavankomer3048 ปีที่แล้ว

      How to find this bi-variate strength?

  • @samruddhideshmukh5928
    @samruddhideshmukh5928 3 ปีที่แล้ว +4

    Simple, Clear and Amazing explanation!!!
    I think we can remove one of the columns seeing the p value. If p>0.05 then we fail to reject the Null hypothesis for that variable and thus that coefficient value will be equal to 0.Hence that variable will not contribute significantly.
    Sir pls do make a video on how to use Ridge-Lasso regression to handle multicollinearity.

    • @UnfoldDataScience
      @UnfoldDataScience  3 ปีที่แล้ว +1

      Thanks Samruddhi,
      Videos u asked:
      th-cam.com/video/7XvBwQeT9OI/w-d-xo.html
      th-cam.com/video/21TgKhy1GY4/w-d-xo.html

  • @umeshrawat8827
    @umeshrawat8827 11 หลายเดือนก่อน +2

    To omit either X1 or X2, we can use PCA and remove the variable with low variance.

  • @swatikute219
    @swatikute219 3 ปีที่แล้ว +6

    Amazing pace, crisp word selection and good examples, thank you Aman for great videos !!

  • @dariakrupnova6245
    @dariakrupnova6245 2 ปีที่แล้ว +2

    Wow, I think I owe you my mark on the Econometrics final, you blew my mind, I had no idea it was so simple. Thank you!

  • @sangeethasaga
    @sangeethasaga 7 หลายเดือนก่อน

    Never seen someone with such a clear understandable explanation...thank you so much!

  • @Bididudy_
    @Bididudy_ ปีที่แล้ว +1

    Thank you for detailed explanation. I tried this concept from other channels but was bit difficult to get it. Your way of explaining terms is very simple and which helps to understand subject. Really glad that i visited your channel.👍

  • @shadow82000
    @shadow82000 3 ปีที่แล้ว +8

    If X1, X2 have high correlation, can I choose to drop the X with lower correlation to Y? Based on the correlation matrix

    • @UnfoldDataScience
      @UnfoldDataScience  3 ปีที่แล้ว +2

      Yes Right.

    • @shadow82000
      @shadow82000 3 ปีที่แล้ว

      @@UnfoldDataScience Thank you kind sir. High quality content as always!

    • @carlmemes9763
      @carlmemes9763 3 ปีที่แล้ว

      👍❤️

  • @KastijitBabar
    @KastijitBabar 4 หลายเดือนก่อน

    The best explaination on whole TH-cam! Thank You.

  • @arshiyasaba2259
    @arshiyasaba2259 2 ปีที่แล้ว +1

    If value is less then thresholds value 0.5/0.7 as per the reference suggests. Then we can remove those values

  • @datafuturelab_ssb4433
    @datafuturelab_ssb4433 3 ปีที่แล้ว +2

    Remove the variable which have low impact on target variable...
    Sir I hv 2 question
    1. If there is multicollinearity in Classification problem. How to handle that
    2. What is VIF & how standardization done
    3. Can we use standard scaler in regression problem

    • @UnfoldDataScience
      @UnfoldDataScience  3 ปีที่แล้ว +1

      There are three questions, I will cover them in separate video. Thanks for asking.

  • @jhonatangilromero2311
    @jhonatangilromero2311 ปีที่แล้ว

    It is evident that a lot of work goes into developing these very informative videos. Thank you!

  • @datafuturelab_ssb4433
    @datafuturelab_ssb4433 3 ปีที่แล้ว +2

    Great explaination sir . Thanks for sharing and making my fundamentals strong

  • @ChenLiangrui
    @ChenLiangrui 3 หลายเดือนก่อน

    awesome video! very clear and beginner friendly, no broken train of thought, very problem-focused

  • @csprusty
    @csprusty 2 ปีที่แล้ว

    We can create and compare two models based on choosing each of the correlated explanatory variables one at a time and select the model having better R-squared value.

  • @abdulhaseebshah9109
    @abdulhaseebshah9109 ปีที่แล้ว +1

    Amazing Explanation Aman, I have a question that VIF and auxiliary regression both use to detect multicollinearity?

  • @shivamthakur4079
    @shivamthakur4079 3 ปีที่แล้ว +1

    really loved sir what u said i can say that u have great idea of explaining concepts. i can blindly follow u sir

  • @smegala3815
    @smegala3815 ปีที่แล้ว +1

    Thank you sir... Best explanation

  • @shahbazkhalilli8593
    @shahbazkhalilli8593 5 หลายเดือนก่อน +1

    I don't know which one should I take. By the way video is great

  • @faozanindresputra3096
    @faozanindresputra3096 ปีที่แล้ว +1

    is multicollinearity will be problem too in correlations? just focus on getting which variables that correlate, not focus on regression. like in PCA

  • @allaboutstat1103
    @allaboutstat1103 3 ปีที่แล้ว +1

    thanks for clear explanation and God bless!

  • @sudhirnanaware1944
    @sudhirnanaware1944 3 ปีที่แล้ว +1

    Hi Aman,
    As per my knowledge we can use VIF (Variation Inflation Factor) function, heatmap,Corr() function to remove the multicoliniarity. Please confirm another techniques

    • @UnfoldDataScience
      @UnfoldDataScience  3 ปีที่แล้ว +1

      Yes Sushir, apart from some other regression techniques can be used.

    • @sudhirnanaware1944
      @sudhirnanaware1944 3 ปีที่แล้ว

      Thanks Aman, may I know the regression techniques to remove multicoliniarity. so I will definitely learn this and it will helpful for me.

  • @roshinidhinesh5490
    @roshinidhinesh5490 3 ปีที่แล้ว +1

    Such a great explanation sir.. Thanks a lot!

  • @atomicbreath4360
    @atomicbreath4360 3 ปีที่แล้ว +1

    Sir can given some ideas on how to know which type of ml models is affected by multicollinearity?

  • @ugwukelechi9476
    @ugwukelechi9476 2 ปีที่แล้ว

    You are a great teacher! I learnt something new today.

  • @kunalchakraborty3037
    @kunalchakraborty3037 3 ปีที่แล้ว

    My question..
    1. Is multicollinearity a concern for predictive modeling. I mean the prediction is altered by neglecting this phenomenon or not.
    2. In case of GAM do we have to worry about multicollinearity.
    3. How collinearity inflates the variation.

    • @UnfoldDataScience
      @UnfoldDataScience  3 ปีที่แล้ว

      Thanks Kunal for asking it. Answer to first question is prediction will not be impacted more however eoefficoents will be impacted.
      2nd and 3rd, I will. Cover in other video

    • @kunalchakraborty3037
      @kunalchakraborty3037 3 ปีที่แล้ว

      @@UnfoldDataScience thanks 👍. Really appreciate your videos.

  • @shanmukhchandrayama8508
    @shanmukhchandrayama8508 3 ปีที่แล้ว +1

    Aman, Your videos are great. But there are many videos which have some connection with other, so can you please make a video in which you can say which order to follow the playlists to learn the machine learning from basics. It would be really helpful😅

  • @DataScience111
    @DataScience111 3 ปีที่แล้ว +1

    best explanation....keep the good work up.

  • @ashulohar8948
    @ashulohar8948 ปีที่แล้ว

    Please please make a vedio how to select drivers in linear regression which drive the sales

  • @nurlanimanov9503
    @nurlanimanov9503 3 ปีที่แล้ว +1

    Hello sir! Firstly thank you for the video!
    I have 2 questions if you answer I will be glad:
    1) Can we say that we don't need to be concerned about correlated features in for example decision tree-based models? I mean do we need this concept only in linear-based models?
    2) Don't we need to touch correlated features when we use Lasso or Ridge regression is that true? Will the model do that by itself in that case? Don't we need to touch?

    • @UnfoldDataScience
      @UnfoldDataScience  3 ปีที่แล้ว +1

      1. This is a problem with regression based models where coefficients come into picture.
      2.still you need to take care.

    • @hemanthkumar42
      @hemanthkumar42 3 ปีที่แล้ว

      @@UnfoldDataScience from you first answer, then why multicollinearity is not a problem in neural network? Pls make a video regarding this sir...

    • @saurabhagrawal9874
      @saurabhagrawal9874 2 ปีที่แล้ว +3

      @@hemanthkumar42 Note that multicollinearity does not affect prediction accuracy of the linear regression ,it only make the interpretation harder in the linear regression and mostly for interpretation we go to linear regression and when we go to neural network we already know its type of blackbox and we dont want to interpret ,but want good prediction results ,thats why we dont bother about multicollinearity in neural network

  • @datapointpune6216
    @datapointpune6216 3 ปีที่แล้ว +1

    Very Informative aman

  • @sriadityab4794
    @sriadityab4794 3 ปีที่แล้ว +1

    Should we need to remove multicollinearity while building time series model?

  • @YourRandomVariable
    @YourRandomVariable 3 ปีที่แล้ว +1

    Hi Aman, What should we do when the constant term p-value is high? Mostly I see that people keep it without worrying about it. Could you please give an explanation for this?

  • @MuhammadImran-o4c
    @MuhammadImran-o4c 3 ปีที่แล้ว

    Sr ap ko js ne jo answr dia he sb ka answr correct he ap sb ko yes bol rhen hn

  • @bhavanichatrathi7435
    @bhavanichatrathi7435 3 ปีที่แล้ว

    Hi Aman it's very good explanation...please do video on penalised regression like lasso ridge and elastic..too much of mathematics into those please explain in simple way Thank you

  • @zakiaa7464
    @zakiaa7464 11 หลายเดือนก่อน

    You are a genius. Thanks

  • @bijaynayak6473
    @bijaynayak6473 2 ปีที่แล้ว

    which one will eliminate ? VIF of each features set the threshold >5

  • @manavgora
    @manavgora 8 หลายเดือนก่อน

    great, easily understandable

  • @nivednambiar6845
    @nivednambiar6845 7 หลายเดือนก่อน

    Hi Aman, hope you are doing well !
    I want to ask one thing, what you are mentioning regression models is related to linear models right not the tree based regression models am i correct ?
    does multicolinearity effects the tree based models ?

  • @hemanthkumar42
    @hemanthkumar42 3 ปีที่แล้ว +1

    Is multicollinearity is the problem for neural network?

  • @anmolpardeshi3138
    @anmolpardeshi3138 3 ปีที่แล้ว

    regarding the question- which variable to remove out of a set of highly correlated variables? Can this be answered by PCA (principal component analysis)? or will the PCA weight them the same because they are highly correlated?

    • @UnfoldDataScience
      @UnfoldDataScience  3 ปีที่แล้ว

      Hi Anmol , not in terms of pca, generally I asked.

  • @KumarHemjeet
    @KumarHemjeet 3 ปีที่แล้ว

    Remove that feature which is in less correlation with target.

  • @harshadbobade2200
    @harshadbobade2200 2 ปีที่แล้ว

    Simple and to the point explaination 🤘

  • @ShubhamSharma-zb9uh
    @ShubhamSharma-zb9uh 3 ปีที่แล้ว

    09:11 The Data which More Coefficient Value that we have to consider for analysis.

  • @trushnamayeenanda5431
    @trushnamayeenanda5431 2 ปีที่แล้ว

    The independent variable with higher correlation among the similar factors should be removed

  • @squadgang1678
    @squadgang1678 2 ปีที่แล้ว

    I will find the correlation between x1 and y and x2 and y individually and see which one is lesser the one with lesser correlation i will delete it

  • @muhammadaliabid5793
    @muhammadaliabid5793 3 ปีที่แล้ว

    Thankyou for excellent explanation. I have fews questions please:
    1. I used Polynomial features method in sklearn and it significantly improved accuracy of my linear regression prediction model, but i found that the newly created features are correlated with the existing features since i created square and cubes! I understand as per your explanation that it will lead to multicollinearity problem! So i understand that the coefficients are not the true picture, However can i use this type of model for predictions?
    2. What would you suggest the threshold correlation value for multicollinearity?
    Thanks

  • @prateeksachdeva1611
    @prateeksachdeva1611 2 ปีที่แล้ว

    we will drop that feature from the model whose correlation with the dependent variable is lesser as compared to the other one

  • @MuhammadImran-o4c
    @MuhammadImran-o4c 3 ปีที่แล้ว +1

    Thnks sr g I think uncecessary variable remove

  • @mariapramiladcosta1972
    @mariapramiladcosta1972 3 ปีที่แล้ว

    Sir if the there are 3 predictors and one dependent variable. all the three independent variables are highly correlated then which type of regression model can be used. multiple regression can not be used rt?can we use the linear regression? can the tolerance of .1 and the VIF less than 10 not a good enough to indicate that there is no multicollinearity?
    for your question i think the one with weak correlated one to be removed

  • @RamanKumar-ss2ro
    @RamanKumar-ss2ro 3 ปีที่แล้ว +1

    Great content.

  • @AMVSAGOs
    @AMVSAGOs 3 ปีที่แล้ว

    Great Explanation...
    At 7.50 you said "that's why we should not have multicollinearity in regression" . So, Is it okay if we have multicollinearity in classification?? Could you please make it clear..

    • @UnfoldDataScience
      @UnfoldDataScience  3 ปีที่แล้ว +1

      When I say, it means regression family of Algorithms. Logistic regression also.

    • @AMVSAGOs
      @AMVSAGOs 3 ปีที่แล้ว

      @@UnfoldDataScience Thank you Aman Sir

  • @rafibasha4145
    @rafibasha4145 2 ปีที่แล้ว

    Multicolinearity is problem in classification as well right .@3:57

    • @UnfoldDataScience
      @UnfoldDataScience  2 ปีที่แล้ว +1

      Yes, if it's a linear model like logistics regression.

  • @kar2194
    @kar2194 3 ปีที่แล้ว

    Sorry so it means when there is multicollinearity for example x2 and x3, so if I increase x2, x3 will automatically increased? Great video by the way!

  • @nurlanimanov9503
    @nurlanimanov9503 3 ปีที่แล้ว

    Hello sir, After reading the comments I saw the answer to your question. They said we have to remove the one which has less correlation coefficient with the target variable due to the correlation matrix. It confused me at one point, Can we say that the coefficients in front of each feature that we get after running the regression model indicate us impact of each feature on the target? So, I mean can I take these coefficients when I decide which feature I have to remove bw two correlated features instead of taking correlation matrix value with the target variable? Can we say that the coefficients in front of each feature actually say the same thing as the value in the correlation matrix with the target variable in this context?

  • @salajmondal3437
    @salajmondal3437 6 หลายเดือนก่อน

    Should I check multicolinearty for classification problem?

    • @UnfoldDataScience
      @UnfoldDataScience  6 หลายเดือนก่อน +1

      For logistic regression - yes.

    • @salajmondal3437
      @salajmondal3437 6 หลายเดือนก่อน

      @@UnfoldDataScience Is it necessary to check multicollinearity between categorical features or numerical and categorical features??

  • @jaheerkalanthar816
    @jaheerkalanthar816 2 ปีที่แล้ว

    I think which variable highly CO relate with target variable

  • @shafeeqaabdussalam6195
    @shafeeqaabdussalam6195 3 ปีที่แล้ว +1

    Thank you

  • @sidrahms7458
    @sidrahms7458 3 ปีที่แล้ว

    Awesome explanation, I have a question: if I have nominal,ordinal and continuous variables how can I find multicollinearity among them?

    • @UnfoldDataScience
      @UnfoldDataScience  3 ปีที่แล้ว

      Hi Sidrah, answered.

    • @sidrahms7458
      @sidrahms7458 3 ปีที่แล้ว

      I can't find your answer, I understand that we should use vif for continuous variables but what if I need to see correlation among all ordinal, numeric and nominal?

  • @ameerrace2284
    @ameerrace2284 3 ปีที่แล้ว

    Great video. Please create video on python implementation of Lasso and ridge regression

  • @RAJANKUMAR-mi1ib
    @RAJANKUMAR-mi1ib 3 ปีที่แล้ว

    Hi...Thanks for the nice explaination. Have a question that is multicollinearity a problem for linear regression only? if not then how its a problem for non-linear regression?

    • @UnfoldDataScience
      @UnfoldDataScience  3 ปีที่แล้ว

      For regression based models like linear/logistic etc

  • @hakimandishmand1068
    @hakimandishmand1068 2 ปีที่แล้ว

    Good and perfect

  • @prateeksachdeva1611
    @prateeksachdeva1611 2 ปีที่แล้ว

    excellent explanation

  • @omkarlokhande3692
    @omkarlokhande3692 11 หลายเดือนก่อน

    Sir what to do if the multi collinearity is affecting the binary classification problem

    • @UnfoldDataScience
      @UnfoldDataScience  11 หลายเดือนก่อน

      many ways to take care of it. I have discussed in classification videos.

  • @sharadpkumar
    @sharadpkumar 2 ปีที่แล้ว

    Hi Aman, nice work, keep it up.....i have a doubt that why normal distribution is so important? why we need our independent variable should show normal distribution for a good model? i am not finding a satisfying answer. can you please help?

    • @UnfoldDataScience
      @UnfoldDataScience  2 ปีที่แล้ว +2

      Hi Sharad, in simple language, its easy for the model to learn pattern if you give examples from a large set of range.(That is your normal distribution).
      Take a example below:
      Predict salary of an individual(Y - target) based on his/her expense(X variable)
      Scenario 1 - in your training set you have Y as - 10LPA, 15LPA,20LPA, like that, here model wont be able to learn the pattern for 3LPA guys, may be there is difference is income/expense pattern for junior guys.
      Scenario 2 - You give many values of Y from all over like 2LPA, 4LPS,5LPA,100LPA, all values like they are normally distributed.
      Here its easy for model to learn pattern as it sees a range of values and the resulting model will be more reliable.
      Hope its clear now.

    • @sharadpkumar
      @sharadpkumar 2 ปีที่แล้ว

      @@UnfoldDataScience thanks for clarification . Does a huge dataset always show normal distribution?

    • @UnfoldDataScience
      @UnfoldDataScience  2 ปีที่แล้ว +1

      No, not always...it depends on data

  • @beautyisinmind2163
    @beautyisinmind2163 ปีที่แล้ว

    can we remove highly negatively correlated features also or not? someone reply, please

  • @anirudhchandnani9917
    @anirudhchandnani9917 3 ปีที่แล้ว

    Hi Aman,
    Could you please make a detailed video explaining the difference between Gradient Boost, AdaBoost and ExtremeGradientBoosting?
    Why is AdaBoost called adaptive? Is it only because it edits the weights of the misclassified instances? XGBoost and GradientBoost also are adaptive in that way, arent they?
    Also, why are XGBoost and Gboost more robust to outliers than AdaBoost despite all of them having a term of log in their loss functions?
    Would really appreciate your reply.
    Thanks

  • @rohitnalage6366
    @rohitnalage6366 ปีที่แล้ว

    Sir please explain Lasso and ridge if you made it,link pl.

    • @UnfoldDataScience
      @UnfoldDataScience  ปีที่แล้ว

      th-cam.com/video/7XvBwQeT9OI/w-d-xo.html
      th-cam.com/video/21TgKhy1GY4/w-d-xo.html

  • @sreejadas4417
    @sreejadas4417 2 ปีที่แล้ว

    I want to be a data analyst but I want sequential courses from you please guide

  • @bezagetnigatu1173
    @bezagetnigatu1173 2 ปีที่แล้ว

    Thank you!

  • @sujithreddy1599
    @sujithreddy1599 3 ปีที่แล้ว

    It depends on feature importance. the feature with less importance will be dropped.
    correct me if am wrong :0

  • @karthikganesh4679
    @karthikganesh4679 3 ปีที่แล้ว

    Sir plz do the video for post pruning decision tree

  • @suryadhakal3608
    @suryadhakal3608 3 ปีที่แล้ว

    Great.

  • @akhileshgandhe5934
    @akhileshgandhe5934 3 ปีที่แล้ว

    Hi Aman, I have 9 categorical and 6 numerical columns and it's a regression problem.
    So I can find the correlation between numerical using correlation heatmap but how to find the relation between categorical..??
    Can I use chi square test..??
    If I use I am getting all 9 categorical are dependent on each other. So what should be my next step..??
    Please guide me.
    Thanks

    • @UnfoldDataScience
      @UnfoldDataScience  3 ปีที่แล้ว +1

      Yes, chi square can be used, I have a dedicated video for the same topic.

  • @squadgang1678
    @squadgang1678 2 ปีที่แล้ว

    Is Machine learning better than deep learning or deep learning better than machine learning

    • @UnfoldDataScience
      @UnfoldDataScience  2 ปีที่แล้ว

      Depends on problem statement, data availability, Infra availability etc, can't say one is better then other

    • @squadgang1678
      @squadgang1678 2 ปีที่แล้ว

      @@UnfoldDataScience oh ok got it ✌️

  • @sandipansarkar9211
    @sandipansarkar9211 2 ปีที่แล้ว

    finished watching

  • @sudheeshe1384
    @sudheeshe1384 3 ปีที่แล้ว +1

    You always rocks :)

  • @naziakhatoon3058
    @naziakhatoon3058 3 ปีที่แล้ว

    Jo less Cor related ho usko remove karna hai

  • @khoaanh7375
    @khoaanh7375 6 หลายเดือนก่อน +1

    this shit is pure gold

  • @ahmad3823
    @ahmad3823 6 หลายเดือนก่อน

    at least two variables!

  • @tesfayesime9434
    @tesfayesime9434 ปีที่แล้ว

    Neither x1 or x2

  • @ethiodiversity-1184
    @ethiodiversity-1184 2 ปีที่แล้ว

    great explanation