Why multicollinearity is a problem | Why is multicollinearity bad | What is multicollinearity

Поделиться
HTML-код
  • Опубликовано: 8 янв 2025

Комментарии • 165

  • @swatikute219
    @swatikute219 3 года назад +13

    If x1 and x2 are strongly correlated then we should check their individual correlation with target and will select the variable which is highly correlated with target and can also check p value for the variables.

  • @sanjeevkmr5749
    @sanjeevkmr5749 3 года назад +18

    Thanks a lot for the detailed discussion on this topic. For the question asked in the video(Which feature to be removed incase of high correlation), I guess among the two, we have to remove the one which least contributes(less correlated) with the target variable. In that way, we will be able to preserve the feature which has high contribution.

    • @UnfoldDataScience
      @UnfoldDataScience  3 года назад +2

      Thanks Sanjeev. True.

    • @babareddy44
      @babareddy44 3 года назад +1

      How do we know which contributes least, help?

    • @arslanshahid3454
      @arslanshahid3454 2 года назад

      @@babareddy44 from R2, F- value or p- value?

    • @beautyisinmind2163
      @beautyisinmind2163 2 года назад

      @@babareddy44 you can use random forest model to see the significance of feature that contribute the most

  • @koustavdutta1176
    @koustavdutta1176 3 года назад +16

    Firstly great explanation !! Now coming to your question, we have to check the bi-variate strength between dependent variables with independent variables. The independent variable with weakest strength should choose to remove from model

    • @UnfoldDataScience
      @UnfoldDataScience  3 года назад +4

      Awesome. Thank you. :)

    • @jamiainaga5853
      @jamiainaga5853 3 года назад +1

      what is bi- viariate?

    • @sowsow5199
      @sowsow5199 3 года назад

      @@jamiainaga5853 the two variables that have been found to be highly correlated with each other

    • @kavankomer3048
      @kavankomer3048 Год назад

      How to find this bi-variate strength?

  • @sangeethasaga
    @sangeethasaga 10 месяцев назад

    Never seen someone with such a clear understandable explanation...thank you so much!

  • @swatikute219
    @swatikute219 3 года назад +6

    Amazing pace, crisp word selection and good examples, thank you Aman for great videos !!

  • @dariakrupnova6245
    @dariakrupnova6245 3 года назад +2

    Wow, I think I owe you my mark on the Econometrics final, you blew my mind, I had no idea it was so simple. Thank you!

  • @Bididudy_
    @Bididudy_ Год назад +1

    Thank you for detailed explanation. I tried this concept from other channels but was bit difficult to get it. Your way of explaining terms is very simple and which helps to understand subject. Really glad that i visited your channel.👍

  • @samruddhideshmukh5928
    @samruddhideshmukh5928 3 года назад +4

    Simple, Clear and Amazing explanation!!!
    I think we can remove one of the columns seeing the p value. If p>0.05 then we fail to reject the Null hypothesis for that variable and thus that coefficient value will be equal to 0.Hence that variable will not contribute significantly.
    Sir pls do make a video on how to use Ridge-Lasso regression to handle multicollinearity.

    • @UnfoldDataScience
      @UnfoldDataScience  3 года назад +1

      Thanks Samruddhi,
      Videos u asked:
      ruclips.net/video/7XvBwQeT9OI/видео.html
      ruclips.net/video/21TgKhy1GY4/видео.html

  • @umeshrawat8827
    @umeshrawat8827 Год назад +2

    To omit either X1 or X2, we can use PCA and remove the variable with low variance.

  • @jhonatangilromero2311
    @jhonatangilromero2311 Год назад

    It is evident that a lot of work goes into developing these very informative videos. Thank you!

  • @KastijitBabar
    @KastijitBabar 7 месяцев назад

    The best explaination on whole RUclips! Thank You.

  • @ChenLiangrui
    @ChenLiangrui 6 месяцев назад

    awesome video! very clear and beginner friendly, no broken train of thought, very problem-focused

  • @ShubhamSharma-zb9uh
    @ShubhamSharma-zb9uh 3 года назад

    09:11 The Data which More Coefficient Value that we have to consider for analysis.

  • @shahbazkhalilli8593
    @shahbazkhalilli8593 9 месяцев назад +1

    I don't know which one should I take. By the way video is great

  • @shadow82000
    @shadow82000 3 года назад +8

    If X1, X2 have high correlation, can I choose to drop the X with lower correlation to Y? Based on the correlation matrix

  • @datafuturelab_ssb4433
    @datafuturelab_ssb4433 3 года назад +2

    Great explaination sir . Thanks for sharing and making my fundamentals strong

  • @albertma1
    @albertma1 Месяц назад

    Thank you so much for the explanation!

  • @smegala3815
    @smegala3815 2 года назад +1

    Thank you sir... Best explanation

  • @BrainBlink-111
    @BrainBlink-111 3 года назад +1

    best explanation....keep the good work up.

  • @csprusty
    @csprusty 3 года назад

    We can create and compare two models based on choosing each of the correlated explanatory variables one at a time and select the model having better R-squared value.

  • @abdulhaseebshah9109
    @abdulhaseebshah9109 2 года назад +1

    Amazing Explanation Aman, I have a question that VIF and auxiliary regression both use to detect multicollinearity?

  • @albertma1
    @albertma1 Месяц назад

    this channel is so underrated

  • @datapointpune6216
    @datapointpune6216 3 года назад +1

    Very Informative aman

  • @arshiyasaba2259
    @arshiyasaba2259 2 года назад +1

    If value is less then thresholds value 0.5/0.7 as per the reference suggests. Then we can remove those values

  • @roshinidhinesh5490
    @roshinidhinesh5490 3 года назад +1

    Such a great explanation sir.. Thanks a lot!

  • @shivamthakur4079
    @shivamthakur4079 3 года назад +1

    really loved sir what u said i can say that u have great idea of explaining concepts. i can blindly follow u sir

  • @allaboutstat1103
    @allaboutstat1103 3 года назад +1

    thanks for clear explanation and God bless!

  • @faozanindresputra3096
    @faozanindresputra3096 Год назад +1

    is multicollinearity will be problem too in correlations? just focus on getting which variables that correlate, not focus on regression. like in PCA

  • @zakiaa7464
    @zakiaa7464 Год назад

    You are a genius. Thanks

  • @ugwukelechi9476
    @ugwukelechi9476 2 года назад

    You are a great teacher! I learnt something new today.

  • @manavgora
    @manavgora 11 месяцев назад

    great, easily understandable

  • @YourRandomVariable
    @YourRandomVariable 3 года назад +1

    Hi Aman, What should we do when the constant term p-value is high? Mostly I see that people keep it without worrying about it. Could you please give an explanation for this?

  • @shanmukhchandrayama8508
    @shanmukhchandrayama8508 3 года назад +1

    Aman, Your videos are great. But there are many videos which have some connection with other, so can you please make a video in which you can say which order to follow the playlists to learn the machine learning from basics. It would be really helpful😅

  • @sudhirnanaware1944
    @sudhirnanaware1944 3 года назад +1

    Hi Aman,
    As per my knowledge we can use VIF (Variation Inflation Factor) function, heatmap,Corr() function to remove the multicoliniarity. Please confirm another techniques

    • @UnfoldDataScience
      @UnfoldDataScience  3 года назад +1

      Yes Sushir, apart from some other regression techniques can be used.

    • @sudhirnanaware1944
      @sudhirnanaware1944 3 года назад

      Thanks Aman, may I know the regression techniques to remove multicoliniarity. so I will definitely learn this and it will helpful for me.

  • @prateeksachdeva1611
    @prateeksachdeva1611 2 года назад

    we will drop that feature from the model whose correlation with the dependent variable is lesser as compared to the other one

  • @MuhammadImran-o4c
    @MuhammadImran-o4c 3 года назад

    Sr ap ko js ne jo answr dia he sb ka answr correct he ap sb ko yes bol rhen hn

  • @rafibasha4145
    @rafibasha4145 2 года назад

    Multicolinearity is problem in classification as well right .@3:57

    • @UnfoldDataScience
      @UnfoldDataScience  2 года назад +1

      Yes, if it's a linear model like logistics regression.

  • @harshadbobade2200
    @harshadbobade2200 2 года назад

    Simple and to the point explaination 🤘

  • @mariapramiladcosta1972
    @mariapramiladcosta1972 3 года назад

    Sir if the there are 3 predictors and one dependent variable. all the three independent variables are highly correlated then which type of regression model can be used. multiple regression can not be used rt?can we use the linear regression? can the tolerance of .1 and the VIF less than 10 not a good enough to indicate that there is no multicollinearity?
    for your question i think the one with weak correlated one to be removed

  • @kunalchakraborty3037
    @kunalchakraborty3037 3 года назад

    My question..
    1. Is multicollinearity a concern for predictive modeling. I mean the prediction is altered by neglecting this phenomenon or not.
    2. In case of GAM do we have to worry about multicollinearity.
    3. How collinearity inflates the variation.

    • @UnfoldDataScience
      @UnfoldDataScience  3 года назад

      Thanks Kunal for asking it. Answer to first question is prediction will not be impacted more however eoefficoents will be impacted.
      2nd and 3rd, I will. Cover in other video

    • @kunalchakraborty3037
      @kunalchakraborty3037 3 года назад

      @@UnfoldDataScience thanks 👍. Really appreciate your videos.

  • @bijaynayak6473
    @bijaynayak6473 2 года назад

    which one will eliminate ? VIF of each features set the threshold >5

  • @sriadityab4794
    @sriadityab4794 3 года назад +1

    Should we need to remove multicollinearity while building time series model?

  • @atomicbreath4360
    @atomicbreath4360 3 года назад +1

    Sir can given some ideas on how to know which type of ml models is affected by multicollinearity?

  • @bhavanichatrathi7435
    @bhavanichatrathi7435 3 года назад

    Hi Aman it's very good explanation...please do video on penalised regression like lasso ridge and elastic..too much of mathematics into those please explain in simple way Thank you

  • @RamanKumar-ss2ro
    @RamanKumar-ss2ro 3 года назад +1

    Great content.

  • @nurlanimanov9503
    @nurlanimanov9503 3 года назад

    Hello sir, After reading the comments I saw the answer to your question. They said we have to remove the one which has less correlation coefficient with the target variable due to the correlation matrix. It confused me at one point, Can we say that the coefficients in front of each feature that we get after running the regression model indicate us impact of each feature on the target? So, I mean can I take these coefficients when I decide which feature I have to remove bw two correlated features instead of taking correlation matrix value with the target variable? Can we say that the coefficients in front of each feature actually say the same thing as the value in the correlation matrix with the target variable in this context?

  • @anmolpardeshi3138
    @anmolpardeshi3138 3 года назад

    regarding the question- which variable to remove out of a set of highly correlated variables? Can this be answered by PCA (principal component analysis)? or will the PCA weight them the same because they are highly correlated?

  • @nivednambiar6845
    @nivednambiar6845 10 месяцев назад

    Hi Aman, hope you are doing well !
    I want to ask one thing, what you are mentioning regression models is related to linear models right not the tree based regression models am i correct ?
    does multicolinearity effects the tree based models ?

  • @ashulohar8948
    @ashulohar8948 2 года назад

    Please please make a vedio how to select drivers in linear regression which drive the sales

  • @beautyisinmind2163
    @beautyisinmind2163 2 года назад

    can we remove highly negatively correlated features also or not? someone reply, please

  • @nurlanimanov9503
    @nurlanimanov9503 3 года назад +1

    Hello sir! Firstly thank you for the video!
    I have 2 questions if you answer I will be glad:
    1) Can we say that we don't need to be concerned about correlated features in for example decision tree-based models? I mean do we need this concept only in linear-based models?
    2) Don't we need to touch correlated features when we use Lasso or Ridge regression is that true? Will the model do that by itself in that case? Don't we need to touch?

    • @UnfoldDataScience
      @UnfoldDataScience  3 года назад +1

      1. This is a problem with regression based models where coefficients come into picture.
      2.still you need to take care.

    • @hemanthkumar42
      @hemanthkumar42 3 года назад

      @@UnfoldDataScience from you first answer, then why multicollinearity is not a problem in neural network? Pls make a video regarding this sir...

    • @saurabhagrawal9874
      @saurabhagrawal9874 2 года назад +3

      @@hemanthkumar42 Note that multicollinearity does not affect prediction accuracy of the linear regression ,it only make the interpretation harder in the linear regression and mostly for interpretation we go to linear regression and when we go to neural network we already know its type of blackbox and we dont want to interpret ,but want good prediction results ,thats why we dont bother about multicollinearity in neural network

  • @muhammadaliabid5793
    @muhammadaliabid5793 3 года назад

    Thankyou for excellent explanation. I have fews questions please:
    1. I used Polynomial features method in sklearn and it significantly improved accuracy of my linear regression prediction model, but i found that the newly created features are correlated with the existing features since i created square and cubes! I understand as per your explanation that it will lead to multicollinearity problem! So i understand that the coefficients are not the true picture, However can i use this type of model for predictions?
    2. What would you suggest the threshold correlation value for multicollinearity?
    Thanks

  • @hemanthkumar42
    @hemanthkumar42 3 года назад +1

    Is multicollinearity is the problem for neural network?

  • @hakimandishmand1068
    @hakimandishmand1068 2 года назад

    Good and perfect

  • @datafuturelab_ssb4433
    @datafuturelab_ssb4433 3 года назад +2

    Remove the variable which have low impact on target variable...
    Sir I hv 2 question
    1. If there is multicollinearity in Classification problem. How to handle that
    2. What is VIF & how standardization done
    3. Can we use standard scaler in regression problem

    • @UnfoldDataScience
      @UnfoldDataScience  3 года назад +1

      There are three questions, I will cover them in separate video. Thanks for asking.

  • @AMVSAGOs
    @AMVSAGOs 3 года назад

    Great Explanation...
    At 7.50 you said "that's why we should not have multicollinearity in regression" . So, Is it okay if we have multicollinearity in classification?? Could you please make it clear..

    • @UnfoldDataScience
      @UnfoldDataScience  3 года назад +1

      When I say, it means regression family of Algorithms. Logistic regression also.

    • @AMVSAGOs
      @AMVSAGOs 3 года назад

      @@UnfoldDataScience Thank you Aman Sir

  • @suryadhakal3608
    @suryadhakal3608 3 года назад

    Great.

  • @sudheeshe1384
    @sudheeshe1384 3 года назад +1

    You always rocks :)

  • @ethiodiversity-1184
    @ethiodiversity-1184 2 года назад

    great explanation

  • @jaheerkalanthar816
    @jaheerkalanthar816 2 года назад

    I think which variable highly CO relate with target variable

  • @akhileshgandhe5934
    @akhileshgandhe5934 3 года назад

    Hi Aman, I have 9 categorical and 6 numerical columns and it's a regression problem.
    So I can find the correlation between numerical using correlation heatmap but how to find the relation between categorical..??
    Can I use chi square test..??
    If I use I am getting all 9 categorical are dependent on each other. So what should be my next step..??
    Please guide me.
    Thanks

    • @UnfoldDataScience
      @UnfoldDataScience  3 года назад +1

      Yes, chi square can be used, I have a dedicated video for the same topic.

  • @salajmondal3437
    @salajmondal3437 9 месяцев назад

    Should I check multicolinearty for classification problem?

    • @UnfoldDataScience
      @UnfoldDataScience  9 месяцев назад +1

      For logistic regression - yes.

    • @salajmondal3437
      @salajmondal3437 9 месяцев назад

      @@UnfoldDataScience Is it necessary to check multicollinearity between categorical features or numerical and categorical features??

  • @kar2194
    @kar2194 3 года назад

    Sorry so it means when there is multicollinearity for example x2 and x3, so if I increase x2, x3 will automatically increased? Great video by the way!

  • @RAJANKUMAR-mi1ib
    @RAJANKUMAR-mi1ib 3 года назад

    Hi...Thanks for the nice explaination. Have a question that is multicollinearity a problem for linear regression only? if not then how its a problem for non-linear regression?

    • @UnfoldDataScience
      @UnfoldDataScience  3 года назад

      For regression based models like linear/logistic etc

  • @sharadpkumar
    @sharadpkumar 2 года назад

    Hi Aman, nice work, keep it up.....i have a doubt that why normal distribution is so important? why we need our independent variable should show normal distribution for a good model? i am not finding a satisfying answer. can you please help?

    • @UnfoldDataScience
      @UnfoldDataScience  2 года назад +2

      Hi Sharad, in simple language, its easy for the model to learn pattern if you give examples from a large set of range.(That is your normal distribution).
      Take a example below:
      Predict salary of an individual(Y - target) based on his/her expense(X variable)
      Scenario 1 - in your training set you have Y as - 10LPA, 15LPA,20LPA, like that, here model wont be able to learn the pattern for 3LPA guys, may be there is difference is income/expense pattern for junior guys.
      Scenario 2 - You give many values of Y from all over like 2LPA, 4LPS,5LPA,100LPA, all values like they are normally distributed.
      Here its easy for model to learn pattern as it sees a range of values and the resulting model will be more reliable.
      Hope its clear now.

    • @sharadpkumar
      @sharadpkumar 2 года назад

      @@UnfoldDataScience thanks for clarification . Does a huge dataset always show normal distribution?

    • @UnfoldDataScience
      @UnfoldDataScience  2 года назад +1

      No, not always...it depends on data

  • @trushnamayeenanda5431
    @trushnamayeenanda5431 2 года назад

    The independent variable with higher correlation among the similar factors should be removed

  • @sidrahms7458
    @sidrahms7458 3 года назад

    Awesome explanation, I have a question: if I have nominal,ordinal and continuous variables how can I find multicollinearity among them?

    • @UnfoldDataScience
      @UnfoldDataScience  3 года назад

      Hi Sidrah, answered.

    • @sidrahms7458
      @sidrahms7458 3 года назад

      I can't find your answer, I understand that we should use vif for continuous variables but what if I need to see correlation among all ordinal, numeric and nominal?

  • @KumarHemjeet
    @KumarHemjeet 3 года назад

    Remove that feature which is in less correlation with target.

  • @bezagetnigatu1173
    @bezagetnigatu1173 2 года назад

    Thank you!

  • @omkarlokhande3692
    @omkarlokhande3692 Год назад

    Sir what to do if the multi collinearity is affecting the binary classification problem

    • @UnfoldDataScience
      @UnfoldDataScience  Год назад

      many ways to take care of it. I have discussed in classification videos.

  • @ameerrace2284
    @ameerrace2284 3 года назад

    Great video. Please create video on python implementation of Lasso and ridge regression

  • @shafeeqaabdussalam6195
    @shafeeqaabdussalam6195 3 года назад +1

    Thank you

  • @MuhammadImran-o4c
    @MuhammadImran-o4c 3 года назад +1

    Thnks sr g I think uncecessary variable remove

  • @squadgang1678
    @squadgang1678 2 года назад

    I will find the correlation between x1 and y and x2 and y individually and see which one is lesser the one with lesser correlation i will delete it

  • @karthikganesh4679
    @karthikganesh4679 3 года назад

    Sir plz do the video for post pruning decision tree

  • @rohitnalage6366
    @rohitnalage6366 2 года назад

    Sir please explain Lasso and ridge if you made it,link pl.

    • @UnfoldDataScience
      @UnfoldDataScience  2 года назад

      ruclips.net/video/7XvBwQeT9OI/видео.html
      ruclips.net/video/21TgKhy1GY4/видео.html

  • @Hinchey613
    @Hinchey613 Месяц назад

    Thanks

  • @sandipansarkar9211
    @sandipansarkar9211 3 года назад

    finished watching

  • @squadgang1678
    @squadgang1678 2 года назад

    Is Machine learning better than deep learning or deep learning better than machine learning

    • @UnfoldDataScience
      @UnfoldDataScience  2 года назад

      Depends on problem statement, data availability, Infra availability etc, can't say one is better then other

    • @squadgang1678
      @squadgang1678 2 года назад

      @@UnfoldDataScience oh ok got it ✌️

  • @sreejadas4417
    @sreejadas4417 2 года назад

    I want to be a data analyst but I want sequential courses from you please guide

  • @anirudhchandnani9917
    @anirudhchandnani9917 3 года назад

    Hi Aman,
    Could you please make a detailed video explaining the difference between Gradient Boost, AdaBoost and ExtremeGradientBoosting?
    Why is AdaBoost called adaptive? Is it only because it edits the weights of the misclassified instances? XGBoost and GradientBoost also are adaptive in that way, arent they?
    Also, why are XGBoost and Gboost more robust to outliers than AdaBoost despite all of them having a term of log in their loss functions?
    Would really appreciate your reply.
    Thanks

  • @sujithreddy1599
    @sujithreddy1599 3 года назад

    It depends on feature importance. the feature with less importance will be dropped.
    correct me if am wrong :0

  • @naziakhatoon3058
    @naziakhatoon3058 3 года назад

    Jo less Cor related ho usko remove karna hai

  • @ahmad3823
    @ahmad3823 9 месяцев назад

    at least two variables!

  • @khoaanh7375
    @khoaanh7375 10 месяцев назад +1

    this shit is pure gold

  • @tesfayesime9434
    @tesfayesime9434 Год назад

    Neither x1 or x2

  • @prateeksachdeva1611
    @prateeksachdeva1611 2 года назад

    excellent explanation