Python Feature Selection: Remove Multicollinearity from Machine Learning Model in Python

Поделиться
HTML-код
  • Опубликовано: 27 окт 2024

Комментарии • 39

  • @Nomar_7
    @Nomar_7 2 года назад +2

    nice video, what about checking from where the high threshold is coming and comparing the correlation with the target column and only dropping the one with the less correlation

    • @StatsWire
      @StatsWire  2 года назад

      We can decide the threshold and see which columns are having high correlation

  • @oomraden
    @oomraden 2 года назад +1

    Hi thanks for the video!
    Would'nt that remove all high correlated columns instead of just leaving one column for every relationship?

    • @StatsWire
      @StatsWire  2 года назад +2

      It will leave one column for every relationship.

    • @SaifTreks
      @SaifTreks 2 года назад

      @@StatsWire Great video! I don't quite get what you mean here. isn't the list returning every column that has a high correlation based on threshold? And then we're proceeding to remove all those columns. Should we intentionally just keep one instead of removing all of them. How is it automatically keeping one if that's what you are saying.

  • @anishdeshpande395
    @anishdeshpande395 Год назад

    Is this method better than variance inflation factor?

  • @AnanyaJoshi-g2x
    @AnanyaJoshi-g2x Год назад

    what about a scenario where the order of the columns change? since we're checking for adjacent columns and their correlations to be more than the threshold and then remove the first out of the two in case the threshold is matched or passed, if I change the order of columns, the result received will be different. is that going to a correct list of features as well?

    • @StatsWire
      @StatsWire  Год назад

      That is completely ok. You can change the order.

    • @AnanyaJoshi-g2x
      @AnanyaJoshi-g2x Год назад

      @@StatsWire thanks for the reply, I did change the order and got a different set of features. Built an XgBoost model with both sets of features and got extremely different forecasts and accuracies in both cases. How do I decide which is correct?

    • @StatsWire
      @StatsWire  Год назад

      Yes, that is going to be a correct feature list. You can change column positions no problem at all.

  • @baburamchaudhary159
    @baburamchaudhary159 Год назад

    I have been following you for feature selection, covered forward, backward, exhaustive, variance threshold, chi2, etc.
    You have not shared the dataset, in them.
    for us to follow along you, why don't you share dataset?

    • @StatsWire
      @StatsWire  Год назад

      Please find the dataset link: github.com/siddiquiamir/Data

  • @d1pranjal
    @d1pranjal Год назад

    How are diagonal elements being handled in the user defined function correlation(df, threshold) ?

    • @StatsWire
      @StatsWire  Год назад

      I did not get your question

    • @d1pranjal
      @d1pranjal Год назад

      @@StatsWire at diagonal elements... value is 1 > threshold... so all elements will show up in output

    • @StatsWire
      @StatsWire  Год назад

      @@d1pranjal Ok, hope you found the solution yourself.

  • @protapmaitra5049
    @protapmaitra5049 2 года назад

    This video was really helpful, thanks a ton.

  • @michaelsagols8295
    @michaelsagols8295 2 года назад

    Thank you for the video! very well explained! keep it up!

    • @StatsWire
      @StatsWire  2 года назад

      Thank you for your kind words.

  • @jorge1869
    @jorge1869 2 года назад

    Excellent, thank you very much.

    • @StatsWire
      @StatsWire  2 года назад

      I'm glad you liked it. You're welcome

  • @farahamirah2091
    @farahamirah2091 2 года назад

    hi, how to get this dataset?

    • @StatsWire
      @StatsWire  2 года назад +1

      Hi, please find the dataset: github.com/siddiquiamir/Feature-Selection

    • @farahamirah2091
      @farahamirah2091 2 года назад

      @@StatsWire thank you

    • @StatsWire
      @StatsWire  2 года назад

      @@farahamirah2091 You're welcome!

  • @akiwhitesoyo918
    @akiwhitesoyo918 Год назад

    Nice ! Would it be the same if we use PCA to avoid multicollinearity ?

    • @StatsWire
      @StatsWire  Год назад

      Thank you! There would be some minor differences.

  • @maskman9630
    @maskman9630 2 года назад

    how to find collinearity for categorical features

    • @StatsWire
      @StatsWire  2 года назад

      You can use chi-square

    • @maskman9630
      @maskman9630 2 года назад

      @@StatsWire thanks brother...... Suppose I have done chi2 test of independent variable and dependent variables.and then i got f and p values, then how can I select features based on those f and p values...? Will u please clarify this brother

    • @StatsWire
      @StatsWire  2 года назад

      @@maskman9630 Select the variable whose p value is less compared to other variables.

  • @mazharalamsiddiqui6904
    @mazharalamsiddiqui6904 2 года назад

    Very nice

  • @naveedullah390
    @naveedullah390 Год назад

    when i enter the code line =====> corrmatrix = X_train.corr()
    it gives the error of =========> AttributeError: 'numpy.ndarray' object has no attribute 'corr'

    • @StatsWire
      @StatsWire  Год назад

      You need to make sure your data is in correct format.

  • @gisflow406
    @gisflow406 2 года назад

    This wasn't helpful at all. You just picked one of the correlated variables randomly without additional criteria. Anyways, correlation matrix can't do much. It's much more reliable to use VIF or hierarchical clustering for feature selection.

    • @StatsWire
      @StatsWire  2 года назад

      Hi, this is for demonstration purposes. You can deep dive and pick the variables based on your selection criteria:)