Drop the first category from binary features (only) with OneHotEncoder

Поделиться
HTML-код
  • Опубликовано: 27 окт 2024

Комментарии • 9

  • @dataschool
    @dataschool  3 года назад +2

    Big news! I just launched a free, 3-hour course that contains all 50 scikit-learn tips! Join here: courses.dataschool.io/scikit-learn-tips

  • @timfwater
    @timfwater Год назад +1

    I think the binary justification is because with binary -- its either yes or no. So 1 binary column can record that.
    If you just remove a random column for one of your 3 shapes -- say 'square' -- then haven't you just lost that information from your dataset?
    I guess you could infer that since there are only 3 discrete categories -- a '0' value for circle/oval implies that it must be a square. But then how would the presence of 'square' be returned as a predictive value in a later model, if square isn't an explicitly listed option?
    In the case with 2 separate "Pink" and "Yellow" values -- both would be exactly correlated with one another, as the dichotomy is either/or. They are perfect opposites, and the absence of 1 of the 2 options enables you to infer the value 100% of the time.
    In the case of 3 variables -- each of these columns would not represent a similar symmetric/binary relationship- as the absence of "square" doesn't allow you to directly infer the presence of either circle/oval as an alternative, as the absence of Pink enables you to do for Yellow. Having 2 alternatives instead of 1 introduces ambiguity that is not present in a binary relationship
    Anyways just my thought. Thank you for the great content!

    • @dataschool
      @dataschool  Год назад +1

      Great question! The information is not lost when you drop the first column, because the original categories are stored in the categories_ attribute of the OneHotEncoder (ohe.categories_). Hope that helps!

  • @sachink9102
    @sachink9102 Год назад

    Explained very well, May i know what is Multicollinearity problem ?

  • @Atulmishra-hs8ch
    @Atulmishra-hs8ch 3 года назад

    Well my understanding says that "a binary feature when one-encoded will always give a 2*2 Matrix and non-binary is always n*2 Matrix".
    This could be the supporting pillar for using "if_binary" as it removes redundancy from a very near Identity Matrix.

    • @dataschool
      @dataschool  3 года назад +1

      Thanks for your comment! I still don't quite understand, because regardless of whether the feature has 2 categories or 10 categories, there is still always 1 column (after one-hot encoding) that is redundant.

    • @sv1562
      @sv1562 2 года назад

      @@dataschool Because in-case of binary it will be always be negative collinearity ?!

  • @johnanih56
    @johnanih56 3 года назад

    YOU ARE AWESOME!