I think the binary justification is because with binary -- its either yes or no. So 1 binary column can record that. If you just remove a random column for one of your 3 shapes -- say 'square' -- then haven't you just lost that information from your dataset? I guess you could infer that since there are only 3 discrete categories -- a '0' value for circle/oval implies that it must be a square. But then how would the presence of 'square' be returned as a predictive value in a later model, if square isn't an explicitly listed option? In the case with 2 separate "Pink" and "Yellow" values -- both would be exactly correlated with one another, as the dichotomy is either/or. They are perfect opposites, and the absence of 1 of the 2 options enables you to infer the value 100% of the time. In the case of 3 variables -- each of these columns would not represent a similar symmetric/binary relationship- as the absence of "square" doesn't allow you to directly infer the presence of either circle/oval as an alternative, as the absence of Pink enables you to do for Yellow. Having 2 alternatives instead of 1 introduces ambiguity that is not present in a binary relationship Anyways just my thought. Thank you for the great content!
Great question! The information is not lost when you drop the first column, because the original categories are stored in the categories_ attribute of the OneHotEncoder (ohe.categories_). Hope that helps!
Well my understanding says that "a binary feature when one-encoded will always give a 2*2 Matrix and non-binary is always n*2 Matrix". This could be the supporting pillar for using "if_binary" as it removes redundancy from a very near Identity Matrix.
Thanks for your comment! I still don't quite understand, because regardless of whether the feature has 2 categories or 10 categories, there is still always 1 column (after one-hot encoding) that is redundant.
Big news! I just launched a free, 3-hour course that contains all 50 scikit-learn tips! Join here: courses.dataschool.io/scikit-learn-tips
I think the binary justification is because with binary -- its either yes or no. So 1 binary column can record that.
If you just remove a random column for one of your 3 shapes -- say 'square' -- then haven't you just lost that information from your dataset?
I guess you could infer that since there are only 3 discrete categories -- a '0' value for circle/oval implies that it must be a square. But then how would the presence of 'square' be returned as a predictive value in a later model, if square isn't an explicitly listed option?
In the case with 2 separate "Pink" and "Yellow" values -- both would be exactly correlated with one another, as the dichotomy is either/or. They are perfect opposites, and the absence of 1 of the 2 options enables you to infer the value 100% of the time.
In the case of 3 variables -- each of these columns would not represent a similar symmetric/binary relationship- as the absence of "square" doesn't allow you to directly infer the presence of either circle/oval as an alternative, as the absence of Pink enables you to do for Yellow. Having 2 alternatives instead of 1 introduces ambiguity that is not present in a binary relationship
Anyways just my thought. Thank you for the great content!
Great question! The information is not lost when you drop the first column, because the original categories are stored in the categories_ attribute of the OneHotEncoder (ohe.categories_). Hope that helps!
Explained very well, May i know what is Multicollinearity problem ?
Well my understanding says that "a binary feature when one-encoded will always give a 2*2 Matrix and non-binary is always n*2 Matrix".
This could be the supporting pillar for using "if_binary" as it removes redundancy from a very near Identity Matrix.
Thanks for your comment! I still don't quite understand, because regardless of whether the feature has 2 categories or 10 categories, there is still always 1 column (after one-hot encoding) that is redundant.
@@dataschool Because in-case of binary it will be always be negative collinearity ?!
YOU ARE AWESOME!
Thank you! 🙏