nice video, what about checking from where the high threshold is coming and comparing the correlation with the target column and only dropping the one with the less correlation
@@StatsWire Great video! I don't quite get what you mean here. isn't the list returning every column that has a high correlation based on threshold? And then we're proceeding to remove all those columns. Should we intentionally just keep one instead of removing all of them. How is it automatically keeping one if that's what you are saying.
what about a scenario where the order of the columns change? since we're checking for adjacent columns and their correlations to be more than the threshold and then remove the first out of the two in case the threshold is matched or passed, if I change the order of columns, the result received will be different. is that going to a correct list of features as well?
@@StatsWire thanks for the reply, I did change the order and got a different set of features. Built an XgBoost model with both sets of features and got extremely different forecasts and accuracies in both cases. How do I decide which is correct?
I have been following you for feature selection, covered forward, backward, exhaustive, variance threshold, chi2, etc. You have not shared the dataset, in them. for us to follow along you, why don't you share dataset?
@@StatsWire thanks brother...... Suppose I have done chi2 test of independent variable and dependent variables.and then i got f and p values, then how can I select features based on those f and p values...? Will u please clarify this brother
when i enter the code line =====> corrmatrix = X_train.corr() it gives the error of =========> AttributeError: 'numpy.ndarray' object has no attribute 'corr'
This wasn't helpful at all. You just picked one of the correlated variables randomly without additional criteria. Anyways, correlation matrix can't do much. It's much more reliable to use VIF or hierarchical clustering for feature selection.
nice video, what about checking from where the high threshold is coming and comparing the correlation with the target column and only dropping the one with the less correlation
We can decide the threshold and see which columns are having high correlation
Hi thanks for the video!
Would'nt that remove all high correlated columns instead of just leaving one column for every relationship?
It will leave one column for every relationship.
@@StatsWire Great video! I don't quite get what you mean here. isn't the list returning every column that has a high correlation based on threshold? And then we're proceeding to remove all those columns. Should we intentionally just keep one instead of removing all of them. How is it automatically keeping one if that's what you are saying.
Is this method better than variance inflation factor?
Both of them are good.
what about a scenario where the order of the columns change? since we're checking for adjacent columns and their correlations to be more than the threshold and then remove the first out of the two in case the threshold is matched or passed, if I change the order of columns, the result received will be different. is that going to a correct list of features as well?
That is completely ok. You can change the order.
@@StatsWire thanks for the reply, I did change the order and got a different set of features. Built an XgBoost model with both sets of features and got extremely different forecasts and accuracies in both cases. How do I decide which is correct?
Yes, that is going to be a correct feature list. You can change column positions no problem at all.
I have been following you for feature selection, covered forward, backward, exhaustive, variance threshold, chi2, etc.
You have not shared the dataset, in them.
for us to follow along you, why don't you share dataset?
Please find the dataset link: github.com/siddiquiamir/Data
How are diagonal elements being handled in the user defined function correlation(df, threshold) ?
I did not get your question
@@StatsWire at diagonal elements... value is 1 > threshold... so all elements will show up in output
@@d1pranjal Ok, hope you found the solution yourself.
This video was really helpful, thanks a ton.
You're welcome
Thank you for the video! very well explained! keep it up!
Thank you for your kind words.
Excellent, thank you very much.
I'm glad you liked it. You're welcome
hi, how to get this dataset?
Hi, please find the dataset: github.com/siddiquiamir/Feature-Selection
@@StatsWire thank you
@@farahamirah2091 You're welcome!
Nice ! Would it be the same if we use PCA to avoid multicollinearity ?
Thank you! There would be some minor differences.
how to find collinearity for categorical features
You can use chi-square
@@StatsWire thanks brother...... Suppose I have done chi2 test of independent variable and dependent variables.and then i got f and p values, then how can I select features based on those f and p values...? Will u please clarify this brother
@@maskman9630 Select the variable whose p value is less compared to other variables.
Very nice
Thank you
when i enter the code line =====> corrmatrix = X_train.corr()
it gives the error of =========> AttributeError: 'numpy.ndarray' object has no attribute 'corr'
You need to make sure your data is in correct format.
This wasn't helpful at all. You just picked one of the correlated variables randomly without additional criteria. Anyways, correlation matrix can't do much. It's much more reliable to use VIF or hierarchical clustering for feature selection.
Hi, this is for demonstration purposes. You can deep dive and pick the variables based on your selection criteria:)