Glad you like it! I will be uploading 2 more tips every week (Tuesdays and Thursdays) until I reach 50 tips. You can find all of them in this playlist: ruclips.net/p/PL5-da3qGB5ID7YYAqireYEew2mWVvgmj6
Sure! If you do missing value imputation on the whole dataset (before splitting the dataset as part of your model evaluation procedure), data leakage will result.
Thank you sir. Another question if you may. But data leakage you indicated is not because of using pandas instead of sklearn, but because you impute before splitting the data. Can I say that I can use pandas or sklearn for preprocessing as long as I split the data to train test validation split first? Thank you in advance
That's technically true, but it misses the bigger picture. pandas lacks separate fit and transform steps, and so your code will quickly become overly complex if you want to do multiple different transformations within pandas without data leakage. And if there are any transformations you need to do that pandas doesn't offer, it's a pain to combine transformations from pandas with transformations from scikit-learn. Finally, it's completely impractical to do cross-validation (without data leakage) if your transformations are done in pandas (depending on the exact nature of the transformation). And if you can't use cross-validation, you also can't do hyperparameter tuning with GridSearchCV. Thus what you are saying is not technically incorrect, but it also means you are not going to be able to use some of the most important parts of scikit-learn. Hope that helps!
Thank you very much for that very comprehensive explanation, Mr. Kevin. I guess I expected to get away with things by using pandas but that turns out to be inefficient. Time to use the power of sklearn. You do very good content. Appreciate it.
Did you know that the code for all of these tips is on GitHub? Check it out: github.com/justmarkham/scikit-learn-tips
I recently discover your channel, and it's incredible the amount of excellent information you provide!
Thank you!
I like all your well-explained videos! In the future, will you consider guiding a hands-on Kaggle project from beginning to end?
Thanks for your suggestion!
I love your content because it's very well explained and I can practica my english with your pronuntiation. Cheers!
Thank you! That's awesome to hear!
Cheers to Feature Transformer, thanks for sharing this Kevin
You're welcome!
Thank you so much, this was a super clear and simple explanation.
Thanks so much for your kind words!
This is an excellent explanation!
Thank you!
Learned something new. Thanks heaps
Great to hear!
please upload more videos like this ..thanks for this great content !! 🙏
Glad you like it! I will be uploading 2 more tips every week (Tuesdays and Thursdays) until I reach 50 tips. You can find all of them in this playlist: ruclips.net/p/PL5-da3qGB5ID7YYAqireYEew2mWVvgmj6
Hey man, if I have a function that does a bunch of regex operations (.str.extract etc) can I put that into a functiontransformer?
thanks for uploading such great videos...
Thank you!
Helpful..!!
Thanks Atul!
how is this different from TransformerMixin?
thanks
FunctionTransformer is simpler to use, but TransformerMixin is more flexible. Hope that helps!
Hi sir can you provide example on when using pandas instead of sklearn leads to data leakage.
Sure! If you do missing value imputation on the whole dataset (before splitting the dataset as part of your model evaluation procedure), data leakage will result.
Thank you sir. Another question if you may. But data leakage you indicated is not because of using pandas instead of sklearn, but because you impute before splitting the data. Can I say that I can use pandas or sklearn for preprocessing as long as I split the data to train test validation split first? Thank you in advance
That's technically true, but it misses the bigger picture. pandas lacks separate fit and transform steps, and so your code will quickly become overly complex if you want to do multiple different transformations within pandas without data leakage. And if there are any transformations you need to do that pandas doesn't offer, it's a pain to combine transformations from pandas with transformations from scikit-learn. Finally, it's completely impractical to do cross-validation (without data leakage) if your transformations are done in pandas (depending on the exact nature of the transformation). And if you can't use cross-validation, you also can't do hyperparameter tuning with GridSearchCV. Thus what you are saying is not technically incorrect, but it also means you are not going to be able to use some of the most important parts of scikit-learn. Hope that helps!
Thank you very much for that very comprehensive explanation, Mr. Kevin. I guess I expected to get away with things by using pandas but that turns out to be inefficient. Time to use the power of sklearn. You do very good content. Appreciate it.