Cross Validation in Scikit Learn

Data Talks

Просмотров 64 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 4 ноя 2024

Комментарии • 51

@TD-ph3wb 3 года назад
your voice is extremely soothing
@ninjaduck3534 3 года назад ⁺²
Super good, thank you for that clear run through!
@alejandrodecoud7319 Год назад
Thanks, it was very useful! masterclass by young hugh grant
@samuelpradhan1899 3 месяца назад
Is it required to train the model in entire data after cross-validation?
@ericsuda4143 3 года назад
Hey man, first of all great vid! One doubt tho, if I need to normalize or scale my data, should I do it before on my whole training dataset or on each fold should I normalize or scale for the subset of the training data that is being extracted?
@DataTalks 3 года назад ⁺¹
You should normalize on each fold if you are doing cross validation! You should do your full training on each fold of the cross validation - and normalization, feature selection, etc included.
@ericsuda4143 3 года назад
@@DataTalks got it! Thanks
@armelsokoudjou8696 3 года назад
I used the following code:
X, y = np.arange(20).reshape((10, 2)), np.arange(10)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
random_state=111)
kf = KFold(n_splits=4, random_state=1, shuffle=True)
for train_index, test_index in kf.split(X_train):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X_train[train_index], X_train[test_index]
y_train, y_test = y_train[train_index], y_train[test_index]
But I got the following message: index 7 is out of bounds for axis 0 with size 6
What could be the reason?
Thanks in advance
@manojachari3682 2 года назад
In cv. Scores () what does it gives training accuracy?, if yes can we get training accuracy at each spit?
@liuchengyu5420 5 лет назад ⁺¹
Hi,
A question for you! We run CV which is telling us the performance of the model. This itself doesn't make any model for us. We eventually still need to use .fit() function to train our model by the training set. Whats the point for doing CV in this case?
Moreover, does that matter to use different kind of CV since it is not actually making the model by CV?
@DataTalks 5 лет назад
Great question! CV is used to get more stable estimate of your model's performance. So not really needed to fit a model (unless you want to ensemble models trained with CV which is another topic).
The type of CV is important for different datasets so that the estimate of model performance is unbiased.
All this being said, if you have a ton of homogenous data CV is probably overkill (but you can always use a bootstrap power calculation to figure that out too -- shameless plug -- ruclips.net/video/G77qfPVO6x8/видео.html)
@backgroundnoiselistener3599 5 лет назад ⁺¹
Here's my question.
when we train our model using the StratifiedKFold we actually get "k" numbers of models in return and we can calculate the accuracy on that model. But how do we get one final model instead of these "k" number of models?
I've read that we take the average of these models, but how do we take the average of a model.
To put it more simply, how can we use StratifiedKFold to make a final model?
@DataTalks 5 лет назад
Great question! Ultimately you will find the hyperparameters that you like best from the StratifiedKFold and then retrain the model with those hyperparameters on the full data. Hope that helps
@marinovik7954 4 года назад
Really nice content. Thanks a lot!
@hangxu7967 6 лет назад ⁺¹
Thanks for your video, it helps me a lot. By the way, can you zoom in your code page, it is not easy using a 11 inches laptop to read the code. Thanks.
@DataTalks 6 лет назад
Absolutely, you are definitely not alone. I'll try to make the text bigger in subsequent vids!
@dikshyasurvi6869 3 года назад
Do you know anything about forecast.baseline?
@GWebcob 3 года назад
Thank you!
@VladyVeselinov 7 лет назад
Lovely presented, would love to see more ;)
@rafaelaraujo5988 2 года назад
Thanks for the amazing video, i used for a project and found a peculiar "problem" and came back to notice that it happens in your video too. When you use the mean + 2* std the value is bigger than 1, is that normal?
@DataTalks 2 года назад
Great quesiton! The interval we calculate has a max greater than 1 which is a bit silly because it can't be greater than 1. This is because we assume a symetric distribution (a normal distribution centered around the mean of the scores). You don't need to do this however. My favorite confidence intervals are bootstrap confidence intervals which don't have this type of behavior. (check out my series here for the full course: ruclips.net/video/uWLMtCtsHmc/видео.html&ab_channel=DataTalks)
@umasharma6119 2 года назад
Can you please s tell me that in cross Val score which cross validation technique is used?
@DataTalks 2 года назад
@@umasharma6119 without specifying it uses 5 fold. However you can specify specify cv techniques in the parameter cv
scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
@DrJohnnyStalker 6 лет назад
If you do model selection or hyperparameter tuning the CV isn't unbiased on the selected model. Should we hold out a separate test set to test the best model on to get an really unbiased performance estimate?
@alice9737 10 месяцев назад
So if we use train_test split do we also need to use cross-validation?
@DataTalks 9 месяцев назад ⁺¹
Great question! You will always need to have a test set - that's what's gonna tell you how well your model will do in production. Cross validation is a way to have a validation set with a lower amount of data. Where your validation set is what you use to optimize hyper parameters.
@alice9737 9 месяцев назад
@@DataTalks thank you for clearing that up for me!
@MasterofPlay7 4 года назад
if you port the cross_val_predict result as y_pred to classification report (y_pred, y), it will output 3 classes, 0,1,2. Why does it output 3 classes instead of 2 since the iris dataset is binary classification?
@DataTalks 4 года назад
Iris has three classes: each the species of plant :)
@MasterofPlay7 4 года назад
@@DataTalks yeah I got mind fk just realized the fact it has 3 classes....
@DataTalks 4 года назад
@@MasterofPlay7 No problem! You'd be totally right if there were two!
@MasterofPlay7 4 года назад
@@DataTalks thx for the help! Yeah i was totally confused the confusion matrix is 3x3..... But for the cross_val_predict, is the output the average of n folds prediction? How come it only outputs 1 set of predictions whereas cross_val_score output the multiple scores (i.e accuracy)?
@MasterofPlay7 4 года назад
@@DataTalks Should it not output the metrics for each iteration of k fold? Hence if i have cv=3 fold it should output 3 classification summary and confusion matrix?
@guillermoabrahamgonzaleztr4852 3 года назад
I went through the complete video, and I still don't know how to perform a cross validation using stratified trainning sets...?
@TaiHeardPostAudio 3 года назад
'cross_val_score' in the model selection module estimates the accuracy of a classifier using stratified k-fold-- by default. The 'cv' parameter adjust the number of folds, default is 5. If you need to compare different classifiers, most likely this is the way. The actual stratified Kfold seems most useful for making charts, etc. Same with standard Kfold. This is all mentioned near the end of the video very briefly, blink and you miss it.
@fanwu281 6 лет назад
I just had a question: what's the difference between using clf.predict and clf.scores?
@DataTalks 6 лет назад ⁺²
Great question. Predict will return the prediction itself (if you are predicting house value it will return the predicted values). Score will take the predictions one step forward and compute how well you have done on a set of fabled data (generally your accuracy or r squared)
@fanwu281 6 лет назад
Thank you sooo much! BTW, I just tried an example which gave me a predict() score 94% and score() score 81%. What information I can get from these two scores? Which score should I use to test the model? To be specific, I used grid search to tune parameter first, and then get the predict() scores of all classifications, also use score() to get scores. Lots of questions, thank you in advance! :-)
@DataTalks 6 лет назад ⁺¹
It will be hard for me to debug without seeing the full code. But predicting and then scoring should be the same as just scoring (as long as you have the same data and score measure). So do double check. In the future just go ahead and use score. It's the method that is used behind the scenes in the GridSearchCV method and is built for getting the model score.
Hope that helps. If you want to chat more feel free to message me through YT :)
@shankrukulkarni3234 4 года назад
nice video can you plz share me the code
@syyamnoor9792 5 лет назад ⁺²
I am not gay but I have to say that you are one attractive personality
@rasu84 6 лет назад ⁺¹
Good video....but why make the video in the kitchen :D:D
@fiddinyusfida5356 6 лет назад ⁺¹
lol
@desertrose00 2 года назад
Hard to focus on what he is saying because the teacher it too cute LOL
Focus... focus .. focus!!!
@cyberdudecyber 5 лет назад
Thank you!
@kokkiniarkouda1 7 лет назад
in k-fold when using the function kf.split(X) how do we separate the data from the target? (x's, y's)
I mean the function splits the X array but where do we define our y's?
@DataTalks 7 лет назад
Great question! kf.split will return an index in the training and the test set. And you get to use those to index into those sets that you have split previously. Generally you will use pandas to split them beforehand.
That being said there are particular types of data that need different splits (like class imbalanced or time series).
@kokkiniarkouda1 7 лет назад
thanks

Следующие

Автовоспроизведение