Complete Guide to Cross Validation

Поделиться
HTML-код
  • Опубликовано: 1 июл 2024
  • In this video Rob Mulla discusses the essential skill that every machine learning practictioner needs to know - cross validation. We go through examples of scikit learn cross validation in python code. sklearn has many of these built in. Without cross validation it's easy to overfit your model and overstate it's predictive power. This video is a must watch for anyone trying to learn machine learning.
    Timelime:
    00:00 Intro
    01:37 Setup
    03:41 The Dataset
    06:56 The wrong way
    10:20 Holdout check and baseline
    12:50 Train/Test Split
    15:25 Cross Validation
    24:14 Applying Cross Validation
    Notebook: www.kaggle.com/robikscube/cro...
    Follow me on twitch for live coding streams: / medallionstallion_
    Speed Up Your Pandas Code: • Make Your Pandas Code ...
    Working with Audio data in Python: • Audio Data Processing ...
    * RUclips: youtube.com/@robmulla?sub_con...
    * Discord: / discord
    * Twitch: / medallionstallion_
    * Twitter: / rob_mulla
    * Kaggle: www.kaggle.com/robikscube
    #python #machinelearning #datascience

Комментарии • 81

  • @casualgamer91
    @casualgamer91 2 года назад +19

    Hi Rob, thanks for the nice explanation of cross-validation. After you run the cv method, what do you use as your final model to predict new data? My 1st thought is to use the best performing fold, but that seems to defeat the purpose of cross-validation and would be prone to overfitting. Would you use all 5 folds then average the predictions similar to how you calculated the out of fold AUC score? Or perhaps just train on all your data since you have an idea that your model will perform at around 0.83 AUC with unseen data?

    • @robmulla
      @robmulla  2 года назад +14

      So glad you liked it. Your intuition is exactly correct. You would not want to only use the model from the best fold - because that is an overfit model to just that one split. Many times people average across folds, which works well especailly when stacking. However most GMs have perfected the art of re-training the final model on all the data. This sometimes requires additional parameter tuning and is done much more on intuition. For instance training for less epochs because you have more data to train on. Hope that helps!

    • @casualgamer91
      @casualgamer91 2 года назад +4

      Thanks for the reply! It'd be great if you could make a future video on retraining the final model on all the data with hyperparameter tuning with perhaps a concrete example. Appreciate your hard work on educating us!

    • @user-nu2vd6qf6y
      @user-nu2vd6qf6y 6 месяцев назад +4

      can you please enlight more on this rob, i now understand that cv is a robust method of validating the model performance. but inside the cv loop you fit the model 5times(well according to the fold number) using 5 different train-val dataset. is that mean you're only training 1 model for 5 consecutive time, so each times the model has knowledge of more data. or do you actually make 5 different model using those 5 different dataset?

    • @gunasekharvenkatachennaiah1033
      @gunasekharvenkatachennaiah1033 4 месяца назад

      @@casualgamer91
      Here is the sample code for that:
      ```
      from sklearn.model_selection import KFold
      from sklearn.metrics import accuracy_score
      from sklearn.pipeline import Pipeline
      from sklearn.tree import DecisionTreeClassifier
      import numpy as np
      models = []
      kf = KFold(n_splits=5, shuffle=True, random_state=0)
      x = df.drop(columns=['target'], axis=1)
      y = df['target']
      tr_acc = []
      ts_acc = []
      for train_index, test_index in kf.split(x, y):
      lr = Pipeline(steps=[
      ('preprocessing', preprocessing),
      ('classification model', DecisionTreeClassifier())
      ])
      x_train, x_test = x.iloc[train_index], x.iloc[test_index]
      y_train, y_test = y.iloc[train_index], y.iloc[test_index]
      lr.fit(x_train, y_train)
      y_train_preds = lr.predict(x_train)
      y_test_preds = lr.predict(x_test)
      tr_acc.append(accuracy_score(y_train, y_train_preds))
      ts_acc.append(accuracy_score(y_test, y_test_preds))

      models.append(lr)
      def predict_avg(models, new_data):
      new_data_df = pd.DataFrame([new_data], columns=df.drop(columns=['target'], axis=1).columns)
      predictions = []
      for model in models:
      prediction = model.predict(new_data_df)
      predictions.append(prediction)
      return np.mean(predictions)
      new_data = [5.9, 3.0, 5.1, 1.8]
      average_prediction = predict_avg(models, new_data)
      print(average_prediction)
      Ground Truth:- 2.0
      Predicted Truth:- 2.0
      ```

  • @eduardomanotas7403
    @eduardomanotas7403 11 месяцев назад

    Rob , you are so great! I can come over to your videos many times and never get bored, I need more teachers like you lol, you are a guy who really improves every day, Thanks for supporting the community

  • @ye-ym5jo
    @ye-ym5jo Год назад

    this is the best CV explanation I've ever watched and finally clear my confusion, thanks a lot sir

  • @robinsonrios3199
    @robinsonrios3199 8 месяцев назад

    Man, i really love your coding performance and all your explanations. You helped me a lot.

  • @MOAMA82
    @MOAMA82 7 месяцев назад

    The best explanation of CV on RUclips, Rob is an ML beast, thank you.

  • @kalianeeboodoo4750
    @kalianeeboodoo4750 Год назад +2

    Really great tutorial, so thorough and simple to understand. You're a natural tutor

  • @srimantamukherjee7090
    @srimantamukherjee7090 Месяц назад

    Excellent and elegant flow of concepts and implementatiom.

  • @davidm6624
    @davidm6624 2 года назад +5

    Wow, thanks. I'm going thru ML on the theoretical side rn and it's refreshing to see such applied content! It's a long road ahead, but I believe that if you keep posting vids of such a) relevancy and b) quality with a good c) frequency in a couple of months you will be much bigger. Thanks again!

    • @robmulla
      @robmulla  2 года назад +1

      Thanks so much David! Glad you found it helpful.

  • @user-es3wr6uf2l
    @user-es3wr6uf2l Год назад +2

    This is amazing. It was so helpful. Thank you Rob!

    • @robmulla
      @robmulla  Год назад +1

      Glad you found it helpful! Cross validation is one of the most important things to master.

  • @ronbzalen
    @ronbzalen Год назад +3

    this is pure gold! thanks for this awesome content !!

    • @robmulla
      @robmulla  Год назад +1

      Glad you enjoyed it!

  • @berkguney8992
    @berkguney8992 2 года назад +2

    Great content as always Rob.

    • @robmulla
      @robmulla  2 года назад

      Thanks Berk! Glad you liked it. Tell your friends. :)

  • @gauravmalik3911
    @gauravmalik3911 Год назад +4

    This is the only video I've found on net which explains ofcourse crossvalidation part but also how to separate and divide our data into different sets. Because in most of the tutorials and articles that Ive found, they divide data into just two parts and just perform evaluation on test set which is wrong but here I've found proper explanation of whole process. Awesome video.
    For future I would love to know how we can apply different classification metrics when we need to have a high recall (for example in case of cancer predictions) or high precision etc.
    Again thank you for the detailed explanation

    • @robmulla
      @robmulla  Год назад +1

      Thanks so much for this thoughtful comment. Even though this is one of my less popular videos I think the topic is really important. Cross validation is super crucial if you want your model to be robust and work on unseen data. A video about classification metrics is a great idea. Thanks again for watching.

  • @KingJadi
    @KingJadi 2 года назад +1

    Great video as always!

    • @robmulla
      @robmulla  2 года назад

      Thanks again! I apprecaite the feedback.

  • @SuhasKM-tl1rg
    @SuhasKM-tl1rg 5 месяцев назад

    My God Rob, you are a blessing man

  • @ashraf_isb
    @ashraf_isb 2 месяца назад +1

    lol truly a master class, thanks Rob xD

  • @ramoda13
    @ramoda13 11 месяцев назад

    Thanks , great video.

  • @JaswinderSingh-wn6lc
    @JaswinderSingh-wn6lc 11 месяцев назад

    Hey Rob! Awesome video as always! Can you please explain why did you pick the probabilities for the positive class only? I have been making myself crazy over this ..xD

  • @hussainsalih3520
    @hussainsalih3520 Год назад +1

    amazing , keep doing awesome videos

    • @robmulla
      @robmulla  Год назад +1

      Thank you! Will do! Cross validation is really important to understand in ML!

  • @user-es2np7gb4f
    @user-es2np7gb4f Год назад +1

    ❤❤❤ You are the one from my best proffessors

    • @robmulla
      @robmulla  Год назад

      Thanks! Not a professor. Just a guy.

  • @raheemnasirudeen6394
    @raheemnasirudeen6394 Год назад +1

    This is superb, I wish to be like Rob one day.

    • @robmulla
      @robmulla  Год назад

      Thanks. I’m nothing special. Just sharing what I’ve learned.

  • @vivekpadman5248
    @vivekpadman5248 Год назад +1

    Thanks for this detailed walkthrough through cross validation, I really learnt a lot. Can you if possible make a video on ensembling techniques in machine learming models optimization, and other general score increasing techniques ?

    • @robmulla
      @robmulla  Год назад +1

      Glad you liked it. Optimization for blending and ensembling is a good video idea. I’ll keep it in mind for future videos.

  • @islamibrahim4382
    @islamibrahim4382 Год назад

    Hi Rob really great content, I am wondering if you can make something for retail ecommerce as I want to be predicting sessions, orders ,revenue and conversion rate based on previous data history which is going to help in spending wisely and get the most out of it

  • @ismailonurylmaz192
    @ismailonurylmaz192 6 месяцев назад

    Hi Rob, thanks for this great video. It is highly didactic! I can not clarify one thing, at the end of any cross validation process, do we have to report only one metric which is the averaging of 5 different clf like in this video?

  • @poojagoyal3647
    @poojagoyal3647 Год назад +2

    Hey rob ...great explanation on the cross fold technique....can you do a further video on how to apply these techniques in case of deep learning model for image classification problem...it will be super helpful

    • @robmulla
      @robmulla  Год назад

      Thanks for the suggestion I’ll add it to the list of ideas for future videos! Thanks for watching.

  • @FilippoGronchi
    @FilippoGronchi 2 года назад +1

    Always awesome...as clarity, as depth level of explanation, as examples...Just one question. What is for .sample(frac=1) when at the beginning you prepare the HoldOut set. Thanks a lot

    • @robmulla
      @robmulla  2 года назад

      Thanks for the feedback! `.sample(frac=1)` is just a way of randomly shuffling a dataframe.

  • @MrMeap12345
    @MrMeap12345 Год назад +1

    Thank you 🙌🏻 this is great. What steps might be next in regard to producing a model you would deploy in a production environment?

    • @robmulla
      @robmulla  Год назад

      Thanks for watching, feature engineering is #1. Then parameter tuning and pre/post processing. Good luck!

  • @TheKekko16
    @TheKekko16 Год назад

    Hi, thank you for the video. The concept of cross validation is explained really well. I'm writing my bachelor's thesis on support vector machines for classification. To implement my models I used the Python docplex library, but now I can't perform the cross validation because I don't know how to apply the scikit learn methods (for example the fit method) on customized models. Do you know how I should do?

  • @hasanovmaqsud
    @hasanovmaqsud Год назад +1

    Hello Rob! Thank you very much for such a valuable tutorial! Can you please elaborate more on cross validation used for Time Series forecasting? Thank you very much!

    • @robmulla
      @robmulla  Год назад +1

      Thanks Maqsud. I’m thinking about making a video that goes into detail about it following up on my forecasting video. Thanks for the encouraging comment.

  • @JordiRosell
    @JordiRosell Год назад +1

    GroupTimeSeriesSplit!
    It's implemented in mlxtend and sklego libraries and seen in some kaggle notebooks and stackoverflow answers.

    • @robmulla
      @robmulla  Год назад +1

      Interesting. I’ll have to give it a look.

  • @kalianeeboodoo4750
    @kalianeeboodoo4750 Год назад

    Hi Rob, when there is a highly imbalance data (e.g. 920 datapoints for class 0 and 80 datapoints for class 1), do we not have to balance the dataset using some techniques such as SMOTE and generate synthetic data to equalize both classes, or CV does the job here?

  • @koleshjr
    @koleshjr 2 года назад +1

    Amazing video. You should do a time series one next.

    • @robmulla
      @robmulla  2 года назад +2

      Glad you liked it. I'll put time series on the list for future videos. If you haven't already check out Konrad's videos on Abhishek's channel: ruclips.net/p/PL98nY_tJQXZmT9ZB59T0lsx0ZzzLrYdX4

  • @minister1005
    @minister1005 8 месяцев назад

    Hi, thanks for such a thorough video explaining cross validation! I really enjoyed it ^^
    I have a question though. While defining the get_prep_data function, isn't '.sample(frac=1, random_state=529)' irrelevant since you already did 'holdout_ids = data.sample(500, random_state=529).index' ?

  • @risabb
    @risabb Год назад +1

    Thanks a lot Rob for making this video. It's really insightful. I have a quick question - At @11:39 you created a baseline (which is really helpful!) for binary classification. Would you recommend techniques for imbalanced multiclass classification as well? Thank you in advance :)

    • @robmulla
      @robmulla  Год назад

      Great question. There are a lot of resources out there with regards to imblanced datasets. It's a really common problem, but there isn't a one size fits all solution. Generally just having more training data or simpler models is best, otherwise you will find youself overfitting to the few positive samples you have.

  • @chrismiles9019
    @chrismiles9019 2 года назад +1

    Thank you very much. The visualizations are really great. At 22:30 you say that there is an even distribution of each class in each of the folds, but it appears to me that all the positive samples are in fold 3, maybe because the class was only positive in the first group of that dataset? Am I missing something? Thanks again, I’m really loving the videos.

    • @robmulla
      @robmulla  2 года назад +1

      You are exactly right. That is because it prioritizes not having overlapping groups. The very next thing I do in the video is shuffle the target class to give a better example of how StratifiedGroupKFold works in practice.

  • @devnull711
    @devnull711 Год назад +1

    very good content

    • @robmulla
      @robmulla  Год назад

      Thanks for watching. Glad you like it.

  • @maxscheijen
    @maxscheijen 2 года назад +2

    Great video Rob! Question when dealing with regressions problems is the best approach to simply use KFold cross-validation? Or are there other cv methods that are also useful when dealing with regression problems?

    • @robmulla
      @robmulla  2 года назад

      Great question! If your data size is large, then typrically KFold with shuffle=True is enough. If you are concerned about it you can always create a binary feature and use StratifiedKFold on that feature - or there are some other more advanced stratification packages out there: pypi.org/project/iterative-stratification/ - I could go over those in a future video if you think it might be helpful.

    • @maxscheijen
      @maxscheijen 2 года назад +1

      @@robmulla Thanks for the response! I usually KFold with a shuffle. At work, I train a lot of regression models I was just curious if there are general best cross-validation practices for those kinds problems. So a future video on that would definitely be helpful! Maybe another idea for a future video is performance how to identify underperformance of a model on slices/subsections of data. However, no hurry really like the videos and live streams keep it up!

    • @amanthinks374
      @amanthinks374 8 месяцев назад

      @@maxscheijen I would use pd.cut() to bin my y variable and use stratified k fold with the binned var as a category (temp categorial y var), so that each validation fold has different levels of the y variable represented in it. Obviously discard this binned (cut) version of y-variable before you run fit()

  • @maxidiazbattan
    @maxidiazbattan 2 года назад +1

    Amazing tutorial Rob, I usually split the data at the beginning of the whole process into folds, do you think that approach is fine? I don’t see that quite often in kaggle.

    • @robmulla
      @robmulla  2 года назад +1

      It's funny you say that. I typically do the splits at the beginning and when working in teams we use the same splits so we can share results for the exact same folds. I think this is actually really common on kaggle. With really small data you eventually want to change your random seed and try new folds at some point to make sure you aren't overfitting to a specific seed of folds. Hope that helps!

    • @maxidiazbattan
      @maxidiazbattan 2 года назад

      @@robmulla Thank you very much for the clarification Rob, yes, it helped me a lot.

  • @ifeanyinwobodo8530
    @ifeanyinwobodo8530 Год назад +1

    Thanks for this video!
    Really insightful!
    Is it possible to use an ensemble of cross validated models for a single problem

    • @robmulla
      @robmulla  Год назад

      Thanks for commenting. I’ve never heard of doing ensembles of cross validation but I don’t see why you couldn’t.

    • @ifeanyinwobodo8530
      @ifeanyinwobodo8530 Год назад

      @@robmulla ok thanks. I'd experiment with it

  • @user-kl5nx9qy8o
    @user-kl5nx9qy8o 8 дней назад

    What about using these cross validation objects within GridSearchCV, how would you treat the creation of new features for each fold? Would you have to sacrifice using GridSearchCV to be able to apply feature engineering with aggregate data on each fold? Your approach allows feature engineerimg on each fold, but it does not allow the benefit of hyperparameter tuning as with gridsearchcv. Would the best advice be to use your approach of cross validation and apply in each fold loop with feature engineering, but with the drawback of relying on manual hyperparameter tuning instead f methods as gridsearchcv or randomsearchcv?

  • @thespace7371
    @thespace7371 5 месяцев назад

    Finally found a video series on data science that I can understand. Thank you!

  • @oilgas1016
    @oilgas1016 2 года назад +1

    State of the art for cross validation.

    • @robmulla
      @robmulla  2 года назад

      It definitely is important to do cross validation correctly!

  • @user-sv7tr4bt8l
    @user-sv7tr4bt8l 4 месяца назад

    Hey! I just noticed you didn't encode the categorical values nor specified them in the model training, is that possible with this algorythm?

  • @TylerMacClane
    @TylerMacClane Год назад +1

    Hello Rob
    How about if your model that you train on 9 min. video, retrained
    Because you predict on the same data that you train on
    You had a leak and therefore the predictions are very accurate
    Haven't watched more than 10 minutes now
    I continue
    👾
    Understood at 11 min. everything fell into place))

    • @robmulla
      @robmulla  Год назад +1

      Awesome! Glad the video was able to answer your question before I could.

  • @qpellidomombre
    @qpellidomombre 11 месяцев назад

    Why don't you keep track of the error metrics in the training sets for each fold iteration. I was under the impression that it is necessary to do it to notice if there's overfitting.
    For example, if you have the following results for auc and K=3
    K=1: validation, 0.9 (training, 0.95)
    K=2: validation, 0.92 (training, 0.99)
    K=3: validation, 0.88 (training, 0.92)

  • @watcher8582
    @watcher8582 4 месяца назад

    Good video, but a bit sus that you say "I'm a dada scientist" 3 times (literally) in the first 70 seconds.

    • @robmulla
      @robmulla  4 месяца назад

      😒

    • @watcher8582
      @watcher8582 4 месяца назад

      @@robmulla Ungrateful viewers eh

  • @muchammadfahd-a1985
    @muchammadfahd-a1985 Год назад

    what happen if you using from.sklearn.model_selection import * ???