Should You Stop Splitting Your Data Like This?

Поделиться
HTML-код
  • Опубликовано: 11 сен 2024
  • Random splitting your dataset is one of those recommendations that sound great in Machine Learning books and tutorials.
    But do they actually work?
    🔔 Subscribe for more stories: www.youtube.co...
    📚 My 3 favorite Machine Learning books:
    • Deep Learning With Python, Second Edition - amzn.to/3xA3bVI
    • Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow - amzn.to/3BOX3LP
    • Machine Learning with PyTorch and Scikit-Learn - amzn.to/3f7dAC8
    Twitter: / svpino
    Disclaimer: Some of the links included in this description are affiliate links where I'll earn a small commission if you purchase something. There's no cost to you.

Комментарии • 39

  • @PritishMishra
    @PritishMishra Год назад +15

    When I was doing image captioning, I ran into this exact problem (my latest video is on the same topic). Each image has five different captions, which is why the image column contained similar images, and you guessed it. I did the random splitting. It resulted in similar images in the training and testing sets (leak), and when I tested the model, it performed exceptionally well on the testing set (as the model has already seen testing images while training). I was both happy and suspicious. Then it occurred to me that random splitting is not what I need, so I grouped similar images, split them, and then trained it again. The model was performing slightly worse than before. What a relief!

    • @underfitted
      @underfitted  Год назад +1

      Thanks for telling the story!

    • @user-dx1hh6dg7o
      @user-dx1hh6dg7o Год назад +1

      the model performing worse is better in this case ? how ?

    • @3bdo3id
      @3bdo3id Год назад

      @@user-dx1hh6dg7o it was leaking information that shouldn't be, it was cheating [the model of course not the man 😁]

  • @dannybee9068
    @dannybee9068 Год назад +2

    I'm a beginner and I participated in some competition on tabular dataset for regression problem. And the top solutions were using KFold splitting to ensure that their train and test would be different so testing data would be more representative of the private dataset that was used to give scores on the leaderboard, so when training they have some correlation between test set used for evaluation and the private dataset. I've never seen done anything like that before and if someone has more information or links on where I can read about it more, It would be greatly appreciated

  • @santiagoprada4827
    @santiagoprada4827 Год назад +5

    You inspired me to make content for youtube again. I'm a software developer and some months ago I stopped making content because I felt I was loosing my time ( no much views / no good ideas ). But after feeling that I'm learning in such a fun way with a format like yours, is something that I also want to make other people feel. Thanks man

    • @underfitted
      @underfitted  Год назад +4

      Oh man, thanks for saying this! You just made my day!

  • @JordiRosell
    @JordiRosell Год назад +3

    I think this is the most important video I've ever seen in machine learning. Congratulations. ❤️

  • @peoplepeople335
    @peoplepeople335 Год назад +2

    It happens a lot with medical image data, since those type of data is very hard to collect, sometimes we get multiple images from the same person multiple times in our whole dataset.

    • @Offiziersmesser
      @Offiziersmesser 5 месяцев назад

      yup. Definitely facing this problem right now and I suspected random splitting was the culprit, this video just explained why.

  • @jasdeepsinghgrover2470
    @jasdeepsinghgrover2470 Год назад +1

    I think in many cases leakage is also a very important feature. As long as the same information as leakage can actually be given to the model consistently during application, it can turn out to be a very strong predictor.
    Like in time series models autocorrelation is essentially leakage with one step back. In NLP models we use prompting and provide context which is a lot like leakage.
    As long as we can get the leaked information consistently and it's relevance to the task persists, it is a feature.

    • @underfitted
      @underfitted  Год назад

      Right, in that case is not a leak anymore, but a feature.

  • @joseinsfran3807
    @joseinsfran3807 Год назад +1

    I think that all your videos are amazing! Thank you so much for all the content! What about a video where you show all the books on machine learning/ Data Science you have, or at least one with the best books you've ever read

  • @mar79379
    @mar79379 Год назад +1

    Perhaps we should use pseudo randomisation?

  • @michaelduffy5309
    @michaelduffy5309 Год назад

    I'm trying to make a point to watch one video in this series every day. Great content and presentation. Short and to the point. Thank you.

  • @fikriansyahadzaka6647
    @fikriansyahadzaka6647 Год назад +1

    Just found your channel. Your video is well edited and easy to follow. Keep up the good work!

  • @thevoyager7675
    @thevoyager7675 3 месяца назад

    Keep up the great content!

  • @curiousmind7967
    @curiousmind7967 Год назад

    I think the data overall should be pre-processed more. Probably use weekdays instead of specific dates. Maybe instead of using only one flight data, add 20 past flights history etc

  • @Offiziersmesser
    @Offiziersmesser 5 месяцев назад

    This is wisdom!

  • @knutjagersberg381
    @knutjagersberg381 Год назад

    Thanks for the tip!

  • @fdkaix9091
    @fdkaix9091 Год назад

    I appreciate the effort you put into your videos. Great content!

  • @usmanmuhammad196
    @usmanmuhammad196 Год назад

    Thanks a lot Sir

  • @eduardoabreu78
    @eduardoabreu78 Год назад

    Awesome channel!

  • @austinefeak3794
    @austinefeak3794 Год назад

    Nice Insight to take home and look out for onwards from this video. However, do you quote Wikipedia in your research?

    • @underfitted
      @underfitted  Год назад +1

      Many times, yes

    • @austinefeak3794
      @austinefeak3794 Год назад

      @@underfitted Well, in my research methodology class, we were told it's a bad idea to quote Wikipedia except if the research subject is Wikipedia itself. Always recommended quoting a published journal or article like those ones you showed.
      Nice video editing skills also, i commend.

  • @mgreek31
    @mgreek31 Год назад

    Exponential Growth 💪
    Exponential Knowledge 💪
    Expoenetial Channel 💪

    • @underfitted
      @underfitted  Год назад

      Thanks!

    • @3bdo3id
      @3bdo3id Год назад

      It should have been Exponential Thanks 💪

  • @iftik
    @iftik Год назад

    Why did I find this channel so late 💔

    • @underfitted
      @underfitted  Год назад +1

      No worries! You are very early. I’m just getting started!

  • @javierHuertay
    @javierHuertay Год назад

    But you are only talking about time series here, i think the name of the video is unaccurate.
    And also why you are using the date as a variable in your model, i don't think is a explicative one, and cause a lot of trouble as you mention

    • @underfitted
      @underfitted  Год назад

      The date in the model is to illustrate a specific point. The same happens with any other feature that could cause a leaking. For example, in a dataset of x-rays, you should always make sure that images from the same patient go into the same split. Splitting patients will cause a leaking validation strategy just like I mention in this video.

  • @user-uz5qz7ik8w
    @user-uz5qz7ik8w Год назад

    do u have to shout !!!! give me a headache

    • @underfitted
      @underfitted  Год назад

      Getting there, Brett. Getting there.