Credit Card Fraud Detection - Dealing with Imbalanced Datasets in Machine Learning

Поделиться
HTML-код
  • Опубликовано: 6 окт 2024
  • Error: The neural net predictions function is using shallow_nn everytime instead of the model passed in, sorry about that! This changes the results a bit, but the main point is choosing and creating a model, which this doesn't impact.
    The Code: colab.research...
    Kaggle dataset (ensure you make an account!): www.kaggle.com...
    Learn Python, SQL, & Data Science for free at mlnow.ai/ :)
    Subscribe if you enjoyed the video!
    Best Courses for Analytics:
    ---------------------------------------------------------------------------------------------------------
    IBM Data Science (Python): bit.ly/3Rn00ZA
    Google Analytics (R): bit.ly/3cPikLQ
    SQL Basics: bit.ly/3Bd9nFu
    Best Courses for Programming:
    ---------------------------------------------------------------------------------------------------------
    Data Science in R: bit.ly/3RhvfFp
    Python for Everybody: bit.ly/3ARQ1Ei
    Data Structures & Algorithms: bit.ly/3CYR6wR
    Best Courses for Machine Learning:
    ---------------------------------------------------------------------------------------------------------
    Math Prerequisites: bit.ly/3ASUtTi
    Machine Learning: bit.ly/3d1QATT
    Deep Learning: bit.ly/3KPfint
    ML Ops: bit.ly/3AWRrxE
    Best Courses for Statistics:
    ---------------------------------------------------------------------------------------------------------
    Introduction to Statistics: bit.ly/3QkEgvM
    Statistics with Python: bit.ly/3BfwejF
    Statistics with R: bit.ly/3QkicBJ
    Best Courses for Big Data:
    ---------------------------------------------------------------------------------------------------------
    Google Cloud Data Engineering: bit.ly/3RjHJw6
    AWS Data Science: bit.ly/3TKnoBS
    Big Data Specialization: bit.ly/3ANqSut
    More Courses:
    ---------------------------------------------------------------------------------------------------------
    Tableau: bit.ly/3q966AN
    Excel: bit.ly/3RBxind
    Computer Vision: bit.ly/3esxVS5
    Natural Language Processing: bit.ly/3edXAgW
    IBM Dev Ops: bit.ly/3RlVKt2
    IBM Full Stack Cloud: bit.ly/3x0pOm6
    Object Oriented Programming (Java): bit.ly/3Bfjn0K
    TensorFlow Advanced Techniques: bit.ly/3BePQV2
    TensorFlow Data and Deployment: bit.ly/3BbC5Xb
    Generative Adversarial Networks / GANs (PyTorch): bit.ly/3RHQiRj

Комментарии • 55

  • @GregHogg
    @GregHogg  Год назад

    Take my courses at mlnow.ai/!

  • @vishnusunil9610
    @vishnusunil9610 10 месяцев назад +1

    Stunning bro just clear cut explanation not wasting a single minute it's just a gold mine of information
    best video on a project explained step by step

    • @GregHogg
      @GregHogg  10 месяцев назад

      Thank you for the very kind words! Glad it was helpful 😀

  • @prathameshmore1402
    @prathameshmore1402 3 года назад +3

    Thank you for your amazing efforts! I don't have much experience in building different models, so this video helped me a lot! Btw, I tried increasing max_depth to 6 in random forest model, and it really increased model's performance better than I expected. Thanks again!

    • @GregHogg
      @GregHogg  3 года назад +1

      Interesting! Yeah it's surprisingly easy to mess around with models. That's great about the max_depth! And you're very welcome :)

  • @somechad3682
    @somechad3682 Год назад

    One thing worth mentioning would be the data wrangling part. It's often a good idea to check for feature relevance and feature importance. Funny enough, the amount of transaction and the time of it were not considered as the features that had a substantial impact on the general outcome of the model to see if a transaction was fraudulent or not.
    This not only reduces bias in our data frame, but it can also substantially increase the computation speed of that model! (mine had a 36% boost in speed while losing only 0.01 points in F1 score, and 0.02 in precision.)
    Another thing would be to write a function that fits the training and validation data in each of the models automatically. It will substantially help with the cleanliness and readability of the project.
    I would also consider hyperparameter tuning and pipelining everything together to make it a robust project. However, great video and a great demonstration of how to check each model and measure their suitability for the problem at hand.

    • @kimchi6284
      @kimchi6284 6 месяцев назад

      please i have a poject in this topic could you pleeeease help me i don't know what to do

  • @mellowftw
    @mellowftw 3 года назад +2

    I'll be trying this soon, thanks Greg

    • @GregHogg
      @GregHogg  3 года назад

      No problem Krish! 😊😊

  • @petarganev4256
    @petarganev4256 3 года назад +1

    Great video on classification. Good luck with the channel!

    • @GregHogg
      @GregHogg  3 года назад +1

      Thanks so much Petar! I appreciate that 😊

  • @motilalmeher7666
    @motilalmeher7666 7 месяцев назад +1

    After training the model on the balance population please find the model performance on the original population the imbalanced one.

  • @sivanujansivakumar5907
    @sivanujansivakumar5907 3 года назад +1

    Thanks man. I'm going to try this one. It's really helpful. 🙏😍

    • @GregHogg
      @GregHogg  3 года назад

      Enjoy! You're very welcome 😊

  • @mahelvson
    @mahelvson Год назад +2

    Great vídeo. I was just wondering if taking a slice from the original dataset to use as a test set is a more consistent way to evaluate the resampling procedure. Because in production, the model still has to deal with imbalanced data.

    • @Hash9211
      @Hash9211 7 месяцев назад

      yes I agree. I've tried slice of original data for test set and the results look completely different.

  • @machinelearning3602
    @machinelearning3602 3 года назад

    Hope to see more of this kind in the coming days!!

    • @GregHogg
      @GregHogg  3 года назад +1

      With an account name of "Machine Learning" I would expect nothing less! 😂 And absolutely ☺️

  • @garlicman2778
    @garlicman2778 2 месяца назад

    really like your vidoe!
    One thing though, when you downsampling the data, shouldn't you still keep validating/testing on the ratio of data?
    In your case, you are basically assuming the testing data is also have a 50/50 split, which in reality will never be the case.

  • @aguspe532
    @aguspe532 2 года назад

    Great video and explanation! Thanks!

    • @GregHogg
      @GregHogg  2 года назад

      You're very welcome!

  • @joxa6119
    @joxa6119 6 месяцев назад +1

    What is your opinion on doing oversampling (SMOTE) on the minority class?

    • @GregHogg
      @GregHogg  6 месяцев назад +1

      Definitely a solid option.

  • @ottomaggio2725
    @ottomaggio2725 Год назад +2

    Nice video, however, it is not completely clear to me how the undersampling relates to the overall problem. In the end, you have to provide the client (the bank) with a model capable of detecting fraud. Let's suppose we give them the model trained on the rebalanced dataset. Since frauds are unbalanced by nature, then they will end up using the model trained on a balanced dataset on a test set that is actually unbalanced. Isn't this causing issues? Isn't the prediction biased toward the fraud? Aren't we predicting way too many frauds?

    • @ottomaggio2725
      @ottomaggio2725 Год назад +1

      To be more specific, I think you can try balancing the training set but you cannot balance the test set because, in the end, in the real scenario, the new data to be predicted will be always unbalanced.

    • @luqmanhrizal
      @luqmanhrizal Год назад

      its not practical to evaluate the model on the balanced the evaluation/test set since its ignore the real fraud representation. data representation is sacred.

  • @saitejatangudu6320
    @saitejatangudu6320 3 года назад

    Great video ❤❤ looking forward for more videos like this..

    • @GregHogg
      @GregHogg  3 года назад

      Thank you!! Absolutely 😊

  • @vinsanargeese4384
    @vinsanargeese4384 Год назад

    I just wanna know whether it gives the accuracy details only or detect whether card is fraud or not

  • @sakshirathi7950
    @sakshirathi7950 3 года назад +1

    Thanks greg!!
    Is it okay to do projects by looking at the tutorial videos!? When is the time, we need to do it on our own

    • @GregHogg
      @GregHogg  3 года назад

      Absolutely! Go ahead. You can do it on your own when you feel like you've got the general hang of things, if that makes sense.

  • @srijanshovit844
    @srijanshovit844 2 года назад

    That's amaaazzzing!!

  • @KeKuHauPiOx
    @KeKuHauPiOx Год назад

    im getting errposts on the rest train and val run for the numpy

  • @sushantpargaonkar5188
    @sushantpargaonkar5188 2 месяца назад

    how do you balance test set when you don't have labels in real life?

  • @devjain7076
    @devjain7076 3 года назад +1

    12:51 shouldn't shape of y_train be (240000, 1) since it consists of exactly one column?

    • @GregHogg
      @GregHogg  3 года назад +1

      (240000,) and (240000,1) are very close to the same thing. I'm not sure if they both work or not

  • @emrecoban3895
    @emrecoban3895 Год назад

    Are we not supposed to test from original data instead of balanced one.

    • @Mwme2000
      @Mwme2000 11 месяцев назад

      well i have the same question but every code i saw for this dataset with high f1 score did like him and after a lot of research i found that if you have highly imbalanced data like this it is okay to test on the under sampled data if u know anything else
      please share it

  • @unlucky-777
    @unlucky-777 9 месяцев назад

    Hey Greg, thank you for the video but I have a question. At first, we had a dataset that had 280000 rows and 30 columns but towards to end of the video, we decreased the dataset that only had 984 rows. Doesn't this make the model bad because we're trained on less data?
    Or the real problem was we were getting bad results at first because we had so many not_fraud data compared to fraud ones?

  • @amannagarkar
    @amannagarkar Месяц назад

    In the predict function, you’re taking model as input arg but returning on shallow-nn. Is it correct? Or should it be model.predict() 28:31

    • @amannagarkar
      @amannagarkar Месяц назад

      Probably that’s why the values are exactly the same at 51:51

  • @ArtistrystoriesUnleashed45
    @ArtistrystoriesUnleashed45 Год назад

    can i try train_test_split function from sklearn to split data into train and test set?

  • @mubshali7489
    @mubshali7489 3 года назад

    Sweet. This is going to my github!!

  • @arsheyajain7055
    @arsheyajain7055 3 года назад

    Awesome 👏🥳

  • @ashwanirathi948
    @ashwanirathi948 4 дня назад

    Awesome

  • @MatTheBene
    @MatTheBene 2 года назад

    Are you not leaking targets if your normalize before splitting the data?

    • @GregHogg
      @GregHogg  2 года назад

      If I am, it isn't really a big deal

    • @MatTheBene
      @MatTheBene 2 года назад

      @@GregHogg it isn't a big deal in most cases probably, but with time series data you are leaking future information that the model will not have during inference, such as changes in trend 📈 in future data points

    • @GregHogg
      @GregHogg  2 года назад

      @@MatTheBene For time series it would be more concerning yes

  • @allaboardthegravytrain5987
    @allaboardthegravytrain5987 6 месяцев назад

    thanks

  • @j_ckitchai
    @j_ckitchai 9 месяцев назад +1

    Hi thankyou a lot from making this video I learn a lot through this, I have some question at @52:05 the line print rf.predict(x_val_b) isn't that should be rf_b.predict(x_val_b) instead ? along with Gbc later on too it should use gbc_b.predict right ??

    • @jeremyklauber7535
      @jeremyklauber7535 3 месяца назад

      I thought that as well not entirely sure why he hadn't changed those when the neural_net_predictions function he had it under shallow_nn_b

    • @83Dunes
      @83Dunes 2 месяца назад

      I have the same question. The inference and final choice of model may differ with that change.