I think you'd want to define the folds on the original data and then oversample holding some folds fixed. Example: 3-fold CV. - split original data into 3 folds (A,B,C) - consider (A,B) as training data -> oversample that data -> validate using C. - repeat using A,B as validation sets - note that there is no data leak in this case
very interesting. AdTech modeling of conversions as caused by advertising always suffers from imbalance. (Conversion rates are usually low-mid single digits).
To my knowledge, don't think any classification what immunes from imbalanced dataset because they are data-driven. However, you are still able to get very good accuracy from imbalanced dataset. It happens when inter-class separability is very high, for example, detection of water bodies (often a minority class) over a large area is often quite accurate.
Great video! But don’t you think with such unbalanced dataset it would be better going for an anomaly detection algorithm instead of classification algorithm?
Great video. For other ML algorithms like logistic regression, SVM, KNN etc, can we implement the first method (upweight the minority class) ? or this is only applicable to decision tree ?
Hi just wondering if SMOTE is applicable for image data? I saw only one article on it online, so I am not sure if it even works since generating synthetic images is likely much harder.
hi when people have problems with unbalanced data, it's just the proof they did not get what they do when i was young ( a long time ago, so), our teachers wanted us to do things ' step by step' to be ( nearly) sure we knew what we were calculating as it's not the case anymore, yes, people dont get the methodology and the maths, but practice data science, wich is sad
We just talked about this in my machine learning course this week!! Great timing! This video is very helpful.
Thank you so much for all you do.
Great content, these practical content is gold. Thank you :)
Well presented!
ritvikmath coming with a video of one of my favorite topics - instant like!
Hi! Great video. Is there any way you would like to creat a full in-depth catboost tutorial on some random data? Would be super useful.
Okey, but with oversampling - how do you use cross validation ? Because if you use it on the oversampled dataset, you'll have dataleak
I think you'd want to define the folds on the original data and then oversample holding some folds fixed. Example: 3-fold CV.
- split original data into 3 folds (A,B,C)
- consider (A,B) as training data -> oversample that data -> validate using C.
- repeat using A,B as validation sets
- note that there is no data leak in this case
very interesting. AdTech modeling of conversions as caused by advertising always suffers from imbalance. (Conversion rates are usually low-mid single digits).
It should be "imbalanced data" instead of "unbalanced data"
Lol 😂
you are seriously so underrated
Excellent video!
One question though: are certain classification models immune from class imbalance? Thanks!
To my knowledge, don't think any classification what immunes from imbalanced dataset because they are data-driven. However, you are still able to get very good accuracy from imbalanced dataset. It happens when inter-class separability is very high, for example, detection of water bodies (often a minority class) over a large area is often quite accurate.
Great video!
But don’t you think with such unbalanced dataset it would be better going for an anomaly detection algorithm instead of classification algorithm?
Great video. For other ML algorithms like logistic regression, SVM, KNN etc, can we implement the first method (upweight the minority class) ? or this is only applicable to decision tree ?
Can we customise loss function? For example more weight for misclassification of true minor class and less weight for the other error?
Great demo!
just one thought, why did you not talk about downsampling the majority class? and see what can be the impact?
This is something I am wondering about too!
Hi just wondering if SMOTE is applicable for image data? I saw only one article on it online, so I am not sure if it even works since generating synthetic images is likely much harder.
That's where image augmentation comes to play. You can create different variations of that image by rotating, flipping etc various transformations
You could predict that aircraft engines NEVER fail and almost always be right.
Are you familiar with Latent vectors in network analysis?
s/o from South Africa
hi
when people have problems with unbalanced data, it's just the proof they did not get what they do
when i was young ( a long time ago, so), our teachers wanted us to do things ' step by step' to be ( nearly) sure we knew what we were calculating
as it's not the case anymore, yes, people dont get the methodology and the maths, but practice data science, wich is sad
ups, nuance wrote 'yes'!!; thx to lstm, i did not check my post, sorry! ;-)