I was actually gonna watch this while having some pasta, but 2 minutes later I realised I need to get my notebook and pen ASAP! Golden Content my guy, pure GOLD.
You have high quality Videos. If you keep up with those you will be very succesful. Keep up the good work. I ll bet you ll achieve 100k subscribers in the 6 Months.
Great and informative video! I'm sure I'll rewatch it many times. I started learning programming just a month and a half ago (through Udemy courses), and I'm already building my first dataset on EV chargers installed in Europe (I have a dataframe with over a million parameters!). Once I clean it up, I'll move on to running ML algorithms on it. Thank you for the effort you're investing in future generations, Mr. Nobody!
During my university years (I'm finishing my degree in Engineering Management this semester), I studied statistics, linear algebra, calculus (ended somewhere around Hessian matrices), and optimization over the past 2-3 years. It feels like a dream come true to now apply these concepts in a programming environment, which I previously only worked on theoretically with pen and paper.
actually, could you make an explanation when to do scaling of factors? is it needed only for distance based algorithms and how do you deploy a model if you did the scaling?
What about data created dishonestly? Basically, I’m not an IT programmer, but I’m learning data science. As a practitioner, I’ve occasionally created or reported dishonest data. I think, as a human, others might do the same. Can this affect the accuracy of the model in general?
This is where metrics can't reach, as mentioned in the video, domain knowledge is essential because it will most likely tell you why your model performs well in training and testing but fails in the real world. Somewhere, somehow there is always an answer to lies.
Hey, thank you. Your videos are very informative. However, I have recently started studying ML and have few questions. 1. Can you tell me what does "sample" mean? 2. How many samples are needed minimum for considering a DL approach? Is there any criteria as such? I remember you showing scikit learn ML chart like a decision tree on what model to select based on the number of samples we have. Thanks in advance.
I submit an abstract on the 7th featuring my first ever ML work in my field, i’ve been very nervous about making simple errors or not presenting the research in a way that ML people would feel satisfied by. This was super helpful, thank you
Having recently retired after having worked as a Data Scientist for over 3 decades, this is a very, very good summary of the issues and fixes not just for ML but for any predictive modeling project.
this is great, its something i wish i had when i was banging my head against the wall when my models weren't behaving how I thought they would, this video mentions all the issues i spent weeks working one, and more !, great tool.
thank you for creating this video! quick question... re not shuffling data 9:58 for time series data, wouldn't shuffling data introduce train / test set contamination? also, wouldn't the order be important for time series data and shuffling it ruin the time arrow? thank you!
Ignoring domain knowledge is the worst of these by far. If you don't understand the domain, you will generate trivial, weak or useless solutions even if you do everything else right.
@@acasualviewer5861Stationarity is a property of some time series data. It essentially ensures that the distribution of out which the time series data is generated does not change over time (this is for strict stationarity. For weak stationarity only the first two moments and autocovariance need to stay the same when analysing two time points that are h time steps apart). But yeah, unless you know what you are doing, stay away from time series
@@acasualviewer5861 yes, temperatures are non stationary. They have a trend(global warming) and seasonal components. Most processes are in fact non stationary, but pretty much all of the time series modelling techniques assume stationarity. Therefore, to model correctly, you need to know how to turn non stationary processes into stationary ones. Common techniques are differencing, de trending and log transformations.
@@somnath3986 SMOTE is more likely to create synthetic samples of the majority class instead of the minority. To assess this issue under-sampling is preferred to oversampling, however class imbalance is not a problem but usually the nature of the data, so best solution would be to actually use a loss function that penalizes the majority class.
that's not a very good approach, neither to yourself nor aspiring data scientists, SMOTE is more likely to fail in creating synthetic samples in the minority class if the minority class already contains noisy or imbalanced data, this is not the fault of the technique. Instead, what you should suggest are variants of SMOTE that have been developed to address limitations in this technique. There just simply can't be a preferred sampling method to every class imbalance problem and you should know that if you understand how broad the application of ML is in every industry.
I was actually gonna watch this while having some pasta, but 2 minutes later I realised I need to get my notebook and pen ASAP! Golden Content my guy, pure GOLD.
first 3 mins and i added this to my liked section ....pure gem
As an amateur data Analyst/scientist, I think This is insanely useful Information. Thanks for Sharing
Seriously good refresher. I like this type of videos. Quick and to the point. Got job
I am just a new hobbyist, this content is awesome, I find it very helpful.
This is a good video, way better than others I’ve seen.
Thank you. Very helpful.
You have high quality Videos.
If you keep up with those you will be very succesful.
Keep up the good work.
I ll bet you ll achieve 100k subscribers in the 6 Months.
Your video is top notch as always, just diving into the world of ML
Great and informative video! I'm sure I'll rewatch it many times.
I started learning programming just a month and a half ago (through Udemy courses), and I'm already building my first dataset on EV chargers installed in Europe (I have a dataframe with over a million parameters!). Once I clean it up, I'll move on to running ML algorithms on it.
Thank you for the effort you're investing in future generations, Mr. Nobody!
During my university years (I'm finishing my degree in Engineering Management this semester), I studied statistics, linear algebra, calculus (ended somewhere around Hessian matrices), and optimization over the past 2-3 years. It feels like a dream come true to now apply these concepts in a programming environment, which I previously only worked on theoretically with pen and paper.
Common sense, but very good refresher. Thanks!
Great summary!
Would you say that shuffeling is needed for the Testing and Validation Sets?
2:20 ahaaaa, had this asked in an interview as well
6:07 read about annealing learning rates, will try to implement that as well
hey, could you explain the data leakage part to me?
Amazing content, thank you for helping out a noob
Wow man! This is gold!
Thanks for video!
Great content, thanks
Very good content!
THIS IS AWSOME!
actually, could you make an explanation when to do scaling of factors? is it needed only for distance based algorithms and how do you deploy a model if you did the scaling?
What about data created dishonestly? Basically, I’m not an IT programmer, but I’m learning data science. As a practitioner, I’ve occasionally created or reported dishonest data. I think, as a human, others might do the same. Can this affect the accuracy of the model in general?
Yeah definitely. If the data is wrong, no model can save it.
This is where metrics can't reach, as mentioned in the video, domain knowledge is essential because it will most likely tell you why your model performs well in training and testing but fails in the real world. Somewhere, somehow there is always an answer to lies.
Absolutely fantastic
This video is insane. It's so good that it should be included in any ml based academic book as a synopsis.
ver control, docs, are very good. I never shuffle data.
it's lot to take note from one session i believe you have more detailed videos on your channel. I'll check out later. Thanks
Hey, thank you. Your videos are very informative. However, I have recently started studying ML and have few questions.
1. Can you tell me what does "sample" mean?
2. How many samples are needed minimum for considering a DL approach? Is there any criteria as such? I remember you showing scikit learn ML chart like a decision tree on what model to select based on the number of samples we have.
Thanks in advance.
Hey, please do the same videos for statistics and other concepts which are foundation ...
Will you create similar content for statistics concepts like your previous videos ?
Can we have similar video for Deep learning as well.
I don't understand anything but I'm still hooked because my brain tells me it's helpful
I submit an abstract on the 7th featuring my first ever ML work in my field, i’ve been very nervous about making simple errors or not presenting the research in a way that ML people would feel satisfied by. This was super helpful, thank you
well, I would love to do hyperparameter optimization and use cross validation but each epoch is 16h and we need to publish a paper so :(
Having recently retired after having worked as a Data Scientist for over 3 decades, this is a very, very good summary of the issues and fixes not just for ML but for any predictive modeling project.
This is good stuff
Damn this is good
Hey Tim. I love you
Very good
Great video! All the lessons I had to learn the hard way in my first 2 years!
Now I understand why Splunk makes sense in AI world, instead of create AI you must assure to have the better dataset first.
instant subscribe
Thanks! I try to avoid most of these
this is great, its something i wish i had when i was banging my head against the wall when my models weren't behaving how I thought they would, this video mentions all the issues i spent weeks working one, and more !, great tool.
This is years of experience and lessons learnt for "beginner"
I just love these types of organized information, it's the data scientist's way
This video is a perfect checklist😂. Thanks🙏
thank you for creating this video! quick question... re not shuffling data 9:58 for time series data, wouldn't shuffling data introduce train / test set contamination? also, wouldn't the order be important for time series data and shuffling it ruin the time arrow? thank you!
Ohhh there are such things has model validation.
Ignoring domain knowledge is the worst of these by far. If you don't understand the domain, you will generate trivial, weak or useless solutions even if you do everything else right.
🔥
Non-stationary data is being missed
what do you mean by this? Can you explain?
@@acasualviewer5861Essentially, unless you have a degree in stats or maths, avoid time series data
@@acasualviewer5861Stationarity is a property of some time series data. It essentially ensures that the distribution of out which the time series data is generated does not change over time (this is for strict stationarity. For weak stationarity only the first two moments and autocovariance need to stay the same when analysing two time points that are h time steps apart). But yeah, unless you know what you are doing, stay away from time series
@@FedeAlbertini you mean like temperatures tend to be cooler in the winter vs the summer? So you could say its non-stationary?
@@acasualviewer5861 yes, temperatures are non stationary. They have a trend(global warming) and seasonal components. Most processes are in fact non stationary, but pretty much all of the time series modelling techniques assume stationarity. Therefore, to model correctly, you need to know how to turn non stationary processes into stationary ones. Common techniques are differencing, de trending and log transformations.
I stopped the video right away when SMOTE was suggested as a solution to class imbalance.
Same!
Why? is it not the actual solution
@@somnath3986 SMOTE is more likely to create synthetic samples of the majority class instead of the minority. To assess this issue under-sampling is preferred to oversampling, however class imbalance is not a problem but usually the nature of the data, so best solution would be to actually use a loss function that penalizes the majority class.
that's not a very good approach, neither to yourself nor aspiring data scientists, SMOTE is more likely to fail in creating synthetic samples in the minority class if the minority class already contains noisy or imbalanced data, this is not the fault of the technique. Instead, what you should suggest are variants of SMOTE that have been developed to address limitations in this technique. There just simply can't be a preferred sampling method to every class imbalance problem and you should know that if you understand how broad the application of ML is in every industry.
Duo
hmm this video is too hard for me someone reply to me in a year reminding me to come back here please thanks
Using SMOTE is a beginner mistake
pls elaborate
First 😂
Makes me feel like a rock star :-D
You are to me🎉
Amazing, thanks!