Handling Missing Data Easily Explained| Machine Learning
HTML-код
- Опубликовано: 8 сен 2024
- Data can have missing values for a number of reasons such as observations that were not recorded and data corruption.
Handling missing data is important as many machine learning algorithms do not support data with missing values.
In this tutorial, you will discover how to handle missing data for machine learning with Python.
Specifically, after completing this tutorial you will know:
How to marking invalid or corrupt values as missing in your dataset.
How to remove rows with missing data from your dataset.
How to impute missing values with mean values in your dataset.
Github link: github.com/kri...
You can buy my book where I have provided a detailed explanation of how we can use Machine Learning, Deep Learning in Finance using python
url: www.amazon.in/...
Today I started working on the titanic data. Tried to predict the missing age values but failed and was very tensed. So, I started watching your video in hope for a way. When you opened the notebook I felt such a relief - 'ki aab to ho hi jaega'. Thank you for making this video.
What you had done for cabin data set .We can't remove this simply by saying there are many missing values .
I think there is a quantitative justification why we should fill the NaN values on 'Age' with median that classified by 'Sex' and 'Pclass'. On EDA step, we can print or visualize heatmap of the correlations between each columns (dataset.corr().abs()). We can see that 'Age' columns has relatively high correlation to 'Sex' and 'Pclass' columns.
Hi Krish,
Your videos are quite useful and simple to understand. My request is if you can create video on how we can deploy ML model with Flask that will be very useful..
Sure I will do that
Thank you Krish sir. I was following the kaggle learn course on machine learning but couldn't understand this topic even after so much of hard work - now it's all clear. Keep it up.
Thank you Krish, you have explained the second option very well. Wondering how we do this for categorical columns and when values are missing from multiple fields
Your channel is awesome, please keep going! Can't tell you how valuable your videos are when starting to learn!
Krish, I have one doubt. You are saying that we need to compute the null values by considering the other releted columns. Then tell me how we can implement the same as a pipeline(sklearn.pipeline import Pipeline) so that the pipeline can be used to compute the missing values of the test dataset.
Please clear my doubt if anybody knows! It will be helpful for me.....
Well, I appreciate the video that Mr.
krish naik made and i love to see his videos and I really want to discuss on how can we handle missing values. Ok well using separate model to see relation between variables that have complete dataset is not great though because the value, since it's a value generated from machine learning, is not a real data and may statistically far from central of population data because it comes from other equation. I would love to use statistical method like mean, median or mode and, I don't know this will work or not, checking the range of population mean and make sure that the value is not going far from population mean
staring data science with so many years career gap..your videos are god leavel
All good with the imputation of null values related to Age. However for the Cabin feature, instead of deleting roughly 70% of the records, if we aren't able to find any way for imputation via domain knowledge, why can't we tag them as "Undetected" and keep the records for model training.
Deleting 70% of the records just for 1 feature will surely not be the best solution if we want to improve the model performance.
Thank you for making life so much easier for us!
Cleared all my doubts! Great..Thank you so much!!
Nice explanation, conclusion depending on your end goal, and whether if drop or change to mean will affect on your analysis, in he’s example he need the age but he didn’t need the cabinet.
beautifully explained with the detailing!
Great way of explaining things. I like it very much.
thanks a lot for sharing your knowledge with us. Kindly address one confusion that do we need to impute missing values in the test data set the same way you have taught in the video?
Honestly, I really love your videos, simple and easy to understand. Always answering my machine learning and data science questions! I do have one though. I watched your video on standardisation and normalisation. I am trying to build a benchmark/index, would it be okay to make the data standardized before creating it or?
Thanks Krish. I can't think of an easier explanation of a tricky topic!!! Simply superb!!!👍
Your explanation is pretty much amazing and your my perfect as usual.
Thought that you will also implement Regression Model for synthetic imputation. But the content is great!!
But anyway, thankyou for your sharing. It help me a lot to learn how to handle missing values, nice works!
a really good idea of creating seprate model thanks for sharing.
great! thanks for explanation
Thanks Kris, very helpful.
Hi Krish I find your videos very useful for beginners like me. Here you have shown how to handle missing values for numbers and string fields. We also need to handle for date and time columns. Please guide us through this.
Hi how to deal with year like...2006 0 in same column
Hi Krish,want to understand why did you choose Pclass to replace null values for Age.
Why not any of the other attributes.
Thanks sir, I was confused in this part only, about nan values and why we take sum of those nan values..
If you're using the isnull() function, it will turn all your missing values into True (or 1) and not-null into False (or 0). After that you can just sum() all of the 1's to find out how many nan values in your dataset.
@@BretskoD : Thank you sir.
Thanks for the video, you said that option -2 (model based imputation) is less preferred for huge datasets, does that mean that in general it is good to go with statistical based imputation over model based imputation in real world datasets? Since we get lot of data in real world?. I am working on Home-Credit-Default-Risk (kaggle competetion dataset) request your comment on which imputation method to use?
Very helpful video
Thanks Krish
QUESTION: why did you choose the imputed value of age with respect to the Pclass and not respect to male or female?
Could you please make a video on missing values imputation using decision trees ?
Hi Krish, could you please tell what to do when there are missing values in the dependent variables?
hi krish
i have a doubt. How will you treat if one variable is having missing values around 30% and that variable is important to consider. Overall records are around 550K
Thanks a lot for detailed explanation. It really helps
Thank you so much ..
By using Flask,u can do some more deployment ..please Mr.Krish
Hi Krish,
I think the age column in the distplot is right skewed. I do not think that it has a normal distribution.
What's the recommended rule for deciding the whether to do data imputation techniques or just simple dropping of the rows having missing values. As the missing values can have any patterns like Mising Data at Random, Not missing at Random and so on. So what to do in that case.
Sir, CAN YOU PLEASE TELL US ABOUT THE ROLE OF ROC AND CAP curve analysis for improving model performance
Hi Krish , I have one doubt on this case study .
Why you have imputed on basis of class column ? We can also do it on basis of Gender column as well . Median/mean of Male passenger & mean/median of female passenger .
Also we have normally distributed age data , Can we apply mean instead of median ?
Are these only for numerical data? What all methods can I used for characters/names or Years? Please suggest, Thanks!
Please sire what do I do if I have 80% of missing values in my target variable.
I'm trying to predict the gross of movies but the target variable to train my model with has 80% missing values.
How can we decide whether to use the mean, median or mode to replace a missing value?
Based on our data you have to decide
Our first priority is mean...if we have large outliers we go for either mode or median depending on the situation as these 2 figures are least affected by the outliers
First, I want to thank "Krish" for all your content, i have numpy array of continuous value obtained from regression model but i don't know how to fill the null value using the continuous np array of, can any one help me out?
Thank you so much
Sir can u please build a video on named entitity recognition using tensorflow keras
Hi , i want to know you used box plot median to replace missing values in age column but why no mean or mode ? can you please tell me the reason
Sir plz give suggestion regarding cabin feature if it has low number of missing values how we deal with that type?it is a combination of catrogical and neumerical
Same doubt bro .Do you know the answer.
nice video
How to handle missing(NaN) values in column having binary data values i.e Just 0 or 1 ?
Sir, if we have missing values in output column, then how separate model will utilise?
Loved it
Great !
i am thinking Netflix problem-solution type of filling missing, kind of minimizing a cost function
Awesome
Why write and not just make text appear (as in: pre-typed so people can read it and use transition? )
At 3:45 mins you said that we delete the record , but what if that variable / feature is significant ?
We can replace the null values with the mean of each column
Sir I have a doubt in this. What is we have 50 Pclass values it become really tedious to write all of them. Is there any way we can use list of such pclass values while using the list of the potential age list while defining the function. For ex If Pclass == list1:
return age ==list2
Just use a dictionary where key is pclass and value is the mean.
Sir I was unable to under stand the programming part in Udemy. That is why searched in the youtube but here I can see both of them are exactly same.. You should at least change the digits.. With all due respect.. Chap diya apne Udemy se..
in which senirio we can delete data in missing values ?
When there is a very large data set.
Hi sir, how to fill missing values using Linear regression?
👍👍👍
terrible presentation cant really understand the red scrible, maybe try typing