Handling Missing Data Easily Explained| Machine Learning

Krish Naik

Просмотров 182 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 8 сен 2024
Data can have missing values for a number of reasons such as observations that were not recorded and data corruption.
Handling missing data is important as many machine learning algorithms do not support data with missing values.
In this tutorial, you will discover how to handle missing data for machine learning with Python.
Specifically, after completing this tutorial you will know:
How to marking invalid or corrupt values as missing in your dataset.
How to remove rows with missing data from your dataset.
How to impute missing values with mean values in your dataset.
Github link: github.com/kri...
You can buy my book where I have provided a detailed explanation of how we can use Machine Learning, Deep Learning in Finance using python
url: www.amazon.in/...

Комментарии • 78

@himalayasinghsheoran1255 4 года назад ⁺¹⁵
Today I started working on the titanic data. Tried to predict the missing age values but failed and was very tensed. So, I started watching your video in hope for a way. When you opened the notebook I felt such a relief - 'ki aab to ho hi jaega'. Thank you for making this video.
@rajagopal.g3533 3 года назад
What you had done for cabin data set .We can't remove this simply by saying there are many missing values .
@radifantaufik8085 3 года назад ⁺⁵
I think there is a quantitative justification why we should fill the NaN values on 'Age' with median that classified by 'Sex' and 'Pclass'. On EDA step, we can print or visualize heatmap of the correlations between each columns (dataset.corr().abs()). We can see that 'Age' columns has relatively high correlation to 'Sex' and 'Pclass' columns.
@dilipgawade9686 5 лет назад ⁺¹³
Hi Krish,
Your videos are quite useful and simple to understand. My request is if you can create video on how we can deploy ML model with Flask that will be very useful..
@krishnaik06 5 лет назад ⁺¹⁰
Sure I will do that
@raunasur9710 2 года назад
Thank you Krish sir. I was following the kaggle learn course on machine learning but couldn't understand this topic even after so much of hard work - now it's all clear. Keep it up.
@amarendrakolukula4592 4 месяца назад
Thank you Krish, you have explained the second option very well. Wondering how we do this for categorical columns and when values are missing from multiple fields
@stevechops3226 3 года назад ⁺²
Your channel is awesome, please keep going! Can't tell you how valuable your videos are when starting to learn!
@Ro45256 6 месяцев назад ⁺¹
Krish, I have one doubt. You are saying that we need to compute the null values by considering the other releted columns. Then tell me how we can implement the same as a pipeline(sklearn.pipeline import Pipeline) so that the pipeline can be used to compute the missing values of the test dataset.
Please clear my doubt if anybody knows! It will be helpful for me.....
@channelfisikaasik1124 2 года назад
Well, I appreciate the video that Mr.
krish naik made and i love to see his videos and I really want to discuss on how can we handle missing values. Ok well using separate model to see relation between variables that have complete dataset is not great though because the value, since it's a value generated from machine learning, is not a real data and may statistically far from central of population data because it comes from other equation. I would love to use statistical method like mean, median or mode and, I don't know this will work or not, checking the range of population mean and make sure that the value is not going far from population mean
@cutecreature_san 3 года назад
staring data science with so many years career gap..your videos are god leavel
@uddipanmitra399 9 дней назад
All good with the imputation of null values related to Age. However for the Cabin feature, instead of deleting roughly 70% of the records, if we aren't able to find any way for imputation via domain knowledge, why can't we tag them as "Undetected" and keep the records for model training.
Deleting 70% of the records just for 1 feature will surely not be the best solution if we want to improve the model performance.
@aimenbaig6201 3 года назад ⁺¹
Thank you for making life so much easier for us!
@bhaktibailurkar1936 Год назад
Cleared all my doubts! Great..Thank you so much!!
@strangereview2414 4 года назад
Nice explanation, conclusion depending on your end goal, and whether if drop or change to mean will affect on your analysis, in he’s example he need the age but he didn’t need the cabinet.
@kukulaarohi9850 3 месяца назад
beautifully explained with the detailing!
@konradpyrz8559 3 года назад
Great way of explaining things. I like it very much.
@gaziya8815 2 года назад
thanks a lot for sharing your knowledge with us. Kindly address one confusion that do we need to impute missing values in the test data set the same way you have taught in the video?
@dishydez 3 года назад ⁺²
Honestly, I really love your videos, simple and easy to understand. Always answering my machine learning and data science questions! I do have one though. I watched your video on standardisation and normalisation. I am trying to build a benchmark/index, would it be okay to make the data standardized before creating it or?
@equiwave80 3 года назад
Thanks Krish. I can't think of an easier explanation of a tricky topic!!! Simply superb!!!👍
@hanman5195 4 года назад ⁺¹
Your explanation is pretty much amazing and your my perfect as usual.
@finance_tamil 5 лет назад ⁺⁴
Thought that you will also implement Regression Model for synthetic imputation. But the content is great!!
@radifantaufik8085 3 года назад
But anyway, thankyou for your sharing. It help me a lot to learn how to handle missing values, nice works!
@saurabhtripathi62 4 года назад
a really good idea of creating seprate model thanks for sharing.
@ayselceferzade8587 2 года назад
great! thanks for explanation
@jobihara 3 года назад
Thanks Kris, very helpful.
@vishwa021094 3 года назад
Hi Krish I find your videos very useful for beginners like me. Here you have shown how to handle missing values for numbers and string fields. We also need to handle for date and time columns. Please guide us through this.
@eshaal2525 Год назад
Hi how to deal with year like...2006 0 in same column
@kajalkapasiya193 5 лет назад ⁺¹
Hi Krish,want to understand why did you choose Pclass to replace null values for Age.
Why not any of the other attributes.
@aikagyan999 4 года назад
Thanks sir, I was confused in this part only, about nan values and why we take sum of those nan values..
@BretskoD 3 года назад
If you're using the isnull() function, it will turn all your missing values into True (or 1) and not-null into False (or 0). After that you can just sum() all of the 1's to find out how many nan values in your dataset.
@aikagyan999 3 года назад ⁺¹
@@BretskoD : Thank you sir.
@gopi3e 4 года назад ⁺²
Thanks for the video, you said that option -2 (model based imputation) is less preferred for huge datasets, does that mean that in general it is good to go with statistical based imputation over model based imputation in real world datasets? Since we get lot of data in real world?. I am working on Home-Credit-Default-Risk (kaggle competetion dataset) request your comment on which imputation method to use?
@ahmedelsabagh6990 2 года назад
Very helpful video
@louerleseigneur4532 3 года назад
Thanks Krish
@aimenbaig6201 3 года назад ⁺¹
QUESTION: why did you choose the imputed value of age with respect to the Pclass and not respect to male or female?
@scifimoviesinparts3837 3 года назад ⁺¹
Could you please make a video on missing values imputation using decision trees ?
@PriyaSharma-xb3ju 2 года назад ⁺¹
Hi Krish, could you please tell what to do when there are missing values in the dependent variables?
@dineshbaisla951 5 лет назад ⁺²
hi krish
i have a doubt. How will you treat if one variable is having missing values around 30% and that variable is important to consider. Overall records are around 550K
@coolsun-lifestyle 5 лет назад
Thanks a lot for detailed explanation. It really helps
@mihirkamble9095 2 года назад
Thank you so much ..
@alphonseinbaraj7602 4 года назад
By using Flask,u can do some more deployment ..please Mr.Krish
@rapchhos 3 года назад
Hi Krish,
I think the age column in the distplot is right skewed. I do not think that it has a normal distribution.
@marsrover2754 4 года назад
What's the recommended rule for deciding the whether to do data imputation techniques or just simple dropping of the rows having missing values. As the missing values can have any patterns like Mising Data at Random, Not missing at Random and so on. So what to do in that case.
@karndeepsingh 5 лет назад ⁺¹
Sir, CAN YOU PLEASE TELL US ABOUT THE ROLE OF ROC AND CAP curve analysis for improving model performance
@priyaranjanswainjitu 4 года назад
Hi Krish , I have one doubt on this case study .
Why you have imputed on basis of class column ? We can also do it on basis of Gender column as well . Median/mean of Male passenger & mean/median of female passenger .
Also we have normally distributed age data , Can we apply mean instead of median ?
@rachanakotha6059 8 месяцев назад
Are these only for numerical data? What all methods can I used for characters/names or Years? Please suggest, Thanks!
@isaiahdickinson9039 8 месяцев назад
Please sire what do I do if I have 80% of missing values in my target variable.
I'm trying to predict the gross of movies but the target variable to train my model with has 80% missing values.
@ezbitz23 4 года назад ⁺²
How can we decide whether to use the mean, median or mode to replace a missing value?
@adityan8536 4 года назад
Based on our data you have to decide
@amalsunil4722 4 года назад
Our first priority is mean...if we have large outliers we go for either mode or median depending on the situation as these 2 figures are least affected by the outliers
@sam45330 2 года назад
First, I want to thank "Krish" for all your content, i have numpy array of continuous value obtained from regression model but i don't know how to fill the null value using the continuous np array of, can any one help me out?
@Abhishekpandey-dl7me 5 лет назад
Thank you so much
@gauravsalwatkar8324 4 года назад
Sir can u please build a video on named entitity recognition using tensorflow keras
@vineethp8925 3 года назад
Hi , i want to know you used box plot median to replace missing values in age column but why no mean or mode ? can you please tell me the reason
@RAJI11000 4 года назад
Sir plz give suggestion regarding cabin feature if it has low number of missing values how we deal with that type?it is a combination of catrogical and neumerical
@rajagopal.g3533 3 года назад
Same doubt bro .Do you know the answer.
@mvcutube 3 года назад
nice video
@jaiminshah143 3 года назад
How to handle missing(NaN) values in column having binary data values i.e Just 0 or 1 ?
@malleswararaomaguluri6344 2 года назад
Sir, if we have missing values in output column, then how separate model will utilise?
@aimenbaig6201 3 года назад
Loved it
@NA-by7rv 4 года назад
Great !
@samirelzein1095 2 года назад
i am thinking Netflix problem-solution type of filling missing, kind of minimizing a cost function
@pratikrandad1990 4 года назад
Awesome
@nikosterizakis 2 года назад
Why write and not just make text appear (as in: pre-typed so people can read it and use transition? )
@simonelgarrad 4 года назад
At 3:45 mins you said that we delete the record , but what if that variable / feature is significant ?
@siyabongazungu1640 4 года назад ⁺¹
We can replace the null values with the mean of each column
@ankitvarma2319 4 года назад
Sir I have a doubt in this. What is we have 50 Pclass values it become really tedious to write all of them. Is there any way we can use list of such pclass values while using the list of the potential age list while defining the function. For ex If Pclass == list1:
return age ==list2
@waynelai9312 4 года назад
Just use a dictionary where key is pclass and value is the mean.
@subhajitdutta1443 3 года назад
Sir I was unable to under stand the programming part in Udemy. That is why searched in the youtube but here I can see both of them are exactly same.. You should at least change the digits.. With all due respect.. Chap diya apne Udemy se..
@gopalakrishna9510 4 года назад
in which senirio we can delete data in missing values ?
@khusheekapoor 3 года назад ⁺¹
When there is a very large data set.
@sachinborgave8094 5 лет назад
Hi sir, how to fill missing values using Linear regression?
@PuneethSaiBhaskar 5 лет назад
👍👍👍
@simonnguyen237 2 года назад ⁺¹
terrible presentation cant really understand the red scrible, maybe try typing

Следующие

Автовоспроизведение

Cross Validation using sklearn and python | Machine Learning