How do I handle missing values in pandas?

Поделиться
HTML-код
  • Опубликовано: 11 сен 2024
  • Most datasets contain "missing values", meaning that the data is incomplete. Deciding how to handle missing values can be challenging! In this video, I'll cover all of the basics: how missing values are represented in pandas, how to locate them, and options for how to drop them or fill them in.
    SUBSCRIBE to learn data science with Python:
    www.youtube.co...
    JOIN the "Data School Insiders" community and receive exclusive rewards:
    / dataschool
    == RESOURCES ==
    GitHub repository for the series: github.com/jus...
    "read_csv" documentation: pandas.pydata.o...
    "isnull" documentation: pandas.pydata.o...
    "notnull" documentation: pandas.pydata.o...
    "dropna" documentation: pandas.pydata.o...
    "value_counts" documentation: pandas.pydata.o...
    "fillna" documentation: pandas.pydata.o...
    Working with missing data: pandas.pydata.o...
    == LET'S CONNECT! ==
    Newsletter: www.dataschool...
    Twitter: / justmarkham
    Facebook: / datascienceschool
    LinkedIn: / justmarkham

Комментарии • 376

  • @dataschool
    @dataschool  6 лет назад +20

    In pandas version 0.21 (released October 2017), they added 'isna' and 'notna' as aliases for 'isnull' and 'notnull'. Learn more in my latest video, "5 new changes in pandas you need to know about": ruclips.net/video/te5JrSCW-LY/видео.html

    • @bragattemas
      @bragattemas 4 года назад +2

      Even in the final of 2019 your material form 2016 still gives incredible help.
      I have certainty the DataSchool will keep been a success and helping people.
      Excellent job Kevin Markham. Thanks.

    • @Taranggpt6
      @Taranggpt6 4 года назад

      Why after replacing na with *various* the count is different .
      Coubts of various must be equal to na values earlier which was 2644

    • @EdgardThreat
      @EdgardThreat 4 года назад

      ​@@Taranggpt6 hi, that's because there is already a category named "VARIOUS" in the dataset, so the new filled in data gets added up to the existing count of "VARIOUS".

    • @vigneshpadmanabhan
      @vigneshpadmanabhan 3 года назад

      can we get a video on how to handle missing values for data time related datasets. may be sensor values or any sensitive values. multiple varieties of handling missing value would be very useful.

    • @nadyamoscow2461
      @nadyamoscow2461 2 года назад +1

      @@bragattemas I must say even in 2021 it is still completely up to date

  • @339059331
    @339059331 3 года назад +33

    I like his way of teaching, he doesn't assume that the audience knows by default. He breaks down the explanation piece by piece, it is a great learning experience, concise and clear stated lectures as always! Thanks!

    • @dataschool
      @dataschool  3 года назад +1

      You're very welcome! Thanks for your kind words!

    • @depokboy
      @depokboy 3 года назад +2

      @@dataschool first time watch,,,,those positves comments are true,,,,,thanks a lot

  • @codesandroads
    @codesandroads 3 года назад +14

    I never leave this place unsatisfied or without answers, total treasure.

  • @nasserabachi9625
    @nasserabachi9625 7 лет назад +10

    Now i am in love with Pandas just by seeing a couple of your videos, Shukran Jazeelan !

    • @dataschool
      @dataschool  7 лет назад +2

      That's awesome! Thanks for sharing!

  • @BAIBHAVPATHYBEE
    @BAIBHAVPATHYBEE Год назад

    6 years has gone released this video and i m watching it now and it still made me fall in love with the series ... beautifully explained every concept in detail.

  • @kuldipchauhan524
    @kuldipchauhan524 6 лет назад +20

    Awesome- you are gifted.... -- your explanation and content are clean and effective.

    • @dataschool
      @dataschool  6 лет назад

      Thanks very much for your kind words!

  • @luqikong283
    @luqikong283 4 года назад +3

    The most amazing python tutorial I've watched so far. Fell in love with python.

  • @kiranachanta9741
    @kiranachanta9741 5 лет назад +4

    I have been watching Kevin Videos, needless to say he is an Awesome Instructor. His explanation in all of his videos is Conceptual, In-depth and breaking down any complex topic into the easiest way.
    Thanks Kevin for your great Work!!!
    It would be great if you could make videos on visualization using Matplotlib & Seaborn.

    • @dataschool
      @dataschool  5 лет назад

      Thanks for your kind words, and for your suggestions! :)

  • @astroinceptor
    @astroinceptor 2 года назад +1

    You saved my life twice today, your videos are great and the way you explain is really good. Thank you!

  • @rahuldeepdraws8699
    @rahuldeepdraws8699 3 года назад +1

    This is actually the most clearly explained video on DataFrames that I have ever come across. Glad I found you. Thank you so much.

  • @gabrielreilly7010
    @gabrielreilly7010 3 года назад +1

    Great videos covering the basics. I enjoy how the additional values within the functions are covered, i.e. axis, etc.

  • @FULLCOUNSEL
    @FULLCOUNSEL 7 лет назад +23

    You are doing an excellent job. You are called to do this for sure. Cheers

    • @dataschool
      @dataschool  7 лет назад

      Wow, thank you so much for your comment! I really appreciate it.

  • @saachishivhare4836
    @saachishivhare4836 4 года назад

    I am really loving your videos. Explored your channel just 2 days back!! Earlier I had no idea about pandas but after watching your video, I feel that I will be able to work on my assignment. Great Work! Thank you!

  • @TrevorHigbee
    @TrevorHigbee 4 года назад +1

    Great videos. I love how all the CSVs are available online.

  • @stevechops3226
    @stevechops3226 3 года назад

    I cannot tell you how much you have helped me, with all sorts of problems! You have the clearest way of explaining things, thank you so much!

    • @dataschool
      @dataschool  3 года назад

      You're so very welcome, thanks for your kind words! 🙏

  • @indreshkumar2002
    @indreshkumar2002 7 лет назад +1

    u are superb.i took a paid course but they were not able to make me explain these things as u explained me in such a easy way.thnx a lot.

    • @dataschool
      @dataschool  7 лет назад

      You are very welcome! Thanks so much for your kind comment!

  • @sinabaghaei3504
    @sinabaghaei3504 3 года назад

    Your way of teaching makes learning Data Analysis very interesting to me. I really appreciate and wish you success.

  • @Person_Not_Known
    @Person_Not_Known 6 лет назад +2

    Thanks for your videos. most of the python online course i took... i just couldn't get into. Something about your cadence, data sets, and or approach just clicks with me. Thanks for the content.

    • @dataschool
      @dataschool  6 лет назад

      That's awesome! Thanks so much for sharing!

  • @fredcalo
    @fredcalo 7 лет назад

    This was awesome. Clear, concise, incredibly easy to follow. Your explanations (and bonus) were exactly what I was looking for.

    • @dataschool
      @dataschool  7 лет назад

      Excellent! I'm glad the video was helpful to you!

  • @grumpyae86
    @grumpyae86 3 года назад +1

    Exceptional would be a single word to describe your tutorial. Looking forward to binging on your videos lol. Thank you for such clear explanation.

  • @TR3NDSETR
    @TR3NDSETR 4 года назад

    Thanks so much for making this video. You spoke slowly, clearly and very concise. Other videos I have to rewind and watch over, but i dont have to do that here. Looking forward to watching other.

    • @dataschool
      @dataschool  4 года назад

      That's awesome to hear! Thanks for watching my videos 👍

  • @elilavi7514
    @elilavi7514 8 лет назад +14

    Awesome as usual !

  • @sammy0722
    @sammy0722 4 года назад +1

    Good video. Learnt a lot in short and crisp way

  • @sapnasinha804
    @sapnasinha804 5 лет назад

    Fantastic explanation , however at the end would be good to mention that there are more ways to fill with value_counts , eg. With the mean of all other values etc and not just merging null column with any other column. Cheers!

  • @jiaxufan7050
    @jiaxufan7050 7 лет назад

    The best pandas tutorial ever. Hands down.

    • @dataschool
      @dataschool  7 лет назад

      Wow! Thank you so much for your kind comment!

  • @rahulbhusari1478
    @rahulbhusari1478 2 года назад +1

    Really clear and amazing tutorial

  • @mustafabohra2070
    @mustafabohra2070 5 лет назад +1

    The content you shared is Gold!!

  • @firdharamadhani5162
    @firdharamadhani5162 4 года назад

    i rarely leave youtube comment but thank you!! if it werent for your video i wouldn't understand how to do my assignment at all, you did a great job at explaining!

  • @konradpyrz8559
    @konradpyrz8559 3 года назад +1

    This yung gentleman is simply amazing.

    • @dataschool
      @dataschool  3 года назад

      Thank you! I'm actually 40 years old now 😊

  • @wesleypgurira7142
    @wesleypgurira7142 2 года назад +1

    and by the way i love the way you teach , its just perfect

  • @nadyamoscow2461
    @nadyamoscow2461 2 года назад +1

    Thanks a lot, your course is really helpful and very detailed. You are a great teacher!

  • @-MinhazulFerdous
    @-MinhazulFerdous 6 лет назад

    you are a life saver man...... i was fucked up with errors for only 2 missing values in a row of 1000 data

    • @dataschool
      @dataschool  6 лет назад

      Glad to hear I could be of help!

  • @SachinGairola
    @SachinGairola 6 лет назад

    great video series, I always fall back here whenever I'm stuck..Thanx for making them so informative...cheers

    • @dataschool
      @dataschool  6 лет назад

      Thanks very much for your kind words!

  • @msctube45
    @msctube45 4 года назад

    Excellent video Data School, very helpful, your explanations are clear and objective. Thank you !

  • @ashishsahu2925
    @ashishsahu2925 2 года назад

    Really helpful. This means if one needs to figure out number of rows with 1 or more Null values, the code should look like dataframe[dataframe.isnull().sum(axis=1) > 0].

  • @nimesharya909
    @nimesharya909 8 лет назад +3

    Amazing, clear, precise and I got it working as well :)

  • @mingqian813
    @mingqian813 3 года назад +2

    Thanks for all your well-made videos! I got to know you and your classes from Datacamp. As a beginner in the ML field, please allow me to ask a silly question. So if we have categorical features with missing values, do we need to handle missing values first then do categorical feature transformation using encoders? Or the order doesn't matter? Thanks!

    • @dataschool
      @dataschool  3 года назад

      Great question! Previous to scikit-learn 0.24, missing values need to be handled first if you are going to one-hot encode them. Starting in 0.24, OneHotEncoder can handle missing values itself. Hope that helps!

  • @rishimusicprods
    @rishimusicprods 3 года назад

    This video is quite helpful and easy to understand. Thanks a lot!

  • @tyl9680
    @tyl9680 4 месяца назад

    In the last part of the video, why the number of "VARIOUS" made by fillna doesn't match the previous NA number?

  • @BrokenLightPole
    @BrokenLightPole 5 лет назад +1

    Great video and explanation as always!

  • @MrsRimouch
    @MrsRimouch 2 года назад

    Thank you so much. It is always clearer to listen to you!!

  • @ExcelTutorials1
    @ExcelTutorials1 2 года назад +1

    This is super helpful, thank you!!!!!

  • @tommonks2490
    @tommonks2490 4 года назад +1

    Great explanation. This was a huge help. Thanks so much!

  • @vishwanathg8083
    @vishwanathg8083 6 лет назад

    Thank you , You made learning pandas a cake walk.

    • @dataschool
      @dataschool  6 лет назад

      Awesome, that's great to hear!

  • @bagushari1886
    @bagushari1886 Год назад +1

    Could you please make a video on how to handling missing values in multiple sheets in pandas? Or any recommendation source that I can read about it?
    Thanks in advance

  • @LonglongFeng
    @LonglongFeng 7 лет назад

    very nice tutorial, your style of teaching is awesome like an amazing opera singer

    • @dataschool
      @dataschool  7 лет назад

      What a compliment, thanks! :)

  • @azbas
    @azbas 7 лет назад

    Thank you for simple and detailed explanation including the use of features.

  • @nasser_omar
    @nasser_omar 3 года назад

    What about displaying the rows where columns 'A' and 'B' both of them have any missing values?

  • @yashugarg1815
    @yashugarg1815 5 лет назад +2

    Doubt: Sir , If I want to assign Na to a value suppose 5.Means where ever 5 is present in a DataFrame it will be replaced by Na.then how I have to proceed????
    Thanks

    • @mountainscott5274
      @mountainscott5274 4 года назад +2

      df.column_name.replace(5, np.nan, inplace = True)
      check to make sure values are replaced with df.info()

    • @dataschool
      @dataschool  4 года назад

      Nice!

  • @jieqi6341
    @jieqi6341 5 лет назад

    Thank you so much! you are amazing as always ! I really appreciate it ! Please don't stop making these videos !

  • @jaden_vdb
    @jaden_vdb 6 месяцев назад

    How can we count each time we drop a row and not count the amount of NaN values?

  • @eniisy
    @eniisy 2 года назад

    Dude it's just an awesome video, forgive me for saying this turning playback speed 1.25 is felts more normal hahah .Love ya, appreciate for your effort about teaching piece by piece !!!!!

  • @rehmanullahkhan7389
    @rehmanullahkhan7389 7 лет назад

    Thank you so much.Very useful. I have no wordings to appreciate you. I liked your way of teaching very much. You became my ideal in teaching.

    • @dataschool
      @dataschool  7 лет назад

      You're very welcome! Thanks so much for your kind words!

  • @ItsWithinYou
    @ItsWithinYou 2 года назад

    I have 1 column with 100 rows. After dropping 4 rows with null values, new column has 96 rows. How to write a code that can tell me which 4 rows were dropped

  • @adelabdallah3833
    @adelabdallah3833 Год назад

    I actually have question, I have a dataframe grouped by month and country. Some of those countries don't have a value for a certain month which is causing anomalies in the visualization. I want to generate a record for the month and the country with zero if no record is found, how can I achieve that?
    Thanks in advance

  • @user-ic1bb3tv9d
    @user-ic1bb3tv9d 5 лет назад

    all the content in the video are presented clear!!! thanks very much!! we love you!!

  • @amishbhat3560
    @amishbhat3560 3 года назад +1

    You told how to handle NaN values but if there are some other values such as "Not Provided" then what to do?
    How to ignore them?

  • @paula805
    @paula805 6 лет назад +1

    What inspires a down vote on any of these videos?? Always great content!

  • @RahulKumar-bh9hb
    @RahulKumar-bh9hb 4 года назад

    Explanation techniques is great........want to thank you for sharing your knowledge......Grt videos

  • @gouravkushwaha4488
    @gouravkushwaha4488 4 года назад

    You are good. Your explanation really made it simple.

  • @HasanSuper
    @HasanSuper 3 года назад

    What a beautiful video and such great explanation. Beautiful. Keep it up

  • @rajpaul1501
    @rajpaul1501 3 года назад

    Truly amazing videos. Can you do a series on Matplotlib and Seaborn

    • @dataschool
      @dataschool  3 года назад

      Thanks for your kind words and suggestion!

  • @delmaregals
    @delmaregals 11 месяцев назад +1

    Hi let's say I accidentally changed the value like the one I line 19 where NAN is change to Various can I reverse the change?

    • @dataschool
      @dataschool  10 месяцев назад

      No, changes made through assignment (or inplace operations) are permanent!

  • @mdfaiz4583
    @mdfaiz4583 5 лет назад

    great tutor...great way of making us understand.... so easy and intuitive

  • @PointlessVanessa
    @PointlessVanessa 6 лет назад

    Great video! I learned a lot! I just wished you talked about non discrete values as well. I'm having some trouble to replace missing numerical data and I don't want to replace them with zero because that would make my dataset biased. My goal was to replace that missing data with the mean of the data that I have.

    • @PointlessVanessa
      @PointlessVanessa 6 лет назад

      The only problem is that I don't know how to do that (yet).

    • @dataschool
      @dataschool  6 лет назад

      Glad you liked the videos! You can do something like this: df.fillna(df.mean())

  • @NR_Tutorials
    @NR_Tutorials 5 лет назад +3

    thanks for Nice lecture we love ur sir

  • @eshaal2525
    @eshaal2525 Год назад

    Hi when I change na to nan in my data frame ...all I refers become floats...
    N first there was no bulk values showed ..but now there are null values

  • @vinodkumarjodu4062
    @vinodkumarjodu4062 5 лет назад +1

    In Some Scenarios instead of NaN, will be having Zero, How do you handle those or how you will count number of Zeros

  • @saranyan4123
    @saranyan4123 7 лет назад

    Thank you for the clear and quick explanation. Very helpful !!

  • @oysteijo
    @oysteijo 8 лет назад +3

    Hi Kevin!
    How can I fill na based on a condition? Say I want to fill NA for all missing cities, but only if the color is red.

    • @dataschool
      @dataschool  8 лет назад +7

      Great question! ufo.loc[(ufo.City.isnull()) & (ufo['Colors Reported']=='RED'), 'City'] = 'New value'

    • @ashishkhuraishy
      @ashishkhuraishy 6 лет назад

      Man thx btw😁

  • @keshavkashyap2012
    @keshavkashyap2012 4 года назад

    In this Tutorial for finding the missing city name we used syntax ufo[ufo.City.isnull()] but what if i have to find the missing "Shape Reported" the syntax ufo[ufo.Shape Reported.isnull()] is not working? how to specify the space?

  • @uttamkumarpatra7616
    @uttamkumarpatra7616 5 лет назад +1

    You are simply awesome :) .thank you for making such wonderful videos

    • @dataschool
      @dataschool  5 лет назад

      That's so nice of you to say - thank you!

  • @rohitsinghal2972
    @rohitsinghal2972 4 года назад +1

    Sir what you told in the video is applicable only for the numbers and what should be done for the string values?

    • @dataschool
      @dataschool  4 года назад +1

      If you like, you can impute missing string values with the most common values using scikit-learn's SimpleImputer.

    • @SkillTop
      @SkillTop 4 года назад

      ROHIT SINGHAL Hi programmer🔌🤩 pleaaaase see my channel🌹

  • @alexhenning7086
    @alexhenning7086 4 года назад +1

    Superb video! Thanks a lot it helps alot !

  • @causap
    @causap 5 лет назад

    Markham, make America Great Again...You're the Boss..

  • @deepakshisharma2660
    @deepakshisharma2660 3 года назад

    I used drop command to drop a col which has 10,000 same entries out of 50,000 but it is deleting all row when i use df.dropna(how='any').shape what i do?

  • @ekehopenkiruka
    @ekehopenkiruka 6 лет назад

    Awesome, simple and straight to point with code, what i have been looking for weeks. Thank you so much. Do you have any video where you have used NSL KDD or KDD 99 data set to demonstrate data pre-processing as this is driving me naught.

    • @dataschool
      @dataschool  6 лет назад

      I'm sorry, I don't have a video like that... good luck!

  • @njoy2075
    @njoy2075 4 года назад

    Can you please post any complete project from scratch including pandas , matpootlib, scikitlean, seaborn ?

  • @Andrew6James
    @Andrew6James 4 года назад

    I thought we said that no values were missing from the City or Shape reported columns. Why do we see rows dropped at @11:07?

    • @dataschool
      @dataschool  4 года назад

      Values are missing from both the City and Shape Reported columns.

  • @nyashagracenhandara7757
    @nyashagracenhandara7757 2 года назад +1

    thank you so much the explanation is very clear

  • @samiagharib3796
    @samiagharib3796 4 года назад

    Do you handle missing data before splitting the data set (training set and test set) ?

  • @akashjoshi6826
    @akashjoshi6826 7 лет назад

    Fantastic video Sir.Your work is really commendable.It would be great if you can make a video about imputing the missing values in Python.

    • @dataschool
      @dataschool  7 лет назад

      Thanks for your suggestion, and your kind comments!

  • @tresortshimbombo3133
    @tresortshimbombo3133 5 лет назад +1

    That's exactly what I was looking for!

  • @usmanshaikh1115
    @usmanshaikh1115 5 лет назад

    Very useful and easily explained.

  • @joehopewell
    @joehopewell 6 лет назад

    Thank you!!!! All the good stuff, all in the same place...love it!

  • @wesleypgurira7142
    @wesleypgurira7142 2 года назад

    hey , how can we replace a NaN value with the previous value in a database like on ufos (shapes ) instead of various you place maybe rectangle shape if it was before the NaN value

  • @dishydez
    @dishydez 3 года назад

    Great video btw. Just a quick question. I am trying to build a benchmark, would it be okay to make the data standardized before creating it or?

  • @jonathanfriz4410
    @jonathanfriz4410 3 года назад

    Hi, how you can handle the ValueError: arrays must all be same length ? when df.transpone() is not an option?

  • @nataliaagudelo8635
    @nataliaagudelo8635 5 лет назад

    As always, your videos are very helpful!

    • @dataschool
      @dataschool  5 лет назад

      Thanks very much for your kind words!

  • @rayrivera1830
    @rayrivera1830 4 года назад

    what happens if you have missing values while training the model, e.g. xgboost?

  • @MrJioYoung
    @MrJioYoung 4 года назад +1

    Thank you! Great instructions!

  • @asutoshnayak1391
    @asutoshnayak1391 3 года назад

    Bro how to do data cleaning in pandas ? What are the methods used for it ? Please reply

  • @nataliyakunderevych1211
    @nataliyakunderevych1211 6 лет назад +1

    Super. I understood everything. Nice explanation

  • @pegasoos
    @pegasoos 5 лет назад

    I watched your first video, you are legend!

  • @amitvajpeyee3890
    @amitvajpeyee3890 3 года назад +1

    Thank you so much for explaining in such a simple language! You are doing a great job. God bless you!

  • @saubhagya594
    @saubhagya594 4 года назад

    Lots of thanks from NEPAL✌✌✌

  • @carolinasantoslages5604
    @carolinasantoslages5604 4 года назад

    Excellent! Still have one doubt: how do I creat a third column (dummy variable) based on others two columns (dummy variables), considering that they have missing values. I don´t want to lose information, in other words, I want to consider the pair (NaN, 1) or (0, NaN) as 1 or 0.

  • @salamatburj9502
    @salamatburj9502 6 лет назад

    I think it would be great if you can make lecture about handling missing missing values for machine learning.

    • @dataschool
      @dataschool  6 лет назад

      Thanks for your suggestion!

  • @saiftazir
    @saiftazir 7 лет назад

    Dear Sir , Small question here , i want to replace "..." in specific column name "Energy supply" . What i am doing is
    en1['Energy Supply']=en1['Energy Supply'].str.replace("...", "NaN")
    what this does is it disturbs all other values that are correct into NaN
    My objective here is to replace "..." to NaN

    • @dataschool
      @dataschool  7 лет назад

      I would not advise using the text "NaN" to denote missing values. Rather, you should be setting those values to "nan" from the NumPy library. An example is shown in this video: ruclips.net/video/4R4WsDJ-KVc/видео.html
      Does that help?

  • @TheBeltranito
    @TheBeltranito 3 года назад

    Hey, first of all thanks a lot for your videos!
    One question regarding the fillna() method you use. At the end of the video, when you check the NAs in Shape Reported it said that there were 2644 NaN. However, when you use the fillna() method, it appears that there are 2977 VARIOUS. I dont understand why there are more VARIOUS than NaN?
    Thanks in advance

    • @TheBeltranito
      @TheBeltranito 3 года назад +1

      Okay nvm, there was already a group called various with 333 observations