How to Do Data Cleaning (step-by-step tutorial on real-life dataset)

Поделиться
HTML-код
  • Опубликовано: 27 июл 2024
  • 🐼 All you need to know about Pandas in one place! Download my Pandas Cheat Sheet (free) - misraturp.gumroad.com/l/pandascs
    👇Learn how to complete your first real-world data science project
    Hands-on Data Science course
    www.misraturp.com/hods
    Data exploration video: • How to Do Data Explora...
    If you’d like to follow along, find the data here: data.cityofnewyork.us/Environ...
    00:00 Welcome and some words on data cleaning
    01:21 Cleaning irrelevant columns
    01:43 Cleaning categorical column values
    02:27 Cleaning missing values
    13:08 Dealing with outliers
    17:17 Data has a surprise for us
    17:57 Dealing with unexpected data issues
    21:10 Going forward
    👋 Keep in touch?
    ==========================
    🐥 Twitter - / misraturp
    🔗 LinkedIn - / misraturp
    📹 RUclips - / @misraturp
    🌎 Website - misraturp.com/
    Courses & resources
    ============================
    👩‍💻 Hands-on Data Science: Complete your first portfolio project
    www.misraturp.com/hods
    📙 Fundamentals of Deep Learning in 25 pages
    misraturp.gumroad.com/l/fdl
    📥 Streamlit template
    misraturp.gumroad.com/l/stemp
    🤖 Deep Learning 101 with Python and Keras (FREE)
    • 50 Days of Deep Learning
    🏃‍♀️ Data Science Kick-starter mini-course (FREE)
    misraturp.gumroad.com/l/kick-...
    🐼 Pandas cheat sheet (FREE)
    misraturp.gumroad.com/l/pandascs
    📝 NNs hyperparameters cheat sheet (FREE)
    misraturp.gumroad.com/l/hcs
  • КиноКино

Комментарии • 127

  • @cybiryan
    @cybiryan 9 месяцев назад +1

    Thank you for showing us these techniques. And I love the pace, thank you!

  • @animelover5093
    @animelover5093 Год назад

    Great content! This step-by-step guide has been extremely helpful in my journey.

  • @simenandreasknudsen9272
    @simenandreasknudsen9272 3 года назад +5

    This is great Misra, thanks! :) Please make more of these!

  • @iqraasif3783
    @iqraasif3783 8 месяцев назад

    This is gold. Learned a lot from your videos. Thanks!

  • @antonioarana8002
    @antonioarana8002 2 года назад +1

    PERFECTION all this explanation and process, thanks Misra! you are a master, thanks a lot!

    • @misraturp
      @misraturp  2 года назад

      You're very welcome :) and thank you for your nice words Antonio :)

  • @ahmedkaram-ws3hs
    @ahmedkaram-ws3hs 2 года назад +2

    thank you misra for this great content please keep making videos it's very very helpful and your explanation is so much great.

  • @prekshagampa5889
    @prekshagampa5889 2 года назад +4

    Hey Misra! I am new here and I love your content.
    And, Thank you for this wonderful explanation, Its really helpful to get a clear view of how things work .

    • @misraturp
      @misraturp  2 года назад +1

      That’s great to hear. Thank you!

  • @aaratigeorge2904
    @aaratigeorge2904 2 года назад

    This was a wonderful tutorial wherein you also gave insights on what things u consider while data cleaning, thank you.

    • @misraturp
      @misraturp  2 года назад

      You're very welcome!

  • @fehmidatahir2509
    @fehmidatahir2509 2 года назад

    Thank you SO much! I am totally in love with the content you share!

    • @misraturp
      @misraturp  2 года назад +1

      You're very welcome :)

  • @atharvasavdekar7613
    @atharvasavdekar7613 3 года назад +1

    You deserve more appreciation. Great Content!!!

    • @misraturp
      @misraturp  3 года назад +1

      Thank you so much 😀

  • @abdelrhmanrhyaseen6194
    @abdelrhmanrhyaseen6194 Год назад

    This was really amazing, Thank you.

  • @fernandoraposo5038
    @fernandoraposo5038 3 года назад +1

    That was the video that i was searching for. Thanks :)

    • @misraturp
      @misraturp  2 года назад

      Awesome! You are very welcome.

  • @deandu3414
    @deandu3414 3 месяца назад

    Thank you for the amazing lecture.

  • @raminsadeghnasab9310
    @raminsadeghnasab9310 Год назад

    That was amazing. Thanks for your time.

  • @LightHouse31073
    @LightHouse31073 Год назад

    I would highly recommend new data analysists to participate this this data cleaning process you present. The level of detail is impeccable 👌

    • @misraturp
      @misraturp  Год назад

      Great to hear Loyiso, thank you!

    • @LightHouse31073
      @LightHouse31073 Год назад

      @@misraturp My pleasure. You're so good. (*meant to say Data Analysts and not the gibberish above🤭)

  • @harikishan437
    @harikishan437 2 года назад +1

    It's really amazing, i learned a lot of new coding lines here as well ass the concept too ......Thanks a lot @Misra Turp😇😇😇😇😇

    • @misraturp
      @misraturp  2 года назад

      That's great to hear, thank you!

  • @la-vieborde9825
    @la-vieborde9825 Год назад

    thank you so much. I am a new data scientist and these videos are very helpful.

  • @judyostroot8682
    @judyostroot8682 2 года назад +3

    Wonderful tutorial! This is exactly the type of content I was looking for. When you are cleaning data in a work environment, do you document all the changes you make?

    • @misraturp
      @misraturp  2 года назад +2

      Hey Judy, I’m happy to hear that! I did not document the cleaning steps most of the time. But having clear comments on the code itself is very useful to people who will maintain the code after you to understand why you have done something.

  • @onyinyeobijiofor7075
    @onyinyeobijiofor7075 2 года назад

    This is so nice and clean 👌

  • @misraturp
    @misraturp  2 года назад +3

    👉 Get real world data science experience by doing hands-on work
    www.misraturp.com/hods

  • @KonradTamas
    @KonradTamas 11 месяцев назад +1

    Smart bute :)
    and great content ! Thanks

  • @mohammedansar8023
    @mohammedansar8023 2 года назад

    Amazing...My search has ended here... pls continue unloading more videos...

    • @misraturp
      @misraturp  2 года назад

      That's awesome to hear! Thank you.

  • @8CountLife
    @8CountLife 3 года назад

    I'm very appreciative for your channel

    • @misraturp
      @misraturp  3 года назад

      Thank you 8CountLife!

  • @hassanmahamat-pz8fx
    @hassanmahamat-pz8fx 2 месяца назад

    Good explanation.

  • @hamidhussain5488
    @hamidhussain5488 11 месяцев назад

    Grateful for awesome tutorials. How to split the datasets for training and testing sets? I am working with EMG signal classification using SVM classifier. I am confused how to split the data sets to do classificaiton task.

  • @FRANKWHITE1996
    @FRANKWHITE1996 2 года назад +1

    Thanks for sharing! 👍

  • @KhaliDALKhafaji
    @KhaliDALKhafaji Год назад

    that was a very great illustration thanks alot

  • @NotFog1
    @NotFog1 2 года назад

    So good, thanks a lot!

    • @misraturp
      @misraturp  2 года назад

      You're very welcome!

  • @diegomartins7214
    @diegomartins7214 9 месяцев назад

    Thank you!

  • @haliltezel8106
    @haliltezel8106 2 года назад

    Thank you for content mısra,datas not always clean as in tutorial:) Good job keep going my friend

  • @CresentX
    @CresentX 3 года назад

    Good work Misra

  • @ExtraKanin
    @ExtraKanin Год назад +4

    As someone with short attention span these days, I'd like to pat myself at the back for being able to sit through the entirety of the video. Data is really interesting! Thanks, Misra :)

    • @misraturp
      @misraturp  Год назад +1

      Great to hear that you enjoyed it! :)

  • @rwejolandacademy
    @rwejolandacademy Год назад

    I loved you from the time I see you. TYJ

  • @paragandozdroch3791
    @paragandozdroch3791 Год назад

    Thank you Misra for the detail video, very helpful. I am wandering what is the shortcut for the search drop down bar on 29:18 min of your video . Thanks

  • @shameermasroor7375
    @shameermasroor7375 Год назад

    Hello, Misra! great video!
    I had a question. When I run the tree_census_subset['steward'].value_counts() statement, pandas does not return the None count, or the count of the trees which do not have a steward. Has there been some change in the value_counts() function or am i doing something wrong?

  • @asadghnaim2332
    @asadghnaim2332 3 года назад

    Thank you a lot Mirsa :D

    • @misraturp
      @misraturp  2 года назад +1

      You're welcome 😊

  • @parkuuu
    @parkuuu 2 года назад

    Hello Misra,
    For the last part (substituted the diameter with Q1 and Q3 values), wouldn't it be better to retain the original column data and then just add a new one based on a condition (like np.where actual dia < Q1, use Q1 else dia), just to be able to compare it side by side

  • @ashikinfodu
    @ashikinfodu 6 месяцев назад

    Hi Misra, a quick question, what you do with missing values created by logical/skip questions in the survey? Thanks?

  • @searchbug
    @searchbug 2 года назад

    Wow! Not everyone gets to share this kind of in-depth walk through. Thanks for sharing this, Misra! For our dear friends who are not really techy or no experience dealing with codes and stuff, you may learn the basics or find others ways to clean data. Bulk validation, for example, the only thing you need to do is to make sure that your data sets are well-organized, with proper headings. You can then simply upload them in a third-party software that will do the verification for you, which then will let you know what data sets are outdated, invalid, and inactive.

    • @misraturp
      @misraturp  2 года назад +1

      You are welcome! Thank you for the addition.

    • @searchbug
      @searchbug 2 года назад

      @@misraturp My pleasure! Thank you for considering it :)

  • @govindant8360
    @govindant8360 3 года назад

    Very useful for me Mam.

  • @okotpascal1239
    @okotpascal1239 Год назад

    The playlist doesn't seem to be arranged for one to easily follow up on the videos. otherwise great content and thank you for the tutorials.

  • @kunjalsahu3504
    @kunjalsahu3504 2 года назад

    Great content mam for fresher keep up the good work , new subbie

  • @HasanKarakus
    @HasanKarakus 2 года назад

    Harika anlatım !!

  • @ecitahpi385
    @ecitahpi385 Год назад

    Dear Misra, thank you for your tutorial. from 19:40 min with merge codes and other till and other codes unfortunately not working by myside.. is it possible to share your code source in GitHub, too?

  • @rangabharath4253
    @rangabharath4253 3 года назад

    Awesome 👍😎

  • @md.alamintalukder3261
    @md.alamintalukder3261 Год назад

    Great ❤

  • @shuklajitechnicals2907
    @shuklajitechnicals2907 Год назад

    hey, what font( code font ) you are using ??

  • @ankitsrivastava06
    @ankitsrivastava06 10 месяцев назад

    Please provide datasets for practice csv file

  • @alifiaz7792
    @alifiaz7792 3 года назад +3

    Very well explained. During the cleaning you arbitrarily took 25th and 75th percentile as limits for cleaning tree diameter. Can you recommend a more systematic approach to select the these lower and upper quantile values? So appropriate treatment can be applied to the values below and above.

    • @misraturp
      @misraturp  3 года назад +4

      Hey Ali, great question. My main goal was to not drag the video for very long so I didn't go very deep into that decision. You can read of a good description on this page (machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/) under the "Interquartile Range Method".

    • @alifiaz7792
      @alifiaz7792 3 года назад +2

      ​@@misraturp Thanks Misra for sharing the link

    • @nicolaslpf
      @nicolaslpf Год назад

      Most people use IQR

  • @BillusTinnus
    @BillusTinnus Год назад

    So this is what data scientists do... Nice video ! :)

  • @asadghnaim2332
    @asadghnaim2332 3 года назад +1

    Please more on data cleaning

  • @tejaswaroopdasari1495
    @tejaswaroopdasari1495 2 года назад +1

    Your look Great by subject wise Content

  • @dominicatuahene7303
    @dominicatuahene7303 2 года назад

    Hi Misra, please the link to the previous video on before this one

    • @misraturp
      @misraturp  2 года назад

      Here it is: ruclips.net/video/OY4eQrekQvs/видео.html

  • @shivamagarwal587
    @shivamagarwal587 2 года назад +1

    mask=((tree_census_subset['status']=="Stump") | (tree_census_subset['status']=="Dead"))
    this line given an error to me can you explain what's the problem here

  • @mkhex87
    @mkhex87 2 года назад

    isn't it often useful to leave missing values as np.NaN so that your aggregation functions skip over them but still compute?

    • @misraturp
      @misraturp  2 года назад +1

      Depending on your goal that would be helpful. When preparing data for a machine learning task, we would like to get rid of or correct as many missing values as possible so they don't crush the training process.

  • @idongessien2245
    @idongessien2245 Год назад

    Not only are you pretty but freaking intelligent. (Forgive my choice of words, but I'm always straight forward). Was almost frustrated at some point...Thank God I ran into your channel

  • @hilloldasfisheuphoria
    @hilloldasfisheuphoria 2 года назад

    Hallo Misra, while enrolling myself to the course "deep learning 101 with python and keras" a "coupon code" is asked, how do I get it ???? where from I get the coupon code so that I can fill the "Add Coupon Code" field ???? ,....

    • @misraturp
      @misraturp  2 года назад

      Hey Hillol, that field is optional. I occasionally have campaigns, then I create a coupon code. There are no eligible coupons right now.

  • @samirmendhe7387
    @samirmendhe7387 2 года назад

    You made this so beutifull

  • @comptegmail273
    @comptegmail273 2 года назад

    Hello Misra , thank you so much for the tutorial. I'm actually stuck since my source in a CSV file. Except that sadly the file I'm working is extremely complex with indefinete columns since my main columns are repeated everyday based on the date. I've been stuck on this problem since over a week. Is there a way I could reach out to you via mail, to maybe help solve this problem? Thanks a lot in advance.

    • @misraturp
      @misraturp  2 года назад

      You're very welcome! My suggestion would be to first take a small subset of this dataset and work with that. After you create the code, it could be easier to work with it.

    • @nmanduley
      @nmanduley Год назад

      Probably a bit late, but I worked with a dataset like that once, and yeah, it's pretty unorthodox to have that structure where you have one column for each date instead of one row.
      What I ended up doing was to just label each column with the date in datetime format, so that I could automatically generate an array containing the date intervals I wanted for each case and slice the dataframe with it. It worked perfectly for plots and visualizations.
      You could also use the same approach to transpose the dataset, so that each date would be a row instead, and the columns would be your daily features. I didn't do it at the time because it would've worked the same, but it certainly would look better when visualizing the dataframe and seeing all the features in their own columns.
      Hope this helped.

  • @antukhan5592
    @antukhan5592 11 месяцев назад

    can u share github code?

  • @DataSet
    @DataSet Год назад

    Im ready to be cleaned

  • @user-qg1rv6jm5m
    @user-qg1rv6jm5m 2 месяца назад

    I thought cleaning should be before exploration?

  • @minhhapham3010
    @minhhapham3010 10 месяцев назад

    you are so beautiful, and the the content of video is very an useful for new learner. Many thanks

  • @um1541
    @um1541 Год назад

    Would it be better to find 75% and 25% from the max value, instead of using percentiles? 15 doesn't look like 75% of 59, so I assume, we are editing 50% of data.

    • @mad1337nes
      @mad1337nes Год назад

      that's the quartile markers, not an actual percentage. The mean of the "bottom 25% of the data" is that number, same as the mean of "the top 75%", not 75% of the max value

  • @ekaterinakorneeva4792
    @ekaterinakorneeva4792 5 месяцев назад

    Please make the links stay on the screen for more than 1 second, it would be much more convenient. Thank you.

  • @MOHANAMona-yq3dl
    @MOHANAMona-yq3dl 3 месяца назад

    Where can i get the data set?

    • @misraturp
      @misraturp  3 месяца назад +1

      I believe this is the dataset I'm using. www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

    • @MOHANAMona-yq3dl
      @MOHANAMona-yq3dl 3 месяца назад

      @@misraturp Thank you ✨

  • @hilloldasfisheuphoria
    @hilloldasfisheuphoria 2 года назад

    Misra I had submitted my "first name" and "email" to get "Pandas Cheat Sheet (free)" three times, but I have not received any mail yet !!!! ,....

    • @misraturp
      @misraturp  2 года назад

      Hey Hillol, could you check your spam folder? It looks like the email was sent.

  • @vinayakdixit2855
    @vinayakdixit2855 2 года назад +1

    @19:40

  • @ttffan658
    @ttffan658 2 года назад

    Cute smile

  • @keidran_r3
    @keidran_r3 8 месяцев назад

    data wrangler, a vs code extension. you're welcome.

  • @salimayad2151
    @salimayad2151 Год назад

    She can fix me

  • @rohitbuddabathina9225
    @rohitbuddabathina9225 2 года назад

    Did anyone tell you that you resemble Angelina Jolie ?😀

    • @misraturp
      @misraturp  2 года назад +1

      Not until now. I'm flattered. 😅

    • @rohitbuddabathina9225
      @rohitbuddabathina9225 2 года назад

      @@misraturp I hope you run a correlation test on your face and Angelina's face. I really wanna see the results 😀

    • @misraturp
      @misraturp  2 года назад +1

      @@rohitbuddabathina9225 Haha sure. :D

  • @msumode4493
    @msumode4493 Год назад

    Visiting for your face.

  • @theamithsingh
    @theamithsingh Год назад

    Great video series @misraturp