How to Do Data Cleaning (step-by-step tutorial on real-life dataset)
HTML-код
- Опубликовано: 27 июл 2024
- 🐼 All you need to know about Pandas in one place! Download my Pandas Cheat Sheet (free) - misraturp.gumroad.com/l/pandascs
👇Learn how to complete your first real-world data science project
Hands-on Data Science course
www.misraturp.com/hods
Data exploration video: • How to Do Data Explora...
If you’d like to follow along, find the data here: data.cityofnewyork.us/Environ...
00:00 Welcome and some words on data cleaning
01:21 Cleaning irrelevant columns
01:43 Cleaning categorical column values
02:27 Cleaning missing values
13:08 Dealing with outliers
17:17 Data has a surprise for us
17:57 Dealing with unexpected data issues
21:10 Going forward
👋 Keep in touch?
==========================
🐥 Twitter - / misraturp
🔗 LinkedIn - / misraturp
📹 RUclips - / @misraturp
🌎 Website - misraturp.com/
Courses & resources
============================
👩💻 Hands-on Data Science: Complete your first portfolio project
www.misraturp.com/hods
📙 Fundamentals of Deep Learning in 25 pages
misraturp.gumroad.com/l/fdl
📥 Streamlit template
misraturp.gumroad.com/l/stemp
🤖 Deep Learning 101 with Python and Keras (FREE)
• 50 Days of Deep Learning
🏃♀️ Data Science Kick-starter mini-course (FREE)
misraturp.gumroad.com/l/kick-...
🐼 Pandas cheat sheet (FREE)
misraturp.gumroad.com/l/pandascs
📝 NNs hyperparameters cheat sheet (FREE)
misraturp.gumroad.com/l/hcs - Кино
Thank you for showing us these techniques. And I love the pace, thank you!
Great content! This step-by-step guide has been extremely helpful in my journey.
This is great Misra, thanks! :) Please make more of these!
Thank you!
This is gold. Learned a lot from your videos. Thanks!
PERFECTION all this explanation and process, thanks Misra! you are a master, thanks a lot!
You're very welcome :) and thank you for your nice words Antonio :)
thank you misra for this great content please keep making videos it's very very helpful and your explanation is so much great.
Thank you Ahmed!
Hey Misra! I am new here and I love your content.
And, Thank you for this wonderful explanation, Its really helpful to get a clear view of how things work .
That’s great to hear. Thank you!
This was a wonderful tutorial wherein you also gave insights on what things u consider while data cleaning, thank you.
You're very welcome!
Thank you SO much! I am totally in love with the content you share!
You're very welcome :)
You deserve more appreciation. Great Content!!!
Thank you so much 😀
This was really amazing, Thank you.
That was the video that i was searching for. Thanks :)
Awesome! You are very welcome.
Thank you for the amazing lecture.
That was amazing. Thanks for your time.
You're very welcome :)
I would highly recommend new data analysists to participate this this data cleaning process you present. The level of detail is impeccable 👌
Great to hear Loyiso, thank you!
@@misraturp My pleasure. You're so good. (*meant to say Data Analysts and not the gibberish above🤭)
It's really amazing, i learned a lot of new coding lines here as well ass the concept too ......Thanks a lot @Misra Turp😇😇😇😇😇
That's great to hear, thank you!
thank you so much. I am a new data scientist and these videos are very helpful.
Great to hear!
Wonderful tutorial! This is exactly the type of content I was looking for. When you are cleaning data in a work environment, do you document all the changes you make?
Hey Judy, I’m happy to hear that! I did not document the cleaning steps most of the time. But having clear comments on the code itself is very useful to people who will maintain the code after you to understand why you have done something.
This is so nice and clean 👌
Thank you! Cheers!
👉 Get real world data science experience by doing hands-on work
www.misraturp.com/hods
you looks very beautiful
Smart bute :)
and great content ! Thanks
Amazing...My search has ended here... pls continue unloading more videos...
That's awesome to hear! Thank you.
I'm very appreciative for your channel
Thank you 8CountLife!
Good explanation.
Grateful for awesome tutorials. How to split the datasets for training and testing sets? I am working with EMG signal classification using SVM classifier. I am confused how to split the data sets to do classificaiton task.
Thanks for sharing! 👍
Of course!
that was a very great illustration thanks alot
You're very welcome!
So good, thanks a lot!
You're very welcome!
Thank you!
Thank you for content mısra,datas not always clean as in tutorial:) Good job keep going my friend
Thank you!
Good work Misra
Thank you :)
As someone with short attention span these days, I'd like to pat myself at the back for being able to sit through the entirety of the video. Data is really interesting! Thanks, Misra :)
Great to hear that you enjoyed it! :)
I loved you from the time I see you. TYJ
Thank you Misra for the detail video, very helpful. I am wandering what is the shortcut for the search drop down bar on 29:18 min of your video . Thanks
Hello, Misra! great video!
I had a question. When I run the tree_census_subset['steward'].value_counts() statement, pandas does not return the None count, or the count of the trees which do not have a steward. Has there been some change in the value_counts() function or am i doing something wrong?
Thank you a lot Mirsa :D
You're welcome 😊
Hello Misra,
For the last part (substituted the diameter with Q1 and Q3 values), wouldn't it be better to retain the original column data and then just add a new one based on a condition (like np.where actual dia < Q1, use Q1 else dia), just to be able to compare it side by side
Hi Misra, a quick question, what you do with missing values created by logical/skip questions in the survey? Thanks?
Wow! Not everyone gets to share this kind of in-depth walk through. Thanks for sharing this, Misra! For our dear friends who are not really techy or no experience dealing with codes and stuff, you may learn the basics or find others ways to clean data. Bulk validation, for example, the only thing you need to do is to make sure that your data sets are well-organized, with proper headings. You can then simply upload them in a third-party software that will do the verification for you, which then will let you know what data sets are outdated, invalid, and inactive.
You are welcome! Thank you for the addition.
@@misraturp My pleasure! Thank you for considering it :)
Very useful for me Mam.
Thanks a lot
The playlist doesn't seem to be arranged for one to easily follow up on the videos. otherwise great content and thank you for the tutorials.
Great content mam for fresher keep up the good work , new subbie
Welcome!
Harika anlatım !!
Tesekkurler!
Dear Misra, thank you for your tutorial. from 19:40 min with merge codes and other till and other codes unfortunately not working by myside.. is it possible to share your code source in GitHub, too?
Awesome 👍😎
Thanks ✌️
Great ❤
hey, what font( code font ) you are using ??
Please provide datasets for practice csv file
Very well explained. During the cleaning you arbitrarily took 25th and 75th percentile as limits for cleaning tree diameter. Can you recommend a more systematic approach to select the these lower and upper quantile values? So appropriate treatment can be applied to the values below and above.
Hey Ali, great question. My main goal was to not drag the video for very long so I didn't go very deep into that decision. You can read of a good description on this page (machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/) under the "Interquartile Range Method".
@@misraturp Thanks Misra for sharing the link
Most people use IQR
So this is what data scientists do... Nice video ! :)
Yes, thanks!
Please more on data cleaning
Noted!
Your look Great by subject wise Content
Thank you.
Hi Misra, please the link to the previous video on before this one
Here it is: ruclips.net/video/OY4eQrekQvs/видео.html
mask=((tree_census_subset['status']=="Stump") | (tree_census_subset['status']=="Dead"))
this line given an error to me can you explain what's the problem here
isn't it often useful to leave missing values as np.NaN so that your aggregation functions skip over them but still compute?
Depending on your goal that would be helpful. When preparing data for a machine learning task, we would like to get rid of or correct as many missing values as possible so they don't crush the training process.
Not only are you pretty but freaking intelligent. (Forgive my choice of words, but I'm always straight forward). Was almost frustrated at some point...Thank God I ran into your channel
Hallo Misra, while enrolling myself to the course "deep learning 101 with python and keras" a "coupon code" is asked, how do I get it ???? where from I get the coupon code so that I can fill the "Add Coupon Code" field ???? ,....
Hey Hillol, that field is optional. I occasionally have campaigns, then I create a coupon code. There are no eligible coupons right now.
You made this so beutifull
Thank you!
Hello Misra , thank you so much for the tutorial. I'm actually stuck since my source in a CSV file. Except that sadly the file I'm working is extremely complex with indefinete columns since my main columns are repeated everyday based on the date. I've been stuck on this problem since over a week. Is there a way I could reach out to you via mail, to maybe help solve this problem? Thanks a lot in advance.
You're very welcome! My suggestion would be to first take a small subset of this dataset and work with that. After you create the code, it could be easier to work with it.
Probably a bit late, but I worked with a dataset like that once, and yeah, it's pretty unorthodox to have that structure where you have one column for each date instead of one row.
What I ended up doing was to just label each column with the date in datetime format, so that I could automatically generate an array containing the date intervals I wanted for each case and slice the dataframe with it. It worked perfectly for plots and visualizations.
You could also use the same approach to transpose the dataset, so that each date would be a row instead, and the columns would be your daily features. I didn't do it at the time because it would've worked the same, but it certainly would look better when visualizing the dataframe and seeing all the features in their own columns.
Hope this helped.
can u share github code?
Im ready to be cleaned
I thought cleaning should be before exploration?
you are so beautiful, and the the content of video is very an useful for new learner. Many thanks
Would it be better to find 75% and 25% from the max value, instead of using percentiles? 15 doesn't look like 75% of 59, so I assume, we are editing 50% of data.
that's the quartile markers, not an actual percentage. The mean of the "bottom 25% of the data" is that number, same as the mean of "the top 75%", not 75% of the max value
Please make the links stay on the screen for more than 1 second, it would be much more convenient. Thank you.
Where can i get the data set?
I believe this is the dataset I'm using. www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
@@misraturp Thank you ✨
Misra I had submitted my "first name" and "email" to get "Pandas Cheat Sheet (free)" three times, but I have not received any mail yet !!!! ,....
Hey Hillol, could you check your spam folder? It looks like the email was sent.
@19:40
Cute smile
data wrangler, a vs code extension. you're welcome.
She can fix me
Did anyone tell you that you resemble Angelina Jolie ?😀
Not until now. I'm flattered. 😅
@@misraturp I hope you run a correlation test on your face and Angelina's face. I really wanna see the results 😀
@@rohitbuddabathina9225 Haha sure. :D
Visiting for your face.
Great video series @misraturp
Thank you!