Real World Data Cleaning in Python Pandas (Step By Step)

Ryan & Matt Data Science

Просмотров 92 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 25 дек 2024

Комментарии • 105

@RyanAndMattDataScience 3 месяца назад ⁺³
Hey guys I hope you enjoyed the video! If you did please subscribe to the channel!
Join our Data Science Discord Here: discord.com/invite/F7dxbvHUhg
If you want to watch a full course on Machine Learning check out Datacamp: datacamp.pxf.io/XYD7Qg
Want to solve Python data interview questions: stratascratch.com/?via=ryan
I'm also open to freelance data projects. Hit me up at ryannolandata@gmail.com
*Both Datacamp and Stratascratch are affiliate links.
@bluhblubah8464 3 месяца назад
the star is to indicate that the player got those runs without getting out. "his highest score is 269 not out' for examle
@passportbro904 3 месяца назад
Timestamps?
@ArmanKHAN-bj9iv Год назад ⁺¹⁰
Fantastic tutorial! Your step-by-step guide on data cleaning in Python Pandas was excellent. Clear explanations and practical examples made it easy to follow along. Looking forward to more of your uploads. Keep up the great work!
@RyanAndMattDataScience Год назад
Thank you! I’ll have another Python video up this week as well as more coming soon!
@koo5867 9 месяцев назад ⁺⁴
Now that’s some cool content. This is exact what I wanted. Thanks bro🙏🏼keep helping the poor students like us! 😌
@RyanAndMattDataScience 9 месяцев назад ⁺¹
No problem
@nickdaboss03 Год назад ⁺¹⁰
you work super hard and put out really good content. Keep it up man, I'm looking forward to watching you grow!
@RyanAndMattDataScience Год назад
Thank you! Have another video ready to go later this week as well as 90% done with another Python interview question video.
@AmaRan31 3 месяца назад ⁺²
it was a very good training. Thank you for making this video. I have implemented the project myself and I am even thinking about moving forward.
@DataJunkieTX 6 месяцев назад ⁺¹
Mistakes always help me learn because it forces me to recall new/old knowledge. Depending on how common the mistake was (>3) I end up retaining it and auto check, rarely do I see that mistake again.
@RoleJohn Год назад ⁺²
i guess watching your videos while preparing my own portofolio , i am halfway there. Thanks a lot
@RyanAndMattDataScience Год назад
No problem. My first batch of classification vids are done working on regression now
@AJAY7509 8 месяцев назад ⁺¹
this video really helped me man, i was trying to leard about panda now it poped up on my notification, thanks for the video.
@RyanAndMattDataScience 8 месяцев назад
No problem check out my other pandas vids I have a full playlist
@prathmesh_jadhav8930 8 месяцев назад ⁺¹
Brother you doing awesome…. Upload more videos related to data analysis
@RyanAndMattDataScience 8 месяцев назад ⁺¹
I have a full playlist of 70ish vids! Working on more though
2 месяца назад
this video saves ton of my hours , thanks for sharing your knowledges.
@RyanAndMattDataScience 2 месяца назад
No problem if you want to learn more check out our discord
@Al-Ahdal 8 месяцев назад
@Ryan Nolan: Excellent Video. Very clearly explained. I'm looking forward to watching you grow!
@RyanAndMattDataScience 8 месяцев назад
Much appreciated!
@pradeeppadeliya Год назад ⁺¹
This is a best tutorial .... 👍👍👍👍👍👍👍👍👍👍👍👍👍👍
@RyanAndMattDataScience Год назад
Means a ton thank you
@davideschreiber2821 Год назад
Lots of good stuff here, but I finally gave up at 31:24. If you're confused about what's happening, imagine how confused we learners are as you bounce around from cell to cell copying-pasting-deleting-trying again, trying to figure things out.
@RyanAndMattDataScience Год назад
Bugs are part of programming and no one is perfect. I show how it’s solved and why it happens
@far3582 11 месяцев назад
I am trying to move away from R, and this is a great video. Thanks Ryan!
@RyanAndMattDataScience 11 месяцев назад
No problem best of luck
@DivyataShri 16 дней назад
why make changing the data type so long? at 29:00... cant we just use the same method we did for changing the data types for rookie year and final year?
@tianbowen721 7 месяцев назад
Pretty Amazing :) and I'd say it's some dense content to fit in 40 mins ~~I learned a lot
@RyanAndMattDataScience 7 месяцев назад
Awesome
@Al-Ahdal 8 месяцев назад
@Ryan Nolan: Your videos are great indeed. It is requested to have a comprehensive series on "Data Analytics & Visualization". Thanks
@RyanAndMattDataScience 8 месяцев назад
I have a full data Analyst playlist check it out
@Al-Ahdal 8 месяцев назад
@@RyanAndMattDataScience , could you please tag or locate. Thanks
@RyanAndMattDataScience 8 месяцев назад
@@Al-Ahdal ruclips.net/p/PLcQVY5V2UY4JrrKi2bW7DdOD08shTs4QQ
@tapspasi2319 8 месяцев назад
Amazing! Very good presentation
@RyanAndMattDataScience 8 месяцев назад
Thank you
@indiecasmlive1917 5 месяцев назад
The star on highest score means that player was NOT OUT till the end of the match. Non star players were OUT right after the score achieved.
@mindacid3274 5 месяцев назад ⁺¹
star at the end of the score means that was the runs scored -NOT OUT and currently batting
@RyanAndMattDataScience 5 месяцев назад
Yup learned from a few people. I don’t see Cricket statistics often in the US
@VladislavShishkin11 Год назад ⁺⁵
I completed the project but I reopped it today and all the code was still there, but when I typed df it was the old table uncleaned? how do I make sure this doesn't happen again?
@RyanAndMattDataScience Год назад
Ill add my code to github this weekend
@marcus.the.younger 7 месяцев назад
save the cleaned data
@ReadyF0RHeady 22 дня назад
what if the country name or the player is written in a synonym or nickname? currently i want to merge index data from various countries but they are written differently in the dataframes (United States, United States of America) How do i handle synonyms here to have only one written name...its for over 180 country names so its a kinda big dataset to compare it manually
@satishharijan7280 Год назад ⁺¹
nice lecture bro thanks for this it is use full video for me
@RyanAndMattDataScience Год назад
No problem
@suvanshgaurav8996 4 месяца назад
Great Learning we gettin here ! Everything explained precisely 👍 jus having 1 doubt : why using (axis=1), (axis=0)????
@khan07700 7 месяцев назад
Sir when we import data from site to table I'm not getting the option of table 0 what's the solution for that at 1:54.
@henry-o8i 9 месяцев назад
Thanks . Appreciate for this tutorial. Just have a question on Q5. Why is it already in a data frame? while we have to use to_frame for Q4 ? Thanks
@maheshvaka-h3n 10 месяцев назад
Totally it was a great effort and much appreciated for your hard work. I would like to know how to remove or drop null values from the columns.
Thanks in advance
@RyanAndMattDataScience 9 месяцев назад ⁺¹
Look up drop na
@maheshvaka-h3n 9 месяцев назад
Cheers man... any advice how to remove year from a columns. for instances, if a column has numeric and year values and want to remove year (2004 in format)only.@@RyanAndMattDataScience
@Patrick-l5h7r 2 месяца назад
Thanks Ryan, great tutorial. I was pleasantly surprised that you knew the name of the great WI batsman, Sir Garry Sobers. Are you from the West Indies ?
@RyanAndMattDataScience 2 месяца назад
Nope I just collect cards and have a few of sobers
@Patrick-l5h7r 2 месяца назад
That’s great. Card collecting can be financially rewarding !
@lewismurigi3623 10 месяцев назад
This was so much helpfull, Thanks Man
@RyanAndMattDataScience 10 месяцев назад
No problem
@RyanAndMattDataScience 10 месяцев назад
No problem
@nagamanickam6604 Год назад ⁺¹
Thank you Ryan nolan
@RyanAndMattDataScience Год назад
no problem
@ArhamZaiem 7 месяцев назад
In the highest inns score, why didn't you used rstrip to remove * instead of split??
@lenkapang-ek4fe Месяц назад
hi, what is the mean of"hundreds, fifties, ducks(0)AVG by country"?
@ShehneelKhan-p3x 6 месяцев назад
The * in Highest_Inns_Score means the player was not out in that inning.
@loydteds3944 7 месяцев назад
You're video is very helpful! One question though, how do you remove duplicates in high dimensional data, lets say with 500 duplicates? Thanks
@mindacid3274 5 месяцев назад
df.drop_duplicates() or if u just want a subject of columns that are being repeated use df.drop_duplicates(subset=[ ] , keep="") specify whether u wanna keep the first , last when dropping
@Uknowme-e9b 7 месяцев назад
bro...u should have used replace method with regex for cleaning *,+ etc chars from the columns
@RyanAndMattDataScience 7 месяцев назад
I used regex in my latest project and have a video coming out on it soon funny enough
@SuccessGossips 7 месяцев назад ⁺¹
star means not out with highest score, you don't need to remove it
@viralhunt8637 Месяц назад
Very nice buddy, some time like very tough and sometimes like easy , maybe it happens cause of lose confidence
@RyanAndMattDataScience Месяц назад
Just keep going man. Stuff I’ve found confusing I look back at and think it’s not that bad
@inaaahsings 3 месяца назад
Thank you!!
@salfrat55 Год назад
"FS Jackson played for Cambridge University, Yorkshire and England. He spotted the talent of Ranjitsinhji when the latter, owing to his unorthodox batting and his race, was struggling to find a place for himself in the university side, and as captain was responsible for Ranji's inclusion in the Cambridge First XI and the awarding of his Blue. According to Alan Gibson this was "a much more controversial thing to do than would seem possible to us now". He was named a Wisden Cricketer of the Year in 1894.
He captained England in five Test matches in 1905, winning two and drawing three to retain The Ashes. Captaining England for the first time, he won all five tosses and topped the batting and bowling averages for both sides, with 492 runs at 70.28 and 13 wickets at 15.46. These were the last of his 20 Test matches, all played at home as he could not spare the time to tour."
@RyanAndMattDataScience Год назад ⁺¹
Didn’t know this is a really cool story. Like Branch Rickey in baseball
@nessim.liamani 6 месяцев назад
Hi Ryan,
I'd like to understand how you would have treated a file with millions or tens of millions of lines to spot those "*" and "-" and "+"?
You spoted them here manually by eye.
Anyone can help me figureout that?
Thanks
@MrFravallec 10 месяцев назад
Great tutorial, got this issue on the data types: AttributeError Traceback (most recent call last)
Cell In[11], line 1
----> 1 df['Inns']= df["Inns"].str.split(pat = '*').str[0]
File ~\anaconda3\Lib\site-packages\pandas\core\generic.py:5902, in NDFrame.__getattr__(self, name)
5895 if (
5896 name not in self._internal_names_set
5897 and name not in self._metadata
5898 and name not in self._accessors
5899 and self._info_axis._can_hold_identifiers_and_holds_name(name)
5900 ):
5901 return self[name]
-> 5902 return object.__getattribute__(self, name)
File ~\anaconda3\Lib\site-packages\pandas\core\accessor.py:182, in CachedAccessor.__get__(self, obj, cls)
179 if obj is None:
180 # we're accessing the attribute of the class, i.e., Dataset.geo
181 return self._accessor
--> 182 accessor_obj = self._accessor(obj)
183 # Replace the property with the accessor object. Inspired by:
184 # www.pydanny.com/cached-property.html
185 # We need to use object.__setattr__ because we overwrite __setattr__ on
186 # NDFrame
187 object.__setattr__(obj, self._name, accessor_obj)
File ~\anaconda3\Lib\site-packages\pandas\core\strings\accessor.py:181, in StringMethods.__init__(self, data)
178 def __init__(self, data) -> None:
179 from pandas.core.arrays.string_ import StringDtype
--> 181 self._inferred_dtype = self._validate(data)
182 self._is_categorical = is_categorical_dtype(data.dtype)
183 self._is_string = isinstance(data.dtype, StringDtype)
File ~\anaconda3\Lib\site-packages\pandas\core\strings\accessor.py:235, in StringMethods._validate(data)
232 inferred_dtype = lib.infer_dtype(values, skipna=True)
234 if inferred_dtype not in allowed_types:
--> 235 raise AttributeError("Can only use .str accessor with string values!")
236 return inferred_dtype
AttributeError: Can only use .str accessor with string values!
@Muhammad.Kashif31 9 месяцев назад ⁺¹
your data may be containing integer data, thats why you are getting the error
@RAM_JAN22 5 месяцев назад
* in HS coulmns means that the player was not out at that match
@yvonnemukhono3566 7 месяцев назад
Very helpful.
@RyanAndMattDataScience 7 месяцев назад
No problem
@kadircalloglu2848 6 месяцев назад
why we didnt use sql after typeies are changed
@benayawilly6536 Год назад
good work. keep it up
@RyanAndMattDataScience Год назад
Thank you! I just uploaded a new video
@adedayojoseph9775 3 месяца назад
How can I download this data?
@ri5habh 5 месяцев назад
thank you sir..
@RyanAndMattDataScience 5 месяцев назад
Np
@pavankalyan_297 Год назад
The star in the Highest score column means they were not out till the end of the match. Great tutorial Ryan. will it be possible for you to attach the notebook file here
@RyanAndMattDataScience Год назад ⁺¹
Thank you and I can look at adding the code to Github this weekend
@hemantsharma-xf3ub 9 месяцев назад
where i can get the notes
@taha5754 6 месяцев назад
Can you share the notebook used in this tutorial? @RyanNolanData
@RyanAndMattDataScience 6 месяцев назад
I need to make a website article on this. It’ll have the code in there
@tasmisa6778 6 месяцев назад
How am I supposed to know all the alphabets are named as those you just did???
@Sylvestre555 4 месяца назад
link
@sachinnambiar 10 месяцев назад ⁺¹
Its a dictionary right? Not a list.
#rename multiple columns in a dictionary
@rajareddyraju6773 9 месяцев назад
19:09
@dogzrgood Год назад
Star * means the batsman was not out 😊
@RyanAndMattDataScience Год назад
I appreciate it. Didn’t know
@Al-Ahdal 8 месяцев назад
@@RyanAndMattDataScience , Yes * mean batsman not out, but it won't affect any calculations. Great work indeed.
@mehrantavakoli6816 6 месяцев назад
👏👏👏❤❤
@salfrat55 Год назад
Headley @4 min mark 😂😁
@RyanAndMattDataScience Год назад ⁺¹
Haha one day I’ll buy your dup
@yutomidorya459 6 месяцев назад
lol on excel is better lol
@RyanAndMattDataScience 6 месяцев назад ⁺²
Nope

Следующие

Автовоспроизведение

5 Ways to Find the Mean, Median, and Mode in Python