Real World Data Cleaning in Python Pandas (Step By Step)

Поделиться
HTML-код
  • Опубликовано: 25 дек 2024

Комментарии • 105

  • @RyanAndMattDataScience
    @RyanAndMattDataScience  3 месяца назад +3

    Hey guys I hope you enjoyed the video! If you did please subscribe to the channel!
    Join our Data Science Discord Here: discord.com/invite/F7dxbvHUhg
    If you want to watch a full course on Machine Learning check out Datacamp: datacamp.pxf.io/XYD7Qg
    Want to solve Python data interview questions: stratascratch.com/?via=ryan
    I'm also open to freelance data projects. Hit me up at ryannolandata@gmail.com
    *Both Datacamp and Stratascratch are affiliate links.

    • @bluhblubah8464
      @bluhblubah8464 3 месяца назад

      the star is to indicate that the player got those runs without getting out. "his highest score is 269 not out' for examle

    • @passportbro904
      @passportbro904 3 месяца назад

      Timestamps?

  • @ArmanKHAN-bj9iv
    @ArmanKHAN-bj9iv Год назад +10

    Fantastic tutorial! Your step-by-step guide on data cleaning in Python Pandas was excellent. Clear explanations and practical examples made it easy to follow along. Looking forward to more of your uploads. Keep up the great work!

    • @RyanAndMattDataScience
      @RyanAndMattDataScience  Год назад

      Thank you! I’ll have another Python video up this week as well as more coming soon!

  • @koo5867
    @koo5867 9 месяцев назад +4

    Now that’s some cool content. This is exact what I wanted. Thanks bro🙏🏼keep helping the poor students like us! 😌

  • @nickdaboss03
    @nickdaboss03 Год назад +10

    you work super hard and put out really good content. Keep it up man, I'm looking forward to watching you grow!

    • @RyanAndMattDataScience
      @RyanAndMattDataScience  Год назад

      Thank you! Have another video ready to go later this week as well as 90% done with another Python interview question video.

  • @AmaRan31
    @AmaRan31 3 месяца назад +2

    it was a very good training. Thank you for making this video. I have implemented the project myself and I am even thinking about moving forward.

  • @DataJunkieTX
    @DataJunkieTX 6 месяцев назад +1

    Mistakes always help me learn because it forces me to recall new/old knowledge. Depending on how common the mistake was (>3) I end up retaining it and auto check, rarely do I see that mistake again.

  • @RoleJohn
    @RoleJohn Год назад +2

    i guess watching your videos while preparing my own portofolio , i am halfway there. Thanks a lot

    • @RyanAndMattDataScience
      @RyanAndMattDataScience  Год назад

      No problem. My first batch of classification vids are done working on regression now

  • @AJAY7509
    @AJAY7509 8 месяцев назад +1

    this video really helped me man, i was trying to leard about panda now it poped up on my notification, thanks for the video.

  • @prathmesh_jadhav8930
    @prathmesh_jadhav8930 8 месяцев назад +1

    Brother you doing awesome…. Upload more videos related to data analysis

  •  2 месяца назад

    this video saves ton of my hours , thanks for sharing your knowledges.

  • @Al-Ahdal
    @Al-Ahdal 8 месяцев назад

    @Ryan Nolan: Excellent Video. Very clearly explained. I'm looking forward to watching you grow!

  • @pradeeppadeliya
    @pradeeppadeliya Год назад +1

    This is a best tutorial .... 👍👍👍👍👍👍👍👍👍👍👍👍👍👍

  • @davideschreiber2821
    @davideschreiber2821 Год назад

    Lots of good stuff here, but I finally gave up at 31:24. If you're confused about what's happening, imagine how confused we learners are as you bounce around from cell to cell copying-pasting-deleting-trying again, trying to figure things out.

    • @RyanAndMattDataScience
      @RyanAndMattDataScience  Год назад

      Bugs are part of programming and no one is perfect. I show how it’s solved and why it happens

  • @far3582
    @far3582 11 месяцев назад

    I am trying to move away from R, and this is a great video. Thanks Ryan!

  • @DivyataShri
    @DivyataShri 16 дней назад

    why make changing the data type so long? at 29:00... cant we just use the same method we did for changing the data types for rookie year and final year?

  • @tianbowen721
    @tianbowen721 7 месяцев назад

    Pretty Amazing :) and I'd say it's some dense content to fit in 40 mins ~~I learned a lot

  • @Al-Ahdal
    @Al-Ahdal 8 месяцев назад

    @Ryan Nolan: Your videos are great indeed. It is requested to have a comprehensive series on "Data Analytics & Visualization". Thanks

    • @RyanAndMattDataScience
      @RyanAndMattDataScience  8 месяцев назад

      I have a full data Analyst playlist check it out

    • @Al-Ahdal
      @Al-Ahdal 8 месяцев назад

      @@RyanAndMattDataScience , could you please tag or locate. Thanks

    • @RyanAndMattDataScience
      @RyanAndMattDataScience  8 месяцев назад

      @@Al-Ahdal ruclips.net/p/PLcQVY5V2UY4JrrKi2bW7DdOD08shTs4QQ

  • @tapspasi2319
    @tapspasi2319 8 месяцев назад

    Amazing! Very good presentation

  • @indiecasmlive1917
    @indiecasmlive1917 5 месяцев назад

    The star on highest score means that player was NOT OUT till the end of the match. Non star players were OUT right after the score achieved.

  • @mindacid3274
    @mindacid3274 5 месяцев назад +1

    star at the end of the score means that was the runs scored -NOT OUT and currently batting

    • @RyanAndMattDataScience
      @RyanAndMattDataScience  5 месяцев назад

      Yup learned from a few people. I don’t see Cricket statistics often in the US

  • @VladislavShishkin11
    @VladislavShishkin11 Год назад +5

    I completed the project but I reopped it today and all the code was still there, but when I typed df it was the old table uncleaned? how do I make sure this doesn't happen again?

  • @ReadyF0RHeady
    @ReadyF0RHeady 22 дня назад

    what if the country name or the player is written in a synonym or nickname? currently i want to merge index data from various countries but they are written differently in the dataframes (United States, United States of America) How do i handle synonyms here to have only one written name...its for over 180 country names so its a kinda big dataset to compare it manually

  • @satishharijan7280
    @satishharijan7280 Год назад +1

    nice lecture bro thanks for this it is use full video for me

  • @suvanshgaurav8996
    @suvanshgaurav8996 4 месяца назад

    Great Learning we gettin here ! Everything explained precisely 👍 jus having 1 doubt : why using (axis=1), (axis=0)????

  • @khan07700
    @khan07700 7 месяцев назад

    Sir when we import data from site to table I'm not getting the option of table 0 what's the solution for that at 1:54.

  • @henry-o8i
    @henry-o8i 9 месяцев назад

    Thanks . Appreciate for this tutorial. Just have a question on Q5. Why is it already in a data frame? while we have to use to_frame for Q4 ? Thanks

  • @maheshvaka-h3n
    @maheshvaka-h3n 10 месяцев назад

    Totally it was a great effort and much appreciated for your hard work. I would like to know how to remove or drop null values from the columns.
    Thanks in advance

    • @RyanAndMattDataScience
      @RyanAndMattDataScience  9 месяцев назад +1

      Look up drop na

    • @maheshvaka-h3n
      @maheshvaka-h3n 9 месяцев назад

      Cheers man... any advice how to remove year from a columns. for instances, if a column has numeric and year values and want to remove year (2004 in format)only.@@RyanAndMattDataScience

  • @Patrick-l5h7r
    @Patrick-l5h7r 2 месяца назад

    Thanks Ryan, great tutorial. I was pleasantly surprised that you knew the name of the great WI batsman, Sir Garry Sobers. Are you from the West Indies ?

    • @RyanAndMattDataScience
      @RyanAndMattDataScience  2 месяца назад

      Nope I just collect cards and have a few of sobers

    • @Patrick-l5h7r
      @Patrick-l5h7r 2 месяца назад

      That’s great. Card collecting can be financially rewarding !

  • @lewismurigi3623
    @lewismurigi3623 10 месяцев назад

    This was so much helpfull, Thanks Man

  • @nagamanickam6604
    @nagamanickam6604 Год назад +1

    Thank you Ryan nolan

  • @ArhamZaiem
    @ArhamZaiem 7 месяцев назад

    In the highest inns score, why didn't you used rstrip to remove * instead of split??

  • @lenkapang-ek4fe
    @lenkapang-ek4fe Месяц назад

    hi, what is the mean of"hundreds, fifties, ducks(0)AVG by country"?

  • @ShehneelKhan-p3x
    @ShehneelKhan-p3x 6 месяцев назад

    The * in Highest_Inns_Score means the player was not out in that inning.

  • @loydteds3944
    @loydteds3944 7 месяцев назад

    You're video is very helpful! One question though, how do you remove duplicates in high dimensional data, lets say with 500 duplicates? Thanks

    • @mindacid3274
      @mindacid3274 5 месяцев назад

      df.drop_duplicates() or if u just want a subject of columns that are being repeated use df.drop_duplicates(subset=[ ] , keep="") specify whether u wanna keep the first , last when dropping

  • @Uknowme-e9b
    @Uknowme-e9b 7 месяцев назад

    bro...u should have used replace method with regex for cleaning *,+ etc chars from the columns

    • @RyanAndMattDataScience
      @RyanAndMattDataScience  7 месяцев назад

      I used regex in my latest project and have a video coming out on it soon funny enough

  • @SuccessGossips
    @SuccessGossips 7 месяцев назад +1

    star means not out with highest score, you don't need to remove it

  • @viralhunt8637
    @viralhunt8637 Месяц назад

    Very nice buddy, some time like very tough and sometimes like easy , maybe it happens cause of lose confidence

    • @RyanAndMattDataScience
      @RyanAndMattDataScience  Месяц назад

      Just keep going man. Stuff I’ve found confusing I look back at and think it’s not that bad

  • @inaaahsings
    @inaaahsings 3 месяца назад

    Thank you!!

  • @salfrat55
    @salfrat55 Год назад

    "FS Jackson played for Cambridge University, Yorkshire and England. He spotted the talent of Ranjitsinhji when the latter, owing to his unorthodox batting and his race, was struggling to find a place for himself in the university side, and as captain was responsible for Ranji's inclusion in the Cambridge First XI and the awarding of his Blue. According to Alan Gibson this was "a much more controversial thing to do than would seem possible to us now". He was named a Wisden Cricketer of the Year in 1894.
    He captained England in five Test matches in 1905, winning two and drawing three to retain The Ashes. Captaining England for the first time, he won all five tosses and topped the batting and bowling averages for both sides, with 492 runs at 70.28 and 13 wickets at 15.46. These were the last of his 20 Test matches, all played at home as he could not spare the time to tour."

  • @nessim.liamani
    @nessim.liamani 6 месяцев назад

    Hi Ryan,
    I'd like to understand how you would have treated a file with millions or tens of millions of lines to spot those "*" and "-" and "+"?
    You spoted them here manually by eye.
    Anyone can help me figureout that?
    Thanks

  • @MrFravallec
    @MrFravallec 10 месяцев назад

    Great tutorial, got this issue on the data types: AttributeError Traceback (most recent call last)
    Cell In[11], line 1
    ----> 1 df['Inns']= df["Inns"].str.split(pat = '*').str[0]
    File ~\anaconda3\Lib\site-packages\pandas\core\generic.py:5902, in NDFrame.__getattr__(self, name)
    5895 if (
    5896 name not in self._internal_names_set
    5897 and name not in self._metadata
    5898 and name not in self._accessors
    5899 and self._info_axis._can_hold_identifiers_and_holds_name(name)
    5900 ):
    5901 return self[name]
    -> 5902 return object.__getattribute__(self, name)
    File ~\anaconda3\Lib\site-packages\pandas\core\accessor.py:182, in CachedAccessor.__get__(self, obj, cls)
    179 if obj is None:
    180 # we're accessing the attribute of the class, i.e., Dataset.geo
    181 return self._accessor
    --> 182 accessor_obj = self._accessor(obj)
    183 # Replace the property with the accessor object. Inspired by:
    184 # www.pydanny.com/cached-property.html
    185 # We need to use object.__setattr__ because we overwrite __setattr__ on
    186 # NDFrame
    187 object.__setattr__(obj, self._name, accessor_obj)
    File ~\anaconda3\Lib\site-packages\pandas\core\strings\accessor.py:181, in StringMethods.__init__(self, data)
    178 def __init__(self, data) -> None:
    179 from pandas.core.arrays.string_ import StringDtype
    --> 181 self._inferred_dtype = self._validate(data)
    182 self._is_categorical = is_categorical_dtype(data.dtype)
    183 self._is_string = isinstance(data.dtype, StringDtype)
    File ~\anaconda3\Lib\site-packages\pandas\core\strings\accessor.py:235, in StringMethods._validate(data)
    232 inferred_dtype = lib.infer_dtype(values, skipna=True)
    234 if inferred_dtype not in allowed_types:
    --> 235 raise AttributeError("Can only use .str accessor with string values!")
    236 return inferred_dtype
    AttributeError: Can only use .str accessor with string values!

    • @Muhammad.Kashif31
      @Muhammad.Kashif31 9 месяцев назад +1

      your data may be containing integer data, thats why you are getting the error

  • @RAM_JAN22
    @RAM_JAN22 5 месяцев назад

    * in HS coulmns means that the player was not out at that match

  • @yvonnemukhono3566
    @yvonnemukhono3566 7 месяцев назад

    Very helpful.

  • @kadircalloglu2848
    @kadircalloglu2848 6 месяцев назад

    why we didnt use sql after typeies are changed

  • @benayawilly6536
    @benayawilly6536 Год назад

    good work. keep it up

  • @adedayojoseph9775
    @adedayojoseph9775 3 месяца назад

    How can I download this data?

  • @ri5habh
    @ri5habh 5 месяцев назад

    thank you sir..

  • @pavankalyan_297
    @pavankalyan_297 Год назад

    The star in the Highest score column means they were not out till the end of the match. Great tutorial Ryan. will it be possible for you to attach the notebook file here

  • @hemantsharma-xf3ub
    @hemantsharma-xf3ub 9 месяцев назад

    where i can get the notes

  • @taha5754
    @taha5754 6 месяцев назад

    Can you share the notebook used in this tutorial? @RyanNolanData

    • @RyanAndMattDataScience
      @RyanAndMattDataScience  6 месяцев назад

      I need to make a website article on this. It’ll have the code in there

  • @tasmisa6778
    @tasmisa6778 6 месяцев назад

    How am I supposed to know all the alphabets are named as those you just did???

  • @Sylvestre555
    @Sylvestre555 4 месяца назад

    link

  • @sachinnambiar
    @sachinnambiar 10 месяцев назад +1

    Its a dictionary right? Not a list.
    #rename multiple columns in a dictionary

  • @rajareddyraju6773
    @rajareddyraju6773 9 месяцев назад

    19:09

  • @dogzrgood
    @dogzrgood Год назад

    Star * means the batsman was not out 😊

    • @RyanAndMattDataScience
      @RyanAndMattDataScience  Год назад

      I appreciate it. Didn’t know

    • @Al-Ahdal
      @Al-Ahdal 8 месяцев назад

      @@RyanAndMattDataScience , Yes * mean batsman not out, but it won't affect any calculations. Great work indeed.

  • @mehrantavakoli6816
    @mehrantavakoli6816 6 месяцев назад

    👏👏👏❤❤

  • @salfrat55
    @salfrat55 Год назад

    Headley @4 min mark 😂😁

  • @yutomidorya459
    @yutomidorya459 6 месяцев назад

    lol on excel is better lol