Data Cleaning Tutorial | Cleaning Data With Python and Pandas

Поделиться
HTML-код
  • Опубликовано: 3 фев 2025
  • НаукаНаука

Комментарии • 138

  • @anmol_seth_xx
    @anmol_seth_xx 2 года назад +29

    First of all thanks to you,
    I learned ffill, bfill and interpolate functions from here.
    But it's recommended from many professionals that missing values should be imputed with mean, median & mode.

  • @kaustubhjain7066
    @kaustubhjain7066 4 года назад +28

    the chill moment when he codes and sips up coffee, cool man great work !!!

  • @YOGIT_Singh
    @YOGIT_Singh 2 года назад +3

    Bro, you are Doing such great work..
    The way you are explaining things is excellent and easy to understand.
    Kudos to you .

  • @kdausu90
    @kdausu90 3 года назад +2

    This guy is chill af

  • @alishalbayev264
    @alishalbayev264 4 года назад +5

    I just started watching but I see that you are really good man, thanks

  • @kurosaki2510
    @kurosaki2510 2 года назад +17

    Could you make a tutorial on Big Data as well, for situations with e.g. 500k rows and 200 columns where you don't see all of your data and don't know what kinds of Nan values to expect and therefore can't name them textually? Big thanks in advance :)

    • @oktafajarandrian7352
      @oktafajarandrian7352 2 года назад

      Up

    • @Alexander-ms2ct
      @Alexander-ms2ct 2 года назад +2

      It’s a **kwark called “chunksize=“ the integer passed to it is the amount per chunk. So if you select 1000 and have a df of 500k. It would load 500 times. In pieces

  • @sheetalurankar4660
    @sheetalurankar4660 4 года назад +6

    Explain the more concepts like standardizing, matching, consolidation so we get all idea about data cleaning.

  • @sifar1857
    @sifar1857 2 года назад

    Great tutorial! No need to hesitate on referring to the code snippets btw… I don’t think any sane person watching this has the expectation for you to memorise a to z what you want to articulate…

  • @harshvardhansahay3864
    @harshvardhansahay3864 2 года назад

    This video has really me understand the data cleaning process. Thanks Man.

  • @ravitejapavuluri945
    @ravitejapavuluri945 4 года назад +5

    You said everything but but missed the one I am waiting for is filling NaN with mean or media values.

  • @RavikaUniverse.
    @RavikaUniverse. Год назад

    awesome ,good and precise data cleaning
    please load some more stuff related to pandas

  • @Young-Prof
    @Young-Prof 7 месяцев назад

    This is amazing. I learned a lot. I want to come to India to study Data Analytics

  • @santhoshbharath2910
    @santhoshbharath2910 3 года назад

    hats off to you man, you really made my day, you gave me a good confidence today, once again thankyou so much sir

  • @SheetalMuragunde
    @SheetalMuragunde 6 дней назад

    Brother I had one question that is After we clean the data, How can we are convert it into csv/excel file?

  • @kousumichaudhuri3793
    @kousumichaudhuri3793 Год назад

    Thanks a lot bro. The way you explained the steps are really helpful.

  • @rommel23nb
    @rommel23nb 7 месяцев назад

    Thanks Mr. Shah--- I used these commands to prepare a cheat sheet for data cleaning--- regards

  • @sammy0722
    @sammy0722 5 лет назад +2

    Nice way of explaining. One request, do make one video on Outlier removal.
    Thanks.

    • @SoumilShah
      @SoumilShah  5 лет назад +1

      you got it !! will add on my to do List

  • @RaviPrakash-ml8qb
    @RaviPrakash-ml8qb Год назад

    Very useful video keep it up champ

  • @aliaitazaz7040
    @aliaitazaz7040 8 месяцев назад

    Learnt something new, THANK YOU!

  • @thewhiskybottle4641
    @thewhiskybottle4641 Год назад

    Take the ads off please, thanks, great tutorial by the way.

  • @poojakumarirollno9880
    @poojakumarirollno9880 Год назад

    Great job sir . Thank you so much for great explaining . Can u make more videos on data analytics

  • @eshaal2525
    @eshaal2525 2 года назад +1

    Hi...I have na values but tha boxplot is even not showing it null...

  • @laibakhan1835
    @laibakhan1835 2 года назад

    Great job well done n thanx a lot ...I explained v well

  • @slayergaming1852
    @slayergaming1852 2 года назад +1

    This video really helped me a lot, but I still got more to understand. I've zero basic knowledge on this. I'm working on a thesis which needs some coding to complete. I've few questions to what you've explained in this video;
    1. What if there are lot of dataset and how do you define the missing value for each?
    2. What was that in the missing value you defined "np.nan" ?
    As I said earlier, I'm working on a project which is about human-in-the-loop code. Initially I'll be given the dataset and have to figure out a code to include human for feedback from system (Reinforcement Learning). I would like get your response, and if possible any helpful idea or suggestions on the project mentioned above.
    Thank you

    • @debanjangg
      @debanjangg 2 года назад +1

      1. You can use separate lists (with diff variable names,) or a single list as a master list for all the datasets. Depends on the said datasets and the data they contain.
      2. np.nan is the "NaN" value in the dataset, which means Not a Number. So basically the np.nan returns a float object whose value is NaN.
      Hope this helps.

    • @slayergaming1852
      @slayergaming1852 2 года назад +1

      @@debanjangg That was helpful. Thanks mate!🤩

    • @luckycreative7418
      @luckycreative7418 Год назад

      1) u can also use unique function to get only unique values
      Example
      df['Customers'].unique
      in the example above u will get all the unique values in the column 'Customers'

  • @Ayanshedipelly2312
    @Ayanshedipelly2312 7 месяцев назад

    How to do interpolation for categorical variable

  • @greeshmatejamyna
    @greeshmatejamyna 2 года назад

    i can use excel for it right ? my work will be more easy kindly persuade me y i have to use python instead of excel here

  • @gideonopoku-gyamfi1114
    @gideonopoku-gyamfi1114 Год назад

    So please which do you think is more efficient to be used without changing the accuracy of the data

  • @crazystuff5854
    @crazystuff5854 3 года назад

    Superb broo Rock on

  • @ogunoyeadebamigbe1715
    @ogunoyeadebamigbe1715 Год назад

    Good job

  • @minapatil185
    @minapatil185 5 месяцев назад

    New learned null values like nan, na Nan something new thought us. Thank u...

  • @michaelchapisa3709
    @michaelchapisa3709 6 месяцев назад

    I got lost at the very beginning, from the "print(os.listdir())" what i got as my output is very different from what you got

  • @sayantikabiswas8739
    @sayantikabiswas8739 Год назад +2

    thanks for teaching ffill, bfill, interpolate and fillna ...it was a great session

  • @AyaanKhan-rh5vx
    @AyaanKhan-rh5vx 7 месяцев назад

    I have a csv file and when i am using concat function it automatically name unnamed group 1,2,3... Also the alignment gets messy with songle line of code
    How to fix it

  • @enricomendiola9952
    @enricomendiola9952 Год назад

    This is a great video😊

  • @एकोनारायणा

    what is the use of ffill? Isn't it data corruption? filling nulls with values from above rows

  • @shyamkumar-fh2fh
    @shyamkumar-fh2fh 4 года назад +3

    Can u tell about yourself... The company u are working and can u give some tips to get a job in data science field as a fresher

  • @spyder5204
    @spyder5204 2 года назад

    Good and nice explaination

  • @alaberedaisy8171
    @alaberedaisy8171 10 месяцев назад

    Thank you so much. This was really helpful

  • @ShresthBhakta
    @ShresthBhakta Год назад

    Thanks Soumil, really helpful !!

  • @bloom6874
    @bloom6874 2 года назад

    here values in missing_values list are case-sensitive or case-insensitive?

  • @sonalikoli384
    @sonalikoli384 2 года назад

    i have question i wrote print(os.listdir()). but i got many files that is inside my jupyter. may i know how can i import my csv file that i have clean.

  • @world52love
    @world52love 6 месяцев назад

    how to handle zero values in csv file and how to fill those values

  • @harrymary100
    @harrymary100 3 года назад

    Nice tutorial keep it up

  • @bhawitbalodi4324
    @bhawitbalodi4324 3 года назад

    Pls can you tell me that from where you bought that LAPTOP stand? Pls attach link in comment

  • @rashigupta5611
    @rashigupta5611 2 года назад

    How to get how many type different type of value is there to put in na_values? I mean to say the value you have mentioned for missing_value.. how you are getting that.. we cant check the file if that has huge data

  • @ganeshkumarpatel
    @ganeshkumarpatel 4 года назад +2

    Please explain to fiilna or replace zero with mean value by groupby... Means you have 3 groups in data frame and you want to fillna with respective to group mean

  • @nazeer9933
    @nazeer9933 Год назад

    The dataset which you have is having fewer instances
    what if we have thousands of rows of data how to find Nan, and Na there in the dataset ...?
    if you see this please respond ASAP

  • @Karthik-m9j
    @Karthik-m9j Год назад

    essentially you explained it very well

  • @SONALIKUMARI-is9jc
    @SONALIKUMARI-is9jc 4 года назад +1

    What if the file is not there on which we want to work?

  • @ArchieSharma-x4r
    @ArchieSharma-x4r Год назад

    Hii Soumil, right now I'm working on language translation project for that I have collected the data, but I'm facing preprocessing data could you please help me with that.

  • @kunalkishore5260
    @kunalkishore5260 2 года назад

    what to mention in na_values if we dont know the missing vlaues or there are hundreds of different missing values

  • @hamdansiddiqui3294
    @hamdansiddiqui3294 2 года назад

    Great video very helpful

  • @littlecreator4838
    @littlecreator4838 2 года назад +1

    Hi,very nice explanation. I am totally new to python. Can you pls make a tutorial on how to install jupyter and all the other required libraries to perform forecasting.

  • @Nitswits007
    @Nitswits007 4 года назад +1

    Would you be able to provide mentorship. I have started learning DS . Want to keep moving in a direction .jyst don't stop due to coding lag.

  • @vandanasharma.sharma33
    @vandanasharma.sharma33 4 года назад +1

    Can't we add any value to these NaN?

  • @machinelearning1357
    @machinelearning1357 2 года назад

    really great

  • @ehteshamali2893
    @ehteshamali2893 3 года назад

    interpolation is basically like average? right?

  • @jaiprakashlic484
    @jaiprakashlic484 2 года назад +1

    BRO YOU SHOULD HAVE UPLOADED CSV FILE AS WELL.

  • @kamalkantmahour9641
    @kamalkantmahour9641 4 года назад +1

    Sir, please tell me the book from where I can learn all these concepts and programming skills required for this.

  • @IndianHacker-hisBest
    @IndianHacker-hisBest Год назад

    bro, could you please share the dataset in the description ?

  • @akashme-ek3vc
    @akashme-ek3vc 2 года назад

    please donot stop uploading such videos

  • @NaveenKumarsoma
    @NaveenKumarsoma 4 года назад +1

    why you have used np.nan in the mission_values?

  • @ranamahrous7814
    @ranamahrous7814 4 года назад

    i want messy dataset for practicing do u know where can i find one? or do u have one ?

  • @alinajaved2165
    @alinajaved2165 3 года назад

    I need your help please please...how it work in automatically cleaning data?

  • @deutschvalley3574
    @deutschvalley3574 3 года назад

    How i can handle float values 2.o or 0.04576 kindly let me i am doing a research using a big datasets

  • @balatechtvm1438
    @balatechtvm1438 2 года назад

    Thank you sir. I'm very satisfied

  • @kiranpawar8798
    @kiranpawar8798 11 месяцев назад

    What is noisy data

  • @jayakhanal1720
    @jayakhanal1720 4 года назад

    how to cleaning data that is TZAN dataset from kaggle for,music genre classification using cnn?

  • @husnarazool9866
    @husnarazool9866 4 года назад +1

    How to save the dataset after cleaning process in python?

  • @shilpachowdhury8860
    @shilpachowdhury8860 2 года назад

    Sir, by running the data cleaning code in jupyter notebook by following the same code instruction given by u, when i run the code in the output it is not showing unnamed:0 temperature humidity & in my jupyter system it is showing such as v1 &v2 in the output.Why it is so?can u plz explain.

    • @deepthi5970
      @deepthi5970 Год назад

      I also faced the same problem. When we creating
      a new csv file there is no unnamed 0 : column... But if we saved the same file as csv into a folder it will create a new column lke unnamed 0:
      If we read this data the output will be like in this video.. If repeated each time one extra column will add. For avoiding use index= False while saving a code. It will work

  • @rinkubaria3900
    @rinkubaria3900 2 года назад

    So helpful 👌

  • @siddheshbhalerao3152
    @siddheshbhalerao3152 2 года назад

    sir, facing a issue in a code to convert the variable from object to integer in jupyter notebook:- it shows the error:-invalid literal for int() with base 10: '-'

    • @bloom6874
      @bloom6874 2 года назад

      you can typecast object into int

  • @kelikisbiyantoro2518
    @kelikisbiyantoro2518 4 года назад

    thankyou very much soumilshah, its help for me

  • @samimerk5313
    @samimerk5313 2 года назад

    Thank you for thr explanation!

  • @rakshansadhu2073
    @rakshansadhu2073 3 года назад

    Loved it man

  • @sudeepjayaprakash9224
    @sudeepjayaprakash9224 2 года назад

    Thank you sir helped me a lot

  • @mahidhar9787
    @mahidhar9787 4 года назад

    how can we replace using mean, median & mode

  • @rohitsanam8829
    @rohitsanam8829 4 года назад

    Please make a video outlier treatment and detection

  • @angshumanbardhan3729
    @angshumanbardhan3729 4 года назад +1

    Thank you for making this video.

  • @minhaaj
    @minhaaj 4 года назад

    good job

  • @enricomendiola9952
    @enricomendiola9952 Год назад

    Hello can you include in this video in cleaning special characters using pandas regular expression?

  • @laithdarras6389
    @laithdarras6389 2 года назад

    Very helpful

  • @sunilkhandale9232
    @sunilkhandale9232 4 года назад

    I want to replace 4wd with fwd in particular column please help

  • @kyleevalencia1827
    @kyleevalencia1827 4 года назад

    Sir, i'm still new in python and this data cleaning thing. And i want to ask what is 11 in df11 ?? is it some kind of function ?? and i also don't understand the snippet concept

    • @kashyapsantoki4889
      @kashyapsantoki4889 3 года назад +1

      df11 is just variable you can take anything df11a,df12 df15 anything and snippet is basically a piece of code which he already have

  • @jeevankumarreddyravuru8710
    @jeevankumarreddyravuru8710 4 года назад

    Do you have any pandas cheat scheet

  • @leticiavillatima1844
    @leticiavillatima1844 8 месяцев назад

    AttributeError Traceback (most recent call last)
    Cell In[7], line 1
    ----> 1 pd_cleaned = pd.dropna()
    AttributeError: module 'pandas' has no attribute 'dropna
    i can't find drop na

  • @thekras177
    @thekras177 2 года назад

    I refer to using mean but thank you it was helpful

  • @mohammedkaifmirza7585
    @mohammedkaifmirza7585 2 года назад

    overall good tutorial, the only thing that is missing is source code (jupyter notebook)

  • @aditisinghxd
    @aditisinghxd 3 года назад +1

    This was really useful. Thankyou!

  • @srujangowda8490
    @srujangowda8490 3 года назад

    5:09 OP

  • @swetathakur8144
    @swetathakur8144 5 лет назад

    Hello sir, I got an assignment , I am having difficulty in understanding the question , can u help , plz ... Plz reply

    • @SoumilShah
      @SoumilShah  5 лет назад

      Sweta Thakur what’s is question. ?

    • @swetathakur8144
      @swetathakur8144 5 лет назад

      @@SoumilShah it's a long question based on unsupervised problem .. can I have your email id, so that I can contact you

    • @SoumilShah
      @SoumilShah  5 лет назад +3

      Sweta Thakur I won’t solve your assignment I will only tell you what to do
      Usually I’m free on sat and Sunday
      Shahsoumil519@gmail.com

  • @shafiqahmed1976
    @shafiqahmed1976 4 года назад

    data.fillna({"P11,":NEGATIVE}) NameError: name 'NEGATIVE' is not defined
    ?? Please help me ?what can I can do??

  • @darshan7673
    @darshan7673 3 года назад

    from where i can download this dataset can anyone provide me link

  • @aiswaryacd7419
    @aiswaryacd7419 2 года назад

    How install Android phone Jupiter notebook

  • @chaitanyaparmar8639
    @chaitanyaparmar8639 2 года назад

    bro can u give this data set? asap

  • @Goal_Huntter_16
    @Goal_Huntter_16 2 года назад

    I have to say. "Very Helpful". or print("very helpful") 😂😂

  • @maniteja2167
    @maniteja2167 4 года назад

    nan palce -1 how to clean that

  • @Lejhand10
    @Lejhand10 4 года назад

    well explained .

  • @ehteshamali2893
    @ehteshamali2893 3 года назад

    Brother Great video. I need the tea as well. ;)

  • @abhinavsingh4208
    @abhinavsingh4208 4 года назад

    Thanks for this video !