Turning multiple CSV files into a single pandas data frame

Поделиться
HTML-код
  • Опубликовано: 15 окт 2024
  • Data sets often come as multiple CSV files. How can you read those into a single pandas data frame? In this video, I show you how to use a Python "for" loop, or even a list comprehension, along with the "glob" module and the pd.concat function, to accomplish this.
    The Jupyter notebook for this video is at github.com/reu..., and my free, weekly newsletter about Python and software engineering is at BetterDevelope....

Комментарии • 58

  • @sloughpacman
    @sloughpacman 2 года назад +3

    Your material has the habit of provoking thought, which often leads me off on exploratory tangents for a couple of hours. Thanks for wasting my day, Reuven 😀

  • @ManjeeriG
    @ManjeeriG 6 месяцев назад +1

    Thank you so much for this tutorial. I have gone through multiple tutorials but this one is the easiest and (ofcourse) smartest way to combine multiple csv into the one. 😊😊 You got a new subscriber.

    • @ReuvenLerner
      @ReuvenLerner  6 месяцев назад +1

      I'm delighted that it helped!

  • @learner7273
    @learner7273 Год назад

    I never knew you could have a list of dataframes. Thank you!

  • @awaisraza2285
    @awaisraza2285 Год назад +1

    I have a main folder and in that main folder there are 35 sub folders. and their names are like S01, S02... S35. And each folder has a dataset of same structure. How i can concatenate that data into one dataframe?

    • @ReuvenLerner
      @ReuvenLerner  Год назад

      You can! Look upthe "recursive" parameter in glob.glob, and you'll see that you can get all of the files that match patterns across multiple subdirectories.

  • @mikej62
    @mikej62 Год назад +1

    Nice!!! - I've seen several ways to solve this problem and this is the most efficient I've personally came across!
    Question for you, In [13], there will be times I want to use this approach because I want to verify the number of rows in each file....how would you do it? TIA!

    • @ReuvenLerner
      @ReuvenLerner  Год назад +1

      Glad to hear it helped! If you want to verify the number of rows in a data frame, the fastest/easiest way is to ask for len(df.index). You can have an "if" statement in your "for" loop checking how many rows there are in the data frame, and thus know how many you retrieved.
      I don't believe that there's a way to check the number of rows in a file without reading it into a data frame first, at least with read_csv.

    • @mikej62
      @mikej62 Год назад +1

      ​@@ReuvenLerner Gotcha!
      What I did was added another print statement before "all_dfs.append(new_df)" print(len(new_df.index)). Wanted to see the way you would approach it.
      All for the purpose of documenting what I started with.
      Cheers!!

  • @tomtaetz20
    @tomtaetz20 2 года назад +3

    How would you go about this when the csv files don’t all contain the same column names(some different some the same)?

    • @ReuvenLerner
      @ReuvenLerner  2 года назад +2

      Ooh, that makes it more interesting!
      If all of the input data frames are guaranteed to contain the columns you want, then you can just select columns with double square brackets + a list of column names in the top line of the comprehension. That is, you can say pd.read_csv(one_filename)[['email', 'phone']].
      But if they have different columns, and/or you have to do transformations, then this technique gets much messier. Perhaps you could call a function in the top line of the comprehension. And the function could find the columns it wants and output them. But it won't be quite as elegant.

    • @tomtaetz20
      @tomtaetz20 2 года назад +1

      Awesome! Thank you, I will give it a try

    • @anushkajoshi8427
      @anushkajoshi8427 6 месяцев назад

      I was literally looking for the exact question in the comment box, before commenting mine.

  • @regal7548
    @regal7548 5 месяцев назад +1

    What if the datasets doesnt haw anything in common , like one is geological data, one is survey data, one is market analysis and each of them has a massive number of null values . Also the unique ids are different for example , one table has SLT20284903 and some others just numbers . What do we do then ?

    • @ReuvenLerner
      @ReuvenLerner  5 месяцев назад +1

      Then you shouldn't be combining them, in this way or any other way! My assumption for this video was that you have a data set broken up across a number of CSV files, each with the same column names and dtypes. You want to take these multiple files and turn them into a single data frame. Pandas provides us with the "pd.concat" method, which is good for such things, but the problem is how you read them into Pandas quickly and easily.
      If you have geological data, survey data, and market analysis, then *perhaps* they have some factors in common. But you don't want them in the same data frame. Rather, read each into its own data frame, and use "join" or "merge" to combine them.

    • @regal7548
      @regal7548 5 месяцев назад +1

      @@ReuvenLerner ok.. thank you

  • @Merqos
    @Merqos 2 года назад +2

    Thank you so much for this example! This made me save so much time!

    • @ReuvenLerner
      @ReuvenLerner  2 года назад +1

      I'm so happy to hear it!

    • @Merqos
      @Merqos 2 года назад +1

      @@ReuvenLerner yeah, really! I was told to extract data from a vessel where all data was given in 10 minutes interval excel documents. This means within a day i had to open 144 files and collect them in one sheet. I love you and the internett man, thanks again!

  • @varunsharma3485
    @varunsharma3485 2 года назад +1

    Worked for me. Earlier I was using excel power query to join multiple csv's and then importing them to pandas, but it had a limitation of max 10million rows in excel. This tutorial is very helpful. Thanks and subscribed. One issue is that I am facing in this code is that I want to add my individual csv file names as a column in concatenate dataframe that is missing in this code.Any tips

    • @ReuvenLerner
      @ReuvenLerner  2 года назад

      I'm delighted to know that the video helped!
      BTW, the maximum number of rows in Excel is (in)famous. It even led to the loss of a bunch of covid data in the UK, by silently removing a whole bunch of inputs.
      Thanks for joining me here, and I'll see you around...

  • @koosschutter1675
    @koosschutter1675 11 месяцев назад +1

    Where can I download some of these large files for testing? I want to split or combine some CSV files but with my basic laptop can't load everything in memory and I need all the columns. Me and chatgtp came up with appending files which is slow. I'm not a data scientist, I just want to split and combine csv files and test my and my pc's capabilities.

    • @ReuvenLerner
      @ReuvenLerner  11 месяцев назад +1

      You can download the (large!) files from Pandas Workout from here: files.lerner.co.il/pandas-workout-data.zip

    • @koosschutter1675
      @koosschutter1675 11 месяцев назад +1

      @@ReuvenLerner Thank you very much.

  • @nvduk3
    @nvduk3 Год назад +1

    If one of the csv files is blank inbetween then it's breaking for loop. How to avoid that?

    • @ReuvenLerner
      @ReuvenLerner  Год назад +1

      My best suggestion is to skip a data frame with zero rows from the output list. There might well be better solutions, though!

    • @nvduk3
      @nvduk3 Год назад +1

      @@ReuvenLerner ​ @Reuven Lerner yes the video was of immense help. I got the concept but was facing an issue as I am trying to run it for over 2300 .csv files and many of them inbetween are blank and so it's not able to get those mentioned column names defined in the loop and is stopping there.
      Manually deleting them is time consuming & kinda stupid lol. I'm sorry as my Python is of intermediate level, but I will try to skip the ones based on if it can't find a matching column name as you mentioned. Thanks a lot again!

    • @ReuvenLerner
      @ReuvenLerner  Год назад +1

      @@nvduk3 Oh, right - if the file is completely blank, then you can't select columns. That is a problem! Unfortunately, that's a tough one to solve. Maybe you could write a function that loads the first 5 lines, and checks to see if there are any columns. If such a function returns True/False, then you can use it in the "if" of your comprehension, and only truly process those that have the columns you want. Yes, you'll end up reading (at least part of) each file twice, but that might still be best. I'm not sure.

    • @nvduk3
      @nvduk3 Год назад +1

      @@ReuvenLerner skipping the column names didn't work but skipping the blank sheets entirely worked somehow after few hit and trials. Thanks a lot, I really appreciate it 👍🏽

  • @matthewhsu7299
    @matthewhsu7299 2 года назад +1

    Learn something new today. Thank you so much!

  • @guocity
    @guocity 6 месяцев назад +1

    thats really helpful, what about reading multiple csv file, read new csv file ignore repeated rows

    • @ReuvenLerner
      @ReuvenLerner  6 месяцев назад

      To read multiple CSV files, you need to run read_csv multiple times. And I think that if you want to ignore repeated rows, that's something you have to do after creating the data frame, not during the loading process.

  • @silvajonatan
    @silvajonatan 9 месяцев назад +1

    Hi, fantastic infomation. Very didatic. Thanks a lot.

  • @FletchersHeating
    @FletchersHeating Год назад

    Thanks for this video!
    Is there a tutorial or information on how to do the same but for multiple data frames? ie. one csv = one dataframe
    Many thanks

    • @ReuvenLerner
      @ReuvenLerner  Год назад

      Glad it helped! The same code would basically work for what you want, if you don't then concatenate the data frames together at the end. You'll end up with a list of data frames, each from a separate CSV file, with which you can do whatever you want.

  • @33samogo
    @33samogo Год назад +1

    Excellent, worked fine for me also, thank you!

  • @traal
    @traal 2 года назад +1

    Would this also work with a generator expression passed into pd.concat? That would look nicer and probably save some memory. 😊

    • @ReuvenLerner
      @ReuvenLerner  2 года назад +1

      It would definitely work. But I'm not sure how much memory it would save, because at the end of the day, we still get all of the data from the three smaller data frames. And the way that behind-the-scenes memory is managed, I'm going to guess (and might be wrong about this) that the usage would be the same. But it would probably be wise to experiment with this; I'll try to find some time to do that, and if I have interesting results, I'll report them on this channel.

  • @gabbyf2906
    @gabbyf2906 Год назад +1

    Great video helped a lot!

  • @israaaljowder9751
    @israaaljowder9751 Год назад +1

    very informative video, Thanks

  • @KingofDiamonds117
    @KingofDiamonds117 2 года назад +1

    I can't get this to waork for me. I need help.

    • @ReuvenLerner
      @ReuvenLerner  2 года назад

      What are you doing, and what error do you get?

    • @KingofDiamonds117
      @KingofDiamonds117 2 года назад

      @@ReuvenLerner I tried doing this:
      import glob
      data = glob.glob['channel1/*.csv']
      and got:
      TypeError Traceback (most recent call last)
      ~\AppData\Local\Temp/ipykernel_13700/502398532.py in
      1 import glob
      ----> 2 data = glob.glob['channel1/*.csv']
      TypeError: 'function' object is not subscriptable

    • @ReuvenLerner
      @ReuvenLerner  2 года назад

      @@KingofDiamonds117 "function object is not subscriptable" means: you're using [] on a function, and you really should be using (). Try that!

    • @KingofDiamonds117
      @KingofDiamonds117 2 года назад +1

      @@ReuvenLerner I'm not getting any errors this time, thanks. I have poor eyesight so it's difficult for me to see properly.

  • @gam3rman85
    @gam3rman85 10 месяцев назад +1

    helpful. thanks!

  • @tedfitzgerald4202
    @tedfitzgerald4202 Год назад +1

    Really super video.

  • @Tshediso1
    @Tshediso1 2 года назад +1

    Thank very much

  • @pankajgoikar4158
    @pankajgoikar4158 2 года назад +1

    Just awesome.... :)