How to work with big data files (5gb+) in Python Pandas!

Поделиться
HTML-код
  • Опубликовано: 11 сен 2024

Комментарии • 50

  • @Hossein118
    @Hossein118 2 года назад +6

    The end of the video was so fascinating to see how that huge amount of data was compressed to such a manageable size.

  • @CaribouDataScience
    @CaribouDataScience Год назад +10

    Since you are working with Python, another approach would be to import the data into SQLite db. Then create some aggregate tables and views ...

  • @dhssb999
    @dhssb999 2 года назад +10

    Never used chunk in read_csv before, it helps a lot! Great tip, thanks

  • @mjacfardk
    @mjacfardk 2 года назад +4

    During my 3 years in the field of data science, this course would be the best I've ever watched.
    thank you brother, go ahead.

  • @fruitfcker5351
    @fruitfcker5351 Год назад +5

    If (and only if) you only want to read a few columns, just specify the columns you want to process from the CSV by adding *usecols=["brand", "category_code", "event_type"]* to the *pd.read_csv* function. Took about 38seconds to read on an M1 Macbook Air.

  • @michaelhaag3367
    @michaelhaag3367 2 года назад +6

    glad you are back my man, I am currently in a data science bootcamp and you are way better than some of my teachers ;)

  • @jacktrainer4387
    @jacktrainer4387 2 года назад +2

    No wonder I've had trouble with Kaggle datasets! "Big" is a relative term. It's great to have a reasonable benchmark to work with! Many thanks!

    • @TechTrekbyKeithGalli
      @TechTrekbyKeithGalli  2 года назад +2

      Definitely, "big" very much means different things to different people and circumstances.

    • @Nevir202
      @Nevir202 2 года назад

      Ya, I've been trying to process a book in Sheets, for that processing 100k words, so a few MB, in the way I'm trying to is already too much lol.

  • @AshishSingh-753
    @AshishSingh-753 2 года назад +2

    Pandas have capabilities I don't know it - secret Keith knows everything

    • @TechTrekbyKeithGalli
      @TechTrekbyKeithGalli  2 года назад +1

      Lol I love the nickname "secret keith". Glad this video was helpful!

  • @JADanso
    @JADanso 2 года назад +1

    Very timely info, thanks Keith!!

  • @abhaytiwari5991
    @abhaytiwari5991 2 года назад +2

    Well-done Keith 👍👍👍

  • @ahmetsenol6104
    @ahmetsenol6104 Год назад

    It was quick and straight to the point. Very good one thanks.

  • @lesibasepuru8521
    @lesibasepuru8521 Год назад +1

    You are a star my man... thank you

  • @andydataguy
    @andydataguy Год назад +1

    Great video! Hope you start making more soon

  • @elu1
    @elu1 2 года назад +1

    great short video! nice job and thanks!

  • @firasinuraya7065
    @firasinuraya7065 2 года назад +1

    OMG..this is gold..thank you for sharing

  • @rishigupta2342
    @rishigupta2342 Год назад +1

    Thanks Keith. Please do more videos on EDA python.

  • @agnesmunee9406
    @agnesmunee9406 Год назад +2

    How would a go about it if it was a jsonlines(jsonl) data file?

  • @manyes7577
    @manyes7577 2 года назад +2

    i have error message on this one. it says 'DataFrame' object is not callable. why is that and how to solve it? thanks
    for chunk in df:
    details = chunk[['brand', 'category_code','event_type']]
    display(details.head())
    break

    • @TechTrekbyKeithGalli
      @TechTrekbyKeithGalli  2 года назад +1

      How did you define "df"? I think that's where your issue lies.

  • @spicytuna08
    @spicytuna08 2 года назад +2

    thanks for the great lesson wondering what would be the performance between output = pd.concat([output, summary) vs output.append(summary)?

  • @oscararmandocisnerosruvalc8503
    @oscararmandocisnerosruvalc8503 Год назад +1

    Cool videos bro .
    Can you address load and dump for Json please :)?

  • @machinelearning1822
    @machinelearning1822 Год назад +1

    I have tried and followed each step however it gives this error:
    OverflowError: signed integer is greater than maximum

  • @DataAnalystVictoria
    @DataAnalystVictoria 10 месяцев назад

    Why and how you use 'append' with DataFrame? I have an error, when I do the same thing. Only if I use a list instead, and then concat all the dfs in the list I have the same result as you do.

  • @rokaskarabevicius
    @rokaskarabevicius 3 месяца назад

    This works fine if you don't have any duplicates in your data. Even if you de-dupe every chunk, aggregating it makes it impossible to know whether there are any dupes between the chunks. In other words, do not use this method if you're not sure whether your data contains duplicates.

    • @rodemire
      @rodemire 27 дней назад

      What method can we use if there are possible duplicates?

  • @lukaschumchal6676
    @lukaschumchal6676 2 года назад +1

    Thank you for video, it was really helpfull. But i am still little confused. Do I have to run every big file with chunks, because its necessary or it is just quicker way of working with large files?

    • @TechTrekbyKeithGalli
      @TechTrekbyKeithGalli  2 года назад +1

      The answer really depends on the amount of RAM that you have on your machine.
      For example, I have 16gb of ram on my laptop. No matter what, I would never be able to load in a file 16gb+ all at once because I don't have enough RAM (memory) to do that. Realistically, my machine is probably using about half the RAM for miscellaneous tasks at all times so I wouldn't even be able to open up a 8gb file all at once. If you are on Windows, you can open up your task manager --> performance to see details on how much memory is available. You could technically open up a file as long as you have enough memory available for it, but performance will decrease as you get closer to your total memory limit. As a result my general recommendation would be to load in files in chunks basically any time the file is greater than 1-2gb in size.

    • @lukaschumchal6676
      @lukaschumchal6676 2 года назад

      @@TechTrekbyKeithGalli Thank you very much. I cannot even describe you how this is helpfull to me :).

  • @konstantinpluzhnikov4862
    @konstantinpluzhnikov4862 2 года назад +1

    Nice video! Working with big files If a hardware is not at it best means there is much time to make a cup of coffee, discuss the latest news...

  • @dicloniusN35
    @dicloniusN35 7 месяцев назад

    but new file have only 100000 , not all info, you ignore other data?

  • @CS_n00b
    @CS_n00b 11 месяцев назад

    why not groupby.size() instead of groupby.sum() the column of 1's?

  • @vickkyphogat
    @vickkyphogat Год назад

    what about .SAV files ?

  • @oscararmandocisnerosruvalc8503

    Why did you use the count there

    • @TechTrekbyKeithGalli
      @TechTrekbyKeithGalli  Год назад

      If you want to aggregate data (make it smaller), counting the number of occurrences of events is a common method to do that.
      If you are wondering why I added an additional 'count' column and summing, instead of just doing something like value_counts(), that's just my personal preferred method of doing it. Both can work correctly.

    • @oscararmandocisnerosruvalc8503
      @oscararmandocisnerosruvalc8503 Год назад

      @@TechTrekbyKeithGalli Thanks a lot for your videos, bro !!!!