This INCREDIBLE trick will speed up your data processes.

Поделиться
HTML-код
  • Опубликовано: 10 май 2024
  • In this video we discuss the best way to save off data as files using python and pandas. When you are working with large datasets there comes a time when you need to store your data. Most people turn to CSV files because they are easy to share and universally used. But there are much better options out there! Watch as Rob Mulla, Kaggle grandmaster, discusses some alternative ways of saving data files: pickle, parquet and feather files. I run some benchmarks to show that you can save time, space and keep the important metadata about your files in the process!
    Timeline
    00:00 Intro
    00:49 Creating our Data
    02:08 CSVs
    04:39 Setting dtypes for CSVs
    06:15 Pickle Files
    07:16 Parquet ❤️
    09:07 Feather
    10:31 Other Options
    11:02 Benchmarking
    12:19 Takeaways
    12:43 Outro
    Code Gist: gist.github.com/RobMulla/7384...
    Follow me on twitch for live coding streams: / medallionstallion_
    Other Videos:
    Speed up Pandas: • Make Your Pandas Code ...
    Efficient Pandas Dataframes: • Speed Up Your Pandas D...
    Inroduction to Pandas: • A Gentle Introduction ...
    Exploritory Data Analysis Video: • Exploratory Data Analy...
    Audio Data in Python: • Audio Data Processing ...
    Image Data in Python: • Image Processing with ...
    * RUclips: youtube.com/@robmulla?sub_con...
    * Discord: / discord
    * Twitch: / medallionstallion_
    * Twitter: / rob_mulla
    * Kaggle: www.kaggle.com/robikscube
    #python #code #datascience #pandas

Комментарии • 380

  • @miaandgingerthememebunnyme3397
    @miaandgingerthememebunnyme3397 2 года назад +329

    First post! That’s my husband he knows about data…

    • @LuisRomaUSA
      @LuisRomaUSA 2 года назад +12

      He knows a lot of good stuff about data 😁. His the first non-introductory Python RUclipsr I have found so far 🎉

    • @venvanman
      @venvanman 2 года назад +8

      aww this is cute

    • @sketch1625
      @sketch1625 2 года назад +11

      Guess he's really in a "pickle" now.

    • @foobarAlgorithm
      @foobarAlgorithm Год назад +3

      Awww now you guys need a The DataCouple channel if you both do data science! Love your content

    • @Arpan_Gupta
      @Arpan_Gupta Год назад

      Nice work Mr. ROB

  • @mschuer100
    @mschuer100 Год назад +28

    As always, awesome video...a real eye opener on most efficient file formats. I have only used pickle as compression, but will now investigate feather and parquet. Thanks for putting this together for all of us.

    • @robmulla
      @robmulla  Год назад +4

      Glad it was helpful! I use parquet all the time now and will never go back.

  • @lashlarue7924
    @lashlarue7924 Год назад +2

    You are my new favorite RUclipsr, Sir. I'm learning more from you than anyone else, by a country mile!

  • @alfredoch3811
    @alfredoch3811 Год назад +1

    Rob, you did it again...keep'em coming, good job!

  • @holgerbirne1845
    @holgerbirne1845 2 года назад +34

    Very good video :). One note: pickle files can be compressed. If you compress them, they become much smaller but reading and writing becomes slower. Overall parquet und feather are still much better.

    • @robmulla
      @robmulla  2 года назад +5

      Good point! There are many ways to save/compress that I probably didn't cover. Thanks for watching the video.

  • @pablodelucchi353
    @pablodelucchi353 Год назад +2

    Thanks Rob, awesome information! Learning a lot from your channel. Keep it up!

    • @robmulla
      @robmulla  Год назад +1

      Isn’t learning fun?! Thanks for watching.

  • @gustavoadolfosanchezhurtad1412
    @gustavoadolfosanchezhurtad1412 Год назад +2

    Very clear and insightful explanation, thanks Rob, keep it up!

    • @robmulla
      @robmulla  Год назад

      Thanks Gustavo. I’ll try my best.

  • @nancyzhang6790
    @nancyzhang6790 Год назад +1

    I saw people mentioned feather on Kaggle sometimes, but had no clue what they were talking about. Finally, I got answers to many questions in my mind. Thank you!

    • @robmulla
      @robmulla  Год назад

      Yes. Feather and parquet formats are awesome for when you want to quickly read and write data to disk. Glad the video helped you learn!

  • @arielspalter7425
    @arielspalter7425 Год назад +1

    Excellent tutorial Rob. Subscribed!

    • @robmulla
      @robmulla  Год назад

      Thanks so much for the feedback. Thanks for subscribing!

  • @Jvinniec
    @Jvinniec Год назад +26

    One really cool feature of .read_parquet() is that it passes through additional parameters for whichever backend you're using. For example the filters parameter in pyarrow allows you to filter data at read, potentially making it even faster:
    df = pd.read_parquet("myfile.parquet", filters=[('col_name', '

    • @robmulla
      @robmulla  Год назад +9

      Whoa. That is really cool. I didn't realize you could do that. I've used athena which allows you to query parquet files using standard SQL and it's really nice.

    • @juanm555
      @juanm555 Год назад +4

      Athena is amazing when backed with parquet files, I've used it in order to be able to read through 600M+ records that were in those parquets easily

    • @incremental_failure
      @incremental_failure 10 месяцев назад +2

      That's the real use case for parquet. Feather doesn't have this.

  • @Banefane
    @Banefane 2 месяца назад

    Very clear, very structured, and the details are intuitive to understand!

  • @FilippoGronchi
    @FilippoGronchi 2 года назад +2

    Excellent as usual Rob...very very useful indeed

  • @rrestituti
    @rrestituti Год назад +1

    Amazing! Got one new member. Thanks, Rob! 😉

    • @robmulla
      @robmulla  Год назад

      Glad you liked it. Thanks for commenting!

  • @walterpark8824
    @walterpark8824 Год назад +4

    Exactly what I needed to know, and to the point. Thanks.
    As Einstein said, 'Everything should be as simple as possible, and no simpler!'

    • @robmulla
      @robmulla  Год назад

      That’s a great quote. Glad you found this helpful.

  • @nascentnaga
    @nascentnaga 2 месяца назад

    as someone moving into datascience this is such a great explainer! thank you

  • @anoopbhagat13
    @anoopbhagat13 2 года назад +1

    learnt something new today. Thank you Rob for this useful & informative video.

    • @robmulla
      @robmulla  2 года назад

      Learn something new every day and before long you will be teaching others!

  • @KirowOnet
    @KirowOnet Год назад +1

    This was the first video from the channel that randomly appeared in my feed. I clicked, I watched - I liked and subscribed :D. This video plant a seed into my mind, some others inspired me to try. So few days later I got running playground environment in the docker. I'm not data scientist but tips and tricks from your videos could be useful for any developer. I used to code before to check some datasets, but with pandas and jupiter notebook it way more faster. Thank You for sharing your experience !

    • @robmulla
      @robmulla  Год назад

      Wow, I really appreciate this feedback. Glad you found it helpful and got some code working yourself. Share with friends and keep an eye out for new videos dropping soon!

  • @marcosoliveira8731
    @marcosoliveira8731 Год назад +2

    I've learned a great deal with this video. Thank you!

    • @robmulla
      @robmulla  Год назад

      Thanks so much for the feedback. Glad you learned from it!

  • @bendirval3612
    @bendirval3612 Год назад +21

    A major design objective of feather is to be able to be read by R. If you are doing pandas-type data science stuff, this is a significant advantage.

    • @robmulla
      @robmulla  Год назад +8

      Great point. The R package called "arrow" can read in both parquet and feather files.

  • @cristianmendozamaldonado3241
    @cristianmendozamaldonado3241 Год назад +1

    I really love it man, thank you. You saved a life

    • @robmulla
      @robmulla  Год назад

      Thanks! Maybe not saved a life, but saved a few minutes of compute time!

  • @rafaelnegreiros_analyst
    @rafaelnegreiros_analyst Год назад

    Amazing.
    Congrats for the video

    • @robmulla
      @robmulla  Год назад

      Glad you like the video. Thanks for watching.

  • @chrisogonas
    @chrisogonas Год назад +1

    Great stuff! Thanks for sharing.

  • @MrWyYu
    @MrWyYu Год назад +1

    Great summary of data types. Thanks

    • @robmulla
      @robmulla  Год назад +1

      Thanks for the feedback! Glad you found it helpful.

  • @javiercmh
    @javiercmh Год назад +1

    Very engaging and clear. Thanks!

    • @robmulla
      @robmulla  Год назад +1

      Thanks for watching. 🙌

  • @SergioBerlottoJr
    @SergioBerlottoJr 2 года назад +1

    Awesome informations ! Thankyou for this.

    • @robmulla
      @robmulla  2 года назад

      Glad you liked it!

  • @olucasharp
    @olucasharp Год назад +1

    Huge thanks for sharing 🍀

    • @robmulla
      @robmulla  Год назад

      Glad you liked it? Thanks for the comment.

  • @niflungv1098
    @niflungv1098 Год назад +4

    This is good to know. I`m going into web development now, so I usually use JSON format for serialization... I`m still new to python so I didn`t know about parquet and feather. Thank you!

    • @robmulla
      @robmulla  Год назад

      Glad you found it helpful. Share it with anyone else you think would benefit!

  • @safsaf2k
    @safsaf2k Год назад

    This is excellent, thank you man

  • @arpanpatel9191
    @arpanpatel9191 Год назад +2

    Great video!! Small things matter the most. Thanks

  • @user-hy1lm2rd9q
    @user-hy1lm2rd9q 9 месяцев назад

    really good video! thank you

  • @reasonableguy6706
    @reasonableguy6706 Год назад +17

    Rob, You're a natural communicator (or you worked really hard at acquiring that skill) - most effective. I follow you on twitch and I'm currently going through your youtube content to come up to speed. Thanks for sharing your time and experience. Have you thought about aggregating your content into a book as a companion to your content - something like "Data Analysis Using Python/Pandas - No BS, Just Good Stuff" ?

    • @robmulla
      @robmulla  Год назад +6

      Hey. Thanks for the kind words. I’ve never considered myself a naturally good communicator and it’s a skill I’m still working in but I appreciate your positive feedback. The book idea is great, maybe sometime in the future….

  • @danieleingredy6108
    @danieleingredy6108 Год назад +1

    This blew my mind, duuude

    • @robmulla
      @robmulla  Год назад

      Happy to hear that! Share with others so their minds can be blown too!

  • @CalSticks
    @CalSticks Год назад

    Really useful video - thanks.
    I was just searching for some Pandas videos for some light upskilling on the weekend, so this was a great find.

    • @robmulla
      @robmulla  Год назад

      Glad I could help! Check out my other videos on pandas too if you liked this one.

  • @humbertoluzoliveira
    @humbertoluzoliveira Год назад +1

    Hey Guy, nice job. Congratulations! Thanks for video.

    • @robmulla
      @robmulla  Год назад

      Thanks for watching Humberto.

  • @JohnMitchellCalif
    @JohnMitchellCalif Год назад +1

    super clear and useful! Subscribed

  • @baharehbehrooziasl9517
    @baharehbehrooziasl9517 8 месяцев назад +1

    Great! Thank you for this very helpful video.

    • @robmulla
      @robmulla  8 месяцев назад +1

      Glad it was helpful!

  • @yogiananta9674
    @yogiananta9674 Год назад

    awesome ! thank you for this tutorial

    • @robmulla
      @robmulla  Год назад

      You're very welcome! Share with a friend.

  • @user-qe7uw4ry7q
    @user-qe7uw4ry7q 6 месяцев назад

    Hi Rob. I'm from Argentina, you are the best!!!

  • @againstthegrain5914
    @againstthegrain5914 Год назад +2

    Hey this was very useful to me thank you for sharing!!

    • @robmulla
      @robmulla  Год назад

      So glad you found it useful.

  • @69k_gold
    @69k_gold 2 месяца назад

    I looked this up, and it's a pretty cool format, I kinda guessed that it could be a column-based storage strategy when you said that we can efficiently get only select columns, but after I looked it up and found it to be true, it felt very exciting.
    Anyways, hats off to Google's engineers for thinking out of the box on this, the number of things we can do just by storing data as column-lines rather than row-lines is a lot. Of course, the trade-off is that it's very expensive to modify column-wise data, so this is more useful for static datasets that require multi-dim analysis

  • @truthgaming2296
    @truthgaming2296 4 месяца назад

    thanks rob, its help me a lot for beginner like me to realize there is weakness in csv format 😉

  • @casey7411
    @casey7411 Год назад +1

    Very informative video! Subscribed :)

    • @robmulla
      @robmulla  Год назад +1

      Glad it helped! 🙏

  • @danilzubarev2952
    @danilzubarev2952 3 месяца назад

    Lol this video changed my life :D Thank you so much.

  • @ozymet
    @ozymet Год назад +1

    Very good stuff. The essence of information.

    • @robmulla
      @robmulla  Год назад

      Glad you liked it!

    • @ozymet
      @ozymet Год назад

      @@robmulla I saw few more videos, insta sub. Thank you. Glad to find you.

  • @JoeMcMullin
    @JoeMcMullin Месяц назад

    Great video and content.

  • @DainiusKirsnauskas
    @DainiusKirsnauskas 17 дней назад

    Man, I thought this video is a clickbait, but it was awesome. Thank you!

  • @DAN_1992
    @DAN_1992 Год назад +2

    Thanks a lot, just brought down my database backup size to MBs.

    • @robmulla
      @robmulla  Год назад +1

      Glad it helped. That’s a huge improvement!

  • @huuquannguyen6688
    @huuquannguyen6688 2 года назад +1

    I really hope you make a video about Data Cleaning in Python soon. Thanks a lot for all your awesome tutorials

    • @robmulla
      @robmulla  2 года назад

      I'll try my best. Thanks for the feedback!

  • @vigneshwarselva9276
    @vigneshwarselva9276 Год назад +1

    Was very useful, thanks much

    • @robmulla
      @robmulla  Год назад

      Thanks! Glad you learned something new.

  • @Patrick-hl1wp
    @Patrick-hl1wp Год назад

    super awesome tricks, thank you

    • @robmulla
      @robmulla  Год назад

      Glad you like them! Thanks for watching.

  • @predstavitel
    @predstavitel Год назад +1

    It's useful for me, thanks a lot!

  • @hugoy1184
    @hugoy1184 Год назад +1

    Thank u very much for sharing such useful skills! 😉Subscribed!

    • @robmulla
      @robmulla  Год назад

      Anytime! Glad you liked it.

  • @gsm7490
    @gsm7490 17 дней назад

    Parquet really saved me )
    Around one year data, each day is appr 2GB (csv format). Parquet is both compact and fast.
    But have to use filtering and load only necessary columns “on demand”.

  • @pawarasiriwardhane3260
    @pawarasiriwardhane3260 Год назад +1

    This content is really awesome

  • @steven7639
    @steven7639 Год назад +1

    Fantastic video

    • @robmulla
      @robmulla  Год назад

      Fantastic comment. 😎

  • @krishnapullak
    @krishnapullak Год назад +1

    Good tips on speeding up large file read and write

    • @robmulla
      @robmulla  Год назад

      Glad you liked it! Thanks for the feedback.

  • @Extremesarova
    @Extremesarova 2 года назад +16

    Informative video! I've heard about feather and pickle, but never used them. I think I should give feather and parquet a try!
    I'd like to get some materials on machine learning and data science that are not introductory - something for middle and senior engineers :)

    • @robmulla
      @robmulla  2 года назад +3

      Glad you found it useful. I’ll try to make some more ML videos in the near future.

  • @ChrisHalden007
    @ChrisHalden007 Год назад +1

    Great video. Thanks

  • @Schmelon
    @Schmelon Год назад +1

    interesting to learn the existence of parquet and feather files. nothing beats csv for portability and ease of use

    • @robmulla
      @robmulla  Год назад

      Yea, for small/medium files CSV gets the job done.

  • @emjizone
    @emjizone 4 месяца назад

    Useful. Thanks.

  • @MarcBenkert001
    @MarcBenkert001 Год назад +10

    Thanks, great comp. One thing about Parquet - it has some limitations in what chars column names can take, I spent quite some time renaming col names 1 year ago - perhaps that has fallen away by now.

    • @robmulla
      @robmulla  Год назад +3

      Good point! I've noticed this too. Definately a limitation that makes it sometimes unusable. Thanks for watching!

  • @Zoltag00
    @Zoltag00 Год назад +7

    Great video - It would have been good to at least mention the downsides to pickle and also the built in compatibility with zip files. Haven't come across feather before, will try it out

    • @robmulla
      @robmulla  Год назад +5

      Great point! I did forget to mention that pandas will auto-unzip. I still like parquet the best.

    • @Zoltag00
      @Zoltag00 Год назад +3

      @@robmulla - Agreed, parquet has some serious benefits
      You know it also supports a compression option? Use it with gzip to see your parquet file get even smaller (and you only need to use it on write)

  • @wonderland860
    @wonderland860 Год назад +2

    This video greatly helped me. I didn't know so many ways to dump a DataFrame. I then did a further test, and found the compression option plays a big role:
    df.to_pickle(FILE_NAME, compression='xz') -> 288M
    df.to_pickle(FILE_NAME, compression='bz2') -> 322M
    df.to_pickle(FILE_NAME, compression='gzip') -> 346M
    df.to_pickle(FILE_NAME, compression='zip') -> 348M
    df.to_pickle(FILE_NAME, compression='infer') -> 679M # default compression
    df.to_parquet(FILE_NAME, compression='brotli') -> 334M
    df.to_parquet(FILE_NAME, compression='gzip') -> 355M
    df.to_parquet(FILE_NAME, compression='snappy') -> 423M # default compression
    df.to_feather(FILE_NAME) -> 500M

    • @robmulla
      @robmulla  Год назад

      Nice findings! Thanks for sharing. Funny that compressing parquet still works. I didn't know that.

    • @DeathorGloryNow
      @DeathorGloryNow Год назад

      @@robmulla Actually if you check the docs parquet files are snappy compressed by default. You have to explicitly say `compression=None` to not compress it.
      Snappy is the default because it adds very little time to read/write with modest compression and low CPU usage while still maintaining the very nice columnar properties (as you showed in the video). It is also the default for Spark.
      Other compressions like gzip get it smaller but at a much more significant cost to speed. I'm not sure this is still the case but in the past they also broke some of the nice properties because it is compressing the entire object.

  • @user-ld5dn3fv4m
    @user-ld5dn3fv4m Год назад +1

    Nice video. I'm going to rewrite the storage on the parquet

    • @robmulla
      @robmulla  Год назад

      You should! Parquet is awesome.

  • @MatthiasBussonnier
    @MatthiasBussonnier 2 года назад +3

    On the first pass when you timeit the csv writing you time both the writing to csv and generating the dataset. So you are likely having biased results as you only time the writing with other format. (Sure it does not change the final message, just want to point it out)
    Also with timeit, you can use the -o flag of timeit to output the result to a variable, and this can help you to for example make a plot of the times.

    • @robmulla
      @robmulla  2 года назад

      Good point about timing the dataframe generation. It should be negligable but fair to note. Also great tip on using -o. I didn't know about that! It looks like from the docs it writes the entire stdout, so it would need to be parsed. ipython.readthedocs.io/en/stable/interactive/magics.html#magic-timeit Still a handy tip. Thanks!

  • @EVL624
    @EVL624 Год назад +1

    Very good and informative video

    • @robmulla
      @robmulla  Год назад

      So nice of you. Thanks for the feedback.

  • @DevenMistry367
    @DevenMistry367 Год назад +2

    Hey Rob, this was a really nice video! Can you please make a tutorial where you try to write this data to a database? Maybe sqlite or postgres? And explain bottlenecks? (Optional: with or without using an ORM).

    • @robmulla
      @robmulla  Год назад +1

      I was actually working on just this type of video and even looking at stuff like duckdb where you can write SQL on parquet files.

  • @yosefasefaw4207
    @yosefasefaw4207 Год назад +1

    thanks very helpful

  • @crazymunna2397
    @crazymunna2397 Год назад +1

    amazing info

  • @dist321
    @dist321 Год назад +2

    Great videos! Thank you for posting them. I wonder if feather is faster to read a >2G file.tsv than csv in chunks.

    • @robmulla
      @robmulla  Год назад

      Thanks for watching Ondina! I think it would depend on the data types within the >2G file. I think the only difference between tsv and csv is a comma ',' vs tab '\t' seperator between values. Hope that helps.

  • @Levince36
    @Levince36 Год назад +1

    Thank you very much 😂, I got something totally new to me.

  • @sangrampattnaik744
    @sangrampattnaik744 8 месяцев назад +1

    Very nice explanation. Can you compare Dask and PySpark ?

  • @meme_eternity
    @meme_eternity Год назад

    Great Video!!!!!!!!!!!

  • @nirbhay_raghav
    @nirbhay_raghav Год назад +1

    Another awesome video. It has become my favorite channel. Only regret is that I found it too late.
    Small correction. It should be 0.3s 0.08s for parquet files. You mistakenly wrote 0.3ms and 0.08ms while converting.
    Thanks.

    • @robmulla
      @robmulla  Год назад

      Apprecate that you are finding my videos helpful. Good catch on finding that typo!

    • @Jay-og6nj
      @Jay-og6nj Год назад

      i was going to comment that, but decided to check first, least should have caught that. Good video.

  • @riptorforever2
    @riptorforever2 Год назад +1

    Thanks!

    • @robmulla
      @robmulla  Год назад

      Thanks for watching!

  • @getolvid5468
    @getolvid5468 8 месяцев назад

    Great comparing, thanks, not sure if feather/pickle files i'm creating from Julia's script use some compression - none that i'm specifying out of the box .. but happens that the pickle files always end up being 1/2 the size smaller than the feather ones.
    (havent compared those 2 to a parquet made file)

  • @FranciscoPMatosJr
    @FranciscoPMatosJr Год назад +2

    Experiment add the compression "Brotli" at the file create. The file size reduce considerably and the read is more fast a lot.
    Example:
    to save file:
    from pyarrow import csv, parquet
    parse_options = csv.ParseOptions(delimiter=delimiter)
    data_arrow = csv.read_csv(temp_file, parse_options=parse_options, read_options=csv.ReadOptions(autogenerate_column_names=autogenerate_column_names, encoding=encoding))
    parquet.write_table(data_arrow, parquet_file + '.brotli', compression='BROTLI')
    to read file: pd.read_parquet(file, engine='pyarrow')

    • @robmulla
      @robmulla  Год назад

      Oh. Very cool I need to check that out.

  • @abhisekrana903
    @abhisekrana903 Месяц назад +1

    stumbled on to this awesome video and absolutely loved it. Just out of curiosity - what tool are you using for making Jupyter notebook with themes especially dark theme?

    • @robmulla
      @robmulla  Месяц назад

      Glad you enjoyed the video. I have a different video that covers my jupyter setup including theme: ruclips.net/video/5pf0_bpNbkw/видео.html

  • @yoyokoko5153
    @yoyokoko5153 2 года назад +1

    great stuff

  • @YeWangRDFZ
    @YeWangRDFZ Год назад +2

    I'm really interested in the comparison against hdf file. My guess is that it's gonna be the fastest to read, however it prolly takes up more space.

    • @robmulla
      @robmulla  Год назад

      I’m not sure. But I think feather files are pretty fast.

    • @YeWangRDFZ
      @YeWangRDFZ Год назад

      @@robmulla Hey Rob thanks for the reply. I had the impression that hdf maps the data taken in ram so there wont be much conversion once its read in the ram but I could be wrong. Also it would be interesting to investigate how feather works. I'll do some benchmarking on my m1mac and maybe get back to you.

  • @melanp4698
    @melanp4698 Год назад +1

    12:28 "When your data set gets very large." - Me working with 800GB json files: :)
    Good video regardless, i might give them a test sometime.

    • @robmulla
      @robmulla  Год назад

      Haha. It’s all relative. When your data can’t fit in local ram you need to start using things like spark.

  • @alexanderf1795
    @alexanderf1795 Год назад +2

    Cool. Would be nice to compare with storing data to an sql base (Postgres for example).

    • @robmulla
      @robmulla  Год назад

      Great suggestion! This video only covers storing to flat files, but comparison of different relational databases is a great idea for a future video.

  • @manigowdas7781
    @manigowdas7781 Год назад +1

    Just wow!!!!

  • @mr_easy
    @mr_easy 2 месяца назад +1

    great comparison. What about HDF5 format? Is it in anyway better?

  • @rdubitsk
    @rdubitsk Год назад +1

    Fantastic video as always. What are disadvantages of json? I use json because it can easily be passed to the front end.

    • @robmulla
      @robmulla  Год назад

      Great question. I don’t use json much. It isn’t common for tabular/relational data and more for unstructured web based stuff I believe. It probably is pretty slow to read/write large dataset I’m guessing.

  • @TzviKD
    @TzviKD Год назад +1

    Thank you

  • @jonathanhody3622
    @jonathanhody3622 Год назад +3

    Thank you for the video! I've basically never heard of parquet or feather and don't really know what type of file those are. I assume it's not an easy format to share with stakeholders for example. Is there a way to link those types of file to a database or perhaps import them in a data vizualisation tool (such as PowerBI or Tableau)?

    • @robmulla
      @robmulla  Год назад +1

      Thanks for watching Jonathan. Glad you found the video useful. You are correct these file formats are more common for storage within systems that read the data via code and not sharing with stakeholders. CSV and excel still dominates for that type of thing.

  • @sabagx
    @sabagx Год назад +1

    keep uploading videos please!!

    • @robmulla
      @robmulla  Год назад

      Thanks Sbg! I'm planning on it!

  • @baharehbehrooziasl9517
    @baharehbehrooziasl9517 8 месяцев назад

    When we create a parquet dataset, can we dummycode the columns?

  • @lorenzowinchler1743
    @lorenzowinchler1743 Год назад +1

    Nice video! Thank you. What about hdf5 format? Thanks!

    • @robmulla
      @robmulla  Год назад

      Thanks! I haven’t used Hdf5 much but I’d be interested to hear how it compares.

  • @leonjbr
    @leonjbr Год назад +3

    Hi Rob! I love our channel. It is very helpfull. I would like to ask you a question: is HDF5 any better than all the options you showed in the video?

    • @robmulla
      @robmulla  Год назад

      Good question. I didn't cover it because I thought it's an older, lesser used format.

    • @leonjbr
      @leonjbr Год назад +1

      @@robmulla so the answer is no?

    • @robmulla
      @robmulla  Год назад +1

      @@leonjbr The answer is - I don't know but probably not. 😁

    • @leonjbr
      @leonjbr Год назад

      @@robmulla ok thanks.

    • @CoolerQ
      @CoolerQ Год назад +1

      I don't know about "better" but HDF5 is a very popular data format in science.

  • @pele512
    @pele512 Год назад +5

    Thanks for the great benchmark. In R / Python hybrid environment I sometimes use `csv.gz` or `tsv.gz` to address the size issue with CSV but retain the ability to quickly pipe these through line based processors. It would be interesting to see how gzipped flat files perform. I do agree that parquet/feather is a better way to go for many reasons, they are superior especially from the data engineering point of view.

    • @robmulla
      @robmulla  Год назад +2

      I do the same with gzipped CSV files. Good idea about making a comparison. I’ll add it to the list of potential future videos.

  • @scottybridwell
    @scottybridwell Год назад +3

    Nice video. How does the performance and storage size of parquet, feather compare to hdf/pytables?

    • @robmulla
      @robmulla  Год назад +1

      Great question. I have no idea! I need to learn more about how they compare.

  • @paarthurnax4561
    @paarthurnax4561 Год назад +2

    진짜 parquet는 혁명임...
    저장용량은 확 줄이고 나중에 다시 데이터 불러올 때의 속도는 확 높이는 최고의 데이터 포맷

    • @robmulla
      @robmulla  Год назад +1

      I agree. Parquet is great!

  • @alemao182
    @alemao182 2 года назад +2

    Hey! Thanks a lot for the video!
    I'm having some issues with .parquet format in Jupyter Lab. It's subtracting the index by 1 hour.
    I have an index (datetime format) that starts each day at 10am and ends 16pm. When I read as .csv or in .parquet using colab, it works fine as it should.
    But when I pd.read_parquet it in Jupyter Lab, it changes the original index to starting at 9am and ending at 15pm. (basically -1 hour)
    Do you have any idea why this is happening?
    Thanks!

    • @robmulla
      @robmulla  2 года назад +1

      Thanks for watching. I don’t know what the problem could be. Could it be a difference in the time zone set in your jupyterlab instance of python?

    • @alemao182
      @alemao182 2 года назад

      @@robmulla Oh that could be it since I'm from Brazil and colab is probably executed in US. Didn't know that was a thing and could change the index of my data. I'll look into it. Thanks for the help!
      edit: That's not the case (i think)
      Both indexes (colab and jupyter) has tz='America/Sao_Paulo'
      .index[0] in colab >Timestamp('2021-01-04 10:00:00-0200', tz='America/Sao_Paulo')
      .index[0] in jupyter >Timestamp('2021-01-04 09:00:00-0300', tz='America/Sao_Paulo')
      Somehow jupyter is still showing one hour less, even with the same timezone. I think it's better to convert the index in colab into a string, then read it in jupyter and converting again into datetime.

  • @vladvol855
    @vladvol855 11 месяцев назад

    Hello! Very interesting! Thank you! Can you please tell me is any limitation for a DF to save in parquet in terms of number of columns? Excel allow around 16-17k columns to save! Thank you for the answer!

  • @codewithvishal91
    @codewithvishal91 Год назад

    Very nice bro

    • @robmulla
      @robmulla  Год назад +1

      Thanks. Hope you learned something!

  • @riessm
    @riessm Год назад +1

    In addition to everything, parquet is the native file format to spark and can fully support spark‘s lazy computing (spark will only ever read the columns and rows that are needed for the desired output). If you ever prep really big data for spark, parquet is the way to go.

    • @robmulla
      @robmulla  Год назад +1

      That’s a great point. Same with polars!

    • @riessm
      @riessm Год назад

      @@robmulla Need to have a closer look at polars then! 🙂

  • @johnd0e
    @johnd0e Год назад +3

    really awesome video and always well explained. One naive question - problem I am facing. Can you share with the community how you would approach to write a lets say 50 MB structured file (CSV) into a remote SQL Database in a fast manner? I am asking, since my approach takes too much time and asking Stackoverflow was not a big help. Having a video here would be awesome :)

    • @robmulla
      @robmulla  Год назад +3

      Thanks for watching and asking the question. Working with databases in python can be it’s own art and depend on the database type. 50mb isn’t too big but also depends on indexing and existing data in the database. You can use pandas to_sql but if that’s too slow you may need to try sqlalchemy.

    • @prathameshjoshi007
      @prathameshjoshi007 Год назад

      sqlloader for oracle. disable constraints,index,triggers beforethen enable it later. It even works for Million rows/10GBs of data.Or external tables if you've server access.

    • @alexaneals8194
      @alexaneals8194 Год назад

      I would use the native bulk data importers for the database in question. They are optimized to import GBs even Terabytes of data. Just have to remember to disable the constraints, indexes and triggers before the import and then reenable them. I know both SQL Server, Sybase and Oracle have bulk data loaders. I image DB2 and the other big databases out there also have similar functionality. You can script the disabling and reenabling of the constraints, indexes, etc. Also, you setup the database to trigger the import when the file is dropped in certain location or run it on a schedule.

  • @Nite_coder
    @Nite_coder Год назад +1

    first time watching, insta subbed, great stuff, could you do an add on to this video and go over some fast ways to read/write to a database , SQL/mongo ?

    • @robmulla
      @robmulla  Год назад

      Happy to have you as a new sub. Yes, I really want to make a video about databases and how to store/read/write with python to them.