Do these Pandas Alternatives actually work?

Поделиться
HTML-код
  • Опубликовано: 22 май 2024
  • In this video we benchmark some of the python pandas alternative libraries and benchmark their speed on a large dataset. We look at four different libraries: Dask, Modin, Ray and Vaex. Pandas is a very popular library used by data scientists who code in python and other libraries exist that claim to be faster than pandas. We put them to the test and see which is the fastest!
    Timeline:
    00:00 Intro
    00:30 Setup
    03:05 Pandas
    05:54 Ray
    10:24 Dask
    13:30 Modin
    15:45 Vaex
    18:45 Summary
    Follow me on twitch for live coding streams: / medallionstallion_
    My other videos:
    Speed Up Your Pandas Code: • Make Your Pandas Code ...
    Speed up Pandas Code: • Make Your Pandas Code ...
    Intro to Pandas video: • A Gentle Introduction ...
    Exploratory Data Analysis Video: • Exploratory Data Analy...
    Working with Audio data in Python: • Audio Data Processing ...
    Efficient Pandas Dataframes: • Speed Up Your Pandas D...
    * RUclips: youtube.com/@robmulla?sub_con...
    * Discord: / discord
    * Twitch: / medallionstallion_
    * Twitter: / rob_mulla
    * Kaggle: www.kaggle.com/robikscube
    #python #pandas #datascience #dataengineering

Комментарии • 86

  • @robmulla
    @robmulla  Год назад +5

    If you enjoyed this video you should also check out my video about Polars, a pandas alternative that I didn't cover in this video: ruclips.net/video/VHqn7ufiilE/видео.html&feature=shares

  • @joaomurilopalonefauvel942
    @joaomurilopalonefauvel942 Год назад +29

    I wonder how polars performs. It seems like the fastest pandas alternative from my research.

    • @robmulla
      @robmulla  Год назад +13

      Thanks for mentioning polars! I haven’t heard of it before but just read the GitHub page and it looks promising. Maybe next video I’ll cover it!

    • @robmulla
      @robmulla  Год назад +2

      @Charles I made a video about polars! Check it out here! ruclips.net/video/VHqn7ufiilE/видео.html&feature=shares

  • @gingerjiang666
    @gingerjiang666 Год назад +2

    Great video. Thank you very much. Quick question though, how did you change you jupyter lab theme. It looks so great?

  • @N147185
    @N147185 Год назад +8

    There is an important subtlety that is being missed about Vaex - it lazily reads the parquet file each time you do an operation. That means, you add the time to read (stream) the data in addition to the time it takes to actually do the math. That makes it all the more impressive (to me anyways). Other libraries read and hold the data in memory, so it is ready to be used. Vaex is more "memory safe", which is especially useful if you work with datasets that are much larger than ram.
    Anyway, very nice video - keep up the good work!

    • @robmulla
      @robmulla  Год назад

      Great point. I didn’t know much about Vaex going into this. Interesting that it’s memory safe.

  • @CedricDeBoom
    @CedricDeBoom Год назад +7

    Would have liked more focus on memory and cpu usage. Especially in contexts with big datasets but limited resources, this is crucial, and it would have been nice to see and compare the effects of lazy evaluation here.

    • @glitchaddict99
      @glitchaddict99 Месяц назад

      yeah this barely touched on the real reason I use dask, to do out of core data operations when I can’t use pandas anymore

  • @zhenliu6596
    @zhenliu6596 Год назад +3

    Just want to say thank you for saving the time for us.

    • @robmulla
      @robmulla  Год назад

      I’m happy to! Thanks for watching 😀

  • @DarthJarJar10
    @DarthJarJar10 Год назад +4

    Was a tad surprised you didn't explore a Dask Delayed object, or try out some of Dask's concurrency features but love your videos! Would have loved to have seen Polars in the mix. For the missing cumsum in Vaex, I want to check if a reduce plus lambda combo would not have worked...

    • @robmulla
      @robmulla  Год назад +3

      Thanks for the feedback. I realize that this video only scratches the surface of what’s possible with each library. I felt like it wouldn’t be fair to use more than just the base API- but you make a fair point. Maybe I need to make a follow up video.

    • @DarthJarJar10
      @DarthJarJar10 Год назад +2

      @@robmulla, it was a pleasure! Your videos are great regardless!
      I'm immersing myself in this stuff bit by bit but having your content whilst working from home has been amazing!

    • @bennri
      @bennri Год назад

      Doesn't dask.dataframe run on multiple cores concurrently by default?

    • @bennri
      @bennri Год назад

      @Javis_Lumu Doesn't dask.dataframe run on multiple cores concurrently by default?

    • @DarthJarJar10
      @DarthJarJar10 Год назад

      @@bennri, I speak under correction - it may be specified by default that dask.dataframe is run concurrently but my understanding was that whether this was the case and the number of threads used is actually a setting, and moreover, that Dask utilises concurrency most efficiently using the delayed delayed.
      You're likely correct.

  • @terusensei_japones
    @terusensei_japones Год назад +4

    Very interesting. It seems that I will keep mostly using pandas 🤣🤣 thanks for sharing the experiment!

    • @robmulla
      @robmulla  Год назад +1

      When I started making this video I thought that each library would outperform pandas in a different way. I was suprised by the results. I'm sure there are situations where they are better alternatives to pandas - but for the time being I too will be mostly sticking with pandas.

    • @igormriegel
      @igormriegel Год назад +1

      Try Polars, it is awesome and have way better results than what was shown in the video.

  • @RockieYang
    @RockieYang Год назад +1

    Did you by change test with arrow format with vaex as well? As vaex is using memory mapping. It still need load the whole thing with parquet file. While it might avoid the whole load if using arrow.

    • @robmulla
      @robmulla  Год назад

      That's a good point. I just ran each package using the default with no modifications. If there is a way to change the backend format let me know how it might be done.

  • @jorgetimes2
    @jorgetimes2 Год назад +1

    Hi, @Rob Mulla, would you please share the link to the data parquet, so that we could replicate your results and dig a bit deeper to tell where the actual problems lie? Thanks for your video!

    • @robmulla
      @robmulla  Год назад

      The data is a combination of the parquet files in this dataset: www.kaggle.com/datasets/robikscube/reddit-place-2022-official-canvas-history
      Good luck!

  • @androiduser457
    @androiduser457 Год назад +3

    I love modin the most because it is backed by dask, ray or omnisci and compatibility with pandas api. If they did not support processing big data, pyspark it is.

    • @robmulla
      @robmulla  Год назад

      Thanks for sharing. Is there a backed for modin you typically use more?

    • @bcak611
      @bcak611 Год назад

      Try Vaex for Big Data!

  • @rafaeel731
    @rafaeel731 Год назад +4

    Would be useful knowing the specs of your machine. I think Dask makes sense when you have much larger data and clusters of executors, more like Spark situation.

    • @robmulla
      @robmulla  Год назад

      My machine has a 32 thread ryzen CPU. There may be situations where it performs better but my main goal was to show how it performs on a single machine- most of the time pandas alone is the best.

    • @rafaeel731
      @rafaeel731 Год назад +1

      @@robmulla which is expected as Pandas can parallelise on a single machine and other options try to build on top of it, or some of them do. Thanks anyway!

  • @FabioRBelotto
    @FabioRBelotto Год назад +1

    You should share the notebook and the data (if it's avaliable somewhere). That would be interesting to explore more about such tools getting such a bad result as modin or dask.

    • @robmulla
      @robmulla  Год назад +1

      Thanks! I did provide the code to the people at modin and they were looking into how to speed it up, but I haven't heard anything about it lately.

  • @kv1kv
    @kv1kv Год назад +1

    vaex is the only package that provides working out-of-core functionality
    you can process and explore the data that just does not fit into the memory at all on your desktop or laptop
    this is its purpose, it works when pandas just does not work at all
    it is an awesome package that I use on almost everyday basis
    it can be a little slower sometimes cause it does not load full dataset into memory and tries to use multiple cores so there is some expected overhead
    and it really misses some functionality so you sometimes need to convert data pieces into pandas which can be done easily

    • @robmulla
      @robmulla  Год назад +1

      Really cool. I haven't used vaex much outside of in this video. Seems similar to polars, which I made a different video on.

  • @585ghz
    @585ghz Год назад +1

    In dask, you can split into index, so they can aggregate by the index much more faster

    • @robmulla
      @robmulla  Год назад

      That’s true. I just wanted to compare as a drop in replacement

  • @jti107
    @jti107 Год назад +1

    nice! the big thing with pandas is the amount of resources when learning and debugging. any thoughts on Julia?

    • @robmulla
      @robmulla  Год назад

      Agreed the resources and documentation surrounding pandas makes it hard to beat. I don’t have any experience with Julia- do you recommend it?

    • @jti107
      @jti107 Год назад +1

      @@robmulla i work in aerospace so alot of my colleague started using it but i love python too much so i've havent used it yet. i started with matlab and it took alot of effort to transition to python so the switching cost is pretty high. when i have some time, i'd like to at least try some tutorials to see what the hype is about. love your channel by the way, i've learned so much!!

  • @gokulakrishnanm
    @gokulakrishnanm Год назад +2

    Which processor you're using is that Intel processor? From what i heard is modin is good at running on Intel CPU. Please share your system spec

    • @robmulla
      @robmulla  Год назад +1

      I’m using a ryzen chip. Maybe that’s the issue.

    • @gokulakrishnanm
      @gokulakrishnanm Год назад +1

      @@robmulla share your test code with data. I have i5 11 th gen I'll benchamark and share result.

  • @kayderl
    @kayderl Год назад +1

    Your notebook looks really nice. Is that jupyter notebook with a theme or something else?

    • @robmulla
      @robmulla  Год назад +1

      Thanks! Jupyterlab with the solarized dark theme.

  • @nishantkumar-lw6ce
    @nishantkumar-lw6ce Год назад +1

    How do we add existing list comprehension functions in pyspark?

    • @robmulla
      @robmulla  Год назад

      I'm not sure. I haven't used pyspark in a long time :D

  • @soren-1184
    @soren-1184 Год назад

    Pandas 2 with arrow backend would be interesting here as well.

  • @CNW21
    @CNW21 Год назад +4

    This is interesting because from my understanding pandas can only use 1 CPU core, where as some or all of those alternatives *should* be able to use some or all of your threadripper cores which theoretically would drastically improve performance. Either way, from the looks of it I'd rather spend a few seconds/minutes waiting in pandas than reading the documentation for pandas alternatives.

    • @robmulla
      @robmulla  Год назад +3

      You have the same thought that I did! Why python inherently using only a single core the vectorized numpy and pandas functions are written in a lower level language that can do multithreading. So that's why straight pandas is hard to beat when the data can fit into memory.

    • @bennri
      @bennri Год назад +2

      @@robmulla yes but when I look at the tapsk manager, I don't see multiple cores running.

  • @sawekb8102
    @sawekb8102 Год назад +3

    In order to speed up dask you can configure client to use all cores or pass .compute(scheduler ="processes")

    • @robmulla
      @robmulla  Год назад

      Good to know. I didn’t want to add too much difference to the packages because I wanted to compare apples to apples.

  • @wayneh7067
    @wayneh7067 2 месяца назад

    Tbh if you have that much CPU memory, there’s really no need to consider any Pandas alternative. Maybe do some memory heavy tasks like joining large dataframes, which I usually use Dask for.

  • @GiasoneP
    @GiasoneP Год назад +2

    Interesting video. I’ve been working on a problem to break up a 10 GB CSV file into multiple parquet files grouped by date. I’ve attempted to do it via Pandas chunksize= and Dask. Using Dask to read into a Pandas data frame (compute()) has yielded the fastest method. However, I think there are better methods. Moving on to pyspark next. The research continues…

    • @joaomurilopalonefauvel942
      @joaomurilopalonefauvel942 Год назад +2

      Have you taken a look at polars?

    • @GiasoneP
      @GiasoneP Год назад

      @@joaomurilopalonefauvel942 i have not, but will check it out 👍

    • @robmulla
      @robmulla  Год назад +1

      Thanks for the feedback. As I mentioned in the video every dataset may respond differently. I didn’t know how fast each would perform going into the video - and was a bit surprised by the results.

    • @igormriegel
      @igormriegel Год назад +2

      @@GiasoneP I'm sure Polars will shine for you I'm using it on 40gbs datasets and it is pretty snappy.

    • @GiasoneP
      @GiasoneP Год назад +1

      @@igormriegel I’ll check it out this weekend. Thanks for sharing.

  • @jmoz
    @jmoz Год назад +2

    I spent days testing dask and couldn’t find any benefits or even it would r work for what I was trying. A large 450M row dataset needed to pivot it and it simply wouldn’t work. Maxed out memory and hdd space. Had to use standard pandas and some clever iterating.

    • @robmulla
      @robmulla  Год назад +1

      I’ve been in the exact same situation a bunch of times before too! That’s partly why I wanted to make this video. Thanks for watching.

  • @riptorforever2
    @riptorforever2 Год назад +1

    A suggestion: Add pyarrow lib if you do a update video about this :) the presentation 'PyArrow and the future of data analytics ( id: 6aWX9bZizu4 ) by EuroPython Conference impressed me

    • @robmulla
      @robmulla  Год назад +1

      Oh wow. I need to check that out. Doing a review of polars soon. It’s really good!

  • @lucasbraesch805
    @lucasbraesch805 Год назад +2

    What about polars? This one beats everything else hands in all the benchmarks that I have seen.

    • @robmulla
      @robmulla  Год назад +1

      I've heard a lot of good things about polars and need to check it out.

  • @LeandroGessner
    @LeandroGessner 2 месяца назад +1

    I missed DuckDB
    In my tests, it is, by far, faster than these in the video (not sure about pandas)

  • @Maric18
    @Maric18 11 месяцев назад +3

    hm i am not that happy with this comparison, as it doesn't add anything that just naively trying these things out doesn't already do
    pandas (to my knowledge) already uses numpy under the hood, so it runs parallel on your local machine
    so every thing else doing the exact same thing as pandas will be less efficient
    most of these are optimizing for datasats that cannot fit in ram, dask is (to my knowledge) for clustering, at least thats how i am using it
    so a bit more research, trying to get ray to work for example, actually using dask features, doing some applys and so on would have been nice
    otherwise this video clickbait compatible title could be something like "Can these libraries be a direct drop in improvement over pandas?" or something

  • @p.v.h.8659
    @p.v.h.8659 4 месяца назад

    Tbh the comparison is like comparing pears and apples. You start off with a pd df which can fit in your memory, you are then obviously faster with running it in pandas bc you gonna have less overhead. But when you have datasets ehich just cant fit into your RAM pandas starts to get useless and one has to switch to alternatives for example dask, especially when you run computational heavy stuff liek bootstrapping etc on a cluster where dask supports the proper allocation of resources while pandas normally lacks this support.

  • @fizipcfx
    @fizipcfx Год назад +1

    How about cudf?

    • @robmulla
      @robmulla  Год назад

      I didn’t cover it in this video but maybe in the future.

  • @ajaypranav1390
    @ajaypranav1390 Год назад +1

    Try with polars

    • @robmulla
      @robmulla  Год назад

      I did! Check out my channel I have two new videos about it

  • @CaribouDataScience
    @CaribouDataScience Год назад +2

    You misspelled "Tidyverse" 😂

  • @barelmishal9668
    @barelmishal9668 Год назад +1

    Hi try polars this is the best of the much better then pandas by far

    • @robmulla
      @robmulla  Год назад

      Absolutely! I made a whole video about it. Check it out here. ruclips.net/video/VHqn7ufiilE/видео.html&feature=shares

  • @FabioRBelotto
    @FabioRBelotto Год назад +1

    If you are an experienced user and have issues with some libs, imagine what happens to a beginner lol

    • @robmulla
      @robmulla  Год назад +1

      true! But this is a good thing to learn as a beginner too.

  • @MichaelMantion
    @MichaelMantion Год назад +1

    My butt puckers when ever I see people use "dd". It is not an urban legend that people have lost a lot of data misusing dd in bash.

    • @robmulla
      @robmulla  Год назад

      lol. That thought has never crossed my mind but it’s pretty funny.

  • @tashfinbashar1943
    @tashfinbashar1943 6 месяцев назад

    Great video. Can you do one on Polars? @robmulla