Polars: The Next Big Python Data Science Library... written in RUST?

Поделиться
HTML-код
  • Опубликовано: 4 июн 2024
  • In this video tutorial I explain everything you need to get started coding with polars. Polars is a multi-threaded DataFrame library, meaning that it allows using all the cores of a computer at the same time to achieve its full processing potential. It's been shown to have huge performance gains over pandas.
    Timeline:
    00:00 Intro
    01:00 What is Polars?
    02:43 Getting Started
    06:32 Filtering
    07:15 New Columns
    08:10 Groupby
    08:55 Combining Dataframes
    10:17 Multithreaded Approach
    11:21 Speed Test
    12:50 Takeaways
    Follow me on twitch for live coding streams: / medallionstallion_
    My other videos:
    Speed Up Your Pandas Code: • Make Your Pandas Code ...
    Intro to Pandas video: • A Gentle Introduction ...
    Exploratory Data Analysis Video: • Exploratory Data Analy...
    Working with Audio data in Python: • Audio Data Processing ...
    Efficient Pandas Dataframes: • Speed Up Your Pandas D...
    * RUclips: youtube.com/@robmulla?sub_con...
    * Discord: / discord
    * Twitch: / medallionstallion_
    * Twitter: / rob_mulla
    * Kaggle: www.kaggle.com/robikscube
    #python #polars #datascience

Комментарии • 242

  • @rahuldev2380
    @rahuldev2380 Год назад +323

    Polars is built on top of Apache Arrow which pandas supports. So you can easily convert your polars dataframe to pandas with almost zero overhead. I use polars to do the hard part and jump back to pandas for the visualization stuff

    • @cryptoworkdonkey
      @cryptoworkdonkey Год назад +11

      If you use pyarrow firstly. Pandas convert arrow in his inner representation (numpy arrays managed by BlockManager) and reverse. It not zero cost.

    • @rahuldev2380
      @rahuldev2380 Год назад +2

      @@cryptoworkdonkey Ah my bad. I thought they had updated their internals from numpy

    • @jakobullmann7586
      @jakobullmann7586 Год назад +4

      Same here. There are some things where Pandas is more convenient, but for most stuff I strongly prefer Polars. It’s not just execution performance, but also the speed of writing the code.

    • @adrianjdelgado
      @adrianjdelgado Год назад +5

      ​@@cryptoworkdonkey good news, Pandas 2.0 release candidate now uses pyarrow as the backend. Polars Pandas conversions will be zero cost.

  • @bigphab7205
    @bigphab7205 Год назад +26

    10000 points for printing the version. Every tutorial video should do that.

    • @robmulla
      @robmulla  Год назад +5

      Thanks! I forget to do it on all of my videos but your comment is going to remind me to do it in the future.

  • @brd5548
    @brd5548 Год назад +168

    Our team tried to integrate polars into our analytics pipeline last year, and the result was kinda on and off. To be honest, the performance of pandas is not that bad, we spent some time on doing several fine tunings, like rewriting key bottlenecks with our native modules or with these vectorized pandas methods, and the result turned out just ok. On the other hand, the integration work of polars did require some major revamping and refactoring, due to API gaps and implementation differences between the two. However, the performance gains didn't seem to justify the effort. What's worse, while pandas does come with pitfalls and caveats here and there, polars is a relatively young project and it comes with bugs on basic text manipulating operations.
    But don't get me wrong, that was my experience last year. I do think polars has the potential. It has a much more robust and modern architecture than pandas in my opinion. Its API style is cleaner and more consistent. And it comes with a query optimization engine, which many users can appreciate if you are familiar with tools like apache spark or some databases. Given time, I think polars should become another powerful player in the future. So, definitely give it a try if you're building something new!

    • @robmulla
      @robmulla  Год назад +12

      Thanks for sharing! I haven't used polars in production yet, so it's interesting to hear about your experience. I guess there are limitations I didn't consider in this video. I totally agree it's worth giving a try.

    • @BiologyIsHot
      @BiologyIsHot Год назад +5

      This is,the major bit.. Who is bottlenecked by Pandas? I think the bottlenecks happen with ML or other modeling libraries which are working with the data in the form of Numpy arrays.

    • @leventelajos5078
      @leventelajos5078 Год назад +2

      "Its API style is cleaner" Really? I think Pandas is much more pythonic.

    • @incremental_failure
      @incremental_failure Год назад

      @@leventelajos5078 Agree. Column assignment in Pandas seems more pythonic.

    • @konstagold
      @konstagold Год назад +3

      @@BiologyIsHot When you're working with large data sizes, you will be bottlenecked by pandas in no time. Typically at that point, you switch to spark, which has its advantages, but also downsides. Polars looks to be a good middle fit between the two that dask was trying to achieve.

  • @jakobullmann7586
    @jakobullmann7586 Год назад +22

    13:20 Regarding learning the syntax… It’s worth mentioning that Polars syntax is very similar to PySpark, so it’s really two birds with one stone.

    • @robmulla
      @robmulla  Год назад +8

      That’s a good point. Thanks for pointing it out. I really need to do a spark vs polars comparison video.

  • @Joselias156216
    @Joselias156216 Год назад +15

    Nice video. Very interesting to see how polar works, hope to see it more frequent in your future streams to learn more about the practical use.

    • @robmulla
      @robmulla  Год назад +3

      Thanks Jose! I apprecaite the feedback. I'm going to definately give it a try in a future stream. I just need to find a good dataset for it.

  • @scraps7624
    @scraps7624 Год назад +2

    I saw some tweets about Polars but seeing it in action is something else
    Also, I can't believe it took me this long to find your channel, subbed!

    • @robmulla
      @robmulla  Год назад

      That’s awesome! Glad you found my channel. Feel free to share with others!

  • @santiagoperman3804
    @santiagoperman3804 Год назад +7

    Great timing, I was looking to start playing with Polars since Mark Tenenholtz mentioned it some days ago. I went back to Pandas because couldn't find the assign() and astype() equivalents in Polars, I thought they were lacking, but they seem to be with_columns() and cast(). Now I will resume more persistently.

    • @robmulla
      @robmulla  Год назад +1

      Glad you found this video helpful. It does seem like polars may be worth the time investment now that it's becoming more established.

  • @jcbritobr
    @jcbritobr Год назад +3

    Nice stuff. This Polars seems a killer tool. Thank you for share.

    • @robmulla
      @robmulla  Год назад

      Thanks for watching. It does seem promising.

  • @juan.o.p.
    @juan.o.p. Год назад +2

    Thanks for the recommendation, I will definitely give it a try 😊

    • @robmulla
      @robmulla  Год назад +1

      Please do and let me know what you think. There might be negatives about it that I'm not aware of.

  • @rohitnair4268
    @rohitnair4268 Год назад

    as usual rob nice video i have learned a lot from you

  • @calum.macleod
    @calum.macleod Год назад +16

    Thanks for a good explanation of how Polars could benefit people who use Pandas and need more speed. In my project we already have a heavy emphasis on multi processing and fast inter process communication, so I am especially interested to see a Pandas vs Polar single core performance comparison for group and join. I hope that someone does the comparison and posts it to RUclips.

    • @robmulla
      @robmulla  Год назад +2

      Glad it was helpful! If you look in the polars repo they have some queries that they benchmark. H2o also has a benchmark comparison of a few different libraries.

    • @calum.macleod
      @calum.macleod Год назад +1

      @@robmulla Thanks for the reply. I will look into the benchmarks and h2o.

  • @rackstar2
    @rackstar2 5 месяцев назад +1

    I recently decided to fully transition over to using polars instead of pandas for a data pipeline project.
    The primary reason im liking polars over pandas is not just the speed (the speed is nice dont get me wrong) but its the Space usage!
    Allmost all of my operations entailed working with data larger than memory.
    One of the operations i have to do is pivoting a dataframe. My end result has thousands of columns!
    My kernel never seems to hold steady when doing this with pandas, but polars is really doing the trick for me.
    One small problem did face tho is when it comes to exporting the results of the pipline.
    I still have to resort to something like pyarrow and use its writer to do the export in chunks.
    This might just be because of how low my system memory is. Regardless of this, polars seems to be an excellent option for data processing and manipulation, and if you do want to showcase your data, you can always convert back and forth with pandas !

  • @sonnix31
    @sonnix31 Год назад +2

    This is fantastic. Thank you

    • @robmulla
      @robmulla  Год назад

      You're very welcome!

  • @gregharvey8574
    @gregharvey8574 Год назад +34

    Thanks for brining this to my attention, I think I might include polars into some productionalization processes. For data exploration, typically I only use parts of dataframes for plotting or investigation. Given that you can convert a polars dataframe to pandas, it seems like a good approach would be to have the the full dataset in polars and then filter into a pandas dataframe and plot.

    • @robmulla
      @robmulla  Год назад +7

      That's a good point about how you can convert the dataframe to pandas when you need to do exploration. I'll have to think about how to use this in my EDA pipelines.

    • @headbangingidiot
      @headbangingidiot Год назад +1

      ​@@robmulla you can pass polars columns into plotting libs like plotly

    • @BiologyIsHot
      @BiologyIsHot Год назад

      The question though is do you save much time when doing this? Instantiation of Numpy arrays and Pandas dataframes themselves isn't the fastest. I guess if you have multiple "slow" actions to perform on the data you might have some benefits? Or if you really are working at such a massive scale with many many users that saving compute time is really valuable.

  • @gabrielperfumo1122
    @gabrielperfumo1122 Год назад +1

    Great channel!! Thanks for sharing. I'll check it out for sure!

  • @tmb8807
    @tmb8807 7 месяцев назад

    I'm blown away by how fast this is. Sure there are some things it can't do, but man, even for just reading large data sets it's absolutely blazing.

  • @ChaiTimeDataScience
    @ChaiTimeDataScience Год назад +4

    DataTable is also pretty legendary, you might also find it super awesome.
    Thanks again for your amazing videos, I have watched and learned from every one of them. I hope I'll interview you about your 100k celebration sometime next year 🙏

    • @robmulla
      @robmulla  Год назад +1

      Thanks Sanyam! I need to check it out. Hopefully 100k will come next year, but maybe 2024! Talk soon.

  • @bryanwilly4086
    @bryanwilly4086 10 месяцев назад

    Perfect, thank you!

  • @user-fb3lg6ys5s
    @user-fb3lg6ys5s Год назад +1

    OMG.. thank you!!

  • @aminehadjmeliani72
    @aminehadjmeliani72 Год назад +1

    Hi @rub, I think it's a good approach to diversity our tools this days, especially when it comes to deal with memory (sometimes I find myself running out of time with pandas)

    • @robmulla
      @robmulla  Год назад +2

      Absolutely! Well said.

  • @nikjs
    @nikjs Год назад +4

    For the python library developers : Pls create a wrapper lib that does this job of converting regular pandas syntax into the wee-bit more complicated polars syntax. I can see that not all ops would be readily convertible, but there's definitely some low-hanging fruit here, which would cover a lot of simple use cases.

    • @robmulla
      @robmulla  Год назад

      That would be nice. But I also think it’s nice to have it different to make it clear it’s not the same.

  • @CaribouDataScience
    @CaribouDataScience Год назад

    Good stuff!!

  • @akhil-menon
    @akhil-menon Год назад

    Hi Rob, thank you for this super informative video! In one of your takeaways, you mentioned that Polars is a good fit if we have some really heavy data processing work. Would you be able to share some insight on how Polars would stack up against Pandas when having to perform heavy NumPy specific computations?(Think linear and vector algebra, trigonometry, matrix operations)
    I read on SO that it is imperative to not kill the parallelization that Polars provides by using Python specific code, so it is my intuition that applying NumPy operations on Polars columns could result in a loss of parallelization. It would be great if you could share your thoughts on this. Thank you again for the amazing content you produce!

  • @patrickonodje1428
    @patrickonodje1428 Год назад +2

    I love your work. You should have a course on data science.. for folks like us just learning

    • @robmulla
      @robmulla  Год назад +3

      Maybe one day! Thanks for watching Patrick!

    • @patrickonodje1428
      @patrickonodje1428 Год назад +1

      @@robmulla Looking forward

  • @GiasoneP
    @GiasoneP Год назад +30

    Like PySpark AND Pandas. Second half mirrors PySpark. Due to the speed, and out of the box parralelization, I wonder how it stacks up against Spark and how it’s functionality compares to a cluster of machines. Take AWS for example, can it be applied to an EMR cluster? As a side note, I’m super excited about Rust and it’s future in data.

    • @cryptoworkdonkey
      @cryptoworkdonkey Год назад +6

      There is some Apache Arrow based Spark competitors (too young) like Ballista (distributed Data Fusion, written in Rust).
      We "buy" Spark for Resilent in RDD abbr. Polars can process 50gb on machine, Spark - 35gb because not so effective row-based abstraction from "distributed" trade-off, scala case classes memory blowing etc. vs skinny Rayon runtime in Polars.
      Ray platform has same arrow format backend and more effective than Spark but can't streaming (yet).
      In Polars repo polars-dask integration is empty.

    • @pabtorre
      @pabtorre Год назад +1

      Yeah the syntax is very similar to pyspark
      Wonder how well it'll run on a spark cluster...

    • @robmulla
      @robmulla  Год назад +3

      Good question. I don’t think polars is meant as a replacement for pyspark because from I can tell it doesn’t computation across nodes.

    • @AWest-ns3dl
      @AWest-ns3dl Год назад +5

      I can confirm, Polars does not use nodes.

    • @RyanApplegatePhD
      @RyanApplegatePhD Год назад +2

      @@robmulla With the ever improving compute, I think Polars could be in a sweet spot between Spark and Pandas. I know when I was parsing very raw large datasets in pandas I did feel sometimes constrained and moved to Spark, however; there is a lot of overhead for using Spark effectively and this might split the difference.

  • @bubbathemaster
    @bubbathemaster Год назад +7

    Extremely interesting. It’ll be hard to dethrone pandas due to the huge community support but I really like the lib.

    • @robmulla
      @robmulla  Год назад +2

      I agree pandas is too entrenched at this point to be easily dethroned.

  • @Mari_Selalu_Berbuat_Kebaikan
    @Mari_Selalu_Berbuat_Kebaikan Год назад +2

    Let's always do good and encourage more people to do the same 🙏

  • @ApeWithPants
    @ApeWithPants Год назад +4

    Pandas has some strange quirks that always bothered me. Strange syntax or unintuitive copy/not copy behavior. Glad to see more competitors

    • @robmulla
      @robmulla  Год назад

      I’m a big fan. But also think polars and others like it have good potential. Thanks for watching! Are you a kraken fan? Go Caps!

  • @jackychan4640
    @jackychan4640 Год назад +1

    Happy New Year 2023

    • @robmulla
      @robmulla  Год назад

      Same to you Jacky! 🎆

  • @tonik2558
    @tonik2558 Год назад +3

    The usage in Python seems to mirror a lot of the standard Rust iterator API. Looks like it would be even better if used directly in Rust. Thanks for making a video about this.

    • @brainsniffer
      @brainsniffer Год назад +1

      I think that there is so much for data that is built in python that it’s easier to use an abstraction like this than to do things in rust, especially for interactions. It’s an interesting idea.

    • @robmulla
      @robmulla  Год назад +1

      I have learning RUST on my todo list. Will you teach me? 😝

    • @tonik2558
      @tonik2558 Год назад +3

      @@robmulla The Book is an amazing starting resource. It's how I learned Rust, and it's probably the fastest way to get started with the language

    • @shadowangel8005
      @shadowangel8005 Год назад

      @@robmulla google just posted a small course a week or so back

  • @michaelnorthrup2
    @michaelnorthrup2 Год назад +3

    This is my little trick for hyper optimizing data processing haha. Pivots are insanely fast in polars

    • @robmulla
      @robmulla  Год назад

      Ohh. Never tried pivots in it.

  • @BiologyIsHot
    @BiologyIsHot Год назад +16

    I think the big problem is that it isn't inter-operable with Numpy-based libraries. I'm honestly struggling to think of many cases where Pandas is too slow. Some of thd features like a lazy/eager API could be nice, but I think most of the slow computations people are doing is within libraries that are going to require conversion to Numpy arrays already.

    • @robmulla
      @robmulla  Год назад +3

      Yea, I guess it really depends on your use case. I've run across a few recently where polars was helpful.

    • @adrianjdelgado
      @adrianjdelgado Год назад +2

      You can convert to and from Pandas very easily. Now that Pandas 2.0 will use pyarrow as the backend, that conversion will be truly zero cost.

  • @pimziengs2900
    @pimziengs2900 Год назад +1

    Thanks for this video! I am a data scientist always looking for some new techniques xD.
    Cheers from the Netherlands!
    PS: There is some background noise in your video around 3:30.

    • @robmulla
      @robmulla  Год назад

      Welcome! Glad to have a viewer from the Netherlands. Sorry about the noise at 3:30 - I didn't notice it until after I was done editing and then it was too late.

  • @mutley11
    @mutley11 Год назад +2

    Very compelling presentation; many thanks. I would have liked to see an example of how user-friendly the error messages are. Rust error messages are surprisingly good in general and I was wondering if that is true of polars. You missed at least one opportunity to illustrate a typo. 😊

    • @robmulla
      @robmulla  Год назад +1

      Glad it was helpful! Next time I'll try to throw more errors :D

  • @hensonjhensonjesse
    @hensonjhensonjesse Год назад +2

    It looks surprisingly similar to pyspark. Especially the lazy implementation. Pretty cool stuff!

    • @robmulla
      @robmulla  Год назад +1

      Yea, a lot of similarities to pyspark!

  • @HyperFocusMarshmallow
    @HyperFocusMarshmallow Год назад +6

    The rust community really produce brilliant stuff. Very impressive!
    Did you find any areas where polars is lacking vs pandas?
    Btw, have you checked out nu-shell? It’s essentially a new shell language designed to do the Unix-philosophy but with data frames for data flow. At least as far as I understand it. Written in rust of course.
    It’s in pretty early development but it feel pretty great to play around with and can probably produce some nice workflows.

    • @robmulla
      @robmulla  Год назад +1

      Never heard of my-shell but I’ll check it out. I am not too familiar with the RUST community but this package is pretty solid. As people have mentioned the syntax is much more verbose and it lacks some of the built in pandas features.

  • @AlexanderHyll
    @AlexanderHyll Год назад +4

    As a btw. If you want to plot smth quick, converting to a pandas is super fast (if ofc a bit mem inefficient). Can also just pass columns to plt. Just my 2 cents.

    • @robmulla
      @robmulla  Год назад

      Good point, I do use df.plot() a lot though so it would take some getting used to.

    • @adrianjdelgado
      @adrianjdelgado Год назад

      Now that Pandas 2.0 uses pyarrow as backend, conversions will be truly zero cost.

  • @The-KP
    @The-KP Год назад +2

    @Rob Mulla Nice that Polars can perform rdbms-like ops, but what about the computation libs bind to Pandas dataframes, like numpy, scipy, scikit-learn? If it can be used with those, or somehow replaces them, I'm in! Hopefully Polars is not an island.

    • @robmulla
      @robmulla  Год назад

      I know you can easily convert from polars back to a pandas dataframe and they use similar Apache Arrow.

  • @PlatinumDragonProductions999
    @PlatinumDragonProductions999 Год назад +4

    I love Pandas, but I prefer Spark. This looks very Spark-like to me; I'm eager to make it my goto dataframe processor. :-)

    • @robmulla
      @robmulla  Год назад

      If you prefer spark I’m guessing this will be a great package for you.

  • @DiegoSilva-dv9uf
    @DiegoSilva-dv9uf 11 месяцев назад +1

    Valeu!

    • @robmulla
      @robmulla  11 месяцев назад

      Thanks so much 🙌

  • @aliyektaie9123
    @aliyektaie9123 Год назад +1

    Hi, thanks for this great video! It looks like polar is very similar to spark, do you know how they compare?

    • @robmulla
      @robmulla  Год назад

      Thanks for the comment. They are very similar. Check out my most recent video where I compare the two.

  • @grabani
    @grabani Год назад +1

    Interesting.

  • @K-mk6pc
    @K-mk6pc Год назад +1

    I am working on large data in pandas.But its not a problem for me. Pandas is doing fine in few mins.

  • @samstanton-cook1419
    @samstanton-cook1419 Год назад +6

    Great video thanks Rob! Our data science teams use polars alot. For long timeseries aggregation queries (100M+) rows we use the pykx python package to access q kdb+ language for higher performance still over pandas and polars. Have you seen it?
    kx.q.qsql.select(qtab, columns={'minCol2': 'min col2', 'medCol3': 'med col3'},
    by={'groupCol1': 'col1'},
    where=['col30.7']
    )

    • @robmulla
      @robmulla  Год назад

      I need to check that out. Pykx… first time hearing of it. Sounds cool though. Thanks for watching.

  • @Pedro_Israel
    @Pedro_Israel Год назад +1

    Hey Rob can you do a video about automatic EDA librearies? I used them and they blew my mind. I am amazed I didn´t know them earlier.

    • @robmulla
      @robmulla  Год назад

      That's a good suggestion. What libraries have you used that you like? The main one I've seen is pandas profiling.

  • @JustinGarza
    @JustinGarza Год назад +1

    i like this, but i wish i covered graphs? does this use matplotlib or something use to make graphs and charts ?

    • @robmulla
      @robmulla  Год назад

      It doesn’t. But you can always convert it back to a pandas data frame to plot.

    • @JustinGarza
      @JustinGarza Год назад

      @@robmulla umm maybe I’ll wait til it gets more graphic/chart support or until pandas gets updated

  • @chris_kouts
    @chris_kouts Год назад +1

    You should do a benchmarking video i was waiting for you to tell me if i should start using it

    • @robmulla
      @robmulla  Год назад

      I made a video about it just yesterday! Check it out on my channel.

  • @cryptoworkdonkey
    @cryptoworkdonkey Год назад +4

    I think Polars must be replace Pandas in ETL tasks. But it have some struggles for comfortable Exprs construction.
    And in Arrow universe there is Data Fusion project as alternative.

    • @robmulla
      @robmulla  Год назад

      I agree. I haven't fully tested out the expressions to notice what I use in pandas that polars is missing. What is the Data Fusion project, I'm not familiar with that?

    • @cryptoworkdonkey
      @cryptoworkdonkey Год назад

      @@robmulla , DataFusion is more "arrow-society" convented project (part of Apache Arrow project) as Spark/Hive/MR challenger. This is designed more modularity with SQL and DataFrame APIs. This project can be used as library (it positioned self as query engine for arrow) for more high level projects.
      Polars positioning self as classical DataFrames libraries challenger. But with both you can use as SQL CLI. Both has plan optimizers, Rayon parallelism, simd optimizations etc.
      Both are cool. I don't know about larger-than-memory capabilities of DataFusion. DataFusion is fundament of Blaze/Ballista distributed computing engines. Polars Dask integration repo currently not active.

  • @neronjp9909
    @neronjp9909 Год назад

    how come everytime when u click the column name, the column name then copied into yr tpying code.. is there a hot key for that? my company raw data column name is so long and with _ / space / dot...i always get slow down when typing code across the column name, may i know how u do that 8:07..thx

  • @MaavBR
    @MaavBR Год назад +1

    7:10 Quick correction, SAN is San Diego, not San Francisco
    San Francisco airport's code is SFO

  • @Matias-eh2pn
    @Matias-eh2pn Год назад +1

    How did you configured that theme on jupyter?

    • @robmulla
      @robmulla  Год назад

      I have a whole video on my setup. Check it out here: ruclips.net/video/TdbeymTcYYE/видео.html

  • @JordiRosell
    @JordiRosell Год назад +2

    For ploting polars, I think plotnine is a good option.

    • @robmulla
      @robmulla  Год назад +1

      I have a video all about my favorite plotting libraries (including plotnine): ruclips.net/video/4O_o53ag3ag/видео.html&feature=shares

  • @user-fv1576
    @user-fv1576 3 месяца назад

    Looks a bit like SQL with the select. Newbie question, why not just use pandasql library?

  • @chintansawla
    @chintansawla Год назад +3

    The library feels like it's based off the syntax/methods of pyspark. A lot of the methods used are similar to how RDDs are converted to DataFrames in pyspark

    • @robmulla
      @robmulla  Год назад +2

      Yes, definitely a lot of similarities between pyspark and polars. Pyspark has always been much slower for me when running on a single node.

    • @chintansawla
      @chintansawla Год назад

      @@robmulla that's a bit shocking! Both seem to be performing in a similar fashion theoretically (lazy evaluation, parallel computing). Going to try and compare polars soon. Thanks

    • @jordanfox470
      @jordanfox470 Год назад +1

      @@robmulla have you tried pandas on spark? Databricks has that running.

    • @robmulla
      @robmulla  Год назад

      @@jordanfox470 no. Have you? How does it compare?

  • @simplemanideas4719
    @simplemanideas4719 Год назад +1

    Speed is always priority, because it is equal to resource optimization. However, this leads to question how effizient are both libs in per core efficiency?

    • @robmulla
      @robmulla  Год назад

      Good question. I'd guess polars is faster on all fronts but it would depend on a lot of things.

  • @AaronWoodrow1
    @AaronWoodrow1 Год назад +2

    I don't fully get why it's geared more toward data pipelining rather than data exploration (as mentioned @ 13:33) if the data needs to be contained to a single host. Even with parallelization across multiple CPUs, there's still a data size cap limited by available memory. A tool such as PySpark (or Dask) seems better suited for pipelining, which ultimately consumes larger amounts of data.

    • @robmulla
      @robmulla  Год назад +1

      Yea. I see your point. Sometimes you have data in between or just want a faster pipeline for a small job you run on a regular basis. Either way, if it was identical to python and faster then people would use it for sure!

    • @AaronWoodrow1
      @AaronWoodrow1 Год назад +1

      @@robmulla True, just a minor nit. Great video btw!

  • @bazoo513
    @bazoo513 Год назад +1

    "Split, apply, combine" approach sounds like it could employ massively parallel processing of graphics cards. Is there a CUDA implementation?

    • @robmulla
      @robmulla  Год назад +1

      Yes! It’s called rapids. I need to make a video about it.

    • @bazoo513
      @bazoo513 Год назад

      @@robmulla Thanks!

  • @suvidani
    @suvidani Год назад +1

    How does the performance compares to pyspark? The syntax very similar to pyspark.

    • @robmulla
      @robmulla  Год назад

      Good question. I might need to test it out. Haven’t used spark in years and had some bad experiences but it’s probably gotten better since then.

  • @akshaydushyanth9720
    @akshaydushyanth9720 Год назад +1

    Is it similar to pyspark? Whats the difference between both?

    • @robmulla
      @robmulla  Год назад +1

      Only runs on a single node. Much faster than pyspark when working with data that can fit in memory.

  • @georgiyveter6391
    @georgiyveter6391 Год назад +1

    Use python 3.10.
    Created dictionary:
    d = {'a': [1,2,3], 'b': [4, -5, 6]}
    Created dataframe:
    df = pl.DataFrame(d)
    print(type(df))
    print(df)
    It all works. But if I change in dictionary d any number to float, for example 6.8, then functions print type still shows it's a dataframe, but next print silently do nothing, like 'pass', and script ends. Why?

    • @robmulla
      @robmulla  Год назад

      That’s a great question. Is it only with 3.10?

  • @ankan650
    @ankan650 Год назад +1

    Wow. It looks like Apache Spark might be obsolete soon. Can you also compare Ray packages with Polar. I think Ray is not exactly for data processing instead for more compute intensive tasks. Thanks.

    • @robmulla
      @robmulla  Год назад +1

      I benchmark ray in a different video if you want to check it out.

  • @rhard007
    @rhard007 Год назад

    Is it not possible to use Matplotlib or Seaborn with Polars?

    • @robmulla
      @robmulla  Год назад

      It probably is possible. It's just not built into the dataframe as methods like it is in pandas. Just one additional step or you can convert the final data to pandas after processing.

  • @rahulrjb
    @rahulrjb 4 месяца назад

    Very pysparke syntax

  • @praveenmogilipuri4524
    @praveenmogilipuri4524 Год назад +1

    Hi, anyone can help me how to connect polars with snowflake. Through pandas i can but i don't want to use pandas.

    • @robmulla
      @robmulla  Год назад

      I’ve never done anything like that before but maybe others will know how.

  • @donnillorussia
    @donnillorussia Год назад +1

    Isn't this "split-apply-combine" approach similar to map-reduce? Just curious 😉

    • @robmulla
      @robmulla  Год назад +1

      Yes! Exactly. Map reduce (like in spark) is very similar. Polars only runs single node, and map reduce I believe can be done across nodes.

  • @bazoo513
    @bazoo513 Год назад +1

    I wonder what authors of these tabular data manipulation libraries didn't adopt relational algebra terminology (or even SQL as a, if not the, manipulation language). For example, why is not choosing only some columns called "projection"?
    Subtle syntax (and _especially_ semantics) differences between libraries designed to do essentially the same tasks make life of users unnecessarily more difficult.

    • @robmulla
      @robmulla  Год назад +1

      That’s a good point. Some libraries (like spark) do have the ability to write SQL directly on flat files like this.

  • @ArnabAnimeshDas
    @ArnabAnimeshDas Год назад +1

    I would import another plotting library which produces a better plot anyways.

    • @robmulla
      @robmulla  Год назад

      Yep, that's totally reasonible. Thanks for watching.

    • @ArnabAnimeshDas
      @ArnabAnimeshDas Год назад +1

      @@robmulla also you can convert polars dataframe to pandas if you want to

  • @yayasssamminna
    @yayasssamminna 3 месяца назад

    Please make a tutorial on Dask!!!

  • @user-ck3hp8cj4h
    @user-ck3hp8cj4h Год назад +2

    somehow it's very similar to Spark on AWS Glue ?

    • @robmulla
      @robmulla  Год назад +1

      Yes, very similar but I think polars is intended for a single machine vs. spark which can be distributed across nodes.

  • @michaeldeleted
    @michaeldeleted Год назад +1

    OMG I just completely replaced pandas with polars and all the regular pandas commands worked

    • @robmulla
      @robmulla  Год назад

      Wait, what? I think the syntax should be very different. Unless they released a new version that I don't know about. Can you show an example?

    • @michaeldeleted
      @michaeldeleted Год назад +1

      Oops, didn't change all my pd to pl. LOL was still using pandas

    • @robmulla
      @robmulla  Год назад

      @@michaeldeleted oh! That explains it.

  • @EircWong
    @EircWong Год назад +1

    Nosie at 3:29, about 10 seconds

    • @robmulla
      @robmulla  Год назад +1

      Yes! I noticed that. I forgot to put my phone further away from the mic. I tried to edit it out as much as possible. Hopefully it wasn't too distracting.

  • @ibekweobinna3514
    @ibekweobinna3514 Год назад +1

    Rob,can I add you to website as one of the best tutors of data science. Man you are good. But funny enough I am still learning pandas,then boom came polars.

    • @robmulla
      @robmulla  Год назад

      Thanks Ibekwe. Never stop learning!

  • @JohannPetrak
    @JohannPetrak Год назад +2

    Your timeit presentation includes the time to read the data which might not be such a good idea.

    • @robmulla
      @robmulla  Год назад

      Nice catch, but I actually did that intentionally because data I/O is one area where polars can be much faster.

    • @JohannPetrak
      @JohannPetrak Год назад +1

      @@robmulla it is just very bad practice to do this and there other issues which may totally distort the measurements like the OS caching read data in buffers from a previous read.

    • @robmulla
      @robmulla  Год назад

      @@JohannPetrak that’s a good point. Any idea how I could properly compare the read time in a way that wouldn’t be messed up by the caching?

    • @JohannPetrak
      @JohannPetrak Год назад

      @@robmulla i think there is no way to avoid it, but it may be possible to reduce the effect by loading files that are much larger than what the OS might use for caching, and also load a sequence of many different files for a single benchmarking run, then repeat this several times and take the average (and stdev). Also maybe check how much the external storage is the bottleneck by also loading from SSDs or memcached files.
      With HDDs this will be A LOT slower than the CPU based benchmarks, so I would argue to separate these benchmarks from each other.
      But even with the CPU based ones, running on larger data structures (on a computer that has even larger RAM) may give better results as the impact of other OS, memory management, (JIT) interpreter etc optimizations gets reduced.
      Sorry, I do not want to claim I know how to do proper benchmarks, but I do know (from experience) it is easy to not do it properly :)

  • @nikjs
    @nikjs Год назад +1

    3:35 - some audio interference starts from around this point, pls check the video

    • @robmulla
      @robmulla  Год назад

      Thanks for the heads up. I noticed that when editing. Sorry about it.

  • @AyahuascaDataScientist
    @AyahuascaDataScientist 10 месяцев назад

    Polars doesn’t have a .info() method? I can’t use it…

  • @rolandheinze7182
    @rolandheinze7182 Год назад

    Polars syntax seems very similar to pyspark, and in my opinion therefore hurts readability vs pandas

  • @JayRodge
    @JayRodge Год назад +1

    Have you tried RAPIDS cuDF?

    • @robmulla
      @robmulla  Год назад

      A little bit. It can be really fast but requires that your data is small enough to fit into your GPU memory.

  • @valuetraveler2026
    @valuetraveler2026 Год назад +1

    URLError:

    • @robmulla
      @robmulla  Год назад

      Strange. Did you get this error when trying to pip install? Otherwise polars shouldn't be using anything to connect to the internet.

  • @fredgavin
    @fredgavin Год назад +2

    Tried Polars multiple times, and felt that it was too verbose. Just cannot give up R's data.table, which is the best data manipulation package in the data science world, no competitor at all.

    • @robmulla
      @robmulla  Год назад

      Yea. Definitely more verbose than pandas. I haven’t used R in years but don’t remember it ever being the fastest.

  • @XavierSoriaPoma
    @XavierSoriaPoma Год назад +1

    So why should we use polars instead of pandas?

    • @robmulla
      @robmulla  Год назад +1

      Did you watch the video? 😂 speed is the main reason.

    • @XavierSoriaPoma
      @XavierSoriaPoma Год назад +1

      @@robmulla yeah but still I'm not convinced, it's like tensorflow or pytorch they are not as fast as Flux, but we still use them in python

  • @hanabimock5193
    @hanabimock5193 Год назад +1

    I already see books and videos about polars. The same as with pandas. It is like come on, who needs a book for pandas? Are you kidding me ?

    • @robmulla
      @robmulla  Год назад

      Why do you dislike the fact that there are books about it? Honestly curious. Thanks for watching!

  • @mishmohd
    @mishmohd Год назад +1

    Can we suggest they change the name to Polaris

    • @robmulla
      @robmulla  Год назад

      Why do you suggest that?

  • @hawrezangana8240
    @hawrezangana8240 Год назад +1

    Unless you need to run your scripts over and over, I believe Polars cannot replace Pandas, as it takes more effort to write a simple aggregation. 2 seconds of faster execution is not worth 20 seconds of writing a line for every aggregation column and giving it an alias.

    • @robmulla
      @robmulla  Год назад +1

      Yea. For quick scripts on small data and EDA, I’m sticking with pandas.

  • @leonidgrishenkov4183
    @leonidgrishenkov4183 Год назад +1

    In some cases Polars syntax seems like PySpark

    • @robmulla
      @robmulla  Год назад

      I've been hearing that a lot :D

    • @leonidgrishenkov4183
      @leonidgrishenkov4183 Год назад +1

      @@robmulla ahaha sorry, I’m just a captain obvious 😂

    • @robmulla
      @robmulla  Год назад

      @@leonidgrishenkov4183 No it's a good point that I didn't realize until people pointed it out. I personally don't use pyspark a ton. Thanks for watching.

  • @vzmaster
    @vzmaster Год назад

    I'm running into different problem when i try to speed up pandas (or dask), they eating up memory really fast.
    jupiterlab environment, I load ~3-5mb data, use pandas .extractall() function on string field, a then compare results with int fields(count of matches)
    In single thread it takes several week to calculate. If i use multiprocessing, then when comparing results with df.loc it eats up to 200gb+ memory.

    • @robmulla
      @robmulla  Год назад

      That doesn't sound right. If your data is 3-5Mb I can't imagine any sort of processing needing to take a week to calculate. I'm thinking it's probably something in your code and not an issue with pandas or dask.

    • @vzmaster
      @vzmaster Год назад

      @@robmulla I actually have 2 tables, big one and small one. The Big one has data(~2M entries, ~150mb), the small has patterns (~10k entries ~2mb). I need to run each pattern on all data. Thats why it may take long.
      But primary problem is not speed, its memory consumption. Thats why i take small chunks of big table ~5-30mb. But even with 5mb i get memory overflow 200gb.
      Here is code:
      import numpy as np
      import pandas as pd
      import sqlalchemy as sql
      sql_engine=sql.create_engine('mysql+mysqlconnector://.......................')
      df=pd.read_sql_query("..........................",sql_engine) #big table
      patterns=pd.read_sql_query("..........................",sql_engine) #small table
      ----------------------------------------------jupyterlab block seperation--------------------------------------------------------
      def findoccurancesofpatterns(pat):
      (idx,row)=pat
      res=df.summarynormalized.str.extractall(row.pattern)
      numvaluesstats=pd.DataFrame(columns=['pattern','numvalueorder','param1','param2','param3'])
      if len(res)>0:
      numvaluecount=len(res.columns)
      res=pd.merge(res.reset_index(),df[[,'param1','param2','param3']],how='left',left_on='id',right_index=True)
      for i in range(numvaluecount):
      numvaluesstats.loc[len(numvaluesstats)]=[row.pattern,i,(res['param1']==res[i].astype('Int64')).sum(),(res['param2']==res[i].astype('Int64')).sum(),(res['param3']==res[i].astype('Int64')).sum()]
      return numvaluesstats
      from multiprocessing.pool import Pool
      pool = Pool(50)
      allnumvaluesstats=[]
      for numvaluesstats in pool.imap_unordered(findoccurancesofpatterns, patterns.iterrows()):
      allnumvaluesstats.append(numvaluesstats)

  • @cradleofrelaxation6473
    @cradleofrelaxation6473 Год назад +1

    Is it just me, the syntax is a bit more complicated than pandas whenever they differ!!

    • @robmulla
      @robmulla  Год назад

      Yes. I agree, it ends up being more verbose.

  • @commonsense1019
    @commonsense1019 Год назад +1

    Well the core of pandas can also be changed using RUST no big deal

    • @robmulla
      @robmulla  Год назад +1

      It can. But will it?

  • @AWest-ns3dl
    @AWest-ns3dl Год назад +1

    Polars syntax is similar to spark

    • @robmulla
      @robmulla  Год назад +1

      I’ve been hearing that 😃

  • @whitebai6367
    @whitebai6367 Год назад +1

    Okay, I'd like to use rust directly.

    • @robmulla
      @robmulla  Год назад

      You can do it! Polars has a rust API too. Try it out and let me know what you think.

  • @richardbennett4365
    @richardbennett4365 Год назад +1

    It is the problem with people who use pandas. They don't by and large know about polars. But why? Polars creator's fault for not promoting his product or laziness by pandas operators who just don't look for something better.
    Also, if one writes import polars as pd, then one doesn't need to rewrite code written for pandas. Or, one can import polars s po. I never understood why people import this package as pl. That would be for a package called plank, line the dock replacement.

    • @robmulla
      @robmulla  Год назад

      Importing as pl makes the most sense to me and it’s what their docs recommend.

  • @NickWindham
    @NickWindham Год назад +1

    Just use Julia instead of Python. Then you can do all this with speed similar to Rust in one language that even simpler syntax than Python.

    • @robmulla
      @robmulla  Год назад

      Oh really? I haven’t had a chance to need to use Julia but I know it’s popular to use with spark.

  • @ErikS-
    @ErikS- Год назад +1

    Just take a huge amount of RAM.
    I did that also...

    • @robmulla
      @robmulla  Год назад

      I used polars on a live stream and crashed my computer during it because it ate all my memory. There is a way to set it to limit the amount it uses I think

  • @richardbennett4365
    @richardbennett4365 Год назад +1

    He said 15, but he wrote 10 at 7min 05s.

  • @BillyT83
    @BillyT83 Год назад +1

    So... Pandas + Dask = Polars?

    • @robmulla
      @robmulla  Год назад

      Kinda… but it’s really just it’s own thing.

  • @go64bit
    @go64bit Год назад +1

    Panda bear vs Polar bear 😅

  • @nitinkumar29
    @nitinkumar29 Год назад +1

    I will let it mature before dealing with this.

    • @robmulla
      @robmulla  Год назад

      That’s a fair approach. Adopting things too early can be problematic.

  • @richardbennett4365
    @richardbennett4365 Год назад +1

    What??? Polars is supposed to give the same result as pandas. Duh. Polars is a pandas replacement.

  • @xuantungnguyen9719
    @xuantungnguyen9719 Год назад +1

    Very similar to pyspark

    • @robmulla
      @robmulla  Год назад

      I seems so. Just not distributed.

  • @jaqo92
    @jaqo92 Год назад +1

    Looks like pyspark

    • @robmulla
      @robmulla  Год назад

      Indeed! I just released a whole video comparing the two.

  • @dunteesilver6607
    @dunteesilver6607 11 дней назад

    😂😂