Make Your Pandas Code Lightning Fast

Поделиться
HTML-код
  • Опубликовано: 29 янв 2025

Комментарии • 333

  • @hasijasanskar
    @hasijasanskar 2 года назад +112

    Whoa.. 3500 times difference. Vectorised is even faster than apply, will give it try next time for sure. Awesome video as always.

    • @robmulla
      @robmulla  2 года назад +5

      Thanks Sanskar. Yes, using vectorized functions is always much faster. In some cases it's not possible but then there are other ways to speed it up. I might show that in another video if this one is popular.

    • @amazingdude9042
      @amazingdude9042 11 месяцев назад

      @@robmulla can you make a video on how to make pandas resample faster ?

  • @miaandgingerthememebunnyme3397
    @miaandgingerthememebunnyme3397 2 года назад +289

    That’s my husband! He’s so cool.

    • @robmulla
      @robmulla  2 года назад +35

      Love you boo. 😘

    • @FilippoGronchi
      @FilippoGronchi 2 года назад +6

      Fully agree!

    • @sauloncall
      @sauloncall 2 года назад +6

      Aww! This is wholesome!

    • @rahulchoudhary1024
      @rahulchoudhary1024 2 года назад

      I've been watching your videos since last one week non stop! And enjoy comments from your SO!!! Lovely!

    • @mohammedgt8102
      @mohammedgt8102 2 года назад

      He is awesome. Taking time out of his day to share knowledge 👏

  • @nathanielbonini8951
    @nathanielbonini8951 Год назад +4

    This is spot on. I had a filter running that was going to take 2 days to complete on a 12M line CSV file using iteration - clearly not good. Now it takes 6 seconds.

  • @kip1272
    @kip1272 2 года назад +33

    also, a way to speed it up is to not use & and | for 'and' and 'or' but just use the words 'and' and 'or'. these words are made for boolean expressions and thus work faster. & and | are bitwise operators and are made for integers. using these will force python to make the booleans an integer and then do the bitwise operation and then cast it back to a boolean. this doesn't take that much time if u do it once but in a test scenario inspired by this video it was roughly 45% slower.

    • @robmulla
      @robmulla  2 года назад +6

      Nice tip! I didn’t know that.

    • @kazmkazm9676
      @kazmkazm9676 2 года назад +2

      I made the experiment. It is ready to run. What you have suggested is coded below. It is approximately 20 percent faster.
      import timeit
      setup = 'import random; random_list = random.sample(range(1,101),100)'
      # with or
      first_code = '''\
      result_1 = [rand for rand in random_list if (rand >75) or (rand 75) | (rand

    • @kip1272
      @kip1272 Год назад

      @@kazmkazm9676 the difference was even bigger between & and 'and', if i remember corectly.

    • @A372575
      @A372575 Год назад

      Great, never realized that. Will start using 'and' and 'or' now onwards.

  • @Zenoandturtle
    @Zenoandturtle Год назад +4

    That is unbelievable. Astounding time difference. I was recently watching a presentation on candle stick algorythm, and the presenter used vectorised method and I was confused (I an new to Python), but this video made it all too clear. Fantastic presentation.

    • @robmulla
      @robmulla  Год назад

      Glad you found it interesting. Thanks for watching!

  • @LimitlesslyUnlimited
    @LimitlesslyUnlimited 2 года назад +4

    Haha coincidentally I'd been raving about vectorized to my friends the last few months. It's soo good. The moment I saw your title I figured you're probably talking about vectorize too haha. Awesome video and great content!!

    • @robmulla
      @robmulla  2 года назад +1

      You called it! Thanks for the positive feedback. Hope to create more videos like it soon.

    • @robertnolte519
      @robertnolte519 2 года назад

      Same! Still hasn't worked on picking up chicks at the bar, but I'm not giving up.

  • @jti107
    @jti107 2 года назад +90

    I didn’t realize you could write 10k as 10_000. I work with astronomical units so makes variables more readable. Great video!

    • @robmulla
      @robmulla  2 года назад +11

      Thanks! Yes, they introduced that functionality with underscores in numbers with python 3.6 - it really helps make numbers more readable.

    • @kailashlate6348
      @kailashlate6348 11 месяцев назад

      😊😊😊

    • @kailashlate6348
      @kailashlate6348 11 месяцев назад

      😊

  • @robertjordan114
    @robertjordan114 2 года назад +2

    Man where have you been all my Python-Life!?!? Thank you so much for this! Outstanding!!!

    • @robmulla
      @robmulla  2 года назад +1

      Thanks Robert for watching. Glad you found it helpful!

    • @robertjordan114
      @robertjordan114 2 года назад +1

      The problem in dealing with is that I am looping through some poorly designed tables and building a sql statement to be applied and then appending the output to a list. Not sure if a vectorized approach will work since I have that sql call, but the apply might save me from needing to recreate the df prior to appending everytime.

    • @robmulla
      @robmulla  2 года назад +1

      @@robertjordan114 Interesting. Not sure what your data is like- but it can be better a lot of the times to write a nice SQL statement that puts the data in the correct formatting first. That way you put the processing demands on the SQL server and it can usually optimize really well.

    • @robertjordan114
      @robertjordan114 2 года назад +1

      Oh you have no idea, my source table has one column with the name of the column in my lookup table and another with the value that I need to filter on in that lookup table. The loop creates the where clause based on the number of related rows in the initial dataset, and then I'm executing that sql statement the return the values to a python data frame which I then convert to a pandas data frame and append. Like I said, amateur hour! 🤣

  • @nirbhay_raghav
    @nirbhay_raghav 2 года назад +3

    My man made a df out of the time diff to plot them!! Really useful video. Will definitely keep this in mind from now.

    • @robmulla
      @robmulla  2 года назад

      Haha. Thanks Nirbhay!

  • @OktatOnline
    @OktatOnline 2 года назад +2

    I'm over here as a newbie data scientist, copying the logic step-by-step in order to have good coding habits in the future lmao. Thanks for the video, really valuable!

    • @robmulla
      @robmulla  2 года назад

      Glad you found it helpful!

  • @i-Mik
    @i-Mik Год назад +1

    Thanks for the great video! I have a project with some calculations. They take some minutes through the loops. I'm going to use vectorized way. So i'll write another comment with comparison later. Some days later... i rewrote a signifacnt part of my code. Made it vectorized, and i got fantastic results. The example: old code - 1m.3s, new code - 6s. One more: old code - 14m.58s, new code - 11s. Awesome!

    • @robmulla
      @robmulla  Год назад

      So awesome! It's really satisfying when you are able to improve the speed of code by orders of magnitude.

  • @deepakramani05
    @deepakramani05 2 года назад +48

    As I work with Pandas and large datasets, I come across code that use iterrows often. Most developers just don't care about time or come from various programming backgrounds that prohibit them from using efficient methods. I wish more people use vectorization.

    • @robmulla
      @robmulla  2 года назад +5

      Thanks. That’s exactly why I wanted to make this video. Hopefully people will find it helpful.

    • @pr0skis
      @pr0skis 2 года назад +5

      Some of the biggest bottlenecks are from IO... especially when trying to read then concat multiple large Excel files. Shaving a few seconds in the algos just isnt gonna make much of a difference

    • @allenklingsporn6993
      @allenklingsporn6993 2 года назад +5

      @@pr0skis Hard to say that definitively, though, right? You have no idea how anyone is using pandas. If they have slow algos running iteratively, it can very easily become much slower than I/O functions. I've seen some pretty wild pandas use in my business, and a lot of it is really terrible at runtime, especially anything that is wrapped in a GUI (sometimes even with multiprocessing...).

    • @nitinkumar29
      @nitinkumar29 2 года назад +2

      @@pr0skis you can convert excel file to csv and then use csv files because csv files io are faster.

    • @jaimeduncan6167
      @jaimeduncan6167 2 года назад +5

      It's the same with a relational database, we call them the cursor kids. They loop and loop and loop when they can use a set operation to go hundreds of times faster and often with less code.

  • @craftydoeseverything
    @craftydoeseverything Год назад

    Hey, I just thought I'd mention, I really appreciate that you use really huge test datasets, since a lot of the time, test datasets used in tutorials are quite small and don't sure how code will scale. This video does it perfectly, though!

  • @alexandremachado1014
    @alexandremachado1014 2 года назад +3

    Hey man, nice video! Kudos from reddit!

    • @robmulla
      @robmulla  2 года назад +1

      Glad you enjoed it. So cool that the reddit community liked this video so much. Hopefully my next one will be as popular.

  • @OPPACHblu_channel
    @OPPACHblu_channel Год назад +1

    Somehow i have been met vectorize method first at the beginning on my python and pandas journey. Thanks for sharing your experience, lightning fast

    • @robmulla
      @robmulla  Год назад

      It’s a great thing to learn early!

  • @FilippoGronchi
    @FilippoGronchi 2 года назад +8

    That's another awesome video....extremely useful in the real world work. Thanks again Rob

    • @robmulla
      @robmulla  2 года назад

      Thanks for watching Filippo!

  • @LaHoraMaker
    @LaHoraMaker 2 года назад +1

    I loved that you used Madrid Python user group for the pandas logo :)

    • @robmulla
      @robmulla  2 года назад

      I did?! I didn't even realize. What's the timestamp where I show that logo?

  • @colmduffy2272
    @colmduffy2272 2 года назад +2

    There are several videos on pandas vectorization. This is the best.

    • @robmulla
      @robmulla  2 года назад +1

      I apprecaite you saying that! Thanks for watching.

  • @gabriel-mckee
    @gabriel-mckee 2 года назад +7

    Great video! I wish I had known not to loop over my array for my machine learning project... going to go improve my code now!

    • @robmulla
      @robmulla  2 года назад +1

      Glad you learned something new!

  • @hussamcheema
    @hussamcheema 2 года назад +4

    Wow amazing. Please keep making more videos like this.

    • @robmulla
      @robmulla  2 года назад +1

      Thanks for the feedback. I’ll try my best.

  • @anoopbhagat13
    @anoopbhagat13 2 года назад +5

    Wow ! That's an excellent way of speed up the code.

    • @robmulla
      @robmulla  2 года назад +1

      Thanks Anoop. Hope your future pandas code is a bit faster because of this video :D

  • @sphericalintegration
    @sphericalintegration 2 года назад +1

    Thank you for this, Rob. This video made me subscribe because in 10 minutes you solved one of my biggest problems.
    And your Boo is right - you are pretty cool. Thanks again, sir.

    • @robmulla
      @robmulla  2 года назад

      That's awesome that I was able to help you out. Check my other videos where I go over similar tips! Glad you agree with my Boo

  • @thebreath6159
    @thebreath6159 2 года назад +1

    Ok this channel is great for data science, I’ll follow

    • @robmulla
      @robmulla  2 года назад

      Thanks for subbing!

  • @artemqqq7153
    @artemqqq7153 2 года назад +1

    Dude, that row[column] thing was a shock to me, thanks!

    • @robmulla
      @robmulla  2 года назад

      Glad you learned something!

  • @ajaybalakrishnan5208
    @ajaybalakrishnan5208 Год назад +1

    Awesome. Thanks Rob for introducing this concept to me.

  • @Vonbucko
    @Vonbucko 2 года назад +2

    Awesome video man! Appreciate the tips, I'll definitely be subscribing!

    • @robmulla
      @robmulla  2 года назад

      I appreciate that a ton. Share with a friend too!

  • @balajikrishnamoorthy5464
    @balajikrishnamoorthy5464 2 года назад +1

    I am a begineer, admired your sound knowledge in Pandas

    • @robmulla
      @robmulla  2 года назад

      Thanks for watching. Hope you leaned some helpful stuff.

  • @DataScienceconMilton
    @DataScienceconMilton 9 месяцев назад

    Thank you very much for this video Rob. It is very helpful for beginners like me. Have a great day.

  • @mosama22
    @mosama22 2 месяца назад +1

    Thank you so much for the beautiful video, but two questions please:
    1 - What if we changed the engine in apply to numba:
    df.apply(func, engine = 'numba', raw = True, engine_kwargs = {'nopython': True})
    2 - Do you mind make a video just on vectorization please?
    Thank you again 🙂

  • @pietraderdetective8953
    @pietraderdetective8953 2 года назад +8

    I have always been struggling to understand how vectorize work..this video of yours is the one made it crystal clear for me.
    What a great video!
    Can you please do more of these efficient pandas videos and use some stock market data? Thanks!

    • @robmulla
      @robmulla  2 года назад +1

      Thanks for the feedback. I’m so happy you found this useful. I’ll try my best to do a future video related to stock market data.

  • @sweealamak628
    @sweealamak628 2 года назад +1

    I'm kicking myself now for not finding your video 10 months ago. I'm near the completion of my code and resorted to a mix of iterating For loops and small scale vectorisation by declaring new columns after applying some logic. I seriously need to adopt your methods and redo my code because mine is just not fast enough!

    • @robmulla
      @robmulla  2 года назад

      I totally feel you. It took me years before I understood really how important it is to avoid iterating rows was. Once you learn it all your pandas code will be much faster though.

    • @sweealamak628
      @sweealamak628 2 года назад

      @@robmulla I just altered one of my For loops and used your Vectorized approach! Not only is it faster, I did it in just 3 lines of code and the syntax is much easier to read! I feel so embarrased for myself cos it's much more straight forward than I thought!
      Now the tricky thing is, I work on a time series dataset where I compare previous rows of data to the current row to get the "result". I assume I can use the "shift" method to look back at a previous row of data. If it works, I'm gonna Vectorize everything! THANKS SO MUCH!

  • @prodmanaiml9317
    @prodmanaiml9317 2 года назад +8

    More video tips for pandas would be excellent!

    • @robmulla
      @robmulla  2 года назад

      Great suggestion. I'll try to keep the pandas videos coming.

  • @FabioRBelotto
    @FabioRBelotto 2 года назад +2

    Great video. I am working on a Df with millions of rows and pandas apply was struggling. I solved using an vectorized solution as exposed. Much much better.
    Could you imagine a situation where vectorization would be not possible?

    • @robmulla
      @robmulla  2 года назад +1

      Glad this helped! As far as examples where vectorization is not possible:
      For example, if you need to perform an operation that requires branching, such as selecting different values based on some condition, vectorization may not be possible. In this case, you would need to use a loop or some other non-vectorized approach to perform the operation.
      Another example where vectorization may not be possible is when working with datasets that have varying lengths or shapes. In this case, it may not be possible to perform operations on the entire dataset using vectorized methods.
      Hope that helps.

  • @bigoper
    @bigoper 7 дней назад

    NEW SUB here, this is MAGIC!
    Thank you!

  • @Graham_Wideman
    @Graham_Wideman Год назад +1

    1:19 "a random integer between one and 100." I believe that should be from 0 to 99 (ie: inclusive at both ends). In case nobody else mentioned it.

    • @robmulla
      @robmulla  Год назад

      Good catch! I think you are the first to point that out.

  • @Sinke_100
    @Sinke_100 Год назад +1

    Cool, for really large dataset and when conditions aren't too complicated that vectorized method is amazing, apply is nice alternative cause you can write function, there should be a module that converts normal functions in this vectorized syntax cause it's quite complicated to write

    • @robmulla
      @robmulla  Год назад

      Glad it was helpful! There are some packages that compile functions (called numba/jit) there is also np.vectorize

    • @Sinke_100
      @Sinke_100 Год назад

      @@robmulla I tryed to played a bit with it, pandas it's similar to numpy and I worked with numpy quite a bit, I tryed to put in a function bool_calculation with 3 distinct dfs for age condition, pct_sleeping and time in bed, finaly return value was final condition, df loc supports putting function directly in it's statement, so I did that finaly I compared created dfs with both methods, and they are same.
      My suggestion is that you should explain more in depth those complexed stuff.

  • @mic9657
    @mic9657 Год назад +1

    great tips! and very well presented

    • @robmulla
      @robmulla  Год назад

      Glad you like it. Thanks for watching.

  • @GregZoppos
    @GregZoppos 2 года назад +1

    Wow, thanks! I'm a beginner in data science, this is really interesting to me.

    • @robmulla
      @robmulla  2 года назад

      Great to hear! Good luck in your data science journey.

  • @blogmaster7920
    @blogmaster7920 2 года назад +1

    This can be really helpful, when moving data from one source to another through Internet.

    • @robmulla
      @robmulla  2 года назад

      Absolutely, compressing can make any data transfer faster.

  • @kingj5983
    @kingj5983 8 месяцев назад

    Wow, awesome video, thanks! Although it takes time to figure out how to turn my limit conditions into logical calculation and return a bool dataframe

  • @incremental_failure
    @incremental_failure Год назад

    Vectorization is the whole point of Pandas. But there are cases where vectorization is impossible and you need to process row-by-row, in that case it's best to switch to numba for a precompiled function.

  • @kanishkpareek6650
    @kanishkpareek6650 Год назад

    your teaching style is awesome. where can i find your videos in a structured manner??

  • @kennethstephani692
    @kennethstephani692 2 года назад +2

    Great video!!!

  • @rockwellshabani5180
    @rockwellshabani5180 2 года назад +4

    Would vectorization also be faster than an np.where statement with multiple conditions?

    • @robmulla
      @robmulla  2 года назад

      Great question! I think someone tested it out in the reddit thread where I posted it and found maybe a slight speed increase over the vectorized version.

    • @georgebrandon7696
      @georgebrandon7696 2 года назад +1

      np.where() is what I use almost exclusively. However, it tends to be a little unreadable if you need to use additional if statements to go from binary (either or) to 3 or more possible values. Of course, one could also nest np.where() statements too. :)

  • @RichieStockholm
    @RichieStockholm 2 года назад +2

    I expect a video about moped gangs in the future, Rob.

    • @robmulla
      @robmulla  2 года назад

      That’s a great idea Richie! I practically majored in moped gangs in college. 😂

  • @andrew3068
    @andrew3068 2 года назад +1

    Super awesome video.

    • @robmulla
      @robmulla  2 года назад

      I appreciate that. Thanks for commenting!

  • @johnidouglasmarangon
    @johnidouglasmarangon 2 года назад +1

    Great video Bob, thanks.
    I curious, which interface for Jupyter Notebook you are using?

    • @robmulla
      @robmulla  2 года назад

      Glad you liked it. This is jupyterlab with the solarized dark theme. Check out my full video on jupyter where I go into detail about it.

    • @johnidouglasmarangon
      @johnidouglasmarangon 2 года назад

      @@robmulla Tks Bob ✌️

    • @robmulla
      @robmulla  2 года назад

      @@johnidouglasmarangon no problem. Jane!

  • @mltamarlin
    @mltamarlin 3 месяца назад

    You should plot the results on a log plot, which would give the fold speedup for each.

  • @bm647
    @bm647 2 года назад +1

    Great video! very useful

    • @robmulla
      @robmulla  2 года назад

      Glad you found it useful!

  • @spicytuna08
    @spicytuna08 2 года назад +2

    oh my!!! awesome. thanks!!!

  • @ledestonilo7274
    @ledestonilo7274 2 года назад +1

    Interesting. Thank you will try it.

    • @robmulla
      @robmulla  2 года назад

      Awesome! Let me know how it goes.

  • @vinitjha_
    @vinitjha_ Год назад

    which font do you use? That's awesome font and color scheme

  • @cbritton27
    @cbritton27 2 года назад +18

    I had a similar situation creating a new column based on conditions. My data set has 520,000 records so the apply was very slow. I got good results with using the select function from numpy. I'm curious how that would compare to the vectorization in your case.
    Edit: in my case, the numpy select is slightly faster than the vectorization.

    • @robmulla
      @robmulla  2 года назад +2

      Thanks for sharing. It would be cool to see an example code snippet similar to what I used in this video for comparison.

    • @linkernick5379
      @linkernick5379 2 года назад

      Polars lib is quite fast with my 1 million dataset, I recommend to try.

  • @djangoworldwide7925
    @djangoworldwide7925 Год назад

    As an R user we use vectorization using mutate without even thinking about the other methods for such task. R is so much more suitable for data science and wrangling

  • @dreamdeckup
    @dreamdeckup 2 года назад +6

    I had to do the same thing in my first internship lol. The script went from 4 hours to like 10 minutes to run

    • @robmulla
      @robmulla  2 года назад

      Yea, when I learned this it 100% changed the way I write pandas code.

  • @justsayin...1158
    @justsayin...1158 Год назад

    It's a great tip, but I don't feel like, I understood, what vectorized means, or how I make a function vectorized. Is it just creating the boolean array by applying the conditions to the whole data frame in this way, or are there other ways to vectorize as well?

  • @Reftsquabble
    @Reftsquabble 2 года назад +1

    i _am_ excited! show the solution in machine code next pls thx

    • @robmulla
      @robmulla  2 года назад

      Working on it…

  • @ersineser7610
    @ersineser7610 2 года назад +3

    Thank you very much for great video.

    • @robmulla
      @robmulla  2 года назад

      Glad you liked it! Thanks for the feedback.

  • @FF-ct5dr
    @FF-ct5dr 2 года назад +2

    The Pandas doc literally tells you that iterrows is slow and should be avoided lol. As for vectorization, Pandas uses (slightly tweaked so to hold different types) numpy arrays which are hosted in continuous memory blocks... So ofc vectorization will be faster than apply/map.

    • @robmulla
      @robmulla  2 года назад

      Yep. This is obvious to a seasoned veteran, but as I mentioned in the video, for many newbies who haven't read the docs and aren't fully aware of the backend, they don't know that iterrows is a bad idea.

    • @Geza_Molnar_
      @Geza_Molnar_ 2 года назад

      @@robmulla Maybe, when you have time for that, you could publish a video that describes to newbies what "RTFM" means, and what is the benefit of that. You are popular, a role model for some 🙂
      (in this case "M" -> docs)

  • @blakete
    @blakete 2 года назад +1

    What theme are you using to get your Jupyter Notebook to look like that?

    • @robmulla
      @robmulla  2 года назад

      Solarized dark theme. I have a whole video about my jupyter setup

  • @A372575
    @A372575 Год назад

    Thanks, one query in case of vectorize, which one would be faster - np.where or the method you memtioned ?

  • @diegoalmeida2221
    @diegoalmeida2221 2 года назад +2

    Nice video, though in some cases we want to use a specific complex function from a library. The apply method works fine for that case. But is there a way to use it with vectorization?

    • @robmulla
      @robmulla  2 года назад +1

      You can try to vectorize using something like numba. But it depends on the complexity of the function.

  • @alysmtech3683
    @alysmtech3683 2 года назад

    Jesus, I'm over here blowing up my laptop. Had no idea, thank you!

    • @robmulla
      @robmulla  2 года назад +1

      Hah. My name is Rob. But glad you learned something new.

  • @mbcebrix
    @mbcebrix 2 года назад +1

    Is vectorization applicable for huge datasets? Like millions of datasets for example.

    • @robmulla
      @robmulla  2 года назад

      If it can fit in your computer’s memory then yes!

  • @andrewcoyne9768
    @andrewcoyne9768 Год назад +1

    Love the video, thanks Rob! Is there a vectorized way to create a column that is the sum of several columns?
    I tried df['total'] = df.iloc[:, 5:13].sum(), which was way faster but returned all NaN values.
    Any help would be appreciated.

    • @robmulla
      @robmulla  Год назад +1

      So close! I think all you need to do is to change `sum(axis=1)` and it should work!

    • @andrewcoyne9768
      @andrewcoyne9768 Год назад

      @@robmulla Brilliant! Works perfect now. Thanks for the quick reply

  • @abdulkadirguven1173
    @abdulkadirguven1173 Год назад

    Thank you very much Rob.

  • @bgotura
    @bgotura Год назад

    I love how that Pandas logo has canibalized the city of Madrid (Spain) logo

  • @danielbrett247
    @danielbrett247 2 года назад +1

    Not everything can be vectorized, commonly when processing time series data. For these, a great library to know about is njit.

    • @robmulla
      @robmulla  2 года назад +1

      Agreed. Njit / numba can be great when needing to make sudo compiled python code.

  • @moodiiie
    @moodiiie 2 года назад +1

    That’s all I do at work, vectorize is the way to go. I was able to do some complex logic with them.

  • @Hitdouble
    @Hitdouble 2 года назад +1

    What theme do you use in your notebook?

    • @robmulla
      @robmulla  2 года назад

      I have a whole video on my jupyter lab setup. But it’s just the solarized dark theme.

  • @alanhouston5874
    @alanhouston5874 Год назад +1

    Can you save lists using Parquet
    Or is it only applicable to dataframes?

  • @dh00mketu
    @dh00mketu 2 года назад +1

    Why didn't you remove get rewards function from other run times?

    • @robmulla
      @robmulla  2 года назад

      Oops. Did I do it incorrectly? Can you share the timestamp?

  • @PeterSeres
    @PeterSeres 2 года назад +12

    Nice video! Thanks for detailed explanation. My only problem with this is that I often have to apply functions that depend on sequential time data and a loop setup makes the most sense since the next time step depends on the previous time steps.
    Are there some advanced methods on how to set up more complex vectorized functions that don't fit into a one-liner expression?

    • @robmulla
      @robmulla  2 года назад +5

      Yes there are! I think I'll probably make a few more videos on the topic considering how interested people seem in this. But I'd suggest if you can do any of your processing that goes across rows in groups - first do a `groupby()` and then you can multiprocess the processing of each group on a different CPU thread. If you have 8 or 16 CPU threads you can speed things up a lot!

    • @DrewLevitt
      @DrewLevitt 2 года назад +2

      Pandas has a lot of useful time series methods, but without knowing exactly what you're trying to do, it'd be hard to suggest any specific functions. But if you only need to refer to step (n-1) when processing step n, you can use df.shift() to store step n-1 IN the row for step n. Hope this helps!

  • @Chris_87BC
    @Chris_87BC Год назад

    Great video! I am currently looping through a data frame column for each customer and print the data to PDF. Is there a vectorized version that can be much faster?

  • @Atlas92936
    @Atlas92936 2 года назад +1

    Nice glasses! where are they from?

    • @robmulla
      @robmulla  2 года назад

      Haha. Thanks! 🤓 - They are from warby parker, but I accidentally broke this pair :(

  • @rahulkmail
    @rahulkmail 7 месяцев назад

    Thanks for sharing a nice information

  • @chndrl5649
    @chndrl5649 2 года назад +1

    Could also use query instead of loc

    • @robmulla
      @robmulla  2 года назад

      Not sure that would work for this case because we aren’t straight filtering.

  • @adamleon8504
    @adamleon8504 Год назад

    in these cases it is easy to vectorize but how can you vectorize when the process or the function that needs the df as input is more complex? For example can you vectorize a procedure that uses specific rows and not one column based on a condition and then use these elements to perform calculations with step and not on the same row for example df.loc[i,"A"] - df.loc[i-1,"B"]?

  • @scientistgeospatial
    @scientistgeospatial 2 года назад +1

    Thanks Rob

    • @robmulla
      @robmulla  2 года назад +1

      Thanks for watching!

  • @MrJak3d
    @MrJak3d 2 года назад +2

    Damn, I knew lvl 2 but lvl 3 was awesome!

    • @robmulla
      @robmulla  2 года назад +1

      Thanks Jake! Yea, vectorized functions are super fast. If you can't vectorize then there are other ways to make it faster (like chunking and multiprocessing)... I might make a video about that next!

  • @Pranavshashi
    @Pranavshashi 2 года назад +1

    I'm new so this might be a silly question. I am using an API to get additional data for each row in my dataset. Can I use vectorized approach while making API calls as well?

    • @robmulla
      @robmulla  2 года назад +1

      That's kind of different. You just want to gather the results from the api as fast as possible. Check out something like async calls to the api. This might help: stackoverflow.com/questions/71232879/how-to-speed-up-async-requests-in-python

    • @Pranavshashi
      @Pranavshashi 2 года назад

      @@robmulla thanks!

  • @kathrynpanger2289
    @kathrynpanger2289 2 года назад +2

    What if I want to apply a more complicated or non-numeric test, like instead of df['pct-sleeping'] > 0.5 I was looking at whether "teeth" was in df['dream-themes'], (a list of the tags concerning the things the sleeper dreamt about e.g. [teeth, whale, dog, slide, school]). Is the only way to do this by with .apply or can this still be vectorized?

    • @robmulla
      @robmulla  2 года назад +1

      This is a good question. I think it depends on the dtype of the dream-theme columns. Would it only contain a single value or potentially multiple ones? Check the ‘isin’ function in the pandas docs, its a vectorized function of doing this.

    • @DrewLevitt
      @DrewLevitt 2 года назад

      I haven't tested this but try df['dream-themes'].str.contains('teeth'). If df['dream-themes'] is a bunch of comma-delimited strings, you should be good to go (but watch out for partial matches e.g. "teeth whitening" contains "teeth"); not sure whether this will work if df['dream-themes'] contains a bunch of proper lists. Try it and let me know!

  • @GroWithUmar
    @GroWithUmar 2 года назад +2

    Amazing video

    • @robmulla
      @robmulla  2 года назад

      Thanks for the feedback!

  • @elgoogffokcuf
    @elgoogffokcuf 2 года назад +1

    What about Numba, if it can bring some more optimization, it will be nice if you make a video for it.

    • @robmulla
      @robmulla  Год назад +1

      Numba/jit is great to speed up more complex operations. I've had limited experience with it, but every case it really sped things up. Doing it as a video is a good idea.

  • @agoodwin-8127
    @agoodwin-8127 2 года назад +6

    I typically use numpy where in this situation (mainly because I like the syntax better!), so I was curious about the speed vs. the level 3 solution. Where ran a little faster (~15-20% for datasets sized 10K - 50M records).
    # level 3 - vectorized
    %%timeit
    df = get_data(10_000)
    df['reward'] = df['hate_food']
    df.loc[((df['pct_sleeping'] > 0.5) & (df['time_in_bed'] > 5))
    | (df['age'] > 90), 'reward'] = df['favorite_food']
    # 3.74 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    # level 4 - where
    %%timeit
    df = get_data(10_000)
    df['reward'] = np.where(((df['pct_sleeping'] > 0.5) & (df['time_in_bed'] > 5))
    | (df['age'] > 90),
    df['favorite_food'],
    df['hate_food'])
    # 3.15 ms ± 37 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

    • @robmulla
      @robmulla  2 года назад +1

      I love that you ran that experiment! I'm actually suprised the numpy version isn't even faster. Thanks for sharing.

    • @georgebrandon7696
      @georgebrandon7696 2 года назад +1

      @@robmulla I'll throw another wrench into this. df.at[] vs df.loc[]. df.at[] is considerably faster than df.loc[]. But I've never ran conditionals with df.at[]. I'm also an np.where() user. :)

  • @YuanYuan-uk8sz
    @YuanYuan-uk8sz Год назад

    thank you your very extremely perfect video,so so helpful for me,love you so much

    • @robmulla
      @robmulla  Год назад

      I'm so glad! Share it with a friend or two who you think might also appreciate it.

  • @jrwkc
    @jrwkc 2 года назад +1

    when you vectorize with loc, don't you have to vectorize the right side of the equation too. df['favorite_food'] is not masked. It's the whole array. Right? So you are setting the reward to the first N of df['favorite_food'] where N is the length of the mask.

    • @robmulla
      @robmulla  2 года назад

      I don't think so because pandas will use the index when populating. But I'm also not 100% sure.

    • @jrwkc
      @jrwkc 2 года назад

      @@robmulla make github repos so we can test! that would be great

  • @rururu_en
    @rururu_en 4 месяца назад

    now instead of two actions + .loc in the third example use np.where for oneliner and it will be even faster

  • @DK-rl1sf
    @DK-rl1sf 2 года назад +2

    Is there a way to use np.vectorize() instead of df.loc so things are more tidier?

    • @robmulla
      @robmulla  2 года назад

      That's a great point. I've used np.vectorize before but not too frequently. I agree the current solution isn't very clean to read and could be much tidier.

    • @DK-rl1sf
      @DK-rl1sf 2 года назад

      @@robmulla No, this is not so much a suggestion and more a question. I'm new to this and literally don't know. I had previously read about np.vectorize(). I tried doing your vectorize method but using np.vectorize but couldn't figure out the syntax.

    • @robmulla
      @robmulla  2 года назад

      ​@@DK-rl1sf Yea, there is a lot of overhead when using pandas instead of numpy - but you get the benefit of named columns, easy filtering and sorting. In my experience np.vectorize worked but i was working with just numpy arrays not pandas dataframes.

  • @leejunzhao
    @leejunzhao 2 года назад +1

    Question: I follow your code, an erroe come out said " 'reward_calc' is not defined ".

    • @wenbozhao4325
      @wenbozhao4325 2 года назад +2

      In the video the fn name has no letter c at the end

    • @robmulla
      @robmulla  2 года назад

      Oh. Good catch. Sorry it was confusing

  • @alexisdebrand6209
    @alexisdebrand6209 2 года назад +1

    so usefull thank you !!!!!

    • @robmulla
      @robmulla  2 года назад

      You're welcome! Thanks for commenting.

  • @saxegothaea_conspicua
    @saxegothaea_conspicua Год назад

    can you vectorize using query? I suppose you can't

  • @amitamola2014
    @amitamola2014 2 года назад +1

    So what about the scenario when we want to perform same operation but only in one column?
    Such as, if pct_sleep

    • @robmulla
      @robmulla  2 года назад

      You can use something like qcut in this case or a vectorized statement with and or statement.

  • @SillyLittleMe
    @SillyLittleMe 2 года назад +2

    Hey, this is a great video and truly shows the benefit of vectorisation
    I would like to point out that always remebering the vectorize way of writing is hard. Fortunately, NumPy module does provide a neat method called "vectorize" that vectorizes your non-vectorize function.
    an example (from the docs):
    ## this is the function
    def myfunc(a, b):
    "Return a-b if a>b, otherwise return a+b"
    if a > b:
    return a - b
    else:
    return a + b
    ## vectorising the function and then applying it
    vfunc = np.vectorize(myfunc)
    vfunc([1, 2, 3, 4], 2)
    array([3, 4, 1, 2])
    This works on DataFrames as well.
    Do Note tho that this is not True vectorisation because of that ,in some cases, it performs similarly to functions like "apply". However, for the most part it does a tremendous job and has significantly increased the speed of my functions.
    The reasons for why it is not "true vectorisation" are mentioned in this thread : stackoverflow.com/questions/52673285/performance-of-pandas-apply-vs-np-vectorize-to-create-new-column-from-existing-c

    • @robmulla
      @robmulla  2 года назад +1

      Thanks! I've used that before and it does come in handy. Also using things like jit/numba can compile numpy operations.

  • @TeXiCiTy
    @TeXiCiTy 2 года назад +1

    For looping over big datasets I switch to polars when speed becomes an issue.

    • @robmulla
      @robmulla  2 года назад

      I have an entire video on my channel about polars. It’s great! Check it out.

  • @lucienjaegers2028
    @lucienjaegers2028 2 года назад +1

    Nice trick, but what if you code it completely in C / C++ / Rust? Literature says those are 50 - 80 times faster?

    • @robmulla
      @robmulla  2 года назад +1

      I have a whole video on polars, which is written in rust. It’s faster for sure. But keep in mind pandas backend is just C code.

  • @JoseStev
    @JoseStev Год назад

    Use polars?

  • @onganxiety8719
    @onganxiety8719 5 месяцев назад

    This is the basic way all the code should be written for pandas, but if you really need it to scale for high memory consumption and number of calculations use njit processing for numerics or even easier just use Polars for everything .

  • @pr0skis
    @pr0skis 2 года назад +2

    I wonder how it compares to np.where

    • @robmulla
      @robmulla  2 года назад

      I guess depends.

  • @ErikS-
    @ErikS- 2 года назад +1

    3.5 seconds for a for loop with only 10k rows...
    Is this done in a Docker container or another VM(-like) environment?

    • @robmulla
      @robmulla  2 года назад

      Just done locally on my fairly beefy machine.