Thanks Sanskar. Yes, using vectorized functions is always much faster. In some cases it's not possible but then there are other ways to speed it up. I might show that in another video if this one is popular.
This is spot on. I had a filter running that was going to take 2 days to complete on a 12M line CSV file using iteration - clearly not good. Now it takes 6 seconds.
also, a way to speed it up is to not use & and | for 'and' and 'or' but just use the words 'and' and 'or'. these words are made for boolean expressions and thus work faster. & and | are bitwise operators and are made for integers. using these will force python to make the booleans an integer and then do the bitwise operation and then cast it back to a boolean. this doesn't take that much time if u do it once but in a test scenario inspired by this video it was roughly 45% slower.
I made the experiment. It is ready to run. What you have suggested is coded below. It is approximately 20 percent faster. import timeit setup = 'import random; random_list = random.sample(range(1,101),100)' # with or first_code = '''\ result_1 = [rand for rand in random_list if (rand >75) or (rand 75) | (rand
That is unbelievable. Astounding time difference. I was recently watching a presentation on candle stick algorythm, and the presenter used vectorised method and I was confused (I an new to Python), but this video made it all too clear. Fantastic presentation.
Haha coincidentally I'd been raving about vectorized to my friends the last few months. It's soo good. The moment I saw your title I figured you're probably talking about vectorize too haha. Awesome video and great content!!
The problem in dealing with is that I am looping through some poorly designed tables and building a sql statement to be applied and then appending the output to a list. Not sure if a vectorized approach will work since I have that sql call, but the apply might save me from needing to recreate the df prior to appending everytime.
@@robertjordan114 Interesting. Not sure what your data is like- but it can be better a lot of the times to write a nice SQL statement that puts the data in the correct formatting first. That way you put the processing demands on the SQL server and it can usually optimize really well.
Oh you have no idea, my source table has one column with the name of the column in my lookup table and another with the value that I need to filter on in that lookup table. The loop creates the where clause based on the number of related rows in the initial dataset, and then I'm executing that sql statement the return the values to a python data frame which I then convert to a pandas data frame and append. Like I said, amateur hour! 🤣
I'm over here as a newbie data scientist, copying the logic step-by-step in order to have good coding habits in the future lmao. Thanks for the video, really valuable!
Thanks for the great video! I have a project with some calculations. They take some minutes through the loops. I'm going to use vectorized way. So i'll write another comment with comparison later. Some days later... i rewrote a signifacnt part of my code. Made it vectorized, and i got fantastic results. The example: old code - 1m.3s, new code - 6s. One more: old code - 14m.58s, new code - 11s. Awesome!
As I work with Pandas and large datasets, I come across code that use iterrows often. Most developers just don't care about time or come from various programming backgrounds that prohibit them from using efficient methods. I wish more people use vectorization.
Some of the biggest bottlenecks are from IO... especially when trying to read then concat multiple large Excel files. Shaving a few seconds in the algos just isnt gonna make much of a difference
@@pr0skis Hard to say that definitively, though, right? You have no idea how anyone is using pandas. If they have slow algos running iteratively, it can very easily become much slower than I/O functions. I've seen some pretty wild pandas use in my business, and a lot of it is really terrible at runtime, especially anything that is wrapped in a GUI (sometimes even with multiprocessing...).
It's the same with a relational database, we call them the cursor kids. They loop and loop and loop when they can use a set operation to go hundreds of times faster and often with less code.
Hey, I just thought I'd mention, I really appreciate that you use really huge test datasets, since a lot of the time, test datasets used in tutorials are quite small and don't sure how code will scale. This video does it perfectly, though!
Thank you for this, Rob. This video made me subscribe because in 10 minutes you solved one of my biggest problems. And your Boo is right - you are pretty cool. Thanks again, sir.
Thank you so much for the beautiful video, but two questions please: 1 - What if we changed the engine in apply to numba: df.apply(func, engine = 'numba', raw = True, engine_kwargs = {'nopython': True}) 2 - Do you mind make a video just on vectorization please? Thank you again 🙂
I have always been struggling to understand how vectorize work..this video of yours is the one made it crystal clear for me. What a great video! Can you please do more of these efficient pandas videos and use some stock market data? Thanks!
I'm kicking myself now for not finding your video 10 months ago. I'm near the completion of my code and resorted to a mix of iterating For loops and small scale vectorisation by declaring new columns after applying some logic. I seriously need to adopt your methods and redo my code because mine is just not fast enough!
I totally feel you. It took me years before I understood really how important it is to avoid iterating rows was. Once you learn it all your pandas code will be much faster though.
@@robmulla I just altered one of my For loops and used your Vectorized approach! Not only is it faster, I did it in just 3 lines of code and the syntax is much easier to read! I feel so embarrased for myself cos it's much more straight forward than I thought! Now the tricky thing is, I work on a time series dataset where I compare previous rows of data to the current row to get the "result". I assume I can use the "shift" method to look back at a previous row of data. If it works, I'm gonna Vectorize everything! THANKS SO MUCH!
Great video. I am working on a Df with millions of rows and pandas apply was struggling. I solved using an vectorized solution as exposed. Much much better. Could you imagine a situation where vectorization would be not possible?
Glad this helped! As far as examples where vectorization is not possible: For example, if you need to perform an operation that requires branching, such as selecting different values based on some condition, vectorization may not be possible. In this case, you would need to use a loop or some other non-vectorized approach to perform the operation. Another example where vectorization may not be possible is when working with datasets that have varying lengths or shapes. In this case, it may not be possible to perform operations on the entire dataset using vectorized methods. Hope that helps.
Cool, for really large dataset and when conditions aren't too complicated that vectorized method is amazing, apply is nice alternative cause you can write function, there should be a module that converts normal functions in this vectorized syntax cause it's quite complicated to write
@@robmulla I tryed to played a bit with it, pandas it's similar to numpy and I worked with numpy quite a bit, I tryed to put in a function bool_calculation with 3 distinct dfs for age condition, pct_sleeping and time in bed, finaly return value was final condition, df loc supports putting function directly in it's statement, so I did that finaly I compared created dfs with both methods, and they are same. My suggestion is that you should explain more in depth those complexed stuff.
Vectorization is the whole point of Pandas. But there are cases where vectorization is impossible and you need to process row-by-row, in that case it's best to switch to numba for a precompiled function.
Great question! I think someone tested it out in the reddit thread where I posted it and found maybe a slight speed increase over the vectorized version.
np.where() is what I use almost exclusively. However, it tends to be a little unreadable if you need to use additional if statements to go from binary (either or) to 3 or more possible values. Of course, one could also nest np.where() statements too. :)
I had a similar situation creating a new column based on conditions. My data set has 520,000 records so the apply was very slow. I got good results with using the select function from numpy. I'm curious how that would compare to the vectorization in your case. Edit: in my case, the numpy select is slightly faster than the vectorization.
As an R user we use vectorization using mutate without even thinking about the other methods for such task. R is so much more suitable for data science and wrangling
It's a great tip, but I don't feel like, I understood, what vectorized means, or how I make a function vectorized. Is it just creating the boolean array by applying the conditions to the whole data frame in this way, or are there other ways to vectorize as well?
The Pandas doc literally tells you that iterrows is slow and should be avoided lol. As for vectorization, Pandas uses (slightly tweaked so to hold different types) numpy arrays which are hosted in continuous memory blocks... So ofc vectorization will be faster than apply/map.
Yep. This is obvious to a seasoned veteran, but as I mentioned in the video, for many newbies who haven't read the docs and aren't fully aware of the backend, they don't know that iterrows is a bad idea.
@@robmulla Maybe, when you have time for that, you could publish a video that describes to newbies what "RTFM" means, and what is the benefit of that. You are popular, a role model for some 🙂 (in this case "M" -> docs)
Nice video, though in some cases we want to use a specific complex function from a library. The apply method works fine for that case. But is there a way to use it with vectorization?
Love the video, thanks Rob! Is there a vectorized way to create a column that is the sum of several columns? I tried df['total'] = df.iloc[:, 5:13].sum(), which was way faster but returned all NaN values. Any help would be appreciated.
Nice video! Thanks for detailed explanation. My only problem with this is that I often have to apply functions that depend on sequential time data and a loop setup makes the most sense since the next time step depends on the previous time steps. Are there some advanced methods on how to set up more complex vectorized functions that don't fit into a one-liner expression?
Yes there are! I think I'll probably make a few more videos on the topic considering how interested people seem in this. But I'd suggest if you can do any of your processing that goes across rows in groups - first do a `groupby()` and then you can multiprocess the processing of each group on a different CPU thread. If you have 8 or 16 CPU threads you can speed things up a lot!
Pandas has a lot of useful time series methods, but without knowing exactly what you're trying to do, it'd be hard to suggest any specific functions. But if you only need to refer to step (n-1) when processing step n, you can use df.shift() to store step n-1 IN the row for step n. Hope this helps!
Great video! I am currently looping through a data frame column for each customer and print the data to PDF. Is there a vectorized version that can be much faster?
in these cases it is easy to vectorize but how can you vectorize when the process or the function that needs the df as input is more complex? For example can you vectorize a procedure that uses specific rows and not one column based on a condition and then use these elements to perform calculations with step and not on the same row for example df.loc[i,"A"] - df.loc[i-1,"B"]?
Thanks Jake! Yea, vectorized functions are super fast. If you can't vectorize then there are other ways to make it faster (like chunking and multiprocessing)... I might make a video about that next!
I'm new so this might be a silly question. I am using an API to get additional data for each row in my dataset. Can I use vectorized approach while making API calls as well?
That's kind of different. You just want to gather the results from the api as fast as possible. Check out something like async calls to the api. This might help: stackoverflow.com/questions/71232879/how-to-speed-up-async-requests-in-python
What if I want to apply a more complicated or non-numeric test, like instead of df['pct-sleeping'] > 0.5 I was looking at whether "teeth" was in df['dream-themes'], (a list of the tags concerning the things the sleeper dreamt about e.g. [teeth, whale, dog, slide, school]). Is the only way to do this by with .apply or can this still be vectorized?
This is a good question. I think it depends on the dtype of the dream-theme columns. Would it only contain a single value or potentially multiple ones? Check the ‘isin’ function in the pandas docs, its a vectorized function of doing this.
I haven't tested this but try df['dream-themes'].str.contains('teeth'). If df['dream-themes'] is a bunch of comma-delimited strings, you should be good to go (but watch out for partial matches e.g. "teeth whitening" contains "teeth"); not sure whether this will work if df['dream-themes'] contains a bunch of proper lists. Try it and let me know!
Numba/jit is great to speed up more complex operations. I've had limited experience with it, but every case it really sped things up. Doing it as a video is a good idea.
I typically use numpy where in this situation (mainly because I like the syntax better!), so I was curious about the speed vs. the level 3 solution. Where ran a little faster (~15-20% for datasets sized 10K - 50M records). # level 3 - vectorized %%timeit df = get_data(10_000) df['reward'] = df['hate_food'] df.loc[((df['pct_sleeping'] > 0.5) & (df['time_in_bed'] > 5)) | (df['age'] > 90), 'reward'] = df['favorite_food'] # 3.74 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # level 4 - where %%timeit df = get_data(10_000) df['reward'] = np.where(((df['pct_sleeping'] > 0.5) & (df['time_in_bed'] > 5)) | (df['age'] > 90), df['favorite_food'], df['hate_food']) # 3.15 ms ± 37 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
@@robmulla I'll throw another wrench into this. df.at[] vs df.loc[]. df.at[] is considerably faster than df.loc[]. But I've never ran conditionals with df.at[]. I'm also an np.where() user. :)
when you vectorize with loc, don't you have to vectorize the right side of the equation too. df['favorite_food'] is not masked. It's the whole array. Right? So you are setting the reward to the first N of df['favorite_food'] where N is the length of the mask.
That's a great point. I've used np.vectorize before but not too frequently. I agree the current solution isn't very clean to read and could be much tidier.
@@robmulla No, this is not so much a suggestion and more a question. I'm new to this and literally don't know. I had previously read about np.vectorize(). I tried doing your vectorize method but using np.vectorize but couldn't figure out the syntax.
@@DK-rl1sf Yea, there is a lot of overhead when using pandas instead of numpy - but you get the benefit of named columns, easy filtering and sorting. In my experience np.vectorize worked but i was working with just numpy arrays not pandas dataframes.
Hey, this is a great video and truly shows the benefit of vectorisation I would like to point out that always remebering the vectorize way of writing is hard. Fortunately, NumPy module does provide a neat method called "vectorize" that vectorizes your non-vectorize function. an example (from the docs): ## this is the function def myfunc(a, b): "Return a-b if a>b, otherwise return a+b" if a > b: return a - b else: return a + b ## vectorising the function and then applying it vfunc = np.vectorize(myfunc) vfunc([1, 2, 3, 4], 2) array([3, 4, 1, 2]) This works on DataFrames as well. Do Note tho that this is not True vectorisation because of that ,in some cases, it performs similarly to functions like "apply". However, for the most part it does a tremendous job and has significantly increased the speed of my functions. The reasons for why it is not "true vectorisation" are mentioned in this thread : stackoverflow.com/questions/52673285/performance-of-pandas-apply-vs-np-vectorize-to-create-new-column-from-existing-c
This is the basic way all the code should be written for pandas, but if you really need it to scale for high memory consumption and number of calculations use njit processing for numerics or even easier just use Polars for everything .
Whoa.. 3500 times difference. Vectorised is even faster than apply, will give it try next time for sure. Awesome video as always.
Thanks Sanskar. Yes, using vectorized functions is always much faster. In some cases it's not possible but then there are other ways to speed it up. I might show that in another video if this one is popular.
@@robmulla can you make a video on how to make pandas resample faster ?
That’s my husband! He’s so cool.
Love you boo. 😘
Fully agree!
Aww! This is wholesome!
I've been watching your videos since last one week non stop! And enjoy comments from your SO!!! Lovely!
He is awesome. Taking time out of his day to share knowledge 👏
This is spot on. I had a filter running that was going to take 2 days to complete on a 12M line CSV file using iteration - clearly not good. Now it takes 6 seconds.
also, a way to speed it up is to not use & and | for 'and' and 'or' but just use the words 'and' and 'or'. these words are made for boolean expressions and thus work faster. & and | are bitwise operators and are made for integers. using these will force python to make the booleans an integer and then do the bitwise operation and then cast it back to a boolean. this doesn't take that much time if u do it once but in a test scenario inspired by this video it was roughly 45% slower.
Nice tip! I didn’t know that.
I made the experiment. It is ready to run. What you have suggested is coded below. It is approximately 20 percent faster.
import timeit
setup = 'import random; random_list = random.sample(range(1,101),100)'
# with or
first_code = '''\
result_1 = [rand for rand in random_list if (rand >75) or (rand 75) | (rand
@@kazmkazm9676 the difference was even bigger between & and 'and', if i remember corectly.
Great, never realized that. Will start using 'and' and 'or' now onwards.
That is unbelievable. Astounding time difference. I was recently watching a presentation on candle stick algorythm, and the presenter used vectorised method and I was confused (I an new to Python), but this video made it all too clear. Fantastic presentation.
Glad you found it interesting. Thanks for watching!
Haha coincidentally I'd been raving about vectorized to my friends the last few months. It's soo good. The moment I saw your title I figured you're probably talking about vectorize too haha. Awesome video and great content!!
You called it! Thanks for the positive feedback. Hope to create more videos like it soon.
Same! Still hasn't worked on picking up chicks at the bar, but I'm not giving up.
I didn’t realize you could write 10k as 10_000. I work with astronomical units so makes variables more readable. Great video!
Thanks! Yes, they introduced that functionality with underscores in numbers with python 3.6 - it really helps make numbers more readable.
😊😊😊
😊
Man where have you been all my Python-Life!?!? Thank you so much for this! Outstanding!!!
Thanks Robert for watching. Glad you found it helpful!
The problem in dealing with is that I am looping through some poorly designed tables and building a sql statement to be applied and then appending the output to a list. Not sure if a vectorized approach will work since I have that sql call, but the apply might save me from needing to recreate the df prior to appending everytime.
@@robertjordan114 Interesting. Not sure what your data is like- but it can be better a lot of the times to write a nice SQL statement that puts the data in the correct formatting first. That way you put the processing demands on the SQL server and it can usually optimize really well.
Oh you have no idea, my source table has one column with the name of the column in my lookup table and another with the value that I need to filter on in that lookup table. The loop creates the where clause based on the number of related rows in the initial dataset, and then I'm executing that sql statement the return the values to a python data frame which I then convert to a pandas data frame and append. Like I said, amateur hour! 🤣
My man made a df out of the time diff to plot them!! Really useful video. Will definitely keep this in mind from now.
Haha. Thanks Nirbhay!
I'm over here as a newbie data scientist, copying the logic step-by-step in order to have good coding habits in the future lmao. Thanks for the video, really valuable!
Glad you found it helpful!
Thanks for the great video! I have a project with some calculations. They take some minutes through the loops. I'm going to use vectorized way. So i'll write another comment with comparison later. Some days later... i rewrote a signifacnt part of my code. Made it vectorized, and i got fantastic results. The example: old code - 1m.3s, new code - 6s. One more: old code - 14m.58s, new code - 11s. Awesome!
So awesome! It's really satisfying when you are able to improve the speed of code by orders of magnitude.
As I work with Pandas and large datasets, I come across code that use iterrows often. Most developers just don't care about time or come from various programming backgrounds that prohibit them from using efficient methods. I wish more people use vectorization.
Thanks. That’s exactly why I wanted to make this video. Hopefully people will find it helpful.
Some of the biggest bottlenecks are from IO... especially when trying to read then concat multiple large Excel files. Shaving a few seconds in the algos just isnt gonna make much of a difference
@@pr0skis Hard to say that definitively, though, right? You have no idea how anyone is using pandas. If they have slow algos running iteratively, it can very easily become much slower than I/O functions. I've seen some pretty wild pandas use in my business, and a lot of it is really terrible at runtime, especially anything that is wrapped in a GUI (sometimes even with multiprocessing...).
@@pr0skis you can convert excel file to csv and then use csv files because csv files io are faster.
It's the same with a relational database, we call them the cursor kids. They loop and loop and loop when they can use a set operation to go hundreds of times faster and often with less code.
Hey, I just thought I'd mention, I really appreciate that you use really huge test datasets, since a lot of the time, test datasets used in tutorials are quite small and don't sure how code will scale. This video does it perfectly, though!
Hey man, nice video! Kudos from reddit!
Glad you enjoed it. So cool that the reddit community liked this video so much. Hopefully my next one will be as popular.
Somehow i have been met vectorize method first at the beginning on my python and pandas journey. Thanks for sharing your experience, lightning fast
It’s a great thing to learn early!
That's another awesome video....extremely useful in the real world work. Thanks again Rob
Thanks for watching Filippo!
I loved that you used Madrid Python user group for the pandas logo :)
I did?! I didn't even realize. What's the timestamp where I show that logo?
There are several videos on pandas vectorization. This is the best.
I apprecaite you saying that! Thanks for watching.
Great video! I wish I had known not to loop over my array for my machine learning project... going to go improve my code now!
Glad you learned something new!
Wow amazing. Please keep making more videos like this.
Thanks for the feedback. I’ll try my best.
Wow ! That's an excellent way of speed up the code.
Thanks Anoop. Hope your future pandas code is a bit faster because of this video :D
Thank you for this, Rob. This video made me subscribe because in 10 minutes you solved one of my biggest problems.
And your Boo is right - you are pretty cool. Thanks again, sir.
That's awesome that I was able to help you out. Check my other videos where I go over similar tips! Glad you agree with my Boo
Ok this channel is great for data science, I’ll follow
Thanks for subbing!
Dude, that row[column] thing was a shock to me, thanks!
Glad you learned something!
Awesome. Thanks Rob for introducing this concept to me.
Happy it helped!
Awesome video man! Appreciate the tips, I'll definitely be subscribing!
I appreciate that a ton. Share with a friend too!
I am a begineer, admired your sound knowledge in Pandas
Thanks for watching. Hope you leaned some helpful stuff.
Thank you very much for this video Rob. It is very helpful for beginners like me. Have a great day.
Thank you so much for the beautiful video, but two questions please:
1 - What if we changed the engine in apply to numba:
df.apply(func, engine = 'numba', raw = True, engine_kwargs = {'nopython': True})
2 - Do you mind make a video just on vectorization please?
Thank you again 🙂
I have always been struggling to understand how vectorize work..this video of yours is the one made it crystal clear for me.
What a great video!
Can you please do more of these efficient pandas videos and use some stock market data? Thanks!
Thanks for the feedback. I’m so happy you found this useful. I’ll try my best to do a future video related to stock market data.
I'm kicking myself now for not finding your video 10 months ago. I'm near the completion of my code and resorted to a mix of iterating For loops and small scale vectorisation by declaring new columns after applying some logic. I seriously need to adopt your methods and redo my code because mine is just not fast enough!
I totally feel you. It took me years before I understood really how important it is to avoid iterating rows was. Once you learn it all your pandas code will be much faster though.
@@robmulla I just altered one of my For loops and used your Vectorized approach! Not only is it faster, I did it in just 3 lines of code and the syntax is much easier to read! I feel so embarrased for myself cos it's much more straight forward than I thought!
Now the tricky thing is, I work on a time series dataset where I compare previous rows of data to the current row to get the "result". I assume I can use the "shift" method to look back at a previous row of data. If it works, I'm gonna Vectorize everything! THANKS SO MUCH!
More video tips for pandas would be excellent!
Great suggestion. I'll try to keep the pandas videos coming.
Great video. I am working on a Df with millions of rows and pandas apply was struggling. I solved using an vectorized solution as exposed. Much much better.
Could you imagine a situation where vectorization would be not possible?
Glad this helped! As far as examples where vectorization is not possible:
For example, if you need to perform an operation that requires branching, such as selecting different values based on some condition, vectorization may not be possible. In this case, you would need to use a loop or some other non-vectorized approach to perform the operation.
Another example where vectorization may not be possible is when working with datasets that have varying lengths or shapes. In this case, it may not be possible to perform operations on the entire dataset using vectorized methods.
Hope that helps.
NEW SUB here, this is MAGIC!
Thank you!
1:19 "a random integer between one and 100." I believe that should be from 0 to 99 (ie: inclusive at both ends). In case nobody else mentioned it.
Good catch! I think you are the first to point that out.
Cool, for really large dataset and when conditions aren't too complicated that vectorized method is amazing, apply is nice alternative cause you can write function, there should be a module that converts normal functions in this vectorized syntax cause it's quite complicated to write
Glad it was helpful! There are some packages that compile functions (called numba/jit) there is also np.vectorize
@@robmulla I tryed to played a bit with it, pandas it's similar to numpy and I worked with numpy quite a bit, I tryed to put in a function bool_calculation with 3 distinct dfs for age condition, pct_sleeping and time in bed, finaly return value was final condition, df loc supports putting function directly in it's statement, so I did that finaly I compared created dfs with both methods, and they are same.
My suggestion is that you should explain more in depth those complexed stuff.
great tips! and very well presented
Glad you like it. Thanks for watching.
Wow, thanks! I'm a beginner in data science, this is really interesting to me.
Great to hear! Good luck in your data science journey.
This can be really helpful, when moving data from one source to another through Internet.
Absolutely, compressing can make any data transfer faster.
Wow, awesome video, thanks! Although it takes time to figure out how to turn my limit conditions into logical calculation and return a bool dataframe
Vectorization is the whole point of Pandas. But there are cases where vectorization is impossible and you need to process row-by-row, in that case it's best to switch to numba for a precompiled function.
your teaching style is awesome. where can i find your videos in a structured manner??
Great video!!!
Thank you!!
Would vectorization also be faster than an np.where statement with multiple conditions?
Great question! I think someone tested it out in the reddit thread where I posted it and found maybe a slight speed increase over the vectorized version.
np.where() is what I use almost exclusively. However, it tends to be a little unreadable if you need to use additional if statements to go from binary (either or) to 3 or more possible values. Of course, one could also nest np.where() statements too. :)
I expect a video about moped gangs in the future, Rob.
That’s a great idea Richie! I practically majored in moped gangs in college. 😂
Super awesome video.
I appreciate that. Thanks for commenting!
Great video Bob, thanks.
I curious, which interface for Jupyter Notebook you are using?
Glad you liked it. This is jupyterlab with the solarized dark theme. Check out my full video on jupyter where I go into detail about it.
@@robmulla Tks Bob ✌️
@@johnidouglasmarangon no problem. Jane!
You should plot the results on a log plot, which would give the fold speedup for each.
Great video! very useful
Glad you found it useful!
oh my!!! awesome. thanks!!!
Thanks 🙏
Interesting. Thank you will try it.
Awesome! Let me know how it goes.
which font do you use? That's awesome font and color scheme
I had a similar situation creating a new column based on conditions. My data set has 520,000 records so the apply was very slow. I got good results with using the select function from numpy. I'm curious how that would compare to the vectorization in your case.
Edit: in my case, the numpy select is slightly faster than the vectorization.
Thanks for sharing. It would be cool to see an example code snippet similar to what I used in this video for comparison.
Polars lib is quite fast with my 1 million dataset, I recommend to try.
As an R user we use vectorization using mutate without even thinking about the other methods for such task. R is so much more suitable for data science and wrangling
I had to do the same thing in my first internship lol. The script went from 4 hours to like 10 minutes to run
Yea, when I learned this it 100% changed the way I write pandas code.
It's a great tip, but I don't feel like, I understood, what vectorized means, or how I make a function vectorized. Is it just creating the boolean array by applying the conditions to the whole data frame in this way, or are there other ways to vectorize as well?
i _am_ excited! show the solution in machine code next pls thx
Working on it…
Thank you very much for great video.
Glad you liked it! Thanks for the feedback.
The Pandas doc literally tells you that iterrows is slow and should be avoided lol. As for vectorization, Pandas uses (slightly tweaked so to hold different types) numpy arrays which are hosted in continuous memory blocks... So ofc vectorization will be faster than apply/map.
Yep. This is obvious to a seasoned veteran, but as I mentioned in the video, for many newbies who haven't read the docs and aren't fully aware of the backend, they don't know that iterrows is a bad idea.
@@robmulla Maybe, when you have time for that, you could publish a video that describes to newbies what "RTFM" means, and what is the benefit of that. You are popular, a role model for some 🙂
(in this case "M" -> docs)
What theme are you using to get your Jupyter Notebook to look like that?
Solarized dark theme. I have a whole video about my jupyter setup
Thanks, one query in case of vectorize, which one would be faster - np.where or the method you memtioned ?
Nice video, though in some cases we want to use a specific complex function from a library. The apply method works fine for that case. But is there a way to use it with vectorization?
You can try to vectorize using something like numba. But it depends on the complexity of the function.
Jesus, I'm over here blowing up my laptop. Had no idea, thank you!
Hah. My name is Rob. But glad you learned something new.
Is vectorization applicable for huge datasets? Like millions of datasets for example.
If it can fit in your computer’s memory then yes!
Love the video, thanks Rob! Is there a vectorized way to create a column that is the sum of several columns?
I tried df['total'] = df.iloc[:, 5:13].sum(), which was way faster but returned all NaN values.
Any help would be appreciated.
So close! I think all you need to do is to change `sum(axis=1)` and it should work!
@@robmulla Brilliant! Works perfect now. Thanks for the quick reply
Thank you very much Rob.
I love how that Pandas logo has canibalized the city of Madrid (Spain) logo
Not everything can be vectorized, commonly when processing time series data. For these, a great library to know about is njit.
Agreed. Njit / numba can be great when needing to make sudo compiled python code.
That’s all I do at work, vectorize is the way to go. I was able to do some complex logic with them.
Love it.
What theme do you use in your notebook?
I have a whole video on my jupyter lab setup. But it’s just the solarized dark theme.
Can you save lists using Parquet
Or is it only applicable to dataframes?
Why didn't you remove get rewards function from other run times?
Oops. Did I do it incorrectly? Can you share the timestamp?
Nice video! Thanks for detailed explanation. My only problem with this is that I often have to apply functions that depend on sequential time data and a loop setup makes the most sense since the next time step depends on the previous time steps.
Are there some advanced methods on how to set up more complex vectorized functions that don't fit into a one-liner expression?
Yes there are! I think I'll probably make a few more videos on the topic considering how interested people seem in this. But I'd suggest if you can do any of your processing that goes across rows in groups - first do a `groupby()` and then you can multiprocess the processing of each group on a different CPU thread. If you have 8 or 16 CPU threads you can speed things up a lot!
Pandas has a lot of useful time series methods, but without knowing exactly what you're trying to do, it'd be hard to suggest any specific functions. But if you only need to refer to step (n-1) when processing step n, you can use df.shift() to store step n-1 IN the row for step n. Hope this helps!
Great video! I am currently looping through a data frame column for each customer and print the data to PDF. Is there a vectorized version that can be much faster?
Nice glasses! where are they from?
Haha. Thanks! 🤓 - They are from warby parker, but I accidentally broke this pair :(
Thanks for sharing a nice information
Could also use query instead of loc
Not sure that would work for this case because we aren’t straight filtering.
in these cases it is easy to vectorize but how can you vectorize when the process or the function that needs the df as input is more complex? For example can you vectorize a procedure that uses specific rows and not one column based on a condition and then use these elements to perform calculations with step and not on the same row for example df.loc[i,"A"] - df.loc[i-1,"B"]?
Thanks Rob
Thanks for watching!
Damn, I knew lvl 2 but lvl 3 was awesome!
Thanks Jake! Yea, vectorized functions are super fast. If you can't vectorize then there are other ways to make it faster (like chunking and multiprocessing)... I might make a video about that next!
I'm new so this might be a silly question. I am using an API to get additional data for each row in my dataset. Can I use vectorized approach while making API calls as well?
That's kind of different. You just want to gather the results from the api as fast as possible. Check out something like async calls to the api. This might help: stackoverflow.com/questions/71232879/how-to-speed-up-async-requests-in-python
@@robmulla thanks!
What if I want to apply a more complicated or non-numeric test, like instead of df['pct-sleeping'] > 0.5 I was looking at whether "teeth" was in df['dream-themes'], (a list of the tags concerning the things the sleeper dreamt about e.g. [teeth, whale, dog, slide, school]). Is the only way to do this by with .apply or can this still be vectorized?
This is a good question. I think it depends on the dtype of the dream-theme columns. Would it only contain a single value or potentially multiple ones? Check the ‘isin’ function in the pandas docs, its a vectorized function of doing this.
I haven't tested this but try df['dream-themes'].str.contains('teeth'). If df['dream-themes'] is a bunch of comma-delimited strings, you should be good to go (but watch out for partial matches e.g. "teeth whitening" contains "teeth"); not sure whether this will work if df['dream-themes'] contains a bunch of proper lists. Try it and let me know!
Amazing video
Thanks for the feedback!
What about Numba, if it can bring some more optimization, it will be nice if you make a video for it.
Numba/jit is great to speed up more complex operations. I've had limited experience with it, but every case it really sped things up. Doing it as a video is a good idea.
I typically use numpy where in this situation (mainly because I like the syntax better!), so I was curious about the speed vs. the level 3 solution. Where ran a little faster (~15-20% for datasets sized 10K - 50M records).
# level 3 - vectorized
%%timeit
df = get_data(10_000)
df['reward'] = df['hate_food']
df.loc[((df['pct_sleeping'] > 0.5) & (df['time_in_bed'] > 5))
| (df['age'] > 90), 'reward'] = df['favorite_food']
# 3.74 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# level 4 - where
%%timeit
df = get_data(10_000)
df['reward'] = np.where(((df['pct_sleeping'] > 0.5) & (df['time_in_bed'] > 5))
| (df['age'] > 90),
df['favorite_food'],
df['hate_food'])
# 3.15 ms ± 37 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I love that you ran that experiment! I'm actually suprised the numpy version isn't even faster. Thanks for sharing.
@@robmulla I'll throw another wrench into this. df.at[] vs df.loc[]. df.at[] is considerably faster than df.loc[]. But I've never ran conditionals with df.at[]. I'm also an np.where() user. :)
thank you your very extremely perfect video,so so helpful for me,love you so much
I'm so glad! Share it with a friend or two who you think might also appreciate it.
when you vectorize with loc, don't you have to vectorize the right side of the equation too. df['favorite_food'] is not masked. It's the whole array. Right? So you are setting the reward to the first N of df['favorite_food'] where N is the length of the mask.
I don't think so because pandas will use the index when populating. But I'm also not 100% sure.
@@robmulla make github repos so we can test! that would be great
now instead of two actions + .loc in the third example use np.where for oneliner and it will be even faster
Is there a way to use np.vectorize() instead of df.loc so things are more tidier?
That's a great point. I've used np.vectorize before but not too frequently. I agree the current solution isn't very clean to read and could be much tidier.
@@robmulla No, this is not so much a suggestion and more a question. I'm new to this and literally don't know. I had previously read about np.vectorize(). I tried doing your vectorize method but using np.vectorize but couldn't figure out the syntax.
@@DK-rl1sf Yea, there is a lot of overhead when using pandas instead of numpy - but you get the benefit of named columns, easy filtering and sorting. In my experience np.vectorize worked but i was working with just numpy arrays not pandas dataframes.
Question: I follow your code, an erroe come out said " 'reward_calc' is not defined ".
In the video the fn name has no letter c at the end
Oh. Good catch. Sorry it was confusing
so usefull thank you !!!!!
You're welcome! Thanks for commenting.
can you vectorize using query? I suppose you can't
So what about the scenario when we want to perform same operation but only in one column?
Such as, if pct_sleep
You can use something like qcut in this case or a vectorized statement with and or statement.
Hey, this is a great video and truly shows the benefit of vectorisation
I would like to point out that always remebering the vectorize way of writing is hard. Fortunately, NumPy module does provide a neat method called "vectorize" that vectorizes your non-vectorize function.
an example (from the docs):
## this is the function
def myfunc(a, b):
"Return a-b if a>b, otherwise return a+b"
if a > b:
return a - b
else:
return a + b
## vectorising the function and then applying it
vfunc = np.vectorize(myfunc)
vfunc([1, 2, 3, 4], 2)
array([3, 4, 1, 2])
This works on DataFrames as well.
Do Note tho that this is not True vectorisation because of that ,in some cases, it performs similarly to functions like "apply". However, for the most part it does a tremendous job and has significantly increased the speed of my functions.
The reasons for why it is not "true vectorisation" are mentioned in this thread : stackoverflow.com/questions/52673285/performance-of-pandas-apply-vs-np-vectorize-to-create-new-column-from-existing-c
Thanks! I've used that before and it does come in handy. Also using things like jit/numba can compile numpy operations.
For looping over big datasets I switch to polars when speed becomes an issue.
I have an entire video on my channel about polars. It’s great! Check it out.
Nice trick, but what if you code it completely in C / C++ / Rust? Literature says those are 50 - 80 times faster?
I have a whole video on polars, which is written in rust. It’s faster for sure. But keep in mind pandas backend is just C code.
Use polars?
This is the basic way all the code should be written for pandas, but if you really need it to scale for high memory consumption and number of calculations use njit processing for numerics or even easier just use Polars for everything .
I wonder how it compares to np.where
I guess depends.
3.5 seconds for a for loop with only 10k rows...
Is this done in a Docker container or another VM(-like) environment?
Just done locally on my fairly beefy machine.