Very good video, thank you! However I replicated exactly the same example as him for the groupby using Numba and it's still much slower than Pandas groupby... I don't understand how he got it to quick for the njit version... My groupby took 3 seconds while njit takes 14 seconds... so much slower...
OK I noticed I had done differently, I had created a DataFrame and not a Series. I am surprised by how much difference this makes! Basically if you create a DataFrame instead of Series, it makes it much slower. I would understand for the groupby of Pandas but I don't understand why this impacts the NJIT version !? Nowhere in the function we specify whether it is a DataFrame or a Series since we pass the numpy arrays to the function and not the Pandas objects... So how come Numba is much slower? How does it know ???
OK... I reply to my own message... Numba will make the difference between whether it is DataFrame or Series because at some point we use np.zeros_like(m) ... Basically if we use DataFrame then it will create a DataFrame for the output (output in the function)... which slows the function a lot compared to when output is a Series. So if you have a DataFrame at the beginning, you can just replace m_numba = np.zeros_like(m) by m_numba =np.zeros(len(m)) and it will be fast.
Very good video, thank you!
However I replicated exactly the same example as him for the groupby using Numba and it's still much slower than Pandas groupby... I don't understand how he got it to quick for the njit version... My groupby took 3 seconds while njit takes 14 seconds... so much slower...
OK I noticed I had done differently, I had created a DataFrame and not a Series. I am surprised by how much difference this makes! Basically if you create a DataFrame instead of Series, it makes it much slower. I would understand for the groupby of Pandas but I don't understand why this impacts the NJIT version !? Nowhere in the function we specify whether it is a DataFrame or a Series since we pass the numpy arrays to the function and not the Pandas objects... So how come Numba is much slower? How does it know ???
OK... I reply to my own message... Numba will make the difference between whether it is DataFrame or Series because at some point we use np.zeros_like(m) ... Basically if we use DataFrame then it will create a DataFrame for the output (output in the function)... which slows the function a lot compared to when output is a Series.
So if you have a DataFrame at the beginning, you can just replace m_numba = np.zeros_like(m) by m_numba =np.zeros(len(m)) and it will be fast.