UPDATE: Thanks to @swni on Reddit for the suggestion to use the `ids_pairs` array to index to get `x_pairs` and `y_pairs` as opposed to reusing the `torch.combinations` function. This reduces the simulation time required for 10000 particles to only 20 seconds (about half what is shown in the video). Code has been updated on GitHub! To compare NumPy and PyTorch fairly under these new conditions, I simulate 5000 particles in each case. PyTorch takes 6.3 seconds to run (remember, it also has around a 2 second overhead), while NumPy takes about 823 seconds, indicative of about a 100x increase.
There must still be a lot of potential. A GPU calculates 1080x1920 ~2Mio RGB value per frame. You don`t need to check n^2 combinations for collision, n! should be enough because if P1 collides with P2, P2 also collides with P1. Especially something like checking for collision can be blazingly fast on a GPU. Your 3070 has over 5000 cores and each one has SIMD instructions. So you can do about 20k fp ops per clk. I would check the particles for collision when creating the pairs. You have the function anyway so it`s an easy-to-fix bug.
The same thing can be applied for NumPy as well. Replacing the get_delta_pairs function with def get_delta_pairs(x, ids_pairs): # added extra parameter return np.diff(np.array(x[ids_pairs[:]]), axis=1).ravel() The itertools.combinations function takes a long time when the number of particles increase, so using the ids_pairs which was already created, can reduce time taken, as combinations is called twice is each iteration. Using 400 particles now takes 3 seconds instead of 57 seconds (NumPy).
Bro please never stop doing physics videos, they are amazing! I know they are not the most popular videos in your channel but they are super helpful for someone that only had one programming subject and was with Fortran :( . Greetings from the Dominican Republic! haha
It always catches me off guard to see non-meme videos from you. I am more into web-interacting services rather than data manipulation/science-so async is my wheelhouse rather than this stuff. Still fascinating to watch.
I am a numerical physicist, and this will be very helpful for me. I am currently running all my simulations om CPU (though using MPI for parallellization)
Please learn about nvblas and openblas and code vectorization in my other comment here today. The two keys are writing vectorized numpy or pandas code, plus activating the nvblas or openblas subsystem. Let me know if want help.
Awesome vid! Love seeing Pytorch being leveraged for its first class GPU support for things other than machine learning. If I recall correctly, someone had a blog post about using pytorch to optimize a shape for rolling (i.e. reinventing the wheel) and it used pytorch, super funny, but cool. Great video!!
I absolutely love this. I'm making my own game engine (fun hobby, tbh) with OpenGL, numpy and Python, and for some time I've thought about where to simulate my physics. This is an eye-opener, and it looks fun as heck! Espec. the matplotlib animation for some "lazy" collision simulations. This vid brings me straight to my college days
Amazing content. I had a professor when I was in the physics degree that told us about the power of GPU when coding "big numbers". The GPUs have up to 1000 more ("dumb") cores than the CPU and that can be really powerfull. I am now working on my PhD and I use python to do the work. I think that I can learn a lot from you. Thank you!
Congrats! Great video! Please, don't stop, your videos are incredibly didactic! I allways cite your channel to my students in my Classical Dynamics classes.
Great video, thanks! Consider using indexing by the coordinates of particles in space. The idea is that the coordinates of the particles are rounded to the size of the box, and the collision check occurs only for those particles that are inside the same box. This usually reduces the number of pairs by 90%.
GPU takes advantage of linear operations. So I'm not really sure, but if you use some data structures like quadtree the complexity of the computation might drastically simplify. And you won't need to calculate all n² distances. In fact most particles are not collading with each other. One need to test it, but with that CPU might still outperform the overhead of the GPU, since there won't be that many computations.
Nvblas can be used by numpy. Nv stands for nvidia. Just configure your host a bit, which is easy. Openblas can also be used by numpy, which is more common. By default, your linux is using a gnu blas which is super slow by comparison. Nvblas uses the gpu for the linear algebra operation s in your numpy code. Just be sure to write vectorized numpy code , not for loops. You don't change your application code at all, which is a big benefit for ease of maintenance. Openblas will recruit all your cpu cores and implicitly parallelize your matrix matg, greatly speeding it up, as will nvblas.
Great educational video, mate! I'm a CS Grad student and was beginning to get to the later ML courses. Your explanation and side-by-side logic demonstration with Numpy convinced me to do a bit of research and switch from TF to Pytorch! Thanks so much!! I eagerly look forward to the next video!
12:00 why don't you simply use torch.cdist (if you have a batch of vectors, otherwise use torch.pdist) which calculates the p-norm (p=2 in your case) distance between each pair of the two collections of row vectors. This is supposed to be much faster than your code, even though I didn't test it.
torch by default includes in each tensor the telemetry necessary to calculate the derivative for the error back propagation algorithm. Use the require_grad=False parameter, this will speed up the calculation even more. x = torch.randn(3, requires_grad=False)
Nice video, did not think about using pytorch to replace Numpy, but it makes perfect sense for parellelizing numpy code👍. Just a quick tip for additional speedup. Instead of comparing the distance directly you can compare the squared distance for collision detection, this avoids using the square root function which is "slow" at least compared to all the dot products, though it might not matter much for simulations of this scale.
Multiprocessing library also helps to utilize all available threads. I was generating a mandlebulb and it went from 4 minutes to 1 minute when I optimized code for using it.
Nice video! I’ve also got a question in the part of calculating whether particles collide with each other. Is there any advantage of the video’s method compared to use: DIS = torch.cdist(points,points) < collision distance DIS = torch.triu(DIS, diagonal = 1) Pairs = DIS.nonzero() Or, they are having the same computational complexity?
You said you use a GTX 3070? All I can find here is a RFX 3070. Was trying to figure out what makes your calculations for 10,000 particles that much faster, compared to my GTX 750Ti that of course would crash the system with 10,000 particles, until an additional Tesla M4 was installed, completely bypassing the functions of the GTX 750Ti, resulting in rs,vs = motion(r, v, ids_pairs, ts=1000, dt=0.000008, d_cutoff=2*radius) calculations: Wall time: 6min 55s, which is still much slower than 48.9 sec, as your demonstration boasts. The animation with 10,000 particles looks awesome. Thx. Have not yet found the necessary control software to raise the clockspeed of the Tesla M4. Current driver installations supplied by Nividia have cleverly merged the two cards as one. Ergo.: We have now an augmented GTX 750 Ti that benefits from the original factory settings of the Tesla M4. Additional software needed to raise the clockspeed of the cards in their combined setup to the level of their full capacities was not found yet.
Could another distribution really come up for different potentials around the particles? I thought of the Boltzmann distribution as a thermodynamic necessity due to maximization of entropy
If this is all about optimization, you should probably compare the sqared distance between particles to eliminate the need to calculate the square root. :)
Thank you so much , I was already using PyTorch for something, but I couldn't figure out how to create the equivalent of the "x_pairs" array I needed to use, thanks.
Thanks for the video ! You are awesome ! I have a question. Is it possible to use pytorch to optimize a code with a lot of functions from scipy ? Like solving a lot of differential equations, nonlinear equations, interpolating and integrating functions all in one big code. I'm currently optimizing my code with the joblib library to run it in parallel.
Very good intro video into GPU programming, it even gave me couple of ideas. One question if I may. Why wouldn't you do the simulation with event driven algorithm, since that would save a lot of resources and you can avoid overlaps of particles (ie the need to choose small timesteps). I get this is a tutorial/introduction video, but that implementation would be very interesting as well!
It would be interesting to know what sort of a GPU you are using to achieve the record-breaking walltime of 48.9sec. Compared to my 33:12.4 min for a max of 8000 particle, on a NVIDIA GeForce GTX 750 Ti sporting 640 CUDA-cores @ 100% of the available 2GB mermory in use, while offering badly interrupted video performance on all 3 monitors. However, after rewriting your code as a CUPY hybrid, the wall time was reduced to 11:31 min for the 8000 particles, while keeping GPU memory use between 33-75%, and thus well away from the videocard crash encountered when attempting 10000 particles using your code unmodified. Anyway, arangements were made today, to obtain a NIVIDIA Tesla M4 graphics card to be used in conjunction with the existing card as a dedicated number cruncher. Hopefully that will get us closer to the desired performance.
I frequently do so-called "model-fitting" using MCMC (or anything good enough), where each set of data consists of 1k-10k data. I wonder whether this could benefit from GPU acceleration or the overhead would be too much.
Nice sales pitch for Microsoft Visual Studio. Had `cuda` up and running nicely in a previous installation of Visual Studio. All of that had been with C++ in mind. So Python was not really considered at that time. Was hoping to do the same with PyCharm and Browser-based Notebook using Python exclusively. That's when it got confusing to the point of dropping the idea. 😒
I mean you start by saying most people have access to a GPU these days and this is absolutely true. But plenty don't have an NVIDIA GPU and as I understand it pytorch doesn't support non NVIDIA gpus? might be worth re writing this with pyopencl.
I can't get accepted into your discord. In the two lines ani.save()... I get a file not found exception. I am using Python 3.11.3. I really like the article and video. Thanks
soo instead of using the poor man´s version of Fortran for such calculations, just use Fortran. It is not only perfect for arrays but also natively parallel. You can even make a python wrapper if you want some gui to please the eye. But , I get it, it would not be cool for the kids on youtube...but if you really need efficiency , give it a try.
Use transfer function its billion times faster: h = np.random.rand(10000) idx = np.arange(10000) X = X_train[idx].astype("float32")/255.0 yt = y_train[idx] + 1 # 1..10 x = X.mean(1) ids = np.argsort(x) i=0 while True: err = yt[ids] - x[ids] * h h += 0.1*err print(np.mean(err**2)) i+=1 if np.mean(err**2)
UPDATE: Thanks to @swni on Reddit for the suggestion to use the `ids_pairs` array to index to get `x_pairs` and `y_pairs` as opposed to reusing the `torch.combinations` function. This reduces the simulation time required for 10000 particles to only 20 seconds (about half what is shown in the video). Code has been updated on GitHub!
To compare NumPy and PyTorch fairly under these new conditions, I simulate 5000 particles in each case. PyTorch takes 6.3 seconds to run (remember, it also has around a 2 second overhead), while NumPy takes about 823 seconds, indicative of about a 100x increase.
Could you test CuPy please?
There must still be a lot of potential. A GPU calculates 1080x1920 ~2Mio RGB value per frame. You don`t need to check n^2 combinations for collision, n! should be enough because if P1 collides with P2, P2 also collides with P1. Especially something like checking for collision can be blazingly fast on a GPU. Your 3070 has over 5000 cores and each one has SIMD instructions. So you can do about 20k fp ops per clk. I would check the particles for collision when creating the pairs. You have the function anyway so it`s an easy-to-fix bug.
The same thing can be applied for NumPy as well. Replacing the get_delta_pairs function with
def get_delta_pairs(x, ids_pairs): # added extra parameter
return np.diff(np.array(x[ids_pairs[:]]), axis=1).ravel()
The itertools.combinations function takes a long time when the number of particles increase, so using the ids_pairs which was already created, can reduce time taken, as combinations is called twice is each iteration. Using 400 particles now takes 3 seconds instead of 57 seconds (NumPy).
Bro please never stop doing physics videos, they are amazing! I know they are not the most popular videos in your channel but they are super helpful for someone that only had one programming subject and was with Fortran :( . Greetings from the Dominican Republic! haha
It always catches me off guard to see non-meme videos from you. I am more into web-interacting services rather than data manipulation/science-so async is my wheelhouse rather than this stuff. Still fascinating to watch.
I am a numerical physicist, and this will be very helpful for me. I am currently running all my simulations om CPU (though using MPI for parallellization)
Please learn about nvblas and openblas and code vectorization in my other comment here today. The two keys are writing vectorized numpy or pandas code, plus activating the nvblas or openblas subsystem. Let me know if want help.
You may want to take a look at CUDA C++ if you have Nvidia GPU(s) and are concerned with performance
Awesome vid! Love seeing Pytorch being leveraged for its first class GPU support for things other than machine learning. If I recall correctly, someone had a blog post about using pytorch to optimize a shape for rolling (i.e. reinventing the wheel) and it used pytorch, super funny, but cool. Great video!!
I absolutely love this. I'm making my own game engine (fun hobby, tbh) with OpenGL, numpy and Python, and for some time I've thought about where to simulate my physics. This is an eye-opener, and it looks fun as heck! Espec. the matplotlib animation for some "lazy" collision simulations. This vid brings me straight to my college days
Amazing content.
I had a professor when I was in the physics degree that told us about the power of GPU when coding "big numbers". The GPUs have up to 1000 more ("dumb") cores than the CPU and that can be really powerfull. I am now working on my PhD and I use python to do the work. I think that I can learn a lot from you.
Thank you!
first non-humour vid i see and its awesome! will try to learn more! thank you professor!
Congrats! Great video! Please, don't stop, your videos are incredibly didactic! I allways cite your channel to my students in my Classical Dynamics classes.
Dude a GPU accelarated python series would be amazing 😍😍😍
Very interesting content and I really appreciate the way you show both notebooks side by side to compare the results. Thank you very much!
Great video, thanks! Consider using indexing by the coordinates of particles in space. The idea is that the coordinates of the particles are rounded to the size of the box, and the collision check occurs only for those particles that are inside the same box. This usually reduces the number of pairs by 90%.
I'm honored that your stuff comes up on my feed. Amazing work!
GPU takes advantage of linear operations. So I'm not really sure, but if you use some data structures like quadtree the complexity of the computation might drastically simplify. And you won't need to calculate all n² distances. In fact most particles are not collading with each other. One need to test it, but with that CPU might still outperform the overhead of the GPU, since there won't be that many computations.
Nvblas can be used by numpy. Nv stands for nvidia. Just configure your host a bit, which is easy. Openblas can also be used by numpy, which is more common. By default, your linux is using a gnu blas which is super slow by comparison. Nvblas uses the gpu for the linear algebra operation s in your numpy code. Just be sure to write vectorized numpy code , not for loops. You don't change your application code at all, which is a big benefit for ease of maintenance. Openblas will recruit all your cpu cores and implicitly parallelize your matrix matg, greatly speeding it up, as will nvblas.
Great educational video, mate! I'm a CS Grad student and was beginning to get to the later ML courses. Your explanation and side-by-side logic demonstration with Numpy convinced me to do a bit of research and switch from TF to Pytorch! Thanks so much!! I eagerly look forward to the next video!
It's so relaxing so see someone else's explanation. I'm so tired of doing work in graduate school XD
Nice I've been waiting for this one ! thanks , looking forward to seeing the next ones
Oh damn, this is what my thesis is on! Good to see that some great resources are being put out for it
I hope you continued this series
12:00 why don't you simply use torch.cdist (if you have a batch of vectors, otherwise use torch.pdist) which calculates the p-norm (p=2 in your case) distance between each pair of the two collections of row vectors. This is supposed to be much faster than your code, even though I didn't test it.
torch by default includes in each tensor the telemetry necessary to calculate the derivative for the error back propagation algorithm. Use the require_grad=False parameter, this will speed up the calculation even more.
x = torch.randn(3, requires_grad=False)
Nice video, did not think about using pytorch to replace
Numpy, but it makes perfect sense for parellelizing numpy code👍. Just a quick tip for additional speedup. Instead of comparing the distance directly you can compare the squared distance for collision detection, this avoids using the square root function which is "slow" at least compared to all the dot products, though it might not matter much for simulations of this scale.
Multiprocessing library also helps to utilize all available threads. I was generating a mandlebulb and it went from 4 minutes to 1 minute when I optimized code for using it.
Nice video!
I’ve also got a question in the part of calculating whether particles collide with each other. Is there any advantage of the video’s method compared to use:
DIS = torch.cdist(points,points) < collision distance
DIS = torch.triu(DIS, diagonal = 1)
Pairs = DIS.nonzero()
Or, they are having the same computational complexity?
Never seen "torch.cdist" before! Thank you for this comment. Huge reason why I post videos like this...to learn more from the comments :)
Super cool, your meme videos are hilarious but this quality content is why I subbed in the first place
Very interesting, would love to see more of this!
You said you use a GTX 3070?
All I can find here is a RFX 3070.
Was trying to figure out what makes your calculations for 10,000 particles that much faster,
compared to my GTX 750Ti that of course would crash the system with 10,000 particles,
until an additional Tesla M4 was installed, completely bypassing the functions of the GTX 750Ti,
resulting in
rs,vs = motion(r, v, ids_pairs, ts=1000, dt=0.000008, d_cutoff=2*radius)
calculations: Wall time: 6min 55s,
which is still much slower than 48.9 sec, as your demonstration boasts.
The animation with 10,000 particles looks awesome. Thx.
Have not yet found the necessary control software to raise the clockspeed of the Tesla M4.
Current driver installations supplied by Nividia have cleverly merged the two cards as one.
Ergo.: We have now an augmented GTX 750 Ti that benefits from the original factory settings of the Tesla M4.
Additional software needed to raise the clockspeed of the cards in their combined setup to the level of their full capacities was not found yet.
Could another distribution really come up for different potentials around the particles? I thought of the Boltzmann distribution as a thermodynamic necessity due to maximization of entropy
Just recently got familiar with multithreading so I guess this is the natural progression
How do I resolve the error:
[WinError 2] The system cannot find the file specified
Wow this is really interesting! Thanks! Waiting for more videos
perfect intro to torch for someone who is familiar with numpy
If this is all about optimization, you should probably compare the sqared distance between particles to eliminate the need to calculate the square root. :)
GTX 3070? Do you mean rtx?
Haha ya 😂
Thank you so much , I was already using PyTorch for something, but I couldn't figure out how to create the equivalent of the "x_pairs" array I needed to use, thanks.
Thanks for the video ! You are awesome !
I have a question. Is it possible to use pytorch to optimize a code with a lot of functions from scipy ? Like solving a lot of differential equations, nonlinear equations, interpolating and integrating functions all in one big code. I'm currently optimizing my code with the joblib library to run it in parallel.
Very good intro video into GPU programming, it even gave me couple of ideas. One question if I may. Why wouldn't you do the simulation with event driven algorithm, since that would save a lot of resources and you can avoid overlaps of particles (ie the need to choose small timesteps). I get this is a tutorial/introduction video, but that implementation would be very interesting as well!
Maybe I add that you can use AMD GPUs but currently only in Linux (as Nvidia have CUDA, AMD have ROCm)
Great to see ya again mate
It would be interesting to know what sort of a GPU you are using to achieve the record-breaking walltime of 48.9sec.
Compared to my 33:12.4 min for a max of 8000 particle,
on a NVIDIA GeForce GTX 750 Ti sporting 640 CUDA-cores @ 100% of the available 2GB mermory in use,
while offering badly interrupted video performance on all 3 monitors.
However, after rewriting your code as a CUPY hybrid, the wall time was reduced to 11:31 min for the 8000 particles,
while keeping GPU memory use between 33-75%, and thus well away from the videocard crash
encountered when attempting 10000 particles using your code unmodified.
Anyway, arangements were made today,
to obtain a NIVIDIA Tesla M4 graphics card to be used in conjunction with the existing card as a dedicated number cruncher.
Hopefully that will get us closer to the desired performance.
What about cupy (CUDA drop-in replacement for numpy)? Is the performance uplift comparable to pytorch?
Yes...
I frequently do so-called "model-fitting" using MCMC (or anything good enough), where each set of data consists of 1k-10k data. I wonder whether this could benefit from GPU acceleration or the overhead would be too much.
Thanks, now I finally will have one reason to tell my dad to buy me a graphics card😂
Where billy
Nice sales pitch for Microsoft Visual Studio.
Had `cuda` up and running nicely in a previous installation of Visual Studio.
All of that had been with C++ in mind. So Python was not really considered at that time.
Was hoping to do the same with PyCharm and Browser-based Notebook using Python exclusively.
That's when it got confusing to the point of dropping the idea. 😒
Definitely a must watch!
thats so cool
need this more in field of quantum chemistry❤❤
Why was it 'bad' that some of particles were colliding in the initial conditions?
"GPU, wich most people have acess today"
Looks like we've got some serious worldknowing issue going on here
Lots of free nodes you can use here that have access to GPU resources:
colab.research.google.com/
So my intuition of rewriting stuff in pytorch just for fun was not unreasonable after all!
I think operations on Pythorch Tensor are also faster than on Numpy arrays both on cpu
Torch jit and torch compile is a lot faster than just torch
Sweet topic. Thank you!
This is brilliant!
I mean you start by saying most people have access to a GPU these days and this is absolutely true. But plenty don't have an NVIDIA GPU and as I understand it pytorch doesn't support non NVIDIA gpus? might be worth re writing this with pyopencl.
It's all great but one thing. I don't need your face (person) sitting in fron do 2 screens and covering them 😂
What program are you using here that let's you put notes in the code like this?
VSCode and Jupyer Notebook!
2 minutes into the video, what about the performance of pytorch compared to numpy in CPU? is it faster there also !? have you tried numba !?
Does GPU mean NVDIA GPU specifically? Will we ever have libraries utilizing ANY general GPU?
it works with any GPU
can you suggest me books for this relevant problems of laplace transform via python
PLEASE MAKE A TUTORIAL ON HOW TO HANDLE BIG INTEGERS (>64INT) ON THE GPU 🙏
What are libraries that must be imported?
When I try to run your code, I get the error message: No module named 'torch' What am I doing wrong?
I can't get accepted into your discord. In the two lines ani.save()... I get a file not found exception. I am using Python 3.11.3. I really like the article and video. Thanks
I found the problem, I hand not installed python-ffmpeg. It is fixed now. Thanks
thank you for this!
love it, keep it up :)
Great video :)
Would this work on a RX 6800 or intel Ark 770?
If I remember correctly pytorch runs on AMD and intel arc as well
Can You Make video on the PyOpenCl
Does Pytorch have numerical integration capabilities?
there's torch.trapz which does exactly that
@@ДмитроПрищепа-д3я That's cool. Any idea how its accuracy compares to say scipy.quad function?
this is really cool!
What!!!!!!!!!!!!!!!!!!!!!!!!!
Great video
0:03 "If you coded in python before" while showing a screen full of braces
Oh so good
soo instead of using the poor man´s version of Fortran for such calculations, just use Fortran. It is not only perfect for arrays but also natively parallel. You can even make a python wrapper if you want some gui to please the eye. But , I get it, it would not be cool for the kids on youtube...but if you really need efficiency , give it a try.
Hey Nvidia did i miss the new gtx 3070 !!
Very nice video
wowowowoowwww
i'll break your comment bar with C++
Gtx 3070?
Is this from china!? 🤣😂
Damn, he used a Deep Learning Framework to replace Numpy, a Mathematics Framework :v
The entire PART 1 can be more efficiently rewritten in one line:
d_pairs = torch.pdist( r.T )
#Include
int main()
{
std::cout
the RTX 3070 is already considered mid range??!! 🥲
Use transfer function its billion times faster:
h = np.random.rand(10000)
idx = np.arange(10000)
X = X_train[idx].astype("float32")/255.0
yt = y_train[idx] + 1 # 1..10
x = X.mean(1)
ids = np.argsort(x)
i=0
while True:
err = yt[ids] - x[ids] * h
h += 0.1*err
print(np.mean(err**2))
i+=1
if np.mean(err**2)