thank you for nice explanation. I have following questions: 1. how to optimize euclidean distance function using SIMD? 2. How to implement SIMD instructions in java?
1.) You could just do multiple distances at the same time using intrinsics. There are ones for add, subtract, multiply, square root, etc. I had a project where I had to calculate millions of manhattan distances so I just offloaded it to the GPU. 2.) I don't know if there's a way to use SIMD intrinsics in Java. I have never used Java, but from a brief google search, there seems to be no way to easily do this.
@@CoffeeBeforeArch many thanks. euclidean distance is just another form of manhattan distance with minor changes. It would be great if you upload manhattan distance project on github or somewhere. I shall try to modify it for euclidean distance. If I can do it in java or c++, no problem. By the way, I have found nd4j library for java. Nd4j makes extensive use of vectorized c++ code for all numerical operations (utilizing JavaCPP). What do you suggest about that?
@@ASD9344 Yep, I've used both manhattan distance and euclidean in research work, so I'm familiar. This was the short CUDA app I wrote for calculating manhattan distance on the GPU. github.com/CoffeeBeforeArch/research_utilities/blob/master/acceleration/m_distance.cu . I really have no suggestion about something Java related because I have never used Java. If that library works for you, go ahead and use it.
hi, I was just wondering at what times/conditions that the compiler is not smart enough to automatically use SIMD intrinsics, which forces us to manually write them on our own instead?
For GCC, the auto-vectorizer kicks in at the -O3 optimization level, or if you manually enable it with -ftree-vectorize. There are cases where your compiler may not perform vectorization (e.g., if there's an alignment or aliasing problem). Furthermore, it seems some SIMD instructions are just not produced by compilers (likely due to the high-effort in matching high-level code to them and niche use-cases). The dot-product intrinsic seems to be an example of this (I've yet to have a compiler produce it for me, and I've always had to use the intrinsic). Cheers, --Nick
Thanks so much for this very helpful explanation. What happens if you use intrinsics, but it turns out that the processor the user of your app has doesn't support them? Is there a way to detect at run time what sort of hardware support exists? Is there a reference somewhere that will help us map detected hardware elements to support for particular intrinsics?
If your processor does not support the intrinsic, it will generate an invalid opcode exception (#UD). How you check this at runtime will differ based on the OS. I think all those utilities for checking what is supported is in cpuid.h. Intel has software manuals that specify what instructions/intrinsics are supported on what processors. Cheers, --Nick
It's so hard to find concise, direct info on SIMD intrinsics. Thanks for this!
Glad I could help!
This is the best video on SIMD. Short concise and to the point. Everyone else i found was just blabbering stuff
Thanks! Glad you thought so!
Excellent content, thanks!
well well. looks like YT finally sent me a c++ channel it's worth watching
This guy 🔥 and The cherno channel🔥
@@preethamdbz2023 The cherno is crap outside of the C++ series.
Straight to the point, clear and precise.
thank you for nice explanation. I have following questions:
1. how to optimize euclidean distance function using SIMD?
2. How to implement SIMD instructions in java?
1.) You could just do multiple distances at the same time using intrinsics. There are ones for add, subtract, multiply, square root, etc. I had a project where I had to calculate millions of manhattan distances so I just offloaded it to the GPU.
2.) I don't know if there's a way to use SIMD intrinsics in Java. I have never used Java, but from a brief google search, there seems to be no way to easily do this.
@@CoffeeBeforeArch many thanks. euclidean distance is just another form of manhattan distance with minor changes. It would be great if you upload manhattan distance project on github or somewhere. I shall try to modify it for euclidean distance.
If I can do it in java or c++, no problem. By the way, I have found nd4j library for java. Nd4j makes extensive use of vectorized c++ code for all numerical operations (utilizing JavaCPP).
What do you suggest about that?
@@ASD9344 Yep, I've used both manhattan distance and euclidean in research work, so I'm familiar. This was the short CUDA app I wrote for calculating manhattan distance on the GPU. github.com/CoffeeBeforeArch/research_utilities/blob/master/acceleration/m_distance.cu . I really have no suggestion about something Java related because I have never used Java. If that library works for you, go ahead and use it.
Nice, thanks!!
Glad you enjoyed the video!
Thank you!
hi, I was just wondering at what times/conditions that the compiler is not smart enough to automatically use SIMD intrinsics, which forces us to manually write them on our own instead?
For GCC, the auto-vectorizer kicks in at the -O3 optimization level, or if you manually enable it with -ftree-vectorize. There are cases where your compiler may not perform vectorization (e.g., if there's an alignment or aliasing problem). Furthermore, it seems some SIMD instructions are just not produced by compilers (likely due to the high-effort in matching high-level code to them and niche use-cases). The dot-product intrinsic seems to be an example of this (I've yet to have a compiler produce it for me, and I've always had to use the intrinsic).
Cheers,
--Nick
@@CoffeeBeforeArch thanks for the detailed reply!
Thanks so much for this very helpful explanation.
What happens if you use intrinsics, but it turns out that the processor the user of your app has doesn't support them?
Is there a way to detect at run time what sort of hardware support exists?
Is there a reference somewhere that will help us map detected hardware elements to support for particular intrinsics?
If your processor does not support the intrinsic, it will generate an invalid opcode exception (#UD). How you check this at runtime will differ based on the OS. I think all those utilities for checking what is supported is in cpuid.h. Intel has software manuals that specify what instructions/intrinsics are supported on what processors.
Cheers,
--Nick