Parallel C++: Unsafe Math Optimizations

Parallel C++: Vectorization

Как работает C/C++?

Hurricane Milton impacts travel nationwide

Shakira - Soltera (Official Lyric Video)

“Absolutely Stunned!” - Jets Fan Rich Eisen Reacts to Robert Saleh’s Firing | The Rich Eisen Show

Parallel C++: SIMD Intrinsics

CoffeeBeforeArch

Просмотров 4,8 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 10 окт 2024

Комментарии • 14

@markusbuchholz3518 Год назад ⁺⁹
Hi Nick, I do not have a question but I would like to highlight again that your channel is remarkable. As far as I know, only by following your channel one can capture in a consistent way the latest achievements in SW, especially in excellent C++. It is a huge distinction to be here. Additionally, I appreciate your effort used for the preparation video each day. I have been using C++ for over 2 decades mostly within the robotics domain. Your impressive work gives all of us (the community) a new look at this beauty and encourages us to study more. Thank you so much. Have a nice day!
@CoffeeBeforeArch Год назад ⁺¹
Thank you for the kind words! Always nice to hear when others enjoy the content. Autonomous robotics was where I got my start in research many years ago (primarily with SLAM) before moving more into the architecture/performance side of things.
Cheers,
--Nick
@user-vw5ex4wf1e Год назад ⁺¹
Hey Nick, amazing videos as always! Compiling with -ffast-math seems to unlock intrinsics for the transform_reduce baseline as well. Btw your videos are very inspirational, keep it up!
@shaytal100 Год назад ⁺¹
The performance benefit of using SIMD intrinsics is really impressive! I wonder how often the use of SIMD instructions could speed up every day computing tasks.
My really blind guess would be that they are very underused even in computing intensive software.
Thanks Nick for this fantastic series so far!
@CoffeeBeforeArch Год назад ⁺¹
Glad you enjoyed it! For many cases, the auto-vectorizer code is good enough. There can be an incredibly high software development cost for using low-level intrinsics (programming in assembly can be tricky work).
There are great examples though of code written entirely/almost entirely in assembly. Intel's MKL (math kernel library) is a great example (along with many high-performance linear algebra libraries out there).
Cheers,
--Nick
@shaytal100 Год назад ⁺¹
@@CoffeeBeforeArch Well, that is true. I guess there are many libraries for common algorithms that make good use of SIMD instructions. Good point!
@eladon19153 3 месяца назад
Hi nick, I just read about the alignment, and I would like to know why is it an improvement to align at 32 and not 64.. because 64 alignment (on 64bit system) would mean worst case of 4 cache misses and read of 64 bytes, while alignment of 32 would mean worst case of 6 cache misses.
Unless we are talking in 32bit system.
again, I might be wrong with how I perceived the cache, but I figured I will just ask while I still read about it.
Thanks alot
@juancolmenares6185 17 дней назад
Why was it that the compiler did not recognize that it could use the vdpsp instruction? you did mention something about the compiler implementation, but dot product seems like something it should be able to figure out...
@CoffeeBeforeArch 17 дней назад ⁺¹
If I recall correctly, it's because of the compiler's guarantees about the floating point arithmetic.
Compilers will guarantee floating point results (regarding the ordering and precision being used) to give repeatable results across platforms.
Vector dot product I believe uses a higher precisions for intermediate operations, and only rounds the final results, therefore giving a different result than if you were to do a dot product in a standard way, floating point standard compliant way. That result will often be more accurate than the standard calculation, but it is non-portable (because intrinsics are hardware specific)
@juancolmenares6185 17 дней назад ⁺¹
@@CoffeeBeforeArch great, thank you fir the explanation and the content!
@ahmedazeem5975 Год назад ⁺¹
Is it possible for you to also cover Arm Neon intrinsics if its possible?
This is a good topic and a good video :)
@CoffeeBeforeArch Год назад ⁺¹
Thanks for the suggestion, and glad you enjoyed the video :^)
I would like to do more ARM-based performance videos, but unfortunately, I don't have an ARM proc at this moment, so it's a non-starter until that changes.
Cheers,
--Nick
@anm3037 Год назад ⁺¹
It’s unfortunate that SIMD doesn’t fit so may practical scenarios.
@SneedsFeeduckAndSeeduck 5 месяцев назад
SIMD should be designed like a GPU kernel, to execute one stream of instructions on arbitrary amounts of independent data, instead of simply making a normal program that operates on 4 or 8 or 16 values at once. It just doesn't lend itself well to arbitrary processing the way it is currently common.

Следующие

Автовоспроизведение

Parallel C++: Unsafe Math Optimizations

Parallel C++: Unsafe Math Optimizations

Parallel C++: Vectorization

Parallel C++: Vectorization

Как работает C/C++?

Как работает C/C++?

Hurricane Milton impacts travel nationwide

Hurricane Milton impacts travel nationwide

Shakira - Soltera (Official Lyric Video)

Shakira - Soltera (Official Lyric Video)

“Absolutely Stunned!” - Jets Fan Rich Eisen Reacts to Robert Saleh’s Firing | The Rich Eisen Show

“Absolutely Stunned!” – Jets Fan Rich Eisen Reacts to Robert Saleh’s Firing | The Rich Eisen Show

so, i tried to watch the Minions movie...

so, i tried to watch the Minions movie...

Emulating a CPU in C++ (6502)

Emulating a CPU in C++ (6502)

SIMD and vectorization using AVX intrinsic functions (Tutorial)

SIMD and vectorization using AVX intrinsic functions (Tutorial)

Intrinsic Functions - Vector Processing Extensions

Intrinsic Functions - Vector Processing Extensions

Parallel C++: False Sharing

Parallel C++: False Sharing

Parallel C++: OpenMP Target Offloading

Parallel C++: OpenMP Target Offloading

Faster than Rust and C++: the PERFECT hash table

Faster than Rust and C++: the PERFECT hash table

Performance Optimization, SIMD and Cache

Performance Optimization, SIMD and Cache

"Clean" Code, Horrible Performance

"Clean" Code, Horrible Performance

舞桐错怪唐老六了！ #斗罗大陆 #唐三 #小舞 #唐舞桐 #唐老六

舞桐错怪唐老六了！ #斗罗大陆 #唐三 #小舞 #唐舞桐 #唐老六

КАРАСЕВ: ВСТУПЛЕНИЕ В НАТО В ОБМЕН НА ВЫХОД ИЗ ВОЙНЫ! "СЕРАЯ ЗОНА" ПО ДНЕПРУ, СТОПОР ЗЕЛЕНСКОМУ

КАРАСЕВ: ВСТУПЛЕНИЕ В НАТО В ОБМЕН НА ВЫХОД ИЗ ВОЙНЫ! "СЕРАЯ ЗОНА" ПО ДНЕПРУ, СТОПОР ЗЕЛЕНСКОМУ

Торнадо, штормовой ветер и дождь - ураган "Милтон" прошелся по США

Торнадо, штормовой ветер и дождь - ураган "Милтон" прошелся по США

Bro think he the MC.. 😂👊🔥

Bro think he the MC.. 😂👊🔥

Как РУССКИЕ МАТЕРИ срочников ищут в Курской области: командиры просто бросили их // Золкин, Карпенко

Как РУССКИЕ МАТЕРИ срочников ищут в Курской области: командиры просто бросили их // Золкин, Карпенко

진 (Jin) ‘슈퍼 참치 (Super Tuna)’ Special Video

진 (Jin) ‘슈퍼 참치 (Super Tuna)’ Special Video

Стыдные вопросы про Америку / вДудь

Стыдные вопросы про Америку / вДудь

ДИАНА в ТАНЦЕ #дистори

ДИАНА в ТАНЦЕ #дистори