CppCon 2016: Timur Doumler “Want fast C++? Know your hardware!"

CppCon

Просмотров 175 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 24 дек 2024

Комментарии • 127

@digitalconsciousness 3 года назад ⁺¹⁴⁴
This accessing of L1 cache is the reason game programmers are talking about Data Oriented Design (DOD) in the past few years. It is expensive to send an entire C++ object to L1 cache just to update an x and a y coordinate. You cannot put 100,000 ships on screen quickly. With DOD however, all the x's for all the ships are in a single array. Doing it this way, this array gets sent to the cache, and you rip through all 100,000 x values, then you do the same for the y array. Insanely fast.
@jeeperscreeperson8480 Год назад ⁺¹⁵
Game programmers talk about dod/dop because this style of programming reduces software complexity. And the simpler the software, i.e. the less generic it is, the faster it is generally. An array of floats is simpler than a vector of objects. Dereferencing a float pointer is simpler than calling a getter. Loading 4 floats from an array into an xmm register is simpler than gathering those floats from 4 separate objects, etc.
@dd-qz2rh Год назад ⁺¹⁷
@@jeeperscreeperson8480 it seems like you are mixing the software complexity definition which is mostly used in order to describe the complexity of a source code as perceived by humans reading said code and "simple" code that generates fewer instructions/those instructions are fast(er than non-dod code). DOD code can be much harder to read/write by an average programmer, it's complexity actually increases because a lot of stuff is not hidden by some abstractions anymore + you control and think about much more stuff than simply putting together some logic and abstracting it away via some interface/design pattern/etc.. And then what you call "simple" code is in fact just "fast" and "optimized" code, that is what has already been mentioned in the previous comment.
@J.D-g8.1 6 месяцев назад
For new programmeres playing around and eventually wanting to program a game of some sort; you often end up wanting something to represent a world map or game map, etc, and often this is supposed to be tiles; from a simple pacman to Warcraft.
When you make that map, you think you need a 2D array instead of a std::map which is slower.
That 2D array is easy to implement as a class where you actually use a 1D array, and get each "tile" by the formula
Width + (MaxWidth * Height)
That way you just implement the array as f.ex
char WorldMap[MaxWidth*MaxHeight]
This is just a very simple implementation, you can of course do the same with a vector, and a 1D vector is faster than a 2D vector.
The compiler will probably turn a 2d array like
int WorldMap[100][100] into a 1D array anyway. And just use the same
CurrentWidth + (MaxWidth * CurrentHeight)
to index the array anyway.
@jerobarraco 3 месяца назад
Its very unlikely in a game engine (engine, not game) that you need/can update 1000 entities' positons in the same function.
Its usually lots of different things for all those actors.
I like more these kinds of optimizations, but the problem with dop is that it makes it quite hard to implement your game, and its somewhat unnatural.
@Ratstail91 2 месяца назад
@@digitalconsciousness ECS mentioned!!
@andreytimokhin936 4 года назад ⁺⁴¹
Well organized talk with excellent explanations of the reason causing performance degradation using easy to understand examples.Would be very helpful event for quite experienced developers
@Altekameraden79 Год назад ⁺²
I am just a mechanical engineer doing vibration and signal analysis, but this presentation was very informative, even for me barely starting out in C++ a few days ago.
@KilgoreTroutAsf 5 лет назад ⁺¹⁴⁰
No1 tip for high performance: MEASURE EVERYTHING.
It doesnt matter how strongly you THINK something is better. The technology stack is complex enough to interact in unforseeable ways.
@aorusaki 5 лет назад ⁺⁸
You want both. Especially since some things are hard to debug/measure accurately being able to deterministcally understand SOME part of whats going on will give u an invaluable heuristic that statistics alone can't grant. (Ie, being able to look at assembly code with a datasheet with ur microcontroller will give u veeerry exact performancr characterisitcs of X instruction over Y))
@aorusaki 5 лет назад ⁺⁴
In addition, its STILL importsnt to understand WHYYY x or y is slower than ghe other so you can work around it and or get hardware vendors to improve
@seditt5146 5 лет назад ⁺⁹
Dude you have no idea how right you are. I have been trying my hardest to benchmark multithreaded code and get accurate results esp comparing it against linear algorithms... Due to compiler optimizations, CPU optimizations, the overall difficulty of Benchmarking MT code in the first place this entire ordeal has become a nightmare and no matter what results I get and how close they are to what I believe they should be there is still that little voice in my head telling me I just can not trust the output of the Measurements since there is just so damn much going on under the hood idk what to believe about it all.
@ChrisM541 2 года назад
You NEED to have significant expertise in assembly language. No point in having that debugger if you don't FULLY understand that core component...!!!
@samhadi7972 Год назад
I wasted so many hours of my life optimising based on theory only for the result to end up being slower… definitely bench mark and do it often if you care for the results
@NehadHirmiz 7 лет назад ⁺⁴⁴
Thank you very much. I really like your choice of presenting very intuitive examples.
@alexeysubbota 2 месяца назад
Excelent presentation, as usual for Timur Doumler! Concise, comprehensive, emotional, fascinating . Recommend!❤
@aorusaki 5 лет назад ⁺⁹
These conventions NEVER fail to provide useful knowledge and springboard ideas to help insoire newer developments and technologies!
@teranyan 3 года назад ⁺¹¹⁰
I once got rejected at a job interview because I told them exactly that thing you pointed out about the list/trees, that if you must use it at least it can be accelerated with a custom allocator (preallocating nodes) depending on your use case and if cache misses are a bottleneck. Apparently their "senior" programmer guy didn't agree with that one
@puppergump4117 3 года назад ⁺²
Anyone who listens to the people next to them without doing research are dumb.
@heavygm9623 2 года назад ⁺⁸
lol
@ShaunYCheng 2 года назад ⁺⁹
What company was that?
@samhadi7972 Год назад ⁺⁹
Ya when you talk like that to your everyday devs they think you are from a different universe or making stuff up lol
@renzocoppola4664 8 лет назад ⁺⁵⁷
He knows his stuff.
@minastaros 3 года назад ⁺⁴⁵
Quite dry stuff, and not really what I need for my daily work. But gave me a huge amount of new knowledge. Very well presented, really good examples, and in general a clear and entertaining talk. I have really enjoyed watching it!
@CppCon 3 года назад ⁺⁸
Glad it was helpful!
@deckarep 8 месяцев назад ⁺¹
I know this talk is quite a few years old and it has some solid points to consider but I did want to comment on a few things. One of them is that he compares audio software engineering to game development saying that while game devs worry about code having to be performant at 60 fps, on the other hand audio software development has to worry about 44k sps (samples per second) and this is not a fair comparison. Yes, a game loop may need to run at 60 fps, but within the context of a single frame, many computations are being done on the CPU, and likely way more than a few hundred thousand data-points.
The second thing: At one point he says that in general you want to structure your Array of structs next to each other for locality which is true in general but if you utilize the Struct of Arrays (SOA) paradigm, which I don't think he mentions at all you may likely get even better performance out of your code due to the memory being more tightly compacted and also due to not having to fetch certain objects for computations that you don't need.
@prafulsrivastava8942 2 года назад ⁺⁷
This is exactly what I was looking for. Great work! Thanks for the info!
@CppCon 2 года назад ⁺¹
Glad it was helpful!
@佐邱鸣 4 года назад ⁺⁷
Very intuitive example！Down to earth knowledge!
@TrueWodzu 6 лет назад ⁺¹⁶
Thank you, some stuff I knew, but never heard of denormals. Would have to check them out :)
@Peregringlk 4 года назад ⁺¹⁰
15:30 hash for integers is usually the identity function which is a no-op.
@TheGokhansimsek35 3 года назад ⁺⁵
Excellent talk, full of great information and tips!
@CppCon 3 года назад ⁺²
Glad it was helpful!
@grindinglcmeow 4 месяца назад
This is super informative, thank you for the effort Timur.
@michaelmorris2300 7 лет назад ⁺⁶
Good talk. People need to remember that performance in general purpose computing will have its limits, where Fpgas would take could take over. There is only so much performance that can be squeezed out of general purpose computing notwithstanding the versatility and quick development costs of that paradigm. The talk was very good.
@ciCCapROSTi Год назад
Good presentation, knew about all of it except the FPU stuff, but got way more useful details than I had.
@marec023 5 лет назад ⁺⁴
What a great talk, thanks!
@qm3ster 2 года назад ⁺²
Summary of what is most surprising:
1. cache associativity hole on round values
2. some modern cpus are good at unaligned access (I was sure we are moving away from that instead?)
3. false sharing
@kapyonik2424 4 года назад ⁺¹⁰
52:08 : here, I see a different reason than the one mentionned. One the left code, there is 4 different values accessed ( a[i] b[i] b[i+1] and c[i] ) while on the right version code, only 3 values are accessed ( b[i+1] c[i] and a[i+1] ) so is it really about vectorization ? I don't understand this part can someone explain please ?
@BGFutureBG 4 года назад ⁺⁴
The same thing stood out to me initially, but I'm pretty confident it's insignificant here. b[i] and b[i+1] are extremely likely to be on the same cache line.
@danielnitzan8582 3 года назад ⁺¹
Excellent talk!
@Shardic 6 лет назад
I loved this talk. Never knew that a simple contiguous set of variables can be SO MUCH faster than randomly situated ones.
@dvlduvall 5 лет назад ⁺²
Nailed, great examples.
@suryatn 5 лет назад
Nice explanation with code samples and supporting data analysis.
@Lecker9419 Месяц назад
47:27 what's the point in wrapping a b c d in atomic if each thread works on its own variable ?
@jaime7295 Год назад
This was really good!!!
@skyfaded9807 19 дней назад
I have a question, when Timur Doumler demonstrates random accessing 2-d array is almost 100x slower than row major access, is the rand() being considered. I mean, it is reasonable that random access is slow, but I am not sure how much rand() call affect the performance.
@7Denial7 Год назад
I think there's a mistake at 27:00. 8-way associative cache means there must be 8 blocks(cache lines) per set. So If we have 128 KB 8-way cache, 64 bytes per Block, this means 128 * 1024 / (64 * 8) = 256 sets, 512 bytes per set, 8 blocks per set. Not 8 sets 16KB each like you show in the presentation
@NekonyaNyaNya1 5 лет назад ⁺¹
Great stuff, at least for someone without degree in SC, thanks!
@elliott8175 4 года назад ⁺²
To vectorise the loop (at 52:00) why not just do this:
for (int i = 1; i < 998; ++i)
a[i] += b[i]
for (int i = 1; i < 998; ++i)
b[i+1] += c[i]
Not only does this get rid of dependencies, but it results in requiring 2 positions in the cache (for each of the arrays) rather than 3, so better overall practice to reduce crowding. Or does this actually screw up the SIMD vectorisation? I don't know much about it.
@ChrisM541 2 года назад
Make certain you confirm - and fully understand - what the compiler is doing by critically examining the assembly version of your routine.
@Wayne-wo1wc 2 года назад
This could take double the time. A CPU can sometimes fit two instructions in one cycle in the same core. Therefore it is better to do it in one loop.
@ambushedraccoon6408 Год назад
Hello! Great content.
Is there a link for a video mentioned at 19:31 ?
@jackofnotrades15 4 года назад ⁺²
Just curious, Isnt this why data-oriented design is preferred? Seems like this gives a good example for Data oriented Design. Am I wrong??
@seditt5146 5 лет назад ⁺²
At 41:00 He stated no branch predictor can work with Random Numbers but I was wondering if they can detect patterns and the BP is capable of looking ahead why is it not possible for it to generate the Random number in advance and make predictions using that pregenerated Random number
@mrlithium69 5 лет назад
not an expert, but random is usally directly affected by the CPU clock as the source of entropy, so picking a number in advance is not the same number as picking it normally. Also picking extra random numbers and then throwing them away cause of any invalid branches would have an effect, some would be missing, and the following set of random numbers would not match the the original pattern of distribution.
@flutterwind7686 4 года назад
Depending on the CPU architecture, branch prediction can look ahead and precalculate based on different possibilities only up to a certain point before running out of cache. IF it invests entirely on a certain possibility, if that guess is wrong there is a huge penalty since it has to start calculating the branch all over again (from the branching point in code ex. if statement). This limits how far the branch predictor can look ahead when dealing with random numbers (as the number of possibility quickly inflate) in for example a nested if structure.
@RajeshPandianM 2 года назад
At 53:20, is there a typo on left side code at line .. a[i] +=b[i+1] instead of a[i] +=b[i]?
@sirnawaz 4 года назад ⁺¹
The example at 56:20 is interesting. I didn't quite understand the alignment issues with the offset. Is it because of different offset for both buffers? If I use same offset, say `+2`, for both, would it still have the alignment issue?
@sirnawaz 3 года назад ⁺¹
Watched this talk again a year later. I still have the same doubt. Haha.
@ChristianBrugger 2 года назад
The problem is that SIMD instructions only work for aligned data. If both have the same offsets the compiler can special case, the first 2, and then take 4 aligned each.
@masheroz Год назад
What is the talk he refers to at 49:45?
@severemaisjuste 4 года назад ⁺²
Not considering any oversampling in either domain, isn't it more like ?
gfx : 1920*1080*30
sfx : 44100 * 2
samples per second ?
@shinyhappyrem8728 4 года назад
Why 30?
@jamessandham1941 4 года назад ⁺¹
Ya I don't really know why he made that statement. Kind of needlessly provocative IMHO. Its even more skewed in games than you indicate because in games you are not just sending colors to the screen but also dealing with physics systems, multiplayer, tons of custom tools, fast compile times, and a general complexity of game play that is only limited by a designers creativity, etc. Otherwise good talk though.
@TernaryM01 4 года назад ⁺³
It's harder to write a program that does some computation in 1/44100 seconds than 44100 such computations in 1 second. It's about bandwidth vs latency. His point is that not only do you need to make computations as efficient as possible, but they also have to have low latency. You don't worry so much about latency in programming a game renderer, and the GPU itself is not optimized for latency, unlike the CPU.
Of course he is not an idiot who thinks that the computation he's doing in audio is 200 times faster than game engines. Game engines are written in C++ and C, not Python.
@BGFutureBG 4 года назад
I too think processing 60 frames (or even 250 with modern monitors these days) in a video game, is way way way more expensive than sampling 44kHz audio...
@teranyan 3 года назад ⁺¹
The CPU (which you use to process audio) usually has 8 strong cores while modern GPUs have 3000+ "stupid" cores. It's not the same type of sample and you can't really compare that.
@bencenagy2542 2 года назад
Actually, frame rendering has to provide much more samples than 24-60. You have to render screenwidth*screenheight samples. So for a 1080p resolution that is about 2*10e6 samples.
Год назад ⁺³
Maybe, but real-time sound programming is a little different. In a game, for example, you can render the same frame twice and nothing bad will happen. In sound, the membrane of a speaker is a physical object with kinetic energy, and if you send the same signal 2 samples in a row, that means the magnet should apply infinite braking power to stop it for a moment. It's like pulling the emergency brake on a locomotive, the listener will hear an audible click.
@sundareswaransenthilvel2759 5 лет назад
Thanks a lot! Nice explanation!
@coder2k 4 года назад
Great talk! Thank you! :)
@marktaylor6052 4 года назад ⁺¹
33 mins 40 secs, I found this out myself trying to optimise a virtual machine. I thought aligning instructions and arguments on 4 or 8 byte boundaries would help. In the end I saw I was only getting a 1% increase in performance.
@azadalishah2966 4 года назад
🤔How will you find size of cache line [before using alignas(cachelineSize) ]?
@ivailomanolov6908 3 года назад
Can someone explain to me 54:19 Example please ? I have a hard time understanding it .
@antonfernando8409 3 года назад
Something to with: SIMD vectorisation
@avashurov 3 года назад ⁺⁹
I hope you’re not benchmarking memory random access speed with rand function modulus operator on every access at 20:44 🤦‍♂️
@iamlalala1995 3 года назад
Lol yeah i was thinking that
@expurple 3 года назад
So, the correct thing to do is to add that rand()%n calculation into the other two loops as well? Or maybe precompute these random indexes?
3 года назад
Of course replacing %m with &(m-1) is much faster if m is power of 2 and if m-1 comes from const variable) but he might have kept it for readability instead.
@dat_21 2 года назад
Yeah. Easily 100 unnecessary cycles just from these two.
@praveenalapati5234 7 лет назад ⁺¹
Good explanation
@TheOnlyAndreySotnikov 4 года назад ⁺¹
16:03, not true. VTune shows you "these things".
@KeithMakank3 3 года назад
What are we supposed to do about problems that specifically require random traversal of arrays?
@michaelliuzzi 5 месяцев назад
you wait ;)
@quoc-minhton-that6391 5 лет назад
Can someone explain why at 56:14 , calling multiplyAdd with the array pointers offset cancels vectorisation? What data is fetched into the SSE register and how? Thank you.
@CBaggers 5 лет назад ⁺⁵
simd expects the data that will be loaded to be aligned a certain way. By offsetting slightly the compiler could either 1. use unaligned simd load instructions if the cpu supports them or 2. fall back to a non vectorized loop. On older x86 cpus unaligned loads were siginificantly slower but nowerdays it's not so bad. So option 1 may be significantly slower and 2 (almost) certainly will be.
@quoc-minhton-that6391 5 лет назад ⁺¹
@@CBaggers Thank you!!
@AshishNegi1618 4 года назад ⁺⁴
if `rand()` is inside the for loop, then `random loop on 2-d array` would be slow because of cost of `rand()` as well..
[ of course, random 2 d access will be slow.. but should not be that slow compared to column wise access ]
@dineshdevaraju6698 7 лет назад
Awesome Thanks...
@stevenclark2188 4 года назад
I suspect in a case with only a few columns, column major might even be faster, because it could use a cache line for each column thus using more of the cache. If you don't line them up so they step on each other's cache slots of course. Most of this is covered in that computer architecture course of a CS bachelors.
@ChrisM541 2 года назад ⁺²
37:24 It's incredibly shocking how, in the relatively short space of around 30 years, the vast majority of today's programmers - even C++ programmers - have bypassed so much critical expertise in understanding assembly language. I mean it's truly...almost criminally shocking, especially given that...
1) Assembly language is at the absolute core of C++ debugging and, as demonstrated in this video
2) Different compilers can frequently produce different machine code, running at different speeds
Higher level languages may well have a more valid reason, but C/C++ programmers have - literally - no excuse.
@aritzh 5 лет назад
In the example at 20:44, wouldn't the rand() be very expensive, since it (at least on linux) uses a system call, with the context switch and everything?
@TheElectromatter 5 лет назад ⁺⁴
It is not a syscall, its a lcg in glibc
@Luix 4 года назад
Very interesting
@DanielJamesCollier 8 лет назад ⁺⁴
woudn't it be much better if you generated the random numbers before,
@DanielJamesCollier 8 лет назад ⁺¹
talking about 21:40
@dascandy 8 лет назад ⁺¹
You should've included the SSD / harddisk in your memory speed / latency discussion.
@valetprivet2232 8 лет назад ⁺⁷
what for? If you interacting with hard disk you know it will be slow
@Slithermotion 6 лет назад ⁺⁶
I'm angry too, still wondering if a floppy disk is faster then internal RAM...haha xD
But I think it's common knowledge at this point, if you are going in such detail with coding performance then this should be common knowledge.
If you are watching a video where a guy explaines that getting data from L1 or L2 cache is time critical you better already know that ROM Memory is slow as fuck in that dimension.
@flutterwind7686 4 года назад
That's a rabbit hole, he can only cover the basics given the time. HashMaps -> Self Balancing Trees -> BTrees -> Esoteric Data Structures to model databases (improve disk read performance when searching) take up an entire lecture on their own.
@subbyraccoon 8 лет назад
Really great talk, tho next time maybe try to reduce the number of sub-topics or ask for more time to present them since it looked like you tried to rush in the second half to get everything out.
@elbozo5723 2 года назад ⁺¹
“want fast python? well tough shit”
@childhood1888 3 года назад
Perfect talk except for the ums
@KeithMakank3 3 года назад
im not sure knowing hardware helps us tremendously, we are bound by data access times whatever we do, and if its significant the impact it means hardware is slow, until we actually design hardware its not our problem.
@dupersuper1000 2 года назад ⁺¹
It’s really only “our problem” when high performance is a business requirement. The people who really stress out about this stuff typically work in gaming, high-frequency trading, low-level networking, etc.
The other thing to point out is that we are hitting the physical limits of processor speed and transmission of data between caches. It’s no longer possible for us to just “wait until hardware gets faster.” Either you figure out how to optimize your code and parallelize what you can on multiple threads, or you just eat dirt until some unforeseeable scientific breakthrough allows us to make faster hardware. Those are your options.
@qqqqqqqqqqqqqqq67 2 года назад ⁺²
Dude, do you think slack running slow on ryzens is because the hardware is slow?
@zhulikkulik Год назад
You can't make one thing fit all purposes so it is “our” problem.
You can't bend physics and force electricity to run faster, but you can bend your approach to data and code and make it easier for the computer to do stuff. If someone managed to send Voyager to space in 1977 and it still operates to this day - we definitely can make a videogame/social network app/whatever to run on a hardware that's a zillion times faster than what people from NASA had 40+ years ago.
@panjak323 Год назад
The comparison at the beginning is kind of lame... You don't render just 60 frames per second, those frames often consist of 2 million or more samples. In comparison 44000 samples is nearly nothing, even slowest of CPUs can emulate that in real time nowadays.
@Vermilicious 3 года назад
Cache lines, eh. Thanks!
@wpavada3247 8 лет назад ⁺¹⁰
so we have another thing to worry about while coding in c++, i sometimes believe it is getting too over complicated!
@NikoKauppi 8 лет назад ⁺⁴
It applies to both C and C++. To some extent other languages too depending on how they handle things internally. And this is something that has been present for years already. For as long as you keep your data aligned and your array accesses linear you should be good to go. But it's good to know about possible pitfalls in case you need to squeeze some performance out of some of your programs.
@wpavada3247 8 лет назад
Can you offer me a job?!
7 лет назад ⁺¹
Wp Avada nobody said it was gonna be easy 😀 just remember that C++ let's you work efficiently, but other higher level languages can't provide that
@blanamaxima 4 года назад
Well hard because some are really bad programers.
@Oaisus 6 месяцев назад
I wish it was more obvious somehow whether or not these videos are about basic topics. I keep getting clickbaited in and wasting time.
@swat_katz_tbone 8 лет назад ⁺²⁴
The material covered is decent but the way things are explained is quite bad ! It will be safe to say that the people who understand this talk would already have a decent idea of the content.
@ahmeterdem9312 8 лет назад ⁺³
I was gonna disagree with you but when I watched until the end, I now agree. I think the purpose of the talk is more like raising some questions and curiosity into actual hardware that your c++ code running.
@Boneamps 7 лет назад ⁺¹⁰
I actually thought this talk was really good. It makes you remember a that if you want to do real proper software development of high performance applications, you MUST know about computer architecture. And this guy Timur seems to really know his stuff. For some reason, SIMD and VLIW and all that fancy stuff that processors do nowadays are quite ignored in the community. But the bottom line is, yeah this talk is not for explaining to someone who is learning. The people who attend this talk are industry experts, who probably studied all of this things but don't have the years of hand on experience in the particular topics they are listening in the talks.
@rg3412 7 лет назад ⁺²
hmm counter is exploding. Speaker should take a Toastmaster course, that would make his presentation clearer and more engaging.
@mariogalindoq 6 лет назад ⁺⁴
Good content but I feel very distracted with hammmm aaaaaa before each sentence. He really needs to learn how to speak in public. His speech is so stressing so I can't understand a lot of sentences. Aaaam.
@Superlokkus5 5 лет назад ⁺⁴
"You found books from computer science", yeah the fact he didn't know about and also didn't say "computer architecture" suggest he never attempted to learn a bit of formal computer science.
"C++ books never talk about this stuff" LOL are you kidding me?
*Skipping thru* Oh wow branch prediction, data alignment, array access i.e. cache locality, false sharing I never heard of it... NOT
This is the nth average talk with stuff already heard dozen times.
@gamingsportz3390 5 лет назад ⁺²
Did all of this in my high performance class at my university
@aorusaki 5 лет назад ⁺⁶
Well some of us dont know and now we do.

Следующие

Автовоспроизведение

CppCon 2016: Nicholas Ormrod “The strange details of std::string at Facebook"