CppCon 2017: Chandler Carruth “Going Nowhere Faster”

Поделиться
HTML-код
  • Опубликовано: 24 ноя 2024

Комментарии • 84

  • @MrKatoriz
    @MrKatoriz 2 месяца назад +1

    For the last question: the CPU store buffer and register renaming makes it possible to hide results of operations from external observers (i.e. other cores or memory for instance). The changes are only made visible when they will be program-correct (i.e. the CPU will actually execute and go past the array bounds, but before actually "publishing" the changes it will check if the branch prediction was correct and discard the incorrect results, that's why anything works)

  • @eauxpie
    @eauxpie 6 лет назад +149

    If only he dug into that last question, we might have known about Spectre that much earlier.

    • @henke37
      @henke37 5 лет назад +12

      Maybe not spectre in general, but certainly the variant with stale cache entry loads.

    • @VivekYadav-ds8oz
      @VivekYadav-ds8oz 2 года назад +4

      What a butterfly effect would that have been lol

    • @emilien.breton
      @emilien.breton 2 месяца назад

      Though that last question was kind of off topic.

  • @echosystemd
    @echosystemd 7 лет назад +111

    Chandler helps me realize that I know nothing about benchmark.

  • @jackwang3397
    @jackwang3397 2 года назад +11

    I have no idea about assembly language or the further details but still feel the passion of the lecturer and the audience. I am satisfied and cheer like everyone else did at 35:10 during my lunch break. Good job!

  • @hanneshauptmann
    @hanneshauptmann 7 лет назад +78

    I am a simple man, I see a good talk, I press like.

    • @dipi71
      @dipi71 7 лет назад

      ... and write an encouraging comment to do the same.
      So do I. Cheers!

  • @osere6432
    @osere6432 3 года назад +32

    In his 2018 talk he discusses the last question in great detail

    • @andik70
      @andik70 3 года назад +1

      You have a link

    • @shadowtuber1589
      @shadowtuber1589 2 года назад

      ruclips.net/video/_f7O3IfIR2k/видео.html

    • @kamilziemian995
      @kamilziemian995 2 года назад

      This talk from 2018 is: CppCon 2018: Chandler Carruth “Spectre: Secrets, Side-Channels, Sandboxes, and Security”.
      ruclips.net/video/_f7O3IfIR2k/видео.html

  • @ldmnyblzs
    @ldmnyblzs 7 лет назад +19

    That command prompt is just pure madness!

  • @Zeturic
    @Zeturic 5 лет назад +50

    20:48
    Going from .01 to .04 is a 300% increase in cache misses (e.g. 4x the amount), not .03%. When you look at it that way, dramatic changes in performance aren't that surprising.

    • @movax20h
      @movax20h 4 года назад +11

      The problem is he is counting L1 dcache misses relative to L1 dcache loads within one run. Because of dcache misses his programs runs slower, and does less iterations, and less dcache loads overall. The way to make it more clear is to set the number of iterations to the fixed value instead so each benchmark tries to do same amount of loads overall. Then compare between the runs. But yes, 4x observation from the relative measurements is a good observation. The 0.01% cache misses doesn't mean that 0.01% of your TIME were missed. It is just a counter. But L1 dcache misses are like 20 times more expensive than hits, so if you weight it properly, you will see a real sensibile info. You just need to know how to interpret these numbers.

  • @MrKatoriz
    @MrKatoriz 2 месяца назад

    cmov is used for cases where the condition is close to random 50/50, since branching performs absolutely horrible there. In the presentation, random numbers generated for the test are in range 0-intmax (roughly 2 billion), while clamp limit is at 255, so the probability of mispredict is ~1.18e-7, which is why cmov is slower in the example.

  • @kamilziemian995
    @kamilziemian995 Год назад +10

    How much I want to have 1/1000 of Carruth's knowledge about compilers.

  • @emanuellandeholm5657
    @emanuellandeholm5657 Год назад +1

    I'm not a C++ developer (I have a background in C), and I don't really know x86, but these talks by Chandler Carruth are so interesting to me. This is like crack! :D

  • @abc3631
    @abc3631 4 года назад +2

    Thats what i love about chandler's talks, he goes to the nub of the topics, and hard ones at that, rather than just glossing over.

  • @christophrcr
    @christophrcr 7 лет назад +15

    In the clamp example, after the first iteration, all values will already be clamped. I guess that's why branch prediction will always be right after some point.

  • @bluecircle56
    @bluecircle56 5 лет назад +7

    "I don't even know. Magic and unicorns?"
    Great talk!

  • @ZeroPlayerGame
    @ZeroPlayerGame 5 лет назад +5

    The reason for tight-loop speedup is because the branch is never missed after the first execution, so it is predicted super well and a memory store is just avoided at all.

    • @movax20h
      @movax20h 4 года назад +2

      The memory store is not avoided. It still does happen. The reason is super fast because it is to the cache line that is already in L1 dcache (we just loaded it), and due to something called write forwarding (not important here, but it works behind the scene to resolve load after store problems quickly, even if it misses L1 dcache).
      The probable reason the code is equal speed is due to something called op fusion. The CPU actually recognize the jump by small relative offset and move after it, and basically replaces it into conditional store I think.

  • @JackAdrianZappa
    @JackAdrianZappa 5 лет назад +5

    The most important thing you can get out of this talk is that Magic and Unicorns keeps the processor from crashing! :D

  • @tobiasfuchs7016
    @tobiasfuchs7016 7 лет назад +12

    10:00 Ohmygosh I'm using the same nvim setup, shell and even colorscheme as Chandler Carruth, I will never change my ~/.config again!

    • @orbital1337
      @orbital1337 7 лет назад +1

      What's the colorscheme? Looks pretty nice.

    • @leonhrad
      @leonhrad 7 лет назад

      pretty sure it's jellybeans

    • @victornoagbodji
      @victornoagbodji 7 лет назад +1

      yes please, give away the colorscheme and nvim setup : )

    • @ericcurtin2812
      @ericcurtin2812 6 лет назад +1

      Vanilla vim, dark background for life lol:
      .vimrc
      syntax off
      set background=dark
      set mouse=a
      set hlsearch

  • @AlexeySalmin
    @AlexeySalmin 7 лет назад +1

    TSO doesn't allow reordering anything past a store. Therefore a regular load won't flush the store buffer because it's being reordered ahead of buffered writes, not past them.

  • @warrenhenning8064
    @warrenhenning8064 3 года назад +3

    6:08 did anything ever come of the Efficiency Sanitizer?

  • @llothar68
    @llothar68 7 лет назад +5

    Damit, he always gets the nicest hardware to play with. Hey boss can you hear me? I want a AMD EPYC workstation too.

  • @N00byEdge
    @N00byEdge 7 лет назад +18

    So what if he used $255 instead of %ebx on the cmove too??

    • @CornedBee
      @CornedBee 6 лет назад +25

      You can't. cmov only works between registers.

  • @davidjulitz7446
    @davidjulitz7446 5 лет назад +1

    Hmmm, cmovxx should be the right choice.
    Could it be that cmovxx only perform worse because it can only use registers? Looks to me that moving an immediate made the performance difference here.
    However, modern processors are obviously to complex to know what is the best optimization for every particular case.

  • @chris3kpl
    @chris3kpl 7 лет назад +4

    I wonder if simple masking with 0xff or cast to char whould be faster in clamp example? Because of this special case to clamping to 255.

    • @MatthewChaplain
      @MatthewChaplain 7 лет назад +14

      Yes, but you would get a modulo instead of a clamp. e.g. 257 & 0xFF is 1, not 255.

    • @chris3kpl
      @chris3kpl 7 лет назад +4

      Yes, you're right! :)

  • @PeterFoxtrott
    @PeterFoxtrott 5 лет назад +2

    Could someone explain me what he means with that modern processors don't have a branch forward prediction anymore? What else should a branch predictor do other than forward? And that means it does not exist anymore on modern processors? 32:06

    • @movax20h
      @movax20h 4 года назад +4

      There are two branch predictors on a global level in the CPU. There is a 'static' predictor, and dynamic predictor. The static predictor is used when the code is completely fresh and never run before (cold code). The static predictor usually only takes into account whetever a jump is forward or backward, and if it is small jump or big jump, and if it is a relative jump or absolute jump. Usually static predictor will predict branch taken for short relative jumps backwards (because they are most likely loop jump), but will predict large jumps forward as not taken (it would be to some even more cold code, possibly error handling or end of the loop handling. Usually the big relative jumps backward will also be predicted false, as they are unlikely to be loops, or they are likely to be very big loop anyway, so it is likely there is already plenty to do for the CPU to do. It might still ask the cache to prefetch the cacheline, if it is big jump one way or another, but it might be wasteful, and on multi core CPU it might add a lot to wasted memory traffic that could be used better by other cores. Sometimes these things can be tweaked in BIOS or microcode for specific CPU, but also these things change between each version of CPU, between CPUs of different manufacturers, and one way or another is hard to know which is better. Most CPU ISAs doesn't have hints in assembly to indicate which jumps are more likely or not, it is only done by code locality (i.e. closer code is more likely to be a part of loop, and should be preffered, but I don't know where is the cutoff - probably tuned on a lot of different code bases during a design).
      The issue in his code is actually mostly due to a random distribution of his data, and how speculative execution works. He just flushes entire pipeline on 50% of the loop iterations.

  • @Cwize1
    @Cwize1 7 лет назад +4

    My theory on why using 255 is faster: 255 is the max value of an unsigned byte. So it will have a tendency to crop up a lot within programs. So internally the CPU has a register dedicated solely to storing that value. (Though I know nothing about CPU design. So I am probably completely wrong.)

    • @peterfireflylund
      @peterfireflylund 7 лет назад +2

      You are. Some CPU's hardwire a "register" to a specific value but that value is 0, not 255 (or any other "all bits set" value). The x86 CPU's don't do this but the ARM CPU's do in 64-bit mode. The x86 CPU's are in your laptop, the ARM CPU's are in your phone.
      Byte masks are common in performance-critical code so modern compilers recognize them.

    • @ChandlerCarruth
      @ChandlerCarruth 7 лет назад +23

      Sadly, the entire demo with the clamp loop was broken, no need to speculate about this one goof. =[ My apologies about this, I feel really bad. I tried to craft a tiny reproducer of an issue I see commonly in the real world and thought I had succeeded, but my data poor.
      I have a much better example. Suffice to say, a key ingredient is that the branch *has* to be well predicted or the branch can't possibly help (which you see in the subsequent discussion around how the processor works). A second key ingredient is that you can craft much more dramatic examples using cmov vs. a branch, which I should have to make this more clear and less confusing. Again, sorry about that.

    • @Peter_Cordes
      @Peter_Cordes 5 лет назад +1

      @@ChandlerCarruth - I wrote up an analysis a while ago of a case where `gcc -O3` is slower (on sorted data) because it chooses CMOV. stackoverflow.com/questions/28875325/gcc-optimization-flag-o3-makes-code-slower-than-o2 (And some versions really shoot itself in the foot by insisting on putting CMOV into the loop-carried dep chain, even if you write it in C as conditional zeroing of an input to a loop-carried ADD). Fun fact: gcc -O3 -fprofile-use figures this out and uses a branch when you do profile-guided optimization on predictable data.
      I assume that's the kind of case you say would make a better example, because it would tie in with your 2nd bit about the dot product loop (which does have a loop-carried dep chain, so it would hurt the "unrolling" across iterations that out-of-order execution gives us). Using a control dependency which the CPU will speculate past can very much be a win vs. a CMOV data dependency in a loop-carried dependency chain.
      ----
      If CMOV had been much slower, it makes me think of stackoverflow.com/questions/54190958/throughput-analysis-on-memory-copy-benchmark where a byte-at-a-time asm copy loop is only running at 1 iter per 4 clocks on IvyBridge, even though it's mostly getting cache hits. But I think that's some kind of HW prefetch failure, not totally due to stores depending on loads.
      Breaking the data dependency between load and store with a branch instead of CMOV was plausible in your test, but unlikely on a modern x86 with deep enough out-of-order execution to keep lots of those independent dependency chains in flight at once. Maybe on an in-order Atom. :P

  • @arnabthakuria2243
    @arnabthakuria2243 8 месяцев назад

    what font is this. looks very nice

  • @KD-username
    @KD-username 3 года назад +2

    What are his vimrc settings?

  • @victornoagbodji
    @victornoagbodji 7 лет назад +1

    amazing as always!

  • @DeepakGiri-z5z
    @DeepakGiri-z5z 9 месяцев назад

    Where can I get the c++ code in his presentation?

  • @zarakivishalkenchan
    @zarakivishalkenchan 5 лет назад +1

    This talk's title should have been "cmov spoiling instruction pipelining".

  • @aqg7vy
    @aqg7vy 6 лет назад +3

    What is that vim setup??

  • @marianaldenhoevel7240
    @marianaldenhoevel7240 Год назад

    My takeaway:
    If you really care about performance you have to measure it (fine) and try non-obvious things to see wether they change something (not that great).
    Even if Chandler disagrees we then just hope that the system you will be running them on in anger agrees with the one that you measured. The next CPU may think different. And that CPU can be on the machine you use in production or simply change under you next month.
    That feels like the software-equivalent of "shut up and calculate" in physics. Not a very satisfying place to be in.
    Luckily my code has all the time in the world and can just blunder through what my thick brain writes down.

  • @FalcoGer
    @FalcoGer Год назад

    36:30 use std::inner_product. I'm sure it is well optimized and does what you want without you re-implementing it.
    Running std::inner_product on vectors with 65535 elements with random values was around 3 times faster than calling the dotproduct function. (27.2ms vs 75.3ms on my machine)

  • @pfeilspitze
    @pfeilspitze 5 лет назад +7

    "April 2019: Intel® Architecture Code Analyzer has reached its End Of Life" :(

    • @LeDabe
      @LeDabe Год назад

      intel iacz is dead, long live llvm-mca

  • @BenjaminDirgo
    @BenjaminDirgo 6 лет назад +5

    I wonder if Spectre makes this talk obsolete.

    • @ericcurtin2812
      @ericcurtin2812 6 лет назад +1

      I opened a PR on his github wondering the exact same thing github.com/chandlerc/chandlerc.github.io/pull/1
      If I was getting a response I probably would have got one by now.

    • @meneldal
      @meneldal 6 лет назад +4

      Well he made a talk about that this year ruclips.net/video/_f7O3IfIR2k/видео.html

    • @mcneeleypeter
      @mcneeleypeter 6 лет назад +1

      Nothing has changed from this talk due to spectre. The hardware still operates the exact same way.

  • @SatyajitGhana7
    @SatyajitGhana7 5 лет назад

    code for the benchmark utility ?

  • @ginkner
    @ginkner 7 лет назад +2

    Stupid question, but what command line is that? Ive never seen git look like that

    • @chris3kpl
      @chris3kpl 7 лет назад +6

      Ben it's a fish shell in tmux

    • @chris3kpl
      @chris3kpl 7 лет назад

      Ben, and editor is probably nvim

    • @MitchPleune
      @MitchPleune 7 лет назад +2

      looks like it says fish and that looks like a powerline something or other.

    • @gehngis
      @gehngis 7 лет назад +3

      Yes and the theme looks like bobthefish: github.com/oh-my-fish/oh-my-fish/blob/master/docs/Themes.md#bobthefish

  • @charlesreynolds8696
    @charlesreynolds8696 2 года назад

    What are the implications of this post-spectre?

    • @alpers.2123
      @alpers.2123 2 года назад +2

      It turns out, it wasn't magic and unicorns

  • @cmilkau
    @cmilkau 6 лет назад

    "You can't speculate on cmov" - There seems no reason why you couldn't. Maybe today's processors cannot but I can't see any difference between branching and conditional operations that would prevent speculation. After all a branch is just a conditional write to the instruction pointer, which is kind of the hardest thing you could speculate about. What am I missing?

    • @Peter_Cordes
      @Peter_Cordes 5 лет назад +2

      If you want speculation that can mispredict and might need to roll back, use a branch! If you want a data dependency, use CMOV. Speculating CMOV would defeat the entire purpose. (Unless you dynamically evaluate and predict whether it would be a good idea to do that for any given CMOV).
      But that would be hard to implement. An instruction that could decode to *either* an ALU operation or a branch would need another whole predictor, and a whole new mechanism for pretending we branched in the internal uops when the x86 code didn't contain branch.
      Plus you'd need that predictor as an input to every decoder. Or if it's only to the complex decoder, then CMOV could cause delays in decoding by having to always decode in the complex decoder, even if it chooses to only decode to a single ALU uop (with a normal data dependency) instead of a branch, in cases where it wouldn't predict well as a branch, e.g. on a condition that was only true some of the time, not almost all the time.
      (Possibly related: one reason that `rep movs` startup is slow is that micro-code branches on Intel CPUs are special and can't use branch prediction. See stackoverflow.com/questions/33902068/what-setup-does-rep-do for a quote from Andy Glew, former Intel architect who implemented fast strings on P6. This may mean it would be impossible for a CMOV to decode to a branch-like uop over a mov uop in a way that could actually speculate.)
      See also stackoverflow.com/questions/28875325/gcc-optimization-flag-o3-makes-code-slower-than-o2 for another case where CMOV is not a win, but it is a win on unsorted data. With more links and details about when CMOV is/isn't a good idea.

    • @bloopbleep7082
      @bloopbleep7082 4 года назад

      Conditional moves don't change the instruction stream. Conditional branches potentially do. Thus you usually don't need to flush the instruction pipeline with conditional moves.

  • @slavanap
    @slavanap 6 лет назад +1

    35:03. Man, you just jump over a memory write operation (for the majority of your data elements). There is a huge impact that comes from that. I guess, it's more noticeable than not using `cmov`.
    ADDED: 49:01 YES! Put your hands away from `cmov`! That's not the issue in this command!

    • @SolomonUcko
      @SolomonUcko 4 года назад +1

      Didn't he say that the majority of the elements are over 255, and need to be modified?

  • @Thecommet
    @Thecommet 5 лет назад

    9:08 Your indices array ranges from 0 to count, when it should be 0 to count-1. v[count] is out of bounds

    • @movax20h
      @movax20h 4 года назад +5

      The RNG will not generate count, the RNG(0, count, count), generates a 'count' elements (last parameter) by each of them is between 0 and count-1, inclusive. This is very common in RNG design, to not include the max element. And is similar to how iterators and loops work.

  • @SanjeevkumarMSnj
    @SanjeevkumarMSnj 4 года назад

    which terminal he is using ?

  • @marcpanther7924
    @marcpanther7924 7 лет назад

    What is he using? My bash doesn't look as cool as that

    • @dipi71
      @dipi71 7 лет назад +2

      It’s bobthefish; but note the numerous redrawing bugs during the talk. It doesn’t look completely trustworthy to me. Oh well, maybe it’s nvim.

  • @kosbarable
    @kosbarable 6 лет назад

    Is it was in Simpsons?
    Probably no.
    Thumb up!

  • @joe-ti8rz
    @joe-ti8rz 6 лет назад

    Go chandler. Survive. Be better. Buy the tibetian book of the death and meditate.

  • @noxabellus
    @noxabellus 7 лет назад

    Looks, talks, acts like a Chandler. Thinks like a damn Einstein O_O

  • @DeepakGiri-z5z
    @DeepakGiri-z5z 9 месяцев назад

    Where can I find the code that was used in this presentation?