C++ and Beyond 2012: Herb Sutter - atomic Weapons 2 of 2

Поделиться
HTML-код
  • Опубликовано: 6 ноя 2024

Комментарии • 21

  • @Roms8313
    @Roms8313 Год назад +4

    this talk is amazing :O thanks so much

  • @dv8silencer
    @dv8silencer 6 лет назад +1

    Thank you Herb Sutter

  • @jaredmulconry
    @jaredmulconry Год назад

    21:49 It's worth noting that speculative execution in modern processors will behave like the top example. Since speculatively-executed operations will not commit their changes until proven correct, it will be as-if the final write happened conditionally.
    The good news is, in practice, this doesn't significantly impact how the CPU operates.
    However, Herb is correct that the compiler needs to give the CPU the equivalent of that code for it to behave that way.
    Key takeaway: avoid putting memory barriers or conditional logic in performance-critical code. Loops, especially. The optimiser and processor will thank you.

  • @qwertyuiop-cu2ve
    @qwertyuiop-cu2ve Год назад +2

    At 1:25:39, why can line B go above the decrement? More generally why can something in the body of an if statement be reordered before something in the condition check? This seems to defeat the point of an if statement to avoid doing work in certain cases depending on the condition.

    • @BertPdeboy
      @BertPdeboy 5 месяцев назад

      my guess is "delete" being an high-level operation that itself requires memory writes, hence requires memory ordering. The presentation has an optimization example where unrelated stores are moved up before the release.
      The example below uses the same optimization example. The multithreaded observed effect of pointer delete is the pointer becomes "null".
      if control_block_ptr->refs.fetch_sub(1,memory_order_release) == 0 {
      ....global::allocator::mem_block_free_list = { ... }
      ....control_block_ptr = null
      }
      ~ ~ could become ~ ~
      r1 = control_block_ptr
      r2 = control_block_ptr.refs
      control_block_ptr = null // legal compiler optimization
      r3 = r2.fetch_sub(1,memory_order_release) // flushes write buffer
      if r3 == 0 {
      ....global::allocator::mem_block_free_list = { ... }
      }
      if r3 != 0 {
      ....control_block_ptr = r1 // updates write buffer
      }
      // lack of additional release, lack of buffer flush, aka race condition since fetch_sub call.
      which shows a wrong ordering. note;
      - that the code above is still highlevel, the compiler produces more concrete assembly statements
      - the ordering change could happen anywhere in the technology stack between ]c-program ; cpu-cache[

    • @aharrison24
      @aharrison24 4 месяца назад +2

      Because the compiler is allowed to assume that there are no data races on accesses to non-atomic variables. Without the acquire ordering on the decrement, the compiler is for instance allowed to speculatively reorder a non-atomic load from the body of the if-statement to a point *before* the if-statement. Because there can't be data races, it 'knows' the value it reads should be unaffected by whether it does the load before or after checking the condition. That's not a safe assumption for the ref-count code, so adding acquire ordering ensures that any modifications to the object in thread 1 are completed and visible to thread 2 before it is able to perform any operations from the body of the if-statement.

    • @qwertyuiop-cu2ve
      @qwertyuiop-cu2ve 3 месяца назад

      ​@@aharrison24 It doesn't sit right with me that the body can be done before checking. This goes against the causality that you intend by writing the if statement. A load of the pointer itself won't cause any data races (as far as I know in this example it doesn't change after initialization), it's about the data that the pointer points to. If the other thread dereferences the pointer after delete, then we have a problem.

    • @aharrison24
      @aharrison24 2 месяца назад

      @@qwertyuiop-cu2ve I'm totally with you on it not sitting right! But that's why we have to completely change our mindset when reasoning about this stuff. In a single thread of execution, the compiler is allowed to pull all sorts of cunning tricks to make code go faster. As long as those tricks don't change the *observable* behaviour of the code, then they are fair game (the "as if" rule). The CPU too will perform out-of-order execution of instructions in order to avoid costly pipeline stalls.
      Given that accessing main memory can be so expensive, it's common to reorder loads to happen as early as possible, so that there is less waiting around when the value is actually needed. In the specific case you asked about, if the compiler can see that a variable might be read inside an if-block, it's perfectly reasonable for it to generate code that speculatively reads the value into a register *before* evaluating the if-condition. If control flow subsequently doesn't enter the if-block then the value is simply thrown away.
      In single-threaded code this can be done safely if the compiler can see that no-one else could change the value of the variable in the meantime (and it can assume no data races). In multi-threaded code these optimisations can obviously cause very surprising effects. And that's why the control-block decrement needs to have acquire semantics in the slide you referenced.
      The art of lock-free programming is in using the memory-order arguments to atomic operations to constrain the behaviour of both the compiler and the CPU. But seeing things through the lens of 'instruction reordering' is often not helpful. Really the only way to successfully (and portably) reason about multi-threaded code involving relaxed atomics is to think in terms of the *synchronizes-with* and *happens-before* relations of the C++ memory model. Those relations do not talk about control flow. They are concerned only with controlling how memory operations on one thread are made visible to other threads.

  • @dhairyar3556
    @dhairyar3556 7 месяцев назад +1

    herb is a god

  • @mysanoop
    @mysanoop Год назад

    Is standalone fence like a barrier, and not tied to any particular variable?
    And alternatively, in the case of a release+acquire -- will other unrelated releases and acquires be able to interleave into the release+acquire in question?

  • @brewtaldeafmedal1262
    @brewtaldeafmedal1262 8 месяцев назад +1

    And people say God isn't real?

  • @szirsp
    @szirsp Год назад

    28:10 What if the read accesses a HW register and the register read has a side effect? (maybe register read clears the interrupt status flags)
    Are all reads always truly safe everywhere?
    Would volatile or some other keyword prevent the compiler from generating unwanted reads?

    • @szirsp
      @szirsp Год назад

      (or consumes data from HW FIFO)

    • @nobodynada_
      @nobodynada_ Год назад +1

      At the compiler level: yes, 'volatile' will prevent the compiler from generating unwanted reads. However, as Herb discusses in the last section, volatile only applies to compiler optimizations; the hardware is still allowed to synthesize memory reads, and in addition to marking your variables as volatile you need to turn on some architecture-specific settings to tell the hardware to play ball.
      On both x86 and ARM, you need to set some bits in the pagetables to indicate the memory type, which tells the hardware what optimizations it's allowed to apply when accessing the memory. For x86, you'll want to mark the memory as "uncacheable" or "strong uncacheable" to deny speculative reads (x86 developers' manual, volume 3A, sections 4.9 "Paging and Memory Typing" and 11.3 "Methods of Caching Available"). On ARM, you want to mark the pages as device memory (developer.arm.com/documentation/den0024/a/Memory-Ordering/Memory-types/Device-memory; ARM Architectural Reference Manual, sections B2.7 "Memory types available" and D8.5 "Memory region attributes").

    • @szirsp
      @szirsp Год назад

      @@nobodynada_ Thank you

  • @jamesonnash4948
    @jamesonnash4948 3 года назад +1

    At about minute 12, you said that it may be impossible to update a single bit in a byte. However, isn't this merely possibly slower, and not impossible on any processor for regular memory? For example, a possible implementation on x86 might be to use cmpxchg in a loop until a write succeeds that only changes the specific bit? On LL/SC processors, it seems like this would be even easier, and at no increased performance penalty, since all atomic operations would be written with this loop pattern. What makes this impossible?

    • @quangtung2912
      @quangtung2912 3 года назад +4

      That's not problem he talked about. The problem is about updating on some bits that is not conceptually shared, but because of lacking update operation on only single bit, it might end up sharing.
      For example, you have a byte with 1 bit is only used at a single thread, no synchronization is need, and another 7 bits is shared and protected by a lock.
      If you update that one bit, you might end up updating the other 7 bits, too.
      The compiler doesn't know it's shared, or it's atomic, it doesn't make any sense for compiler to always be conservative here by always generating LL/SC instructions.

  • @hanyanglee9018
    @hanyanglee9018 2 года назад +4

    Do we have any update to this topic today? It's been 10 years.

    • @herbsutter4340
      @herbsutter4340 Год назад +11

      It's all still current. I just referred a commercial compiler optimization question to this talk a few minutes ago :)

  • @MMIIRRKKOO
    @MMIIRRKKOO 5 лет назад +2

    Intel 8051 does single bits read and writes if I'm not mistaken (on *some* memory locations)

    • @rostislavstelmach9168
      @rostislavstelmach9168 3 года назад +2

      8051 is a microcontroller, not a microprocessor, thus it does single bit writes. On the other hand, processors can do only word writes ( 1 byte or 4 bytes ).