21:49 It's worth noting that speculative execution in modern processors will behave like the top example. Since speculatively-executed operations will not commit their changes until proven correct, it will be as-if the final write happened conditionally. The good news is, in practice, this doesn't significantly impact how the CPU operates. However, Herb is correct that the compiler needs to give the CPU the equivalent of that code for it to behave that way. Key takeaway: avoid putting memory barriers or conditional logic in performance-critical code. Loops, especially. The optimiser and processor will thank you.
At 1:25:39, why can line B go above the decrement? More generally why can something in the body of an if statement be reordered before something in the condition check? This seems to defeat the point of an if statement to avoid doing work in certain cases depending on the condition.
my guess is "delete" being an high-level operation that itself requires memory writes, hence requires memory ordering. The presentation has an optimization example where unrelated stores are moved up before the release. The example below uses the same optimization example. The multithreaded observed effect of pointer delete is the pointer becomes "null". if control_block_ptr->refs.fetch_sub(1,memory_order_release) == 0 { ....global::allocator::mem_block_free_list = { ... } ....control_block_ptr = null } ~ ~ could become ~ ~ r1 = control_block_ptr r2 = control_block_ptr.refs control_block_ptr = null // legal compiler optimization r3 = r2.fetch_sub(1,memory_order_release) // flushes write buffer if r3 == 0 { ....global::allocator::mem_block_free_list = { ... } } if r3 != 0 { ....control_block_ptr = r1 // updates write buffer } // lack of additional release, lack of buffer flush, aka race condition since fetch_sub call. which shows a wrong ordering. note; - that the code above is still highlevel, the compiler produces more concrete assembly statements - the ordering change could happen anywhere in the technology stack between ]c-program ; cpu-cache[
Because the compiler is allowed to assume that there are no data races on accesses to non-atomic variables. Without the acquire ordering on the decrement, the compiler is for instance allowed to speculatively reorder a non-atomic load from the body of the if-statement to a point *before* the if-statement. Because there can't be data races, it 'knows' the value it reads should be unaffected by whether it does the load before or after checking the condition. That's not a safe assumption for the ref-count code, so adding acquire ordering ensures that any modifications to the object in thread 1 are completed and visible to thread 2 before it is able to perform any operations from the body of the if-statement.
@@aharrison24 It doesn't sit right with me that the body can be done before checking. This goes against the causality that you intend by writing the if statement. A load of the pointer itself won't cause any data races (as far as I know in this example it doesn't change after initialization), it's about the data that the pointer points to. If the other thread dereferences the pointer after delete, then we have a problem.
@@qwertyuiop-cu2ve I'm totally with you on it not sitting right! But that's why we have to completely change our mindset when reasoning about this stuff. In a single thread of execution, the compiler is allowed to pull all sorts of cunning tricks to make code go faster. As long as those tricks don't change the *observable* behaviour of the code, then they are fair game (the "as if" rule). The CPU too will perform out-of-order execution of instructions in order to avoid costly pipeline stalls. Given that accessing main memory can be so expensive, it's common to reorder loads to happen as early as possible, so that there is less waiting around when the value is actually needed. In the specific case you asked about, if the compiler can see that a variable might be read inside an if-block, it's perfectly reasonable for it to generate code that speculatively reads the value into a register *before* evaluating the if-condition. If control flow subsequently doesn't enter the if-block then the value is simply thrown away. In single-threaded code this can be done safely if the compiler can see that no-one else could change the value of the variable in the meantime (and it can assume no data races). In multi-threaded code these optimisations can obviously cause very surprising effects. And that's why the control-block decrement needs to have acquire semantics in the slide you referenced. The art of lock-free programming is in using the memory-order arguments to atomic operations to constrain the behaviour of both the compiler and the CPU. But seeing things through the lens of 'instruction reordering' is often not helpful. Really the only way to successfully (and portably) reason about multi-threaded code involving relaxed atomics is to think in terms of the *synchronizes-with* and *happens-before* relations of the C++ memory model. Those relations do not talk about control flow. They are concerned only with controlling how memory operations on one thread are made visible to other threads.
Is standalone fence like a barrier, and not tied to any particular variable? And alternatively, in the case of a release+acquire -- will other unrelated releases and acquires be able to interleave into the release+acquire in question?
28:10 What if the read accesses a HW register and the register read has a side effect? (maybe register read clears the interrupt status flags) Are all reads always truly safe everywhere? Would volatile or some other keyword prevent the compiler from generating unwanted reads?
At the compiler level: yes, 'volatile' will prevent the compiler from generating unwanted reads. However, as Herb discusses in the last section, volatile only applies to compiler optimizations; the hardware is still allowed to synthesize memory reads, and in addition to marking your variables as volatile you need to turn on some architecture-specific settings to tell the hardware to play ball. On both x86 and ARM, you need to set some bits in the pagetables to indicate the memory type, which tells the hardware what optimizations it's allowed to apply when accessing the memory. For x86, you'll want to mark the memory as "uncacheable" or "strong uncacheable" to deny speculative reads (x86 developers' manual, volume 3A, sections 4.9 "Paging and Memory Typing" and 11.3 "Methods of Caching Available"). On ARM, you want to mark the pages as device memory (developer.arm.com/documentation/den0024/a/Memory-Ordering/Memory-types/Device-memory; ARM Architectural Reference Manual, sections B2.7 "Memory types available" and D8.5 "Memory region attributes").
At about minute 12, you said that it may be impossible to update a single bit in a byte. However, isn't this merely possibly slower, and not impossible on any processor for regular memory? For example, a possible implementation on x86 might be to use cmpxchg in a loop until a write succeeds that only changes the specific bit? On LL/SC processors, it seems like this would be even easier, and at no increased performance penalty, since all atomic operations would be written with this loop pattern. What makes this impossible?
That's not problem he talked about. The problem is about updating on some bits that is not conceptually shared, but because of lacking update operation on only single bit, it might end up sharing. For example, you have a byte with 1 bit is only used at a single thread, no synchronization is need, and another 7 bits is shared and protected by a lock. If you update that one bit, you might end up updating the other 7 bits, too. The compiler doesn't know it's shared, or it's atomic, it doesn't make any sense for compiler to always be conservative here by always generating LL/SC instructions.
8051 is a microcontroller, not a microprocessor, thus it does single bit writes. On the other hand, processors can do only word writes ( 1 byte or 4 bytes ).
this talk is amazing :O thanks so much
Thank you Herb Sutter
21:49 It's worth noting that speculative execution in modern processors will behave like the top example. Since speculatively-executed operations will not commit their changes until proven correct, it will be as-if the final write happened conditionally.
The good news is, in practice, this doesn't significantly impact how the CPU operates.
However, Herb is correct that the compiler needs to give the CPU the equivalent of that code for it to behave that way.
Key takeaway: avoid putting memory barriers or conditional logic in performance-critical code. Loops, especially. The optimiser and processor will thank you.
At 1:25:39, why can line B go above the decrement? More generally why can something in the body of an if statement be reordered before something in the condition check? This seems to defeat the point of an if statement to avoid doing work in certain cases depending on the condition.
my guess is "delete" being an high-level operation that itself requires memory writes, hence requires memory ordering. The presentation has an optimization example where unrelated stores are moved up before the release.
The example below uses the same optimization example. The multithreaded observed effect of pointer delete is the pointer becomes "null".
if control_block_ptr->refs.fetch_sub(1,memory_order_release) == 0 {
....global::allocator::mem_block_free_list = { ... }
....control_block_ptr = null
}
~ ~ could become ~ ~
r1 = control_block_ptr
r2 = control_block_ptr.refs
control_block_ptr = null // legal compiler optimization
r3 = r2.fetch_sub(1,memory_order_release) // flushes write buffer
if r3 == 0 {
....global::allocator::mem_block_free_list = { ... }
}
if r3 != 0 {
....control_block_ptr = r1 // updates write buffer
}
// lack of additional release, lack of buffer flush, aka race condition since fetch_sub call.
which shows a wrong ordering. note;
- that the code above is still highlevel, the compiler produces more concrete assembly statements
- the ordering change could happen anywhere in the technology stack between ]c-program ; cpu-cache[
Because the compiler is allowed to assume that there are no data races on accesses to non-atomic variables. Without the acquire ordering on the decrement, the compiler is for instance allowed to speculatively reorder a non-atomic load from the body of the if-statement to a point *before* the if-statement. Because there can't be data races, it 'knows' the value it reads should be unaffected by whether it does the load before or after checking the condition. That's not a safe assumption for the ref-count code, so adding acquire ordering ensures that any modifications to the object in thread 1 are completed and visible to thread 2 before it is able to perform any operations from the body of the if-statement.
@@aharrison24 It doesn't sit right with me that the body can be done before checking. This goes against the causality that you intend by writing the if statement. A load of the pointer itself won't cause any data races (as far as I know in this example it doesn't change after initialization), it's about the data that the pointer points to. If the other thread dereferences the pointer after delete, then we have a problem.
@@qwertyuiop-cu2ve I'm totally with you on it not sitting right! But that's why we have to completely change our mindset when reasoning about this stuff. In a single thread of execution, the compiler is allowed to pull all sorts of cunning tricks to make code go faster. As long as those tricks don't change the *observable* behaviour of the code, then they are fair game (the "as if" rule). The CPU too will perform out-of-order execution of instructions in order to avoid costly pipeline stalls.
Given that accessing main memory can be so expensive, it's common to reorder loads to happen as early as possible, so that there is less waiting around when the value is actually needed. In the specific case you asked about, if the compiler can see that a variable might be read inside an if-block, it's perfectly reasonable for it to generate code that speculatively reads the value into a register *before* evaluating the if-condition. If control flow subsequently doesn't enter the if-block then the value is simply thrown away.
In single-threaded code this can be done safely if the compiler can see that no-one else could change the value of the variable in the meantime (and it can assume no data races). In multi-threaded code these optimisations can obviously cause very surprising effects. And that's why the control-block decrement needs to have acquire semantics in the slide you referenced.
The art of lock-free programming is in using the memory-order arguments to atomic operations to constrain the behaviour of both the compiler and the CPU. But seeing things through the lens of 'instruction reordering' is often not helpful. Really the only way to successfully (and portably) reason about multi-threaded code involving relaxed atomics is to think in terms of the *synchronizes-with* and *happens-before* relations of the C++ memory model. Those relations do not talk about control flow. They are concerned only with controlling how memory operations on one thread are made visible to other threads.
herb is a god
Is standalone fence like a barrier, and not tied to any particular variable?
And alternatively, in the case of a release+acquire -- will other unrelated releases and acquires be able to interleave into the release+acquire in question?
And people say God isn't real?
28:10 What if the read accesses a HW register and the register read has a side effect? (maybe register read clears the interrupt status flags)
Are all reads always truly safe everywhere?
Would volatile or some other keyword prevent the compiler from generating unwanted reads?
(or consumes data from HW FIFO)
At the compiler level: yes, 'volatile' will prevent the compiler from generating unwanted reads. However, as Herb discusses in the last section, volatile only applies to compiler optimizations; the hardware is still allowed to synthesize memory reads, and in addition to marking your variables as volatile you need to turn on some architecture-specific settings to tell the hardware to play ball.
On both x86 and ARM, you need to set some bits in the pagetables to indicate the memory type, which tells the hardware what optimizations it's allowed to apply when accessing the memory. For x86, you'll want to mark the memory as "uncacheable" or "strong uncacheable" to deny speculative reads (x86 developers' manual, volume 3A, sections 4.9 "Paging and Memory Typing" and 11.3 "Methods of Caching Available"). On ARM, you want to mark the pages as device memory (developer.arm.com/documentation/den0024/a/Memory-Ordering/Memory-types/Device-memory; ARM Architectural Reference Manual, sections B2.7 "Memory types available" and D8.5 "Memory region attributes").
@@nobodynada_ Thank you
At about minute 12, you said that it may be impossible to update a single bit in a byte. However, isn't this merely possibly slower, and not impossible on any processor for regular memory? For example, a possible implementation on x86 might be to use cmpxchg in a loop until a write succeeds that only changes the specific bit? On LL/SC processors, it seems like this would be even easier, and at no increased performance penalty, since all atomic operations would be written with this loop pattern. What makes this impossible?
That's not problem he talked about. The problem is about updating on some bits that is not conceptually shared, but because of lacking update operation on only single bit, it might end up sharing.
For example, you have a byte with 1 bit is only used at a single thread, no synchronization is need, and another 7 bits is shared and protected by a lock.
If you update that one bit, you might end up updating the other 7 bits, too.
The compiler doesn't know it's shared, or it's atomic, it doesn't make any sense for compiler to always be conservative here by always generating LL/SC instructions.
Do we have any update to this topic today? It's been 10 years.
It's all still current. I just referred a commercial compiler optimization question to this talk a few minutes ago :)
Intel 8051 does single bits read and writes if I'm not mistaken (on *some* memory locations)
8051 is a microcontroller, not a microprocessor, thus it does single bit writes. On the other hand, processors can do only word writes ( 1 byte or 4 bytes ).