Thanks for an interesting topic and talk. Your presentation skills are beyond outstanding. It’s interesting to consider that most people will use something like this only to use a “dispose” operation which uses the default allocator which in turn will take a lock on the heap when freeing memory, thus negating much of the cleverness.
Atomic shared_ptr's could be nearly as fast as copying a simple pointer for RCU-like patterns. Simply keep a thread_local snasphot of the latest data and update it frequently from the shared atomic. The update shoudl take into account that both pointers may be equal, thereby doing nothting. For RCU-like patterns the central atomic shared_ptr isn't updated that often that in most cases the copy-process notices that both pointers are equal so that nothing is done. If you have a futex'd solution the compare-process should take place before any locking and as I said in most cases locking won't be necessary. As the current atomic shared_ptr lacks this feature I implemented sth similar with a class shared_obj, which is similar to a shared_ptr, and a tshared_obj, which is similar to an atomic. I benchmarked my idea against the MSVC atomic and because there's no cache-invalidation while copying equal pointers my solution is about 900 times faster with 32 contending threads on my Zen4 CPU. So if you optimize that way you could simply stick with futex'd atomics and you won't run into performance-problems.
Amazing presentation, the hard topic is so very well explained! An additional conclusion I made from it is to never have a public mutable variable (not just shared_ptr) accessible to more than 1 thread. Threads competing for re-assignment and copying of a shared pointer is so scary and has so many potential issues even if the pointer operation itself is thread-safe, that it is to be avoided.
Still, as long as an approach haso e thread deleting the memory of the control block while another thread took a pointer to the control block and didn't get a chance to increment the counter inside the block - you have a problem. The split-count approach lets the two threads coordinate using another counter that is outside the control block. The deferred reclamation approach ensures that the 'deleting thread' will keep the memory around for a little longer if needed
Very nice feature potential, one more tool to help ease parallel scaling. (I wish everyone would just ignore that bureaucrat's claim about languages, it was clearly wrtten by some fool that doesn't understand basic computer science fundamentals. It would be like the DOT declaring that farm tractors should be avoided because they are prone to rollovers with poorly trained operators, then listing a bunch of highway vehicles as prefered alternatives.) (For the trivia minded many farmers have been killed by pre 1980s tractors turning over, but this has been mitigated by rollover protecton bars that stop it from going completely upside-down [like raii]. Though people still remove them because bad designers make them too tall and it interferes with low barn doors or [orchard] tree branches. [not zero cost])
the gcc version is going to behave algorithmically mostly like the split reference count version in no or low contention scenarios, and I think that is what it is optimized for, since if you have high contention, the compare-exchange operation will eat you alive anyway with cache misses.
It seems to there is a misunderstanding of multithreading issue on minute 8 - I could understand what is real use case. There are 2 things to know about shared ptr in that context: 1. the ref counter in shared prt is a thread safe one. that means the pointer is deleted once. 2. the object (the pointer) is not a thread safe one (it depends on the object implementation, but no such a functionality from shared ptr). That means if you call methods of that object from different threads then no synchronization comes from shared ptr. Regarding the example on minute 8: * one thread creates an object an then the ownership can be shared. * to create a copy shared ptr, you should own it, so the ref counter is always > 0 * when the pointer is being destroyed the ref counter is 0, meaning no one owns it anymore (the last and only owner of the pointer has just released it) - as no one owns it, no way to make a copy So the problem above is an issue in shared ownership logic, not in shared_ptr itself. That has nothing to do with shared_ptr implementation, but design issue
Okay, let’s say you don’t want to own shared_prt (it is a cache or something like that), but you want to use it from time to time, then it seems you should use weak_ptr to reference the shared_ptr which has its pros/cons but works well. Then, let’s say someone recreated shared_ptr, so the weak_ptr returns an error - there you somehow should get the reference to the new shared_ptr which you never owned before. It seems here is the problem that should be solved in the video :) And the weak_ptr is not needed if you could ref shared_ptr without owning it. That looks a bit like trying to solve a bad design issue, without redesigning, unless of course Im missing something important here
4:00 Lock free definition in this is "if a scheduled thread fails to make progress it must be because another thread made progress". In my mind that isn't lock free programming, that's just code that doesn't deadlock. That seems to be a much lower bar than the normal definition of lock-free which you would expect. EDIT: I've watched the rest of the talk and the definition at the beginning doesn't seem to matter for the rest.
I actually like Daniels definition (and think it's common). One thing that I like about it, is that it explains why a "compare-exchange" loop is considered lock-free, and a "spin lock" loop isn't. In the former every extra iteration is done because another thread actually changed some shared data, and in the latter we can have more and more iterations while the 'other thread' is doing totally unrelated work
It is the standard definition of lock-free (not wait-free which is stronger and even more tricky). You see, it is guaranteed progress for the system as a whole, nor every thread. Deadlock is not the only problem with locks. Another typical problem is the crashed / killed / non-scheduled thread. With lock you have lock owned forever (by sleeping or dead thread), with lock-free there is no such problem. But, yes, lock-free doesn't guarantee anything for each individual thread, only for the system as a whole. As soon as something is scheduled (we have live threads) there is progress. Each individual thread can be starved due to an unfortunate scheduling pattern, but something works.
@@Roibarkan Right, but just for elaboration I'll add that one should be a little careful here. Not every compare-exchange loop is lockfree - even if the compare-exchange loop only has to repeat in case another thread changed the value. You could still have a livelock situation where two threads continuously invalidate each other's compare-exchanges, but none of them makes any "real" progress in the sense of actually finishing some operation. (and of course there are compare-exchange loops that just aren't lockfree at all, e.g., those that compare against a fixed value).
28:42: No it's not. There already currently are server CPUs with 57 bit address space. The 48 bit thing was never defined in ABI, anyway, so there is no opt-in for 57-bit addresses. More fundamentally, this technique is not portable to other architectures.
Interesting ideas presented on this one, thanks! I am inclined to ask if retiring itself is lock free, since it will append the retired control block to some global list of retired objects. I suppose it is possible to use a linked list, which could be lock free, minus the allocator overhead, but then the allocator itself will need a lock. You can't escape!
Could have statically known size. Think of global std array. Then, as the presenter said, occasionally (each 1000th defer) you go thru that array and finish the deferring. Or, if during the deferring, you see that the array is full, you finish deferring of some or all elements in that array. Of course choosing size of the array depends...on maximum number of threads I guess, or on frequency of finish defer operation, or maximum contention or something like that.
It sounds to me that the problems with shared_ptr (hard to do this, hard to do that) originate from the fact that its functionality is bloated: it supports weak_ptr, separate allocation, offsets, inheritance. And now we want it to be atomic. Do we need all these features at the same time? If we want a subset of features, still useful in most cases, the control block wouldn’t need to be separate and it would be easier to have an atomic shared ptr?
I appreciate what you're trying to do here, but it seems a bit like using GC to avoid GC. The whole point of using shared_ptr's and the like is to not use GC at all. This may just be that people need to design better around parallelism, or it may be that for parallel programming GC would be a better solution. I can't say for certain, and while I would rather avoid GC, I kind of feel like for some programmers, it might be the solution they're looking for to escape their bad programming habits. Just don't let them work on any embedded devices, especially in the medical field.
There surely is non determinism with retiring control blocks timing and memory wise, normally possibly buried in the nondeterminism of current multi core multi level cache CPU's and system/memory architectures. But are scenarios possible that retire enormous numbers of control blocks in a short time that show significant memory waste peaks and timing effects?
You can control the maximum deallocations that a hazard pointer is going to need to make by controlling how often `retire` performs the scan. The longer you wait, the more efficient it is, and you can scan more frequently for a lower bound on latency
Unfortunately I'm quite disappointed by this talk, which sucks because Daniel's on stage presentation skills are exemplary. Main reason being assumptions and claims, which i thought an academic would have nailed. Two i shall cherry pick: 1. "Most CPUs in the last 15 years are x86." I find this very hard to believe. 2. Several claims, particularly in the first half, that lock free programming is more performant and implying that we should all her doing it because it's better. Then the benchmarking says (unsurprisingly) that it is situational and only relevent on a massive server. Particularly on RUclips i think we need to be more careful, most people don't watch to the end, including more junior developers. I think we need to be cautious when making claims at the start of content that are then only disproved right at the end without forewarning.
Right. Lockfree is usually only more performant "as a principle" under oversubscription scenarios, where a thread that is preempted while holding a lock can slow down the whole system until it gets to run again. There are lockfree algorithms that are more performant than locked algorithms, but that's usually not *because* they are lockfree but rather because they avoid contention or unnecessary work by being clever about how they do things. You could usually add a lock to those without making them significantly slower (except in ovsersubscription scenarios).
After he put while(true) it effectively became a spinlock at that point, isn’t it? EDIT: ah, I see, he hand-waved it at 59:30 as being “not wait free, but definitely lock-free”. Sure, with that definition spinlock is “lock-free”
There's only one definition of lock-free. The implementation of a spinlock is not lock-free. This code is. The only way not to exit the loop iteration is that the pointer is not null, and between reading the pointer value and trying to increment its reference count, the reference count has *become* zero. I.e., wasn't already zero when reading the pointer. That immediately makes it lock free: the only way to be stuck is if someone else is making progress (namely: finished a store operation that eventually led to reducing the reference count to 0). In contrast, a spinlock is stuck as long as nobody else made progress. That's pretty much the exact opposite.
ruclips.net/video/lNPZV9Iqo3U/видео.html In this example we don't have an issue if shared_ptr atomic counter uses compare-exchange idiom which guarantees that when we do increment we do it only if previous value was that one we stored a moment ago, and if it zero we do not do any copy in thread 3
The issue is with the control block (and the pointer to it), and it is in the control block the ref count is. What if we have copied that pointer on thread 3, and then thread 2 decrements the counter and destroys the control block (before thread 3 has incremented the counter): thread 3 has a dangling control block. Like he said later, we would need to be able to copy the pointer and increment atomically.
I believe that as long as the atomic counter lives inside the control block - the ‘compare-exchange idiom’ isn’t enough, because (as Daniel points out) once one thread manages to decrement to zero, it deletes the control block, and any other thread that might be inside a ‘compare exchange loop’ (trying to increment) will refer to deleted memory, and reach memory unsafety
No disrespect is meant to anyone individually, but I can't help but make note of this. And I think it ought to be noted as well. As I watch some of these talks every year a recurring theme is the highly repetitive, almost systemic, use of phrase like "it's hard, it's really hard, yeah this problem it's really really really really REALLY hard, it's a hard problem, it's so difficult, yeah it's just a hard problem, it's extremely complex, it's hard". I think after enough of this a person has to step back, out of the shared mental space so to speak (what one could call suspenion of disbelief, or a type of somnambulance) and ask themselves "what in the hell is going on here?" I mean look, I'm not posting with the solution in hand, but all this talk about how hard everything is to solve just seems ridiculous and unbelieveable. Yeah, it's that simple. I don't believe it. There's also a chilling effect to both the speaker and the listener. The speaker, much like if they say magical trigger phrases like "I'm tired" and then they start to feel more tired, or "it hurts" it'll feel more painful, when they speak the words "it's hard" their unconscious mind latches on and starts to execute that script. Basically, it makes it hard. Just like I'm tired, or dumb, or weak, or whatever makes you tired dumb and weak. This is demonstrable. So when the listener hears it's hard, 1) they'll make it harder 2) it'll dissuade people from trying to solve these problems. So it's almost like an anti-competitive thing. "This is really complex, cutting edge. Yeah, stay out, you can't do it, I'm working on it, it's too hard." I don't like this. I don't like it at all. And I'm sick of hearing it, year after year after year. I myself am one of the "if you want it to get don e, and done right, do it yourself" types. Just not believeable. Leaves me thinking a combo of what's really going on here, and what is wrong with these people.
I agree with a lot of what you're saying here but I differ in that I don't blame the speaker. The language is significantly more complex than others to get right, and increasingly so. I think that's down to default (ie simplest) behaviours not being the right ones with respect to modern best practices. The NSA saying what it did is not surprising and should really highlight the fundamental issue with C++ which is that you need a significant amount of experience to write good C++, and even then the risk of dangerous mistake is high. I find that pretty damning for the next generation of C++ programmers.
Thanks for an interesting topic and talk. Your presentation skills are beyond outstanding. It’s interesting to consider that most people will use something like this only to use a “dispose” operation which uses the default allocator which in turn will take a lock on the heap when freeing memory, thus negating much of the cleverness.
great presenter, great topic and great talk!
Very interested in lockfree. Thank you for sharing knowledge about lock-free ref cnt!
Atomic shared_ptr's could be nearly as fast as copying a simple pointer for RCU-like patterns. Simply keep a thread_local snasphot of the latest data and update it frequently from the shared atomic. The update shoudl take into account that both pointers may be equal, thereby doing nothting. For RCU-like patterns the central atomic shared_ptr isn't updated that often that in most cases the copy-process notices that both pointers are equal so that nothing is done. If you have a futex'd solution the compare-process should take place before any locking and as I said in most cases locking won't be necessary.
As the current atomic shared_ptr lacks this feature I implemented sth similar with a class shared_obj, which is similar to a shared_ptr, and a tshared_obj, which is similar to an atomic. I benchmarked my idea against the MSVC atomic and because there's no cache-invalidation while copying equal pointers my solution is about 900 times faster with 32 contending threads on my Zen4 CPU.
So if you optimize that way you could simply stick with futex'd atomics and you won't run into performance-problems.
Amazing presentation, the hard topic is so very well explained!
An additional conclusion I made from it is to never have a public mutable variable (not just shared_ptr) accessible to more than 1 thread. Threads competing for re-assignment and copying of a shared pointer is so scary and has so many potential issues even if the pointer operation itself is thread-safe, that it is to be avoided.
Great presentation of a hard topic!
So many interesting talks at CppCon in 2023. Thx for sharing.
Very well explained. Thanks
I have to ask - why not just increment atomic counter twice then? SOunds like same "split counter" approach without any changes to structure layout.
Still, as long as an approach haso e thread deleting the memory of the control block while another thread took a pointer to the control block and didn't get a chance to increment the counter inside the block - you have a problem. The split-count approach lets the two threads coordinate using another counter that is outside the control block. The deferred reclamation approach ensures that the 'deleting thread' will keep the memory around for a little longer if needed
All around great talk!
Very nice feature potential, one more tool to help ease parallel scaling.
(I wish everyone would just ignore that bureaucrat's claim about languages, it was clearly wrtten by some fool that doesn't understand basic computer science fundamentals. It would be like the DOT declaring that farm tractors should be avoided because they are prone to rollovers with poorly trained operators, then listing a bunch of highway vehicles as prefered alternatives.) (For the trivia minded many farmers have been killed by pre 1980s tractors turning over, but this has been mitigated by rollover protecton bars that stop it from going completely upside-down [like raii]. Though people still remove them because bad designers make them too tall and it interferes with low barn doors or [orchard] tree branches. [not zero cost])
the gcc version is going to behave algorithmically mostly like the split reference count version in no or low contention scenarios, and I think that is what it is optimized for, since if you have high contention, the compare-exchange operation will eat you alive anyway with cache misses.
It seems to there is a misunderstanding of multithreading issue on minute 8 - I could understand what is real use case.
There are 2 things to know about shared ptr in that context:
1. the ref counter in shared prt is a thread safe one. that means the pointer is deleted once.
2. the object (the pointer) is not a thread safe one (it depends on the object implementation, but no such a functionality from shared ptr). That means if you call methods of that object from different threads then no synchronization comes from shared ptr.
Regarding the example on minute 8:
* one thread creates an object an then the ownership can be shared.
* to create a copy shared ptr, you should own it, so the ref counter is always > 0
* when the pointer is being destroyed the ref counter is 0, meaning no one owns it anymore (the last and only owner of the pointer has just released it) - as no one owns it, no way to make a copy
So the problem above is an issue in shared ownership logic, not in shared_ptr itself. That has nothing to do with shared_ptr implementation, but design issue
Okay, let’s say you don’t want to own shared_prt (it is a cache or something like that), but you want to use it from time to time, then it seems you should use weak_ptr to reference the shared_ptr which has its pros/cons but works well.
Then, let’s say someone recreated shared_ptr, so the weak_ptr returns an error - there you somehow should get the reference to the new shared_ptr which you never owned before.
It seems here is the problem that should be solved in the video :)
And the weak_ptr is not needed if you could ref shared_ptr without owning it.
That looks a bit like trying to solve a bad design issue, without redesigning, unless of course Im missing something important here
I think what he want to say is that the reference counting and deletion together is not atomic operation, so there is a risk of race condition.
4:00 Lock free definition in this is "if a scheduled thread fails to make progress it must be because another thread made progress". In my mind that isn't lock free programming, that's just code that doesn't deadlock. That seems to be a much lower bar than the normal definition of lock-free which you would expect.
EDIT: I've watched the rest of the talk and the definition at the beginning doesn't seem to matter for the rest.
I actually like Daniels definition (and think it's common). One thing that I like about it, is that it explains why a "compare-exchange" loop is considered lock-free, and a "spin lock" loop isn't. In the former every extra iteration is done because another thread actually changed some shared data, and in the latter we can have more and more iterations while the 'other thread' is doing totally unrelated work
It is the standard definition of lock-free (not wait-free which is stronger and even more tricky).
You see, it is guaranteed progress for the system as a whole, nor every thread.
Deadlock is not the only problem with locks. Another typical problem is the crashed / killed / non-scheduled thread. With lock you have lock owned forever (by sleeping or dead thread), with lock-free there is no such problem.
But, yes, lock-free doesn't guarantee anything for each individual thread, only for the system as a whole. As soon as something is scheduled (we have live threads) there is progress. Each individual thread can be starved due to an unfortunate scheduling pattern, but something works.
@@Roibarkan Right, but just for elaboration I'll add that one should be a little careful here. Not every compare-exchange loop is lockfree - even if the compare-exchange loop only has to repeat in case another thread changed the value. You could still have a livelock situation where two threads continuously invalidate each other's compare-exchanges, but none of them makes any "real" progress in the sense of actually finishing some operation. (and of course there are compare-exchange loops that just aren't lockfree at all, e.g., those that compare against a fixed value).
28:42: No it's not. There already currently are server CPUs with 57 bit address space. The 48 bit thing was never defined in ABI, anyway, so there is no opt-in for 57-bit addresses. More fundamentally, this technique is not portable to other architectures.
Interesting ideas presented on this one, thanks! I am inclined to ask if retiring itself is lock free, since it will append the retired control block to some global list of retired objects. I suppose it is possible to use a linked list, which could be lock free, minus the allocator overhead, but then the allocator itself will need a lock. You can't escape!
Could have statically known size. Think of global std array. Then, as the presenter said, occasionally (each 1000th defer) you go thru that array and finish the deferring. Or, if during the deferring, you see that the array is full, you finish deferring of some or all elements in that array. Of course choosing size of the array depends...on maximum number of threads I guess, or on frequency of finish defer operation, or maximum contention or something like that.
I think this exaplains on example 3 what Qt had implemented as "deleteLater" 🤔
It sounds to me that the problems with shared_ptr (hard to do this, hard to do that) originate from the fact that its functionality is bloated: it supports weak_ptr, separate allocation, offsets, inheritance. And now we want it to be atomic. Do we need all these features at the same time? If we want a subset of features, still useful in most cases, the control block wouldn’t need to be separate and it would be easier to have an atomic shared ptr?
I appreciate what you're trying to do here, but it seems a bit like using GC to avoid GC. The whole point of using shared_ptr's and the like is to not use GC at all. This may just be that people need to design better around parallelism, or it may be that for parallel programming GC would be a better solution. I can't say for certain, and while I would rather avoid GC, I kind of feel like for some programmers, it might be the solution they're looking for to escape their bad programming habits. Just don't let them work on any embedded devices, especially in the medical field.
There surely is non determinism with retiring control blocks timing and memory wise, normally possibly buried in the nondeterminism of current multi core multi level cache CPU's and system/memory architectures.
But are scenarios possible that retire enormous numbers of control blocks in a short time that show significant memory waste peaks and timing effects?
You can control the maximum deallocations that a hazard pointer is going to need to make by controlling how often `retire` performs the scan. The longer you wait, the more efficient it is, and you can scan more frequently for a lower bound on latency
58:48 Daniels talk from the previous year: ruclips.net/video/OS7Asaa6zmY/видео.html
Unfortunately I'm quite disappointed by this talk, which sucks because Daniel's on stage presentation skills are exemplary.
Main reason being assumptions and claims, which i thought an academic would have nailed.
Two i shall cherry pick:
1. "Most CPUs in the last 15 years are x86." I find this very hard to believe.
2. Several claims, particularly in the first half, that lock free programming is more performant and implying that we should all her doing it because it's better. Then the benchmarking says (unsurprisingly) that it is situational and only relevent on a massive server. Particularly on RUclips i think we need to be more careful, most people don't watch to the end, including more junior developers. I think we need to be cautious when making claims at the start of content that are then only disproved right at the end without forewarning.
Right. Lockfree is usually only more performant "as a principle" under oversubscription scenarios, where a thread that is preempted while holding a lock can slow down the whole system until it gets to run again.
There are lockfree algorithms that are more performant than locked algorithms, but that's usually not *because* they are lockfree but rather because they avoid contention or unnecessary work by being clever about how they do things. You could usually add a lock to those without making them significantly slower (except in ovsersubscription scenarios).
After he put while(true) it effectively became a spinlock at that point, isn’t it?
EDIT: ah, I see, he hand-waved it at 59:30 as being “not wait free, but definitely lock-free”.
Sure, with that definition spinlock is “lock-free”
There's only one definition of lock-free. The implementation of a spinlock is not lock-free. This code is.
The only way not to exit the loop iteration is that the pointer is not null, and between reading the pointer value and trying to increment its reference count, the reference count has *become* zero.
I.e., wasn't already zero when reading the pointer.
That immediately makes it lock free: the only way to be stuck is if someone else is making progress (namely: finished a store operation that eventually led to reducing the reference count to 0).
In contrast, a spinlock is stuck as long as nobody else made progress.
That's pretty much the exact opposite.
ruclips.net/video/lNPZV9Iqo3U/видео.html
In this example we don't have an issue if shared_ptr atomic counter uses compare-exchange idiom which guarantees that when we do increment we do it only if previous value was that one we stored a moment ago, and if it zero we do not do any copy in thread 3
The issue is with the control block (and the pointer to it), and it is in the control block the ref count is. What if we have copied that pointer on thread 3, and then thread 2 decrements the counter and destroys the control block (before thread 3 has incremented the counter): thread 3 has a dangling control block. Like he said later, we would need to be able to copy the pointer and increment atomically.
I believe that as long as the atomic counter lives inside the control block - the ‘compare-exchange idiom’ isn’t enough, because (as Daniel points out) once one thread manages to decrement to zero, it deletes the control block, and any other thread that might be inside a ‘compare exchange loop’ (trying to increment) will refer to deleted memory, and reach memory unsafety
No disrespect is meant to anyone individually, but I can't help but make note of this. And I think it ought to be noted as well. As I watch some of these talks every year a recurring theme is the highly repetitive, almost systemic, use of phrase like "it's hard, it's really hard, yeah this problem it's really really really really REALLY hard, it's a hard problem, it's so difficult, yeah it's just a hard problem, it's extremely complex, it's hard". I think after enough of this a person has to step back, out of the shared mental space so to speak (what one could call suspenion of disbelief, or a type of somnambulance) and ask themselves "what in the hell is going on here?" I mean look, I'm not posting with the solution in hand, but all this talk about how hard everything is to solve just seems ridiculous and unbelieveable. Yeah, it's that simple. I don't believe it.
There's also a chilling effect to both the speaker and the listener. The speaker, much like if they say magical trigger phrases like "I'm tired" and then they start to feel more tired, or "it hurts" it'll feel more painful, when they speak the words "it's hard" their unconscious mind latches on and starts to execute that script. Basically, it makes it hard. Just like I'm tired, or dumb, or weak, or whatever makes you tired dumb and weak. This is demonstrable. So when the listener hears it's hard, 1) they'll make it harder 2) it'll dissuade people from trying to solve these problems. So it's almost like an anti-competitive thing. "This is really complex, cutting edge. Yeah, stay out, you can't do it, I'm working on it, it's too hard."
I don't like this. I don't like it at all. And I'm sick of hearing it, year after year after year. I myself am one of the "if you want it to get don e, and done right, do it yourself" types. Just not believeable. Leaves me thinking a combo of what's really going on here, and what is wrong with these people.
I agree with a lot of what you're saying here but I differ in that I don't blame the speaker. The language is significantly more complex than others to get right, and increasingly so. I think that's down to default (ie simplest) behaviours not being the right ones with respect to modern best practices.
The NSA saying what it did is not surprising and should really highlight the fundamental issue with C++ which is that you need a significant amount of experience to write good C++, and even then the risk of dangerous mistake is high. I find that pretty damning for the next generation of C++ programmers.