I agree, another one worth catching all his talks is Ansel Sermersheim. He makes hard to understand subjects highly consumable. Timur Doumler also has good talks but he is not the best public speaker, not the worst, he is very good, but not Herb or Chandler levels at all.
I always enjoy Chandler's talks - this one has great video/audio quality as well. I would like to emphasize the fact that this pointer-packing stuff is implementation-defined, and generally pointless outside of a performance bottleneck. So, some of the code you don't see in the slides likely contains a lot of platform-specific configuration macros (but I haven't looked at these parts of the Clang code base yet). TL;DW: Cater to the cache, pack your pointers.
@20:00 Microsoft has very similar implementation in their ATL, since 2003. Their implementation is called CAtlPlex, and is used internally by many ATL containers.
Apple also uses that at the language level (of both ObjC and Swift IIRC). In fact on 64b they use not only the low bits but also the high bits, because all extant 64b implementations only have 48b address space at the moment (and x86-64 is architecturally limited to 52 bits).
As usual, it is a joy to listen to Chandler's talks. I am working on an Open Source library of data structures for High Performance Computing and one of the first thing I had to do is to rebuilt a custom std::vector. One of the many reason I had to do that is that std::vector v(n); fills the vector with zeros which trashes the cache and prevents you to use the "first touch policy" which is extremely important for NUMA architectures such as dual socket nodes. Alignement is also an issue to get efficient vectorization in HPC. These things could be "solved" with allocators, but I just hate them as I don't want them to change the type of my containers. I have a few questions though: - Do you have a llvm::Vector or do you use llvm::SmallVector as a replacement for std::vector? Do you also use llvm::Vector as a replacement of std::array? - There is a BumpPtrAllocator, but is does not seem possible to use it with SmallVector. Do you have any design ideas for allocators? I just can't get them to work the way I way. I hate both the way they are done in the C++ standard library and the Bloomberg implementation.
We use SmallVector pretty much everywhere. On very rare occasions we'll want what std::vector provides, but it's unusual. std::array is a totally different beast -- it is a statically sized array which has essentially no overlap with the vector use cases. I suspect there are some places where we could use std::array and currently use llvm::SmallVector, but I think they're rare.
And regarding allocators -- for LLVM, the pattern we fall into more often is needing to control allocation of the *objects* with allocators, not the buffers in the data structures. For those buffers, tcmalloc, jemalloc, or any of the other optimized free-list based malloc libraries will do a fine job.
There are too many things in the standard library which are broken for me and I am glad I decided to write my own containers. It provides me so many useful things. As I am working mainly on numerical simulation, all I really need is a good std::vector implementation. Having my own allows me to: - Have an il::Vector v(n) that does not fills the vector with 0 (same with float, int, etc) - Have an il::Vector v(n) that gets initialized to NaN in debug mode (very useful if you want to track uninitialized values) - Add my own constructor such as il::Vector v(n, il::align, 64) if I want to align the buffer to a cacheline - Easily count the number of copy constructor called in il::Vector (if I want to make sure that the move semantics are properly done) - Having il::Vector::size() return a signed integer (I hate unsigned integers and all the bugs they lead to) - Write a il::SmallVector as Chandler has pointed - so many useful things... I like the language C++, but I just have so many troubles with the standard library. I am glad to hear that LLVM worked on their own containers. I can point people who say I am stupid not to use the standard library to this kind of talk.
Slowly learning C++20, but how would you achieve that? The only way I can think of is rejecting incompatible allocators and giving a compiler error, which doesn't really achieve the result of having the SmallVector on API boundaries.
Is there a significant performance benefit to use the hybrid 32-bit pointers and 64-bit registers ABI, namely x32 ABI? I don't mean for the generated code, but specifically for the compiler itself?
So you get an expanded register file with 8 extra registers, but register scarcity has not been a major issue since about Pentium Pro, since top of stack is actually aliased to an unknown number of hidden registers and the corresponding operations get removed at the opcode decoding stage. You'd get a slight bottleneck at opcode decoding but it was pretty much removed in the later iterations of Pentium III. There also doesn't seem to be any difference whatsoever in compiler performance between pure 32-bit and 64-bit ABIs, at least from my experience with GCC - don't know whether it extends to Clang, but I strongly suspect so.
Can you elaborate on "top of stack" aliasing? I don't find any mentions of it in intel optimization guide or Agner Fog's guides. Maybe you are talking about FP register stack?
The smallvector slide at 5:26 is confusing me because of omissions: 1) Can all small size vectors be upgraded to heap storage on overflow 2) Is SmallVectorImpl what you use for function params? I assume so, it just sounds like a wierd / wordier name.
Yes, but at the expense of having to come up with a new name for each different small-size-optimized container in a function's API boundary. =/ Doesn't seem to scale very well.
You can use the same technique you use to remove the size parameter from the type of SmallVectorImpl. Allocators are passed by reference, so no slicing worries, and you can use the pimpl idiom for polymorphic copying. My jury is still out on whether it's better, but it is possible.
6:17 Slide 6 -- How does the SmallVectorImpl::push_back know whether to free the old buffer? Without knowing the original N, it can't tell if it's been reallocated to the heap already, or was still using the small buffer.
Hi Chandler, I may be misinterpreting you, but at ruclips.net/video/vElZc6zSIXM/видео.html are you implying that modern CPUs do indirect memory prefetching? e.g. are you implying that CPUs will work out you are iterating through an array of pointers, read one or more indices ahead in that array, and then prefetch the memory pointed to by that pointer?
Note that prefetching on pointer load is bound to be slow no matter what. If you want to measure if the CPU does prefetching or not, make sure you use the same algorithm, but one doing prefetching explicitly using compiler intrinsic, and the other letting the CPU do its thing. If you see no significant perf difference, your CPU does prefectching. If the one with prefectching is faster, then you CPU doesn't do prefecthing. I wouldn't be able to tell you if your specific CPU does it. Some Intel do it, but I'm not sure which ones.
@onosendaiAAB I think you're right there is no indirect prefetching. Possibly he means that the execution of later iteration loads can occur speculatively and in parallel to the first load because there is no dependency.
I don't get the point about SmallVector class having a fixed size, whereas the containers goal is to change it. I can only guess that 'N' is the maximum size the "simulated container" will ever get, right? But how would you know that?
+Chandler Carruth I saw the push_back f(), however, it will produce just 1 element. And even if it create another buffer, how Begin and End would work with the 2 or more arrays, without an extra code to handle the situation? I guess you are talking about low-level, which would allow you to delete an compiling-time array, and realocate it, unlike C/C++ allow to do, right?
No, this is totally built with what C++ already allows you to do. The begin and end pointers start off pointing at the inline small buffer, and if extra space is needed it gets allocated and the begin and end pointers are moved to point at the new buffer. You can find all the code for the real implementation here: github.com/llvm-project/llvm-project/blob/master/llvm/include/llvm/ADT/SmallVector.h
Chandler Carruth So, the penalty is risking to have an unused array, after the memory enlarge. Good idea, though. The pointers can also return to the array, if memory is freed. How much an array is faster than std::list? For GCC, some years ago, I measured that an array, simulating std::map, is 6-7 times faster.
btw How about intrinsic lists? Or when you say "list" you mean list `List` instead of `List`? www.codeofhonor.com/blog/avoiding-game-crashes-related-to-linked-lists This have very good properties if list are very dynamic and object can move between lists. Of corse it have one big drawback, object can belongs only to one list at time.
ok i see so the idea behind smallvector is that you can allocate some space before hand? instead of allocating when pushing something on to the vector array ?
No, the point is that you allocate it on the stack, not the heap. If it's too large for the stack, then it'll fallback to the heap. You can allocate space in a vector beforehand with vector::reserve().
It's hard to parse as a human, and you just happened to get it wrong is all. Read it as (Value).(template is)(). It's for when you call a template member function with type parameters that cannot be deduced from its formal parameters.
> 5:00, in derived class, you could had use the base constructor instead (C++11 and above): _using SmallVectorImpl::SmallVectorImpl;_. > 8:36, at least for GCC, _std::pair_ drops performance by the half, compared to 2 separately values. It was shocking to me (absurd geek moment)! However, I guess that an std::array , and a first and second f()s, fix this issue. _Edit: nope. This array is as slow as std::pair._
+EvanED I'm comparing: 1) int m; int n; 2) std::pair p; 3) class Pair { std::array a; public: inline const int &first () const {return a[0];} inline int &first () {return a[0];} The same for second, returning a[1]; }; The case 1 is more than 100% faster than the other ones. I also tryed to make a 4th case: packing a pair inside a size_t variable, which turned out to be more than 100% slower than case 1 too.
Contrary to what Chandler says, bit packing is slow. You won't find it in use very often in the modern code. Bit packing makes sense only if you save substantial amount of data size (megabytes).
Do you have any qualification for that? On the machines I use (Haswell server class), bit packing is almost free, taking around 2/3rds of a CPU cycle to shift and mask.
Bit packing: 1/2 cycles. L1: 3-4 cycles If your data fit in L1, then bitpacking is probably not worth it. L1 is 32k. L2: 10-12 cycles If your data fit in L2, it's probably a good idea, but not sure you'll win a lot. L2 is 256k. If you manipulate anything bigger than this, bitpacking is DEFINITIVELY worth it. Most programs are here. Caveat: bitpacking makes your code bigger, so if you are icache bound - you have a large executable with fairly flat profile - then you may want to reconsider.
Depends on your usage case, if you are primarily setting/clearing bits then it will need more instructions compared to say, using an array of bytes, but if you are testing them, particularly multiple bits at the same time then you make savings.
Chandler Carruth talks are great. I try every one of them available on the YT.
man, Chandler makes the camera man work really hard for his money xD
😂
Chandler is my favorite speaker after Herb, good talk!
I agree, another one worth catching all his talks is Ansel Sermersheim. He makes hard to understand subjects highly consumable. Timur Doumler also has good talks but he is not the best public speaker, not the worst, he is very good, but not Herb or Chandler levels at all.
I always enjoy Chandler's talks - this one has great video/audio quality as well.
I would like to emphasize the fact that this pointer-packing stuff is implementation-defined, and generally pointless outside of a performance bottleneck. So, some of the code you don't see in the slides likely contains a lot of platform-specific configuration macros (but I haven't looked at these parts of the Clang code base yet).
TL;DW: Cater to the cache, pack your pointers.
I supose you are talking about the 4-bits-packing stuff, instead of the array of unique pointers, for large objects, right?
Agreed, he's definitely my favorite presenter.
@20:00 Microsoft has very similar implementation in their ATL, since 2003. Their implementation is called CAtlPlex, and is used internally by many ATL containers.
Apple also uses that at the language level (of both ObjC and Swift IIRC). In fact on 64b they use not only the low bits but also the high bits, because all extant 64b implementations only have 48b address space at the moment (and x86-64 is architecturally limited to 52 bits).
As usual, it is a joy to listen to Chandler's talks. I am working on an Open Source library of data structures for High Performance Computing and one of the first thing I had to do is to rebuilt a custom std::vector. One of the many reason I had to do that is that std::vector v(n); fills the vector with zeros which trashes the cache and prevents you to use the "first touch policy" which is extremely important for NUMA architectures such as dual socket nodes. Alignement is also an issue to get efficient vectorization in HPC. These things could be "solved" with allocators, but I just hate them as I don't want them to change the type of my containers.
I have a few questions though:
- Do you have a llvm::Vector or do you use llvm::SmallVector as a replacement for std::vector?
Do you also use llvm::Vector as a replacement of std::array?
- There is a BumpPtrAllocator, but is does not seem possible to use it with SmallVector. Do you have any design ideas for allocators? I just can't get them to work the way I way. I hate both the way they are done in the C++ standard library and the Bloomberg implementation.
We use SmallVector pretty much everywhere. On very rare occasions we'll want what std::vector provides, but it's unusual.
std::array is a totally different beast -- it is a statically sized array which has essentially no overlap with the vector use cases. I suspect there are some places where we could use std::array and currently use llvm::SmallVector, but I think they're rare.
And regarding allocators -- for LLVM, the pattern we fall into more often is needing to control allocation of the *objects* with allocators, not the buffers in the data structures. For those buffers, tcmalloc, jemalloc, or any of the other optimized free-list based malloc libraries will do a fine job.
I tried to get attention to this issue a while back:
groups.google.com/a/isocpp.org/forum/#!msg/std-proposals/b946VongJjY/RYYj--v63qsJ
There are too many things in the standard library which are broken for me and I am glad I decided to write my own containers. It provides me so many useful things. As I am working mainly on numerical simulation, all I really need is a good std::vector implementation. Having my own allows me to:
- Have an il::Vector v(n) that does not fills the vector with 0 (same with float, int, etc)
- Have an il::Vector v(n) that gets initialized to NaN in debug mode (very useful if you want to track uninitialized values)
- Add my own constructor such as il::Vector v(n, il::align, 64) if I want to align the buffer to a cacheline
- Easily count the number of copy constructor called in il::Vector (if I want to make sure that the move semantics are properly done)
- Having il::Vector::size() return a signed integer (I hate unsigned integers and all the bugs they lead to)
- Write a il::SmallVector as Chandler has pointed
- so many useful things...
I like the language C++, but I just have so many troubles with the standard library. I am glad to hear that LLVM worked on their own containers. I can point people who say I am stupid not to use the standard library to this kind of talk.
So, did you already write your own, or just want to?
I'm quite sure that the paramaterized allocators problem is easily solvable with C++20's `concept` feature now.
Slowly learning C++20, but how would you achieve that? The only way I can think of is rejecting incompatible allocators and giving a compiler error, which doesn't really achieve the result of having the SmallVector on API boundaries.
Is there a significant performance benefit to use the hybrid 32-bit pointers and 64-bit registers ABI, namely x32 ABI? I don't mean for the generated code, but specifically for the compiler itself?
So you get an expanded register file with 8 extra registers, but register scarcity has not been a major issue since about Pentium Pro, since top of stack is actually aliased to an unknown number of hidden registers and the corresponding operations get removed at the opcode decoding stage. You'd get a slight bottleneck at opcode decoding but it was pretty much removed in the later iterations of Pentium III.
There also doesn't seem to be any difference whatsoever in compiler performance between pure 32-bit and 64-bit ABIs, at least from my experience with GCC - don't know whether it extends to Clang, but I strongly suspect so.
Can you elaborate on "top of stack" aliasing? I don't find any mentions of it in intel optimization guide or Agner Fog's guides. Maybe you are talking about FP register stack?
The smallvector slide at 5:26 is confusing me because of omissions: 1) Can all small size vectors be upgraded to heap storage on overflow 2) Is SmallVectorImpl what you use for function params? I assume so, it just sounds like a wierd / wordier name.
1) Yes, with SmallVector v; they can call v.resize(10);
2) The name is a bit weird, but I don't know which one they could have chosen
For 2), the names should have been switched. Then you allocate SmallVectorImpl but pass around SmallVector.
@@xfreeman86 what about allocating SmallVector and passing around BasicSmallVector
12:00 Can alias templates be used to avoid those mistakes?
Yes, but at the expense of having to come up with a new name for each different small-size-optimized container in a function's API boundary. =/ Doesn't seem to scale very well.
You can use the same technique you use to remove the size parameter from the type of SmallVectorImpl. Allocators are passed by reference, so no slicing worries, and you can use the pimpl idiom for polymorphic copying. My jury is still out on whether it's better, but it is possible.
6:17 Slide 6 -- How does the SmallVectorImpl::push_back know whether to free the old buffer? Without knowing the original N, it can't tell if it's been reallocated to the heap already, or was still using the small buffer.
It can check if Begin points right after SmallVectorImpl object - this is where derived class keeps the buffer
Hi Chandler,
I may be misinterpreting you, but at ruclips.net/video/vElZc6zSIXM/видео.html
are you implying that modern CPUs do indirect memory prefetching?
e.g. are you implying that CPUs will work out you are iterating through an array of pointers, read one or more indices ahead in that array, and then prefetch the memory pointed to by that pointer?
Modern Intel do it to some extent. I'm not aware of any other architecture doing it.
Interesting, do you have any further information or references about that?
I've just done some tests on my Ivy Bridge, and I'm pretty sure it doesn't do indirect memory prefetching.
Note that prefetching on pointer load is bound to be slow no matter what. If you want to measure if the CPU does prefetching or not, make sure you use the same algorithm, but one doing prefetching explicitly using compiler intrinsic, and the other letting the CPU do its thing.
If you see no significant perf difference, your CPU does prefectching. If the one with prefectching is faster, then you CPU doesn't do prefecthing.
I wouldn't be able to tell you if your specific CPU does it. Some Intel do it, but I'm not sure which ones.
@onosendaiAAB I think you're right there is no indirect prefetching. Possibly he means that the execution of later iteration loads can occur speculatively and in parallel to the first load because there is no dependency.
i love this guy
The talk from 2 years ago he mentions in the beginning: ruclips.net/video/fHNmRkzxHWs/видео.html
Is this the previous talk? ruclips.net/video/fHNmRkzxHWs/видео.html ("Efficiency with Algorithms, Performance with Data Structures")
I don't get the point about SmallVector class having a fixed size, whereas the containers goal is to change it. I can only guess that 'N' is the maximum size the "simulated container" will ever get, right? But how would you know that?
It's not a fixed size -- it's an initial *allocation* size. The container still is dynamically sized like std::vector.
+Chandler Carruth I saw the push_back f(), however, it will produce just 1 element. And even if it create another buffer, how Begin and End would work with the 2 or more arrays, without an extra code to handle the situation?
I guess you are talking about low-level, which would allow you to delete an compiling-time array, and realocate it, unlike C/C++ allow to do, right?
No, this is totally built with what C++ already allows you to do.
The begin and end pointers start off pointing at the inline small buffer, and if extra space is needed it gets allocated and the begin and end pointers are moved to point at the new buffer.
You can find all the code for the real implementation here: github.com/llvm-project/llvm-project/blob/master/llvm/include/llvm/ADT/SmallVector.h
Chandler Carruth So, the penalty is risking to have an unused array, after the memory enlarge. Good idea, though. The pointers can also return to the array, if memory is freed.
How much an array is faster than std::list? For GCC, some years ago, I measured that an array, simulating std::map, is 6-7 times faster.
btw How about intrinsic lists? Or when you say "list" you mean list `List` instead of `List`?
www.codeofhonor.com/blog/avoiding-game-crashes-related-to-linked-lists
This have very good properties if list are very dynamic and object can
move between lists. Of corse it have one big drawback, object can
belongs only to one list at time.
ok i see so the idea behind smallvector is that you can allocate some space before hand? instead of allocating when pushing something on to the vector array ?
No, the point is that you allocate it on the stack, not the heap. If it's too large for the stack, then it'll fallback to the heap.
You can allocate space in a vector beforehand with vector::reserve().
Nice!
_Value.template_, I don't even know what this means.
It is a disambiguation for accessing a templated member of an object that is itself a dependent type.
It's hard to parse as a human, and you just happened to get it wrong is all. Read it as (Value).(template is)(). It's for when you call a template member function with type parameters that cannot be deduced from its formal parameters.
@@xfreeman86 ...wait, you're Gordon Freeman's brother, aren't you?
Bug in slide 21: `(Size >= (End - CurPtr))` should be `(Size > (End - CurPtr))`.
In the original source code it looks a bit different: "if (Adjustment + SizeToAllocate
> 5:00, in derived class, you could had use the base constructor instead (C++11 and above): _using SmallVectorImpl::SmallVectorImpl;_.
> 8:36, at least for GCC, _std::pair_ drops performance by the half, compared to 2 separately values. It was shocking to me (absurd geek moment)! However, I guess that an std::array , and a first and second f()s, fix this issue.
_Edit: nope. This array is as slow as std::pair._
Re. pair, what are you comparing? Separate correlated arrays for keys and values (e.g. 'KeyType keys[N]; ValueType values[N];) vs an array of pairs?
+EvanED I'm comparing:
1) int m; int n;
2) std::pair p;
3) class Pair {
std::array a;
public:
inline const int &first () const {return a[0];}
inline int &first () {return a[0];}
The same for second, returning a[1];
};
The case 1 is more than 100% faster than the other ones.
I also tryed to make a 4th case: packing a pair inside a size_t variable, which turned out to be more than 100% slower than case 1 too.
the thing is array is homogenous
>llvm
>Fast
black magic
Contrary to what Chandler says, bit packing is slow. You won't find it in use very often in the modern code. Bit packing makes sense only if you save substantial amount of data size (megabytes).
Do you have any qualification for that? On the machines I use (Haswell server class), bit packing is almost free, taking around 2/3rds of a CPU cycle to shift and mask.
Bit packing almost always wins if it means your working set fits in cache, or for anything that is to be transmitted over a network.
Bit packing: 1/2 cycles.
L1: 3-4 cycles
If your data fit in L1, then bitpacking is probably not worth it. L1 is 32k.
L2: 10-12 cycles
If your data fit in L2, it's probably a good idea, but not sure you'll win a lot. L2 is 256k.
If you manipulate anything bigger than this, bitpacking is DEFINITIVELY worth it. Most programs are here.
Caveat: bitpacking makes your code bigger, so if you are icache bound - you have a large executable with fairly flat profile - then you may want to reconsider.
Depends on your usage case, if you are primarily setting/clearing bits then it will need more instructions compared to say, using an array of bytes, but if you are testing them, particularly multiple bits at the same time then you make savings.
Found the Javascript coder.
i just got lost on xkcd.com for 10 minutes