This is definitely a good talk. I love the subject matter and am quite surprised by the results. I would've figured probing would be slower no matter what, but using a bitmap jump table is a pretty good idea. One of my many attempts had a binary tree for each chain, which was horrible for locality and memory usage, but still sped up the prior flat array approach, somehow. I guess because I didn't insert or delete often, and truthfully inserts were never slow either way, but it did speed up finds, especially if you iterated the entire collection and found each item recursively. This method looks a lot better though, I'll have to do an implementation. Also, nice job on the using powers of two for the table sizes so you can bitmask for the index. I found that trick myself back in high school and was shocked that no one used it, now I finally am seeing it all over.
@Matt Kulukundis, A few ideas to explore: As two consecutive full groups likely mean the table is near load / needs a resize anyway, why keep looping until a matching or empty slot? Instead detect the issue at insert time and execute some other logic in case of a full group. If its a table of one or two groups, do as you do now. If there are more groups, see if the just checked group has an "extension group" assigned to it (one per table). If so, check this group first before trying the next/last group. If it has no extension group, just check the next/last group. The first group to fill up is thus extended by doubling it, which might stave of the need for growing the table as a whole. It could be a small price to pay compared to potentially doubling the size in a growth action. It also keeps the data dense for longer, at least if the hash distributes well enough. I haven't tested this, but it seems feasible. For very large tables one can consider having a more spillover groups to compensate for the increased cost of growing. In this scenario I would use some low-order index bits of the group that is full to point towards the one extension group that must match it.
I would certainly also try to special-case small tables, like 32-128 entries, and replace with a linear scan. This will find the hidden constant overhead.
I would definitely do that if I could jump to the place where he describes the structure and algorithms. A third of the way in, all I have seen is how to do it badly. Plonk.
Top-notch technical work and an insightful presentation, thanks! Do you know what effect or benchmark configuration is responsible for the surprising result that Large Payload is faster than Small Payload for flat_hash_set (around minute 35)?
Honestly not sure. I suspect that the the large payload one is a slight outlier and if we rerun those we would see it come up to be about the same as Small Payload.
+Matt Kulukundis, I made a post with a few ideas that might be beneficial and just reply here so I am sure you get a notification. ruclips.net/video/ncHmEUmJZf4/видео.html&lc=UgyIBvVL_JNXGPZhY4B4AaABAg I hope it makes sense and does what I expect it to do.
I wonder if the elephant in the room[1][2] is that you are ostensibly guaranteed to have two cache misses as a result of having two separate arrays. If this is correct, I further wonder if the reason it's not addressed is that the solution to this problem is what separates the Google version from the Abseil version. The mitigation is to do simultaneous memory fetches[3], and while I'm guessing you could do a better job of that with assembly instructions by way of forcing a greater than one cache line prefetch (I'm no expert on assembly), all that's necessary (post the sliding window optimization landing[4][5]) is to store the key in the first probe location in a variable before probing the meta data array[3]. This would create an additional branch in the subsequent code, but that could be eliminated by replacing "Key first_key = m_key_array[index]" with "volatile Key first_key = m_key_array[index]" and then ignoring the variable. [1]: ruclips.net/video/ncHmEUmJZf4/видео.html [2]: ruclips.net/video/ncHmEUmJZf4/видео.html [3]: larshagencpp.github.io/blog/2016/05/01/a-cache-miss-is-not-a-cache-miss [4]: ruclips.net/video/ncHmEUmJZf4/видео.html [5]: ruclips.net/video/JZE3_0qvrMg/видео.html
How about adding new elements? We are talking about finding elements all the time but we skipped the part (or I missed that) how to add a new element into such hash table. Having fixed elements in the table makes adding new elements with given hash more likely collides with already existing elements (this is why we start with a linked list in the first place). Do we just iterate over the table and use the first free spot? And how do we choose the Group to add an element into? The final design suggests that we can add it in a random place because later, the search is done in parallel.
Have you guys played with prefetching? I honestly never got it to work positively myself, but on the other hand I've never went down to SSE instructions either. Perhaps you could prefetch the data before doing the lookup on the metadata, so by the time you find a match (if any) the data could already be in place or on its way.
By the way, I'm implementing a hash table using the ideas from this talk. Apparently the prefetch did give me 15% or so, for the use cases I'm directing my benchmark.
On a third thought, for large tables the metadata will be a cache miss, and prefetching the data before accessing the metadata gives it a good chance for the data to be there if it turns out to be necessary. That's what I observe in my implementation.
We have played with it but haven't concluded firmly yet on it. Intuitively issuing a prefetch for the data before accessing the metadata should be a win for find hit cases.
I would too. There is some internal movement (although clearly I would have lost any bet I made). I can't make any commitments, but we are working on it.
So like uhh, if this Matt dude isn't from Southern California, that'd be like...shocking? I never knew the "surfer dude" went on to work at Google :P -- But really though, I found this talk fascinating! Great job on the presentation and keeping the mood real light!
Matt, thanks for the awesome talk! I have a question: can you please clarify what was the "green" map in these stacked graphs and why it was being used over std::unordered_map and dense_hash_map?
did y'all ever consider doing h & (b-1) instead of h % b for the starting bucket. it just seems that you need something to truncate h to be in range [0,b).
When removing with arbitrary group positions, even if the group that the element was found in has empty slots, there may exist another group containing the same element and no empty slots. Do you check for empty slots in both the group ending and the group beginning with the to-be-removed element, in order to know whether the slot may again be marked as empty, or must be marked as deleted?
We are not committing to anything, but we are working on it. I know it is frustrating to wait. We are balancing the value to the community of releasing it as part of Abseil and the value we get from doing internal migrations of our code base.
I don't get the example at 38.35? He says it is going to pick the r value insert, but that's wrong. It is going to pick the templated forwarding reference one, which will forward the pair as a reference, so I don't get how he can measure any difference? Maybe I misunderstood something but I even tried the code on VS2019 and there were no additional copy.
It's going to pick the templated r-value ref overload which is a better match for non-const types.. ... insert( const value_type& value ); // not this one template ... insert( P&& value ); // but this one
@@pruibiebehastoet1914 hi, I understand it is picking the template-insert overload. But why there would be a copy? I think the template is instantiated as `pair insert(P& value)`, so it is still a ref, and then forward into string's constructor. I can't figure out why there's a copy😨, would you please explain a little? thanks
okay, I think figured out. It is because inside the template-insert, for gcc's implementation(v12.1), it will contruct a temp key. clang seems don't have this problem
Nice work I will say, but boy is C++ in a messy state when its even hell for library makers. I know they like hell as it allows them to show off their cryptic ways, but its still a big loss when it comes down to it. I used to love C++, but only use parts of it due to the clutter. With all the new layering of crap on top of crap, it looks less and less appealing to ever go back to. Its a reasonable question to raise what knowledgeable developer will seriously consider it now as a productivity language. And without actual users, there is no need for library makers and the language slowly dies out. Reminds me of games where devs focus on the wrong things, it never ends well. Still, good to see a focus on efficiency and daring to break baseless standard requirements that are of use to no-one. We need much more of that in this field. Simplify our languages and other interfaces and work more declarative instead of coding everything out all the time. Current languages are all wrong in these ways as the true problems people need to solve are very similar and could be described compactly and little code. The focus must shift from making tools that satisfy tool makers to making tools that satisfy end-user requirements. This hash is a pretty good example of better priorities. I wish this was done at the level of languages and core products, such as databases too. In fact these should be integrated much more, along with data visualization tools.
Great talk, thanks. When you say "the first 7 bits" does this mean "the 7 least significant bits of the most significant byte" on a little-endian system? Also, I tried to search for this bit distribution entropy thing but get mostly results for crypto hash functions that discuss the distribution of the hash function outputs over the output space which sounds a little different to my ears, because we can still be uniformly distributed among the output space without any entropy in the LSB. If you could provide a link regarding this I would appreciate. And just to confirm my toughts: Would MurmurHash2 or 3 comply with the requirements (I think they do) and would using a 32bit hash value on 64bit system cause an issue (I see no issue with this beside diverging from the standard)? Thanks
MurmurHash is good. It happens that a 32bit hash will work if your table has fewer than 2**25 elements, but that is not a guarantee we will provide officially.
Thank you for the feedback, I think I will set to code a quick version of this, it all seams quite accessible (removing the asm optimization and smart stuff to figure out the best performing version of what the user want).
You can use a so called "bit mixer" to populate all bits with the entropy of only a few bits. Murmur hash's "finalizer" is known to be a good bit mixer. A detailed discussion about Murmur3's finalizer can be found here: zimbry.blogspot.com/2011/09/better-bit-mixing-improving-on.html. I should add that the article claims the bit mixer adds entropy, but that is not correct.
Is there any concern on the size overhead of the object itself? Are you guys optimizing for that as well? I sometimes face situations where there are millions of small hash map instantiations (and some of those are large, so I need them to be succinct). And some compact hash map implementations I found go up to 88bytes or more on the object itself. That kills the gains from the implementation being compact.
sizeof(flat_hash_map) is about 40 bytes. We would like it to be smaller, but that has been a lower priority optimization. We might get it down to 32, but we need to ensure no performance regressions doing it. A zero element flat_hash_map allocates no memory at all.
Ivan Ivanov yes I know. Those "stack" 40/56 bytes make a difference in application where millions of hashmaps are allocated. I say "stack" because if you allocate the hash map on the heap, then it's on the heap :)
In first 3minutes of unfunny jokes, he almost lost me, skipped to 23:00 (skipping unordered map) i will probably watch the rest EDIT 25:00 good point "we want use all bits well" but another joke.. OMG how old is he? 8? Point was less collision in averange adjusted for data set distribution, i came at this moment with "stupid idea" of making translation table for data -> table[data] -> result data for hash calc, statistically adjusted for your data distribution (in first testing pass) to use more bits of data for numbers/letter seen most of time, less bits for numbers range seen less frequent, focusing more bit on local context (local order) noticed that some hashes uses sum and xor in such way. Sum is a broad context of data, xor is local context, wont loose any data, shifts are just to "position data" cut off useless bits, % prime numbers are to do same with but of sums.. Am i correct? I assume all "researtch of perfect hash" is just attempt to adjust hash to most common data distribution to have less collisions. Since for example ASCII uses 5-6 bits at max, any data shifting 2-3bits usually helps. Probably what i just wrote is stupid.. Question? Watch or not to watch more?
This guy kinda feels like Tom Scott but for C++. I love it!
"Perfect fidelity interferes with understanding" 🔥🔥
"turns out we can... and it fits on one slide"
*slaps 3 pages of template meta programming on one slide*
came for the hash table, stayed for the jokes.
This 👆
Great talk, and I like the progressive way of introducing the material.
Thank you!
this man went on stage and did a null pointer dereference.
This is definitely a good talk. I love the subject matter and am quite surprised by the results. I would've figured probing would be slower no matter what, but using a bitmap jump table is a pretty good idea. One of my many attempts had a binary tree for each chain, which was horrible for locality and memory usage, but still sped up the prior flat array approach, somehow. I guess because I didn't insert or delete often, and truthfully inserts were never slow either way, but it did speed up finds, especially if you iterated the entire collection and found each item recursively. This method looks a lot better though, I'll have to do an implementation. Also, nice job on the using powers of two for the table sizes so you can bitmask for the index. I found that trick myself back in high school and was shocked that no one used it, now I finally am seeing it all over.
Ignore the haters lol. I thought you were fucking hilarious dude. Just a tough crowd
@Matt Kulukundis,
A few ideas to explore:
As two consecutive full groups likely mean the table is near load / needs a resize anyway, why keep looping until a matching or empty slot? Instead detect the issue at insert time and execute some other logic in case of a full group.
If its a table of one or two groups, do as you do now.
If there are more groups, see if the just checked group has an "extension group" assigned to it (one per table).
If so, check this group first before trying the next/last group.
If it has no extension group, just check the next/last group.
The first group to fill up is thus extended by doubling it, which might stave of the need for growing the table as a whole.
It could be a small price to pay compared to potentially doubling the size in a growth action.
It also keeps the data dense for longer, at least if the hash distributes well enough.
I haven't tested this, but it seems feasible. For very large tables one can consider having a more spillover groups to compensate for the increased cost of growing. In this scenario I would use some low-order index bits of the group that is full to point towards the one extension group that must match it.
Thanks for the ideas, I will add them to the queue of experiments :)
I would certainly also try to special-case small tables, like 32-128 entries, and replace with a linear scan. This will find the hidden constant overhead.
Looking forward to eventually trying this out.
What a great talk. Both interesting and entertaining, and I actually learned something. The humor reminds me a bit of Zack Freedman/Voidstar Labs.
"I'll give you for a moment to read it while I drink some of this lovely, lovely, diet Mountain Dew"
Oh, "diet"! I understood "dead", like "gone completely flat". m)
Love his style!
24:38 "One and a half bits, that is slightly awkward."
It is actually log2(3) bits i.e. ~ 1.58496
I wonder how many people are writing the code themselves based on this talk, just for fun :D
I would definitely do that if I could jump to the place where he describes the structure and algorithms. A third of the way in, all I have seen is how to do it badly. Plonk.
I took some time to implement each of those steps, up to the table with the extension. Fun ride, and it works! :D
I am.
Going to write in C.
Top-notch technical work and an insightful presentation, thanks! Do you know what effect or benchmark configuration is responsible for the surprising result that Large Payload is faster than Small Payload for flat_hash_set (around minute 35)?
Honestly not sure. I suspect that the the large payload one is a slight outlier and if we rerun those we would see it come up to be about the same as Small Payload.
+Matt Kulukundis,
I made a post with a few ideas that might be beneficial and just reply here so I am sure you get a notification.
ruclips.net/video/ncHmEUmJZf4/видео.html&lc=UgyIBvVL_JNXGPZhY4B4AaABAg
I hope it makes sense and does what I expect it to do.
I wonder if the elephant in the room[1][2] is that you are ostensibly guaranteed to have two cache misses as a result of having two separate arrays. If this is correct, I further wonder if the reason it's not addressed is that the solution to this problem is what separates the Google version from the Abseil version. The mitigation is to do simultaneous memory fetches[3], and while I'm guessing you could do a better job of that with assembly instructions by way of forcing a greater than one cache line prefetch (I'm no expert on assembly), all that's necessary (post the sliding window optimization landing[4][5]) is to store the key in the first probe location in a variable before probing the meta data array[3]. This would create an additional branch in the subsequent code, but that could be eliminated by replacing "Key first_key = m_key_array[index]" with "volatile Key first_key = m_key_array[index]" and then ignoring the variable.
[1]: ruclips.net/video/ncHmEUmJZf4/видео.html
[2]: ruclips.net/video/ncHmEUmJZf4/видео.html
[3]: larshagencpp.github.io/blog/2016/05/01/a-cache-miss-is-not-a-cache-miss
[4]: ruclips.net/video/ncHmEUmJZf4/видео.html
[5]: ruclips.net/video/JZE3_0qvrMg/видео.html
insightful
How about adding new elements? We are talking about finding elements all the time but we skipped the part (or I missed that) how to add a new element into such hash table. Having fixed elements in the table makes adding new elements with given hash more likely collides with already existing elements (this is why we start with a linked list in the first place). Do we just iterate over the table and use the first free spot?
And how do we choose the Group to add an element into? The final design suggests that we can add it in a random place because later, the search is done in parallel.
Have you guys played with prefetching? I honestly never got it to work positively myself, but on the other hand I've never went down to SSE instructions either.
Perhaps you could prefetch the data before doing the lookup on the metadata, so by the time you find a match (if any) the data could already be in place or on its way.
By the way, I'm implementing a hash table using the ideas from this talk. Apparently the prefetch did give me 15% or so, for the use cases I'm directing my benchmark.
On a second thought, maybe my implementation is just too crapy yet that there is time for the data from the prefetch to arrive...
On a third thought, for large tables the metadata will be a cache miss, and prefetching the data before accessing the metadata gives it a good chance for the data to be there if it turns out to be necessary.
That's what I observe in my implementation.
We have played with it but haven't concluded firmly yet on it. Intuitively issuing a prefetch for the data before accessing the metadata should be a win for find hit cases.
Waiting for it. Please make it open source.
I am working on it. I cannot make any hard commitments or guarantees, but it is high on my priority list.
Any movement on this front? Would love it as a new year's gift!
I would too. There is some internal movement (although clearly I would have lost any bet I made). I can't make any commitments, but we are working on it.
It's open source now on github abseil
So like uhh, if this Matt dude isn't from Southern California, that'd be like...shocking? I never knew the "surfer dude" went on to work at Google :P -- But really though, I found this talk fascinating! Great job on the presentation and keeping the mood real light!
Actually I am from Minnesota. Although I attended college on the east coast.
Finally open source! abseil.io/blog/20180927-swisstables
Matt, thanks for the awesome talk! I have a question: can you please clarify what was the "green" map in these stacked graphs and why it was being used over std::unordered_map and dense_hash_map?
`__gnu_cxx::hash_map`. It was an extension to the library that is similar to std::unordered_map but predates C++11.
did y'all ever consider doing h & (b-1) instead of h % b for the starting bucket. it just seems that you need something to truncate h to be in range [0,b).
nevermind! looking at 51:30 it seems like y'all already do
I must say I don't understand bashing on the jokes in the comments. I found most of the jokes to be quite amusing
When removing with arbitrary group positions, even if the group that the element was found in has empty slots, there may exist another group containing the same element and no empty slots. Do you check for empty slots in both the group ending and the group beginning with the to-be-removed element, in order to know whether the slot may again be marked as empty, or must be marked as deleted?
With arbitrary group positions you have to verify that [pos-15, pos+15] does not contain a run of 16 non-empty items.
Really inspiring presentation,
Any reason why you chose 128 bits SSE over 256 bits AVX2, or even AVX512 ?
Did you get the chance to test those ?
We have done tests with 256bit versions and did not find a win over 128. We have not tried 512
release date as open source?
We are not committing to anything, but we are working on it. I know it is frustrating to wait. We are balancing the value to the community of releasing it as part of Abseil and the value we get from doing internal migrations of our code base.
I don't get the example at 38.35? He says it is going to pick the r value insert, but that's wrong. It is going to pick the templated forwarding reference one, which will forward the pair as a reference, so I don't get how he can measure any difference? Maybe I misunderstood something but I even tried the code on VS2019 and there were no additional copy.
It's going to pick the templated r-value ref overload which is a better match for non-const types..
... insert( const value_type& value ); // not this one
template ... insert( P&& value ); // but this one
@@pruibiebehastoet1914 hi, I understand it is picking the template-insert overload. But why there would be a copy? I think the template is instantiated as `pair insert(P& value)`, so it is still a ref, and then forward into string's constructor. I can't figure out why there's a copy😨, would you please explain a little? thanks
okay, I think figured out. It is because inside the template-insert, for gcc's implementation(v12.1), it will contruct a temp key. clang seems don't have this problem
Is the new hash table faster than vectors for iterations?
Depends on what you are trying to do
Almost never. vectors are perfectly contiguous so iteration is essentially optimal.
Excellent talk! I have the same sense of humor, so I thought it was funny too.
Very insightful talk. Is the flat_hash_set is made open source now? If yes, can you please provide pointer to search the same on internet?
github.com/abseil/abseil-cpp/blob/master/absl/container/flat_hash_set.h
Nice work I will say, but boy is C++ in a messy state when its even hell for library makers.
I know they like hell as it allows them to show off their cryptic ways, but its still a big loss when it comes down to it. I used to love C++, but only use parts of it due to the clutter. With all the new layering of crap on top of crap, it looks less and less appealing to ever go back to.
Its a reasonable question to raise what knowledgeable developer will seriously consider it now as a productivity language.
And without actual users, there is no need for library makers and the language slowly dies out.
Reminds me of games where devs focus on the wrong things, it never ends well.
Still, good to see a focus on efficiency and daring to break baseless standard requirements that are of use to no-one. We need much more of that in this field. Simplify our languages and other interfaces and work more declarative instead of coding everything out all the time. Current languages are all wrong in these ways as the true problems people need to solve are very similar and could be described compactly and little code.
The focus must shift from making tools that satisfy tool makers to making tools that satisfy end-user requirements. This hash is a pretty good example of better priorities. I wish this was done at the level of languages and core products, such as databases too. In fact these should be integrated much more, along with data visualization tools.
I don't know if you'll see this 4+ years later, but 100% agreed. Solve new problems, don't rewrite old solutions.
Very nice talk! One question though: Is there a particular reason for the slot count of a group being 16? Why not 32 for example?
We tried 32 and didn't see any performance win from it, so we went with the one that allows for smaller tables
Great talk, thanks. When you say "the first 7 bits" does this mean "the 7 least significant bits of the most significant byte" on a little-endian system? Also, I tried to search for this bit distribution entropy thing but get mostly results for crypto hash functions that discuss the distribution of the hash function outputs over the output space which sounds a little different to my ears, because we can still be uniformly distributed among the output space without any entropy in the LSB. If you could provide a link regarding this I would appreciate. And just to confirm my toughts: Would MurmurHash2 or 3 comply with the requirements (I think they do) and would using a 32bit hash value on 64bit system cause an issue (I see no issue with this beside diverging from the standard)? Thanks
MurmurHash is good. It happens that a 32bit hash will work if your table has fewer than 2**25 elements, but that is not a guarantee we will provide officially.
Realizing I didn't answer you question, it is currently the least significant bits, but we don't guarantee anything on that front
Thank you for the feedback, I think I will set to code a quick version of this, it all seams quite accessible (removing the asm optimization and smart stuff to figure out the best performing version of what the user want).
You can use a so called "bit mixer" to populate all bits with the entropy of only a few bits. Murmur hash's "finalizer" is known to be a good bit mixer. A detailed discussion about Murmur3's finalizer can be found here: zimbry.blogspot.com/2011/09/better-bit-mixing-improving-on.html. I should add that the article claims the bit mixer adds entropy, but that is not correct.
Is there any concern on the size overhead of the object itself? Are you guys optimizing for that as well?
I sometimes face situations where there are millions of small hash map instantiations (and some of those are large, so I need them to be succinct). And some compact hash map implementations I found go up to 88bytes or more on the object itself. That kills the gains from the implementation being compact.
sizeof(flat_hash_map) is about 40 bytes. We would like it to be smaller, but that has been a lower priority optimization. We might get it down to 32, but we need to ensure no performance regressions doing it. A zero element flat_hash_map allocates no memory at all.
Sweet!
I can't wait :)
40 bytes is already far better than the 56 from STL (godbolt.org/g/Uiiau8).
40/56 bytes only in stack but not in heap ;)
Ivan Ivanov yes I know. Those "stack" 40/56 bytes make a difference in application where millions of hashmaps are allocated. I say "stack" because if you allocate the hash map on the heap, then it's on the heap :)
Check out my small hash optimization: github.com/greg7mdp/sho
11:45 why not just put it at the second slot of the linked list?
GOML ass-embly optimizer
In first 3minutes of unfunny jokes, he almost lost me, skipped to 23:00 (skipping unordered map) i will probably watch the rest
EDIT 25:00 good point "we want use all bits well" but another joke.. OMG how old is he? 8? Point was less collision in averange adjusted for data set distribution, i came at this moment with "stupid idea" of making translation table for data -> table[data] -> result data for hash calc, statistically adjusted for your data distribution (in first testing pass) to use more bits of data for numbers/letter seen most of time, less bits for numbers range seen less frequent, focusing more bit on local context (local order) noticed that some hashes uses sum and xor in such way. Sum is a broad context of data, xor is local context, wont loose any data, shifts are just to "position data" cut off useless bits, % prime numbers are to do same with but of sums.. Am i correct?
I assume all "researtch of perfect hash" is just attempt to adjust hash to most common data distribution to have less collisions. Since for example ASCII uses 5-6 bits at max, any data shifting 2-3bits usually helps. Probably what i just wrote is stupid..
Question? Watch or not to watch more?