CppCon 2017: Matt Kulukundis “Designing a Fast, Efficient, Cache-friendly Hash Table, Step by Step”

CppCon

Просмотров 107 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 25 ноя 2024

Комментарии • 88

@Florian-sh9nf Год назад ⁺¹⁴
This guy kinda feels like Tom Scott but for C++. I love it!
@nimcapcik 3 месяца назад ⁺¹
"Perfect fidelity interferes with understanding" ‍🔥🔥
@8292-d6n 3 года назад ⁺⁵¹
"turns out we can... and it fits on one slide"
*slaps 3 pages of template meta programming on one slide*
@jg5275 3 года назад ⁺⁷⁸
came for the hash table, stayed for the jokes.
@JonMasters Год назад ⁺³
This 👆
@allanwind295 3 года назад ⁺¹⁸
Great talk, and I like the progressive way of introducing the material.
@CppCon 3 года назад ⁺²
Thank you!
@thomash9008 19 дней назад ⁺²
this man went on stage and did a null pointer dereference.
@anon_y_mousse 2 года назад ⁺¹¹
This is definitely a good talk. I love the subject matter and am quite surprised by the results. I would've figured probing would be slower no matter what, but using a bitmap jump table is a pretty good idea. One of my many attempts had a binary tree for each chain, which was horrible for locality and memory usage, but still sped up the prior flat array approach, somehow. I guess because I didn't insert or delete often, and truthfully inserts were never slow either way, but it did speed up finds, especially if you iterated the entire collection and found each item recursively. This method looks a lot better though, I'll have to do an implementation. Also, nice job on the using powers of two for the table sizes so you can bitmask for the index. I found that trick myself back in high school and was shocked that no one used it, now I finally am seeing it all over.
@marcusklaas4088 5 лет назад ⁺⁵⁷
Ignore the haters lol. I thought you were fucking hilarious dude. Just a tough crowd
@TheEVEInspiration 7 лет назад ⁺¹⁷
@Matt Kulukundis,
A few ideas to explore:
As two consecutive full groups likely mean the table is near load / needs a resize anyway, why keep looping until a matching or empty slot? Instead detect the issue at insert time and execute some other logic in case of a full group.
If its a table of one or two groups, do as you do now.
If there are more groups, see if the just checked group has an "extension group" assigned to it (one per table).
If so, check this group first before trying the next/last group.
If it has no extension group, just check the next/last group.
The first group to fill up is thus extended by doubling it, which might stave of the need for growing the table as a whole.
It could be a small price to pay compared to potentially doubling the size in a growth action.
It also keeps the data dense for longer, at least if the hash distributes well enough.
I haven't tested this, but it seems feasible. For very large tables one can consider having a more spillover groups to compensate for the increased cost of growing. In this scenario I would use some low-order index bits of the group that is full to point towards the one extension group that must match it.
@mattkulukundis6638 7 лет назад ⁺⁹
Thanks for the ideas, I will add them to the queue of experiments :)
@ReiniUrban 6 лет назад
I would certainly also try to special-case small tables, like 32-128 entries, and replace with a linear scan. This will find the hidden constant overhead.
@IllumTheMessage 7 лет назад ⁺⁴
Looking forward to eventually trying this out.
@SaHaRaSquad Год назад
What a great talk. Both interesting and entertaining, and I actually learned something. The humor reminds me a bit of Zack Freedman/Voidstar Labs.
@disk0__ 7 лет назад ⁺²⁸
"I'll give you for a moment to read it while I drink some of this lovely, lovely, diet Mountain Dew"
@Runoratsu 6 лет назад ⁺¹
Oh, "diet"! I understood "dead", like "gone completely flat". m)
@bernadettetreual 5 лет назад ⁺⁴
Love his style!
@lucacavalleri8649 4 года назад ⁺¹⁰
24:38 "One and a half bits, that is slightly awkward."
It is actually log2(3) bits i.e. ~ 1.58496
@brenogi 7 лет назад ⁺³³
I wonder how many people are writing the code themselves based on this talk, just for fun :D
@JiveDadson 6 лет назад
I would definitely do that if I could jump to the place where he describes the structure and algorithms. A third of the way in, all I have seen is how to do it badly. Plonk.
@brenogi 6 лет назад ⁺³
I took some time to implement each of those steps, up to the table with the extension. Fun ride, and it works! :D
@figfox2425 6 лет назад
I am.
@abhisheksingh-mj6kb 2 года назад
Going to write in C.
@astro511 7 лет назад ⁺¹⁸
Top-notch technical work and an insightful presentation, thanks! Do you know what effect or benchmark configuration is responsible for the surprising result that Large Payload is faster than Small Payload for flat_hash_set (around minute 35)?
@mattkulukundis6638 7 лет назад ⁺⁴
Honestly not sure. I suspect that the the large payload one is a slight outlier and if we rerun those we would see it come up to be about the same as Small Payload.
@TheEVEInspiration 7 лет назад
+Matt Kulukundis,
I made a post with a few ideas that might be beneficial and just reply here so I am sure you get a notification.
ruclips.net/video/ncHmEUmJZf4/видео.html&lc=UgyIBvVL_JNXGPZhY4B4AaABAg
I hope it makes sense and does what I expect it to do.
@m.j.rayburn3 Год назад
I wonder if the elephant in the room[1][2] is that you are ostensibly guaranteed to have two cache misses as a result of having two separate arrays. If this is correct, I further wonder if the reason it's not addressed is that the solution to this problem is what separates the Google version from the Abseil version. The mitigation is to do simultaneous memory fetches[3], and while I'm guessing you could do a better job of that with assembly instructions by way of forcing a greater than one cache line prefetch (I'm no expert on assembly), all that's necessary (post the sliding window optimization landing[4][5]) is to store the key in the first probe location in a variable before probing the meta data array[3]. This would create an additional branch in the subsequent code, but that could be eliminated by replacing "Key first_key = m_key_array[index]" with "volatile Key first_key = m_key_array[index]" and then ignoring the variable.
[1]: ruclips.net/video/ncHmEUmJZf4/видео.html
[2]: ruclips.net/video/ncHmEUmJZf4/видео.html
[3]: larshagencpp.github.io/blog/2016/05/01/a-cache-miss-is-not-a-cache-miss
[4]: ruclips.net/video/ncHmEUmJZf4/видео.html
[5]: ruclips.net/video/JZE3_0qvrMg/видео.html
@vipuljain4494 3 года назад ⁺²
insightful
@MalicePandrodor 3 месяца назад
How about adding new elements? We are talking about finding elements all the time but we skipped the part (or I missed that) how to add a new element into such hash table. Having fixed elements in the table makes adding new elements with given hash more likely collides with already existing elements (this is why we start with a linked list in the first place). Do we just iterate over the table and use the first free spot?
And how do we choose the Group to add an element into? The final design suggests that we can add it in a random place because later, the search is done in parallel.
@brenogi 6 лет назад ⁺⁸
Have you guys played with prefetching? I honestly never got it to work positively myself, but on the other hand I've never went down to SSE instructions either.
Perhaps you could prefetch the data before doing the lookup on the metadata, so by the time you find a match (if any) the data could already be in place or on its way.
@brenogi 6 лет назад ⁺¹
By the way, I'm implementing a hash table using the ideas from this talk. Apparently the prefetch did give me 15% or so, for the use cases I'm directing my benchmark.
@brenogi 6 лет назад
On a second thought, maybe my implementation is just too crapy yet that there is time for the data from the prefetch to arrive...
@brenogi 6 лет назад
On a third thought, for large tables the metadata will be a cache miss, and prefetching the data before accessing the metadata gives it a good chance for the data to be there if it turns out to be necessary.
That's what I observe in my implementation.
@mattkulukundis6638 6 лет назад ⁺⁵
We have played with it but haven't concluded firmly yet on it. Intuitively issuing a prefetch for the data before accessing the metadata should be a win for find hit cases.
@llothar68 7 лет назад ⁺⁵
Waiting for it. Please make it open source.
@mattkulukundis6638 7 лет назад ⁺²⁰
I am working on it. I cannot make any hard commitments or guarantees, but it is high on my priority list.
@PeterGoodman 7 лет назад
Any movement on this front? Would love it as a new year's gift!
@mattkulukundis6638 6 лет назад ⁺⁶
I would too. There is some internal movement (although clearly I would have lost any bet I made). I can't make any commitments, but we are working on it.
@goauld88 6 лет назад ⁺²
It's open source now on github abseil
@Bittboy 7 лет назад ⁺⁵
So like uhh, if this Matt dude isn't from Southern California, that'd be like...shocking? I never knew the "surfer dude" went on to work at Google :P -- But really though, I found this talk fascinating! Great job on the presentation and keeping the mood real light!
@mattkulukundis6638 7 лет назад ⁺⁹
Actually I am from Minnesota. Although I attended college on the east coast.
@goauld88 6 лет назад ⁺³
Finally open source! abseil.io/blog/20180927-swisstables
@cutekitten9856 7 лет назад ⁺¹³
Matt, thanks for the awesome talk! I have a question: can you please clarify what was the "green" map in these stacked graphs and why it was being used over std::unordered_map and dense_hash_map?
@mattkulukundis6638 7 лет назад ⁺³
`__gnu_cxx::hash_map`. It was an extension to the library that is similar to std::unordered_map but predates C++11.
@jacobschmidt 3 года назад ⁺²
did y'all ever consider doing h & (b-1) instead of h % b for the starting bucket. it just seems that you need something to truncate h to be in range [0,b).
@jacobschmidt 3 года назад
nevermind! looking at 51:30 it seems like y'all already do
@marinversic9469 4 года назад ⁺⁹
I must say I don't understand bashing on the jokes in the comments. I found most of the jokes to be quite amusing
@laurivasama593 6 лет назад
When removing with arbitrary group positions, even if the group that the element was found in has empty slots, there may exist another group containing the same element and no empty slots. Do you check for empty slots in both the group ending and the group beginning with the to-be-removed element, in order to know whether the slot may again be marked as empty, or must be marked as deleted?
@mattkulukundis6638 6 лет назад
With arbitrary group positions you have to verify that [pos-15, pos+15] does not contain a run of 16 non-empty items.
@frostiebek 7 лет назад ⁺⁶
Really inspiring presentation,
Any reason why you chose 128 bits SSE over 256 bits AVX2, or even AVX512 ?
Did you get the chance to test those ?
@mattkulukundis6638 7 лет назад ⁺¹⁶
We have done tests with 256bit versions and did not find a win over 128. We have not tried 512
@greg7mdp2 6 лет назад ⁺¹
release date as open source?
@mattkulukundis6638 6 лет назад ⁺³
We are not committing to anything, but we are working on it. I know it is frustrating to wait. We are balancing the value to the community of releasing it as part of Abseil and the value we get from doing internal migrations of our code base.
@ulfjohansen2139 4 года назад ⁺⁴
I don't get the example at 38.35? He says it is going to pick the r value insert, but that's wrong. It is going to pick the templated forwarding reference one, which will forward the pair as a reference, so I don't get how he can measure any difference? Maybe I misunderstood something but I even tried the code on VS2019 and there were no additional copy.
@pruibiebehastoet1914 3 года назад
It's going to pick the templated r-value ref overload which is a better match for non-const types..
... insert( const value_type& value ); // not this one
template ... insert( P&& value ); // but this one
@董宏峰 Год назад
@@pruibiebehastoet1914 hi, I understand it is picking the template-insert overload. But why there would be a copy? I think the template is instantiated as `pair insert(P& value)`, so it is still a ref, and then forward into string's constructor. I can't figure out why there's a copy😨, would you please explain a little? thanks
@董宏峰 Год назад
okay, I think figured out. It is because inside the template-insert, for gcc's implementation(v12.1), it will contruct a temp key. clang seems don't have this problem
@atomicgray 7 месяцев назад
Is the new hash table faster than vectors for iterations?
@hampus23 7 месяцев назад
Depends on what you are trying to do
@mattkulukundis6638 5 месяцев назад ⁺¹
Almost never. vectors are perfectly contiguous so iteration is essentially optimal.
@projectshave 6 лет назад ⁺²
Excellent talk! I have the same sense of humor, so I thought it was funny too.
@saurabhnsit2001 4 года назад
Very insightful talk. Is the flat_hash_set is made open source now? If yes, can you please provide pointer to search the same on internet?
@matthewfowleskulukundis3714 4 года назад ⁺¹
github.com/abseil/abseil-cpp/blob/master/absl/container/flat_hash_set.h
@TheEVEInspiration 7 лет назад ⁺⁸
Nice work I will say, but boy is C++ in a messy state when its even hell for library makers.
I know they like hell as it allows them to show off their cryptic ways, but its still a big loss when it comes down to it. I used to love C++, but only use parts of it due to the clutter. With all the new layering of crap on top of crap, it looks less and less appealing to ever go back to.
Its a reasonable question to raise what knowledgeable developer will seriously consider it now as a productivity language.
And without actual users, there is no need for library makers and the language slowly dies out.
Reminds me of games where devs focus on the wrong things, it never ends well.
Still, good to see a focus on efficiency and daring to break baseless standard requirements that are of use to no-one. We need much more of that in this field. Simplify our languages and other interfaces and work more declarative instead of coding everything out all the time. Current languages are all wrong in these ways as the true problems people need to solve are very similar and could be described compactly and little code.
The focus must shift from making tools that satisfy tool makers to making tools that satisfy end-user requirements. This hash is a pretty good example of better priorities. I wish this was done at the level of languages and core products, such as databases too. In fact these should be integrated much more, along with data visualization tools.
@anon_y_mousse 2 года назад
I don't know if you'll see this 4+ years later, but 100% agreed. Solve new problems, don't rewrite old solutions.
@marcel_pi 6 лет назад
Very nice talk! One question though: Is there a particular reason for the slot count of a group being 16? Why not 32 for example?
@mattkulukundis6638 6 лет назад ⁺¹³
We tried 32 and didn't see any performance win from it, so we went with the one that allows for smaller tables
@xavierthomas1980 7 лет назад
Great talk, thanks. When you say "the first 7 bits" does this mean "the 7 least significant bits of the most significant byte" on a little-endian system? Also, I tried to search for this bit distribution entropy thing but get mostly results for crypto hash functions that discuss the distribution of the hash function outputs over the output space which sounds a little different to my ears, because we can still be uniformly distributed among the output space without any entropy in the LSB. If you could provide a link regarding this I would appreciate. And just to confirm my toughts: Would MurmurHash2 or 3 comply with the requirements (I think they do) and would using a 32bit hash value on 64bit system cause an issue (I see no issue with this beside diverging from the standard)? Thanks
@mattkulukundis6638 7 лет назад
MurmurHash is good. It happens that a 32bit hash will work if your table has fewer than 2**25 elements, but that is not a guarantee we will provide officially.
@mattkulukundis6638 7 лет назад
Realizing I didn't answer you question, it is currently the least significant bits, but we don't guarantee anything on that front
@xavierthomas1980 7 лет назад
Thank you for the feedback, I think I will set to code a quick version of this, it all seams quite accessible (removing the asm optimization and smart stuff to figure out the best performing version of what the user want).
@immanuelhaffner6915 4 года назад
You can use a so called "bit mixer" to populate all bits with the entropy of only a few bits. Murmur hash's "finalizer" is known to be a good bit mixer. A detailed discussion about Murmur3's finalizer can be found here: zimbry.blogspot.com/2011/09/better-bit-mixing-improving-on.html. I should add that the article claims the bit mixer adds entropy, but that is not correct.
@brenogi 7 лет назад ⁺¹
Is there any concern on the size overhead of the object itself? Are you guys optimizing for that as well?
I sometimes face situations where there are millions of small hash map instantiations (and some of those are large, so I need them to be succinct). And some compact hash map implementations I found go up to 88bytes or more on the object itself. That kills the gains from the implementation being compact.
@mattkulukundis6638 7 лет назад ⁺⁵
sizeof(flat_hash_map) is about 40 bytes. We would like it to be smaller, but that has been a lower priority optimization. We might get it down to 32, but we need to ensure no performance regressions doing it. A zero element flat_hash_map allocates no memory at all.
@brenogi 7 лет назад
Sweet!
I can't wait :)
40 bytes is already far better than the 56 from STL (godbolt.org/g/Uiiau8).
@bikereview3117 7 лет назад
40/56 bytes only in stack but not in heap ;)
@brenogi 7 лет назад ⁺²
Ivan Ivanov yes I know. Those "stack" 40/56 bytes make a difference in application where millions of hashmaps are allocated. I say "stack" because if you allocate the hash map on the heap, then it's on the heap :)
@greg7mdp2 7 лет назад ⁺²
Check out my small hash optimization: github.com/greg7mdp/sho
@TimAnderson-ye4rq 9 месяцев назад
11:45 why not just put it at the second slot of the linked list?
@johnjonjhonjonathanjohnson3559 6 лет назад
GOML ass-embly optimizer
@KabelkowyJoe Год назад
In first 3minutes of unfunny jokes, he almost lost me, skipped to 23:00 (skipping unordered map) i will probably watch the rest
EDIT 25:00 good point "we want use all bits well" but another joke.. OMG how old is he? 8? Point was less collision in averange adjusted for data set distribution, i came at this moment with "stupid idea" of making translation table for data -> table[data] -> result data for hash calc, statistically adjusted for your data distribution (in first testing pass) to use more bits of data for numbers/letter seen most of time, less bits for numbers range seen less frequent, focusing more bit on local context (local order) noticed that some hashes uses sum and xor in such way. Sum is a broad context of data, xor is local context, wont loose any data, shifts are just to "position data" cut off useless bits, % prime numbers are to do same with but of sums.. Am i correct?
I assume all "researtch of perfect hash" is just attempt to adjust hash to most common data distribution to have less collisions. Since for example ASCII uses 5-6 bits at max, any data shifting 2-3bits usually helps. Probably what i just wrote is stupid..
Question? Watch or not to watch more?

Следующие

Автовоспроизведение

CppCon 2017: Matt Godbolt “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”