CppCon 2016: Chandler Carruth “High Performance Code 201: Hybrid Data Structures"

Поделиться
HTML-код
  • Опубликовано: 3 янв 2025

Комментарии • 71

  • @kamilziemian995
    @kamilziemian995 2 года назад +11

    Chandler Carruth talks are great. I try every one of them available on the YT.

  • @MarcusAseth
    @MarcusAseth 6 лет назад +120

    man, Chandler makes the camera man work really hard for his money xD

  • @subbyraccoon
    @subbyraccoon 8 лет назад +19

    Chandler is my favorite speaker after Herb, good talk!

    • @seditt5146
      @seditt5146 5 лет назад +1

      I agree, another one worth catching all his talks is Ansel Sermersheim. He makes hard to understand subjects highly consumable. Timur Doumler also has good talks but he is not the best public speaker, not the worst, he is very good, but not Herb or Chandler levels at all.

  • @barrettellisadair
    @barrettellisadair 8 лет назад +7

    I always enjoy Chandler's talks - this one has great video/audio quality as well.
    I would like to emphasize the fact that this pointer-packing stuff is implementation-defined, and generally pointless outside of a performance bottleneck. So, some of the code you don't see in the slides likely contains a lot of platform-specific configuration macros (but I haven't looked at these parts of the Clang code base yet).
    TL;DW: Cater to the cache, pack your pointers.

    • @MrAbrazildo
      @MrAbrazildo 8 лет назад

      I supose you are talking about the 4-bits-packing stuff, instead of the array of unique pointers, for large objects, right?

    • @10se1ucgo
      @10se1ucgo 8 лет назад

      Agreed, he's definitely my favorite presenter.

  • @soonts
    @soonts 8 лет назад +9

    @20:00 Microsoft has very similar implementation in their ATL, since 2003. Their implementation is called CAtlPlex, and is used internally by many ATL containers.

    • @BlairdBlaird
      @BlairdBlaird 2 года назад +1

      Apple also uses that at the language level (of both ObjC and Swift IIRC). In fact on 64b they use not only the low bits but also the high bits, because all extant 64b implementations only have 48b address space at the moment (and x86-64 is architecturally limited to 52 bits).

  • @insideloop22
    @insideloop22 8 лет назад +5

    As usual, it is a joy to listen to Chandler's talks. I am working on an Open Source library of data structures for High Performance Computing and one of the first thing I had to do is to rebuilt a custom std::vector. One of the many reason I had to do that is that std::vector v(n); fills the vector with zeros which trashes the cache and prevents you to use the "first touch policy" which is extremely important for NUMA architectures such as dual socket nodes. Alignement is also an issue to get efficient vectorization in HPC. These things could be "solved" with allocators, but I just hate them as I don't want them to change the type of my containers.
    I have a few questions though:
    - Do you have a llvm::Vector or do you use llvm::SmallVector as a replacement for std::vector?
    Do you also use llvm::Vector as a replacement of std::array?
    - There is a BumpPtrAllocator, but is does not seem possible to use it with SmallVector. Do you have any design ideas for allocators? I just can't get them to work the way I way. I hate both the way they are done in the C++ standard library and the Bloomberg implementation.

    • @ChandlerCarruth
      @ChandlerCarruth 8 лет назад +8

      We use SmallVector pretty much everywhere. On very rare occasions we'll want what std::vector provides, but it's unusual.
      std::array is a totally different beast -- it is a statically sized array which has essentially no overlap with the vector use cases. I suspect there are some places where we could use std::array and currently use llvm::SmallVector, but I think they're rare.

    • @ChandlerCarruth
      @ChandlerCarruth 8 лет назад +2

      And regarding allocators -- for LLVM, the pattern we fall into more often is needing to control allocation of the *objects* with allocators, not the buffers in the data structures. For those buffers, tcmalloc, jemalloc, or any of the other optimized free-list based malloc libraries will do a fine job.

    • @henrikvallgren1943
      @henrikvallgren1943 8 лет назад

      I tried to get attention to this issue a while back:
      groups.google.com/a/isocpp.org/forum/#!msg/std-proposals/b946VongJjY/RYYj--v63qsJ

    • @insideloop22
      @insideloop22 8 лет назад +5

      There are too many things in the standard library which are broken for me and I am glad I decided to write my own containers. It provides me so many useful things. As I am working mainly on numerical simulation, all I really need is a good std::vector implementation. Having my own allows me to:
      - Have an il::Vector v(n) that does not fills the vector with 0 (same with float, int, etc)
      - Have an il::Vector v(n) that gets initialized to NaN in debug mode (very useful if you want to track uninitialized values)
      - Add my own constructor such as il::Vector v(n, il::align, 64) if I want to align the buffer to a cacheline
      - Easily count the number of copy constructor called in il::Vector (if I want to make sure that the move semantics are properly done)
      - Having il::Vector::size() return a signed integer (I hate unsigned integers and all the bugs they lead to)
      - Write a il::SmallVector as Chandler has pointed
      - so many useful things...
      I like the language C++, but I just have so many troubles with the standard library. I am glad to hear that LLVM worked on their own containers. I can point people who say I am stupid not to use the standard library to this kind of talk.

    • @MrAbrazildo
      @MrAbrazildo 8 лет назад +1

      So, did you already write your own, or just want to?

  • @Cons-Cat
    @Cons-Cat 3 года назад +2

    I'm quite sure that the paramaterized allocators problem is easily solvable with C++20's `concept` feature now.

    • @ivanzhechev4215
      @ivanzhechev4215 Год назад +1

      Slowly learning C++20, but how would you achieve that? The only way I can think of is rejecting incompatible allocators and giving a compiler error, which doesn't really achieve the result of having the SmallVector on API boundaries.

  • @davidforgeas2235
    @davidforgeas2235 8 лет назад +4

    Is there a significant performance benefit to use the hybrid 32-bit pointers and 64-bit registers ABI, namely x32 ABI? I don't mean for the generated code, but specifically for the compiler itself?

    • @SianaGearz
      @SianaGearz 8 лет назад +4

      So you get an expanded register file with 8 extra registers, but register scarcity has not been a major issue since about Pentium Pro, since top of stack is actually aliased to an unknown number of hidden registers and the corresponding operations get removed at the opcode decoding stage. You'd get a slight bottleneck at opcode decoding but it was pretty much removed in the later iterations of Pentium III.
      There also doesn't seem to be any difference whatsoever in compiler performance between pure 32-bit and 64-bit ABIs, at least from my experience with GCC - don't know whether it extends to Clang, but I strongly suspect so.

    • @dat_21
      @dat_21 6 лет назад

      Can you elaborate on "top of stack" aliasing? I don't find any mentions of it in intel optimization guide or Agner Fog's guides. Maybe you are talking about FP register stack?

  • @spudtaters8419
    @spudtaters8419 8 лет назад +4

    The smallvector slide at 5:26 is confusing me because of omissions: 1) Can all small size vectors be upgraded to heap storage on overflow 2) Is SmallVectorImpl what you use for function params? I assume so, it just sounds like a wierd / wordier name.

    • @insideloop22
      @insideloop22 8 лет назад +2

      1) Yes, with SmallVector v; they can call v.resize(10);
      2) The name is a bit weird, but I don't know which one they could have chosen

    • @xfreeman86
      @xfreeman86 6 лет назад +3

      For 2), the names should have been switched. Then you allocate SmallVectorImpl but pass around SmallVector.

    • @smestre
      @smestre 5 лет назад +1

      @@xfreeman86 what about allocating SmallVector and passing around BasicSmallVector

  • @enhex
    @enhex 8 лет назад +2

    12:00 Can alias templates be used to avoid those mistakes?

    • @ChandlerCarruth
      @ChandlerCarruth 8 лет назад +3

      Yes, but at the expense of having to come up with a new name for each different small-size-optimized container in a function's API boundary. =/ Doesn't seem to scale very well.

    • @xfreeman86
      @xfreeman86 6 лет назад

      You can use the same technique you use to remove the size parameter from the type of SmallVectorImpl. Allocators are passed by reference, so no slicing worries, and you can use the pimpl idiom for polymorphic copying. My jury is still out on whether it's better, but it is possible.

  • @JohnDlugosz
    @JohnDlugosz 5 лет назад +2

    6:17 Slide 6 -- How does the SmallVectorImpl::push_back know whether to free the old buffer? Without knowing the original N, it can't tell if it's been reallocated to the heap already, or was still using the small buffer.

    • @pazdziochowaty
      @pazdziochowaty 5 лет назад +3

      It can check if Begin points right after SmallVectorImpl object - this is where derived class keeps the buffer

  • @onosendaiAAB
    @onosendaiAAB 8 лет назад +3

    Hi Chandler,
    I may be misinterpreting you, but at ruclips.net/video/vElZc6zSIXM/видео.html
    are you implying that modern CPUs do indirect memory prefetching?
    e.g. are you implying that CPUs will work out you are iterating through an array of pointers, read one or more indices ahead in that array, and then prefetch the memory pointed to by that pointer?

    • @deadalnix
      @deadalnix 8 лет назад

      Modern Intel do it to some extent. I'm not aware of any other architecture doing it.

    • @onosendaiAAB
      @onosendaiAAB 8 лет назад

      Interesting, do you have any further information or references about that?

    • @onosendaiAAB
      @onosendaiAAB 8 лет назад

      I've just done some tests on my Ivy Bridge, and I'm pretty sure it doesn't do indirect memory prefetching.

    • @deadalnix
      @deadalnix 8 лет назад

      Note that prefetching on pointer load is bound to be slow no matter what. If you want to measure if the CPU does prefetching or not, make sure you use the same algorithm, but one doing prefetching explicitly using compiler intrinsic, and the other letting the CPU do its thing.
      If you see no significant perf difference, your CPU does prefectching. If the one with prefectching is faster, then you CPU doesn't do prefecthing.
      I wouldn't be able to tell you if your specific CPU does it. Some Intel do it, but I'm not sure which ones.

    • @noahgoldstein3951
      @noahgoldstein3951 2 года назад +2

      @onosendaiAAB I think you're right there is no indirect prefetching. Possibly he means that the execution of later iteration loads can occur speculatively and in parallel to the first load because there is no dependency.

  • @TheEmT33
    @TheEmT33 5 лет назад +2

    i love this guy

  • @wintermute701
    @wintermute701 8 лет назад +5

    The talk from 2 years ago he mentions in the beginning: ruclips.net/video/fHNmRkzxHWs/видео.html

  • @SolomonUcko
    @SolomonUcko 4 года назад +1

    Is this the previous talk? ruclips.net/video/fHNmRkzxHWs/видео.html ("Efficiency with Algorithms, Performance with Data Structures")

  • @MrAbrazildo
    @MrAbrazildo 8 лет назад

    I don't get the point about SmallVector class having a fixed size, whereas the containers goal is to change it. I can only guess that 'N' is the maximum size the "simulated container" will ever get, right? But how would you know that?

    • @ChandlerCarruth
      @ChandlerCarruth 8 лет назад +2

      It's not a fixed size -- it's an initial *allocation* size. The container still is dynamically sized like std::vector.

    • @MrAbrazildo
      @MrAbrazildo 8 лет назад

      +Chandler Carruth I saw the push_back f(), however, it will produce just 1 element. And even if it create another buffer, how Begin and End would work with the 2 or more arrays, without an extra code to handle the situation?
      I guess you are talking about low-level, which would allow you to delete an compiling-time array, and realocate it, unlike C/C++ allow to do, right?

    • @ChandlerCarruth
      @ChandlerCarruth 8 лет назад +2

      No, this is totally built with what C++ already allows you to do.
      The begin and end pointers start off pointing at the inline small buffer, and if extra space is needed it gets allocated and the begin and end pointers are moved to point at the new buffer.
      You can find all the code for the real implementation here: github.com/llvm-project/llvm-project/blob/master/llvm/include/llvm/ADT/SmallVector.h

    • @MrAbrazildo
      @MrAbrazildo 8 лет назад

      Chandler Carruth So, the penalty is risking to have an unused array, after the memory enlarge. Good idea, though. The pointers can also return to the array, if memory is freed.
      How much an array is faster than std::list? For GCC, some years ago, I measured that an array, simulating std::map, is 6-7 times faster.

    • @von_nobody
      @von_nobody 7 лет назад

      btw How about intrinsic lists? Or when you say "list" you mean list `List` instead of `List`?
      www.codeofhonor.com/blog/avoiding-game-crashes-related-to-linked-lists
      This have very good properties if list are very dynamic and object can
      move between lists. Of corse it have one big drawback, object can
      belongs only to one list at time.

  • @Shockszzbyyous
    @Shockszzbyyous 7 лет назад

    ok i see so the idea behind smallvector is that you can allocate some space before hand? instead of allocating when pushing something on to the vector array ?

    • @tomcheng3903
      @tomcheng3903 7 лет назад +7

      No, the point is that you allocate it on the stack, not the heap. If it's too large for the stack, then it'll fallback to the heap.
      You can allocate space in a vector beforehand with vector::reserve().

  • @tribalfromthetrapdontrappi3030
    @tribalfromthetrapdontrappi3030 6 лет назад +1

    Nice!

  • @MrAbrazildo
    @MrAbrazildo 8 лет назад +6

    _Value.template_, I don't even know what this means.

    • @barrettellisadair
      @barrettellisadair 8 лет назад +6

      It is a disambiguation for accessing a templated member of an object that is itself a dependent type.

    • @xfreeman86
      @xfreeman86 6 лет назад +3

      It's hard to parse as a human, and you just happened to get it wrong is all. Read it as (Value).(template is)(). It's for when you call a template member function with type parameters that cannot be deduced from its formal parameters.

    • @NXTangl
      @NXTangl 5 лет назад +1

      @@xfreeman86 ...wait, you're Gordon Freeman's brother, aren't you?

  • @max0x7ba
    @max0x7ba 8 лет назад

    Bug in slide 21: `(Size >= (End - CurPtr))` should be `(Size > (End - CurPtr))`.

    • @stdafx
      @stdafx 5 лет назад +2

      In the original source code it looks a bit different: "if (Adjustment + SizeToAllocate

  • @MrAbrazildo
    @MrAbrazildo 8 лет назад +1

    > 5:00, in derived class, you could had use the base constructor instead (C++11 and above): _using SmallVectorImpl::SmallVectorImpl;_.
    > 8:36, at least for GCC, _std::pair_ drops performance by the half, compared to 2 separately values. It was shocking to me (absurd geek moment)! However, I guess that an std::array , and a first and second f()s, fix this issue.
    _Edit: nope. This array is as slow as std::pair._

    • @EvanED
      @EvanED 8 лет назад

      Re. pair, what are you comparing? Separate correlated arrays for keys and values (e.g. 'KeyType keys[N]; ValueType values[N];) vs an array of pairs?

    • @MrAbrazildo
      @MrAbrazildo 8 лет назад

      +EvanED I'm comparing:
      1) int m; int n;
      2) std::pair p;
      3) class Pair {
      std::array a;
      public:
      inline const int &first () const {return a[0];}
      inline int &first () {return a[0];}
      The same for second, returning a[1];
      };
      The case 1 is more than 100% faster than the other ones.
      I also tryed to make a 4th case: packing a pair inside a size_t variable, which turned out to be more than 100% slower than case 1 too.

    • @Sopel997
      @Sopel997 8 лет назад

      the thing is array is homogenous

  • @rubberduck2078
    @rubberduck2078 3 года назад +1

    >llvm
    >Fast

  • @NoNameAtAll2
    @NoNameAtAll2 3 года назад +2

    black magic

  • @alexeiz
    @alexeiz 8 лет назад +1

    Contrary to what Chandler says, bit packing is slow. You won't find it in use very often in the modern code. Bit packing makes sense only if you save substantial amount of data size (megabytes).

    • @MattGodbolt
      @MattGodbolt 8 лет назад +20

      Do you have any qualification for that? On the machines I use (Haswell server class), bit packing is almost free, taking around 2/3rds of a CPU cycle to shift and mask.

    • @DeGuerre
      @DeGuerre 8 лет назад +13

      Bit packing almost always wins if it means your working set fits in cache, or for anything that is to be transmitted over a network.

    • @deadalnix
      @deadalnix 8 лет назад +9

      Bit packing: 1/2 cycles.
      L1: 3-4 cycles
      If your data fit in L1, then bitpacking is probably not worth it. L1 is 32k.
      L2: 10-12 cycles
      If your data fit in L2, it's probably a good idea, but not sure you'll win a lot. L2 is 256k.
      If you manipulate anything bigger than this, bitpacking is DEFINITIVELY worth it. Most programs are here.
      Caveat: bitpacking makes your code bigger, so if you are icache bound - you have a large executable with fairly flat profile - then you may want to reconsider.

    • @SerBallister
      @SerBallister 8 лет назад

      Depends on your usage case, if you are primarily setting/clearing bits then it will need more instructions compared to say, using an array of bytes, but if you are testing them, particularly multiple bits at the same time then you make savings.

    • @nameguy101
      @nameguy101 8 лет назад +21

      Found the Javascript coder.

  • @jonesconrad1
    @jonesconrad1 6 лет назад +1

    i just got lost on xkcd.com for 10 minutes