Programs can also exploit modern architecture's implementation of virtual address translation, which is blazingly fast - done in hardware. Therefore you don't need to copy data around. Because the virtual address space is always very large (128 to 256 TiB), you can reserve large enough areas of virtual memory for each array - even more than total system RAM - and delay commiting pages to physical memory until they are needed, as the array grows. This approach has the added benefit that raw pointers will always remain valid after the array is expanded. The disadvantages of using this approach right now are basically none, AFAIK, but in the future it may run into scalability issues as programs grow to consume ever larger amounts of RAM.
@@Nesdac-k1lcashing space locality uses the virtual address space to know which pages/lines to fetch. Very simplified example: if you have addresses 1, 2, 3, 4, 5, and 6 in your virtual space, which maps to page 3, 8, 5, 12, 1, 2 of physical memory, and you access virtual page 2, it will look ahead and access virtual page 3 and 4, look them up, and load physical page 5 and 12. (Do note that cashing is not really at the page level - but accessing any memory will then actually access the physical memory mapped to the virtual address you access, and will get cashed as normal)
@@Nesdac-k1ldoesn't affect caching. L1 and L2 cache is based off virtual address and L3 is physical. An mremap doesn't change either, though the system call itself will invalidate caches.
Can people stop calling things "blazingly fast", please? I can only associate it with companies making unsubstantiated claims in order to sell you their product.
You overlooked a key point here in that if there's enough space to account for the resized array, nearly all allocators, at least the good ones, will merely alter the recorded size of the allocation and return the same pointer to you. In situations like that, you don't need to copy the elements of the array at all because the array was resized in place. In situations like this, where you have the space to keep allocating in place, a growth factor of 2 will be significantly better. If the allocator places chunks far enough apart, then multiple array allocations will work in that more efficient manner.
@@RF-xr7sx Problem is, if you've reached that kind of worst case, your allocator probably won't be able to satisfy your request anyway, regardless of strategy.
As I said on another comment, this is actually decades out of date... at least for most allocations with modern malloc() implementations. Except sometimes for giant allocations, the heap will be broken into size classes and realloc() rarely gets a chance to do that magic grow in-place. Basically it was an important optimization for vectors in the 1980s or 1990s. Rarely helps now with modern malloc implementations because their allocation strategies are optimized to avoid heap fragmentation (which is critical for performance) and that happens to make grow-in-place difficult.
@@mitchblank No, it's not out of date. Size classes are large enough that multiple subsequent allocations will indeed grow in place. For that matter, they also coalesce adjacent deallocations so that subsequent allocations in the same arenas will yield memory that was previously fragmented.
Videos like this are so inspiring and I literally got into a mini rabbit hole about it. I was just thinking about it for like 2-3 hours and tried calculating stuff. In the end I accomplished nothing, but it never felt like wasted time. It was truly fun!
Real-world applications and memory allocators are actually much more complex (and smarter) than the one given in the video, so that some claims are less applicable in real life. First off, as mentioned in the video, real-world applications tend to have many arrays allocating together. With a free-list allocator like that in the video (e.g. glibc malloc), different reallocated arrays will fill in the gaps created during resizes, so the total memory usage will not be as bad as the initial examples. Additionally, many modern allocators actively control and reduce memory fragmentation (unused gaps between allocations). Some allocators (like jemalloc and mimalloc) have the notion of a "size class", allocating memory of similar sizes together (wasting a little bit of memory for padding) and different ones in completely different places. One page can be dedicated to allocate memory chunks of 481 to 512 bytes, and another dedicated for 513 to 640 bytes, and so on. Arrays of different sizes won't meet at all because they belong to different size classes, and the empty space within one page will be filled with allocation of other arrays or data structures taking similar sizes (because there's too many of them). Other allocators, while not having size classes, force memory allocations to exhibit specific patterns to reduce fragmentation, in the cost of slightly over-allocating. Buddy allocation is a famous trick first seen in the 1960s to only allocate memory within the page in pre-divided partitions, still used today in e.g. jemalloc. Other allocators like TLSF tries to allocate memory into their best fit to reduce fragmentation within constrained memory and time limits. For large-enough chunks of memory allocated, for example those spanning multiple OS pages, a realloc() might be as simple as remap of the pages occupied onto a larger contiguous memory address. No data is copied at all, the only change happens in the virtual address mapping tables that changes which physical memory chunk the virtual addresses point to. Finally, for garbage-collected languages/allocators, many of the garbage collectors (like those used in V8 JavaScript, JVM, CoreCLR C#, and many more) actively do memory compactions when collecting garbage, so memory fragmentation is not a problem for them at all. Not to deny the work done by the author -- It's a very good way to visualize and compare memory allocations. It's just that I happened to read through some of these topics, and notice that some claims are oversimplifications :) In the end, if you're designing a dynamic array yourself, you might as well just pick 1.5 or 2 based on your expectations of how frequent the user expands it and how much memory you would like to waste, and just let the allocator handle the rest. If you're still in doubt -- just pick 2. It's good enough! References - Glibc malloc internals: sourceware.org/glibc/wiki/MallocInternals - jemalloc implementation notes: jemalloc.net/jemalloc.3.html#implementation_notes - mimalloc: microsoft.github.io/mimalloc/index.html - Buddy allocation: en.wikipedia.org/wiki/Buddy_memory_allocation - TLSF: www.gii.upv.es/tlsf/files/papers/ecrts04_tlsf.pdf
I was thinking about some similar lines: why not align the allocation sizes with the bucket sizes of the underlying allocator... (In case it is a bucket allocator, and the sizes are known to the authors of the data structure...)
Powers of two reign supreme, but in situations where memory efficiency is more of a concern, I do something slightly different that I haven't seen anywhere else. Arrays are expanded up to the next "half" power of two. A "half" power of two I define to be either a power of two, or the value halfway between two powers of two. An example "half" power of two is 12, since it's halfway between 8 and 16. You can also think of this as the sum of the previous two powers of two (4 + 8 = 12). So an example sequence of growing dynamic array sizes starting at 8 would be: 8, 12, 16, 24, 32, 48, 64, 96, 128, 192, 256, ... You can also think of this as alternating between a growth factor of 1.5 (e.g. from 8 to 12) and 1 1/3 (e.g. 12 to 16). Some advantages of this method: 1. The binary representation stays simple/nice for computers to work with. It has just one bit set if it's a power of two (0b10...0), or two consecutive bits set if it's a half power of two (0b110...0). This allows for certain optimizations. 2. When the size is not a whole power of two, the "gap" to the next power of two is always itself a power of two. E.g. if the size is 12, then the "gap" to 16 is 4. This can greatly simplify handling memory fragmentation in certain cases.
I don't know if this was the original intention, but I interpreted Lankinen's comment as a way to do more copying but keeping memory totally compact. In other words, if the original allocation is N units and stored starting at position 0, after the two copying steps we have an allocation of size 2N also starting at position 0. Kind of hard to show the diagram in a youtube comment. Starts with allocation of N located at position 0. Then you allocate 2N starting at N. You copy the first elements into the *second half* of the new 2N allocation. Now you have empty space for 2N units starting at position 0. Copy the N units starting at 2N into the N positions starting at 0. Now you have a contiguous 2N vector starting at position 0. And you have free space to grow starting at position 2N.
Interesting! I've visualised out that algorithm here: ruclips.net/video/0NTSZf2AhPM/видео.html (if I understand it rightly!) I can't quite read that idea into his comment (you can read the full thing here, along with some other interesting comments from him - groups.google.com/g/comp.lang.c++.moderated/c/asH_VojWKJw/m/LiYGlFd4drIJ ) - and you do have to make some (what appear to me) slightly funny assumptions about how the allocator behaves in order to implement this, but it does seem to work and definitely achieves the benefits you stated. I haven't checked how it performs with more than one array using the same memory space but I will do so. I imagine it performs as well as a "realloc contiguously if you can, otherwise alloc a new block of 2x" - but I may be wrong. Thanks for the great comment!
@@neoeno4242 This approach doesn't really make sense in practice, as it depends on the 2N allocation being directly adjacent to the N allocation. That can only happen in the case where no other allocations have been done inbetween. Just calling a function like C's `realloc` would in this case be vastly better as in the bad case it just copies the vector once, and in the best case it simply grows the space allocated to the vector without any copying. For the scenario to work as described in the comment by @NathanWhitehead, realloc would always work as in the best case. Re-watching your implementation, I think it does make sense. That is the L2 in the video has compacting properties that could theoretically be useful, and distinct from realloc, while @Nathan's approach can't be more useful than realloc.
@@neoeno4242 Cool visualization! Yeah, I think my interpretation of the comment doesn't really work. If the original allocation of size N is guaranteed to have free space of size 2N after it, then you don't need to do any copying. Just allocate another contiguous N units after the original and you're done, you now have 2N spaces with the first N used. If you're *not* guaranteed the next allocation is contiguous then my interpretation doesn't work, you can't free up 2N space by using the second half of the first new allocation. Oh well!
FYI, modern memory allocators have both the allocate/deallocate (e.g. malloc/free) operations, but also a reallocate operation (e.g. realloc). If the memory allocator knows that the realloc can grow in place, it will do that, avoiding the copy. So in real life, there's more going on that will alter what is a good choice of growth factor, and also what is better for larger arrays that are more likely to need a full reallocation/copy pass over a grow-in-place. This is made even more messy if you are using huge pages or some totally owned arena, for example, from an mmap, so that you always can grow in place. And as you noted, if many different things are being allocated into the same arena, there are other factors in play. Lastly, if you truly can't avoid an endlessly growing array-like structure, you almost certainly want a linked list or binary tree of arrays. If you store the pointers in a separate array of pointers from the array of actual values, you can recover very good performance even with insertions and deletions. Anyway, I think I'm teaching you to suck eggs, so I'll shut up now.
realloc() isn't really "modern", it's been part of the standard C library since the 1970s. However most of the fastest modern malloc implementations actually *don't* benefit from its use much (if at all). That's because the hardest problem for malloc() is keeping heap fragmentation low in long-lived processes, and one of the most important strategies for that is to group allocations into size classes. (i.e. one page of memory is all 96-byte allocations, another page might be all 256-byte allocations, etc) The upshot is that it's usually not possible for realloc to actually just grow in-place... at least if you're growing outside of the "size class". Since a vector is normally growing by 1.5x or 2x, you're always switching to a different size class and never benefiting from realloc doing grow-in-place. What you CAN do (and the best vector implementations DO do) is to use APIs that the allocator provides to tell you what its size classes are (see malloc_good_size() on MacOS, malloc_usable_size() on Linux, etc). Then every time you grow your vector (whether by 1.5x, 2x, whatever) you also make sure to grow to the next size class. For example, if you are about to do malloc(200), but you lear that malloc() would actually give you a 256-byte chunk in that case, just grow your vector to use 256 bytes right away. That way you can delay the next move as long as possible.
Yea, while I doubt vector or any base list in a programming language uses mmap, if you are expecting to have a gigantic dynamic array it would make sense to have a small custom class that can just mmap the expected ceiling and have it lazily map the pages as you access, and then you dont event need to “resize” instead just keep track of a length for bounds checking.
@@garriganstafford9113 Dynamic arrays for gigantic data are usually a sign that your design is wrong IMHO. In practice, they are usually either small, or a symptom of using throw-away code that didn't get thrown away once the scale of the problem ramped up.
"memory efficiency" is only really applicable on architecture that does not use virtual memory. Factor 2 maximizes amortized cost. The actual memory mapping is less important. As long as the memory allocator doesn't fragment over time then you will always get the block you need. Which is why i think factor 2 is picked by the high performance compilers, it exploits the one property that matters by the computers we actually use, i.e minimize bandwidth use, anything else is managed by the hardware and a non-problem.
That's always the problem with theoretical algorithms versus actual implementations, isn't ? for ex, copying data might be trivial for the hardware to do without spending a lot of CPU cycles. an instruction to move a very large word of 512 bits might just schedule the memory management unit to perform DMA copy of the actual bytes without even using the execution units of the CPU, thus other instructions can run in parallel while the data is copied.
@@scoreunder it does, once the memory used starts exceeding the size of one page. The memory used by the dynamic array only needs to be contiguous in virtual address space, not in physical address space. When you run out of space in the first page, that page can be returned to the system. Then that page's physical memory can be reused for future pages in a way that appears contiguous in virtual memory space to the user space process. So even though the virtual memory is never reused, the physical memory is. So unless there's a risk of running out of virtual memory space (very unlikely in a 64 bit architecture), the 'memory gaps' issue of the growth factor 2 algorithm is not a real issue. Another way of thinking about it is that any potential memory gap that spans a whole page will always simply be repurposed by the OS. Page size is typically only 4 KB, so this kicks in pretty quickly for any appreciably big dynamic array.
Great vid! Here is the 201st something something allocator comment: The problem model *should* assume that the allocator is stupid, that lets it stay elegant while still relevant to practical considerations. Imo the single-vector model of the video doesn't assume a stupid allocator, it assumes an adversarial one. That makes the model less relevant for practical consideration. Counting unallocated preceding memory with the same cost as reserved memory only makes sense if we assume that the memory is clear of other objects. Not permitting reallocation/checking if the allocator gave an address directly following the reserved blocks is only valid assuming that the memory is full of other objects. It is hard to conjure a scenario (for me) where both of these assumptions hold.
Due to how memory and allocations actually work there is little to be gained from supposed "memory efficiency" unless you are running rather large applications that can get close to exhausting virtual addressing. And that again is rather unlikely as on normal consumer machines you can't even get 10TB of memory while the VAS is over 10 times larger. So a factor 2 just doesn't really have any particularly big drawbacks. When talking performance you also just want to minimise the number of allocations and if possible get nice cache alignment. If you got a rough ballpark for the expected elements than pre-allocate that size. So if you expect say something like 5-10 elements then just make it 15 and it will nearly never need to reallocate.
One thing to remember is that not all data types occupy the same size. So a variation of the growth factor 2 is to round up the required space in bytes to the next power of two. It's not quite a factor of 2 if the data size is weird, but it has the effect that arrays of any type can re-use the "memory fragments" of other arrays. Not sure if any language does that. If you know you will need more arrays (and are on a single thread), you dont even have to de-allocate, just put the chunk in a list to re-use. I am using this in a wasm project as the default allocator seems to be really really bad for memory fragmentation.
I've have wondered about this for so long, I have always been a person who is scared of doing the "unoptimal" thing especially decided how to pick how much should a dynamic array should grow. This is a godsend for me thx
Fascinating, but these solutions address a problem that shouldn't even exist today. Modern computers provide up to 16 exabytes of addressable virtual space. By reserving large, contiguous regions of virtual memory and committing or decommitting them as needed, we can avoid the complexities of reallocating, copying, and defragmenting memory.
You are missing a critical piece: the allocator There are types of allicators (like buddy allocators) that only return memory blocks in sizes of two. There are allocators that support resizing an allocation in place. etc. etc.
I remember reading a few old articles about optimizing Java programs, and most of these articles mentioned that one of the main reasons for big memory consumption is allocating a lot of arrays and maps that have only a few elements or being completely empty (I never researched it, but I believe that might be the reason why modern Java doesn't actually allocate memory for arrays until you try to insert the first element). And that got me thinking: what if we try to calculate new capacity dynamically? A few ideas: * Start with the growth factor of 2 and decrease it by 0.1 for each new allocation until reaching 1.5 * The same strategy as above, but with a delay until a certain amount of elements * Start with the growth factor of 2 and switch to 1.5 after a certain limit * Alternate between 2 and 1.5 My reasoning is: surely most of the arrays and maps in the real world have very few elements. By using different growth factors we can prioritize the speed for small arrays and memory usage for large arrays. And if someone needs very large arrays (say a few million elements), they probably already allocate them with enough capacity so the small growth factor wouldn't be an issue in most cases. It would be interesting to test this theory on a real-life data like Rust does
That's some interesting theory being laid out, but in this day computer hardware is a bigger factor than algorithmic theory. How allocation, reallocation, copying and memory mapping works may completely break apart this whole video's concept. So despite all the simulated numbers, the actual answer lies in the benchmarks, as the Rust experiment hinted. Not to mention that because of this discrepancy in the results, trying out just 1.25, 1.5 and 2 is possibly even an unrequired constraint. For all we know, because of the aforementioned factors in play, the best growth factor could turn out to be 1.75, 2.25 or even 3.5 (all losslessly encodable numbers in float32).
Like others have commented already, the allocator itself plays a large part in handling how the memory is utilized in modern systems. Good allocators split memory blocks into separate buckets based on their sizes to minimize fragmentation, and are able to use realloc() to increase the size of the memory block in-place so no data copy is needed at all when resizing.
This is an excellent research project on dynamic array allocation efficiency! The insights you've shared here are not only academically valuable but also practical for real-world applications. As a retrocomputing enthusiast, I'm particularly excited about leveraging your findings to test a new memory allocation algorithm for the Apple II. Projects like this bridge the gap between modern research and the optimization challenges of classic systems. Keep up the great work-it's inspiring to see research that has such wide-reaching potential!
A few years ago, I wrote a simple simulation that calculates how often old space can be reused for a given growth factor and made a plot. (x-axis: growth factor between 1 and 2, y-axis: average number of resizing-operations before the first slot in the simulated memory is used again) The plot I got traced out an ellipse, peaking at x=1 and ending at x=(the golden ratio). At the time, I saw people in RUclips comments arguing that the golden ratio would be the best, simply because of the whole nature of Fibonacci numbers. But that graph clearly meant that the golden ratio is just the boundary between memory efficient and time efficient growth factors, and more importantly, that it is the worst growth factor! A lot of the research in this video hints at 2 being the best growth factor, but what about larger numbers? If we are ok with larger-than-golden factors, then anything is fair game.
Wonderful video. I stray away from dynamic arrays when I can, but I had to use them again because the old method involved millions of insertions into a single huge array, which as you know is way too slow. It now has tens of thousands of 'buckets' with a growth factor of 2. The numbers to be stored are generated in a pretty unpredictable order, so which bucket I put them in is proportional to the square root of their size. This can still leave some empty buckets, while the largest one ends up with over 6,000 numbers. Of course, the more tens of millions of numbers I generate, the fewer buckets are still empty.
I wondered this myself - I believe the answer is that we typically don't get the choice to do this unless we control the allocator. If we don't, but we know it uses something like a first-fit algorithm (the assumption Koenig and later Lankinen was making) then we can maximise our chances of creating the right spaces for the allocator to do what we want by copying twice. Like a lot of the ideas here, it requires a lot of assumptions that may not pan out in reality! (And, to be fair, all the thinkers I referred to in this video acknowledge this) but that's the idea as I interpret it.
At the end of the intro and my search has lead me to the pandovan numbers! Hope these come up in the video :) Edit: rip :( no pandovan numbers in this one. Editedit: This sequence approaches the Plastic ratio (which is a close cousin of the golden ratio)! I've learned so much stuff today :O For interested readers, the pandovan numbers follow the formula f(x) = f(x-2) + f(x-3) with initial values of 1, 0, 0 (although something like 2,3,4 or 7,9,12 are better starting places for this application. I derived the formula directly from the problem statement. When allocating iteration n, the array n-1 is still in memory. If n-1 was just allocated sequentially after n-2 and n-3, then the space above n-1 is the sum of n-2 and n-3. When we produce reasonable starting conditions, the size is guaranteed to increase (which is not true for some starting conditions). After finding values that looked good for this application, I found the name of the sequence via OEIS.
Good point. This sequence also keeps the advantage that a growth factor of 2 and Fibonacci numbers have, which is that any possible array size can also be used without gap by two arrays of a smaller size. (Which growth factors like 1.5 or 1.25 don't have.) For a growth factor of 2, an array of size 2^N that is deallocated can be used by two arrays of size 2^(N-1) each. And for Padovan numbers, an array of f(N) that is deallocated can be used by arrays of size f(N-2) and f(N-3). The Fibonacci sequence has the same "sub-divisibility" property, just with both gap-filling arrays being "one size larger".
The only problem with this sequence is that the growth factor (1.324718...) is a good bit worse than 1.5 and even worse than 1.6-ish for the Fibonacci sequence. You could get a growth factor that is closer to the golden ratio if you summed more terms: The Pandovan sequence grows "forwards" twice and then back fills the space. f(N) = f(N-2) + f(N-3) If you used the sequence that summed the 3 terms f(N) = f(N-2) + f(N-3) + f(N-4), you'd grow by about a factor of 1.46557. A growth factor of 1.5 grows "forwards" 4 times and then back fills _almost all_ of the space. For a sequence that summed 4 terms f(N) = f(N-2) + f(N-3) + f(N-4) + f(N-5), the growth factor would be at 1.53146. 1.57015 for summing 5 terms, then 1.59, 1.601 and so on... A growth factor of the golden ratio would grow forwards an infinite amount of times and never back fill the space. And notice that the more often you grow before back-filling the space, the more likely something else will have snatched the space.
@ Very cool! You obviously have more intuition in this space than i do :) perhaps an upside to the lower growth factor would be a smaller likelihood that another object occupies the old space. My intuition says that where the pandovan set would have 2 "chances" (arbitrary percent chance) to fill a space the 5 term version would have about 4 "chances. I'd expect more than twice as often for an old spot to be taken (although this is probably near 100% for systems that are actually doing something). Another interesting upside to this ratio is that IF a spot is taken by an array n-2, it can also take spot n-1. Maybe not practically useful, but i have a pretty picture in my head of russian nesting doll array allocations.
If you're allocating immediately before the existing array you could, I suppose, allocate with an overlap as you know you will have copied the overlapped bytes before you overwrite them with the copy. For your question at the end of the video, this may give you one additional expansion. That assumes a lot, and I'm not sure it would be "allowed" by memory protection unless the allocation routine is completely under your control.
Excellent video, this is something I never thought about wrt array sizing! Really enjoyed! Now I'm going to be thinking more closely about common programming wisdom.
I’m struggling to understand why we reallocate all existing elements each time, instead of just allocating new memory directly after the existing array, if there are enough contiguous available spaces.
Good presentation. I puzzled over this same question for years too. Couple of missed things: * Overallocation - say you double the array size, but you're not actually going to use it. At factor of 2 you're always wasting about 25% of your allocated memory. That is, after allocation you're using 50%, and on average you're going to be at the 75% mark (50% in-between-allocations). This will have a drastic impact on overall memory usage of your application. * Allocator efficiency - allocators are tuned for specific memory sizes. When you ask, say, for a chunk of memory of size 1024 - the allocator might have one ready and will give it to you immediately, but if you ask for 1025 - it might just say "screw it" and give you a chunk of size 2048. You're thinking you're very clever by "only" "using" 1025, but actually - you're not. This happens because allocators don't want to do compaction, it's expensive, so they try to operate on fixed-sized chunks, which are way more likely to be reused. A bit of waste here is often preferable to increased runtime overhead.
I worked for a boutique Smalltalk (and later Java) consultancy in the UK in the 90s. We had every copy of JOOP, until we left that office in ~2001 or so.
I am missing some real-word data collection, where someone just check a huge number of samples from real-world applications about typical usage of dynamic arrays. My intuition tells me, that big (2x or more) growth factor is better, because most of the dynamic arrays are pretty small and bigger growth factor works just better for average CPU usage for small arrays and you don't care about memory usage as most of them will die soon (Weak Generational Hypothesis). Also it is worth to mention it from software engineering perspective. Huge arrays are not a problem: they are pretty rare and easy to spot in a profiler, so anyone, who cares about performance will find inefficient usage of big dynamic arrays and fix it with a well applied reserve(). The real problem is huge quantity of small arrays, which are hard to pick in profiler a.k.a performance death by thousand cuts. Growth strategy should be optimized for cases, where developer don't care about reserve and in my mind it is mostly with small arrays
You might want to try this strategy as well. Use 2 separate allocations. When a new item needs to be added, if it's outside the current bounds, copy the old arrays into a single array, free the old arrays, allocate the second array, add the new item. The second array should be able to fit in the slot left from the old two arrays. This makes indexing very slightly slower, but still O(1).
When working with the heap, 1.5x is sufficient when working blindly, however it is much more efficient to work towards "filling" the cap space prior to testing/requesting for change. What I mean by this is that if there are free bytes following the array, it is better to just expand the current heap size at a 2x clamp rate toward maximum allowance before re-allocation request at 1.5x, saving you write time of data already performed. When you have reached maximum threshold of heap space, then your only option should be at heap space intervals only with the OS, and allow the OS to handle cpu pointer refs and memory balancing. Let's work an example, say you are working on an OS with a 4096 byte page/heap size and has full virtual pointer redirect at the cpu level....and we will also say that each element of the array takes up say 4 bytes each, and allocation data takes up 16. Starting with an array size of 2, that is 24 bytes of data at a position in the heap, when you hit the limit, the next number will be 4, but the allocation header is first checked to see if it can accommodate, if so, the 16 byte header is altered to increase to 4 (16 bytes) and no movement of data has to happen. The next increase request will check for 8, then 16, and so on.....say when at 8 going to 16 that there is only 48 bytes before the next allocation header is reached from some other data that requested memory, then the header is only updated to 12 entries (48 bytes) instead. The next time it tries, it wants 24, but has 0 left to gain, therefore it now drops down to 18 (1.5x) as a complete re-allocation within the heap to find space for it, then move the 12 entries over and continuing. When the size reaches 1020 entries (4080 bytes plus the 16 byte header), the entire page is used, now we move to direct OS page system at 4096 bytes each new allocation, and tell the OS that the new virtual set is 0x01000 more than the previous in the block. This means that when you read/write to an address, the page is redirected from the virtual address to physical address, and it is continuous in virtual number but not necessarily in physical memory.....and as pages get used less, the OS will automatically offload to hard disk in limbo until called upon. There will be a point where the OS runs out of actual space, at which point you need to start thinking about scaling to actual database style coding. Why work so hard? But if you think that's just cheating, ok, fine, then you can program the entire indirect pointer system yourself.....the modern cpu architecture has abilities for memory mapping under high authority flags, but at that point you are writing the entire OS system. You could request the entire virtual space from the OS with little reserve and just increase the reserve later. Or, lastly, if you really think the size is going to be infinitely undefined, then just use indirect pointer blocks instead and handle most of the load on rotating files within the same heap space, like a cache system.....if done right, you can cluster the data onto servers and have unlimited space.
1:43 I'm not following this part, how could we possibly guarantee our new allocation will be contiguous after the old array? What if, between growths of the array, the OS has received another call to allocate memory and has decided the bytes immediately after our array is the place for it? We then can't use the old memory at all
What happens "after" the array is not really relevant; the only important part about reuse is that the sum of all your previous arrays will be 2^n-1 and your new array will be 2n, thus it can't fit. What you point out is the same as pointed out later in the video: once you have 2 arrays (or literally, _any_ other allocation that can fit in that space), it could take up that space and the reuse point becomes moot.
I'm confused. How is Lankinen's algorithm not just exactly the same as a 2x factor but with an extra step? I've rewatched that bit like 10 times and I don't get it at all.
@@sunofabeach9424 step by step, where () is empty memory, and [] is allocated: []()()()()()()()() - initial array ()[][]()()()()()() - vector doubling ()()()[][]()()()() - reallocation ()()()()()[][][][] - vector doubling [][][][]()()()()() - reallocation Lankinen's algorithm eventually lets you reuse the start of your memory which would otherwise be occupied at the cost of additional copies compared to 2x (which still add up to being basically the same amount as with 1.5x factor)
You might be confused by the fact that the _sizes_ still grow with a factor of 2. But the _starting position_ of a new array gets "artificially" pushed further along, meaning that you're making more space at the beginning where you might be able to slot the next array into.
I suspect that, while quite clever, Lankenin's Algorithm would be basically useless in practice. Unless you write your own allocator (a move I would highly respect, but is almost never done in practice and done well even less frequently), you are probably, directly or indirectly, using `libc`'s `malloc` family of functions for dynamic allocations. If the only dynamic allocations you make are for a single dynamic array, I believe that you would actually never copy the array, because `realloc` can simply expand the block that was previously allocated if the block after it is unused and large enough. This trivializes reallocation, rendering this problem irrelevant. It is thus only useful to examine this problem in the context of multiple dynamic allocations. Imagine the following scenario. You have a dynamic array of 64 bytes. You then use a reference-counted object (lets say that requires a 16-byte allocation). You then add elements to the dynamic array until it needs to be reallocated. It will have to be placed in memory after the reference counted object, because there is no space before the dynamic array or between it and the reference counted object. Imagine instead that you made a linked list. Every time you add a new node, you make it impossible for the dynamic array to reuse any of the memory it has already used. In practice, you will most likely either make like 0-2 dynamically allocated objects (if you specifically avoid dynamic allocations where possible like I do), or at least dozens of a wide variety of sizes. Either way, this problem is not particularly relevant. I believe that the main cost that you can actually save in practice is the copying operations, and any memory savings will likely be marginal. Unless benchmarks show some other factor as optimal, I believe that a factor of 2 is still best in practice.
fopen() (or the first fread()) will claim a chunk of the heap when executed. printf(), too, and may even spool lots'n'lots before flushing its buffers... There's many things to account for when skating close to the edge...
The memory gap issue with factor 2 disappears if your language can perform in-place resizing, aka realloc which we have in C. Just doing a small micro benchmark with a growing array from size 1 all the way up to 1048576 required only 3 instances of where memory had to be copied in the 20 needed resizes (when using a factor of 2).
Thanks for this video, there is a lot of stuff that one would have to go to great lengths to work out! I wonder if the resizing factor being dynamically adjusted would have been a better approach entirely (although probably not worth maintaining state for small arrays and hence not a design choice most languages would prefer)
One thing I see missing is why not allocate the new array overlapping the previous one, as long as you then copy the elements front-to-back, you should be able to copy them over with no loss of data, unless the new array starts half way inside the old one, in which case back-to-front copying would also solve it. I guess it's a very impractical approach because no modern allocators would do this, but it's still something I think would be interesting to explore.
I think Lankenin's idea of early resize has some potential. I'd be very curious how resizing at 1.62 but to a size of 2 would perform. To be real, I think optimizing for whatever the CPU architecture prefers will make a bigger difference. I might bother my microkernel collegues next week so I can run A/B testing on aarch64 and a few generations of x86
It's funny because recently, I came to a similar idea (of the golden ratio rate of growth) for allocation. No need of multiplication, just storing two numbers is enough, computing the Fibonacci sequence/algo.
A very strange explanation TBH. First, you assume the new array is always above the initial one, the fragmented memory can not be reused and the copy always happens. IRL in some cases the reserved space just happens to expand into the higher mapped addresses without any copying. Although this alone is not an excuse to call realloc() every time you add a new item, the fact sometimes the arrays may be really huge makes it clear that multiplying their size may be a bad thing for sure. With this, simply growing your array by a fixed additional size is probably the best way out nowadays!
obviously each instance of array should be able to specify what it's growth factor is because a fast growing type might be better served with GF 3 and a slow growing one might be better served with GF 0.3. and that my friends is why we use C.
The argument of not being able to reuse pre-allocated space is a bit "silly" as an argument against a resize factor of 2, because in practice, you won't reuse it anyway. It suffices you create some damn tiny string and the allocator decides to get one from the "previous void" and you no longer have a long enough contiguous space you can reuse. Visual studio uses a grow factor of 1.5 because of that "argument", while clang and gcc simply don't care, i guess because of these practical consideration I just said, and they use a grow factor of 2 (and also because, I guess, multiplying by 2 is much faster than by 1.5, since * 2 is a simple 1 bit shift with takes 1 cycle while multiplying by 1.5 could imply 10 o 20 cycles; if you have a shit-tone of little vectors that 20x difference is significant). Second, respect to memory efficiency. In 64-bits you have 128 terabytes of virtual memory. So having space you have reserved but never written to (unallocated), won't consume RAM, just "virtual space", which in practice is infinite nowadays, so the argument of memory efficiency don't apply either, because you aren't "wasting anything". The virtual space you "left behind", as far as the allocation is > 128KB (in Linux), will be returned back to the operative system, if it's < 128KB, it will be reused for other objects (the allocator/malloc will take care of that), and the space that is "in front of you" won't consume RAM until you write to it for the first time.
I think you're both over and under thinking the 1.5 scaling factor. For one, a growth of 1.5 exactly can be done with just a few clocks when you consider that x*1.5 == x+(x/2). That comes out to a register copy, a shift in the opposite direction, then an add. The real issue here is in how often you have to copy. If you have to move the array often, which isn't necessarily guaranteed regardless of growth factor, then the fewer moves of a 2 times scaling factor will be better. If the allocator can reuse older space before prior allocations then it may be more memory efficient, but will also copy more often. The optimal performance is really when it resizes fewer times and is capable of adjusting the allocation size when it can because there's enough space following the current allocation. Now, if virtual memory works as efficiently as is suggested by multiple commenters, yourself included, then all of this should be a moot point, but clearly the fragmentation causes enough overhead because otherwise there'd be no difference in CPU time between any of them.
@@anon_y_mousse You are right about the x + x/2. Regardless the CPU time, yeah, the most copies you have to do, the more CPU time must be used: you have to allocate the new chunk, copy the data, and deallocate the old chunk. Allocating and deallocating (for chunks > 128KiB) implies a syscall, where the kernel must update the virtual memory tables etc, and for copy, the CPU is who does the copy, and whenever a full page is copied, the kernel must also find a free RAM page so that the copy of the next page can continue. So you really want to minimize the number of copies you do. But that has nothing to do with virtual memory. I mean, the point is that you don't need to worry "too much" for the virtual memory utilization, because nowadays OS's are very efficient in terms of "resource consumption": you are not wasting resources if you allocate virtual memory that you don't use. As far as you don't have memory leaks and you don't do stupid things like zeroing big chunks of memory beforehand, you are in safe zone (because "writing" memory is what needs RAM, which is the actual physical limited resource; allocating memory that you don't write is basically free). The CPU time depends on what your program does: if you copy more you need more CPU time to do such copies. That is why different strategies gives different CPU utilization times, but because they copy more or less, not because of fragmentation.
@@Peregringlk When you're talking about a cold boot, that will be mostly true, depending on what software you have running from startup and various other factors, but not everyone turns their computer off every day. In the most common cases, where someone just leaves their computer on all the time, you'll get variances with the same allocation strategy, regardless of 1.5 or 2. This also holds with mobile devices that just about never get rebooted.
For the exercise: pretty straightforward, but there is the surprisingly interesting part of calculating ceil(1.618 * x) exactly (of course if you have access to bignums/>128 bit integers then it's pretty easy, but rust doesn't natively). You can just reimplement 256bit integers, but there is a short but ugly formula only in terms of 128bit (or smaller) operations. *** spoilers below **** ceil(1.618 * x) = x + 309 * (x / 500) + (309 * (x % 500) + 499 - (309 * (x % 500) + 499) % 500)) / 500 Divisions on the RHS are 128 bit unsigned integer divisions (rounding down), same for modulo. (And most of those actually fit into 32 bits, so you can get away with 4 128 bit operations) The result of the exercise should look like: 126*********************************641.
Look at these kids with their new fangled dynamic arrays. My industry has just about got the hang of variable names automatically allocating memory space - I'll let you know when we get literally anything dynamic.
In automotive and other RT industries, you must not use dynamic memory. Allocators give non deterministic performance and may potentially throw exceptions. So… C, C++
@mastermati773 you could use something like the upcoming std::inplace_vector (C++26) or some custom allocator on std::vector that pulls from a pre-allocated pool.
11:00 it looks like the performance of L2 would be similar to that of the normal sqrt(2). That raises the question: did Lankinen find a way to make the scaling factor act like its square root in performance? Or is there something deeper going on? Also, even if it did, computer memory tends to be a lot messier, so what is in theory better can become much worse in practice. In particular, trying to use an L3 algorithm (for instance) may consume far more memory in the real world compared with a no-allocation experimenting environment. That said, I'd say it's fair to compare L2 with a normal factor of 1.5 (and also 2)
Here's an interesting "optimization": if, when you allocate the second block, its start is higher than the first reallocation, why bother transferring the data to it? Instead, free the 2nd reallocation rather than the first. Granted, that means it becomes less often that a 2nd reallocation is an improvement (why Lankinen's algorithm is an improvement at all), but the sacrifice may be less costly than the savings of one less full-array recopying. In fact, a smart enough allocator wouldn't even need to truly eat the memory before the check, making the modified step far cheaper than with a full 2nd reallocation if such isn't needed.
Perhaps I am misunderstanding the idea, but why would you need to copy data twice for Lankenin's Algorithm? Couldn't you do something like the following: Given a full array of length n, allocate space for 2 arrays of length 2n. Then, copy over the data from the original array to the 2nd 2n array and free both the original array + the first 2n array (freeing 3n blocks).
This has lower operations, but also worse memory efficiency, because your modification delays one of the frees, which could mean the difference between being able to fit the last array size and not being able to
I wonder if the memory efficiency is less of an issue in practice. VA space is so large that running out of VA’s is not really an issue compared to physical memory, and re-using the first block for the array. requires the allocator to be a first fit which just depends. If we consider page sizes then it would seem the most of wasted memory would 2KB since once you get bigger you start filling full pages which are no longer contiguous physically. If we are just using the array then that is a big waste, a full half page, but in practice we likely will have more heap allocated data, much of which could likely fit into that 2KB or whatever 1/2 step block is left as prefix for the 2 case. It would be interesting to see how this plays out in a program with a large dynamic array hut also some other smaller data and larger data structures that is close to a page size, cause then you could see how well the allocator packs it and also if you end up wasting space via page boundaries. Either way this hunch leads me to think in practice 2x is probably the best choice due to the significantly cheaper operation per append cost.
The other thing that wasn't touched on is that for the compiled languages (e.g. C and Rust), the memory allocation is a first-class primitive. This means that a compiler will absolutely optimize out a redundant copy or "useless" allocation
I don’t want multiplicative growth for buffers with sizes measured in GB, too much RAM is wasted. When I recently needed to allocate vastly different dynamically sized blocks of memory in my program, I used 4 small integer control codes, 1 byte each packed into 32-bit integer. The first i.e. lowest byte is base 2 logarithm of minimum allocation size. A reasonable value is between 6 (64 bytes = a cache line) and 12 (4kb = page size). The next byte is base 2 logarithm of the first threshold. Below the threshold, the allocator rounds up requested size to powers of 2, i.e. to a single most significant bit. The next bytes are similar thresholds to round up to 2, 3 or 4 most significant bits. Note standard libraries of most languages expose a function which compiles to BSR (old PCs), LZCNT (current gen PCs), or CLZ on ARM which quickly return base 2 logarithm of integer numbers.
I'm surprised that you never tried to subtract one from the fibonacci sequence, only the regular one, and 1.5. AFAICT the regular one seems to be at least a local minimum in the parameter space, as it's the smallest growth rate (thus high average ops per append) that can never reuse previously used space. My hunch says that it should have the lowest possible memory efficiency in the simple single array case. Subtracting one might make it optimal however, as it's the largest growth rate (thus smallest average growth rate) that allows for deallocated memory re-use in the single array case. It's somewhat easier to use 1.5, as you mention, but I suspect that the performance cost of a few floating point operations (besides being constant) will be vanishingly small for any N above ~32 or so. I would really like to see you compare a fib(n)-1 or floor(n*phi) growth strategy.
The sad truth is that growing the size in this manner will NEVER result in reuse of old memory. It doesn't matter what you add or subtract to the Fibonacci Numbers. The reason is the following: (sorry if its a bit much) Let "s" be the sequence of array sizes, such as [1, 2, 4, 8, 16, ...] (powers of 2) or [1, 2, 4, 7, 12, 20, ...] (fibonacci minus 1). "s[0]" would be the initial array size, "s[1]" the array size after the first re-allocation, etc. Now, let's say our array is at size s[n] and we want to resize to s[n+1]. The previously allocated memory can only be reused if: s[0] + s[1] + ... + s[n-1] (*aka the total size of previously used memory*) is at least as big as s[n+1] (*aka the new memory size*). The Fibonacci Numbers have the property that fib(a) + fib(a+1) + ... + fib(b) = fib(b+2) - fib(a+1). So, when using s[n] = fib(n), we require fib(n+1) - fib(1) >= fib(n+1), which is clearly impossible. Subtracting 1 from the Fibonacci Numbers will result in an even worse situation: With s[n] = fib(n)-1 we get s[0] + ... + s[n-1] = fib(0) + ... + fib(n-1) - (n + 1) = fib(n+1) - fib(1) - n - 1. This needs to be greater than s[n+1] = fib(n+1)-1. That's right, the two sides of the inequality are now even further apart. Subtracting larger numbers will only widen the gap! Counterintuitively, adding 1 to the Fibonacci Numbers would result in the left side of the inequality to gain a +n rather than a -n, closing the gap. However, fib(n) grows exponentially, while n grows, well, linearly. So yet again, we never see previously used memory becoming used again. The Fibonacci Numbers (and in extension, The Golden Ratio) only spell doom to the array-resizing-problem.
@ You are right, I did discover a similar result in an uncle comment. I also found multiple optimal constants to achieve the particular effect I incorrectly theorized subtracting one would have.
I did wonder the same thing about subtracting 1! And did try it out, but it didn't end up making it into the video because it was clear to me in the simulation that it wasn't working - but my math wasn't strong enough to see immediately why, and I didn't end up prioritising puzzling it out fully because there are only so many "let's take a look at this other promising solution... which also doesn't work that well!" threads you can have in one video :) Thanks @Tumbolisu for your comment also laying it out, great thread.
The Lankenin algorithm could be improved dramatically by *not* performing the first copy and instead allocating the n*2 array twice and copying from the original array into the second array before deleting the original and 1st allocated n*2 array.
6:37 as "usual" (well, its what we see in empirical data, 2 is a better factor), wasting main RAM memory is always a good trade off for performance. unless we're talking about cache memory, then its the opposite, wasting CPU cycles is better than wasting cache lines.
The memory allocator should provide a way to allow range. I want between 1.5 and 2 times the size and it could give you something in that range that reduces gaps.
In practice all of the algorithms use realloc, you're assuming that they don't and you said it yourself, the difference between practice and theory is bigger in practice than in theory
Very interesting video, but it fails to neglect the influence of the memory allocator. Glibc-like allocation is far from the fastest nowadays, and in the case of pool based allocators a growth size of two is *much* more suitable.
I feel like you miss (/ didn't cover) two points which seem important to me: To increase the contiguous memory region used by the Arraylist you don't always have to allocate, copy and free. The operation happening is basically C's realloc function which will try to just increase the current memory block's size before allocating a new one, moving data over and freeing the old one. In reality that won't happen a lot but the benefits are enormous if it does, especially for larger Arraylists. The program using the Arraylist, let alone the Arraylist itself, doesn't live alone. Lots of other processes will also be dynamically allocating blocks of memory of varying size at the same time as the Arraylist. Following this thought, a) your focus on the "new" block being able to fit into memory previously used by "old" blocks seems unreasonable. Chances are that memory is already in use elsewhere and not free, we can't know and also don't really care. That memory is already freed by us. b) your definition of memory efficiency seems unreasonable. Why care about memory which our process already freed instead of just the currently allocated one?
While I somewhat get where you are coming from, it is not really valid to consider other process - memory is going to be allocated within a process and other processes are not going to be able fill in the slots left behind unless that area is large (on the order of multiple pages) - because the memory allocator (e.g. the clib malloc) within the main process is not going to free the pages to the OS unless there are main such pages, they are just going to keep the entire arena alive - far less bookkeeping. And within a single process the allocation behavior can be a lot more predicable - the only allocations happening being a few dynamic arrays is a pretty reasonable assumptions for well optimized small-to-medium scope programs. Ofcourse, if you reach the level of dozens of pages, then yeah, the considerations become far less relevant, and you will also get to the point where your allocations take up exactly entire pages, and you should probably keep it that way.
I wonder, is there ever any case where, assuming a pure program (i.e. no I/O), you can't figure out an exact size to preallocate the size of an array? Or if there is such a case, how common is it? I wonder if the ultimate solution to array efficiency is to have better language design that can predict what size to prealloc at.
i'm pretty sure this is essentialy equivalent to the halting problem, so not possible. also, you can just preallocate the size you need manually beforehand a lot of the times, since you (as the programmer) have domain knowelege of the entire operation. depending on how you build up your array, some languages also have support for size hints, like rust iterators containing lower and upper bounds, which can be used to preallocate a buffer to collect into
Seeing all the memory usage benchmarks even out as more arrays are added makes me wonder what would happen if a growth factor of 4 or 8 was used instead.
I'm wondering why you used the final full array sizes to check whether a new array would fit before the old one: since you are essentially doing a memmove, the act of copying does create space. In fact every second growth you should be able to fit the array before the current one?
Even after watching the explanation several times i still do not get the new algorithm at all. I think you may have tricked me into attempting to implement it myself...
Great research and presentation, thank you! The memory gap you describe is certainly an issue in microcontrollers. Do you think it is also relevant in systems that implement virtual memory through an MMU?
This video itched my brain a lot, something didn't seem right with it. Had a delicious coffee and a walk back and forth, high praise for a youtube video mind you, and I think it's realloc that bothers me. What I mean by that is that realloc, i.e. a function that has information about underlying memory layout, changes these statistics completely. Realloc has been around on unix since the start and in c since the start as well. Basically, when you resize an array using realloc you will only copy if there isnt any free memory after the currently allocated block. Using your meaning of memory efficiency this means only having one array that gets allocated will always be at 100%. I can only guess here but I would think that this is taken for granted, i.e. in old papers from the 80s. An algorithm for resizing an array in c++ will use realloc and not malloc. Any frequency analysis presented in such a paper would be based on such an assumption, I would think. In any case, excellent video as usual.
C can usually use realloc, C++ almost never can, for several reasons. First, only trivial types can be realloced, anything with a move constructor needs that constructor to be used (at least, until and unless trivial relocation gets in). And second because new/delete, as well as the allocator interface, don't have a realloc equivalent. Instead, the standard recently added allocate_at_least as a replacement for non-portable apis like malloc_usable_size, which covers smaller sizes well since those allocations usually can't be resized above their size class anyway, but once you are outside the largest size class in the allocator it won't do much. Another, try_resize_alloc (sor something like that) function to complement it would be good. This would be like realloc, except instead of doing the copying for you, it would just fail, and thus wouldn't run afoul of C++'s object lifetime rules like C's realloc, but I don't know of an active proposal for that kind of function.
Great video, thanks! One question, looking at the animation of growing with factor 2x it seems it would be fine to reuse the gap before it since by the time you reach the start of your current allocation you would already have copied that part out. I dont know how practical it is for an allocator to take advantage of this but Im interested to know if this is something that is talked about
MemCopy can move memory much quicker than inserts - how do these numbers change if moving an element between arrays only costs 1/16th of an operation? - a normal insert has to write 1 place in the array and change the number of elements in the array. Copying between an array is simply duplicating a range of bytes from one place to another. The other interesting choice made by language is the minimum size of an array - do you make that initial list 1 element, 10 elements, 100? The perfect list size is the one for the number of elements that ultimately land in it, so you need no copies. The worst size is 1 element less than that. Memory pages are often 4096 bytes, so pre-allocating 1000 4 byte elements makes quite a bit of sense, as that is the size of memory fetches, and contention between adjacent objects - when 2 lists are in play on different cpu cores, they can be updated at the same time if they are on different memory pages.
I've heard that 2 is actually a bad y the growth growth factor when you consider that the allocator is also trying to minimize the amount of copies of the array.
The question is, could we entirely eliminate the copying operation with the help of the kernel doing its virtual memory handling thing? I mean, if we were able to ask for an allocation that maps in the old backing array into the new backing array (unmapping the old one to prevent memory safety problems), we get a new array address with the new size with all the contents of the old one. I have no idea if operative systems have this kind of memory operations as syscalls, but it might be a winning strategy.
i feel like operating systems could implement a specific operation to grow an allocated region of memory that takes into account that the region with the data will be dealocated after copying the data. In this way, if an array of length n starts at location n and needs to be increased to 2*n, and there is a gap from 0 to n-1, the n values could be copied in the 0 to n-1 gap, and the then the whole region from 0 to 2n-1 could be marked as allocated. This would make the *2 scheme as good as the other schemes discuss, and i feel like it's not too big of an overhead for the kernel to check this case. There's probably downsides to this idea though. Is there any reasons this doesn't exist ?
Could you try actually growing with Fibonacci? I imagine storing a previous array size along with the current one and adding them together to get the new one. This will obviously require a bit of extra size for the array struct.
Doesn't this mix allocation strategies and growth factor? You kind of need to co-optimize though. And for large arrays MMUs can avoid copying through the power of page tables.
I don't understand why there are two copies necessary at all here? Why not just allocate the first resized array, then allocate the second and copy the data from the original array without freeing it after the first allocation?
i don't understand one thing, the way the memory is allocated on computers is a black box, it's not guaranteed that the pointer is going to be allocated next, memory gaps isn't something you can control yourself. what you can do tho is preallocate, say, a gigabyte of virtual memory, and continuously map it on page faults
@@sunofabeach9424 allocators don't use this approach because it uses a lot of pages and virtual memory space, but it's a really good approach if you have a big dynamic list
joke 1: me, using a zone allocator joke 2 except not really a joke: why not switch between load factors depending on how "hot" an array is if you go to resize and check when the last time you resized was, if its below x threshold use a 2x resize, and if its been a while since the last resize, use 1.5 kinda like what GCs do although, then that introduces a SECOND temporal load factor, but i imagine youd want to tune that one to the specific situation anyway
Surprisingly fascinating!
Omg 3b1b my beloved :>
Thanks :) Means a lot
omg the pi creature
Oh noes! Bad mathematician, stay away from our CS videos. 🤣😂🥰
hey no way it’s the math guy
Programs can also exploit modern architecture's implementation of virtual address translation, which is blazingly fast - done in hardware. Therefore you don't need to copy data around. Because the virtual address space is always very large (128 to 256 TiB), you can reserve large enough areas of virtual memory for each array - even more than total system RAM - and delay commiting pages to physical memory until they are needed, as the array grows. This approach has the added benefit that raw pointers will always remain valid after the array is expanded. The disadvantages of using this approach right now are basically none, AFAIK, but in the future it may run into scalability issues as programs grow to consume ever larger amounts of RAM.
how does this affect memory caching ?
@@Nesdac-k1lcashing space locality uses the virtual address space to know which pages/lines to fetch.
Very simplified example: if you have addresses 1, 2, 3, 4, 5, and 6 in your virtual space, which maps to page 3, 8, 5, 12, 1, 2 of physical memory, and you access virtual page 2, it will look ahead and access virtual page 3 and 4, look them up, and load physical page 5 and 12.
(Do note that cashing is not really at the page level - but accessing any memory will then actually access the physical memory mapped to the virtual address you access, and will get cashed as normal)
@@Nesdac-k1ldoesn't affect caching. L1 and L2 cache is based off virtual address and L3 is physical. An mremap doesn't change either, though the system call itself will invalidate caches.
And why does nobody exploit it?
Would be a simple optimization?
Can people stop calling things "blazingly fast", please? I can only associate it with companies making unsubstantiated claims in order to sell you their product.
You overlooked a key point here in that if there's enough space to account for the resized array, nearly all allocators, at least the good ones, will merely alter the recorded size of the allocation and return the same pointer to you. In situations like that, you don't need to copy the elements of the array at all because the array was resized in place. In situations like this, where you have the space to keep allocating in place, a growth factor of 2 will be significantly better. If the allocator places chunks far enough apart, then multiple array allocations will work in that more efficient manner.
That is what realloc does last I checked, but computer scientists are always looking at worst case scenarios nicht?
I think this is the reason why Rust's benchmarks performed much better on 2x than on 1.5x.
@@RF-xr7sx Problem is, if you've reached that kind of worst case, your allocator probably won't be able to satisfy your request anyway, regardless of strategy.
As I said on another comment, this is actually decades out of date... at least for most allocations with modern malloc() implementations. Except sometimes for giant allocations, the heap will be broken into size classes and realloc() rarely gets a chance to do that magic grow in-place.
Basically it was an important optimization for vectors in the 1980s or 1990s. Rarely helps now with modern malloc implementations because their allocation strategies are optimized to avoid heap fragmentation (which is critical for performance) and that happens to make grow-in-place difficult.
@@mitchblank No, it's not out of date. Size classes are large enough that multiple subsequent allocations will indeed grow in place. For that matter, they also coalesce adjacent deallocations so that subsequent allocations in the same arenas will yield memory that was previously fragmented.
Videos like this are so inspiring and I literally got into a mini rabbit hole about it. I was just thinking about it for like 2-3 hours and tried calculating stuff. In the end I accomplished nothing, but it never felt like wasted time. It was truly fun!
Real-world applications and memory allocators are actually much more complex (and smarter) than the one given in the video, so that some claims are less applicable in real life.
First off, as mentioned in the video, real-world applications tend to have many arrays allocating together. With a free-list allocator like that in the video (e.g. glibc malloc), different reallocated arrays will fill in the gaps created during resizes, so the total memory usage will not be as bad as the initial examples.
Additionally, many modern allocators actively control and reduce memory fragmentation (unused gaps between allocations).
Some allocators (like jemalloc and mimalloc) have the notion of a "size class", allocating memory of similar sizes together (wasting a little bit of memory for padding) and different ones in completely different places. One page can be dedicated to allocate memory chunks of 481 to 512 bytes, and another dedicated for 513 to 640 bytes, and so on. Arrays of different sizes won't meet at all because they belong to different size classes, and the empty space within one page will be filled with allocation of other arrays or data structures taking similar sizes (because there's too many of them).
Other allocators, while not having size classes, force memory allocations to exhibit specific patterns to reduce fragmentation, in the cost of slightly over-allocating. Buddy allocation is a famous trick first seen in the 1960s to only allocate memory within the page in pre-divided partitions, still used today in e.g. jemalloc. Other allocators like TLSF tries to allocate memory into their best fit to reduce fragmentation within constrained memory and time limits.
For large-enough chunks of memory allocated, for example those spanning multiple OS pages, a realloc() might be as simple as remap of the pages occupied onto a larger contiguous memory address. No data is copied at all, the only change happens in the virtual address mapping tables that changes which physical memory chunk the virtual addresses point to.
Finally, for garbage-collected languages/allocators, many of the garbage collectors (like those used in V8 JavaScript, JVM, CoreCLR C#, and many more) actively do memory compactions when collecting garbage, so memory fragmentation is not a problem for them at all.
Not to deny the work done by the author -- It's a very good way to visualize and compare memory allocations. It's just that I happened to read through some of these topics, and notice that some claims are oversimplifications :)
In the end, if you're designing a dynamic array yourself, you might as well just pick 1.5 or 2 based on your expectations of how frequent the user expands it and how much memory you would like to waste, and just let the allocator handle the rest. If you're still in doubt -- just pick 2. It's good enough!
References
- Glibc malloc internals: sourceware.org/glibc/wiki/MallocInternals
- jemalloc implementation notes: jemalloc.net/jemalloc.3.html#implementation_notes
- mimalloc: microsoft.github.io/mimalloc/index.html
- Buddy allocation: en.wikipedia.org/wiki/Buddy_memory_allocation
- TLSF: www.gii.upv.es/tlsf/files/papers/ecrts04_tlsf.pdf
I was thinking about some similar lines: why not align the allocation sizes with the bucket sizes of the underlying allocator... (In case it is a bucket allocator, and the sizes are known to the authors of the data structure...)
I love a comment with a good references list :]
Emailing Koenig asking for the paper probably would've been worth a try :3
I hope they do
Powers of two reign supreme, but in situations where memory efficiency is more of a concern, I do something slightly different that I haven't seen anywhere else.
Arrays are expanded up to the next "half" power of two. A "half" power of two I define to be either a power of two, or the value halfway between two powers of two. An example "half" power of two is 12, since it's halfway between 8 and 16. You can also think of this as the sum of the previous two powers of two (4 + 8 = 12).
So an example sequence of growing dynamic array sizes starting at 8 would be: 8, 12, 16, 24, 32, 48, 64, 96, 128, 192, 256, ...
You can also think of this as alternating between a growth factor of 1.5 (e.g. from 8 to 12) and 1 1/3 (e.g. 12 to 16).
Some advantages of this method:
1. The binary representation stays simple/nice for computers to work with. It has just one bit set if it's a power of two (0b10...0), or two consecutive bits set if it's a half power of two (0b110...0). This allows for certain optimizations.
2. When the size is not a whole power of two, the "gap" to the next power of two is always itself a power of two. E.g. if the size is 12, then the "gap" to 16 is 4. This can greatly simplify handling memory fragmentation in certain cases.
I don't know if this was the original intention, but I interpreted Lankinen's comment as a way to do more copying but keeping memory totally compact. In other words, if the original allocation is N units and stored starting at position 0, after the two copying steps we have an allocation of size 2N also starting at position 0. Kind of hard to show the diagram in a youtube comment. Starts with allocation of N located at position 0. Then you allocate 2N starting at N. You copy the first elements into the *second half* of the new 2N allocation. Now you have empty space for 2N units starting at position 0. Copy the N units starting at 2N into the N positions starting at 0. Now you have a contiguous 2N vector starting at position 0. And you have free space to grow starting at position 2N.
Yup, this is how I understood it as well.
That makes a lot more sense to me as an explanation, but it would require a custom allocator.
Interesting! I've visualised out that algorithm here: ruclips.net/video/0NTSZf2AhPM/видео.html (if I understand it rightly!)
I can't quite read that idea into his comment (you can read the full thing here, along with some other interesting comments from him - groups.google.com/g/comp.lang.c++.moderated/c/asH_VojWKJw/m/LiYGlFd4drIJ ) - and you do have to make some (what appear to me) slightly funny assumptions about how the allocator behaves in order to implement this, but it does seem to work and definitely achieves the benefits you stated.
I haven't checked how it performs with more than one array using the same memory space but I will do so. I imagine it performs as well as a "realloc contiguously if you can, otherwise alloc a new block of 2x" - but I may be wrong.
Thanks for the great comment!
@@neoeno4242 This approach doesn't really make sense in practice, as it depends on the 2N allocation being directly adjacent to the N allocation. That can only happen in the case where no other allocations have been done inbetween.
Just calling a function like C's `realloc` would in this case be vastly better as in the bad case it just copies the vector once, and in the best case it simply grows the space allocated to the vector without any copying. For the scenario to work as described in the comment by @NathanWhitehead, realloc would always work as in the best case.
Re-watching your implementation, I think it does make sense. That is the L2 in the video has compacting properties that could theoretically be useful, and distinct from realloc, while @Nathan's approach can't be more useful than realloc.
@@neoeno4242 Cool visualization! Yeah, I think my interpretation of the comment doesn't really work. If the original allocation of size N is guaranteed to have free space of size 2N after it, then you don't need to do any copying. Just allocate another contiguous N units after the original and you're done, you now have 2N spaces with the first N used. If you're *not* guaranteed the next allocation is contiguous then my interpretation doesn't work, you can't free up 2N space by using the second half of the first new allocation. Oh well!
FYI, modern memory allocators have both the allocate/deallocate (e.g. malloc/free) operations, but also a reallocate operation (e.g. realloc). If the memory allocator knows that the realloc can grow in place, it will do that, avoiding the copy. So in real life, there's more going on that will alter what is a good choice of growth factor, and also what is better for larger arrays that are more likely to need a full reallocation/copy pass over a grow-in-place. This is made even more messy if you are using huge pages or some totally owned arena, for example, from an mmap, so that you always can grow in place. And as you noted, if many different things are being allocated into the same arena, there are other factors in play. Lastly, if you truly can't avoid an endlessly growing array-like structure, you almost certainly want a linked list or binary tree of arrays. If you store the pointers in a separate array of pointers from the array of actual values, you can recover very good performance even with insertions and deletions. Anyway, I think I'm teaching you to suck eggs, so I'll shut up now.
realloc() isn't really "modern", it's been part of the standard C library since the 1970s. However most of the fastest modern malloc implementations actually *don't* benefit from its use much (if at all). That's because the hardest problem for malloc() is keeping heap fragmentation low in long-lived processes, and one of the most important strategies for that is to group allocations into size classes. (i.e. one page of memory is all 96-byte allocations, another page might be all 256-byte allocations, etc)
The upshot is that it's usually not possible for realloc to actually just grow in-place... at least if you're growing outside of the "size class". Since a vector is normally growing by 1.5x or 2x, you're always switching to a different size class and never benefiting from realloc doing grow-in-place.
What you CAN do (and the best vector implementations DO do) is to use APIs that the allocator provides to tell you what its size classes are (see malloc_good_size() on MacOS, malloc_usable_size() on Linux, etc). Then every time you grow your vector (whether by 1.5x, 2x, whatever) you also make sure to grow to the next size class. For example, if you are about to do malloc(200), but you lear that malloc() would actually give you a 256-byte chunk in that case, just grow your vector to use 256 bytes right away. That way you can delay the next move as long as possible.
Yea, while I doubt vector or any base list in a programming language uses mmap, if you are expecting to have a gigantic dynamic array it would make sense to have a small custom class that can just mmap the expected ceiling and have it lazily map the pages as you access, and then you dont event need to “resize” instead just keep track of a length for bounds checking.
@@garriganstafford9113 Dynamic arrays for gigantic data are usually a sign that your design is wrong IMHO. In practice, they are usually either small, or a symptom of using throw-away code that didn't get thrown away once the scale of the problem ramped up.
"memory efficiency" is only really applicable on architecture that does not use virtual memory. Factor 2 maximizes amortized cost. The actual memory mapping is less important. As long as the memory allocator doesn't fragment over time then you will always get the block you need. Which is why i think factor 2 is picked by the high performance compilers, it exploits the one property that matters by the computers we actually use, i.e minimize bandwidth use, anything else is managed by the hardware and a non-problem.
That's always the problem with theoretical algorithms versus actual implementations, isn't ?
for ex, copying data might be trivial for the hardware to do without spending a lot of CPU cycles.
an instruction to move a very large word of 512 bits might just schedule the memory management unit to perform DMA copy of the actual bytes without even using the execution units of the CPU, thus other instructions can run in parallel while the data is copied.
Virtual memory does not solve the problems of memory fragmentation and I'm not sure why it's brought up as if it does in this comment
@@scoreunder it does, once the memory used starts exceeding the size of one page. The memory used by the dynamic array only needs to be contiguous in virtual address space, not in physical address space. When you run out of space in the first page, that page can be returned to the system. Then that page's physical memory can be reused for future pages in a way that appears contiguous in virtual memory space to the user space process. So even though the virtual memory is never reused, the physical memory is. So unless there's a risk of running out of virtual memory space (very unlikely in a 64 bit architecture), the 'memory gaps' issue of the growth factor 2 algorithm is not a real issue.
Another way of thinking about it is that any potential memory gap that spans a whole page will always simply be repurposed by the OS.
Page size is typically only 4 KB, so this kicks in pretty quickly for any appreciably big dynamic array.
Great vid! Here is the 201st something something allocator comment:
The problem model *should* assume that the allocator is stupid, that lets it stay elegant while still relevant to practical considerations. Imo the single-vector model of the video doesn't assume a stupid allocator, it assumes an adversarial one. That makes the model less relevant for practical consideration.
Counting unallocated preceding memory with the same cost as reserved memory only makes sense if we assume that the memory is clear of other objects.
Not permitting reallocation/checking if the allocator gave an address directly following the reserved blocks is only valid assuming that the memory is full of other objects.
It is hard to conjure a scenario (for me) where both of these assumptions hold.
okay, but your hair looks super cute
Due to how memory and allocations actually work there is little to be gained from supposed "memory efficiency" unless you are running rather large applications that can get close to exhausting virtual addressing. And that again is rather unlikely as on normal consumer machines you can't even get 10TB of memory while the VAS is over 10 times larger. So a factor 2 just doesn't really have any particularly big drawbacks.
When talking performance you also just want to minimise the number of allocations and if possible get nice cache alignment. If you got a rough ballpark for the expected elements than pre-allocate that size. So if you expect say something like 5-10 elements then just make it 15 and it will nearly never need to reallocate.
This is going to blow up. Fascinating exploration!
16:10 that's one of the best single lines I've heard in a while
I might need to add that to my email signature 🤣
One thing to remember is that not all data types occupy the same size. So a variation of the growth factor 2 is to round up the required space in bytes to the next power of two. It's not quite a factor of 2 if the data size is weird, but it has the effect that arrays of any type can re-use the "memory fragments" of other arrays. Not sure if any language does that.
If you know you will need more arrays (and are on a single thread), you dont even have to de-allocate, just put the chunk in a list to re-use. I am using this in a wasm project as the default allocator seems to be really really bad for memory fragmentation.
I've have wondered about this for so long, I have always been a person who is scared of doing the "unoptimal" thing especially decided how to pick how much should a dynamic array should grow. This is a godsend for me thx
Fascinating, but these solutions address a problem that shouldn't even exist today. Modern computers provide up to 16 exabytes of addressable virtual space. By reserving large, contiguous regions of virtual memory and committing or decommitting them as needed, we can avoid the complexities of reallocating, copying, and defragmenting memory.
I hope you don't ever stop making videos. They're so good!
You are missing a critical piece: the allocator
There are types of allicators (like buddy allocators) that only return memory blocks in sizes of two.
There are allocators that support resizing an allocation in place.
etc. etc.
I remember reading a few old articles about optimizing Java programs, and most of these articles mentioned that one of the main reasons for big memory consumption is allocating a lot of arrays and maps that have only a few elements or being completely empty (I never researched it, but I believe that might be the reason why modern Java doesn't actually allocate memory for arrays until you try to insert the first element).
And that got me thinking: what if we try to calculate new capacity dynamically? A few ideas:
* Start with the growth factor of 2 and decrease it by 0.1 for each new allocation until reaching 1.5
* The same strategy as above, but with a delay until a certain amount of elements
* Start with the growth factor of 2 and switch to 1.5 after a certain limit
* Alternate between 2 and 1.5
My reasoning is: surely most of the arrays and maps in the real world have very few elements. By using different growth factors we can prioritize the speed for small arrays and memory usage for large arrays.
And if someone needs very large arrays (say a few million elements), they probably already allocate them with enough capacity so the small growth factor wouldn't be an issue in most cases.
It would be interesting to test this theory on a real-life data like Rust does
That's some interesting theory being laid out, but in this day computer hardware is a bigger factor than algorithmic theory. How allocation, reallocation, copying and memory mapping works may completely break apart this whole video's concept. So despite all the simulated numbers, the actual answer lies in the benchmarks, as the Rust experiment hinted. Not to mention that because of this discrepancy in the results, trying out just 1.25, 1.5 and 2 is possibly even an unrequired constraint. For all we know, because of the aforementioned factors in play, the best growth factor could turn out to be 1.75, 2.25 or even 3.5 (all losslessly encodable numbers in float32).
*Fixed-Point:* _First time?_ /s
Depending on the constant you don’t need to waste a 32-bit float for a few bits of precision.
Aside from your technical prowess you seem like a very nice person, keep on!
Like others have commented already, the allocator itself plays a large part in handling how the memory is utilized in modern systems. Good allocators split memory blocks into separate buckets based on their sizes to minimize fragmentation, and are able to use realloc() to increase the size of the memory block in-place so no data copy is needed at all when resizing.
This is an excellent research project on dynamic array allocation efficiency! The insights you've shared here are not only academically valuable but also practical for real-world applications. As a retrocomputing enthusiast, I'm particularly excited about leveraging your findings to test a new memory allocation algorithm for the Apple II. Projects like this bridge the gap between modern research and the optimization challenges of classic systems. Keep up the great work-it's inspiring to see research that has such wide-reaching potential!
A few years ago, I wrote a simple simulation that calculates how often old space can be reused for a given growth factor and made a plot. (x-axis: growth factor between 1 and 2, y-axis: average number of resizing-operations before the first slot in the simulated memory is used again)
The plot I got traced out an ellipse, peaking at x=1 and ending at x=(the golden ratio). At the time, I saw people in RUclips comments arguing that the golden ratio would be the best, simply because of the whole nature of Fibonacci numbers. But that graph clearly meant that the golden ratio is just the boundary between memory efficient and time efficient growth factors, and more importantly, that it is the worst growth factor!
A lot of the research in this video hints at 2 being the best growth factor, but what about larger numbers? If we are ok with larger-than-golden factors, then anything is fair game.
Wonderful video. I stray away from dynamic arrays when I can, but I had to use them again because the old method involved millions of insertions into a single huge array, which as you know is way too slow. It now has tens of thousands of 'buckets' with a growth factor of 2. The numbers to be stored are generated in a pretty unpredictable order, so which bucket I put them in is proportional to the square root of their size. This can still leave some empty buckets, while the largest one ends up with over 6,000 numbers. Of course, the more tens of millions of numbers I generate, the fewer buckets are still empty.
why do you copy twice in lankinen, couldn't you calculate the gap and offset and just copy with a gap directly the first time?
I wondered this myself - I believe the answer is that we typically don't get the choice to do this unless we control the allocator. If we don't, but we know it uses something like a first-fit algorithm (the assumption Koenig and later Lankinen was making) then we can maximise our chances of creating the right spaces for the allocator to do what we want by copying twice. Like a lot of the ideas here, it requires a lot of assumptions that may not pan out in reality! (And, to be fair, all the thinkers I referred to in this video acknowledge this) but that's the idea as I interpret it.
At the end of the intro and my search has lead me to the pandovan numbers! Hope these come up in the video :)
Edit: rip :( no pandovan numbers in this one. Editedit: This sequence approaches the Plastic ratio (which is a close cousin of the golden ratio)! I've learned so much stuff today :O
For interested readers, the pandovan numbers follow the formula f(x) = f(x-2) + f(x-3) with initial values of 1, 0, 0 (although something like 2,3,4 or 7,9,12 are better starting places for this application. I derived the formula directly from the problem statement. When allocating iteration n, the array n-1 is still in memory. If n-1 was just allocated sequentially after n-2 and n-3, then the space above n-1 is the sum of n-2 and n-3. When we produce reasonable starting conditions, the size is guaranteed to increase (which is not true for some starting conditions).
After finding values that looked good for this application, I found the name of the sequence via OEIS.
Good point. This sequence also keeps the advantage that a growth factor of 2 and Fibonacci numbers have, which is that any possible array size can also be used without gap by two arrays of a smaller size. (Which growth factors like 1.5 or 1.25 don't have.)
For a growth factor of 2, an array of size 2^N that is deallocated can be used by two arrays of size 2^(N-1) each.
And for Padovan numbers, an array of f(N) that is deallocated can be used by arrays of size f(N-2) and f(N-3). The Fibonacci sequence has the same "sub-divisibility" property, just with both gap-filling arrays being "one size larger".
The only problem with this sequence is that the growth factor (1.324718...) is a good bit worse than 1.5 and even worse than 1.6-ish for the Fibonacci sequence.
You could get a growth factor that is closer to the golden ratio if you summed more terms:
The Pandovan sequence grows "forwards" twice and then back fills the space. f(N) = f(N-2) + f(N-3)
If you used the sequence that summed the 3 terms f(N) = f(N-2) + f(N-3) + f(N-4), you'd grow by about a factor of 1.46557.
A growth factor of 1.5 grows "forwards" 4 times and then back fills _almost all_ of the space.
For a sequence that summed 4 terms f(N) = f(N-2) + f(N-3) + f(N-4) + f(N-5), the growth factor would be at 1.53146.
1.57015 for summing 5 terms, then 1.59, 1.601 and so on...
A growth factor of the golden ratio would grow forwards an infinite amount of times and never back fill the space.
And notice that the more often you grow before back-filling the space, the more likely something else will have snatched the space.
@ Very cool! You obviously have more intuition in this space than i do :) perhaps an upside to the lower growth factor would be a smaller likelihood that another object occupies the old space. My intuition says that where the pandovan set would have 2 "chances" (arbitrary percent chance) to fill a space the 5 term version would have about 4 "chances. I'd expect more than twice as often for an old spot to be taken (although this is probably near 100% for systems that are actually doing something). Another interesting upside to this ratio is that IF a spot is taken by an array n-2, it can also take spot n-1. Maybe not practically useful, but i have a pretty picture in my head of russian nesting doll array allocations.
Great video! I love your visualizations! I really like that you actually went and searched the paper
If you're allocating immediately before the existing array you could, I suppose, allocate with an overlap as you know you will have copied the overlapped bytes before you overwrite them with the copy. For your question at the end of the video, this may give you one additional expansion. That assumes a lot, and I'm not sure it would be "allowed" by memory protection unless the allocation routine is completely under your control.
Who would have ever expected that the Golden Ratio contains a hidden off-by-one error!
Excellent video, this is something I never thought about wrt array sizing! Really enjoyed! Now I'm going to be thinking more closely about common programming wisdom.
I’m struggling to understand why we reallocate all existing elements each time, instead of just allocating new memory directly after the existing array, if there are enough contiguous available spaces.
Amazing! Thank you so much for the great work! I'm a big fan!
Good presentation. I puzzled over this same question for years too.
Couple of missed things:
* Overallocation - say you double the array size, but you're not actually going to use it. At factor of 2 you're always wasting about 25% of your allocated memory. That is, after allocation you're using 50%, and on average you're going to be at the 75% mark (50% in-between-allocations). This will have a drastic impact on overall memory usage of your application.
* Allocator efficiency - allocators are tuned for specific memory sizes. When you ask, say, for a chunk of memory of size 1024 - the allocator might have one ready and will give it to you immediately, but if you ask for 1025 - it might just say "screw it" and give you a chunk of size 2048. You're thinking you're very clever by "only" "using" 1025, but actually - you're not. This happens because allocators don't want to do compaction, it's expensive, so they try to operate on fixed-sized chunks, which are way more likely to be reused. A bit of waste here is often preferable to increased runtime overhead.
I worked for a boutique Smalltalk (and later Java) consultancy in the UK in the 90s. We had every copy of JOOP, until we left that office in ~2001 or so.
I am missing some real-word data collection, where someone just check a huge number of samples from real-world applications about typical usage of dynamic arrays. My intuition tells me, that big (2x or more) growth factor is better, because most of the dynamic arrays are pretty small and bigger growth factor works just better for average CPU usage for small arrays and you don't care about memory usage as most of them will die soon (Weak Generational Hypothesis).
Also it is worth to mention it from software engineering perspective. Huge arrays are not a problem: they are pretty rare and easy to spot in a profiler, so anyone, who cares about performance will find inefficient usage of big dynamic arrays and fix it with a well applied reserve(). The real problem is huge quantity of small arrays, which are hard to pick in profiler a.k.a performance death by thousand cuts. Growth strategy should be optimized for cases, where developer don't care about reserve and in my mind it is mostly with small arrays
Awesome vid! Will probably have to rewatch it a couple of times though.
You might want to try this strategy as well. Use 2 separate allocations. When a new item needs to be added, if it's outside the current bounds, copy the old arrays into a single array, free the old arrays, allocate the second array, add the new item. The second array should be able to fit in the slot left from the old two arrays. This makes indexing very slightly slower, but still O(1).
I know I used the term "array" a little too freely. I hope it's not too confusing.
Also, I am almost sure with a few tweaks this can be used to make prepend operations O(1) as well.
good code to keep your home warm in winter
When working with the heap, 1.5x is sufficient when working blindly, however it is much more efficient to work towards "filling" the cap space prior to testing/requesting for change. What I mean by this is that if there are free bytes following the array, it is better to just expand the current heap size at a 2x clamp rate toward maximum allowance before re-allocation request at 1.5x, saving you write time of data already performed. When you have reached maximum threshold of heap space, then your only option should be at heap space intervals only with the OS, and allow the OS to handle cpu pointer refs and memory balancing.
Let's work an example, say you are working on an OS with a 4096 byte page/heap size and has full virtual pointer redirect at the cpu level....and we will also say that each element of the array takes up say 4 bytes each, and allocation data takes up 16. Starting with an array size of 2, that is 24 bytes of data at a position in the heap, when you hit the limit, the next number will be 4, but the allocation header is first checked to see if it can accommodate, if so, the 16 byte header is altered to increase to 4 (16 bytes) and no movement of data has to happen. The next increase request will check for 8, then 16, and so on.....say when at 8 going to 16 that there is only 48 bytes before the next allocation header is reached from some other data that requested memory, then the header is only updated to 12 entries (48 bytes) instead. The next time it tries, it wants 24, but has 0 left to gain, therefore it now drops down to 18 (1.5x) as a complete re-allocation within the heap to find space for it, then move the 12 entries over and continuing.
When the size reaches 1020 entries (4080 bytes plus the 16 byte header), the entire page is used, now we move to direct OS page system at 4096 bytes each new allocation, and tell the OS that the new virtual set is 0x01000 more than the previous in the block. This means that when you read/write to an address, the page is redirected from the virtual address to physical address, and it is continuous in virtual number but not necessarily in physical memory.....and as pages get used less, the OS will automatically offload to hard disk in limbo until called upon. There will be a point where the OS runs out of actual space, at which point you need to start thinking about scaling to actual database style coding.
Why work so hard? But if you think that's just cheating, ok, fine, then you can program the entire indirect pointer system yourself.....the modern cpu architecture has abilities for memory mapping under high authority flags, but at that point you are writing the entire OS system. You could request the entire virtual space from the OS with little reserve and just increase the reserve later. Or, lastly, if you really think the size is going to be infinitely undefined, then just use indirect pointer blocks instead and handle most of the load on rotating files within the same heap space, like a cache system.....if done right, you can cluster the data onto servers and have unlimited space.
1:43 I'm not following this part, how could we possibly guarantee our new allocation will be contiguous after the old array? What if, between growths of the array, the OS has received another call to allocate memory and has decided the bytes immediately after our array is the place for it? We then can't use the old memory at all
What happens "after" the array is not really relevant; the only important part about reuse is that the sum of all your previous arrays will be 2^n-1 and your new array will be 2n, thus it can't fit.
What you point out is the same as pointed out later in the video: once you have 2 arrays (or literally, _any_ other allocation that can fit in that space), it could take up that space and the reuse point becomes moot.
You got a new subscriber
Another great video. Thank you so much.🙏
I'm confused. How is Lankinen's algorithm not just exactly the same as a 2x factor but with an extra step? I've rewatched that bit like 10 times and I don't get it at all.
when somebody explains this, please tag me
@@sunofabeach9424
step by step, where () is empty memory, and [] is allocated:
[]()()()()()()()() - initial array
()[][]()()()()()() - vector doubling
()()()[][]()()()() - reallocation
()()()()()[][][][] - vector doubling
[][][][]()()()()() - reallocation
Lankinen's algorithm eventually lets you reuse the start of your memory which would otherwise be occupied at the cost of additional copies compared to 2x (which still add up to being basically the same amount as with 1.5x factor)
You might be confused by the fact that the _sizes_ still grow with a factor of 2.
But the _starting position_ of a new array gets "artificially" pushed further along, meaning that you're making more space at the beginning where you might be able to slot the next array into.
I suspect that, while quite clever, Lankenin's Algorithm would be basically useless in practice. Unless you write your own allocator (a move I would highly respect, but is almost never done in practice and done well even less frequently), you are probably, directly or indirectly, using `libc`'s `malloc` family of functions for dynamic allocations. If the only dynamic allocations you make are for a single dynamic array, I believe that you would actually never copy the array, because `realloc` can simply expand the block that was previously allocated if the block after it is unused and large enough. This trivializes reallocation, rendering this problem irrelevant. It is thus only useful to examine this problem in the context of multiple dynamic allocations.
Imagine the following scenario. You have a dynamic array of 64 bytes. You then use a reference-counted object (lets say that requires a 16-byte allocation). You then add elements to the dynamic array until it needs to be reallocated. It will have to be placed in memory after the reference counted object, because there is no space before the dynamic array or between it and the reference counted object. Imagine instead that you made a linked list. Every time you add a new node, you make it impossible for the dynamic array to reuse any of the memory it has already used.
In practice, you will most likely either make like 0-2 dynamically allocated objects (if you specifically avoid dynamic allocations where possible like I do), or at least dozens of a wide variety of sizes. Either way, this problem is not particularly relevant. I believe that the main cost that you can actually save in practice is the copying operations, and any memory savings will likely be marginal. Unless benchmarks show some other factor as optimal, I believe that a factor of 2 is still best in practice.
fopen() (or the first fread()) will claim a chunk of the heap when executed. printf(), too, and may even spool lots'n'lots before flushing its buffers... There's many things to account for when skating close to the edge...
RUclips's recommendation algorithm is scary
The memory gap issue with factor 2 disappears if your language can perform in-place resizing, aka realloc which we have in C. Just doing a small micro benchmark with a growing array from size 1 all the way up to 1048576 required only 3 instances of where memory had to be copied in the 20 needed resizes (when using a factor of 2).
I really like the aesthetics of your graphs/visualisations. What are you using to make them?
Great video by the way!
Thanks :) This one was almost all with Motion Canvas - a great tool.
Thanks for this video, there is a lot of stuff that one would have to go to great lengths to work out! I wonder if the resizing factor being dynamically adjusted would have been a better approach entirely (although probably not worth maintaining state for small arrays and hence not a design choice most languages would prefer)
What a great video!
One thing I see missing is why not allocate the new array overlapping the previous one, as long as you then copy the elements front-to-back, you should be able to copy them over with no loss of data, unless the new array starts half way inside the old one, in which case back-to-front copying would also solve it.
I guess it's a very impractical approach because no modern allocators would do this, but it's still something I think would be interesting to explore.
I wanted to watch a video to go sleep but that somehow hooked me too much
I think Lankenin's idea of early resize has some potential. I'd be very curious how resizing at 1.62 but to a size of 2 would perform.
To be real, I think optimizing for whatever the CPU architecture prefers will make a bigger difference. I might bother my microkernel collegues next week so I can run A/B testing on aarch64 and a few generations of x86
It's funny because recently, I came to a similar idea (of the golden ratio rate of growth) for allocation. No need of multiplication, just storing two numbers is enough, computing the Fibonacci sequence/algo.
A very strange explanation TBH. First, you assume the new array is always above the initial one, the fragmented memory can not be reused and the copy always happens. IRL in some cases the reserved space just happens to expand into the higher mapped addresses without any copying. Although this alone is not an excuse to call realloc() every time you add a new item, the fact sometimes the arrays may be really huge makes it clear that multiplying their size may be a bad thing for sure. With this, simply growing your array by a fixed additional size is probably the best way out nowadays!
obviously each instance of array should be able to specify what it's growth factor is because a fast growing type might be better served with GF 3 and a slow growing one might be better served with GF 0.3.
and that my friends is why we use C.
The argument of not being able to reuse pre-allocated space is a bit "silly" as an argument against a resize factor of 2, because in practice, you won't reuse it anyway. It suffices you create some damn tiny string and the allocator decides to get one from the "previous void" and you no longer have a long enough contiguous space you can reuse. Visual studio uses a grow factor of 1.5 because of that "argument", while clang and gcc simply don't care, i guess because of these practical consideration I just said, and they use a grow factor of 2 (and also because, I guess, multiplying by 2 is much faster than by 1.5, since * 2 is a simple 1 bit shift with takes 1 cycle while multiplying by 1.5 could imply 10 o 20 cycles; if you have a shit-tone of little vectors that 20x difference is significant).
Second, respect to memory efficiency. In 64-bits you have 128 terabytes of virtual memory. So having space you have reserved but never written to (unallocated), won't consume RAM, just "virtual space", which in practice is infinite nowadays, so the argument of memory efficiency don't apply either, because you aren't "wasting anything". The virtual space you "left behind", as far as the allocation is > 128KB (in Linux), will be returned back to the operative system, if it's < 128KB, it will be reused for other objects (the allocator/malloc will take care of that), and the space that is "in front of you" won't consume RAM until you write to it for the first time.
I think you're both over and under thinking the 1.5 scaling factor. For one, a growth of 1.5 exactly can be done with just a few clocks when you consider that x*1.5 == x+(x/2). That comes out to a register copy, a shift in the opposite direction, then an add. The real issue here is in how often you have to copy. If you have to move the array often, which isn't necessarily guaranteed regardless of growth factor, then the fewer moves of a 2 times scaling factor will be better. If the allocator can reuse older space before prior allocations then it may be more memory efficient, but will also copy more often. The optimal performance is really when it resizes fewer times and is capable of adjusting the allocation size when it can because there's enough space following the current allocation. Now, if virtual memory works as efficiently as is suggested by multiple commenters, yourself included, then all of this should be a moot point, but clearly the fragmentation causes enough overhead because otherwise there'd be no difference in CPU time between any of them.
@@anon_y_mousse You are right about the x + x/2. Regardless the CPU time, yeah, the most copies you have to do, the more CPU time must be used: you have to allocate the new chunk, copy the data, and deallocate the old chunk. Allocating and deallocating (for chunks > 128KiB) implies a syscall, where the kernel must update the virtual memory tables etc, and for copy, the CPU is who does the copy, and whenever a full page is copied, the kernel must also find a free RAM page so that the copy of the next page can continue. So you really want to minimize the number of copies you do.
But that has nothing to do with virtual memory. I mean, the point is that you don't need to worry "too much" for the virtual memory utilization, because nowadays OS's are very efficient in terms of "resource consumption": you are not wasting resources if you allocate virtual memory that you don't use. As far as you don't have memory leaks and you don't do stupid things like zeroing big chunks of memory beforehand, you are in safe zone (because "writing" memory is what needs RAM, which is the actual physical limited resource; allocating memory that you don't write is basically free). The CPU time depends on what your program does: if you copy more you need more CPU time to do such copies.
That is why different strategies gives different CPU utilization times, but because they copy more or less, not because of fragmentation.
@@Peregringlk When you're talking about a cold boot, that will be mostly true, depending on what software you have running from startup and various other factors, but not everyone turns their computer off every day. In the most common cases, where someone just leaves their computer on all the time, you'll get variances with the same allocation strategy, regardless of 1.5 or 2. This also holds with mobile devices that just about never get rebooted.
For the exercise: pretty straightforward, but there is the surprisingly interesting part of calculating ceil(1.618 * x) exactly (of course if you have access to bignums/>128 bit integers then it's pretty easy, but rust doesn't natively). You can just reimplement 256bit integers, but there is a short but ugly formula only in terms of 128bit (or smaller) operations.
*** spoilers below ****
ceil(1.618 * x) = x + 309 * (x / 500) + (309 * (x % 500) + 499 - (309 * (x % 500) + 499) % 500)) / 500
Divisions on the RHS are 128 bit unsigned integer divisions (rounding down), same for modulo.
(And most of those actually fit into 32 bits, so you can get away with 4 128 bit operations)
The result of the exercise should look like: 126*********************************641.
Look at these kids with their new fangled dynamic arrays. My industry has just about got the hang of variable names automatically allocating memory space - I'll let you know when we get literally anything dynamic.
what language is your industry coding in? 🤣
@@zyansheep ST and Ladder according to IEC 61131-3. It's not a fun experiance
In automotive and other RT industries, you must not use dynamic memory. Allocators give non deterministic performance and may potentially throw exceptions. So… C, C++
@mastermati773 you could use something like the upcoming std::inplace_vector (C++26) or some custom allocator on std::vector that pulls from a pre-allocated pool.
11:00 it looks like the performance of L2 would be similar to that of the normal sqrt(2). That raises the question: did Lankinen find a way to make the scaling factor act like its square root in performance? Or is there something deeper going on?
Also, even if it did, computer memory tends to be a lot messier, so what is in theory better can become much worse in practice. In particular, trying to use an L3 algorithm (for instance) may consume far more memory in the real world compared with a no-allocation experimenting environment. That said, I'd say it's fair to compare L2 with a normal factor of 1.5 (and also 2)
Here's an interesting "optimization": if, when you allocate the second block, its start is higher than the first reallocation, why bother transferring the data to it? Instead, free the 2nd reallocation rather than the first. Granted, that means it becomes less often that a 2nd reallocation is an improvement (why Lankinen's algorithm is an improvement at all), but the sacrifice may be less costly than the savings of one less full-array recopying. In fact, a smart enough allocator wouldn't even need to truly eat the memory before the check, making the modified step far cheaper than with a full 2nd reallocation if such isn't needed.
Why not try to think ahead?
Just jump a few spaces before allocating memory for the array so the next allocation could start at 0.
Examples (starting at n=2):
[block_size description]
Growth factor = 2:
1) [2 array] 100% memory efficiency
2) [8 padding][4 array] 33% memory efficiency
3) [8 array] 100% memory efficiency
Growth factor = Fibonacci sequence times 2
1) [2 array] 100% memory efficiency
2) [6 padding][4 array] 40% memory efficiency
3) [6 array] 100% memory efficiency
Perhaps I am misunderstanding the idea, but why would you need to copy data twice for Lankenin's Algorithm? Couldn't you do something like the following:
Given a full array of length n, allocate space for 2 arrays of length 2n. Then, copy over the data from the original array to the 2nd 2n array and free both the original array + the first 2n array (freeing 3n blocks).
This has lower operations, but also worse memory efficiency, because your modification delays one of the frees, which could mean the difference between being able to fit the last array size and not being able to
I wonder if the memory efficiency is less of an issue in practice.
VA space is so large that running out of VA’s is not really an issue compared to physical memory, and re-using the first block for the array. requires the allocator to be a first fit which just depends. If we consider page sizes then it would seem the most of wasted memory would 2KB since once you get bigger you start filling full pages which are no longer contiguous physically.
If we are just using the array then that is a big waste, a full half page, but in practice we likely will have more heap allocated data, much of which could likely fit into that 2KB or whatever 1/2 step block is left as prefix for the 2 case.
It would be interesting to see how this plays out in a program with a large dynamic array hut also some other smaller data and larger data structures that is close to a page size, cause then you could see how well the allocator packs it and also if you end up wasting space via page boundaries.
Either way this hunch leads me to think in practice 2x is probably the best choice due to the significantly cheaper operation per append cost.
underrated channel!
Did you just... casually made a huge advancement in computer science?
The other thing that wasn't touched on is that for the compiled languages (e.g. C and Rust), the memory allocation is a first-class primitive. This means that a compiler will absolutely optimize out a redundant copy or "useless" allocation
I don’t want multiplicative growth for buffers with sizes measured in GB, too much RAM is wasted.
When I recently needed to allocate vastly different dynamically sized blocks of memory in my program, I used 4 small integer control codes, 1 byte each packed into 32-bit integer. The first i.e. lowest byte is base 2 logarithm of minimum allocation size. A reasonable value is between 6 (64 bytes = a cache line) and 12 (4kb = page size).
The next byte is base 2 logarithm of the first threshold. Below the threshold, the allocator rounds up requested size to powers of 2, i.e. to a single most significant bit. The next bytes are similar thresholds to round up to 2, 3 or 4 most significant bits.
Note standard libraries of most languages expose a function which compiles to BSR (old PCs), LZCNT (current gen PCs), or CLZ on ARM which quickly return base 2 logarithm of integer numbers.
I love your videos!
I'm surprised that you never tried to subtract one from the fibonacci sequence, only the regular one, and 1.5. AFAICT the regular one seems to be at least a local minimum in the parameter space, as it's the smallest growth rate (thus high average ops per append) that can never reuse previously used space. My hunch says that it should have the lowest possible memory efficiency in the simple single array case.
Subtracting one might make it optimal however, as it's the largest growth rate (thus smallest average growth rate) that allows for deallocated memory re-use in the single array case. It's somewhat easier to use 1.5, as you mention, but I suspect that the performance cost of a few floating point operations (besides being constant) will be vanishingly small for any N above ~32 or so.
I would really like to see you compare a fib(n)-1 or floor(n*phi) growth strategy.
Hold on, I'm calculating
what you got
The sad truth is that growing the size in this manner will NEVER result in reuse of old memory. It doesn't matter what you add or subtract to the Fibonacci Numbers.
The reason is the following: (sorry if its a bit much)
Let "s" be the sequence of array sizes, such as [1, 2, 4, 8, 16, ...] (powers of 2) or [1, 2, 4, 7, 12, 20, ...] (fibonacci minus 1). "s[0]" would be the initial array size, "s[1]" the array size after the first re-allocation, etc.
Now, let's say our array is at size s[n] and we want to resize to s[n+1]. The previously allocated memory can only be reused if: s[0] + s[1] + ... + s[n-1] (*aka the total size of previously used memory*) is at least as big as s[n+1] (*aka the new memory size*).
The Fibonacci Numbers have the property that fib(a) + fib(a+1) + ... + fib(b) = fib(b+2) - fib(a+1). So, when using s[n] = fib(n), we require fib(n+1) - fib(1) >= fib(n+1), which is clearly impossible.
Subtracting 1 from the Fibonacci Numbers will result in an even worse situation: With s[n] = fib(n)-1 we get s[0] + ... + s[n-1] = fib(0) + ... + fib(n-1) - (n + 1) = fib(n+1) - fib(1) - n - 1. This needs to be greater than s[n+1] = fib(n+1)-1. That's right, the two sides of the inequality are now even further apart. Subtracting larger numbers will only widen the gap!
Counterintuitively, adding 1 to the Fibonacci Numbers would result in the left side of the inequality to gain a +n rather than a -n, closing the gap. However, fib(n) grows exponentially, while n grows, well, linearly. So yet again, we never see previously used memory becoming used again. The Fibonacci Numbers (and in extension, The Golden Ratio) only spell doom to the array-resizing-problem.
@ You are right, I did discover a similar result in an uncle comment. I also found multiple optimal constants to achieve the particular effect I incorrectly theorized subtracting one would have.
I did wonder the same thing about subtracting 1! And did try it out, but it didn't end up making it into the video because it was clear to me in the simulation that it wasn't working - but my math wasn't strong enough to see immediately why, and I didn't end up prioritising puzzling it out fully because there are only so many "let's take a look at this other promising solution... which also doesn't work that well!" threads you can have in one video :) Thanks @Tumbolisu for your comment also laying it out, great thread.
The Lankenin algorithm could be improved dramatically by *not* performing the first copy and instead allocating the n*2 array twice and copying from the original array into the second array before deleting the original and 1st allocated n*2 array.
Really interesting idea :o
6:37 as "usual" (well, its what we see in empirical data, 2 is a better factor), wasting main RAM memory is always a good trade off for performance. unless we're talking about cache memory, then its the opposite, wasting CPU cycles is better than wasting cache lines.
The memory allocator should provide a way to allow range. I want between 1.5 and 2 times the size and it could give you something in that range that reduces gaps.
In practice all of the algorithms use realloc, you're assuming that they don't and you said it yourself, the difference between practice and theory is bigger in practice than in theory
Very interesting video, but it fails to neglect the influence of the memory allocator. Glibc-like allocation is far from the fastest nowadays, and in the case of pool based allocators a growth size of two is *much* more suitable.
I feel like you miss (/ didn't cover) two points which seem important to me:
To increase the contiguous memory region used by the Arraylist you don't always have to allocate, copy and free.
The operation happening is basically C's realloc function which will try to just increase the current memory block's size before allocating a new one, moving data over and freeing the old one. In reality that won't happen a lot but the benefits are enormous if it does, especially for larger Arraylists.
The program using the Arraylist, let alone the Arraylist itself, doesn't live alone. Lots of other processes will also be dynamically allocating blocks of memory of varying size at the same time as the Arraylist. Following this thought,
a) your focus on the "new" block being able to fit into memory previously used by "old" blocks seems unreasonable. Chances are that memory is already in use elsewhere and not free, we can't know and also don't really care. That memory is already freed by us.
b) your definition of memory efficiency seems unreasonable. Why care about memory which our process already freed instead of just the currently allocated one?
While I somewhat get where you are coming from, it is not really valid to consider other process - memory is going to be allocated within a process and other processes are not going to be able fill in the slots left behind unless that area is large (on the order of multiple pages) - because the memory allocator (e.g. the clib malloc) within the main process is not going to free the pages to the OS unless there are main such pages, they are just going to keep the entire arena alive - far less bookkeeping.
And within a single process the allocation behavior can be a lot more predicable - the only allocations happening being a few dynamic arrays is a pretty reasonable assumptions for well optimized small-to-medium scope programs.
Ofcourse, if you reach the level of dozens of pages, then yeah, the considerations become far less relevant, and you will also get to the point where your allocations take up exactly entire pages, and you should probably keep it that way.
I wonder, is there ever any case where, assuming a pure program (i.e. no I/O), you can't figure out an exact size to preallocate the size of an array? Or if there is such a case, how common is it? I wonder if the ultimate solution to array efficiency is to have better language design that can predict what size to prealloc at.
i'm pretty sure this is essentialy equivalent to the halting problem, so not possible. also, you can just preallocate the size you need manually beforehand a lot of the times, since you (as the programmer) have domain knowelege of the entire operation. depending on how you build up your array, some languages also have support for size hints, like rust iterators containing lower and upper bounds, which can be used to preallocate a buffer to collect into
Thanks!
Seeing all the memory usage benchmarks even out as more arrays are added makes me wonder what would happen if a growth factor of 4 or 8 was used instead.
I'm wondering why you used the final full array sizes to check whether a new array would fit before the old one: since you are essentially doing a memmove, the act of copying does create space. In fact every second growth you should be able to fit the array before the current one?
I wonder how this scales with multiple arrays when we're allowed to expand the existing allocation into empty space?
Even after watching the explanation several times i still do not get the new algorithm at all. I think you may have tricked me into attempting to implement it myself...
What about realloc returning the same pointer if there's enough space to expand? Does this affect the ideal growth factor?
Great research and presentation, thank you! The memory gap you describe is certainly an issue in microcontrollers. Do you think it is also relevant in systems that implement virtual memory through an MMU?
This video itched my brain a lot, something didn't seem right with it. Had a delicious coffee and a walk back and forth, high praise for a youtube video mind you, and I think it's realloc that bothers me. What I mean by that is that realloc, i.e. a function that has information about underlying memory layout, changes these statistics completely. Realloc has been around on unix since the start and in c since the start as well. Basically, when you resize an array using realloc you will only copy if there isnt any free memory after the currently allocated block. Using your meaning of memory efficiency this means only having one array that gets allocated will always be at 100%. I can only guess here but I would think that this is taken for granted, i.e. in old papers from the 80s. An algorithm for resizing an array in c++ will use realloc and not malloc. Any frequency analysis presented in such a paper would be based on such an assumption, I would think.
In any case, excellent video as usual.
C can usually use realloc, C++ almost never can, for several reasons. First, only trivial types can be realloced, anything with a move constructor needs that constructor to be used (at least, until and unless trivial relocation gets in). And second because new/delete, as well as the allocator interface, don't have a realloc equivalent. Instead, the standard recently added allocate_at_least as a replacement for non-portable apis like malloc_usable_size, which covers smaller sizes well since those allocations usually can't be resized above their size class anyway, but once you are outside the largest size class in the allocator it won't do much. Another, try_resize_alloc (sor something like that) function to complement it would be good. This would be like realloc, except instead of doing the copying for you, it would just fail, and thus wouldn't run afoul of C++'s object lifetime rules like C's realloc, but I don't know of an active proposal for that kind of function.
Great video, thanks!
One question, looking at the animation of growing with factor 2x it seems it would be fine to reuse the gap before it since by the time you reach the start of your current allocation you would already have copied that part out. I dont know how practical it is for an allocator to take advantage of this but Im interested to know if this is something that is talked about
MemCopy can move memory much quicker than inserts - how do these numbers change if moving an element between arrays only costs 1/16th of an operation? - a normal insert has to write 1 place in the array and change the number of elements in the array. Copying between an array is simply duplicating a range of bytes from one place to another.
The other interesting choice made by language is the minimum size of an array - do you make that initial list 1 element, 10 elements, 100? The perfect list size is the one for the number of elements that ultimately land in it, so you need no copies. The worst size is 1 element less than that. Memory pages are often 4096 bytes, so pre-allocating 1000 4 byte elements makes quite a bit of sense, as that is the size of memory fetches, and contention between adjacent objects - when 2 lists are in play on different cpu cores, they can be updated at the same time if they are on different memory pages.
I've heard that 2 is actually a bad y the growth growth factor when you consider that the allocator is also trying to minimize the amount of copies of the array.
The question is, could we entirely eliminate the copying operation with the help of the kernel doing its virtual memory handling thing? I mean, if we were able to ask for an allocation that maps in the old backing array into the new backing array (unmapping the old one to prevent memory safety problems), we get a new array address with the new size with all the contents of the old one.
I have no idea if operative systems have this kind of memory operations as syscalls, but it might be a winning strategy.
i feel like operating systems could implement a specific operation to grow an allocated region of memory that takes into account that the region with the data will be dealocated after copying the data. In this way, if an array of length n starts at location n and needs to be increased to 2*n, and there is a gap from 0 to n-1, the n values could be copied in the 0 to n-1 gap, and the then the whole region from 0 to 2n-1 could be marked as allocated. This would make the *2 scheme as good as the other schemes discuss, and i feel like it's not too big of an overhead for the kernel to check this case.
There's probably downsides to this idea though. Is there any reasons this doesn't exist ?
for fibonacci size, can you take the 2 previous lengths and add them up? seems more efficient and reliable than floating point multiplication
Could you try actually growing with Fibonacci? I imagine storing a previous array size along with the current one and adding them together to get the new one. This will obviously require a bit of extra size for the array struct.
fascinating!
Doesn't this mix allocation strategies and growth factor? You kind of need to co-optimize though. And for large arrays MMUs can avoid copying through the power of page tables.
I don't understand why there are two copies necessary at all here? Why not just allocate the first resized array, then allocate the second and copy the data from the original array without freeing it after the first allocation?
Realloc singlehandedly carrying 2x growth factor: 🗿
i don't understand one thing, the way the memory is allocated on computers is a black box, it's not guaranteed that the pointer is going to be allocated next, memory gaps isn't something you can control yourself. what you can do tho is preallocate, say, a gigabyte of virtual memory, and continuously map it on page faults
I think that's what allocators are for
@@sunofabeach9424 allocators don't use this approach because it uses a lot of pages and virtual memory space, but it's a really good approach if you have a big dynamic list
joke 1: me, using a zone allocator
joke 2 except not really a joke: why not switch between load factors depending on how "hot" an array is
if you go to resize and check when the last time you resized was, if its below x threshold use a 2x resize, and if its been a while since the last resize, use 1.5
kinda like what GCs do
although, then that introduces a SECOND temporal load factor, but i imagine youd want to tune that one to the specific situation anyway