Why is Radix Sort so Fast? Part 3 Comparison and Code, Radix Sort vs QuickSort

Creel

Просмотров 85 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 28 авг 2024

Комментарии • 296

@BradleyBroom 4 года назад ⁺³⁰⁵
For when you can't use radix sort, there's a whole science to picking the pivot for quicksort. For large arrays picking pivots close to the median gets you closer to O(n log n) and helps to avoid worst case (n2). Picking a single random value from the array is somewhere between those (since you're unlikely to pick the median). To pick a good pivot , pick several random values from the array and use the median of those. The science is how many random values to use as a function of n.
@WhatsACreel 4 года назад ⁺⁶⁶
Really great stuff mate! I've usually gone with random in the past. Median sounds like a great choice! Cheers for sharing mate, and cheers for watching :)
@MrWorshipMe 4 года назад ⁺³
In his benchmark it scales pretty much as N log N a factor of 1000 for the inout equals a factor of 10000 for the run time is what you'd expect.
@freespam9236 4 года назад ⁺³
@@WhatsACreel my pivot selection has been 1st, middle and last element, find the median and use that as pivot
so extra comparisons but worse case hard to happen - in reality when somebody actually designed input to hurt the pivot picking algorithm
@sharpfang 4 года назад ⁺⁵¹
Yeah, picking the pivot close to median is easy! Just sort the array and pick the middle element!
@freespam9236 4 года назад
@@sharpfang median of 3 elements is not that hard, just some if's
@captainbodyshot2839 4 года назад ⁺¹⁹⁹
std::chrono::high_resolution_clock is the way to go when you want to measure performance of anything in C++
@WhatsACreel 4 года назад ⁺⁴⁰
Yes, good suggestion! I certainly could have used something more accurate :)
@anthonynjoroge5780 4 года назад ⁺³⁸
No I'd say std::chrono::steady_clock is the best since it is,well,steady.
High_resolution_clock is unstable in some implementations and is not recommended for measuring time spans
@sharpfang 4 года назад ⁺⁴
Or at least repeat the run multiple times and sum up the numbers. Instead of 10 0's get time from start to finish. (subtract n times of dry run, generating the array and copying it to output unsorted).
@philipbotha6718 4 года назад ⁺¹
@@anthonynjoroge5780 Sometimes it is the same clock. I've seen libs where high_res_clk is an alias of -steady_clk-system_clock. Edit: wrong clock.
@anthonynjoroge5780 4 года назад ⁺¹
@@philipbotha6718 I doubt that.I only know that to be true only for system_clock and high_resolution_clock. However,please give me an example of such a library. I'd love to see it.
@TomStorey96 4 года назад ⁺⁵⁵
As if I needed any more distractions at the moment (it's hard enough to finish existing projects), now I want to write some sorting algorithms!
@alexander_adnan 3 года назад ⁺²
100% agree
@yvrelna 4 года назад ⁺⁷⁵
The weakness of Quicksort is that it cannot take advantage of partially sorted data or data that's exactly or almost exactly reversed. In real life, you often have two or more sorted dataset that you want to combine into a single sorted whole, or dataset that's mostly sorted but with just a few items in incorrect places. Indeed the worst case scenarios for quicksort usually is with sorted or reversed dataset.
Some hybrid sorting algorithms like timsort or introsort can take advantage of those existing structures in the data and perform a nearly O(n) sort, while avoiding worst case scenarios and maintaining O(n log n) for any input.
The standard sorting algorithm in the standard libraries of most programming languages usually uses one of these hybrid sorts rather than a naive sorting algorithm like quicksort.
@WhatsACreel 4 года назад ⁺²¹
That's very often the case, yes! Certainly best to use a sorting algorithm that takes advantage of any patterns in the data! Cheers for sharing :)
@movax20h 4 года назад ⁺⁷
radix sort doesn't take advantage of partially sorted data either. quicksort hybrids, like timsort and introsort, can be massages to do these tricks. radix sort kind of hard, without some extra passes on data. radix sort usually will always destroy partial sorting of data in the first step.
@mnslr 3 года назад
@@movax20h if you knew apriori that some selection of the data was already sorted could you improve it by splitting off the initially sorted bit? Im thinking something like this: copy out the sorted bit and replace its values in the modified inputArr with zeros/minimal/maximal possible values, except for the sorted selection's first and last (minimum and maximum) values. Then do quicksort on the inputArr using the retained first and last values as pivots. I'm having trouble thinking of a way to reintegrate the stored separately part but it feels like maybe there's a way. Maybe some kind of binary search or something?
@benhetland576 3 года назад ⁺¹
@@mnslr You could do a merge sort as the final merging step, which I suppose should even be doable in-place if the two parts (pre-sorted + qsort output) still remain in that same array.
@SevenRiderAirForce 3 года назад ⁺⁶
"naive sorting algorithm like quicksort" I can just feel the freshmen CS majors sweating as they read that haha!
@CompanionCube 4 года назад ⁺³³
use this if you often need to toggle between 2 lines
//* remove first / to toggle
func1();
/*/
func2();
//*/
@WhatsACreel 4 года назад ⁺⁷
That's awesome! :)
@DVSProductions 4 года назад ⁺⁸
Or just use
#if doFunc
Func1();
#else
Func2() ;
#endif
@NuriTANRIKUT 3 года назад ⁺³
great series! very informative.
couple of things I might improve, in order of importance IMO:
1) keep GenerateRandomData() call out of timing block. you don't need to measure random number generation
2) use std::chrono::steady_clock for better timing (milliseconds is enough but you can go up to nanoseconds precision). important point here is to use double instead of default int64_t:
using milliseconds = std::chrono::duration;
using std::chrono::steady_clock;
auto start = steady_clock::now();
// stuff
auto end = steady_clock::now();
double elapsed_ms = duration_cast(end - start).count();
3) use std::vector (when possible: std::array) instead of C style arrays for two main reasons: a) no need to new/delete, b) swapping input and output arrays becomes as simple as std::swap(input, output) and it is O(1)
there are more small stuff but hey it's a rabbit hole :)
@GabrielPerez-ml2go 4 года назад ⁺⁴
I was always scared of radix sort, thanks to your three part series I feel much much better, THANK YOU
@ridlr9299 2 года назад ⁺⁸
Although it looked like radix performed better at all sizes, I think it’s important to clarify that if each of these sorts was done 1000 times for each given size, quick sort would indeed perform much better on the smaller sizes.
@xniyana9956 2 года назад ⁺³
Wow. I never knew there was a sort that was so much faster than Quick Sort. I learned something here today. Radix Sort is nothing short of amazing!
@plasmaastronaut 2 года назад ⁺¹
quicksort is a misnomer. its not that quick. The 'quick' means if u don't want to spend time thinking about which sorting algorithm to use u can quickly use quicksort for now and worry about improving the quickness later. Should be called lazysort, but then no-one would use it.
@galier2 4 года назад ⁺³⁴
2 small optimisations available for your RadixSort. memset and memcpy are a tick faster than loops, but the more important optimisation is that you don't need to shift and mask the input values. You can use a byte * instead of the uint * and you increment it by 4 in the innerloop and by 1 in the outerloop (of course endianness becomes an issue but in practice not, everything is little endian nowadays).
@AAA-de6gt 4 года назад
I don't think endianness matters. Big-endian just ends up with sorting bytes from the MSD.
@WhatsACreel 4 года назад ⁺⁴
Great suggestions! I really like this *byte idea! :)
@galier2 4 года назад ⁺¹
@@WhatsACreel Theoretically it will not make a big difference though. It only pushes the explicit shifting/masking from the ALU to the load-store unit where it is done implicitely. It can also introduce a partial register write stall that plagued Intel CPUs for a long time (when writing a byte register AL, BL etc., accessing the rest of the register (EAX, EBX or RA, RBX) would result in penaltie). So careful testing is necessary.
@benjaminfacouchere2395 4 года назад ⁺¹¹
There's a great video about modern compiler optimization: ruclips.net/video/w0sz5WbS5AM/видео.html
The key takeaway is: compilers have become extremely good in unwrapping loops and optimizing bit manipulations, if you are able to recognize the optimization easily the compiler will too.
@Viewable11 2 года назад ⁺¹
@@benjaminfacouchere2395 Another optimiziation: use constants instead of variables wherever possible, as this effects branch prediction and preemptive execution of the CPU.
@sheridenboord7853 4 года назад ⁺²
Amazing stuff. Love your work. I started on 8 bit machines back in 1978. You guys trying to learn, subscribe to this guy and watch all his stuff. Now Mr Creel, you been doing assembler. So you could use the processors clock counter to measure how long things take in cpu clocks. Forget the libraries. The magic x86 instruction is RDTSC - Read Time-Stamp Counter into EDX:EAX, opcode 0F 31.
@kleetus92 2 года назад
Fascinating series on sorting. it's been 3 decades since I messed with this stuff, and even then, we didn't get this deep. Thank you for taking the time to explain this.
Side note, has anyone told you that you sound very much like Eric Idle from the Monty Python crew? I don't mean that as defamatory.
@shyrealist 4 года назад ⁺²
I think you can do the radix sort in place by calculating the prefix sum as you go and performing swaps based on the intermediate values, it still requires the final pass per iteration in the reverse direction to put them in the correct order... actually on second thoughts the swaps would have to be inserts (shifting elements), to work around that you could treat the original array as two buffers and I think one of them would need to be circular... hmmm... I wish I had the time to explore this idea!
@roccov3614 4 года назад ⁺¹⁸
Am I right in thinking that the number of digits in the numbers affects radix sort but not quicksort? If so, I would have liked to see a set of tests where the number of elements are constant but the number of digits of the elements increases, and seeing if what effect that would have.
@DerekRoss1958 Год назад
That applies to sorting strings more than numbers. If you're just using the standard integer number types used in the video they tend to have a fixed number of digits. It's just that all the leading digits are zeroes. So you can just use the radix sort, implemented as shown in the video.
The only place where numbers might cause a problem for radix sort is where you have a number type that's not implemented using the standard types. Even there (or for strings) you just need to re-inplement radix sort with a variable number of passes, depending upon the length of string or the number of base-256 digits in the number.
And when sorting non-standard datatypes, quicksort can be affected too.
@movax20h 4 года назад ⁺²
When comparing to quicksort, you should really use the built in C++ std::sort. Granted it is not required to be quick sort, but it is usually a hybrid of quick sort, heap sort in place, and insertion sort for small arrays, sometimes with some additional tricks (like detection of partially sorted or reversed arrays, like timsort; or other good pivot selection methods, i.e. median of 3, or something like this). For totally random data most of these tricks are not supper important, but some minor details and microoptimisations in std::sort are still relevant, like optimising increment / decrement, index comparissons, and form of the loop, makes a difference.
@cptechno 4 года назад
Thanks for bringing this topic up! I didn't study these sorts at university. I intend to use both the bucket-sort and the radix-sort to sort trees and graphs. These algorithm have wonderful extensions to more complex objects than numbers and the number of variations on these opens up a lot of possibilities!
@johnvodopija 4 года назад ⁺¹
This was a terrific and engaging mini series. It gave me flashbacks of computer science at uni. Thank you 👍😎🇦🇺
@sylvainforget2174 3 года назад ⁺¹
Indeed,. Also, Radix Sort was never even mentioned. 🇨🇦
@movax20h 4 года назад ⁺²
Btw. for 64 bit integers the radix sort might be a bit worse, unless you have a really big number of numbers. But for a lot of 32 bit or 16 bit integers, it will works blazing fast. A tricks, is to decide number of buckets before start of the algorithm, based on the size of the key, and number of keys. I.e. if you have billion large numbers, it makes sense to probably use Radix-65536. Because few extra MB for counts, is nothing compared to the temporary array. Also radix sort is reasonably easy to parallelize: in each phase, each thread updates counts using atomic, or own local copy of counts, then they are aggregated (by one thread or by multiple threads, various methods are possible). Then prefix sum is calculated (by one thread, or in parallel). Then shuffling is done in parallel too, with some extra tricks to make it correct.
@cacheman 4 года назад ⁺⁴
I have an article on github that covers sorting other numeric types and some radix sort optimizations, e.g column skipping. It's called "Radix sorting from the ground up"
@mivoe99 4 года назад
Great video series. Really enjoyed it. Altough I already learned about sorting algorithms at university I haven't taken a look at RadixSort until this series.
@elliott8175 4 года назад ⁺³
Nice video. FYI:
*do {} while (/*code*/);* is the same as *while(/*code*/);* in c++.
*if (originalArr == output) /*...*/* This section of the function isn't needed because it'll always be false: The number of shifts is a constant of 4.
For optimisations:
- We could put *count* on the memory stack to avoid unnecessary system calls (using *new* ).
- As others have pointed out, *memcpy* should double the speed of zeroing the *count* array.
@safulkin 4 года назад
staticaly allocated count is not good idea. what will happend when two different threads execute this code on diffent cpu cores? [spoiler] data corruption
int[256] is enough small to stack allocation.
@elliott8175 4 года назад
@@safulkin Sorry, I meant allocate it on the stack. (I was meaning that *count* shouldn't be _dynamically_ allocated, and so chose the antonym _static,_ but of course that has its own meaning.)
@R4ngeR4pidz 4 года назад ⁺¹
if im not mistaken
in the case where the statement isnt met, do while will always run the code atleast once, whereas while wont
is this different in cpp o.o?
@elliott8175 4 года назад ⁺²
@@R4ngeR4pidz A while loop will run the body of the code if the while condition is true - but to know if the condition is true it needs to run the code in *while(/*code*/)* . So it really is the same as *do/while* .
@chesshooligan1282 4 года назад
After three 15-minute videos explaining in detail why the radix sort is so fast, ends the series with...
"It's probably got something to do with it."
Great work, mate. Thanks.
@mikefochtman7164 4 года назад ⁺¹
Another small optimization. Regarding the allocation of the output array and the extra logic to detect an odd number of pointer-swaps and copying the results.
Instead, have the caller responsible for allocating the output array and passing a pointer for that in along with the input array. Then the RadixSort() routine can just pass back the 'output' pointer. The caller can use the returned pointer for the sorted array, and also test against what was originally allocated to deallocate whichever one is no longer needed. No need to copy results and onus is on caller to allocate all memory and clean up. (the 256 element 'count' array could be static as well, so very minimal stack usage).
@Matt23488 4 года назад ⁺²
Yes, with radix sort, the greater than/less than comparison is built into the fact that the order in which we store the counts is itself an ordered list. You're leveraging compile-time information to help with speed. Of course, that's directly why you're using more memory.
@mshonle 2 года назад
One note on your quicksort here: in the worst case, it requires O(n) space (the space is the call stack). To fix this, you need two modifications: (1) on the recursive call, recurse on the smallest subarray first; and (2) change the second recursive call into a tail call [in C/C++, reassign the parameters and goto the function start; in Kotlin declare the function as tailrec, et cetera].
@crysicle 4 года назад ⁺¹
@What's a Creel Note: it's possible to find a more precise speed of the algorithms for the 0 - 10000 array lengths by running each algorithm n times and then dividing the time taken to run them by n. This way you wouldn't have a 0 sort time for lower inputs as enough time would be given for the timer to reach values greater than 0. Thus, after division you'd have some mantissa values to show the algorithms speed.
@khatharrmalkavian3306 4 года назад
He's done that before (you can still see the loop in his test harness, but set to one iteration). He puts the data generation inside the test unit because he doesn't realize that a GB of RAM can store 268,435,456 integers, so it's okay to generate his test data outside the loop.
@torak456 4 года назад ⁺³
Was the pattern of Radix Sort using 2x the memory of Quick Sort something that held through larger arrays? Would have been cool to see that in the final chart at the end. Awesome video series, thank you! :)
@WhatsACreel 4 года назад ⁺³
Sorry, I glossed over that! It does continue, yes. Radix Sort, as used in the video, doubles the array, so it takes about double the RAM of QuickSort. Well, cheers for watching mate, and have a good one :)
@calypso1985 4 года назад ⁺³
A comment here Mr creel. I would like to see the ram use in every instance in that table.
@horaciomlhh 4 года назад ⁺¹⁰
would had been nice to see the memory ratio in the comparison table
@kazedcat 4 года назад ⁺²
The memory ratio is 2 or it approaches 2 as the number of elements approaches infinity.
@goranjosic 4 года назад
This is my first time watching some of your videos and this is great! Excellently explained.
@mechwarrior83 Год назад
Thank you so much for putting up this masterclass.
@morpheus_uat 2 года назад
this was really an interesting adventure into the radix sorting alg, good job there mate
@najamhaq 3 года назад
Good job on not giving older part links so people who land here first will get harder timer finding where it all began.
@ferinzz 4 года назад
Awesome stuff! Very informative series that demonstrates why the methods do become faster, and not just how to do the method itself. Thank you very much, I greatly appreciated this.
@gloweye 2 года назад ⁺¹
Good note at the end about exploiting patterns in your data.
Also, many people think way to hard about sorts. If you're sorting 10 000 000 elements, it's sometimes better to check if you really need to sort it, or whether you can get it sorted.
Often, the best sort is whatever the standard library gives you.
@snnwstt 4 года назад
There is also a "database sort", that is, we know how many "pages" are required to hold the sorted result since each "pages" can hold a fix number of elements (that number is our choice). The memory allocation occurs thus just once. Then, one element at a time from the array to be sorted, we look into which page it should go (each page keeping the min and max value it holds in book-keeping while adding or transferring data to another page), incorporate it to the page it belongs if there is room in that page, else, move half of its data to an blank page (blank up to now, its memory had already been reserved), new page which will be properly linked. Since the amount of all required memory is known at the start ( TWICE the initial data), the memory allocation is done just once. All the other manipulations are book-keeping. Note that database "index" rank the data by keeping the order of the index, not of the value, being ordered, which works fine for string, or float, or even orderable structures of multiple "fields".
I suspect it MAY be faster than a quick sort, given, that, in the first place, if you have to sort 1G elements, they don't happen 1G at a time, but one by one, making the database sort working from "day one" instead of having to wait the 1G-th element to be known.
@jugoslavilic9761 Год назад
Great videos.
Is it possible to somehow check the start list of data, whether it is just integer numbers, negative, positive, floating, just letters, combination of all, …
So spend some time checking data and get information how much memory we have and then decide which sort algorithm is the the best, actually the fastest for the given data and available resources.
That would be very close to ultimate, universal, fastest algorithm for any type of data.
Of course this is only if we do not know what type of data to expect.
I know its not just type of data and memory we have, so maybe include more informations and checks.
@blarghblargh 4 года назад
I really like your explanations. For a programmer they are very easy to understand.
My frustration with compsci books and articles is so often that they talk a little too abstractly, so for example wouldn't really worry about trying to address the "bucket" problem in radix sort.
One odd hunch I had while you were describing that problem was: could you avoid allocating a second array if you used a radix 2 to somehow do clever in-place bucketing, or would that not work because you might get too many items in a row for one bucket and it wouldn't balance out? I plan to work through that soon myself to find out, but figured I'd mention it before then.
@stephenevans8355 3 года назад
Most of the data I deal with is already partially sorted with combinations of hill sorted, valley sorted and reverse sorted. This can be pathological input for Quicksort but makes another algorithm like CombSort11 shine.
@raunaquepatra3966 2 года назад ⁺²
I think the complexity of radix sort should be O(N log_b_(Max value)) where log_b is log with base = no of buckets used. For most computers max value = 2^32 so the complexity becomes O(32N) =O(N). But every it's tought as O(N) which is misleading according to me.
@ichigo_nyanko 2 года назад
k = log_b_(Max value) is always a constant so it will always end up as O(kN) = O(N)
@bartekltg 4 года назад
std::sort is using a hybrid sorting algorithm called introsotr, as you mentioned, it is qsort + heapsort (+ of course O(n^2) sort for short subtables, insertsort I think). The qsort uses "median-of-three" (*) pivot It works well for sorted, reverse sorted and random tables, a bit faster even than a random pivot. But there exist permutations that break it. This is where heapsort is used. As iterations of qqicksort progresses, the depth of iteration is checked, and if it is too large (greater than log[n]*const), the problematic region is sorted using heapsort (a bit slower in mean case, but the worst case is nlogn)
*) it is the median of the first, middle, and last element, not of random ones. Whan interesting, gcc uses the second, middle and last elements, and the median of those is inserted in the first place. Not sure why, but I would guess it gives less jumps in code and is a bit faster.
The main advantage of the median rule is we are guaranteed that in the range is at least one element not smaller, and at least one element not greater than the pivot. This allows us to get rid of some if-statements.
@neverremindyou6172 4 года назад ⁺¹
Great vid again! This sure looks like if i dont know how much there will be to sort, it would be better to straight up use radix sort.
@hs3881 3 года назад ⁺¹
It would be great to see the shootout on strings of different lengths, too.
@pissfilth 3 года назад
Well explained Creel..
Showing the actual data on slides, and the process steps, is quite helpful for any viewer.
Cheers mate!
Will there be an AVX512 example implementation of the Radix sort in the future? and some high precision timing measurement perhaps? (like others also suggested in previous comments).. preferably in clockcycles..
It would make a nice tutorial video on how to tackle problems using SIMD. Count in registers, shuffle data elements (or indexes of elements) in registers.. and in the end you have a really usefull, very! fast sorting algorithm.
..or maybe make it a competition: Can you make it faster? // viewer_interaction++; (no, i did not yet make that code already myself :) )
BTW: ***Your dad makes nice drawings***
I myself had that hobby too: Painting and drawing,.. until i discovered computers, and asm.
@mrmr4737 2 года назад
FANTASTIC Video! I appreciate it:) Cheers mate🐲
@blendpinexus1416 3 года назад
i like the comparison done in this series
@cmdlp4178 4 года назад ⁺⁵
Small optimization:
The count array could have been allocated on the stack, that would just be 4*256=1024 bytes. Today stack usage and arrays on the stack are not a problem, as long as the arrays have constant size. In C there are variable length arrays with array size not known at compiletime, the same thing could be done with alloca. If you would declare an array with variable size you should always check if it has a relatively tiny size like 10kB or less. On windows you have a default maximum stack size of 1MB.
@harrysvensson2610 4 года назад
"On windows you have a default maximum stack size of 1MB."
Is that the maximum size an array can be when passed as an argument to a function (aka put on the stack)?
@mikefochtman7164 4 года назад
@@harrysvensson2610 You wouldn't normally pass an array as an argument, but rather a pointer to that memory location where the array is (or in C++ it could be a reference, but the effect is the same. The stack just has to hold 'sizeof(pointer)' bytes.
@benjaminfacouchere2395 4 года назад
I don't really see the optimization here, care to explain it?
@phizc 4 года назад
@@benjaminfacouchere2395 Allocating and deleting allocated memory on the heap takes some time. If you sort a large array it's not going to matter much, but if you sort many small arrays that time is going to add up. Having the array on the stack is just changing the stack pointer, so it's essentially free.
Replacing
int* count = new int[256];
with
int count [256];
and removing
delete[] count;
makes the count array be on the stack instead of the heap. It will be removed automatically when the function ends, and again, it's just changing the stack pointer, so it's free.
Regarding the maximum stack size: Every time the program enters a function it "creates" space on the stack for the variables that function uses. So if you make a recursive function that needs 1000 bytes of stack space, it will have used a million bytes when it's gone 1000 steps "deep";
e.g.
void rec( int x )
{
if( x < 0 ) return;
int a[250];
rec( x - 1);
}
rec( 1000 );
will used 1 million bytes (more actually since arguments are also put on the stack) before it starts to unwind. With a maximum stack size of 1 MB, you may run out stack space and crash if something down the call stack also allocated a lot of data on the stack.
@benjaminfacouchere2395 4 года назад
@@phizc Sure, you have a small one-time allocation penalty, but that's it.
Obviously we won't use radix sort on small arrays as other algorithms are more efficient there, also a count array of 256 is overkill then.
@sebbes333 4 года назад ⁺¹
*For more sort comparisons:*
Check out *w0rthy* : ruclips.net/user/w0rthyAvideos
For example: "Sorts - Classic" ruclips.net/video/CnzYaK5aE9c/видео.html
Check out *Timo Bingmann* videos: ruclips.net/user/TimoBingmannvideos
For example: "15 Sorting Algorithms in 6 Minutes" ruclips.net/video/kPRA0W1kECg/видео.html
@florianh.6256 4 года назад ⁺³
An interesting topic related to this and probably relevant for many cs students that are obviously watching this (judging comments from previous videos) would be the effects of multithreading these things. Spoiler: Quicksort is not stable enough even with good pivots to get as much out of it. Also to justify the overhead you need to have more than 1e7 elements to begin with. Thinking about it: Multithreading is a good example where the stability of an algorithm actually matters.
If you want to spice it up: Sorting is one of those things that can make really good use of GPU acceleration. Maybe run a cpu radix sort vs a gpu bubble sort on the same data set. So many fun topics around this to explore.
Another nice topic for an addendum video: The practical side of things. I the wild the choice of sorting algorithm is more or less academic. There is close to no cost/benefit to customfit some std_lib sorting to you particular dataset. Yes, there are cases where every ms matters, particular in lowest end embedded or HPC implementations, but for your average job you just use some standard sorting and be done with it. As with any algorithm you should be aware of its pros and cons, but do not obsess with reinventing the wheel. "You can always come back later to fix it, when it breaks everything"-Mantra is sadly reality. Or try to "outsmart" modern compilers with your new learned asm knowledge.
@WhatsACreel 4 года назад ⁺¹
Well, there is a GPU radix sort, yes. I remember watching a GTC talk on it a long time ago. Though I couldn't tell you which it was. I did actually make a multi-threaded version of the Radix sort for this video series too, but scrapped it. I just split the data into 8, and then used a merge step at the end. It was ok. Mostly bottle necked by the RAM reads I think.
Certainly a lot of fun topics, yes!
@nikilragav 4 года назад
Hmm how do you even parallelize radix sort for gpu? I was thinking about this yesterday - it is still iterative along the number of digits. You have to do one digit first, reshuffle the numbers then do the next digit, so I'm not sure there's much of a speed up. Also you'll run into weird race conditions if try to change the same memory location in multiple gpu kernels/threads. Have to think more about how to avoid that
@florianh.6256 4 года назад ⁺²
@@nikilragav There is a ton of literature on the topic. I think one of the best illustration of the idea is in "HARRIS M. Parallel prefix sum (scan) with CUDA. GPU Gems 3". as it also adresses implementation "problems" you face on GPUs. There are quite a bit of optimizations on the idea and gpus have better implementations nowadays (and more cores, and shaders to play with). For example "Fast 4-way parallel radix sorting on GPUs"
Another thing to keep in mind: GPUs do not use the same memory system as CPUs. It is quite different in its usage. There are also tons of absolutely unintuitive parallel algos that one just dont know about when starting with the topic.
There is also other neat sorting stuff from the last years. worth looking through their cites too. doi: 10.1109/ACCESS.2019.2920917 or doi: 10.1155/2020/4793545
@nikilragav 4 года назад
@@florianh.6256 Hi! I *briefly* looked at that one after I wrote my comment. Yep I'm somewhat familiar with GPU algos and memory allocations
@Akronymus_ 4 года назад
At work, I often found myself having the best performance with insertion sort. Eg.: Inserting events into a file from networked sources based on timestamp. The "better" sorting algorithms often are worse, depending on your dataset.
@pleilleur 2 года назад
Thank you ! Interesting and fun video
@darske1 4 года назад
Awesome video as always, sir
Thank you very much!
@az-kalaak6215 3 года назад
If you know the size of the array in advance, then radix sort have a huge advantage since you can just use template to allocate an array during compilation, saving time, you can even do the qame for the counter array, again saving time
Unfortunately, initializing them to 0 will not reduce it
It will even remove all allocation, leading to no needed deallocation so faster and more stable program
@_notch 3 года назад ⁺²
You're including the time of generating the array in the time it takes to sort the array. For an O(n) sort, this is not a trivial overhead, but it is for O(n log n). This makes the test skewed against radix sort.
@WhatsACreel 3 года назад ⁺¹
Good point, I should certainly have removed that from the timings!
You are a legend! Thank you for stopping by :)
@_notch 3 года назад ⁺¹
@@WhatsACreel Hey, little mistakes keep me on my toes, I suppose! Then the trick is to just try to do enough research to make sure the feedback is correct, but I like to gloss over that part.
Lovely channel. Not sure I've seen anything like this before!
@WhatsACreel 3 года назад ⁺¹
@@_notch Well I certainly make lots of mistakes. Looks like solid advice to me mate!
Cheers for the kind words! Hey, reminds me of a game I played, never saw anything like it! Built a sweet little house with a farm by a river, then I fought off zombies and found some diamonds!!
Hahaha, magic! If you’re ever down this neck of the woods give us a yell, I reckon I can find a bbq and a cold beer :)
@_notch 3 года назад ⁺¹
@@WhatsACreel That sounds lovely!
@NoorquackerInd 3 года назад ⁺¹
I'm _real_ happy I only sort by Unix timestamps
@melickon 3 года назад
Greate stuff! P.S. The final table is missing required RAM column
@DavidLindes 2 года назад
Cool series; thanks! Not sure how I managed to go this long without having this algorithm on my radar. I suppose just because it's relatively scope-limited to fixed-length unsigned integers, so its utility is near-zero in other contexts?? Anyway, still cool to know about, and has its place. I imagine it could also be used to pre-sort other kinds of data, if one can easily compute a fixed-length unsigned int that represents the most significant sorting components (e.g. casting char[] arrays, to effectively sort by the first few characters)? Interesting stuff.
@davidjames1684 3 года назад
I think a really good sort algorithm would be one that first scans the data to be sorted (size, type, randomness, min/max data values...), and based on that, selects the most appropriate sorting algorithm. For example, it could have 10 complete sorting algorithms available to itself, but decides which one to use based on the data. Some of those 10 could even be combination algorithms involing pieces of different subalgorithms.
Also, I wanted to mention that heapsort has a simple tweak in that instead of just plucking the root node each time, you can pluck that and one of the children (if available). For example, if we had a maxheap with the first 3 elements being 10, 4, and 6 (in that order), instead of just plucking the 10 and then reheapifying, we can pluck the 10 and the 6. Someone actually benchmarked this for me and it was like 3% faster than 1 element popping heapsort. A mild improvement but hey if it works and it is faster, why not use it?
@JirayD 2 года назад
Isn't that what introsort is all about?
@Vaaaaadim 4 года назад ⁺¹
A nice short video series. The prefix sum idea was especially clever.
Could you possibly do even better if you try to optimize the sorting algorithm to make better use of locality? By that I mean, this seems like it would be doing fairly random accesses to memory, whereas having accesses that are mostly linear would perform better. To my understanding, having fairly random accesses causes page faults.
@WhatsACreel 4 года назад
Hmmm... that's an interesting thought. The data is random, so there will have to randomness somewhere in the algorithm - in a QuickSort the randomness is in the elements that we swap. If we use a small radix, we get much better locality, but at the cost of more iterations. I can't think of a way out other than finding a nice compromise. That seems to be about base 256 radix sort. The counts are small enough that they fit into the L1 cache, but large enough that there's not too many iterations. Well, cheers for sharing this thought, it is very interesting! And cheers for watching :)
@Vaaaaadim 4 года назад
@@WhatsACreel
Perhaps one could make buffers for each possible radix (digit? symbol? value?) such that the L1(or possibly the larger caches) can fit everything. Once a buffer is totally filled, we flush its contents to where the data is supposed to go. Then the page faulting would only occur once during each time we need to flush a buffer.
@kestans 4 года назад ⁺¹
I had some data to extract from siemens PLC database and make a monthly report to exel or push report on the fly to the web interface. What I did I read the month data to ram and save indexes of data of each new day, I then create multiple threads on cpu so each thread handles calculations for that day data so thread receives the memory address of data and index locations of that day data. so it is running eg 30 threads and as if computer has more cores it reduced the time need for calculations as much times as much more cores the cpu had.
@WhatsACreel 4 года назад
Nice one mate! :)
@dm121984 4 года назад ⁺²
I wonder if you can use radix-type sorting on classes and the like. For example, a string - you could radix sort character by character I think - I just wonder if overhead would kill it. For other data types, for example a class with 3 numbers (hours, mins, secs), I imagine you could radix by seconds, then minutes, then hours.
@WhatsACreel 4 года назад
It's certainly capable of sorting strings, yep! It's pretty flexible, really, so long as you can figure out how to read the data as something like unsigned int you're good to go! Cheers for watching mate :)
@sakari_n 4 года назад
You are correct about sorting data with multiple key values per element, because radix sort is type of stable sort you can sort starting from the least significant key value to most significant (like in your example first sort by seconds then by minute and last by hour).
@harrysvensson2610 4 года назад ⁺¹
It would be fun to hear if any of the algorithms could be sped up by using SIMD functions, and then by utilizing locality (maybe structs with alignas?) for the cache lines to reduce the number of cache misses.
Optimize it to the ground!
@kungfujesus06 4 года назад
It can be and has been. Intel performance primitives provides a very nice radix sort for all integer widths as well as floats
@controlledsingularity8084 4 года назад
gpgpu will smoke cpu simd.
@kungfujesus06 4 года назад
@@controlledsingularity8084 For a very branchy integer based operation? Maybe, but probably not by the margin you think it will. Do any of the STL-like utility libraries for CUDA have a radix sort to compare it to? Likely the copy latency will be the longest thing
@GRHmedia 7 месяцев назад
The radix short should effectively require 2x the data size in memory while the quick sort should be able to work within the data's own space.
If one used a thread pool the quick sort could be improved a good bit in performance. Doubt it would catch up with radix sort though. It would require waiting to each division is done to split the work load.
@DFPercush 4 года назад
Fascinating video series, thanks for educating us. :) I will say though, I don't often find applications where I only need a list of numbers sorted. The numbers are often attached or keyed to a data structure, and the whole structure needs to travel with the number, or at least be referenced somehow. Combined with something like a hash map to quickly... well, map, the numbers back to their containing structures, this could be a nice intermediate step. But if you've already constructed a hash map, why would you need a sort... *scratches head for a minute* ... Actually, if you can pack the number to sort, as well as an index to the original structure, into a single integral value, perhaps a int64 or just a byte array, and only do the radix sort on the least significant bytes, then the indeces or pointers would be part of the sorted data. Yeah... I'll have to try this some time. Comparison sorts are all I learned in school, it's very interesting to see other approaches.
@DFPercush 4 года назад
So, follow up... I wrote a radix sort template class, it can do integers, strings, and any structure really, as long as it can be represented as a sortable buffer of bytes. I haven't tried negative numbers yet though. The way I ended up doing it was to allocate two buffers for indeces into the original data, in and out, with a buffer swapping approach. A little heavy on the memory, but very flexible. That also gives the option of returning just the list of indeces without modifying the data (view()), or doing swaps on the original array (sort()). It relies on two functor classes to obtain the size of the inputs and evaluate the sortable bytes, but it was trivial to default those for ints and make a "prefab" for strings that you can pass. It also has a persistent buffer option and preAlloc() to avoid those malloc penalties for repeated calls.
sorter rs; rs.sort(data,size);
or
sorter rs; rs.sort(data, size);
Pretty handy! Thanks again for the inspiration!
If anybody wants it, github.com/DFPercush/RadixSort
I'll keep working on getting that negative number thing... sorted.
@hugmynutus 4 года назад ⁺¹⁶
5:35 are you on an Intel CPU? There is a big performance penalty for 16bit ints on Intel CPU's
@WhatsACreel 4 года назад ⁺¹³
I sure am! Now I'm looking at the speeds of 16 bit instructions against 32, and it does seem that some of them strangely slow...? How odd. Mostly they're ok though. At least on this Kaby Lake. Still, really does make me wonder? Why was 16 bit so slow?? I just assumed it was the Zeroing of the counts. Could certainly be wrong! Great info mate, cheers for sharing, and cheers for watching :)
@Waccoon 4 года назад ⁺¹⁴
@@WhatsACreel x86 instructions favor 32-bit and 8-bit operations by default and switch using a single bit. 16-bit operations require the extension byte, so 16-bit instructions are both slower and larger.
@jeffbeasley8235 4 года назад ⁺²
Would using the "fast" types help for this? I'm only familiar with c and not cpp but imagine they exist in this language as well.
@AAA-de6gt 4 года назад ⁺²
@@jeffbeasley8235 I think usually uint_fast16_t is 32-bit on x86.
@meneldal 4 года назад ⁺⁴
I don't think it has anything to do at the speed of 16 bit vs 8 bit instructions where it comes performance in relation to the size of the radix. There are multiple issues at hand. The most obvious is your table of 256 ints will fit in cache, 64k probably won't, and because you do random access on that (data is random, you're as likely to increment any value as any other), you lose time with cache access.
Another point is using 64k means you need to compute the partial sum of your 64k array , which obviously takes much more time than the 256 array (128 times total considering you do it only twice instead of 4 times). That's a significant constant time added, which will definitely stand out for smaller arrays.
It is important to note however, that the variable time is theoretically halved with a 16 bit radix (going over the array half as many times), and since the and instructions would be 32 bit anyway (since your input is 32 bit), and even then I doubt they'd be twice as slow, memory access is most likely to be the reason.
I believe that using a 16 bit radix could be better if you start multithreading this, because you won't create as much contention on the various arrays elements. However, while counting is trivial to parallelize, putting elements in is not, you'd have to use some clever tricks there.
Also tried checking disassembly on godbolt, the big loop isn't unrolled, which is surprising since you'd probably get some extra performance as you avoid and'ing for one loop (since the result will always be smaller than the value you're and'ing it with).
@overlisted 2 года назад
A funny thing that I just thought while watching this: You can use radix sort to sort strings alphabetically (letters = numbers, radix = the size of the alphabet)! Is this how programs usually do that?
@YerushalayimShelZahv 4 года назад
love your videos, if you get a chance would you mind doing a deep dive into hashmaps vs arrays for linear search?
@RockinMax 3 года назад ⁺¹
I found your 2 previous videos about Radix Sort great! Then I saw the curly braces on a new line.
@AlessioSangalli 3 года назад ⁺³
Fantastic mini-series; absolutely fantastic. However the part where you measure the performance of sorting 10, 100 elements is cringe 😅😊 come on you should have started at 1.000.000 🤣🤣
@FalcoGer 3 года назад
allocating stack memory is basically free (just increment the stack pointer register)
malloc/new is where the overhead comes in. Asking the operating system for memory is a big deal.
@NicosLeben 3 года назад
The process memory is shown in real time. When you are looking at it the sorting is already done and all the memory was freed again. Look at @11:54 and @12:21 in the process memory graph. You can see it spiking at the beginning and then it's freed again.
I am also curious why you are measuring the time of the random number genator too. Yes, it is just O(n) but for a comparison between both sorting algorithms you can not add a constant number to each time.
@alex_smallet 4 года назад ⁺¹
Great series. But why did you not mention that radix sort can start from both ends? I.e. - there is LSD (the one you covered) and MSD sort, which is basically the same, but starts from the leftmost digit.
@WhatsACreel 4 года назад
Yes, sorry. It's just I had the code for LSD and some animation ideas. No real reason other than that. Maybe we can do another vid in the future on MSD? I'm not sure. Both are fine algorithms, and cheers for watching :)
@pw6048 3 года назад ⁺¹
For radixsort, would it matter if you would scan only one time and build 4 array's instead of scan 4 times?
so instead of :
for (int shift = 0, s = 0; shift L 4; shift++, s += 8)
{
// Zero the counts
for (int i = 0; i L 256; i++)
count[i] = 0;
// Store count of occurrences in count[]
for (int i = 0; i L n; i++)
count[(arr[i] GG s)&0xff]++;
// Change count[i] so that count[i] now contains
// actual position of this digit in output[]
for (int i = 1; i L 256; i++)
count[i] += count[i - 1];
// Build the output array
for (int i = n - 1; i GE 0; i--)
.........
something like :
// Zero the counts
for (int i = 0; i L 256; i++)
{
count[0][i] = 0;
count[1][i] = 0;
count[2][i] = 0;
count[3][i] = 0;
}
// Store count of occurrences in count[]
for (int i = 0; i L n; i++)
{
count[0][(arr[i] GG 0)&0xff]++;
count[1][(arr[i] GG 8)&0xff00]++;
count[2][(arr[i] GG 16)&0xff0000]++;
count[3][(arr[i] GG 24)&0xff000000]++;
}
// Change count[i] so that count[i] now contains
// actual position of this digit in output[]
for (int i = 1; i L 256; i++)
{
count[0][i] += count[0][i - 1];
count[1][i] += count[1][i - 1];
count[2][i] += count[2][i - 1];
count[3][i] += count[3][i - 1];
}
for (int shift = 0, s = 0; shift L 4; shift++, s += 8)
{
// Build the output array
for (int i = n - 1; i GE 0; i--)
this way initializing would be a little faster, and stil fit in L1 cache, and you dont scan the input 4 times. More work per loop.
the building of the output arrays stil has to take place 4 time.
but would this be faster? ( i'm not that good at C++ to test this myself )
@surters 3 месяца назад
The quad counting seems like a nice idea, write a paper and post the address here!
@tektronix475 3 года назад ⁺¹
image processing in assembly. any advice about books, tutorials, forums, info?. thanks!
@clock4883 4 года назад
Once again! Amazing video. Keep it up! :D
@FernandoTakeshiSato 2 года назад
Currently revising the fundamentals in preparations for interviews...never had actually taken the time to implement radix sort, quite interesting! I guess it doesn't matter what base the "digits" you extract from the numbers, the algorithm works beautifully provided you adjust the count / prefix sum array length accordingly. Ended up with much more comments than code 🤣 Thanks for the video!
@TheEmolano 2 года назад
One thing I don't get is why people say that Radix can't sort complex objects. As long as you have a sort function that return a integer anything can be sorted.
@soberhippie 4 года назад ⁺¹
Would using other modulos (apart from 10) give better or worse results? Or, rather, how would it affect the speed and what are the things to consider when choosing the base for modulos?
@WhatsACreel 4 года назад ⁺¹
Yes, it can make a big difference! Mostly, we stick to powers of 2 for the modulus, so we can use boolean AND instead of dividing to get the digits. Cheers for watching mate, have a good one :)
@soberhippie 4 года назад
@@WhatsACreel Thank you for your videos, they are not only educational, but a lot more fun than other similar ones )
@ThePandafriend 4 года назад ⁺³
Great video, but one thing I miss is how much RAM the algorithms use and how it changes referring to the amount of numbers in the array. Maybe a graph would be nice.
@billy65bob 4 года назад ⁺⁵
Doesn't really matter here; quicksort needs basically no overhead, and radix sort will double it, because it needs a temporary work area.
In other words, if you pass in 2GB of numbers (that's over 500 million ints), quicksort needs nothing, radix sort will need an additional 2GB.
@mshonle 2 года назад
@@billy65bob Quicksort has a recursive call, so it requires the stack. The problem isn’t how much time the stack allocation takes (as you say, it’s very fast). The problem is that in the worst case it requires O(n) calls. To imagine the worst case, consider a scenario where the pivot element only reduces a subarray with s elements into two smaller subarrays of size (s-1) and 1. All of those remaining subarrays of size 1 “pile up” as something remaining to be done in the worst case. If your input array is 2GB, accounting for just the word size of the return pointer, and having the stack grow n/4 times then you’ve already required a 4GB stack! Causing, as they case, a stack overflow.
The fix isn’t so bad: turn the second recursive call into a tail call, and force the remaining (first) recursive call to operate on the smaller subarray first.
@billy65bob 2 года назад
@@mshonle quicksort doesn't work that way; it's an unstable in place sort and requires no temporary working area, except to store a value for a swap.
You are right in that it's not free and uses recursion which requires stack space, but this cost is trivial, even if the compiler cannot do tail recursion optimisation.
The worst case would be O(logn) levels deep; more than small enough to comfortably fit in 1kb of stack.
@mshonle 2 года назад
@@billy65bob needing O(lg n) stack entries is the BEST case- and, without tail call optimization, you’ll have O(n) stack entries in the WORST case. If you are sorting anything large, then the “trivial” linear space will certainly overflow your stack!
@billy65bob 2 года назад
@@mshonle I really don't give a toss; if your quicksort has a space cost of O(n), then you have screwed up, and catastrophically so.
@AutomaticHourglass 2 года назад
@creel Did you tried using SSE or AVX2? It might give you significant performance boost on radix sort since you can mask 16 8 bits(SSE) or 32 8 bits(AVX2) at the same time? I'm curious about the results..
@raunaquepatra3966 2 года назад ⁺¹
Can we do radix sort for floats?
@NorbiPeti 3 года назад
Nice series
@AndreSomers 3 года назад
As quicksort (& friends) use a devide & conquer approach, they are relatively easy to parallelize, which does not help in the big-O speed of the algorithm but does help in the wall-clock speed. How well is radix sort scalable in this way? I don't see an obvious way do do that in the steps you described, except for chopping up the input array into parts, radix-sorting those, and then use a merge sort on the sorted parts perhaps?
Also: how useful (if at all) is this technique if you dont't just want to sort a bunch of integers, but a collection of structs that you want to sort by a field that is an integer?
@CrystalLily1302 2 года назад
Very late reply, but the way you could do it is by using the fields as keys and then reiterating over the list of integer elements to create a ssorted list of the structs, would be slower than a quicksort unless you have something in the 5k-10k range of structs/objects but theoretically useful if you have massive object systems that need to be sorted.
@Gutagi 3 года назад ⁺¹
Im a begginer programmer and would like to know how to easily make a function getDigits(number,base)
I managed to make a base 10 one, However other bases don’t work...
@wedusk 4 года назад
Thoroughly enjoyed. Reminds me of uni!
@unvergebeneid 3 года назад
Is there a reason those two loops at the beginning of the Quicksort code are do while loops instead of simple while loops? Since the body is empty, I can't see how they would behave differently.
@vukpsodorov5446 4 года назад ⁺¹
i was wondering; returning the radix sort counts to 0 runs through 256 iterations, when in reality if we're dealing with integers you can only have 10 digits (0-9) as you showed in the previous video. i'm not sure if i just missed something, but i think there's a bit more performance to be had (not much though, considering it only does that once per main iteration, of which you only have a few). alternatively, if you do use a larger list of them, would you be able to dodge returning them to 0 altogether, somehow moving to the next 10 digits in the array of counters (so the first iteration uses 0-9, the second uses 10-19 and so on), you never return any to 0, and then you just delete that as you do when you're done? i haven't given it much thought or experimented with it, but i figure it's doable and would likely increase performance (although i guess that would only really happen when dealing with some seriously large numbers in the array, with lots of digits where every instruction in that part of the sort counts)
@WhatsACreel 4 года назад
Yes, sorry, it was using base 256. So there’s 256 digits. Instead of 0 to 9 it goes from 0 to 255 - each digit is one byte. It’s just easier for the CPU.
I really like the idea of using 4 counts arrays to avoid zeroing! If we could be guaranteed that malloc would zero for us, we’d be good as gold! I’m sure that would save some time! I don’t think malloc does Zero anything so it mightn’t save us time and we probs have to do it ourselves :( Haha, still cheers for watching mate, and cheers for the suggestion :)
@vukpsodorov5446 4 года назад
@@WhatsACreel well, i suppose you could set all of the digits to zero when initiating the array (or at least i think so, if we're thinking of the same problem)
and in any case, thank you for these videos!
@mikefochtman7164 4 года назад
@@WhatsACreel Might even try something like memset(count, 0, 256*sizeof(count[0])). The '256*sizeof(count[0])' would be resolved at compile time to something like 'memset(count, 0, 1024)' (assuming 32-bit int), so this is probably the fastest way to zero out a block of memory. No overhead setting up a for() loop or indexing and such.
@HeyItsSahilSoni 4 года назад ⁺¹
You probably should have put the code on gist of something, youtube description is not the best place for code.
Nonetheless, great video!
@juniorjedi2563 3 года назад
Couldn’t you initialize the counts to -1 instead of 0 to skip the subtraction?
@sykeassai 2 года назад
I came into this in video 2, and the explanation of radix thought makes a lot of sense. but I'm left with a couple of questions.
I did some basic programming on my degree about 12 years ago and I honestly can't say I remember all that much of it,
but I am wondering 1) what language is that written in, (I'm guessing Visual Studio? in which case if I want to relearn programming at some point that is maybe where I should start as programing writing software?)
2) what is the program used to run that code?
@pkmnfrk 2 года назад
The language is C++. The compiler/debugger is Visual Studio. If you are looking to get back into programming, I suggest Python instead of C++, it's much easier to use and with many fewer gotchas. The downside is that it's not nearly as fast as C++
@j.p.hendrix4389 4 года назад
I'm trying to figure if radix sort would turn into a special case if the bucket size was one bit in size. So instead of dividing 256, dividing by 2. From an assembly point of view this could be interesting in code size I guess?
@AndrewLapteff Год назад
AMAZING!
@kortaffel 4 года назад ⁺²
Have you watched "15 Sorting Algorithms in 6 Minutes". great visualization of all the sorting algos
@bagok701 3 года назад
I wonder if radix sorting by the most significant digit instead of the least would be faster, reading memory nearby another memory address could be binned? or would it be worse if memory modules allow reads from different parts of memory in parallel?
@AC-bi3bz 3 года назад
Thanks for this amazing series of videos ... explained well and entertaining. Chapeau!
@nebulagaming678 2 года назад
Hi guys! How would I extend this function to work with 64 bits and potentially with 128 bit numbers? Thanks!
@wazydotmeter 4 года назад
Have a contest between different users rigs to see who's is the fastest? with the samples of code used.
@kashgarinn 4 года назад
What would be the fastest sort for lexicographical sorting of double fp (x,y) datapoints?
@murrayhorn8817 3 года назад
It would be interesting to see this on a Risk processor like the ARM or Risk V with an index of 16 ompared with 256. This could result in the counts being stored in 16 of the 32 registers as opposed to memory. Would this speed things up enough to compensate for double the amount of passes compared to an index size of 256?
@fev4 4 года назад
This is fucking awesome. Subscribed. I just found this channel today, incredibly good. :o
@HL65536 4 года назад
would be interesting to see what the speeds are like with longer data types. 64 bit, 128bit,...? Is there a tipping point?

Следующие

Автовоспроизведение

Why is Radix Sort so Fast? Part 1 Why are Comparison Sorts so Slow?