For when you can't use radix sort, there's a whole science to picking the pivot for quicksort. For large arrays picking pivots close to the median gets you closer to O(n log n) and helps to avoid worst case (n2). Picking a single random value from the array is somewhere between those (since you're unlikely to pick the median). To pick a good pivot , pick several random values from the array and use the median of those. The science is how many random values to use as a function of n.
Really great stuff mate! I've usually gone with random in the past. Median sounds like a great choice! Cheers for sharing mate, and cheers for watching :)
@@WhatsACreel my pivot selection has been 1st, middle and last element, find the median and use that as pivot so extra comparisons but worse case hard to happen - in reality when somebody actually designed input to hurt the pivot picking algorithm
No I'd say std::chrono::steady_clock is the best since it is,well,steady. High_resolution_clock is unstable in some implementations and is not recommended for measuring time spans
Or at least repeat the run multiple times and sum up the numbers. Instead of 10 0's get time from start to finish. (subtract n times of dry run, generating the array and copying it to output unsorted).
@@philipbotha6718 I doubt that.I only know that to be true only for system_clock and high_resolution_clock. However,please give me an example of such a library. I'd love to see it.
The weakness of Quicksort is that it cannot take advantage of partially sorted data or data that's exactly or almost exactly reversed. In real life, you often have two or more sorted dataset that you want to combine into a single sorted whole, or dataset that's mostly sorted but with just a few items in incorrect places. Indeed the worst case scenarios for quicksort usually is with sorted or reversed dataset. Some hybrid sorting algorithms like timsort or introsort can take advantage of those existing structures in the data and perform a nearly O(n) sort, while avoiding worst case scenarios and maintaining O(n log n) for any input. The standard sorting algorithm in the standard libraries of most programming languages usually uses one of these hybrid sorts rather than a naive sorting algorithm like quicksort.
radix sort doesn't take advantage of partially sorted data either. quicksort hybrids, like timsort and introsort, can be massages to do these tricks. radix sort kind of hard, without some extra passes on data. radix sort usually will always destroy partial sorting of data in the first step.
@@movax20h if you knew apriori that some selection of the data was already sorted could you improve it by splitting off the initially sorted bit? Im thinking something like this: copy out the sorted bit and replace its values in the modified inputArr with zeros/minimal/maximal possible values, except for the sorted selection's first and last (minimum and maximum) values. Then do quicksort on the inputArr using the retained first and last values as pivots. I'm having trouble thinking of a way to reintegrate the stored separately part but it feels like maybe there's a way. Maybe some kind of binary search or something?
@@mnslr You could do a merge sort as the final merging step, which I suppose should even be doable in-place if the two parts (pre-sorted + qsort output) still remain in that same array.
quicksort is a misnomer. its not that quick. The 'quick' means if u don't want to spend time thinking about which sorting algorithm to use u can quickly use quicksort for now and worry about improving the quickness later. Should be called lazysort, but then no-one would use it.
2 small optimisations available for your RadixSort. memset and memcpy are a tick faster than loops, but the more important optimisation is that you don't need to shift and mask the input values. You can use a byte * instead of the uint * and you increment it by 4 in the innerloop and by 1 in the outerloop (of course endianness becomes an issue but in practice not, everything is little endian nowadays).
@@WhatsACreel Theoretically it will not make a big difference though. It only pushes the explicit shifting/masking from the ALU to the load-store unit where it is done implicitely. It can also introduce a partial register write stall that plagued Intel CPUs for a long time (when writing a byte register AL, BL etc., accessing the rest of the register (EAX, EBX or RA, RBX) would result in penaltie). So careful testing is necessary.
There's a great video about modern compiler optimization: th-cam.com/video/w0sz5WbS5AM/w-d-xo.html The key takeaway is: compilers have become extremely good in unwrapping loops and optimizing bit manipulations, if you are able to recognize the optimization easily the compiler will too.
@@benjaminfacouchere2395 Another optimiziation: use constants instead of variables wherever possible, as this effects branch prediction and preemptive execution of the CPU.
Amazing stuff. Love your work. I started on 8 bit machines back in 1978. You guys trying to learn, subscribe to this guy and watch all his stuff. Now Mr Creel, you been doing assembler. So you could use the processors clock counter to measure how long things take in cpu clocks. Forget the libraries. The magic x86 instruction is RDTSC - Read Time-Stamp Counter into EDX:EAX, opcode 0F 31.
Although it looked like radix performed better at all sizes, I think it’s important to clarify that if each of these sorts was done 1000 times for each given size, quick sort would indeed perform much better on the smaller sizes.
great series! very informative. couple of things I might improve, in order of importance IMO: 1) keep GenerateRandomData() call out of timing block. you don't need to measure random number generation 2) use std::chrono::steady_clock for better timing (milliseconds is enough but you can go up to nanoseconds precision). important point here is to use double instead of default int64_t: using milliseconds = std::chrono::duration; using std::chrono::steady_clock; auto start = steady_clock::now(); // stuff auto end = steady_clock::now(); double elapsed_ms = duration_cast(end - start).count(); 3) use std::vector (when possible: std::array) instead of C style arrays for two main reasons: a) no need to new/delete, b) swapping input and output arrays becomes as simple as std::swap(input, output) and it is O(1) there are more small stuff but hey it's a rabbit hole :)
I think you can do the radix sort in place by calculating the prefix sum as you go and performing swaps based on the intermediate values, it still requires the final pass per iteration in the reverse direction to put them in the correct order... actually on second thoughts the swaps would have to be inserts (shifting elements), to work around that you could treat the original array as two buffers and I think one of them would need to be circular... hmmm... I wish I had the time to explore this idea!
Thanks for bringing this topic up! I didn't study these sorts at university. I intend to use both the bucket-sort and the radix-sort to sort trees and graphs. These algorithm have wonderful extensions to more complex objects than numbers and the number of variations on these opens up a lot of possibilities!
When comparing to quicksort, you should really use the built in C++ std::sort. Granted it is not required to be quick sort, but it is usually a hybrid of quick sort, heap sort in place, and insertion sort for small arrays, sometimes with some additional tricks (like detection of partially sorted or reversed arrays, like timsort; or other good pivot selection methods, i.e. median of 3, or something like this). For totally random data most of these tricks are not supper important, but some minor details and microoptimisations in std::sort are still relevant, like optimising increment / decrement, index comparissons, and form of the loop, makes a difference.
I have an article on github that covers sorting other numeric types and some radix sort optimizations, e.g column skipping. It's called "Radix sorting from the ground up"
Great video series. Really enjoyed it. Altough I already learned about sorting algorithms at university I haven't taken a look at RadixSort until this series.
Btw. for 64 bit integers the radix sort might be a bit worse, unless you have a really big number of numbers. But for a lot of 32 bit or 16 bit integers, it will works blazing fast. A tricks, is to decide number of buckets before start of the algorithm, based on the size of the key, and number of keys. I.e. if you have billion large numbers, it makes sense to probably use Radix-65536. Because few extra MB for counts, is nothing compared to the temporary array. Also radix sort is reasonably easy to parallelize: in each phase, each thread updates counts using atomic, or own local copy of counts, then they are aggregated (by one thread or by multiple threads, various methods are possible). Then prefix sum is calculated (by one thread, or in parallel). Then shuffling is done in parallel too, with some extra tricks to make it correct.
Another small optimization. Regarding the allocation of the output array and the extra logic to detect an odd number of pointer-swaps and copying the results. Instead, have the caller responsible for allocating the output array and passing a pointer for that in along with the input array. Then the RadixSort() routine can just pass back the 'output' pointer. The caller can use the returned pointer for the sorted array, and also test against what was originally allocated to deallocate whichever one is no longer needed. No need to copy results and onus is on caller to allocate all memory and clean up. (the 256 element 'count' array could be static as well, so very minimal stack usage).
Nice video. FYI: *do {} while (/*code*/);* is the same as *while(/*code*/);* in c++. *if (originalArr == output) /*...*/* This section of the function isn't needed because it'll always be false: The number of shifts is a constant of 4. For optimisations: - We could put *count* on the memory stack to avoid unnecessary system calls (using *new* ). - As others have pointed out, *memcpy* should double the speed of zeroing the *count* array.
staticaly allocated count is not good idea. what will happend when two different threads execute this code on diffent cpu cores? [spoiler] data corruption int[256] is enough small to stack allocation.
@@safulkin Sorry, I meant allocate it on the stack. (I was meaning that *count* shouldn't be _dynamically_ allocated, and so chose the antonym _static,_ but of course that has its own meaning.)
if im not mistaken in the case where the statement isnt met, do while will always run the code atleast once, whereas while wont is this different in cpp o.o?
@@R4ngeR4pidz A while loop will run the body of the code if the while condition is true - but to know if the condition is true it needs to run the code in *while(/*code*/)* . So it really is the same as *do/while* .
@What's a Creel Note: it's possible to find a more precise speed of the algorithms for the 0 - 10000 array lengths by running each algorithm n times and then dividing the time taken to run them by n. This way you wouldn't have a 0 sort time for lower inputs as enough time would be given for the timer to reach values greater than 0. Thus, after division you'd have some mantissa values to show the algorithms speed.
He's done that before (you can still see the loop in his test harness, but set to one iteration). He puts the data generation inside the test unit because he doesn't realize that a GB of RAM can store 268,435,456 integers, so it's okay to generate his test data outside the loop.
One note on your quicksort here: in the worst case, it requires O(n) space (the space is the call stack). To fix this, you need two modifications: (1) on the recursive call, recurse on the smallest subarray first; and (2) change the second recursive call into a tail call [in C/C++, reassign the parameters and goto the function start; in Kotlin declare the function as tailrec, et cetera].
After three 15-minute videos explaining in detail why the radix sort is so fast, ends the series with... "It's probably got something to do with it." Great work, mate. Thanks.
Fascinating series on sorting. it's been 3 decades since I messed with this stuff, and even then, we didn't get this deep. Thank you for taking the time to explain this. Side note, has anyone told you that you sound very much like Eric Idle from the Monty Python crew? I don't mean that as defamatory.
Awesome stuff! Very informative series that demonstrates why the methods do become faster, and not just how to do the method itself. Thank you very much, I greatly appreciated this.
Am I right in thinking that the number of digits in the numbers affects radix sort but not quicksort? If so, I would have liked to see a set of tests where the number of elements are constant but the number of digits of the elements increases, and seeing if what effect that would have.
That applies to sorting strings more than numbers. If you're just using the standard integer number types used in the video they tend to have a fixed number of digits. It's just that all the leading digits are zeroes. So you can just use the radix sort, implemented as shown in the video. The only place where numbers might cause a problem for radix sort is where you have a number type that's not implemented using the standard types. Even there (or for strings) you just need to re-inplement radix sort with a variable number of passes, depending upon the length of string or the number of base-256 digits in the number. And when sorting non-standard datatypes, quicksort can be affected too.
Yes, with radix sort, the greater than/less than comparison is built into the fact that the order in which we store the counts is itself an ordered list. You're leveraging compile-time information to help with speed. Of course, that's directly why you're using more memory.
There is also a "database sort", that is, we know how many "pages" are required to hold the sorted result since each "pages" can hold a fix number of elements (that number is our choice). The memory allocation occurs thus just once. Then, one element at a time from the array to be sorted, we look into which page it should go (each page keeping the min and max value it holds in book-keeping while adding or transferring data to another page), incorporate it to the page it belongs if there is room in that page, else, move half of its data to an blank page (blank up to now, its memory had already been reserved), new page which will be properly linked. Since the amount of all required memory is known at the start ( TWICE the initial data), the memory allocation is done just once. All the other manipulations are book-keeping. Note that database "index" rank the data by keeping the order of the index, not of the value, being ordered, which works fine for string, or float, or even orderable structures of multiple "fields". I suspect it MAY be faster than a quick sort, given, that, in the first place, if you have to sort 1G elements, they don't happen 1G at a time, but one by one, making the database sort working from "day one" instead of having to wait the 1G-th element to be known.
Most of the data I deal with is already partially sorted with combinations of hill sorted, valley sorted and reverse sorted. This can be pathological input for Quicksort but makes another algorithm like CombSort11 shine.
I think a really good sort algorithm would be one that first scans the data to be sorted (size, type, randomness, min/max data values...), and based on that, selects the most appropriate sorting algorithm. For example, it could have 10 complete sorting algorithms available to itself, but decides which one to use based on the data. Some of those 10 could even be combination algorithms involing pieces of different subalgorithms. Also, I wanted to mention that heapsort has a simple tweak in that instead of just plucking the root node each time, you can pluck that and one of the children (if available). For example, if we had a maxheap with the first 3 elements being 10, 4, and 6 (in that order), instead of just plucking the 10 and then reheapifying, we can pluck the 10 and the 6. Someone actually benchmarked this for me and it was like 3% faster than 1 element popping heapsort. A mild improvement but hey if it works and it is faster, why not use it?
An interesting topic related to this and probably relevant for many cs students that are obviously watching this (judging comments from previous videos) would be the effects of multithreading these things. Spoiler: Quicksort is not stable enough even with good pivots to get as much out of it. Also to justify the overhead you need to have more than 1e7 elements to begin with. Thinking about it: Multithreading is a good example where the stability of an algorithm actually matters. If you want to spice it up: Sorting is one of those things that can make really good use of GPU acceleration. Maybe run a cpu radix sort vs a gpu bubble sort on the same data set. So many fun topics around this to explore. Another nice topic for an addendum video: The practical side of things. I the wild the choice of sorting algorithm is more or less academic. There is close to no cost/benefit to customfit some std_lib sorting to you particular dataset. Yes, there are cases where every ms matters, particular in lowest end embedded or HPC implementations, but for your average job you just use some standard sorting and be done with it. As with any algorithm you should be aware of its pros and cons, but do not obsess with reinventing the wheel. "You can always come back later to fix it, when it breaks everything"-Mantra is sadly reality. Or try to "outsmart" modern compilers with your new learned asm knowledge.
Well, there is a GPU radix sort, yes. I remember watching a GTC talk on it a long time ago. Though I couldn't tell you which it was. I did actually make a multi-threaded version of the Radix sort for this video series too, but scrapped it. I just split the data into 8, and then used a merge step at the end. It was ok. Mostly bottle necked by the RAM reads I think. Certainly a lot of fun topics, yes!
Hmm how do you even parallelize radix sort for gpu? I was thinking about this yesterday - it is still iterative along the number of digits. You have to do one digit first, reshuffle the numbers then do the next digit, so I'm not sure there's much of a speed up. Also you'll run into weird race conditions if try to change the same memory location in multiple gpu kernels/threads. Have to think more about how to avoid that
@@nikilragav There is a ton of literature on the topic. I think one of the best illustration of the idea is in "HARRIS M. Parallel prefix sum (scan) with CUDA. GPU Gems 3". as it also adresses implementation "problems" you face on GPUs. There are quite a bit of optimizations on the idea and gpus have better implementations nowadays (and more cores, and shaders to play with). For example "Fast 4-way parallel radix sorting on GPUs" Another thing to keep in mind: GPUs do not use the same memory system as CPUs. It is quite different in its usage. There are also tons of absolutely unintuitive parallel algos that one just dont know about when starting with the topic. There is also other neat sorting stuff from the last years. worth looking through their cites too. doi: 10.1109/ACCESS.2019.2920917 or doi: 10.1155/2020/4793545
At work, I often found myself having the best performance with insertion sort. Eg.: Inserting events into a file from networked sources based on timestamp. The "better" sorting algorithms often are worse, depending on your dataset.
std::sort is using a hybrid sorting algorithm called introsotr, as you mentioned, it is qsort + heapsort (+ of course O(n^2) sort for short subtables, insertsort I think). The qsort uses "median-of-three" (*) pivot It works well for sorted, reverse sorted and random tables, a bit faster even than a random pivot. But there exist permutations that break it. This is where heapsort is used. As iterations of qqicksort progresses, the depth of iteration is checked, and if it is too large (greater than log[n]*const), the problematic region is sorted using heapsort (a bit slower in mean case, but the worst case is nlogn) *) it is the median of the first, middle, and last element, not of random ones. Whan interesting, gcc uses the second, middle and last elements, and the median of those is inserted in the first place. Not sure why, but I would guess it gives less jumps in code and is a bit faster. The main advantage of the median rule is we are guaranteed that in the range is at least one element not smaller, and at least one element not greater than the pivot. This allows us to get rid of some if-statements.
Was the pattern of Radix Sort using 2x the memory of Quick Sort something that held through larger arrays? Would have been cool to see that in the final chart at the end. Awesome video series, thank you! :)
Sorry, I glossed over that! It does continue, yes. Radix Sort, as used in the video, doubles the array, so it takes about double the RAM of QuickSort. Well, cheers for watching mate, and have a good one :)
If you know the size of the array in advance, then radix sort have a huge advantage since you can just use template to allocate an array during compilation, saving time, you can even do the qame for the counter array, again saving time Unfortunately, initializing them to 0 will not reduce it It will even remove all allocation, leading to no needed deallocation so faster and more stable program
*For more sort comparisons:* Check out *w0rthy* : th-cam.com/users/w0rthyAvideos For example: "Sorts - Classic" th-cam.com/video/CnzYaK5aE9c/w-d-xo.html Check out *Timo Bingmann* videos: th-cam.com/users/TimoBingmannvideos For example: "15 Sorting Algorithms in 6 Minutes" th-cam.com/video/kPRA0W1kECg/w-d-xo.html
I really like your explanations. For a programmer they are very easy to understand. My frustration with compsci books and articles is so often that they talk a little too abstractly, so for example wouldn't really worry about trying to address the "bucket" problem in radix sort. One odd hunch I had while you were describing that problem was: could you avoid allocating a second array if you used a radix 2 to somehow do clever in-place bucketing, or would that not work because you might get too many items in a row for one bucket and it wouldn't balance out? I plan to work through that soon myself to find out, but figured I'd mention it before then.
Fascinating video series, thanks for educating us. :) I will say though, I don't often find applications where I only need a list of numbers sorted. The numbers are often attached or keyed to a data structure, and the whole structure needs to travel with the number, or at least be referenced somehow. Combined with something like a hash map to quickly... well, map, the numbers back to their containing structures, this could be a nice intermediate step. But if you've already constructed a hash map, why would you need a sort... *scratches head for a minute* ... Actually, if you can pack the number to sort, as well as an index to the original structure, into a single integral value, perhaps a int64 or just a byte array, and only do the radix sort on the least significant bytes, then the indeces or pointers would be part of the sorted data. Yeah... I'll have to try this some time. Comparison sorts are all I learned in school, it's very interesting to see other approaches.
So, follow up... I wrote a radix sort template class, it can do integers, strings, and any structure really, as long as it can be represented as a sortable buffer of bytes. I haven't tried negative numbers yet though. The way I ended up doing it was to allocate two buffers for indeces into the original data, in and out, with a buffer swapping approach. A little heavy on the memory, but very flexible. That also gives the option of returning just the list of indeces without modifying the data (view()), or doing swaps on the original array (sort()). It relies on two functor classes to obtain the size of the inputs and evaluate the sortable bytes, but it was trivial to default those for ints and make a "prefab" for strings that you can pass. It also has a persistent buffer option and preAlloc() to avoid those malloc penalties for repeated calls. sorter rs; rs.sort(data,size); or sorter rs; rs.sort(data, size); Pretty handy! Thanks again for the inspiration! If anybody wants it, github.com/DFPercush/RadixSort I'll keep working on getting that negative number thing... sorted.
Cool series; thanks! Not sure how I managed to go this long without having this algorithm on my radar. I suppose just because it's relatively scope-limited to fixed-length unsigned integers, so its utility is near-zero in other contexts?? Anyway, still cool to know about, and has its place. I imagine it could also be used to pre-sort other kinds of data, if one can easily compute a fixed-length unsigned int that represents the most significant sorting components (e.g. casting char[] arrays, to effectively sort by the first few characters)? Interesting stuff.
I think the complexity of radix sort should be O(N log_b_(Max value)) where log_b is log with base = no of buckets used. For most computers max value = 2^32 so the complexity becomes O(32N) =O(N). But every it's tought as O(N) which is misleading according to me.
Great videos. Is it possible to somehow check the start list of data, whether it is just integer numbers, negative, positive, floating, just letters, combination of all, … So spend some time checking data and get information how much memory we have and then decide which sort algorithm is the the best, actually the fastest for the given data and available resources. That would be very close to ultimate, universal, fastest algorithm for any type of data. Of course this is only if we do not know what type of data to expect. I know its not just type of data and memory we have, so maybe include more informations and checks.
The radix short should effectively require 2x the data size in memory while the quick sort should be able to work within the data's own space. If one used a thread pool the quick sort could be improved a good bit in performance. Doubt it would catch up with radix sort though. It would require waiting to each division is done to split the work load.
For radixsort, would it matter if you would scan only one time and build 4 array's instead of scan 4 times? so instead of : for (int shift = 0, s = 0; shift L 4; shift++, s += 8) { // Zero the counts for (int i = 0; i L 256; i++) count[i] = 0; // Store count of occurrences in count[] for (int i = 0; i L n; i++) count[(arr[i] GG s)&0xff]++; // Change count[i] so that count[i] now contains // actual position of this digit in output[] for (int i = 1; i L 256; i++) count[i] += count[i - 1]; // Build the output array for (int i = n - 1; i GE 0; i--) ......... something like : // Zero the counts for (int i = 0; i L 256; i++) { count[0][i] = 0; count[1][i] = 0; count[2][i] = 0; count[3][i] = 0; } // Store count of occurrences in count[] for (int i = 0; i L n; i++) { count[0][(arr[i] GG 0)&0xff]++; count[1][(arr[i] GG 8)&0xff00]++; count[2][(arr[i] GG 16)&0xff0000]++; count[3][(arr[i] GG 24)&0xff000000]++; } // Change count[i] so that count[i] now contains // actual position of this digit in output[] for (int i = 1; i L 256; i++) { count[0][i] += count[0][i - 1]; count[1][i] += count[1][i - 1]; count[2][i] += count[2][i - 1]; count[3][i] += count[3][i - 1]; } for (int shift = 0, s = 0; shift L 4; shift++, s += 8) { // Build the output array for (int i = n - 1; i GE 0; i--) this way initializing would be a little faster, and stil fit in L1 cache, and you dont scan the input 4 times. More work per loop. the building of the output arrays stil has to take place 4 time. but would this be faster? ( i'm not that good at C++ to test this myself )
Small optimization: The count array could have been allocated on the stack, that would just be 4*256=1024 bytes. Today stack usage and arrays on the stack are not a problem, as long as the arrays have constant size. In C there are variable length arrays with array size not known at compiletime, the same thing could be done with alloca. If you would declare an array with variable size you should always check if it has a relatively tiny size like 10kB or less. On windows you have a default maximum stack size of 1MB.
"On windows you have a default maximum stack size of 1MB." Is that the maximum size an array can be when passed as an argument to a function (aka put on the stack)?
@@harrysvensson2610 You wouldn't normally pass an array as an argument, but rather a pointer to that memory location where the array is (or in C++ it could be a reference, but the effect is the same. The stack just has to hold 'sizeof(pointer)' bytes.
@@benjaminfacouchere2395 Allocating and deleting allocated memory on the heap takes some time. If you sort a large array it's not going to matter much, but if you sort many small arrays that time is going to add up. Having the array on the stack is just changing the stack pointer, so it's essentially free. Replacing int* count = new int[256]; with int count [256]; and removing delete[] count; makes the count array be on the stack instead of the heap. It will be removed automatically when the function ends, and again, it's just changing the stack pointer, so it's free. Regarding the maximum stack size: Every time the program enters a function it "creates" space on the stack for the variables that function uses. So if you make a recursive function that needs 1000 bytes of stack space, it will have used a million bytes when it's gone 1000 steps "deep"; e.g. void rec( int x ) { if( x < 0 ) return; int a[250]; rec( x - 1); } rec( 1000 ); will used 1 million bytes (more actually since arguments are also put on the stack) before it starts to unwind. With a maximum stack size of 1 MB, you may run out stack space and crash if something down the call stack also allocated a lot of data on the stack.
@@phizc Sure, you have a small one-time allocation penalty, but that's it. Obviously we won't use radix sort on small arrays as other algorithms are more efficient there, also a count array of 256 is overkill then.
I had some data to extract from siemens PLC database and make a monthly report to exel or push report on the fly to the web interface. What I did I read the month data to ram and save indexes of data of each new day, I then create multiple threads on cpu so each thread handles calculations for that day data so thread receives the memory address of data and index locations of that day data. so it is running eg 30 threads and as if computer has more cores it reduced the time need for calculations as much times as much more cores the cpu had.
allocating stack memory is basically free (just increment the stack pointer register) malloc/new is where the overhead comes in. Asking the operating system for memory is a big deal.
I wonder if you can use radix-type sorting on classes and the like. For example, a string - you could radix sort character by character I think - I just wonder if overhead would kill it. For other data types, for example a class with 3 numbers (hours, mins, secs), I imagine you could radix by seconds, then minutes, then hours.
It's certainly capable of sorting strings, yep! It's pretty flexible, really, so long as you can figure out how to read the data as something like unsigned int you're good to go! Cheers for watching mate :)
You are correct about sorting data with multiple key values per element, because radix sort is type of stable sort you can sort starting from the least significant key value to most significant (like in your example first sort by seconds then by minute and last by hour).
A funny thing that I just thought while watching this: You can use radix sort to sort strings alphabetically (letters = numbers, radix = the size of the alphabet)! Is this how programs usually do that?
A nice short video series. The prefix sum idea was especially clever. Could you possibly do even better if you try to optimize the sorting algorithm to make better use of locality? By that I mean, this seems like it would be doing fairly random accesses to memory, whereas having accesses that are mostly linear would perform better. To my understanding, having fairly random accesses causes page faults.
Hmmm... that's an interesting thought. The data is random, so there will have to randomness somewhere in the algorithm - in a QuickSort the randomness is in the elements that we swap. If we use a small radix, we get much better locality, but at the cost of more iterations. I can't think of a way out other than finding a nice compromise. That seems to be about base 256 radix sort. The counts are small enough that they fit into the L1 cache, but large enough that there's not too many iterations. Well, cheers for sharing this thought, it is very interesting! And cheers for watching :)
@@WhatsACreel Perhaps one could make buffers for each possible radix (digit? symbol? value?) such that the L1(or possibly the larger caches) can fit everything. Once a buffer is totally filled, we flush its contents to where the data is supposed to go. Then the page faulting would only occur once during each time we need to flush a buffer.
Fantastic mini-series; absolutely fantastic. However the part where you measure the performance of sorting 10, 100 elements is cringe 😅😊 come on you should have started at 1.000.000 🤣🤣
The process memory is shown in real time. When you are looking at it the sorting is already done and all the memory was freed again. Look at @11:54 and @12:21 in the process memory graph. You can see it spiking at the beginning and then it's freed again. I am also curious why you are measuring the time of the random number genator too. Yes, it is just O(n) but for a comparison between both sorting algorithms you can not add a constant number to each time.
You're including the time of generating the array in the time it takes to sort the array. For an O(n) sort, this is not a trivial overhead, but it is for O(n log n). This makes the test skewed against radix sort.
@@WhatsACreel Hey, little mistakes keep me on my toes, I suppose! Then the trick is to just try to do enough research to make sure the feedback is correct, but I like to gloss over that part. Lovely channel. Not sure I've seen anything like this before!
@@_notch Well I certainly make lots of mistakes. Looks like solid advice to me mate! Cheers for the kind words! Hey, reminds me of a game I played, never saw anything like it! Built a sweet little house with a farm by a river, then I fought off zombies and found some diamonds!! Hahaha, magic! If you’re ever down this neck of the woods give us a yell, I reckon I can find a bbq and a cold beer :)
It would be fun to hear if any of the algorithms could be sped up by using SIMD functions, and then by utilizing locality (maybe structs with alignas?) for the cache lines to reduce the number of cache misses. Optimize it to the ground!
@@controlledsingularity8084 For a very branchy integer based operation? Maybe, but probably not by the margin you think it will. Do any of the STL-like utility libraries for CUDA have a radix sort to compare it to? Likely the copy latency will be the longest thing
One thing I don't get is why people say that Radix can't sort complex objects. As long as you have a sort function that return a integer anything can be sorted.
Good note at the end about exploiting patterns in your data. Also, many people think way to hard about sorts. If you're sorting 10 000 000 elements, it's sometimes better to check if you really need to sort it, or whether you can get it sorted. Often, the best sort is whatever the standard library gives you.
Would using other modulos (apart from 10) give better or worse results? Or, rather, how would it affect the speed and what are the things to consider when choosing the base for modulos?
Yes, it can make a big difference! Mostly, we stick to powers of 2 for the modulus, so we can use boolean AND instead of dividing to get the digits. Cheers for watching mate, have a good one :)
Great series. But why did you not mention that radix sort can start from both ends? I.e. - there is LSD (the one you covered) and MSD sort, which is basically the same, but starts from the leftmost digit.
Yes, sorry. It's just I had the code for LSD and some animation ideas. No real reason other than that. Maybe we can do another vid in the future on MSD? I'm not sure. Both are fine algorithms, and cheers for watching :)
Looking at the results for quick sort I noticed it was almost linear once you reached 100k elements and above, which I wouldn't have expected. I mean, it's still nearly an order of magnitude slower than radix at 1bn elements but even at 10m elements it's still a usable algorithm if 1 second of wait time is acceptable, especially with radix sorts massive RAM requirements.
when you're working with truly huge datasets radix has no comparison. it's the only choice for pre-bucketizing string lists over a million unique points. I've been using a pseudoradix methodology to keep my homebrew linked list map class sorted live. the memory overhead isn't impossible to overcome when using linked lists, and radix also allows for parallelism where practically all others are requisite linear. when I'm dealing with data sets that won't fit in working ram, radix to the rescue again! just make your first bucketizing step output to file storage. continue bucketizing to the disc until buckets fit in working ram. array radix is kinda limiting for that tiny lookup bias. I feel like on average you gain more from being able to move blocks of memory and insert nodes than you lose from that lookup while the sort process is happening. my linked list comes with 2 jump pointers that keep relatively close to binary locality when the map is open for active additions or deletions. optionally the map could be collapsed into an array when it's keys became const.
Is there a reason those two loops at the beginning of the Quicksort code are do while loops instead of simple while loops? Since the body is empty, I can't see how they would behave differently.
I sure am! Now I'm looking at the speeds of 16 bit instructions against 32, and it does seem that some of them strangely slow...? How odd. Mostly they're ok though. At least on this Kaby Lake. Still, really does make me wonder? Why was 16 bit so slow?? I just assumed it was the Zeroing of the counts. Could certainly be wrong! Great info mate, cheers for sharing, and cheers for watching :)
@@WhatsACreel x86 instructions favor 32-bit and 8-bit operations by default and switch using a single bit. 16-bit operations require the extension byte, so 16-bit instructions are both slower and larger.
I don't think it has anything to do at the speed of 16 bit vs 8 bit instructions where it comes performance in relation to the size of the radix. There are multiple issues at hand. The most obvious is your table of 256 ints will fit in cache, 64k probably won't, and because you do random access on that (data is random, you're as likely to increment any value as any other), you lose time with cache access. Another point is using 64k means you need to compute the partial sum of your 64k array , which obviously takes much more time than the 256 array (128 times total considering you do it only twice instead of 4 times). That's a significant constant time added, which will definitely stand out for smaller arrays. It is important to note however, that the variable time is theoretically halved with a 16 bit radix (going over the array half as many times), and since the and instructions would be 32 bit anyway (since your input is 32 bit), and even then I doubt they'd be twice as slow, memory access is most likely to be the reason. I believe that using a 16 bit radix could be better if you start multithreading this, because you won't create as much contention on the various arrays elements. However, while counting is trivial to parallelize, putting elements in is not, you'd have to use some clever tricks there. Also tried checking disassembly on godbolt, the big loop isn't unrolled, which is surprising since you'd probably get some extra performance as you avoid and'ing for one loop (since the result will always be smaller than the value you're and'ing it with).
@creel Did you tried using SSE or AVX2? It might give you significant performance boost on radix sort since you can mask 16 8 bits(SSE) or 32 8 bits(AVX2) at the same time? I'm curious about the results..
As quicksort (& friends) use a devide & conquer approach, they are relatively easy to parallelize, which does not help in the big-O speed of the algorithm but does help in the wall-clock speed. How well is radix sort scalable in this way? I don't see an obvious way do do that in the steps you described, except for chopping up the input array into parts, radix-sorting those, and then use a merge sort on the sorted parts perhaps? Also: how useful (if at all) is this technique if you dont't just want to sort a bunch of integers, but a collection of structs that you want to sort by a field that is an integer?
Very late reply, but the way you could do it is by using the fields as keys and then reiterating over the list of integer elements to create a ssorted list of the structs, would be slower than a quicksort unless you have something in the 5k-10k range of structs/objects but theoretically useful if you have massive object systems that need to be sorted.
I'm trying to figure if radix sort would turn into a special case if the bucket size was one bit in size. So instead of dividing 256, dividing by 2. From an assembly point of view this could be interesting in code size I guess?
I wonder if radix sorting by the most significant digit instead of the least would be faster, reading memory nearby another memory address could be binned? or would it be worse if memory modules allow reads from different parts of memory in parallel?
@@WhatsACreel Maybe, I don't really know the terminology. A lot of the programming I do is for monochrome calculators, and to simulate grayscale, we can toggle between two buffers (or even two sets of two buffers if we need to do some animation and don't want the screen tearing). In this application, we don't really need to keep track of the frame, so being able to just XOR a pointer with some pre-computed value in order to toggle between two values saves a few precious clock cycles and bytes on the Z80 :P I have a pretty limited formal experience with CS, so the only other time I've seen this technique used was XOR-linked lists. EDIT: And the first time I did mergesort on the Z80, that's what I used to toggle between the input and output buffer.
Great video, but one thing I miss is how much RAM the algorithms use and how it changes referring to the amount of numbers in the array. Maybe a graph would be nice.
Doesn't really matter here; quicksort needs basically no overhead, and radix sort will double it, because it needs a temporary work area. In other words, if you pass in 2GB of numbers (that's over 500 million ints), quicksort needs nothing, radix sort will need an additional 2GB.
@@billy65bob Quicksort has a recursive call, so it requires the stack. The problem isn’t how much time the stack allocation takes (as you say, it’s very fast). The problem is that in the worst case it requires O(n) calls. To imagine the worst case, consider a scenario where the pivot element only reduces a subarray with s elements into two smaller subarrays of size (s-1) and 1. All of those remaining subarrays of size 1 “pile up” as something remaining to be done in the worst case. If your input array is 2GB, accounting for just the word size of the return pointer, and having the stack grow n/4 times then you’ve already required a 4GB stack! Causing, as they case, a stack overflow. The fix isn’t so bad: turn the second recursive call into a tail call, and force the remaining (first) recursive call to operate on the smaller subarray first.
@@mshonle quicksort doesn't work that way; it's an unstable in place sort and requires no temporary working area, except to store a value for a swap. You are right in that it's not free and uses recursion which requires stack space, but this cost is trivial, even if the compiler cannot do tail recursion optimisation. The worst case would be O(logn) levels deep; more than small enough to comfortably fit in 1kb of stack.
@@billy65bob needing O(lg n) stack entries is the BEST case- and, without tail call optimization, you’ll have O(n) stack entries in the WORST case. If you are sorting anything large, then the “trivial” linear space will certainly overflow your stack!
Im a begginer programmer and would like to know how to easily make a function getDigits(number,base) I managed to make a base 10 one, However other bases don’t work...
Quicksort can be horribly slow in a worst case scenario (such as an almost sorted array), which is why std::sort is usually implemented as introsort (quicksort that switches to heap sort if it reaches a certain recursion depth) followed by an insertion sort at the very end when the array is almost sorted.
For when you can't use radix sort, there's a whole science to picking the pivot for quicksort. For large arrays picking pivots close to the median gets you closer to O(n log n) and helps to avoid worst case (n2). Picking a single random value from the array is somewhere between those (since you're unlikely to pick the median). To pick a good pivot , pick several random values from the array and use the median of those. The science is how many random values to use as a function of n.
Really great stuff mate! I've usually gone with random in the past. Median sounds like a great choice! Cheers for sharing mate, and cheers for watching :)
In his benchmark it scales pretty much as N log N a factor of 1000 for the inout equals a factor of 10000 for the run time is what you'd expect.
@@WhatsACreel my pivot selection has been 1st, middle and last element, find the median and use that as pivot
so extra comparisons but worse case hard to happen - in reality when somebody actually designed input to hurt the pivot picking algorithm
Yeah, picking the pivot close to median is easy! Just sort the array and pick the middle element!
@@sharpfang median of 3 elements is not that hard, just some if's
std::chrono::high_resolution_clock is the way to go when you want to measure performance of anything in C++
Yes, good suggestion! I certainly could have used something more accurate :)
No I'd say std::chrono::steady_clock is the best since it is,well,steady.
High_resolution_clock is unstable in some implementations and is not recommended for measuring time spans
Or at least repeat the run multiple times and sum up the numbers. Instead of 10 0's get time from start to finish. (subtract n times of dry run, generating the array and copying it to output unsorted).
@@anthonynjoroge5780 Sometimes it is the same clock. I've seen libs where high_res_clk is an alias of -steady_clk-system_clock. Edit: wrong clock.
@@philipbotha6718 I doubt that.I only know that to be true only for system_clock and high_resolution_clock. However,please give me an example of such a library. I'd love to see it.
As if I needed any more distractions at the moment (it's hard enough to finish existing projects), now I want to write some sorting algorithms!
100% agree
The weakness of Quicksort is that it cannot take advantage of partially sorted data or data that's exactly or almost exactly reversed. In real life, you often have two or more sorted dataset that you want to combine into a single sorted whole, or dataset that's mostly sorted but with just a few items in incorrect places. Indeed the worst case scenarios for quicksort usually is with sorted or reversed dataset.
Some hybrid sorting algorithms like timsort or introsort can take advantage of those existing structures in the data and perform a nearly O(n) sort, while avoiding worst case scenarios and maintaining O(n log n) for any input.
The standard sorting algorithm in the standard libraries of most programming languages usually uses one of these hybrid sorts rather than a naive sorting algorithm like quicksort.
That's very often the case, yes! Certainly best to use a sorting algorithm that takes advantage of any patterns in the data! Cheers for sharing :)
radix sort doesn't take advantage of partially sorted data either. quicksort hybrids, like timsort and introsort, can be massages to do these tricks. radix sort kind of hard, without some extra passes on data. radix sort usually will always destroy partial sorting of data in the first step.
@@movax20h if you knew apriori that some selection of the data was already sorted could you improve it by splitting off the initially sorted bit? Im thinking something like this: copy out the sorted bit and replace its values in the modified inputArr with zeros/minimal/maximal possible values, except for the sorted selection's first and last (minimum and maximum) values. Then do quicksort on the inputArr using the retained first and last values as pivots. I'm having trouble thinking of a way to reintegrate the stored separately part but it feels like maybe there's a way. Maybe some kind of binary search or something?
@@mnslr You could do a merge sort as the final merging step, which I suppose should even be doable in-place if the two parts (pre-sorted + qsort output) still remain in that same array.
"naive sorting algorithm like quicksort" I can just feel the freshmen CS majors sweating as they read that haha!
use this if you often need to toggle between 2 lines
//* remove first / to toggle
func1();
/*/
func2();
//*/
That's awesome! :)
Or just use
#if doFunc
Func1();
#else
Func2() ;
#endif
Wow. I never knew there was a sort that was so much faster than Quick Sort. I learned something here today. Radix Sort is nothing short of amazing!
quicksort is a misnomer. its not that quick. The 'quick' means if u don't want to spend time thinking about which sorting algorithm to use u can quickly use quicksort for now and worry about improving the quickness later. Should be called lazysort, but then no-one would use it.
I was always scared of radix sort, thanks to your three part series I feel much much better, THANK YOU
2 small optimisations available for your RadixSort. memset and memcpy are a tick faster than loops, but the more important optimisation is that you don't need to shift and mask the input values. You can use a byte * instead of the uint * and you increment it by 4 in the innerloop and by 1 in the outerloop (of course endianness becomes an issue but in practice not, everything is little endian nowadays).
I don't think endianness matters. Big-endian just ends up with sorting bytes from the MSD.
Great suggestions! I really like this *byte idea! :)
@@WhatsACreel Theoretically it will not make a big difference though. It only pushes the explicit shifting/masking from the ALU to the load-store unit where it is done implicitely. It can also introduce a partial register write stall that plagued Intel CPUs for a long time (when writing a byte register AL, BL etc., accessing the rest of the register (EAX, EBX or RA, RBX) would result in penaltie). So careful testing is necessary.
There's a great video about modern compiler optimization: th-cam.com/video/w0sz5WbS5AM/w-d-xo.html
The key takeaway is: compilers have become extremely good in unwrapping loops and optimizing bit manipulations, if you are able to recognize the optimization easily the compiler will too.
@@benjaminfacouchere2395 Another optimiziation: use constants instead of variables wherever possible, as this effects branch prediction and preemptive execution of the CPU.
Amazing stuff. Love your work. I started on 8 bit machines back in 1978. You guys trying to learn, subscribe to this guy and watch all his stuff. Now Mr Creel, you been doing assembler. So you could use the processors clock counter to measure how long things take in cpu clocks. Forget the libraries. The magic x86 instruction is RDTSC - Read Time-Stamp Counter into EDX:EAX, opcode 0F 31.
Although it looked like radix performed better at all sizes, I think it’s important to clarify that if each of these sorts was done 1000 times for each given size, quick sort would indeed perform much better on the smaller sizes.
great series! very informative.
couple of things I might improve, in order of importance IMO:
1) keep GenerateRandomData() call out of timing block. you don't need to measure random number generation
2) use std::chrono::steady_clock for better timing (milliseconds is enough but you can go up to nanoseconds precision). important point here is to use double instead of default int64_t:
using milliseconds = std::chrono::duration;
using std::chrono::steady_clock;
auto start = steady_clock::now();
// stuff
auto end = steady_clock::now();
double elapsed_ms = duration_cast(end - start).count();
3) use std::vector (when possible: std::array) instead of C style arrays for two main reasons: a) no need to new/delete, b) swapping input and output arrays becomes as simple as std::swap(input, output) and it is O(1)
there are more small stuff but hey it's a rabbit hole :)
I think you can do the radix sort in place by calculating the prefix sum as you go and performing swaps based on the intermediate values, it still requires the final pass per iteration in the reverse direction to put them in the correct order... actually on second thoughts the swaps would have to be inserts (shifting elements), to work around that you could treat the original array as two buffers and I think one of them would need to be circular... hmmm... I wish I had the time to explore this idea!
Thanks for bringing this topic up! I didn't study these sorts at university. I intend to use both the bucket-sort and the radix-sort to sort trees and graphs. These algorithm have wonderful extensions to more complex objects than numbers and the number of variations on these opens up a lot of possibilities!
When comparing to quicksort, you should really use the built in C++ std::sort. Granted it is not required to be quick sort, but it is usually a hybrid of quick sort, heap sort in place, and insertion sort for small arrays, sometimes with some additional tricks (like detection of partially sorted or reversed arrays, like timsort; or other good pivot selection methods, i.e. median of 3, or something like this). For totally random data most of these tricks are not supper important, but some minor details and microoptimisations in std::sort are still relevant, like optimising increment / decrement, index comparissons, and form of the loop, makes a difference.
I have an article on github that covers sorting other numeric types and some radix sort optimizations, e.g column skipping. It's called "Radix sorting from the ground up"
Great video series. Really enjoyed it. Altough I already learned about sorting algorithms at university I haven't taken a look at RadixSort until this series.
Btw. for 64 bit integers the radix sort might be a bit worse, unless you have a really big number of numbers. But for a lot of 32 bit or 16 bit integers, it will works blazing fast. A tricks, is to decide number of buckets before start of the algorithm, based on the size of the key, and number of keys. I.e. if you have billion large numbers, it makes sense to probably use Radix-65536. Because few extra MB for counts, is nothing compared to the temporary array. Also radix sort is reasonably easy to parallelize: in each phase, each thread updates counts using atomic, or own local copy of counts, then they are aggregated (by one thread or by multiple threads, various methods are possible). Then prefix sum is calculated (by one thread, or in parallel). Then shuffling is done in parallel too, with some extra tricks to make it correct.
Another small optimization. Regarding the allocation of the output array and the extra logic to detect an odd number of pointer-swaps and copying the results.
Instead, have the caller responsible for allocating the output array and passing a pointer for that in along with the input array. Then the RadixSort() routine can just pass back the 'output' pointer. The caller can use the returned pointer for the sorted array, and also test against what was originally allocated to deallocate whichever one is no longer needed. No need to copy results and onus is on caller to allocate all memory and clean up. (the 256 element 'count' array could be static as well, so very minimal stack usage).
Nice video. FYI:
*do {} while (/*code*/);* is the same as *while(/*code*/);* in c++.
*if (originalArr == output) /*...*/* This section of the function isn't needed because it'll always be false: The number of shifts is a constant of 4.
For optimisations:
- We could put *count* on the memory stack to avoid unnecessary system calls (using *new* ).
- As others have pointed out, *memcpy* should double the speed of zeroing the *count* array.
staticaly allocated count is not good idea. what will happend when two different threads execute this code on diffent cpu cores? [spoiler] data corruption
int[256] is enough small to stack allocation.
@@safulkin Sorry, I meant allocate it on the stack. (I was meaning that *count* shouldn't be _dynamically_ allocated, and so chose the antonym _static,_ but of course that has its own meaning.)
if im not mistaken
in the case where the statement isnt met, do while will always run the code atleast once, whereas while wont
is this different in cpp o.o?
@@R4ngeR4pidz A while loop will run the body of the code if the while condition is true - but to know if the condition is true it needs to run the code in *while(/*code*/)* . So it really is the same as *do/while* .
@What's a Creel Note: it's possible to find a more precise speed of the algorithms for the 0 - 10000 array lengths by running each algorithm n times and then dividing the time taken to run them by n. This way you wouldn't have a 0 sort time for lower inputs as enough time would be given for the timer to reach values greater than 0. Thus, after division you'd have some mantissa values to show the algorithms speed.
He's done that before (you can still see the loop in his test harness, but set to one iteration). He puts the data generation inside the test unit because he doesn't realize that a GB of RAM can store 268,435,456 integers, so it's okay to generate his test data outside the loop.
this was really an interesting adventure into the radix sorting alg, good job there mate
This was a terrific and engaging mini series. It gave me flashbacks of computer science at uni. Thank you 👍😎🇦🇺
Indeed,. Also, Radix Sort was never even mentioned. 🇨🇦
One note on your quicksort here: in the worst case, it requires O(n) space (the space is the call stack). To fix this, you need two modifications: (1) on the recursive call, recurse on the smallest subarray first; and (2) change the second recursive call into a tail call [in C/C++, reassign the parameters and goto the function start; in Kotlin declare the function as tailrec, et cetera].
After three 15-minute videos explaining in detail why the radix sort is so fast, ends the series with...
"It's probably got something to do with it."
Great work, mate. Thanks.
Fascinating series on sorting. it's been 3 decades since I messed with this stuff, and even then, we didn't get this deep. Thank you for taking the time to explain this.
Side note, has anyone told you that you sound very much like Eric Idle from the Monty Python crew? I don't mean that as defamatory.
Awesome stuff! Very informative series that demonstrates why the methods do become faster, and not just how to do the method itself. Thank you very much, I greatly appreciated this.
Am I right in thinking that the number of digits in the numbers affects radix sort but not quicksort? If so, I would have liked to see a set of tests where the number of elements are constant but the number of digits of the elements increases, and seeing if what effect that would have.
That applies to sorting strings more than numbers. If you're just using the standard integer number types used in the video they tend to have a fixed number of digits. It's just that all the leading digits are zeroes. So you can just use the radix sort, implemented as shown in the video.
The only place where numbers might cause a problem for radix sort is where you have a number type that's not implemented using the standard types. Even there (or for strings) you just need to re-inplement radix sort with a variable number of passes, depending upon the length of string or the number of base-256 digits in the number.
And when sorting non-standard datatypes, quicksort can be affected too.
A comment here Mr creel. I would like to see the ram use in every instance in that table.
Yes, with radix sort, the greater than/less than comparison is built into the fact that the order in which we store the counts is itself an ordered list. You're leveraging compile-time information to help with speed. Of course, that's directly why you're using more memory.
This is my first time watching some of your videos and this is great! Excellently explained.
Good job on not giving older part links so people who land here first will get harder timer finding where it all began.
Thank you so much for putting up this masterclass.
There is also a "database sort", that is, we know how many "pages" are required to hold the sorted result since each "pages" can hold a fix number of elements (that number is our choice). The memory allocation occurs thus just once. Then, one element at a time from the array to be sorted, we look into which page it should go (each page keeping the min and max value it holds in book-keeping while adding or transferring data to another page), incorporate it to the page it belongs if there is room in that page, else, move half of its data to an blank page (blank up to now, its memory had already been reserved), new page which will be properly linked. Since the amount of all required memory is known at the start ( TWICE the initial data), the memory allocation is done just once. All the other manipulations are book-keeping. Note that database "index" rank the data by keeping the order of the index, not of the value, being ordered, which works fine for string, or float, or even orderable structures of multiple "fields".
I suspect it MAY be faster than a quick sort, given, that, in the first place, if you have to sort 1G elements, they don't happen 1G at a time, but one by one, making the database sort working from "day one" instead of having to wait the 1G-th element to be known.
It would be great to see the shootout on strings of different lengths, too.
i like the comparison done in this series
Most of the data I deal with is already partially sorted with combinations of hill sorted, valley sorted and reverse sorted. This can be pathological input for Quicksort but makes another algorithm like CombSort11 shine.
Great vid again! This sure looks like if i dont know how much there will be to sort, it would be better to straight up use radix sort.
FANTASTIC Video! I appreciate it:) Cheers mate🐲
I think a really good sort algorithm would be one that first scans the data to be sorted (size, type, randomness, min/max data values...), and based on that, selects the most appropriate sorting algorithm. For example, it could have 10 complete sorting algorithms available to itself, but decides which one to use based on the data. Some of those 10 could even be combination algorithms involing pieces of different subalgorithms.
Also, I wanted to mention that heapsort has a simple tweak in that instead of just plucking the root node each time, you can pluck that and one of the children (if available). For example, if we had a maxheap with the first 3 elements being 10, 4, and 6 (in that order), instead of just plucking the 10 and then reheapifying, we can pluck the 10 and the 6. Someone actually benchmarked this for me and it was like 3% faster than 1 element popping heapsort. A mild improvement but hey if it works and it is faster, why not use it?
Isn't that what introsort is all about?
An interesting topic related to this and probably relevant for many cs students that are obviously watching this (judging comments from previous videos) would be the effects of multithreading these things. Spoiler: Quicksort is not stable enough even with good pivots to get as much out of it. Also to justify the overhead you need to have more than 1e7 elements to begin with. Thinking about it: Multithreading is a good example where the stability of an algorithm actually matters.
If you want to spice it up: Sorting is one of those things that can make really good use of GPU acceleration. Maybe run a cpu radix sort vs a gpu bubble sort on the same data set. So many fun topics around this to explore.
Another nice topic for an addendum video: The practical side of things. I the wild the choice of sorting algorithm is more or less academic. There is close to no cost/benefit to customfit some std_lib sorting to you particular dataset. Yes, there are cases where every ms matters, particular in lowest end embedded or HPC implementations, but for your average job you just use some standard sorting and be done with it. As with any algorithm you should be aware of its pros and cons, but do not obsess with reinventing the wheel. "You can always come back later to fix it, when it breaks everything"-Mantra is sadly reality. Or try to "outsmart" modern compilers with your new learned asm knowledge.
Well, there is a GPU radix sort, yes. I remember watching a GTC talk on it a long time ago. Though I couldn't tell you which it was. I did actually make a multi-threaded version of the Radix sort for this video series too, but scrapped it. I just split the data into 8, and then used a merge step at the end. It was ok. Mostly bottle necked by the RAM reads I think.
Certainly a lot of fun topics, yes!
Hmm how do you even parallelize radix sort for gpu? I was thinking about this yesterday - it is still iterative along the number of digits. You have to do one digit first, reshuffle the numbers then do the next digit, so I'm not sure there's much of a speed up. Also you'll run into weird race conditions if try to change the same memory location in multiple gpu kernels/threads. Have to think more about how to avoid that
@@nikilragav There is a ton of literature on the topic. I think one of the best illustration of the idea is in "HARRIS M. Parallel prefix sum (scan) with CUDA. GPU Gems 3". as it also adresses implementation "problems" you face on GPUs. There are quite a bit of optimizations on the idea and gpus have better implementations nowadays (and more cores, and shaders to play with). For example "Fast 4-way parallel radix sorting on GPUs"
Another thing to keep in mind: GPUs do not use the same memory system as CPUs. It is quite different in its usage. There are also tons of absolutely unintuitive parallel algos that one just dont know about when starting with the topic.
There is also other neat sorting stuff from the last years. worth looking through their cites too. doi: 10.1109/ACCESS.2019.2920917 or doi: 10.1155/2020/4793545
@@florianh.6256 Hi! I *briefly* looked at that one after I wrote my comment. Yep I'm somewhat familiar with GPU algos and memory allocations
At work, I often found myself having the best performance with insertion sort. Eg.: Inserting events into a file from networked sources based on timestamp. The "better" sorting algorithms often are worse, depending on your dataset.
std::sort is using a hybrid sorting algorithm called introsotr, as you mentioned, it is qsort + heapsort (+ of course O(n^2) sort for short subtables, insertsort I think). The qsort uses "median-of-three" (*) pivot It works well for sorted, reverse sorted and random tables, a bit faster even than a random pivot. But there exist permutations that break it. This is where heapsort is used. As iterations of qqicksort progresses, the depth of iteration is checked, and if it is too large (greater than log[n]*const), the problematic region is sorted using heapsort (a bit slower in mean case, but the worst case is nlogn)
*) it is the median of the first, middle, and last element, not of random ones. Whan interesting, gcc uses the second, middle and last elements, and the median of those is inserted in the first place. Not sure why, but I would guess it gives less jumps in code and is a bit faster.
The main advantage of the median rule is we are guaranteed that in the range is at least one element not smaller, and at least one element not greater than the pivot. This allows us to get rid of some if-statements.
Was the pattern of Radix Sort using 2x the memory of Quick Sort something that held through larger arrays? Would have been cool to see that in the final chart at the end. Awesome video series, thank you! :)
Sorry, I glossed over that! It does continue, yes. Radix Sort, as used in the video, doubles the array, so it takes about double the RAM of QuickSort. Well, cheers for watching mate, and have a good one :)
I'm _real_ happy I only sort by Unix timestamps
If you know the size of the array in advance, then radix sort have a huge advantage since you can just use template to allocate an array during compilation, saving time, you can even do the qame for the counter array, again saving time
Unfortunately, initializing them to 0 will not reduce it
It will even remove all allocation, leading to no needed deallocation so faster and more stable program
*For more sort comparisons:*
Check out *w0rthy* : th-cam.com/users/w0rthyAvideos
For example: "Sorts - Classic" th-cam.com/video/CnzYaK5aE9c/w-d-xo.html
Check out *Timo Bingmann* videos: th-cam.com/users/TimoBingmannvideos
For example: "15 Sorting Algorithms in 6 Minutes" th-cam.com/video/kPRA0W1kECg/w-d-xo.html
I really like your explanations. For a programmer they are very easy to understand.
My frustration with compsci books and articles is so often that they talk a little too abstractly, so for example wouldn't really worry about trying to address the "bucket" problem in radix sort.
One odd hunch I had while you were describing that problem was: could you avoid allocating a second array if you used a radix 2 to somehow do clever in-place bucketing, or would that not work because you might get too many items in a row for one bucket and it wouldn't balance out? I plan to work through that soon myself to find out, but figured I'd mention it before then.
would had been nice to see the memory ratio in the comparison table
The memory ratio is 2 or it approaches 2 as the number of elements approaches infinity.
Awesome video as always, sir
Thank you very much!
Greate stuff! P.S. The final table is missing required RAM column
Fascinating video series, thanks for educating us. :) I will say though, I don't often find applications where I only need a list of numbers sorted. The numbers are often attached or keyed to a data structure, and the whole structure needs to travel with the number, or at least be referenced somehow. Combined with something like a hash map to quickly... well, map, the numbers back to their containing structures, this could be a nice intermediate step. But if you've already constructed a hash map, why would you need a sort... *scratches head for a minute* ... Actually, if you can pack the number to sort, as well as an index to the original structure, into a single integral value, perhaps a int64 or just a byte array, and only do the radix sort on the least significant bytes, then the indeces or pointers would be part of the sorted data. Yeah... I'll have to try this some time. Comparison sorts are all I learned in school, it's very interesting to see other approaches.
So, follow up... I wrote a radix sort template class, it can do integers, strings, and any structure really, as long as it can be represented as a sortable buffer of bytes. I haven't tried negative numbers yet though. The way I ended up doing it was to allocate two buffers for indeces into the original data, in and out, with a buffer swapping approach. A little heavy on the memory, but very flexible. That also gives the option of returning just the list of indeces without modifying the data (view()), or doing swaps on the original array (sort()). It relies on two functor classes to obtain the size of the inputs and evaluate the sortable bytes, but it was trivial to default those for ints and make a "prefab" for strings that you can pass. It also has a persistent buffer option and preAlloc() to avoid those malloc penalties for repeated calls.
sorter rs; rs.sort(data,size);
or
sorter rs; rs.sort(data, size);
Pretty handy! Thanks again for the inspiration!
If anybody wants it, github.com/DFPercush/RadixSort
I'll keep working on getting that negative number thing... sorted.
Cool series; thanks! Not sure how I managed to go this long without having this algorithm on my radar. I suppose just because it's relatively scope-limited to fixed-length unsigned integers, so its utility is near-zero in other contexts?? Anyway, still cool to know about, and has its place. I imagine it could also be used to pre-sort other kinds of data, if one can easily compute a fixed-length unsigned int that represents the most significant sorting components (e.g. casting char[] arrays, to effectively sort by the first few characters)? Interesting stuff.
I think the complexity of radix sort should be O(N log_b_(Max value)) where log_b is log with base = no of buckets used. For most computers max value = 2^32 so the complexity becomes O(32N) =O(N). But every it's tought as O(N) which is misleading according to me.
k = log_b_(Max value) is always a constant so it will always end up as O(kN) = O(N)
Great videos.
Is it possible to somehow check the start list of data, whether it is just integer numbers, negative, positive, floating, just letters, combination of all, …
So spend some time checking data and get information how much memory we have and then decide which sort algorithm is the the best, actually the fastest for the given data and available resources.
That would be very close to ultimate, universal, fastest algorithm for any type of data.
Of course this is only if we do not know what type of data to expect.
I know its not just type of data and memory we have, so maybe include more informations and checks.
Thank you ! Interesting and fun video
The radix short should effectively require 2x the data size in memory while the quick sort should be able to work within the data's own space.
If one used a thread pool the quick sort could be improved a good bit in performance. Doubt it would catch up with radix sort though. It would require waiting to each division is done to split the work load.
For radixsort, would it matter if you would scan only one time and build 4 array's instead of scan 4 times?
so instead of :
for (int shift = 0, s = 0; shift L 4; shift++, s += 8)
{
// Zero the counts
for (int i = 0; i L 256; i++)
count[i] = 0;
// Store count of occurrences in count[]
for (int i = 0; i L n; i++)
count[(arr[i] GG s)&0xff]++;
// Change count[i] so that count[i] now contains
// actual position of this digit in output[]
for (int i = 1; i L 256; i++)
count[i] += count[i - 1];
// Build the output array
for (int i = n - 1; i GE 0; i--)
.........
something like :
// Zero the counts
for (int i = 0; i L 256; i++)
{
count[0][i] = 0;
count[1][i] = 0;
count[2][i] = 0;
count[3][i] = 0;
}
// Store count of occurrences in count[]
for (int i = 0; i L n; i++)
{
count[0][(arr[i] GG 0)&0xff]++;
count[1][(arr[i] GG 8)&0xff00]++;
count[2][(arr[i] GG 16)&0xff0000]++;
count[3][(arr[i] GG 24)&0xff000000]++;
}
// Change count[i] so that count[i] now contains
// actual position of this digit in output[]
for (int i = 1; i L 256; i++)
{
count[0][i] += count[0][i - 1];
count[1][i] += count[1][i - 1];
count[2][i] += count[2][i - 1];
count[3][i] += count[3][i - 1];
}
for (int shift = 0, s = 0; shift L 4; shift++, s += 8)
{
// Build the output array
for (int i = n - 1; i GE 0; i--)
this way initializing would be a little faster, and stil fit in L1 cache, and you dont scan the input 4 times. More work per loop.
the building of the output arrays stil has to take place 4 time.
but would this be faster? ( i'm not that good at C++ to test this myself )
The quad counting seems like a nice idea, write a paper and post the address here!
Small optimization:
The count array could have been allocated on the stack, that would just be 4*256=1024 bytes. Today stack usage and arrays on the stack are not a problem, as long as the arrays have constant size. In C there are variable length arrays with array size not known at compiletime, the same thing could be done with alloca. If you would declare an array with variable size you should always check if it has a relatively tiny size like 10kB or less. On windows you have a default maximum stack size of 1MB.
"On windows you have a default maximum stack size of 1MB."
Is that the maximum size an array can be when passed as an argument to a function (aka put on the stack)?
@@harrysvensson2610 You wouldn't normally pass an array as an argument, but rather a pointer to that memory location where the array is (or in C++ it could be a reference, but the effect is the same. The stack just has to hold 'sizeof(pointer)' bytes.
I don't really see the optimization here, care to explain it?
@@benjaminfacouchere2395 Allocating and deleting allocated memory on the heap takes some time. If you sort a large array it's not going to matter much, but if you sort many small arrays that time is going to add up. Having the array on the stack is just changing the stack pointer, so it's essentially free.
Replacing
int* count = new int[256];
with
int count [256];
and removing
delete[] count;
makes the count array be on the stack instead of the heap. It will be removed automatically when the function ends, and again, it's just changing the stack pointer, so it's free.
Regarding the maximum stack size: Every time the program enters a function it "creates" space on the stack for the variables that function uses. So if you make a recursive function that needs 1000 bytes of stack space, it will have used a million bytes when it's gone 1000 steps "deep";
e.g.
void rec( int x )
{
if( x < 0 ) return;
int a[250];
rec( x - 1);
}
rec( 1000 );
will used 1 million bytes (more actually since arguments are also put on the stack) before it starts to unwind. With a maximum stack size of 1 MB, you may run out stack space and crash if something down the call stack also allocated a lot of data on the stack.
@@phizc Sure, you have a small one-time allocation penalty, but that's it.
Obviously we won't use radix sort on small arrays as other algorithms are more efficient there, also a count array of 256 is overkill then.
I had some data to extract from siemens PLC database and make a monthly report to exel or push report on the fly to the web interface. What I did I read the month data to ram and save indexes of data of each new day, I then create multiple threads on cpu so each thread handles calculations for that day data so thread receives the memory address of data and index locations of that day data. so it is running eg 30 threads and as if computer has more cores it reduced the time need for calculations as much times as much more cores the cpu had.
Nice one mate! :)
allocating stack memory is basically free (just increment the stack pointer register)
malloc/new is where the overhead comes in. Asking the operating system for memory is a big deal.
Once again! Amazing video. Keep it up! :D
Have a contest between different users rigs to see who's is the fastest? with the samples of code used.
I wonder if you can use radix-type sorting on classes and the like. For example, a string - you could radix sort character by character I think - I just wonder if overhead would kill it. For other data types, for example a class with 3 numbers (hours, mins, secs), I imagine you could radix by seconds, then minutes, then hours.
It's certainly capable of sorting strings, yep! It's pretty flexible, really, so long as you can figure out how to read the data as something like unsigned int you're good to go! Cheers for watching mate :)
You are correct about sorting data with multiple key values per element, because radix sort is type of stable sort you can sort starting from the least significant key value to most significant (like in your example first sort by seconds then by minute and last by hour).
Can we do radix sort for floats?
A funny thing that I just thought while watching this: You can use radix sort to sort strings alphabetically (letters = numbers, radix = the size of the alphabet)! Is this how programs usually do that?
Thoroughly enjoyed. Reminds me of uni!
image processing in assembly. any advice about books, tutorials, forums, info?. thanks!
Thanks for this amazing series of videos ... explained well and entertaining. Chapeau!
A nice short video series. The prefix sum idea was especially clever.
Could you possibly do even better if you try to optimize the sorting algorithm to make better use of locality? By that I mean, this seems like it would be doing fairly random accesses to memory, whereas having accesses that are mostly linear would perform better. To my understanding, having fairly random accesses causes page faults.
Hmmm... that's an interesting thought. The data is random, so there will have to randomness somewhere in the algorithm - in a QuickSort the randomness is in the elements that we swap. If we use a small radix, we get much better locality, but at the cost of more iterations. I can't think of a way out other than finding a nice compromise. That seems to be about base 256 radix sort. The counts are small enough that they fit into the L1 cache, but large enough that there's not too many iterations. Well, cheers for sharing this thought, it is very interesting! And cheers for watching :)
@@WhatsACreel
Perhaps one could make buffers for each possible radix (digit? symbol? value?) such that the L1(or possibly the larger caches) can fit everything. Once a buffer is totally filled, we flush its contents to where the data is supposed to go. Then the page faulting would only occur once during each time we need to flush a buffer.
Fantastic mini-series; absolutely fantastic. However the part where you measure the performance of sorting 10, 100 elements is cringe 😅😊 come on you should have started at 1.000.000 🤣🤣
The process memory is shown in real time. When you are looking at it the sorting is already done and all the memory was freed again. Look at @11:54 and @12:21 in the process memory graph. You can see it spiking at the beginning and then it's freed again.
I am also curious why you are measuring the time of the random number genator too. Yes, it is just O(n) but for a comparison between both sorting algorithms you can not add a constant number to each time.
love your videos, if you get a chance would you mind doing a deep dive into hashmaps vs arrays for linear search?
You're including the time of generating the array in the time it takes to sort the array. For an O(n) sort, this is not a trivial overhead, but it is for O(n log n). This makes the test skewed against radix sort.
Good point, I should certainly have removed that from the timings!
You are a legend! Thank you for stopping by :)
@@WhatsACreel Hey, little mistakes keep me on my toes, I suppose! Then the trick is to just try to do enough research to make sure the feedback is correct, but I like to gloss over that part.
Lovely channel. Not sure I've seen anything like this before!
@@_notch Well I certainly make lots of mistakes. Looks like solid advice to me mate!
Cheers for the kind words! Hey, reminds me of a game I played, never saw anything like it! Built a sweet little house with a farm by a river, then I fought off zombies and found some diamonds!!
Hahaha, magic! If you’re ever down this neck of the woods give us a yell, I reckon I can find a bbq and a cold beer :)
@@WhatsACreel That sounds lovely!
It would be fun to hear if any of the algorithms could be sped up by using SIMD functions, and then by utilizing locality (maybe structs with alignas?) for the cache lines to reduce the number of cache misses.
Optimize it to the ground!
It can be and has been. Intel performance primitives provides a very nice radix sort for all integer widths as well as floats
gpgpu will smoke cpu simd.
@@controlledsingularity8084 For a very branchy integer based operation? Maybe, but probably not by the margin you think it will. Do any of the STL-like utility libraries for CUDA have a radix sort to compare it to? Likely the copy latency will be the longest thing
Nice series
One thing I don't get is why people say that Radix can't sort complex objects. As long as you have a sort function that return a integer anything can be sorted.
Good note at the end about exploiting patterns in your data.
Also, many people think way to hard about sorts. If you're sorting 10 000 000 elements, it's sometimes better to check if you really need to sort it, or whether you can get it sorted.
Often, the best sort is whatever the standard library gives you.
Would using other modulos (apart from 10) give better or worse results? Or, rather, how would it affect the speed and what are the things to consider when choosing the base for modulos?
Yes, it can make a big difference! Mostly, we stick to powers of 2 for the modulus, so we can use boolean AND instead of dividing to get the digits. Cheers for watching mate, have a good one :)
@@WhatsACreel Thank you for your videos, they are not only educational, but a lot more fun than other similar ones )
Great series. But why did you not mention that radix sort can start from both ends? I.e. - there is LSD (the one you covered) and MSD sort, which is basically the same, but starts from the leftmost digit.
Yes, sorry. It's just I had the code for LSD and some animation ideas. No real reason other than that. Maybe we can do another vid in the future on MSD? I'm not sure. Both are fine algorithms, and cheers for watching :)
Looking at the results for quick sort I noticed it was almost linear once you reached 100k elements and above, which I wouldn't have expected.
I mean, it's still nearly an order of magnitude slower than radix at 1bn elements but even at 10m elements it's still a usable algorithm if 1 second of wait time is acceptable, especially with radix sorts massive RAM requirements.
AMAZING!
when you're working with truly huge datasets radix has no comparison. it's the only choice for pre-bucketizing string lists over a million unique points.
I've been using a pseudoradix methodology to keep my homebrew linked list map class sorted live.
the memory overhead isn't impossible to overcome when using linked lists, and radix also allows for parallelism where practically all others are requisite linear.
when I'm dealing with data sets that won't fit in working ram, radix to the rescue again! just make your first bucketizing step output to file storage.
continue bucketizing to the disc until buckets fit in working ram.
array radix is kinda limiting for that tiny lookup bias. I feel like on average you gain more from being able to move blocks of memory and insert nodes than you lose from that lookup while the sort process is happening.
my linked list comes with 2 jump pointers that keep relatively close to binary locality when the map is open for active additions or deletions. optionally the map could be collapsed into an array when it's keys became const.
Tnx!!!
This is fucking awesome. Subscribed. I just found this channel today, incredibly good. :o
I'm here for the g'day :)
Is there a reason those two loops at the beginning of the Quicksort code are do while loops instead of simple while loops? Since the body is empty, I can't see how they would behave differently.
It'd be fun to see a video on the deliberately bad sorting algorithms like Bogosort.
5:35 are you on an Intel CPU? There is a big performance penalty for 16bit ints on Intel CPU's
I sure am! Now I'm looking at the speeds of 16 bit instructions against 32, and it does seem that some of them strangely slow...? How odd. Mostly they're ok though. At least on this Kaby Lake. Still, really does make me wonder? Why was 16 bit so slow?? I just assumed it was the Zeroing of the counts. Could certainly be wrong! Great info mate, cheers for sharing, and cheers for watching :)
@@WhatsACreel x86 instructions favor 32-bit and 8-bit operations by default and switch using a single bit. 16-bit operations require the extension byte, so 16-bit instructions are both slower and larger.
Would using the "fast" types help for this? I'm only familiar with c and not cpp but imagine they exist in this language as well.
@@jeffbeasley8235 I think usually uint_fast16_t is 32-bit on x86.
I don't think it has anything to do at the speed of 16 bit vs 8 bit instructions where it comes performance in relation to the size of the radix. There are multiple issues at hand. The most obvious is your table of 256 ints will fit in cache, 64k probably won't, and because you do random access on that (data is random, you're as likely to increment any value as any other), you lose time with cache access.
Another point is using 64k means you need to compute the partial sum of your 64k array , which obviously takes much more time than the 256 array (128 times total considering you do it only twice instead of 4 times). That's a significant constant time added, which will definitely stand out for smaller arrays.
It is important to note however, that the variable time is theoretically halved with a 16 bit radix (going over the array half as many times), and since the and instructions would be 32 bit anyway (since your input is 32 bit), and even then I doubt they'd be twice as slow, memory access is most likely to be the reason.
I believe that using a 16 bit radix could be better if you start multithreading this, because you won't create as much contention on the various arrays elements. However, while counting is trivial to parallelize, putting elements in is not, you'd have to use some clever tricks there.
Also tried checking disassembly on godbolt, the big loop isn't unrolled, which is surprising since you'd probably get some extra performance as you avoid and'ing for one loop (since the result will always be smaller than the value you're and'ing it with).
@creel Did you tried using SSE or AVX2? It might give you significant performance boost on radix sort since you can mask 16 8 bits(SSE) or 32 8 bits(AVX2) at the same time? I'm curious about the results..
As quicksort (& friends) use a devide & conquer approach, they are relatively easy to parallelize, which does not help in the big-O speed of the algorithm but does help in the wall-clock speed. How well is radix sort scalable in this way? I don't see an obvious way do do that in the steps you described, except for chopping up the input array into parts, radix-sorting those, and then use a merge sort on the sorted parts perhaps?
Also: how useful (if at all) is this technique if you dont't just want to sort a bunch of integers, but a collection of structs that you want to sort by a field that is an integer?
Very late reply, but the way you could do it is by using the fields as keys and then reiterating over the list of integer elements to create a ssorted list of the structs, would be slower than a quicksort unless you have something in the 5k-10k range of structs/objects but theoretically useful if you have massive object systems that need to be sorted.
I'm trying to figure if radix sort would turn into a special case if the bucket size was one bit in size. So instead of dividing 256, dividing by 2. From an assembly point of view this could be interesting in code size I guess?
Radix sort is also super easy to parallelize.
Hi guys! How would I extend this function to work with 64 bits and potentially with 128 bit numbers? Thanks!
The slowness for the 16biy radix is probably because of cache issues. 256 might be able to make better use of the cache.
Oh. Code...
Awesome! I thought this was gonna be a 2 part video 😅😅
I wonder if radix sorting by the most significant digit instead of the least would be faster, reading memory nearby another memory address could be binned? or would it be worse if memory modules allow reads from different parts of memory in parallel?
When I want to toggle between two pointers, I do:
xor_ptr = ptr1^ptr2
cur_ptr = ptr1
Then to toggle cur_ptr:
cur_ptr ^= xor_ptr
(sorry if this technique was used, I got distracted during the early part)
Looks like an xor swap?
A ^= B
B ^= A
A ^= B
I love it! Will say, I was tempted to :)
@@WhatsACreel Maybe, I don't really know the terminology.
A lot of the programming I do is for monochrome calculators, and to simulate grayscale, we can toggle between two buffers (or even two sets of two buffers if we need to do some animation and don't want the screen tearing). In this application, we don't really need to keep track of the frame, so being able to just XOR a pointer with some pre-computed value in order to toggle between two values saves a few precious clock cycles and bytes on the Z80 :P
I have a pretty limited formal experience with CS, so the only other time I've seen this technique used was XOR-linked lists.
EDIT: And the first time I did mergesort on the Z80, that's what I used to toggle between the input and output buffer.
@@ZedaZ80 Pure class mate!! XOR-Linked lists are marvellous :)
Cool series! Ta.
Great video, but one thing I miss is how much RAM the algorithms use and how it changes referring to the amount of numbers in the array. Maybe a graph would be nice.
Doesn't really matter here; quicksort needs basically no overhead, and radix sort will double it, because it needs a temporary work area.
In other words, if you pass in 2GB of numbers (that's over 500 million ints), quicksort needs nothing, radix sort will need an additional 2GB.
@@billy65bob Quicksort has a recursive call, so it requires the stack. The problem isn’t how much time the stack allocation takes (as you say, it’s very fast). The problem is that in the worst case it requires O(n) calls. To imagine the worst case, consider a scenario where the pivot element only reduces a subarray with s elements into two smaller subarrays of size (s-1) and 1. All of those remaining subarrays of size 1 “pile up” as something remaining to be done in the worst case. If your input array is 2GB, accounting for just the word size of the return pointer, and having the stack grow n/4 times then you’ve already required a 4GB stack! Causing, as they case, a stack overflow.
The fix isn’t so bad: turn the second recursive call into a tail call, and force the remaining (first) recursive call to operate on the smaller subarray first.
@@mshonle quicksort doesn't work that way; it's an unstable in place sort and requires no temporary working area, except to store a value for a swap.
You are right in that it's not free and uses recursion which requires stack space, but this cost is trivial, even if the compiler cannot do tail recursion optimisation.
The worst case would be O(logn) levels deep; more than small enough to comfortably fit in 1kb of stack.
@@billy65bob needing O(lg n) stack entries is the BEST case- and, without tail call optimization, you’ll have O(n) stack entries in the WORST case. If you are sorting anything large, then the “trivial” linear space will certainly overflow your stack!
@@mshonle I really don't give a toss; if your quicksort has a space cost of O(n), then you have screwed up, and catastrophically so.
Im a begginer programmer and would like to know how to easily make a function getDigits(number,base)
I managed to make a base 10 one, However other bases don’t work...
Quicksort can be horribly slow in a worst case scenario (such as an almost sorted array), which is why std::sort is usually implemented as introsort (quicksort that switches to heap sort if it reaches a certain recursion depth) followed by an insertion sort at the very end when the array is almost sorted.