Nick1wasd, he's taking CPU caching out of the equation, but on modern architectures, CPU cache friendly data structures can make enormous difference in performance - arrays can be designed to be more CPU cache friendly
I learned java programming from an institute, in there I basically learned the basics on how to program, develop websites and things like that, but they never even began to explain how things actually worked, it still worries me how little I actually know about computers, since all I know is basically what years of progress in the computer world have achieved, and I'm just using the tools they gave me without questioning, videos like this are very important for someone like me, I appreciate it.
The more important difference is the difference is how the performance scales for different operations and what you actually plan to use the data structure for. Both of these were O(n) operations, but selecting an arbitrary element in an array is O(1) in an array and O(n) in a linked list. However, trying to dynamically increase the size of an array would be O(n) (since you have to make a new array and copy the data) while a linked list can add another item to the front in O(1). Adding or removing arbitrary items next to a node you're currently looking at is O(1) for linked lists and O(n) for arrays (though this would be O(n) for linked lists if you had walk to the element you want to change). They are different tools for different things. If you're only using operations where the two structures are comparable (like the summation mentioned in the video) and speed is particularly important, then such a test makes sense, but what you use the data structure for is likely to be far more important.
I agree with your point about choosing the right tool for the job, but insertion into a dynamically resizing array is amortized θ(1) if you change the size geometrically.
"Amortized" means allocate once, expand if needed, use many times. Yes, this is the case when array has to be used. But in case you have to do many insert/delete operations, there is no amortization.
Having worked mostly on web since school, I jumped to the absolute stupidest way to resize an array: Making a new array that is one element larger and copying the data into it, assuming that any array is necessarily a continuous block of information. Sorry, I haven't dug into the low level implementation of data structures since college, where an array was the continuous block and anything else was either a different data structure or a hand-waved black-box. This assumption was idiotic of me and required the programmer to fail to allocate enough memory for their application. If I'm skimming this Wikipedia article correctly, then dynamic arrays just allocate more memory, just in case. You still have this problem every time you hit that limit, but it should be rare enough to be negligible (as in every time you hit your cap you double the memory allocation).
+Ivo Ivanov That is not what amortized means at all. Amortized θ(1) means that *on average* the time per operation is constant. Yes, whenever you have to reallocate the array, that particular insertion will take θ(n) time, but because you're increasing the size of the array geometrically, the interval between such insertions also grows linearly, so *on average* every insertion is performed in constant time.
This stuff is worth Gold. Not many modern Computer Science degrees teach this because modern day CPUs have cache and compliers have optimization which means that there is virtually no difference in the speed. So people have started assuming that everything runs as quick. Scale the CPU down and you understand the true answer. Very well put video! Worth the 30 minute
There's effectively no difference in small problems, or most common teaching assignments (which are usually fast enough to be instant), then you go out into the real world with huge datasets or untidy problems where the differences can be a LOT more noticeable. Especially working in games where performance can be critical and you're sometimes working with very limited hardware - this is one of the things I always try to drum into new graduates
Thanks, I understand what you're saying but I guess I was just hung up on the word practical. As in, in practice. If there is a difference, there is a difference.
05:22 We're going to visit every single element (125k) in an array & linked list and add up all their values. 11:18 Running linked list version on Atari computer takes ~166 clock ticks. 13:07 Running array version on Atari computer takes ~179 clock ticks. 17:55 On the Atari computer. Traversing the array backwards is ~114 clock ticks. 18:20 Why? What's happening is that the instructions the Atari machine can execute favors walking backwards. 21:00 Summary of results for Atari, Raspberry Pi, and iMac. On the Raspberry Pi and the iMac, the array is much faster than the linked list. 22:15 What's happening? Caching. The Atari doesn't have a cache. Every bit of data has to be fetched from memory each time. But the CPU isn't much faster than the memory. But for the iMac and the Pi, the CPU is much faster and has to wait for things to be fetched from memory.
Almost, but you haven't properly annotated *why* the linked list is marginally faster than plain walking forwards on the ST (your next address is prebaked and can be used as a JMP IMMediate, so you don't have to depend on using the CPU's own, slightly slower address calculation), nor why the cache speeds things up so dramatically for the plain array in both directions vs the LL on the others (reading an uncached memory address actually pulls in a whole block of data surrounding that one location into the cache; if you're simply traversing a large chunk of memory either wholly linearly or with very small jumps for each step, that means you could potentially have a few dozen extra cache hits after the initial miss before crossing the next block boundary, massively accelerating the processing speed; the linked list, if it doesn't allocate each new value close to the previous one but instead at an essentially random location, cannot benefit from that speedup)... Also, going backwards on a 68k series chip is markedly faster because of one particular instruction group: Decrement and Branch if is met. Typically DBI-zero - ie it will spin rapidly through a loop that does some particular task, finishing only when the content of a particular register hits zero (and sets its easily testable zero flag). There's no particularly easy or fast equivalent for counting upwards - instead you have to repeatedly test whether your loop-count register is equal to some particular arbitrary value, and simply supplying *that* and doing the more complicated equal/not equal test takes a nontrivial amount of extra processor (and possibly memory bus) time. You can somewhat split the difference by using a decrementing counter to control the loop but either incrementing a separate counter to control the forwards motion, or setting a register to minus the decrementing one, but it's only going to shave the penny and will still be quite a bit slower than just running through the array backwards. The memory in those older machines has no burst output mode (which typically only works in incrementing-address mode) or even anything much like FPM or EDO (which allow reasonably fast, zero-page-esque random addressing within a single block of memory), so you get no compensating speed advantage in terms of reduced memory latency; wholly random access is exactly as fast as wholly linear forward stepping. Which in a way is also why the linked list is faster, besides the lack of cache. That particular instruction is also why, if it was better optimised for later 68k-series chips, the program would show rather curious results on the Falcon (I wouldn't want to try and predict them, in fact) or any other similarly engineered 68030 machine, and show an even stronger bias towards backwards array walking on an ST upgraded to a 68010 (or maybe 020, which has a rather poorer cache than the 030) - the very tight "Loop Mode" instruction introduced with the 010 that is literally engineered for doing nothing more than blasting through very simple memory-walking routines bounded by exit conditions such as DBZ. For the most applicable uses, optimising code to use that instruction can give a 2 to 3 times speedup vs the more explicitly written-out plain 68000 version, and in mixed-instruction code that can occasionally make use of it you still get acceleration measured in dozens of percent. Essentially, if you thought already thought 68k code was strange because you're more used to how x86 does things, Loop Mode just makes it even weirder... the 010 and later processors are pretty much tuned, in cases where burst mode or caches are unavailable or no use, for skimming through memory, very quickly, _backwards._
@@gwenynorisu6883 as someone who's programmed for the 68k a bit, I'm curious, how would a data register being decremented by DBRA speed up array access when walking backwards? Wouldn't you have to increase or decrease a value in an address register in every iteration of the loop? How does the value in the data register get used to calculate the address of an array element? Off the top of my head I can't really think of anything that would be faster than just adding a constant value (i.e. the size in bytes of one element) to an address register every iteration.
Between caches and prefetchers, in modern hardware arrays are almost always better than linked lists, or other data structures much of the time. Especially in things like game engine architecture, putting items in arrays versus linked lists of randomly allocated elements (i.e. items all over in memory) can yield very significant speed benefits.
that is true, however this is mostly because intel have made their backdoor optimizations, I can bet that the performance on an arm or amd is much more different. The problem with this is that at application level you have no idea what is going on and you can't optimize it, it is just a question of time until this will fall off, just like directx compared to vulkan.
@@D4no00 The same performance hit happens in PowerPC architectures, and for that matter, any processor with a data cache. Also, arrays are safer than pointers stored in memory. Some safety critical coding standards prohibit stored pointers.
I remember when Computerphile was brand new, I was in my Data Structures and Algorithms class, and doing horribly. I think I would have done a lot better if they had started this channel about a year sooner. I did tell my professor about this channel on one of the last days of class, so I never found out if he ever used it in any of his lectures or not.
First impression. Depends on the operation. I tend to use arrays when I need contiguous memory, or I have a known length, and if I need random access. I use linked lists when building lists with unknown length and where random access is not required.
Used that spinner on VDTs, back in the day... { putchar('\b'), putchar( "\\|/-"[ a++ & 03 ] ); } His was displayed inside "()", so I suspect some kind of "printf()" that certainly wouldn't help speed things along during initialisation (on the Atari)... Not sure why he's generating random numbers to initialise all those 'p's... Any (smallish) number would do just as well: struct.p = loopcounter; Having written lots of code for diverse tasks, imo, each approach has its use and purpose. Arrays when their size is not prone to fluctuation, single or double linked lists when there's lots of inserting/deleting to be done. There's no one "optimal" scheme that is best in all circumstances. (A recipe book has a fixed number of recipes, but each recipe has a variable number of ingredients... Thankfully 'C' has dynamic heap allocation and pointers...)
@S B C's "volatile" keyword makes optimisation a little less secure about its assumptions of what stays and what can be replaced. Scary will be the time when optimisation replaces some code's explicit linked list with array operations, or vice versa, and coders come to rely on that. The demo would work just as well with recursive function calls for the linked list example using the stack as the 'chain' of object instances.
Wow. That has to be one the best video on the topic of computer hardware differences and how we as programmers should not make any assumption on how a machine is supposed to function when executing code. I love the "I'm writting in a high level language" when actually it's C and it's one of the least abstract language you can use without resorting to assembly. Just shows you how much this man knows his stuff.
Thank you for clearing that up. I came from programming in the end of the 1980's. I didn't get the linked lists thing and why arrays weren't used. People were talking about linked lists instead.
A lot of it was about just allocating the memory you used, then recycling it, which favoured dynamic approaches storing structs/records together. Traditional arrays require sizing and static allocation which put arbitrary limits. You didn't want to allocate a massive array on start up for some feature which wasn't used, or would restrict what the system could do. The languages used weren't necessarily as flexible as C which limited options for long running programs. Allocate at beginning, process and terminate worked fine for short lived programs running jobs on multi-user systems. As a user began using larger GUI programs, static pre-allocation was less feasible.
I stumbled across this video this evening, some 5 years after it was published. Brilliant work, mate! I'll be searching to watch more of your videos right now!
Also, depending on the fragmentation of the allocator from which the linked list nodes are taken from, on a cached CPU it can be much much slower than an array which always has its elements on the same cache lines relative to one another. And then there's a funny case where supposing you have an allocator which allocates things linearly and you allocate the whole linked list at once, it basically gives you an array with a bunch of (useless) pointers to next nodes between each element. This will basically perform not quite as fast as the array, because it won't be able to fit as many elements per cache line due to the space wasted by the pointers, BUT it'll be much closer to the array than prior. Overall, the speed of operating on a full linked list on a cached CPU will vary depending on where the nodes were put in memory relative to one another. Another thing to note is how in my work anyhow, you're very unlikely to see a list as big as 125k elements. So, the effects of the cache on full array loops is increased even moreso.
@@tanveerhasan2382 Watch it, then you'll know. But generally, because growing an array geometrically reduces the amount of time you need for expanding/shrinking it, and because insertions/deletions, which are a Linked List's strength, still require traversing it in O(n). From my own and limited experience, I've not yet come across a case where anything was ever actually faster than an array. In one case, a hash map came close. Might even have been the same speed, and was definitely more convenient because I had key/value pairs, but I ended up ditching it and using a sorted std::vector, because I needed the simplicity for MPI compatibility..
When using the Raspberry Pi 3 there is the NEON co-processor. Compiling with -O3 could generate vectorized code. To be sure it is not, the vector optimalization should be disabled. Vectorized code would be increasing the speed on array's manipulation.
Taking the median would give more accurate results. The OS scheduler might interrupt the program sometimes resulting in distorted results when taking the average.
I'm certain Mr. Bagley already know about this but on 68k specifically an array will ALWAYS be faster than a linked list when coding in assembly since you can predecrement/postincrement the address-pointer. (as in direction doesn't matter) ie: lea.l "address of array", a0 clr.l d0 clr.l d1 move.l 1048576, d2 (Number of iterations.) Loop: move.b (a0)+, d1 add.l d1, d0 subq.l 1, d2 bne.b Loop d0 will contain a checksum32 you could also add 1048576 to the address of the array and do "move.b -(a0), d1" inside the loop to read it in reverse. Speed should be more or less the same. With a linked list you also have to fetch a new address for every iteration so there's no chance that is faster on these specifically :)
When GCC switched to the GPLv3 license, Apple forked GCC. I believe the GCC version shipped on OSX is something like 4.2, mainline GCC is up to 7.1. This was in like 2006 or something. GCC has been significantly improved since then. So it's not unlikely that his version of GCC won't unroll and vectorize the code, even though real GCC will.
I can understand that the benchmark had to be trivial in order to run on the Atari ST, but I think simply summing integer elements in a loop gives a bad indication on the performance differences between linked-lists and arrays/vectors on modern architectures. If the pointers in a linked list point to a larger data structure, it can absolutely trash the cache and cause big performance hits. It is even worse if the elements in the linked list are scattered around in memory. With three levels of cache which we have now, getting cache misses gives a hefty penalty. Using arrays/vectors is almost always the right choice nowadays when performance is important, even if you don't do random-access. The only exception I can think of is for huge data sets with frequent insertions and removals.
The point wasn't to address a typical use case. The point was to illustrated that the fundamental architecture underlying your code can in fact cause things to behave in unexpected ways. Yes, most of the time X may be true.. but *assuming* it's a universal case can cause problems.
It depends on your use case, I often end up in cases where an unsorted array and a bit of brute force is faster than a tree or hash based set when your data size is tiny (for example under 10) but you're doing a huge number of lookups.
+Madsy9 Sorting is another area where linked lists can be faster, especially if the record size is large, since sorting a list only requires altering the pointers, not moving entire records around.
Possibly the compiler is doing something less than optimal when making the code for an array reference. While in the Air Force in the early 90s we had a problem writing a document paginator on a worst case 5MB report file. (Ultrix workstations). The program sat there and churned for over 30 seconds while the paginator routine determined the starting location of every page in 17 different page formats. Users found this waiting intolerable. I read an article then about how C compilers may generate a multiplication to reference array elements, and that traversing an array by pointer reference instead of subscripts would not include this extra math behind the scenes and should run faster. I tried this idea out on my Amiga 3000 and the paginator using pointers ran in 12 seconds for an equivalent 5MB text file. I was pretty sure the 25MHz 030 should not be a lot faster than the Ultrix boxes at work. The recoded pointer version of the paginator ran in 5 seconds on the ultrix workstations. Happy users.
If you need linked lists on modern architecture store them in arrays, and instead of pointers use indices. It can be faster than linked list, but of course you still need to measure that. General approach for fast code is to put all required data relatively nearby in memory. Allocating the big array chunk kind-of ensures that (lol there are OS caveats still), as long as elements aren't the pointers.
Damodara Kovie Maybe it's part of who he is. I like to think it is intentional, after all they're computer scientists; they know how to use a damn DSLR.
So in all fairness you'd need to enable loop unrolling in gcc (experimental) and compare again. That should significantly speed up the array-based RPi compilation.
A lot of people in the comments do not know (or at least 4 years ago did not know) that you can insert Linked List elements at the head for O(1) if order does not matter. It will reverse the order of elements coming in of course, but this gets rid of the need for an O(n) insert or needing to track the last element in the linked list.
Things become a bit more interesting for embedded real time systems - when finding a memory location to allocate takes too long, so you have to allocate all of your memory in advanced, or write your own allocation algorithm. This shouldn't make a significant difference for reading the array/linked list, but should affect creation time.
Being a C programmer is no excuse for confusing or non-descriptive variable names. As much as possible variable names should explain the purpose of their data unambiguously. I recognize there are times when this is not easy, but with modern IDEs there is no justification for bad variable names. Don't ignore the power of your tools.
Actually this is a very interesting video, many might have thought that rpi is just an intermediate device for running tests, but actually it is running on totally different architecture, arm64. I know for sure that all servers are migrating to arm64 nowadays as it is much more energy efficient, however those number of cycles are alarming, as clock speeds of those processors are not any higher than intel ones.
Generally, yes... across most use cases and particularly with modern cpu/compiler/memory architecture arrays will be faster than linked lists. But the point of the video is that it's not a fundamental given. Even in cases where you can math out the number of steps involved in the logic, your real-world systems that the code runs on can affect the actual results.
correct mè if i am wrong, linked list are slower but memory efficient, slower than arrays as they don't provide random access, but heaps can overcome all that stuff, is there anything more it in the video?
Linked lists are not mem effecient, as they require more data to keep ( every node keeps its key and adresses of prev. and next nodes). However, it is easier to add or remove nodes from the middle pf the list. If you search data structures on wikipedia, i think they say both mem. Efficiency and time efficiency for different operations
If the Linked List stores it's data like an Array, it'll likely be about as fast. But in practice, Linked Lists will have insertions, deletions, concatenations, all of which fragment this data and create cache miss opportunities. Context matters in this case, which is why it's best to profile these things in the situation you need them in
9:03 I thought I was the only one still using floppies for some applications... ;) 21:40 I would have liked to see the results on an Intel Windows computer.
So you're saying the node pointer is Mario, the node is a remote floating platform, the "next node" pointer is a chuckster, and that my neighbors have already reported me to the police for blaring the a cappella theme on max volume. Well, I must admit, I like what you have to say, sir. I like it a lot.
Actually... you may want to look at the index calculation time on the array on the older machine. Some compilers didn't optimize for powers of two and used a multiply for the index. This was pretty painful back in the day.. remember the ARM processor had a serial multiplier on some models but a 5 or 6 clock multiply was also a possibility. People forget this today, typically its 1 clock. IIRC the 80486 had a 3 clock integer multiply
Hahaha, good shit. I was really really concerned when the linked list turned out to be faster on the Atari, just because I figured an array must be faster. Huge sigh of relief with the results on the mac and pi; I'm not insane.
Instead of using an array index, use a pointer to the first element of the array, and increment the pointer. I think that will be even faster. Modern Intel processors handle adding an index to a pointer and dereferencing the incremented pointer all in a single instruction. The reverse speed is likely due to how the cache works. (I haven’t seen the end of the video yet).
Loved this video! Thanks guys. Great take away; "if you're not sure, come up with some tests and collect the real data". This echos Mike Acton's 'truths', 1. Hardware is the platform 2. Design around the Data.
It's quite obvious. Linked lists are of complexity O(n) for lookup, insertion or deletion. Arrays are of complexity O(1) for the same thing. you take the starting address and add (n * sizeOfElem). Arrays are fixed size, linked list are of variable size and more flexible. At the same time memory allocation is very slow. if you put your array on the stack it's again literally a single cpu instruction (add to esp). running over all the elements still would be faster in an array still, since in a linked list, any element might be anywhere in memory, and thus it might not be in the cache in a sufficiently large list while the stack is almost certainly on the cache. also you need to dereference a pointer, meaning looking at the memory address to fetch the memory address at which the next element is, which is 2 memory accesses, while the array is again literally ESP + offset. Also linked lists are larger in memory, since you have to store that extra bit of information.
"It's quite obvious. Linked lists are of complexity O(n) for lookup, insertion or deletion. Arrays are of complexity O(1) for the same thing." No - Arraya only have the complexity O(1) for lookups - when it comes to inserts and deletions in average half of data in the array has to be moved/copied so complexity is O(N) for inserts/deletions in arrays. Arrays are (compared to linked lists) faster in random read but slower in inserting or removing elements. They are also faster for sequential reads where all elements starting from the first one have to be read when optimized by modern compiler. Normally you need the calculation startingAddress + index * sizeOfElement when you want to access the element in the Array with the given index. But if you have a loop which increases the index every cycle by one a modern compiler can get rid of that muplication by introduing a variable "offset" which is set to startingAdress before the loop and to which sizeOfElement is added after each cycle in the loop..
Can Dr Bagley do another shootout, this time between lookup tables and live calculation? That would be amazing to see how modern processors differ from older ones.
Dear me this is dodgy really. The choice between data structure should be made depending on what you're use case is. eg is random access by index needed? are inserts needed? is iteration needed? I found this video interesting in some ways but it fails to address the important points of how to choose a data structure
NOTE: This 30 minute video analyzes in depth the one task where arrays don't crush linked lists by default. There is no operation for which the asymptotic complexity on an array is higher than the asymptotic complexity on a linked list; in most cases, array wins.
Very insightful benchmark. At first I was worried, that you don't mention cache's role in it but you did. Only thing that's missing there, is that this is synthetic test, and in real life applications would show far different results. The reason is not only those cache misses, that you mentioned, but also that on x64 linked list node would be 2x greater in size than one in array. This applies consequences for how much memory will be prefetched to cache and it's pollution thus greater performance hits during execution. Also, as this is synthetic test, all memory for linked list is allocated one after another anyways, thus lower cache misses than it would be in more likely scenario in real life application, where nodes would be scattered through memory and single fetch would likely be a miss.
Everything you say here is true, but beside the point. He's trying to teach a more fundamental fact about CompSci in general - just because X is faster/better/whatever on paper doesn't mean that will always turn out to be true when the code hits metal. While it's true in modern cases that most programs can ignore the underlying architecture in most cases, sometimes it can result in unexpected behaviors.
Extra credit earned for using the Atari! And I'm pretty sure the reason why it's faster when run in reverse is because no compare instruction is needed in the low-level loop, only a check for zero. (I see you clarified this at the end, and I'm right. Atari 2600 programmer here...)
Extremely disappointed this was a comparison of a linear scan of both data structures. What about arbitrary insertion and deletion? Those are some of the the use cases linked lists were invented to address (no pun intended).
On top of that, something was seriously wrong with the compiler implementation on the Atari ST, that a linear forward scan of an array was somehow not faster than a linked list. And noting the speed of list/array creation -- something that is typically done only once -- was sort of pointless.
Arbitrary insertion and deletion of large structures (100's or 1000's of bytes) is going to be slow in an array. Then there's making a list out of items that may not be contiguously allocated, like a list of request structures for I/O coming from different threads that they allocate themselves, possibly on their own stacks. Of course, for the former, you can create an array of pointers to the big structures and move those pointers around the array if that suits you.
I found this disappointing too. It's billed as a comparison of data structures and ends up being a comparison of hardware architectures.When I was in school we used the concept of an abstract machine and everyone learned the pros and cons of arrays and lists for sequential access, random access, inserts, deletes and moves. If a linked list beats an array for sequential or random access, something is definitely screwed up.
While a definitely agree that the choice really comes down to what you need the structure for, I found this information importantant as well as it explains another aspect of choosing a structure that some people might not think about and isn’t considered as often as big O notation. I gaurentee you that if he had done the video we had expected then there would be people complaining in the comments saying that he didn’t consider how they run on different architectures. Besides, at 30 minutes this video was long enough as is without comparing all the different use cases (and the previous video did mention some of it anyway).
Why would you do that, as it's being used in a machine that, odds on, only has a double density drive? (I know HD was at least a later option for the MSTe, but STs in the main have 720k drives) As well as bothering with that trick when proper HD floppies remain cheap (heck, pretty much free these days) and plentiful, and have formulation that's a better match for the drive characteristics? PC drives including USB ones can still happily read and write 720k discs, btw.
A huge advantage of arrays is as mentioned cache friendliness (data locality). But more than that newer CPUs also include SIMD instructions to perform operations on a lot of data as once, say adding 4 or 8 ints in parallel. Adding in this the array will beat a linked list even worse on modern machines. I would try compiling and running with -march=native (along with -O3), which will allow for CPU specific optimizations. Or use some intrinsics to manually add the needed SSE or AVX instructions.
"The compiler is being clever", which is why you don't optimize code to compare algorithm speeds. Also, this is just traversal. How about speed testing insertion in the middle, which is what linked lists excel at?
And I thought I knew how it's working. This was a brilliant answer to this "trivial" question. I learned a lot, which is some-divided-by-zero from what I expected first 😀 Bravo et merci
I'd like to see some tests regarding arrays and lists where you're frequently or constantly adding and removing entries. Because the array would have to rebuild every time, while the list should just be able to skip a value and allocate another.
Isn't the Atari array code running slow vs linked list because the cost of index to memory address is using Multiplication? It more related to C array syntax and math occurring because of the index parameter. I expect this (multiplication) is a larger factor than memory access and not having caches on old Motorola 68000 Atari system. Linked list code will not involve multiplication. Did you look at the assembly code generated?
Did you notice that it was an array of objects, and not an array of primitives? Makes the array testing come out far, far worse than it otherwise would.
Joseph Cote While I don't care about the whole "object" vs "struct" thing, it is important to note that, unlike in Java, data structures in C aren't referenced by pointers unless the code specifies that they are. In C (and other languages), any structure, regardless of it's type, can be accessed either directly or by pointer. In this case, the array was a contiguous sequence of structures, no pointers involved other than the pointer to the array on the heap.
Let me make my point even more plain. Linked list. Pointer to the head. Each item has a data and the pointer to the next "thing". For each item in the list you have to get the address of the item either from the head pointer or the previous item. Memory access to get the data, memory access to get the next pointer. Perhaps even more processing to resolve the actual addresses within the "thing." Primitive data array: Using the base address of the array and the index, calculate the address to read (very fast all in cpu) access memory to retrieve data, go to next item. For each item ONE address calculation and done. This is my ONLY point: Arrays of primitive values will ALWAYS be much more efficient in tests like this than any kind of fooling around with pointers and objects or whatever you want to call those "things." The video compared a linked list of "things" to an array of "things", NOT a big difference.
Wouldn't it have been faster without printing stuff on the screen inbetween? I know that many of my c and c++ programs run much quicker if you leave out the print statement.
that's a good point, writing to the screen can take some time. however both versions print to the screen, so they are subject to the same slowness. it's like if you had two people run a race through syrup - you can still tell who's faster than who.
yes however the percentage in speed difference is exactly the same as they are printing the same length (minus one or two chars) the same amount of times so it wont effect the comparison
If you're finding that printing half a line of low rez text to the console makes a significant difference to the loop speed of your otherwise ~850ms routine, even on an ST, then you've got bigger problems than whether or not to print said text. Like, realising that you're actually stuck in a runtime interpreter for a particularly shoddy variety of BASIC when you thought you were making compiled C programs. If they're routines that take closer to 850us, then, sure, maybe it'll be worth it. But the text only prints after the loop is done (which is how it knows how long said loop took) and before the start of the next one, so it's not going to affect the reported results any, and in most cases the difference between printing or not printing 100 half-lines of text will save you, ooh, maybe one video frame's worth of runtime?
It's not necessary to pre-declare arrays in Perl. They grow to whatever size is required. (It may be slightly faster to create large ones in one go, if it's known in advance how many will be needed.)
You could also add than in the iMac, the array could fit entirely on cache as opposed to the linked list, and optimizing the code for loop unrolling gave it all the speed boost. So theoretically it might have never accesed the memory but to load the program from disk when running as array.
How about sorting? When you have big enough objects in an array/list wouldn't sorting a list be quicker when you can manipulate pointers instead of moving whole objects around in an array?
For BIG sructs this should be true, but first you must come up with an efficient in-place algorithm, where you need no random access. I'm not quite sure, whether you can do selection sort or sth comparable with a single linked list (this example should work with double linked lists), at quicksort this should be even harder...
skeliton11211 I was meant to add that to the question to be honest. Of course an array of pointers would be much quicker. When you have a warehouse full of containers you make a checklist where everything is and organise it and not move all the containers around.
Stefan Gräber I realised that later. sorting one way list might be problematic since you have to relink the neighbouring objects and insert the object in its new place. It results in a far more pointer manipulation than an array of pointers. My question was more about the "pure" version of a structure
I would imagine it depends on what you're sorting, and what sorting algorithm you use. First, on the data being sorted: Arrays have to swap the contents, so the time is dependent on the size of that content. Linked lists have to swap the pointers, so constant. These two are equal if the data being sorted is the size (or smaller) of a pointer, otherwise the linked list (I must assume, I should point out that I'm not an expert) will win out. As for how you sort: Insertion sort moves backwards along the array/list, and so is an outright no for linked lists (unless, of course, you use a doubly linked list, i.e. a linked list linked in both directions, which would have even greater memory use than a single linked list and would require two pointer swaps rather than just one for every data swap). Heap sort relies on random access, so would take a hit on linked lists. Merge sort and quick sort (properly implemented) work well on both arrays and linked lists. Radix sort I'm not sure about. On the one hand, from what I recall, what you're sorting is specifically the sort of thing that would benefit from linked lists only needing to swap pointers rather than data. On the other, I would also think these things to be things where the array would only contain pointers to them, for example strings (assuming you aren't accepting the restriction of all strings being the same maximum length, and probably all close to it (otherwise you're wasting lots of space)). Provided you keep in mind which sorting algorithms to use and which not, and that your array elements aren't unreasonably large, the difference is (probably) negligible: sorting an array/list from scratch is something you probably won't be doing that often, or with data large enough, for it to really matter which of the two you use, at least in comparison to other operations you're likely to use on the same data. Lastly, I suspect that if you're sorting an array or list, you're probably going to want to do other things to it that linked lists are better suited to, that is inserting and removing items. That bit however is purely subjective and I don't have anything to really back that up.
Quicksort can be done in place so there's no need to do large copies. Even still, moving small structures is still often faster than moving pointers in a linked list, again, because of the cache which can preform the move operations in bulk.
I'm 2 years too late, but I think that there is one change to the linked list version that will speed it up on the cached processors. I assume that the elements of the linked list are added to the beginning as they are generated. As the summing loop walks the list, the memory address for each successive element is decreasing, making it more difficult for the CPU to predict and do speculative prefetch. If the newly allocated elements are appended, the general direction is toward higher addresses, which are easier to predict for the prefetch mechanism. I worked on a commercial C compiler for the Atari ST back in the 1980s, and benchmarks were important to the success of the product. Optimization strategies were weighted toward better performance on things like the Dhrystone with the knowledge that there was no cache.
Also if you decide to manage the memory with the linked list you could allocate a block of memory in which the linked list gets stored with new blocks getting allocated when space runs out. It would mean you get less cache misses and it would run a bit faster
You're original test is invalid. You unnecessarily used a different type of loop, with the loop used for iterating over the array doing a bunch of useless loop-maintaining arithmetic and comparisons that any competent programmer would have either removed or accounted for in the benchmark. Pro tip: summing the elements in an array is in fact faster since it reads half the amount of data from memory for each access (only the number being summed, and no pointer). Since the x86 has more registers, the optimizing compiler had better luck keeping your loop counter in a register and not in memory, where-as on the 68k, the loop counter was stored is a local variable (on the stack in memory), and you access it 3 times per iteration: once to compare it, once to read it for incrementing, and once to write it for incrementing. That's completely unfair to the array implementation.
68k has 8 data registers and 7 address registers, it should be enough for all three programs. The operation that slowed down the forward array iteration was probably comparing the index with the constant, which requires one more instruction than comparing it to zero.
it's not "unfair." the point is that the speed depends on the architecture's bottlenecks. imagine we were dealing with HDD instead of RAM. you would want a data structure that minimized reads from the disk because it's so much slower than memory. On the other hand with an SSD, you are not penalized so much for excessive reads and writes. Therefore you can use different data structures and algorithms without incurring the performance hit of the HDD.
The test is unfair and should be invalid, but not because the 68k is inferior, but since the for-loop use more complex mechanics and thus is significantly slower than the while-loop The ascending for-loop is the slowest because it uses loop-counter-mechanics and have to do an active end-of-loop-check every iteration. The descending for-loop is faster because it uses the very nice decrease-and-branch-all-in-one-instruction and use the decrease instruction's result to compare to zero instead of a separate compare instruction. But it still have the loop-counter-mechanics to deal with. The linked list is fastest because it just loads the next address and checks if it's zero.
Commenting on an Atari isn't particularly relevant as modern computers generally use Intel chips that are very cache heavy. Many tests have been done on these computers that show that almost all optimizations are next to nothing compared to the cache effects. The bottom line as illustrated by many, is that arrays are almost always must faster than linked lists and therefore linked lists should only ever be used only in special cases. Arrays should always be considered the default. My first computer was a 8080 running at 1 megahertz (linked lists were probably ok on that computer) but that isn't very important to what works on modern computers. There is no argument, don't use linked lists.
Modern computers are not "generally" using intel chips. Intel chips are dominant for pc:s, but pc:s does not represent a majority of even consumer computer systems. As an example take the 3 big gaming consoles of the 7th generation (xbox360, wii and ps3). Between them they have over 260 million units sold, and none of them use intel chipsets. Furthermore most cellphones today are running on qualcom chipsets. That's not even mentioning the non-consumer market. There the atari test is absolutely relevent as many embedded chipsets do not have cache.
Pretty sure the majority are ARM chips. That's what you're going to find in just about anything that doesn't need a high performance CPU and in some things that do, like most smartphones or that Raspberry Pi for instance.
Maybe if you're using something dismal like an Atom or an embedded / microcontroller core based off a 386SX or what-have-you, but I don't think that was the point of the test. But even if it was, you can use the 68000 in the ST as a fairly prime example of such; embedded CPUs based on the 68k core (e.g. the Coldfire) were extremely popular up until a few years ago when ARMs and ATmegas started offering greater power and lower energy consumption alongside a lower price tag. So we already have the answer - sophisticated large instruction set, large cache chips prefer regular arrays, and that includes the higher end ARMs. Simpler, cacheless chips prefer linked lists, unless they also display unusual behaviour like being over 50% faster when traversing an array from the bottom up rather than top down.
I expected the reverse loop to be fastest, because that's one subtraction less. With a normal loop for (int i=0; i=0; i--) there is no subtraction and i is directly compared to 0. In any case, testing speed optimizations is needed. I sometimes theorized optimizations that turned out to actually be slower than something simpler.
Even more, I want to see the assembly language listings of the compiled output. p[i].p is well known to be less efficient than using pointer arithmetic (even inside an array); however, some (but not all!) compilers can recognize when p[i] is used inside a loop and automatically substitute pointer-based code. While I expect this to occur for x86-based code considering gcc's/clang's relative maturity, I question whether the 68K compilers provide similar optimizations.
"p[i].p is well known to be less efficient than using pointer arithmetic". My understanding is that p[i] is equivalent to *(p + i), why would it be any slower?
Mike Scott Because it involves a multiply and an addition, versus p++ which is just a simple addition. This becomes particularly visible with multiple dereferences to p[i].
This video was just quite misleading and clickbait'y. Very disappointing compared to your typical videos. I've gotta put a dislike on this video for a few reasons: - The description which made it out as though in the general case arrays were slower - He made out as though he was NOT going to be running it on a modern machine - just on the Atari - The fact one has to watch the first 22 minutes (thank god for YT x2 speed) before he properly addresses the fact there are MAJOR differences between the machines
For cached architectures, It would interesting to see the performance difference with the linked list version if you malloc'ed a single contiguous block of memory and then 'sub-allocate' the structures in the memory block.
OK so real computer science talk here regarding array vs linked list. So a Linked list is a modular data structure that can grow or shrink to accommodate data in real time. Arrays need to be completely reallocated and copied to change size. Linked lists also require an extra bit of data in each element while arrays do not. Linked lists must be accessed from the top to the bottom, unless you saved an item of interest in advance. It should be noted that the overhead used in a for loop adds more instructions to this operations than just moving a pointer into a local variable. For loops require an index value such as i, a comparison operation, and an iteration. While a linked list only need to go to the next item until that next item is 0. both situations sounds simple and the same, however they are not when it comes to the assembly since you would just move the next items address into a register compare it, and keep going but in a for loop you would have to compare the index counter to the length of the array then increment the index counter. Since the for loop operation involves two variables(the length of the array, and the index counter) you would need to push those values onto the stack, then back off the stack as your application stepped deeper into its functions and or move those variables into and out of registers each time you got to the comparison. With the Linked List you only need to check if the value stored at a fixed distance away from the list items pointer is 0 or not. Since this distance is fixed, no need to pull the value from memory into an register. I'm not sure I am making a good explanation here but essentially a for loop uses more steps to maintain the loop while a linked list loop just keeps going until a condition is met.
Memory management is a thing as well. In case of fragmented memory it can be difficult to allocate larger memory blocks. The old Macintosh used "Handles" to deal with fragmented memory, the os could move the arrays in memory without the programs knowing.
I thought the array would have been faster, even on the Atari.. I was surprised by this. I think it comes down to memory speed and the way the memory controller addresses memory registers. The reason I thought an array would be faster is because it points to a specific register immediately.. and a linked list has to search and compare. But in the end.. *cache* is king!
What it looks like he's done is create an array of data objects. Access via an index first, THEN the pointer at that index into the object. Had he created a direct array of primitive values the array method might have been a third of the time. I wouldn't have even considered this a fair comparison either way. You have some problem you need to solve and you use the data structure that is appropriate for it. Maybe a linked list is best, maybe a double liked list, maybe an array of objects, whatever. And you live with the constraints of the chosen data structure.
"We're not gonna run this on the iMac, we're not gonna run it on the Pi, we're gonna run it on the Atari behind me" I love this guy so much.
Nick1wasd, he's taking CPU caching out of the equation, but on modern architectures, CPU cache friendly data structures can make enormous difference in performance - arrays can be designed to be more CPU cache friendly
TheSulross You probably already know this by now but he goes into the major speed differences made due to CPU cache later into the video
Let’s run it on the Amiga and C64 and 800xl
@@TheSulross isn't that the point of why you use arrays more than linked lists just the fact it's CPU cache friendly?
I learned java programming from an institute, in there I basically learned the basics on how to program, develop websites and things like that, but they never even began to explain how things actually worked, it still worries me how little I actually know about computers, since all I know is basically what years of progress in the computer world have achieved, and I'm just using the tools they gave me without questioning, videos like this are very important for someone like me, I appreciate it.
they didn't teach you data structures?
In my opinion this doesn't really have too much to do with Data Structures but more so highlighting compilers, CPU Processors@@abdulhfhd
@@jaydavis6357 you are right
The more important difference is the difference is how the performance scales for different operations and what you actually plan to use the data structure for. Both of these were O(n) operations, but selecting an arbitrary element in an array is O(1) in an array and O(n) in a linked list. However, trying to dynamically increase the size of an array would be O(n) (since you have to make a new array and copy the data) while a linked list can add another item to the front in O(1). Adding or removing arbitrary items next to a node you're currently looking at is O(1) for linked lists and O(n) for arrays (though this would be O(n) for linked lists if you had walk to the element you want to change).
They are different tools for different things. If you're only using operations where the two structures are comparable (like the summation mentioned in the video) and speed is particularly important, then such a test makes sense, but what you use the data structure for is likely to be far more important.
I agree with your point about choosing the right tool for the job, but insertion into a dynamically resizing array is amortized θ(1) if you change the size geometrically.
Chris Hinton I think Nerdnumberone was referring to inserts into the middle or beginning of the array, which is O(n).
"Amortized" means allocate once, expand if needed, use many times. Yes, this is the case when array has to be used. But in case you have to do many insert/delete operations, there is no amortization.
Having worked mostly on web since school, I jumped to the absolute stupidest way to resize an array: Making a new array that is one element larger and copying the data into it, assuming that any array is necessarily a continuous block of information. Sorry, I haven't dug into the low level implementation of data structures since college, where an array was the continuous block and anything else was either a different data structure or a hand-waved black-box.
This assumption was idiotic of me and required the programmer to fail to allocate enough memory for their application. If I'm skimming this Wikipedia article correctly, then dynamic arrays just allocate more memory, just in case. You still have this problem every time you hit that limit, but it should be rare enough to be negligible (as in every time you hit your cap you double the memory allocation).
+Ivo Ivanov That is not what amortized means at all. Amortized θ(1) means that *on average* the time per operation is constant. Yes, whenever you have to reallocate the array, that particular insertion will take θ(n) time, but because you're increasing the size of the array geometrically, the interval between such insertions also grows linearly, so *on average* every insertion is performed in constant time.
This stuff is worth Gold. Not many modern Computer Science degrees teach this because modern day CPUs have cache and compliers have optimization which means that there is virtually no difference in the speed. So people have started assuming that everything runs as quick. Scale the CPU down and you understand the true answer.
Very well put video! Worth the 30 minute
Any "Computer Science degree" that doesn't teach linked lists is not computer science.
What do you mean by "this"? Teach about caches? Teach about unrolling loops?
What practical difference is there if there is no difference? Academic?
There's effectively no difference in small problems, or most common teaching assignments (which are usually fast enough to be instant), then you go out into the real world with huge datasets or untidy problems where the differences can be a LOT more noticeable. Especially working in games where performance can be critical and you're sometimes working with very limited hardware - this is one of the things I always try to drum into new graduates
Thanks, I understand what you're saying but I guess I was just hung up on the word practical. As in, in practice. If there is a difference, there is a difference.
05:22 We're going to visit every single element (125k) in an array & linked list and add up all their values.
11:18 Running linked list version on Atari computer takes ~166 clock ticks.
13:07 Running array version on Atari computer takes ~179 clock ticks.
17:55 On the Atari computer. Traversing the array backwards is ~114 clock ticks.
18:20 Why? What's happening is that the instructions the Atari machine can execute favors walking backwards.
21:00 Summary of results for Atari, Raspberry Pi, and iMac.
On the Raspberry Pi and the iMac, the array is much faster than the linked list.
22:15 What's happening? Caching. The Atari doesn't have a cache. Every bit of data has to be fetched from memory each time. But the CPU isn't much faster than the memory. But for the iMac and the Pi, the CPU is much faster and has to wait for things to be fetched from memory.
Almost, but you haven't properly annotated *why* the linked list is marginally faster than plain walking forwards on the ST (your next address is prebaked and can be used as a JMP IMMediate, so you don't have to depend on using the CPU's own, slightly slower address calculation), nor why the cache speeds things up so dramatically for the plain array in both directions vs the LL on the others (reading an uncached memory address actually pulls in a whole block of data surrounding that one location into the cache; if you're simply traversing a large chunk of memory either wholly linearly or with very small jumps for each step, that means you could potentially have a few dozen extra cache hits after the initial miss before crossing the next block boundary, massively accelerating the processing speed; the linked list, if it doesn't allocate each new value close to the previous one but instead at an essentially random location, cannot benefit from that speedup)...
Also, going backwards on a 68k series chip is markedly faster because of one particular instruction group: Decrement and Branch if is met. Typically DBI-zero - ie it will spin rapidly through a loop that does some particular task, finishing only when the content of a particular register hits zero (and sets its easily testable zero flag). There's no particularly easy or fast equivalent for counting upwards - instead you have to repeatedly test whether your loop-count register is equal to some particular arbitrary value, and simply supplying *that* and doing the more complicated equal/not equal test takes a nontrivial amount of extra processor (and possibly memory bus) time. You can somewhat split the difference by using a decrementing counter to control the loop but either incrementing a separate counter to control the forwards motion, or setting a register to minus the decrementing one, but it's only going to shave the penny and will still be quite a bit slower than just running through the array backwards. The memory in those older machines has no burst output mode (which typically only works in incrementing-address mode) or even anything much like FPM or EDO (which allow reasonably fast, zero-page-esque random addressing within a single block of memory), so you get no compensating speed advantage in terms of reduced memory latency; wholly random access is exactly as fast as wholly linear forward stepping. Which in a way is also why the linked list is faster, besides the lack of cache.
That particular instruction is also why, if it was better optimised for later 68k-series chips, the program would show rather curious results on the Falcon (I wouldn't want to try and predict them, in fact) or any other similarly engineered 68030 machine, and show an even stronger bias towards backwards array walking on an ST upgraded to a 68010 (or maybe 020, which has a rather poorer cache than the 030) - the very tight "Loop Mode" instruction introduced with the 010 that is literally engineered for doing nothing more than blasting through very simple memory-walking routines bounded by exit conditions such as DBZ. For the most applicable uses, optimising code to use that instruction can give a 2 to 3 times speedup vs the more explicitly written-out plain 68000 version, and in mixed-instruction code that can occasionally make use of it you still get acceleration measured in dozens of percent. Essentially, if you thought already thought 68k code was strange because you're more used to how x86 does things, Loop Mode just makes it even weirder... the 010 and later processors are pretty much tuned, in cases where burst mode or caches are unavailable or no use, for skimming through memory, very quickly, _backwards._
I wish people didn't create such bloated videos in the first place.
Thanks for the laconized info!
thx, nice summary
@@gwenynorisu6883 as someone who's programmed for the 68k a bit, I'm curious, how would a data register being decremented by DBRA speed up array access when walking backwards? Wouldn't you have to increase or decrease a value in an address register in every iteration of the loop? How does the value in the data register get used to calculate the address of an array element? Off the top of my head I can't really think of anything that would be faster than just adding a constant value (i.e. the size in bytes of one element) to an address register every iteration.
"Confusing variable names, but hey, I'm a C programmer" 😂
As a student learning in C i can confirm that variable names are either long and confusing or short and confusing
@MLU8811 Same
hell0
@MLU8811 Did exactly the same AND like the video
Say that to the Assembly programmer, they don't even have VARIABLEs.
Between caches and prefetchers, in modern hardware arrays are almost always better than linked lists, or other data structures much of the time. Especially in things like game engine architecture, putting items in arrays versus linked lists of randomly allocated elements (i.e. items all over in memory) can yield very significant speed benefits.
that is true, however this is mostly because intel have made their backdoor optimizations, I can bet that the performance on an arm or amd is much more different. The problem with this is that at application level you have no idea what is going on and you can't optimize it, it is just a question of time until this will fall off, just like directx compared to vulkan.
@@D4no00 The same performance hit happens in PowerPC architectures, and for that matter, any processor with a data cache. Also, arrays are safer than pointers stored in memory. Some safety critical coding standards prohibit stored pointers.
Steve is by far the clearest when it comes to explaining anything Computerphile. Clear, Concise, C programmer - all the C's! :)
videos like this make me really appreciate C as a language
C is the ultimate language.
“...which means I need a very high tech piece of equipment”
*showed us a floppy disk*
Second term computer science students, rejoice. Computerphile has come to save you.
i swear, this channel has saved me many times.
Saved me when studying recursion
I've already graduated but better late than never, although I'm doubtful it'll be useful in my job
I remember when Computerphile was brand new, I was in my Data Structures and Algorithms class, and doing horribly. I think I would have done a lot better if they had started this channel about a year sooner. I did tell my professor about this channel on one of the last days of class, so I never found out if he ever used it in any of his lectures or not.
Damn son. Saved me again.
First impression. Depends on the operation. I tend to use arrays when I need contiguous memory, or I have a known length, and if I need random access. I use linked lists when building lists with unknown length and where random access is not required.
a + b = b + a
you can watch Numberphile for that 😂😂 LoL
more like LoL are three letters => Half life 3 confirmed
SABER - FICTIONAL CHARACTER Not in computer science. ad does not equal ba.
seems legit
But you forget that we can extend list easily but if we want to add 1 extra element to an array we must create new one
wait, i thought that there was a conjecture, that a+b=c
so it that not valid?
A fantastic video. It’s set up great. And the fact you do “myth busting” and teaching people how to test a question like is simply awesome.
This will completely be a function of use case and implementation. I love to see what you all setup for this test.
He's talking about complex computer stuff and I'm just intrigued by the little ASCII spinner at 9:37.
Edit: Haha, they talk about it later on. :D
lol same here. The first thing I wondered was "I wonder how many valuable CPU cycles are being wasted animating that."
Used that spinner on VDTs, back in the day... { putchar('\b'), putchar( "\\|/-"[ a++ & 03 ] ); }
His was displayed inside "()", so I suspect some kind of "printf()" that certainly wouldn't help speed things along during initialisation (on the Atari)...
Not sure why he's generating random numbers to initialise all those 'p's... Any (smallish) number would do just as well: struct.p = loopcounter;
Having written lots of code for diverse tasks, imo, each approach has its use and purpose. Arrays when their size is not prone to fluctuation, single or double linked lists when there's lots of inserting/deleting to be done. There's no one "optimal" scheme that is best in all circumstances. (A recipe book has a fixed number of recipes, but each recipe has a variable number of ingredients... Thankfully 'C' has dynamic heap allocation and pointers...)
@S B C's "volatile" keyword makes optimisation a little less secure about its assumptions of what stays and what can be replaced.
Scary will be the time when optimisation replaces some code's explicit linked list with array operations, or vice versa, and coders come to rely on that.
The demo would work just as well with recursive function calls for the linked list example using the stack as the 'chain' of object instances.
Great info about the differences in the way the chips and the compilers handle the same simple code. Stuff like that fascinates me. Thanks.
Wow. That has to be one the best video on the topic of computer hardware differences and how we as programmers should not make any assumption on how a machine is supposed to function when executing code. I love the "I'm writting in a high level language" when actually it's C and it's one of the least abstract language you can use without resorting to assembly. Just shows you how much this man knows his stuff.
Thank you for clearing that up. I came from programming in the end of the 1980's. I didn't get the linked lists thing and why arrays weren't used. People were talking about linked lists instead.
A lot of it was about just allocating the memory you used, then recycling it, which favoured dynamic approaches storing structs/records together.
Traditional arrays require sizing and static allocation which put arbitrary limits. You didn't want to allocate a massive array on start up for some feature which wasn't used, or would restrict what the system could do.
The languages used weren't necessarily as flexible as C which limited options for long running programs.
Allocate at beginning, process and terminate worked fine for short lived programs running jobs on multi-user systems.
As a user began using larger GUI programs, static pre-allocation was less feasible.
I stumbled across this video this evening, some 5 years after it was published. Brilliant work, mate! I'll be searching to watch more of your videos right now!
Also, depending on the fragmentation of the allocator from which the linked list nodes are taken from, on a cached CPU it can be much much slower than an array which always has its elements on the same cache lines relative to one another. And then there's a funny case where supposing you have an allocator which allocates things linearly and you allocate the whole linked list at once, it basically gives you an array with a bunch of (useless) pointers to next nodes between each element. This will basically perform not quite as fast as the array, because it won't be able to fit as many elements per cache line due to the space wasted by the pointers, BUT it'll be much closer to the array than prior. Overall, the speed of operating on a full linked list on a cached CPU will vary depending on where the nodes were put in memory relative to one another.
Another thing to note is how in my work anyhow, you're very unlikely to see a list as big as 125k elements. So, the effects of the cache on full array loops is increased even moreso.
Ironically I see this video next on the auto-play list "Bjarne Stroustrup: Why you should avoid Linked Lists"
why indeed
@@tanveerhasan2382 Watch it, then you'll know. But generally, because growing an array geometrically reduces the amount of time you need for expanding/shrinking it, and because insertions/deletions, which are a Linked List's strength, still require traversing it in O(n).
From my own and limited experience, I've not yet come across a case where anything was ever actually faster than an array. In one case, a hash map came close. Might even have been the same speed, and was definitely more convenient because I had key/value pairs, but I ended up ditching it and using a sorted std::vector, because I needed the simplicity for MPI compatibility..
Good call on running it on a simple machine. More entertaining and more consistency, win win!
Would love to see Dr. Jason Atkin next time. Had him for my C++ module. He is amazing. Dr. Bagley is awesome as well.
It's always a good idea to have an understanding of what's going on under the hood.
When using the Raspberry Pi 3 there is the NEON co-processor. Compiling with -O3 could generate vectorized code. To be sure it is not, the vector optimalization should be disabled. Vectorized code would be increasing the speed on array's manipulation.
Taking the median would give more accurate results. The OS scheduler might interrupt the program sometimes resulting in distorted results when taking the average.
True
I'm certain Mr. Bagley already know about this but on 68k specifically an array will ALWAYS be faster than a linked list when coding in assembly since you can predecrement/postincrement the address-pointer.
(as in direction doesn't matter)
ie:
lea.l "address of array", a0
clr.l d0
clr.l d1
move.l 1048576, d2 (Number of iterations.)
Loop:
move.b (a0)+, d1
add.l d1, d0
subq.l 1, d2
bne.b Loop
d0 will contain a checksum32
you could also add 1048576 to the address of the array and do "move.b -(a0), d1" inside the loop to read it in reverse. Speed should be more or less the same.
With a linked list you also have to fetch a new address for every iteration so there's no chance that is faster on these specifically :)
gcc can unroll loops too, it's not a clang-specific feature.
Every decent compiler can.
for(p=0; p
He wasn't talking about unrolling a loop, he was talking about SIMD.
When GCC switched to the GPLv3 license, Apple forked GCC. I believe the GCC version shipped on OSX is something like 4.2, mainline GCC is up to 7.1. This was in like 2006 or something. GCC has been significantly improved since then. So it's not unlikely that his version of GCC won't unroll and vectorize the code, even though real GCC will.
Reckless Roges Fails to compile. You smartass...
8:11 "p inside p, were using p inside p, confusing variable names, but then I'm a C programmer" LOL.
I can understand that the benchmark had to be trivial in order to run on the Atari ST, but I think simply summing integer elements in a loop gives a bad indication on the performance differences between linked-lists and arrays/vectors on modern architectures. If the pointers in a linked list point to a larger data structure, it can absolutely trash the cache and cause big performance hits. It is even worse if the elements in the linked list are scattered around in memory. With three levels of cache which we have now, getting cache misses gives a hefty penalty.
Using arrays/vectors is almost always the right choice nowadays when performance is important, even if you don't do random-access. The only exception I can think of is for huge data sets with frequent insertions and removals.
Madsy9 worth noting is that if you don't care about the order of items in your array, both insert and remove operations are O(1)
The point wasn't to address a typical use case. The point was to illustrated that the fundamental architecture underlying your code can in fact cause things to behave in unexpected ways. Yes, most of the time X may be true.. but *assuming* it's a universal case can cause problems.
It depends on your use case, I often end up in cases where an unsorted array and a bit of brute force is faster than a tree or hash based set when your data size is tiny (for example under 10) but you're doing a huge number of lookups.
+Madsy9 Sorting is another area where linked lists can be faster, especially if the record size is large, since sorting a list only requires altering the pointers, not moving entire records around.
Possibly the compiler is doing something less than optimal when making the code for an array reference.
While in the Air Force in the early 90s we had a problem writing a document paginator on a worst case 5MB report file. (Ultrix workstations). The program sat there and churned for over 30 seconds while the paginator routine determined the starting location of every page in 17 different page formats. Users found this waiting intolerable.
I read an article then about how C compilers may generate a multiplication to reference array elements, and that traversing an array by pointer reference instead of subscripts would not include this extra math behind the scenes and should run faster. I tried this idea out on my Amiga 3000 and the paginator using pointers ran in 12 seconds for an equivalent 5MB text file. I was pretty sure the 25MHz 030 should not be a lot faster than the Ultrix boxes at work. The recoded pointer version of the paginator ran in 5 seconds on the ultrix workstations. Happy users.
If you need linked lists on modern architecture store them in arrays, and instead of pointers use indices. It can be faster than linked list, but of course you still need to measure that. General approach for fast code is to put all required data relatively nearby in memory. Allocating the big array chunk kind-of ensures that (lol there are OS caveats still), as long as elements aren't the pointers.
It'd be great if Dr. Bagley was recorded in front of a darker background. The over exposed window background is rather distracting.
Damodara Kovie Maybe it's part of who he is. I like to think it is intentional, after all they're computer scientists; they know how to use a damn DSLR.
Bryan Keller
Fair enough, but I'm appealing to Sean Riley, the videographer and editor.
Just needs some front lighting
So in all fairness you'd need to enable loop unrolling in gcc (experimental) and compare again. That should significantly speed up the array-based RPi compilation.
A lot of people in the comments do not know (or at least 4 years ago did not know) that you can insert Linked List elements at the head for O(1) if order does not matter. It will reverse the order of elements coming in of course, but this gets rid of the need for an O(n) insert or needing to track the last element in the linked list.
"There's technically a cache but not really but essentially no but there's something like a prefetch"
Things become a bit more interesting for embedded real time systems - when finding a memory location to allocate takes too long, so you have to allocate all of your memory in advanced, or write your own allocation algorithm. This shouldn't make a significant difference for reading the array/linked list, but should affect creation time.
Being a C programmer is no excuse for confusing or non-descriptive variable names. As much as possible variable names should explain the purpose of their data unambiguously.
I recognize there are times when this is not easy, but with modern IDEs there is no justification for bad variable names. Don't ignore the power of your tools.
He is a computer scientist not an industrial programmer
Ok so i will name it
"Pointer_that_Points_To_Next_Element_In_The_Array_List"
♫♪Ludwig van Beethoven♪♫ Or you could just name it nextpointer#
Butthurt.
He said "asleep" programmer, not a C programmer, so he had an excuse
Actually this is a very interesting video, many might have thought that rpi is just an intermediate device for running tests, but actually it is running on totally different architecture, arm64. I know for sure that all servers are migrating to arm64 nowadays as it is much more energy efficient, however those number of cycles are alarming, as clock speeds of those processors are not any higher than intel ones.
Man, I really wish I can switch my major rn.. I've always had smth for computers
Do it
The most important thing that everyone should learn is this: BOTH ARE FAST ENOUGH! Use whatever method works for your application.
well...you should really re-run it with c++11 iterative for-loops, because these have a kinda big impact on the cpu-performance of arrays
Those computerphile nerds are so friendly. If all teachers were like them...
But in the end the array was faster on all the CPU with a cache, which should be all CPUs now. Right?
Generally, yes... across most use cases and particularly with modern cpu/compiler/memory architecture arrays will be faster than linked lists. But the point of the video is that it's not a fundamental given. Even in cases where you can math out the number of steps involved in the logic, your real-world systems that the code runs on can affect the actual results.
Even CPUs considered obsolete have caches. CPUs without one are old enough to consider them as antiques.
correct mè if i am wrong, linked list are slower but memory efficient, slower than arrays as they don't provide random access, but heaps can overcome all that stuff, is there anything more it in the video?
Not all modern processors have caches. For instance, most Arduinos do not have a cache.
Linked lists are not mem effecient, as they require more data to keep ( every node keeps its key and adresses of prev. and next nodes). However, it is easier to add or remove nodes from the middle pf the list. If you search data structures on wikipedia, i think they say both mem. Efficiency and time efficiency for different operations
If the Linked List stores it's data like an Array, it'll likely be about as fast. But in practice, Linked Lists will have insertions, deletions, concatenations, all of which fragment this data and create cache miss opportunities. Context matters in this case, which is why it's best to profile these things in the situation you need them in
9:03 I thought I was the only one still using floppies for some applications... ;)
21:40 I would have liked to see the results on an Intel Windows computer.
So you're saying the node pointer is Mario, the node is a remote floating platform, the "next node" pointer is a chuckster, and that my neighbors have already reported me to the police for blaring the a cappella theme on max volume. Well, I must admit, I like what you have to say, sir. I like it a lot.
Actually... you may want to look at the index calculation time on the array on the older machine. Some compilers didn't optimize for powers of two and used a multiply for the index. This was pretty painful back in the day.. remember the ARM processor had a serial multiplier on some models but a 5 or 6 clock multiply was also a possibility.
People forget this today, typically its 1 clock.
IIRC the 80486 had a 3 clock integer multiply
Hahaha, good shit. I was really really concerned when the linked list turned out to be faster on the Atari, just because I figured an array must be faster. Huge sigh of relief with the results on the mac and pi; I'm not insane.
8:14 "Confusing variable names, but then, I'm a C-programmer." Haha!
Instead of using an array index, use a pointer to the first element of the array, and increment the pointer. I think that will be even faster.
Modern Intel processors handle adding an index to a pointer and dereferencing the incremented pointer all in a single instruction.
The reverse speed is likely due to how the cache works. (I haven’t seen the end of the video yet).
8:15 "But then, I'm a C programmer." xD
Loved this video! Thanks guys. Great take away; "if you're not sure, come up with some tests and collect the real data". This echos Mike Acton's 'truths', 1. Hardware is the platform
2. Design around the Data.
It's quite obvious. Linked lists are of complexity O(n) for lookup, insertion or deletion. Arrays are of complexity O(1) for the same thing. you take the starting address and add (n * sizeOfElem). Arrays are fixed size, linked list are of variable size and more flexible. At the same time memory allocation is very slow. if you put your array on the stack it's again literally a single cpu instruction (add to esp). running over all the elements still would be faster in an array still, since in a linked list, any element might be anywhere in memory, and thus it might not be in the cache in a sufficiently large list while the stack is almost certainly on the cache. also you need to dereference a pointer, meaning looking at the memory address to fetch the memory address at which the next element is, which is 2 memory accesses, while the array is again literally ESP + offset. Also linked lists are larger in memory, since you have to store that extra bit of information.
"It's quite obvious. Linked lists are of complexity O(n) for lookup, insertion or deletion. Arrays are of complexity O(1) for the same thing."
No - Arraya only have the complexity O(1) for lookups - when it comes to inserts and deletions in average half of data in the array has to be moved/copied so complexity is O(N) for inserts/deletions in arrays.
Arrays are (compared to linked lists) faster in random read but slower in inserting or removing elements.
They are also faster for sequential reads where all elements starting from the first one have to be read when optimized by modern compiler.
Normally you need the calculation
startingAddress + index * sizeOfElement
when you want to access the element in the Array with the given index.
But if you have a loop which increases the index every cycle by one
a modern compiler can get rid of that muplication by introduing a variable "offset"
which is set to startingAdress before the loop and to which sizeOfElement is added after each cycle in the loop..
Can Dr Bagley do another shootout, this time between lookup tables and live calculation? That would be amazing to see how modern processors differ from older ones.
Dear me this is dodgy really. The choice between data structure should be made depending on what you're use case is. eg is random access by index needed? are inserts needed? is iteration needed? I found this video interesting in some ways but it fails to address the important points of how to choose a data structure
NOTE: This 30 minute video analyzes in depth the one task where arrays don't crush linked lists by default. There is no operation for which the asymptotic complexity on an array is higher than the asymptotic complexity on a linked list; in most cases, array wins.
Very insightful benchmark. At first I was worried, that you don't mention cache's role in it but you did.
Only thing that's missing there, is that this is synthetic test, and in real life applications would show far different results. The reason is not only those cache misses, that you mentioned, but also that on x64 linked list node would be 2x greater in size than one in array. This applies consequences for how much memory will be prefetched to cache and it's pollution thus greater performance hits during execution.
Also, as this is synthetic test, all memory for linked list is allocated one after another anyways, thus lower cache misses than it would be in more likely scenario in real life application, where nodes would be scattered through memory and single fetch would likely be a miss.
Everything you say here is true, but beside the point. He's trying to teach a more fundamental fact about CompSci in general - just because X is faster/better/whatever on paper doesn't mean that will always turn out to be true when the code hits metal. While it's true in modern cases that most programs can ignore the underlying architecture in most cases, sometimes it can result in unexpected behaviors.
Extra credit earned for using the Atari! And I'm pretty sure the reason why it's faster when run in reverse is because no compare instruction is needed in the low-level loop, only a check for zero. (I see you clarified this at the end, and I'm right. Atari 2600 programmer here...)
Extremely disappointed this was a comparison of a linear scan of both data structures. What about arbitrary insertion and deletion? Those are some of the the use cases linked lists were invented to address (no pun intended).
Couldn't agree more. The entire video I couldn't stop thinking about how this comparison is unfair.
On top of that, something was seriously wrong with the compiler implementation on the Atari ST, that a linear forward scan of an array was somehow not faster than a linked list. And noting the speed of list/array creation -- something that is typically done only once -- was sort of pointless.
Arbitrary insertion and deletion of large structures (100's or 1000's of bytes) is going to be slow in an array. Then there's making a list out of items that may not be contiguously allocated, like a list of request structures for I/O coming from different threads that they allocate themselves, possibly on their own stacks. Of course, for the former, you can create an array of pointers to the big structures and move those pointers around the array if that suits you.
I found this disappointing too. It's billed as a comparison of data structures and ends up being a comparison of hardware architectures.When I was in school we used the concept of an abstract machine and everyone learned the pros and cons of arrays and lists for sequential access, random access, inserts, deletes and moves. If a linked list beats an array for sequential or random access, something is definitely screwed up.
While a definitely agree that the choice really comes down to what you need the structure for, I found this information importantant as well as it explains another aspect of choosing a structure that some people might not think about and isn’t considered as often as big O notation.
I gaurentee you that if he had done the video we had expected then there would be people complaining in the comments saying that he didn’t consider how they run on different architectures.
Besides, at 30 minutes this video was long enough as is without comparing all the different use cases (and the previous video did mention some of it anyway).
why is backwards array faster than array on Atari?
I'd say that it comes down to the line
for (i=0, i
finally, something academic 😅
Loving the Atari MegaST in the background!
a question which interest me most now. Is the floppy HD format or DD ?
"DD" with a hole drilled to make HD
Why would you do that, as it's being used in a machine that, odds on, only has a double density drive? (I know HD was at least a later option for the MSTe, but STs in the main have 720k drives)
As well as bothering with that trick when proper HD floppies remain cheap (heck, pretty much free these days) and plentiful, and have formulation that's a better match for the drive characteristics?
PC drives including USB ones can still happily read and write 720k discs, btw.
You guys should make a video about branch prediction. That trips up a lot of people, especially on modern CPUs which tries to do some clever shit.
Did he actually say "doog yrev"!? lmao
Did eeh sey
That's what it was!!!
A huge advantage of arrays is as mentioned cache friendliness (data locality). But more than that newer CPUs also include SIMD instructions to perform operations on a lot of data as once, say adding 4 or 8 ints in parallel. Adding in this the array will beat a linked list even worse on modern machines.
I would try compiling and running with -march=native (along with -O3), which will allow for CPU specific optimizations. Or use some intrinsics to manually add the needed SSE or AVX instructions.
insert word association programming humor here
That could probably be done faster with a linked list than with an array.
THC IS IN MY P
Dream selection of retro kit in that office :)
"The compiler is being clever", which is why you don't optimize code to compare algorithm speeds. Also, this is just traversal. How about speed testing insertion in the middle, which is what linked lists excel at?
And I thought I knew how it's working.
This was a brilliant answer to this "trivial" question.
I learned a lot, which is some-divided-by-zero from what I expected first 😀
Bravo et merci
MEMORY IS AN ARRAY!!!!!!!!!
I'd like to see some tests regarding arrays and lists where you're frequently or constantly adding and removing entries.
Because the array would have to rebuild every time, while the list should just be able to skip a value and allocate another.
it's hard to imagine use case it.
Steven: *uses VSCode*
me: MAH BOIiii
I just learnt about "-O3" setting in GCC. Nice one, thanks!
FWIW, -O3 has a tendency to break code, -O2 is a safer, albeit slightly slower alternative
@@anshul5243 ozone mode bad
Isn't the Atari array code running slow vs linked list because the cost of index to memory address is using Multiplication? It more related to C array syntax and math occurring because of the index parameter. I expect this (multiplication) is a larger factor than memory access and not having caches on old Motorola 68000 Atari system.
Linked list code will not involve multiplication.
Did you look at the assembly code generated?
Did you notice that it was an array of objects, and not an array of primitives? Makes the array testing come out far, far worse than it otherwise would.
Joseph Cote, This is C, not Java. They're called "structs". ;)
So what? My point still stands. Arrays of pointers will always be less efficient than an array of primitives.
Joseph Cote While I don't care about the whole "object" vs "struct" thing, it is important to note that, unlike in Java, data structures in C aren't referenced by pointers unless the code specifies that they are. In C (and other languages), any structure, regardless of it's type, can be accessed either directly or by pointer. In this case, the array was a contiguous sequence of structures, no pointers involved other than the pointer to the array on the heap.
Let me make my point even more plain.
Linked list. Pointer to the head. Each item has a data and the pointer to the next "thing". For each item in the list you have to get the address of the item either from the head pointer or the previous item. Memory access to get the data, memory access to get the next pointer. Perhaps even more processing to resolve the actual addresses within the "thing."
Primitive data array: Using the base address of the array and the index, calculate the address to read (very fast all in cpu) access memory to retrieve data, go to next item. For each item ONE address calculation and done.
This is my ONLY point: Arrays of primitive values will ALWAYS be much more efficient in tests like this than any kind of fooling around with pointers and objects or whatever you want to call those "things." The video compared a linked list of "things" to an array of "things", NOT a big difference.
Great deep dive into what I thought would be a trivial problem !
Wouldn't it have been faster without printing stuff on the screen inbetween? I know that many of my c and c++ programs run much quicker if you leave out the print statement.
that's a good point, writing to the screen can take some time. however both versions print to the screen, so they are subject to the same slowness. it's like if you had two people run a race through syrup - you can still tell who's faster than who.
yes however the percentage in speed difference is exactly the same as they are printing the same length (minus one or two chars) the same amount of times so it wont effect the comparison
They did the prints outside the timings, so it would not affect the recorded time.
If you're finding that printing half a line of low rez text to the console makes a significant difference to the loop speed of your otherwise ~850ms routine, even on an ST, then you've got bigger problems than whether or not to print said text. Like, realising that you're actually stuck in a runtime interpreter for a particularly shoddy variety of BASIC when you thought you were making compiled C programs.
If they're routines that take closer to 850us, then, sure, maybe it'll be worth it. But the text only prints after the loop is done (which is how it knows how long said loop took) and before the start of the next one, so it's not going to affect the reported results any, and in most cases the difference between printing or not printing 100 half-lines of text will save you, ooh, maybe one video frame's worth of runtime?
It's not necessary to pre-declare arrays in Perl. They grow to whatever size is required. (It may be slightly faster to create large ones in one go, if it's known in advance how many will be needed.)
VS Code master race reporting in
i think is C
Yes, but the text editor he was using to write it on the Mac was VS Code
You could also add than in the iMac, the array could fit entirely on cache as opposed to the linked list, and optimizing the code for loop unrolling gave it all the speed boost. So theoretically it might have never accesed the memory but to load the program from disk when running as array.
How about sorting?
When you have big enough objects in an array/list wouldn't sorting a list be quicker when you can manipulate pointers instead of moving whole objects around in an array?
For BIG sructs this should be true, but first you must come up with an efficient in-place algorithm, where you need no random access. I'm not quite sure, whether you can do selection sort or sth comparable with a single linked list (this example should work with double linked lists), at quicksort this should be even harder...
skeliton11211 I was meant to add that to the question to be honest. Of course an array of pointers would be much quicker. When you have a warehouse full of containers you make a checklist where everything is and organise it and not move all the containers around.
Stefan Gräber I realised that later. sorting one way list might be problematic since you have to relink the neighbouring objects and insert the object in its new place. It results in a far more pointer manipulation than an array of pointers. My question was more about the "pure" version of a structure
I would imagine it depends on what you're sorting, and what sorting algorithm you use.
First, on the data being sorted:
Arrays have to swap the contents, so the time is dependent on the size of that content.
Linked lists have to swap the pointers, so constant.
These two are equal if the data being sorted is the size (or smaller) of a pointer, otherwise the linked list (I must assume, I should point out that I'm not an expert) will win out.
As for how you sort:
Insertion sort moves backwards along the array/list, and so is an outright no for linked lists (unless, of course, you use a doubly linked list, i.e. a linked list linked in both directions, which would have even greater memory use than a single linked list and would require two pointer swaps rather than just one for every data swap).
Heap sort relies on random access, so would take a hit on linked lists.
Merge sort and quick sort (properly implemented) work well on both arrays and linked lists.
Radix sort I'm not sure about. On the one hand, from what I recall, what you're sorting is specifically the sort of thing that would benefit from linked lists only needing to swap pointers rather than data. On the other, I would also think these things to be things where the array would only contain pointers to them, for example strings (assuming you aren't accepting the restriction of all strings being the same maximum length, and probably all close to it (otherwise you're wasting lots of space)).
Provided you keep in mind which sorting algorithms to use and which not, and that your array elements aren't unreasonably large, the difference is (probably) negligible: sorting an array/list from scratch is something you probably won't be doing that often, or with data large enough, for it to really matter which of the two you use, at least in comparison to other operations you're likely to use on the same data.
Lastly, I suspect that if you're sorting an array or list, you're probably going to want to do other things to it that linked lists are better suited to, that is inserting and removing items. That bit however is purely subjective and I don't have anything to really back that up.
Quicksort can be done in place so there's no need to do large copies. Even still, moving small structures is still often faster than moving pointers in a linked list, again, because of the cache which can preform the move operations in bulk.
I'm 2 years too late, but I think that there is one change to the linked list version that will speed it up on the cached processors. I assume that the elements of the linked list are added to the beginning as they are generated. As the summing loop walks the list, the memory address for each successive element is decreasing, making it more difficult for the CPU to predict and do speculative prefetch. If the newly allocated elements are appended, the general direction is toward higher addresses, which are easier to predict for the prefetch mechanism.
I worked on a commercial C compiler for the Atari ST back in the 1980s, and benchmarks were important to the success of the product. Optimization strategies were weighted toward better performance on things like the Dhrystone with the knowledge that there was no cache.
Oooh, which C compiler was that?
@@DrSteveBagley mwc (Mark Williams C) for the Atari ST.
Doogy rev, doogy rev. Gronda gronda Rangdo!
wopsim
Also if you decide to manage the memory with the linked list you could allocate a block of memory in which the linked list gets stored with new blocks getting allocated when space runs out. It would mean you get less cache misses and it would run a bit faster
You're original test is invalid. You unnecessarily used a different type of loop, with the loop used for iterating over the array doing a bunch of useless loop-maintaining arithmetic and comparisons that any competent programmer would have either removed or accounted for in the benchmark. Pro tip: summing the elements in an array is in fact faster since it reads half the amount of data from memory for each access (only the number being summed, and no pointer). Since the x86 has more registers, the optimizing compiler had better luck keeping your loop counter in a register and not in memory, where-as on the 68k, the loop counter was stored is a local variable (on the stack in memory), and you access it 3 times per iteration: once to compare it, once to read it for incrementing, and once to write it for incrementing. That's completely unfair to the array implementation.
68k has 8 data registers and 7 address registers, it should be enough for all three programs.
The operation that slowed down the forward array iteration was probably comparing the index with the constant, which requires one more instruction than comparing it to zero.
it's not "unfair." the point is that the speed depends on the architecture's bottlenecks.
imagine we were dealing with HDD instead of RAM. you would want a data structure that minimized reads from the disk because it's so much slower than memory. On the other hand with an SSD, you are not penalized so much for excessive reads and writes. Therefore you can use different data structures and algorithms without incurring the performance hit of the HDD.
That's not how Big O works.
The test is unfair and should be invalid, but not because the 68k is inferior, but since the for-loop use more complex mechanics and thus is significantly slower than the while-loop
The ascending for-loop is the slowest because it uses loop-counter-mechanics and have to do an active end-of-loop-check every iteration.
The descending for-loop is faster because it uses the very nice decrease-and-branch-all-in-one-instruction and use the decrease instruction's result to compare to zero instead of a separate compare instruction. But it still have the loop-counter-mechanics to deal with.
The linked list is fastest because it just loads the next address and checks if it's zero.
I love these in depth looks on computing processes, keep it up!
Commenting on an Atari isn't particularly relevant as modern computers generally use Intel chips that are very cache heavy. Many tests have been done on these computers that show that almost all optimizations are next to nothing compared to the cache effects. The bottom line as illustrated by many, is that arrays are almost always must faster than linked lists and therefore linked lists should only ever be used only in special cases. Arrays should always be considered the default. My first computer was a 8080 running at 1 megahertz (linked lists were probably ok on that computer) but that isn't very important to what works on modern computers. There is no argument, don't use linked lists.
To be fair that was not the point of the video. It was that testing was king.
Modern computers are not "generally" using intel chips. Intel chips are dominant for pc:s, but pc:s does not represent a majority of even consumer computer systems. As an example take the 3 big gaming consoles of the 7th generation (xbox360, wii and ps3). Between them they have over 260 million units sold, and none of them use intel chipsets. Furthermore most cellphones today are running on qualcom chipsets. That's not even mentioning the non-consumer market. There the atari test is absolutely relevent as many embedded chipsets do not have cache.
Pretty sure the majority are ARM chips. That's what you're going to find in just about anything that doesn't need a high performance CPU and in some things that do, like most smartphones or that Raspberry Pi for instance.
Maybe if you're using something dismal like an Atom or an embedded / microcontroller core based off a 386SX or what-have-you, but I don't think that was the point of the test. But even if it was, you can use the 68000 in the ST as a fairly prime example of such; embedded CPUs based on the 68k core (e.g. the Coldfire) were extremely popular up until a few years ago when ARMs and ATmegas started offering greater power and lower energy consumption alongside a lower price tag. So we already have the answer - sophisticated large instruction set, large cache chips prefer regular arrays, and that includes the higher end ARMs. Simpler, cacheless chips prefer linked lists, unless they also display unusual behaviour like being over 50% faster when traversing an array from the bottom up rather than top down.
I expected the reverse loop to be fastest, because that's one subtraction less.
With a normal loop for (int i=0; i=0; i--) there is no subtraction and i is directly compared to 0.
In any case, testing speed optimizations is needed. I sometimes theorized optimizations that turned out to actually be slower than something simpler.
I don't buy this. Give us a pastebin of the source code so we can see if there are any other differences between the source code.
Even more, I want to see the assembly language listings of the compiled output. p[i].p is well known to be less efficient than using pointer arithmetic (even inside an array); however, some (but not all!) compilers can recognize when p[i] is used inside a loop and automatically substitute pointer-based code. While I expect this to occur for x86-based code considering gcc's/clang's relative maturity, I question whether the 68K compilers provide similar optimizations.
github or it didn't happen
"p[i].p is well known to be less efficient than using pointer arithmetic". My understanding is that p[i] is equivalent to *(p + i), why would it be any slower?
Mike Scott My guess is that although equivalent, the computer still need to compute p[i].p into *(p+i).
Mike Scott Because it involves a multiply and an addition, versus p++ which is just a simple addition. This becomes particularly visible with multiple dereferences to p[i].
Very informative video on this topic. And actually very clear about what is happening. Great job.
This video was just quite misleading and clickbait'y. Very disappointing compared to your typical videos.
I've gotta put a dislike on this video for a few reasons:
- The description which made it out as though in the general case arrays were slower
- He made out as though he was NOT going to be running it on a modern machine - just on the Atari
- The fact one has to watch the first 22 minutes (thank god for YT x2 speed) before he properly addresses the fact there are MAJOR differences between the machines
CPU/GPU's with SIMD instructions will make processing the array faster as those instructions were made specifically for processing arrays!
For cached architectures, It would interesting to see the performance difference with the linked list version if you malloc'ed a single contiguous block of memory and then 'sub-allocate' the structures in the memory block.
Also, the iMac has SSE4 ans AVX instructions, which can add 8 or 16 array items in one cycle.
Which is facilitated by the loop unrolling, and then the cache makes it so that all 8 or 16 of those elements are accessed in a single read.
OK so real computer science talk here regarding array vs linked list.
So a Linked list is a modular data structure that can grow or shrink to accommodate data in real time. Arrays need to be completely reallocated and copied to change size. Linked lists also require an extra bit of data in each element while arrays do not. Linked lists must be accessed from the top to the bottom, unless you saved an item of interest in advance.
It should be noted that the overhead used in a for loop adds more instructions to this operations than just moving a pointer into a local variable.
For loops require an index value such as i, a comparison operation, and an iteration. While a linked list only need to go to the next item until that next item is 0.
both situations sounds simple and the same, however they are not when it comes to the assembly since you would just move the next items address into a register compare it, and keep going but in a for loop you would have to compare the index counter to the length of the array then increment the index counter. Since the for loop operation involves two variables(the length of the array, and the index counter) you would need to push those values onto the stack, then back off the stack as your application stepped deeper into its functions and or move those variables into and out of registers each time you got to the comparison. With the Linked List you only need to check if the value stored at a fixed distance away from the list items pointer is 0 or not. Since this distance is fixed, no need to pull the value from memory into an register.
I'm not sure I am making a good explanation here but essentially a for loop uses more steps to maintain the loop while a linked list loop just keeps going until a condition is met.
Memory management is a thing as well. In case of fragmented memory it can be difficult to allocate larger memory blocks.
The old Macintosh used "Handles" to deal with fragmented memory, the os could move the arrays in memory without the programs knowing.
Excellent break down. Another gem.
I thought the array would have been faster, even on the Atari.. I was surprised by this. I think it comes down to memory speed and the way the memory controller addresses memory registers. The reason I thought an array would be faster is because it points to a specific register immediately.. and a linked list has to search and compare. But in the end.. *cache* is king!
What it looks like he's done is create an array of data objects. Access via an index first, THEN the pointer at that index into the object. Had he created a direct array of primitive values the array method might have been a third of the time.
I wouldn't have even considered this a fair comparison either way. You have some problem you need to solve and you use the data structure that is appropriate for it. Maybe a linked list is best, maybe a double liked list, maybe an array of objects, whatever. And you live with the constraints of the chosen data structure.
I like this video. The caching mechanism is fascinating. Honestly I didn't understand how this clang hack works.