Nick1wasd, he's taking CPU caching out of the equation, but on modern architectures, CPU cache friendly data structures can make enormous difference in performance - arrays can be designed to be more CPU cache friendly
The more important difference is the difference is how the performance scales for different operations and what you actually plan to use the data structure for. Both of these were O(n) operations, but selecting an arbitrary element in an array is O(1) in an array and O(n) in a linked list. However, trying to dynamically increase the size of an array would be O(n) (since you have to make a new array and copy the data) while a linked list can add another item to the front in O(1). Adding or removing arbitrary items next to a node you're currently looking at is O(1) for linked lists and O(n) for arrays (though this would be O(n) for linked lists if you had walk to the element you want to change). They are different tools for different things. If you're only using operations where the two structures are comparable (like the summation mentioned in the video) and speed is particularly important, then such a test makes sense, but what you use the data structure for is likely to be far more important.
I agree with your point about choosing the right tool for the job, but insertion into a dynamically resizing array is amortized θ(1) if you change the size geometrically.
"Amortized" means allocate once, expand if needed, use many times. Yes, this is the case when array has to be used. But in case you have to do many insert/delete operations, there is no amortization.
Having worked mostly on web since school, I jumped to the absolute stupidest way to resize an array: Making a new array that is one element larger and copying the data into it, assuming that any array is necessarily a continuous block of information. Sorry, I haven't dug into the low level implementation of data structures since college, where an array was the continuous block and anything else was either a different data structure or a hand-waved black-box. This assumption was idiotic of me and required the programmer to fail to allocate enough memory for their application. If I'm skimming this Wikipedia article correctly, then dynamic arrays just allocate more memory, just in case. You still have this problem every time you hit that limit, but it should be rare enough to be negligible (as in every time you hit your cap you double the memory allocation).
+Ivo Ivanov That is not what amortized means at all. Amortized θ(1) means that *on average* the time per operation is constant. Yes, whenever you have to reallocate the array, that particular insertion will take θ(n) time, but because you're increasing the size of the array geometrically, the interval between such insertions also grows linearly, so *on average* every insertion is performed in constant time.
05:22 We're going to visit every single element (125k) in an array & linked list and add up all their values. 11:18 Running linked list version on Atari computer takes ~166 clock ticks. 13:07 Running array version on Atari computer takes ~179 clock ticks. 17:55 On the Atari computer. Traversing the array backwards is ~114 clock ticks. 18:20 Why? What's happening is that the instructions the Atari machine can execute favors walking backwards. 21:00 Summary of results for Atari, Raspberry Pi, and iMac. On the Raspberry Pi and the iMac, the array is much faster than the linked list. 22:15 What's happening? Caching. The Atari doesn't have a cache. Every bit of data has to be fetched from memory each time. But the CPU isn't much faster than the memory. But for the iMac and the Pi, the CPU is much faster and has to wait for things to be fetched from memory.
Almost, but you haven't properly annotated *why* the linked list is marginally faster than plain walking forwards on the ST (your next address is prebaked and can be used as a JMP IMMediate, so you don't have to depend on using the CPU's own, slightly slower address calculation), nor why the cache speeds things up so dramatically for the plain array in both directions vs the LL on the others (reading an uncached memory address actually pulls in a whole block of data surrounding that one location into the cache; if you're simply traversing a large chunk of memory either wholly linearly or with very small jumps for each step, that means you could potentially have a few dozen extra cache hits after the initial miss before crossing the next block boundary, massively accelerating the processing speed; the linked list, if it doesn't allocate each new value close to the previous one but instead at an essentially random location, cannot benefit from that speedup)... Also, going backwards on a 68k series chip is markedly faster because of one particular instruction group: Decrement and Branch if is met. Typically DBI-zero - ie it will spin rapidly through a loop that does some particular task, finishing only when the content of a particular register hits zero (and sets its easily testable zero flag). There's no particularly easy or fast equivalent for counting upwards - instead you have to repeatedly test whether your loop-count register is equal to some particular arbitrary value, and simply supplying *that* and doing the more complicated equal/not equal test takes a nontrivial amount of extra processor (and possibly memory bus) time. You can somewhat split the difference by using a decrementing counter to control the loop but either incrementing a separate counter to control the forwards motion, or setting a register to minus the decrementing one, but it's only going to shave the penny and will still be quite a bit slower than just running through the array backwards. The memory in those older machines has no burst output mode (which typically only works in incrementing-address mode) or even anything much like FPM or EDO (which allow reasonably fast, zero-page-esque random addressing within a single block of memory), so you get no compensating speed advantage in terms of reduced memory latency; wholly random access is exactly as fast as wholly linear forward stepping. Which in a way is also why the linked list is faster, besides the lack of cache. That particular instruction is also why, if it was better optimised for later 68k-series chips, the program would show rather curious results on the Falcon (I wouldn't want to try and predict them, in fact) or any other similarly engineered 68030 machine, and show an even stronger bias towards backwards array walking on an ST upgraded to a 68010 (or maybe 020, which has a rather poorer cache than the 030) - the very tight "Loop Mode" instruction introduced with the 010 that is literally engineered for doing nothing more than blasting through very simple memory-walking routines bounded by exit conditions such as DBZ. For the most applicable uses, optimising code to use that instruction can give a 2 to 3 times speedup vs the more explicitly written-out plain 68000 version, and in mixed-instruction code that can occasionally make use of it you still get acceleration measured in dozens of percent. Essentially, if you thought already thought 68k code was strange because you're more used to how x86 does things, Loop Mode just makes it even weirder... the 010 and later processors are pretty much tuned, in cases where burst mode or caches are unavailable or no use, for skimming through memory, very quickly, _backwards._
@@gwenynorisu6883 as someone who's programmed for the 68k a bit, I'm curious, how would a data register being decremented by DBRA speed up array access when walking backwards? Wouldn't you have to increase or decrease a value in an address register in every iteration of the loop? How does the value in the data register get used to calculate the address of an array element? Off the top of my head I can't really think of anything that would be faster than just adding a constant value (i.e. the size in bytes of one element) to an address register every iteration.
I learned java programming from an institute, in there I basically learned the basics on how to program, develop websites and things like that, but they never even began to explain how things actually worked, it still worries me how little I actually know about computers, since all I know is basically what years of progress in the computer world have achieved, and I'm just using the tools they gave me without questioning, videos like this are very important for someone like me, I appreciate it.
This stuff is worth Gold. Not many modern Computer Science degrees teach this because modern day CPUs have cache and compliers have optimization which means that there is virtually no difference in the speed. So people have started assuming that everything runs as quick. Scale the CPU down and you understand the true answer. Very well put video! Worth the 30 minute
There's effectively no difference in small problems, or most common teaching assignments (which are usually fast enough to be instant), then you go out into the real world with huge datasets or untidy problems where the differences can be a LOT more noticeable. Especially working in games where performance can be critical and you're sometimes working with very limited hardware - this is one of the things I always try to drum into new graduates
Thanks, I understand what you're saying but I guess I was just hung up on the word practical. As in, in practice. If there is a difference, there is a difference.
Between caches and prefetchers, in modern hardware arrays are almost always better than linked lists, or other data structures much of the time. Especially in things like game engine architecture, putting items in arrays versus linked lists of randomly allocated elements (i.e. items all over in memory) can yield very significant speed benefits.
that is true, however this is mostly because intel have made their backdoor optimizations, I can bet that the performance on an arm or amd is much more different. The problem with this is that at application level you have no idea what is going on and you can't optimize it, it is just a question of time until this will fall off, just like directx compared to vulkan.
@@D4no00 The same performance hit happens in PowerPC architectures, and for that matter, any processor with a data cache. Also, arrays are safer than pointers stored in memory. Some safety critical coding standards prohibit stored pointers.
I remember when Computerphile was brand new, I was in my Data Structures and Algorithms class, and doing horribly. I think I would have done a lot better if they had started this channel about a year sooner. I did tell my professor about this channel on one of the last days of class, so I never found out if he ever used it in any of his lectures or not.
First impression. Depends on the operation. I tend to use arrays when I need contiguous memory, or I have a known length, and if I need random access. I use linked lists when building lists with unknown length and where random access is not required.
I stumbled across this video this evening, some 5 years after it was published. Brilliant work, mate! I'll be searching to watch more of your videos right now!
Used that spinner on VDTs, back in the day... { putchar('\b'), putchar( "\\|/-"[ a++ & 03 ] ); } His was displayed inside "()", so I suspect some kind of "printf()" that certainly wouldn't help speed things along during initialisation (on the Atari)... Not sure why he's generating random numbers to initialise all those 'p's... Any (smallish) number would do just as well: struct.p = loopcounter; Having written lots of code for diverse tasks, imo, each approach has its use and purpose. Arrays when their size is not prone to fluctuation, single or double linked lists when there's lots of inserting/deleting to be done. There's no one "optimal" scheme that is best in all circumstances. (A recipe book has a fixed number of recipes, but each recipe has a variable number of ingredients... Thankfully 'C' has dynamic heap allocation and pointers...)
@S B C's "volatile" keyword makes optimisation a little less secure about its assumptions of what stays and what can be replaced. Scary will be the time when optimisation replaces some code's explicit linked list with array operations, or vice versa, and coders come to rely on that. The demo would work just as well with recursive function calls for the linked list example using the stack as the 'chain' of object instances.
Also, depending on the fragmentation of the allocator from which the linked list nodes are taken from, on a cached CPU it can be much much slower than an array which always has its elements on the same cache lines relative to one another. And then there's a funny case where supposing you have an allocator which allocates things linearly and you allocate the whole linked list at once, it basically gives you an array with a bunch of (useless) pointers to next nodes between each element. This will basically perform not quite as fast as the array, because it won't be able to fit as many elements per cache line due to the space wasted by the pointers, BUT it'll be much closer to the array than prior. Overall, the speed of operating on a full linked list on a cached CPU will vary depending on where the nodes were put in memory relative to one another. Another thing to note is how in my work anyhow, you're very unlikely to see a list as big as 125k elements. So, the effects of the cache on full array loops is increased even moreso.
Thank you for clearing that up. I came from programming in the end of the 1980's. I didn't get the linked lists thing and why arrays weren't used. People were talking about linked lists instead.
A lot of it was about just allocating the memory you used, then recycling it, which favoured dynamic approaches storing structs/records together. Traditional arrays require sizing and static allocation which put arbitrary limits. You didn't want to allocate a massive array on start up for some feature which wasn't used, or would restrict what the system could do. The languages used weren't necessarily as flexible as C which limited options for long running programs. Allocate at beginning, process and terminate worked fine for short lived programs running jobs on multi-user systems. As a user began using larger GUI programs, static pre-allocation was less feasible.
@@tanveerhasan2382 Watch it, then you'll know. But generally, because growing an array geometrically reduces the amount of time you need for expanding/shrinking it, and because insertions/deletions, which are a Linked List's strength, still require traversing it in O(n). From my own and limited experience, I've not yet come across a case where anything was ever actually faster than an array. In one case, a hash map came close. Might even have been the same speed, and was definitely more convenient because I had key/value pairs, but I ended up ditching it and using a sorted std::vector, because I needed the simplicity for MPI compatibility..
Wow. That has to be one the best video on the topic of computer hardware differences and how we as programmers should not make any assumption on how a machine is supposed to function when executing code. I love the "I'm writting in a high level language" when actually it's C and it's one of the least abstract language you can use without resorting to assembly. Just shows you how much this man knows his stuff.
When GCC switched to the GPLv3 license, Apple forked GCC. I believe the GCC version shipped on OSX is something like 4.2, mainline GCC is up to 7.1. This was in like 2006 or something. GCC has been significantly improved since then. So it's not unlikely that his version of GCC won't unroll and vectorize the code, even though real GCC will.
I'm certain Mr. Bagley already know about this but on 68k specifically an array will ALWAYS be faster than a linked list when coding in assembly since you can predecrement/postincrement the address-pointer. (as in direction doesn't matter) ie: lea.l "address of array", a0 clr.l d0 clr.l d1 move.l 1048576, d2 (Number of iterations.) Loop: move.b (a0)+, d1 add.l d1, d0 subq.l 1, d2 bne.b Loop d0 will contain a checksum32 you could also add 1048576 to the address of the array and do "move.b -(a0), d1" inside the loop to read it in reverse. Speed should be more or less the same. With a linked list you also have to fetch a new address for every iteration so there's no chance that is faster on these specifically :)
I can understand that the benchmark had to be trivial in order to run on the Atari ST, but I think simply summing integer elements in a loop gives a bad indication on the performance differences between linked-lists and arrays/vectors on modern architectures. If the pointers in a linked list point to a larger data structure, it can absolutely trash the cache and cause big performance hits. It is even worse if the elements in the linked list are scattered around in memory. With three levels of cache which we have now, getting cache misses gives a hefty penalty. Using arrays/vectors is almost always the right choice nowadays when performance is important, even if you don't do random-access. The only exception I can think of is for huge data sets with frequent insertions and removals.
The point wasn't to address a typical use case. The point was to illustrated that the fundamental architecture underlying your code can in fact cause things to behave in unexpected ways. Yes, most of the time X may be true.. but *assuming* it's a universal case can cause problems.
It depends on your use case, I often end up in cases where an unsorted array and a bit of brute force is faster than a tree or hash based set when your data size is tiny (for example under 10) but you're doing a huge number of lookups.
+Madsy9 Sorting is another area where linked lists can be faster, especially if the record size is large, since sorting a list only requires altering the pointers, not moving entire records around.
Possibly the compiler is doing something less than optimal when making the code for an array reference. While in the Air Force in the early 90s we had a problem writing a document paginator on a worst case 5MB report file. (Ultrix workstations). The program sat there and churned for over 30 seconds while the paginator routine determined the starting location of every page in 17 different page formats. Users found this waiting intolerable. I read an article then about how C compilers may generate a multiplication to reference array elements, and that traversing an array by pointer reference instead of subscripts would not include this extra math behind the scenes and should run faster. I tried this idea out on my Amiga 3000 and the paginator using pointers ran in 12 seconds for an equivalent 5MB text file. I was pretty sure the 25MHz 030 should not be a lot faster than the Ultrix boxes at work. The recoded pointer version of the paginator ran in 5 seconds on the ultrix workstations. Happy users.
Generally, yes... across most use cases and particularly with modern cpu/compiler/memory architecture arrays will be faster than linked lists. But the point of the video is that it's not a fundamental given. Even in cases where you can math out the number of steps involved in the logic, your real-world systems that the code runs on can affect the actual results.
correct mè if i am wrong, linked list are slower but memory efficient, slower than arrays as they don't provide random access, but heaps can overcome all that stuff, is there anything more it in the video?
Linked lists are not mem effecient, as they require more data to keep ( every node keeps its key and adresses of prev. and next nodes). However, it is easier to add or remove nodes from the middle pf the list. If you search data structures on wikipedia, i think they say both mem. Efficiency and time efficiency for different operations
When using the Raspberry Pi 3 there is the NEON co-processor. Compiling with -O3 could generate vectorized code. To be sure it is not, the vector optimalization should be disabled. Vectorized code would be increasing the speed on array's manipulation.
It's not necessary to pre-declare arrays in Perl. They grow to whatever size is required. (It may be slightly faster to create large ones in one go, if it's known in advance how many will be needed.)
Taking the median would give more accurate results. The OS scheduler might interrupt the program sometimes resulting in distorted results when taking the average.
If you need linked lists on modern architecture store them in arrays, and instead of pointers use indices. It can be faster than linked list, but of course you still need to measure that. General approach for fast code is to put all required data relatively nearby in memory. Allocating the big array chunk kind-of ensures that (lol there are OS caveats still), as long as elements aren't the pointers.
So in all fairness you'd need to enable loop unrolling in gcc (experimental) and compare again. That should significantly speed up the array-based RPi compilation.
A lot of people in the comments do not know (or at least 4 years ago did not know) that you can insert Linked List elements at the head for O(1) if order does not matter. It will reverse the order of elements coming in of course, but this gets rid of the need for an O(n) insert or needing to track the last element in the linked list.
Being a C programmer is no excuse for confusing or non-descriptive variable names. As much as possible variable names should explain the purpose of their data unambiguously. I recognize there are times when this is not easy, but with modern IDEs there is no justification for bad variable names. Don't ignore the power of your tools.
15:20 - "But if u think about it, if we started at the end and moved backwards, it would be exactly the same as starting from the beginning and moving forwards, so there's no point in testing it." Doesn't that seem like a rather unfair assumption to make? I mean, most people would have assumed looping backwards through an array would be "exactly the same" as looping forwards, but that turned out to be false. Why would u then assume, without even testing it, that going backwards through a list would be just the same?
Dear me this is dodgy really. The choice between data structure should be made depending on what you're use case is. eg is random access by index needed? are inserts needed? is iteration needed? I found this video interesting in some ways but it fails to address the important points of how to choose a data structure
I expected the reverse loop to be fastest, because that's one subtraction less. With a normal loop for (int i=0; i=0; i--) there is no subtraction and i is directly compared to 0. In any case, testing speed optimizations is needed. I sometimes theorized optimizations that turned out to actually be slower than something simpler.
9:03 I thought I was the only one still using floppies for some applications... ;) 21:40 I would have liked to see the results on an Intel Windows computer.
I'm 2 years too late, but I think that there is one change to the linked list version that will speed it up on the cached processors. I assume that the elements of the linked list are added to the beginning as they are generated. As the summing loop walks the list, the memory address for each successive element is decreasing, making it more difficult for the CPU to predict and do speculative prefetch. If the newly allocated elements are appended, the general direction is toward higher addresses, which are easier to predict for the prefetch mechanism. I worked on a commercial C compiler for the Atari ST back in the 1980s, and benchmarks were important to the success of the product. Optimization strategies were weighted toward better performance on things like the Dhrystone with the knowledge that there was no cache.
It's quite obvious. Linked lists are of complexity O(n) for lookup, insertion or deletion. Arrays are of complexity O(1) for the same thing. you take the starting address and add (n * sizeOfElem). Arrays are fixed size, linked list are of variable size and more flexible. At the same time memory allocation is very slow. if you put your array on the stack it's again literally a single cpu instruction (add to esp). running over all the elements still would be faster in an array still, since in a linked list, any element might be anywhere in memory, and thus it might not be in the cache in a sufficiently large list while the stack is almost certainly on the cache. also you need to dereference a pointer, meaning looking at the memory address to fetch the memory address at which the next element is, which is 2 memory accesses, while the array is again literally ESP + offset. Also linked lists are larger in memory, since you have to store that extra bit of information.
"It's quite obvious. Linked lists are of complexity O(n) for lookup, insertion or deletion. Arrays are of complexity O(1) for the same thing." No - Arraya only have the complexity O(1) for lookups - when it comes to inserts and deletions in average half of data in the array has to be moved/copied so complexity is O(N) for inserts/deletions in arrays. Arrays are (compared to linked lists) faster in random read but slower in inserting or removing elements. They are also faster for sequential reads where all elements starting from the first one have to be read when optimized by modern compiler. Normally you need the calculation startingAddress + index * sizeOfElement when you want to access the element in the Array with the given index. But if you have a loop which increases the index every cycle by one a modern compiler can get rid of that muplication by introduing a variable "offset" which is set to startingAdress before the loop and to which sizeOfElement is added after each cycle in the loop..
Can Dr Bagley do another shootout, this time between lookup tables and live calculation? That would be amazing to see how modern processors differ from older ones.
Hmm... another thing with the Falcon is that it makes use of the 68030's burst mode memory transfer mode (sort of an early attempt at EDO, which naturally makes loading large blocks faster than true random-access), as well as the cache. It'd be interesting to see the code run again on the MSTe at full 16MHz speed with cache enabled, to see which program is faster in that case, or if it ends up being pretty much 50:50.
Unfortunately, the Falcon030 doesn't support burst mode. The TT did, though. Both Alive and New Beat document this deficiency. I agree that it would be interesting to compare with the MSTe in 16MHz mode with cache enabled, though.
Damodara Kovie Maybe it's part of who he is. I like to think it is intentional, after all they're computer scientists; they know how to use a damn DSLR.
Actually this is a very interesting video, many might have thought that rpi is just an intermediate device for running tests, but actually it is running on totally different architecture, arm64. I know for sure that all servers are migrating to arm64 nowadays as it is much more energy efficient, however those number of cycles are alarming, as clock speeds of those processors are not any higher than intel ones.
So you're saying the node pointer is Mario, the node is a remote floating platform, the "next node" pointer is a chuckster, and that my neighbors have already reported me to the police for blaring the a cappella theme on max volume. Well, I must admit, I like what you have to say, sir. I like it a lot.
Extremely disappointed this was a comparison of a linear scan of both data structures. What about arbitrary insertion and deletion? Those are some of the the use cases linked lists were invented to address (no pun intended).
On top of that, something was seriously wrong with the compiler implementation on the Atari ST, that a linear forward scan of an array was somehow not faster than a linked list. And noting the speed of list/array creation -- something that is typically done only once -- was sort of pointless.
Arbitrary insertion and deletion of large structures (100's or 1000's of bytes) is going to be slow in an array. Then there's making a list out of items that may not be contiguously allocated, like a list of request structures for I/O coming from different threads that they allocate themselves, possibly on their own stacks. Of course, for the former, you can create an array of pointers to the big structures and move those pointers around the array if that suits you.
I found this disappointing too. It's billed as a comparison of data structures and ends up being a comparison of hardware architectures.When I was in school we used the concept of an abstract machine and everyone learned the pros and cons of arrays and lists for sequential access, random access, inserts, deletes and moves. If a linked list beats an array for sequential or random access, something is definitely screwed up.
While a definitely agree that the choice really comes down to what you need the structure for, I found this information importantant as well as it explains another aspect of choosing a structure that some people might not think about and isn’t considered as often as big O notation. I gaurentee you that if he had done the video we had expected then there would be people complaining in the comments saying that he didn’t consider how they run on different architectures. Besides, at 30 minutes this video was long enough as is without comparing all the different use cases (and the previous video did mention some of it anyway).
Isn't the Atari array code running slow vs linked list because the cost of index to memory address is using Multiplication? It more related to C array syntax and math occurring because of the index parameter. I expect this (multiplication) is a larger factor than memory access and not having caches on old Motorola 68000 Atari system. Linked list code will not involve multiplication. Did you look at the assembly code generated?
Did you notice that it was an array of objects, and not an array of primitives? Makes the array testing come out far, far worse than it otherwise would.
Joseph Cote While I don't care about the whole "object" vs "struct" thing, it is important to note that, unlike in Java, data structures in C aren't referenced by pointers unless the code specifies that they are. In C (and other languages), any structure, regardless of it's type, can be accessed either directly or by pointer. In this case, the array was a contiguous sequence of structures, no pointers involved other than the pointer to the array on the heap.
Let me make my point even more plain. Linked list. Pointer to the head. Each item has a data and the pointer to the next "thing". For each item in the list you have to get the address of the item either from the head pointer or the previous item. Memory access to get the data, memory access to get the next pointer. Perhaps even more processing to resolve the actual addresses within the "thing." Primitive data array: Using the base address of the array and the index, calculate the address to read (very fast all in cpu) access memory to retrieve data, go to next item. For each item ONE address calculation and done. This is my ONLY point: Arrays of primitive values will ALWAYS be much more efficient in tests like this than any kind of fooling around with pointers and objects or whatever you want to call those "things." The video compared a linked list of "things" to an array of "things", NOT a big difference.
This whole thing is really muddied by basically every detail of the hardware architecture, but part of the reason walking backwards through an array can be faster is the loop end condition test (it's also more likely to be noticeable in this case because the actual work done by the loop is very small, so loop overheads are magnified). Unfortunately the video didn't show the loop code for the backwards case, but consider the fundamental work needed to handle these two cases (ignoring compiler optimisations like loop unrolling and auto vectorization): for (i=0; i=0; i--) sum += a[i]; Both visit the same elements of a: [0, 99], but the forwards version needs these sort of operations: - fetch the value of a[i] from memory - add that to an accumulator register (sum) - increment the loop counter (i) - subtract the value of the loop end point (100) from the loop counter - jump back to the first instruction if the result of the above operation wasn't 0 In the backwards version: - fetch the value of a[i] from memory - add that to an accumulator register (sum) - decrement the loop counter (i) - jump back to the first instruction if the result of the above operation wasn't 0 Notice that in the second (backwards) version we need one fewer arithmetic operation, because we can take advantage of the fact that the decrement will set the zero flag when it reaches 0 (in every architecture I can remember working on), whereas the increment to 100 doesn't trigger any flags to change that we can use to conditionally jump to the top of the loop - the cpu has to subtract the end value to see if the difference is 0 (cmp is just a subtract that throws away the result but keeps the effect on the flags register), and then do the conditional jump. Since the actual work done in the loop is 1 arithmetic operation and 1 memory fetch, the additional arithmetic operation to test for the loop end is a significant extra overhead.
Why would you do that, as it's being used in a machine that, odds on, only has a double density drive? (I know HD was at least a later option for the MSTe, but STs in the main have 720k drives) As well as bothering with that trick when proper HD floppies remain cheap (heck, pretty much free these days) and plentiful, and have formulation that's a better match for the drive characteristics? PC drives including USB ones can still happily read and write 720k discs, btw.
A huge advantage of arrays is as mentioned cache friendliness (data locality). But more than that newer CPUs also include SIMD instructions to perform operations on a lot of data as once, say adding 4 or 8 ints in parallel. Adding in this the array will beat a linked list even worse on modern machines. I would try compiling and running with -march=native (along with -O3), which will allow for CPU specific optimizations. Or use some intrinsics to manually add the needed SSE or AVX instructions.
"The compiler is being clever", which is why you don't optimize code to compare algorithm speeds. Also, this is just traversal. How about speed testing insertion in the middle, which is what linked lists excel at?
You could also add than in the iMac, the array could fit entirely on cache as opposed to the linked list, and optimizing the code for loop unrolling gave it all the speed boost. So theoretically it might have never accesed the memory but to load the program from disk when running as array.
Actually... you may want to look at the index calculation time on the array on the older machine. Some compilers didn't optimize for powers of two and used a multiply for the index. This was pretty painful back in the day.. remember the ARM processor had a serial multiplier on some models but a 5 or 6 clock multiply was also a possibility. People forget this today, typically its 1 clock. IIRC the 80486 had a 3 clock integer multiply
If the Linked List stores it's data like an Array, it'll likely be about as fast. But in practice, Linked Lists will have insertions, deletions, concatenations, all of which fragment this data and create cache miss opportunities. Context matters in this case, which is why it's best to profile these things in the situation you need them in
I had the same question. I think he probably means the fact that indexing has to do a calculation to get the result. I.e. getting the 5th element of the array requires doing " arrayheadptr + (5 * element_size)" or something like that. This doesn't have to be done in the linked list version because the memory location pointed to by each next pointer is already 'precalculated'.
Not sure.... I didn't paid attention to this particular part. Maybe does he talk about the fact that modern processors are not running each instruction in a single step but through pipes. Let's say you want to compute a+b. The addition will be processed in several steps, for instance 3 steps (a,b)-> STEP 1 | STEP 2 | STEP 3 -> a+b Imagine that each step takes 1 unit of time. The time to complete the addition is 3 units. Now let's imagine you want to compute a+b+c+d. Three additions has to be performed, so the time to compute a+b+c+d should be 3*3=9. Right ? Well.... in fact, you can go faster because each component of the processor can work independently. Here is how the compiler can schedule the computation of a+b+c+d Time t=1 STEP1 of a+b t=2 STEP2 of a+b AND STEP1 of c+d t=3 STEP3 of a+b AND STEP2 of c+d t=4 STEP3 of c+d t=5 STEP1 of (a+b)+(c+d) t=6 STEP 2 of (a+b)+(c+d) t=7 STEP 3 of (a+b)+(c+d) So it takes only 7 unit times instead of 9 !!!! As you can see (c+d) is in some way "precomputed" during the computation of (a+b) itself. And as everyone knows (a+b)+(c+d)=((a+b)+c)+d [Numberphile has probably a video for it], so you get the expected result (but faster).
How about sorting? When you have big enough objects in an array/list wouldn't sorting a list be quicker when you can manipulate pointers instead of moving whole objects around in an array?
For BIG sructs this should be true, but first you must come up with an efficient in-place algorithm, where you need no random access. I'm not quite sure, whether you can do selection sort or sth comparable with a single linked list (this example should work with double linked lists), at quicksort this should be even harder...
skeliton11211 I was meant to add that to the question to be honest. Of course an array of pointers would be much quicker. When you have a warehouse full of containers you make a checklist where everything is and organise it and not move all the containers around.
Stefan Gräber I realised that later. sorting one way list might be problematic since you have to relink the neighbouring objects and insert the object in its new place. It results in a far more pointer manipulation than an array of pointers. My question was more about the "pure" version of a structure
I would imagine it depends on what you're sorting, and what sorting algorithm you use. First, on the data being sorted: Arrays have to swap the contents, so the time is dependent on the size of that content. Linked lists have to swap the pointers, so constant. These two are equal if the data being sorted is the size (or smaller) of a pointer, otherwise the linked list (I must assume, I should point out that I'm not an expert) will win out. As for how you sort: Insertion sort moves backwards along the array/list, and so is an outright no for linked lists (unless, of course, you use a doubly linked list, i.e. a linked list linked in both directions, which would have even greater memory use than a single linked list and would require two pointer swaps rather than just one for every data swap). Heap sort relies on random access, so would take a hit on linked lists. Merge sort and quick sort (properly implemented) work well on both arrays and linked lists. Radix sort I'm not sure about. On the one hand, from what I recall, what you're sorting is specifically the sort of thing that would benefit from linked lists only needing to swap pointers rather than data. On the other, I would also think these things to be things where the array would only contain pointers to them, for example strings (assuming you aren't accepting the restriction of all strings being the same maximum length, and probably all close to it (otherwise you're wasting lots of space)). Provided you keep in mind which sorting algorithms to use and which not, and that your array elements aren't unreasonably large, the difference is (probably) negligible: sorting an array/list from scratch is something you probably won't be doing that often, or with data large enough, for it to really matter which of the two you use, at least in comparison to other operations you're likely to use on the same data. Lastly, I suspect that if you're sorting an array or list, you're probably going to want to do other things to it that linked lists are better suited to, that is inserting and removing items. That bit however is purely subjective and I don't have anything to really back that up.
Quicksort can be done in place so there's no need to do large copies. Even still, moving small structures is still often faster than moving pointers in a linked list, again, because of the cache which can preform the move operations in bulk.
Instead of using an array index, use a pointer to the first element of the array, and increment the pointer. I think that will be even faster. Modern Intel processors handle adding an index to a pointer and dereferencing the incremented pointer all in a single instruction. The reverse speed is likely due to how the cache works. (I haven’t seen the end of the video yet).
Very insightful benchmark. At first I was worried, that you don't mention cache's role in it but you did. Only thing that's missing there, is that this is synthetic test, and in real life applications would show far different results. The reason is not only those cache misses, that you mentioned, but also that on x64 linked list node would be 2x greater in size than one in array. This applies consequences for how much memory will be prefetched to cache and it's pollution thus greater performance hits during execution. Also, as this is synthetic test, all memory for linked list is allocated one after another anyways, thus lower cache misses than it would be in more likely scenario in real life application, where nodes would be scattered through memory and single fetch would likely be a miss.
Everything you say here is true, but beside the point. He's trying to teach a more fundamental fact about CompSci in general - just because X is faster/better/whatever on paper doesn't mean that will always turn out to be true when the code hits metal. While it's true in modern cases that most programs can ignore the underlying architecture in most cases, sometimes it can result in unexpected behaviors.
There are use cases for linked lists, and use cases for arrays. If your collection is immutable then linked lists are always faster as you would have to copy the hole array on change. If we are talking mutable collections then array(lists) are usually faster. That is as long as you are not changing the length of the list. If you are just adding/deleting at the end, arrays are still usually faster, when you are changing something in the middle some form of linked list is faster.
Linked List are only useful if you make only small jumps from Node to Node. adding or deleting Random Elements in a Linked List is about as fast as on Arrays. (It is useful if the Array has free space)
spaces in an array are definitely a problem and a huge source of bugs. That is if we are talking ordered collections. If we have a hashmap or something like this it is obviously totally fine. But comparing a map to a list wont give you the results you want.
Then i missread your comment. I know that arraylists normaly have space at the end. For smaller lists this mean that this will be faster then a linked list. But once your array is so big that you have to array copy over multiple cache-lines, a linked list will be faster
Wouldn't it have been faster without printing stuff on the screen inbetween? I know that many of my c and c++ programs run much quicker if you leave out the print statement.
that's a good point, writing to the screen can take some time. however both versions print to the screen, so they are subject to the same slowness. it's like if you had two people run a race through syrup - you can still tell who's faster than who.
yes however the percentage in speed difference is exactly the same as they are printing the same length (minus one or two chars) the same amount of times so it wont effect the comparison
If you're finding that printing half a line of low rez text to the console makes a significant difference to the loop speed of your otherwise ~850ms routine, even on an ST, then you've got bigger problems than whether or not to print said text. Like, realising that you're actually stuck in a runtime interpreter for a particularly shoddy variety of BASIC when you thought you were making compiled C programs. If they're routines that take closer to 850us, then, sure, maybe it'll be worth it. But the text only prints after the loop is done (which is how it knows how long said loop took) and before the start of the next one, so it's not going to affect the reported results any, and in most cases the difference between printing or not printing 100 half-lines of text will save you, ooh, maybe one video frame's worth of runtime?
when i choose a list and not an array it is not about the speed but the arithmetic. u run into complicated delete and replace methods if u dont choose the right type to store your data.
For cached architectures, It would interesting to see the performance difference with the linked list version if you malloc'ed a single contiguous block of memory and then 'sub-allocate' the structures in the memory block.
OK so real computer science talk here regarding array vs linked list. So a Linked list is a modular data structure that can grow or shrink to accommodate data in real time. Arrays need to be completely reallocated and copied to change size. Linked lists also require an extra bit of data in each element while arrays do not. Linked lists must be accessed from the top to the bottom, unless you saved an item of interest in advance. It should be noted that the overhead used in a for loop adds more instructions to this operations than just moving a pointer into a local variable. For loops require an index value such as i, a comparison operation, and an iteration. While a linked list only need to go to the next item until that next item is 0. both situations sounds simple and the same, however they are not when it comes to the assembly since you would just move the next items address into a register compare it, and keep going but in a for loop you would have to compare the index counter to the length of the array then increment the index counter. Since the for loop operation involves two variables(the length of the array, and the index counter) you would need to push those values onto the stack, then back off the stack as your application stepped deeper into its functions and or move those variables into and out of registers each time you got to the comparison. With the Linked List you only need to check if the value stored at a fixed distance away from the list items pointer is 0 or not. Since this distance is fixed, no need to pull the value from memory into an register. I'm not sure I am making a good explanation here but essentially a for loop uses more steps to maintain the loop while a linked list loop just keeps going until a condition is met.
You're original test is invalid. You unnecessarily used a different type of loop, with the loop used for iterating over the array doing a bunch of useless loop-maintaining arithmetic and comparisons that any competent programmer would have either removed or accounted for in the benchmark. Pro tip: summing the elements in an array is in fact faster since it reads half the amount of data from memory for each access (only the number being summed, and no pointer). Since the x86 has more registers, the optimizing compiler had better luck keeping your loop counter in a register and not in memory, where-as on the 68k, the loop counter was stored is a local variable (on the stack in memory), and you access it 3 times per iteration: once to compare it, once to read it for incrementing, and once to write it for incrementing. That's completely unfair to the array implementation.
68k has 8 data registers and 7 address registers, it should be enough for all three programs. The operation that slowed down the forward array iteration was probably comparing the index with the constant, which requires one more instruction than comparing it to zero.
it's not "unfair." the point is that the speed depends on the architecture's bottlenecks. imagine we were dealing with HDD instead of RAM. you would want a data structure that minimized reads from the disk because it's so much slower than memory. On the other hand with an SSD, you are not penalized so much for excessive reads and writes. Therefore you can use different data structures and algorithms without incurring the performance hit of the HDD.
The test is unfair and should be invalid, but not because the 68k is inferior, but since the for-loop use more complex mechanics and thus is significantly slower than the while-loop The ascending for-loop is the slowest because it uses loop-counter-mechanics and have to do an active end-of-loop-check every iteration. The descending for-loop is faster because it uses the very nice decrease-and-branch-all-in-one-instruction and use the decrease instruction's result to compare to zero instead of a separate compare instruction. But it still have the loop-counter-mechanics to deal with. The linked list is fastest because it just loads the next address and checks if it's zero.
Commenting on an Atari isn't particularly relevant as modern computers generally use Intel chips that are very cache heavy. Many tests have been done on these computers that show that almost all optimizations are next to nothing compared to the cache effects. The bottom line as illustrated by many, is that arrays are almost always must faster than linked lists and therefore linked lists should only ever be used only in special cases. Arrays should always be considered the default. My first computer was a 8080 running at 1 megahertz (linked lists were probably ok on that computer) but that isn't very important to what works on modern computers. There is no argument, don't use linked lists.
Modern computers are not "generally" using intel chips. Intel chips are dominant for pc:s, but pc:s does not represent a majority of even consumer computer systems. As an example take the 3 big gaming consoles of the 7th generation (xbox360, wii and ps3). Between them they have over 260 million units sold, and none of them use intel chipsets. Furthermore most cellphones today are running on qualcom chipsets. That's not even mentioning the non-consumer market. There the atari test is absolutely relevent as many embedded chipsets do not have cache.
Pretty sure the majority are ARM chips. That's what you're going to find in just about anything that doesn't need a high performance CPU and in some things that do, like most smartphones or that Raspberry Pi for instance.
Maybe if you're using something dismal like an Atom or an embedded / microcontroller core based off a 386SX or what-have-you, but I don't think that was the point of the test. But even if it was, you can use the 68000 in the ST as a fairly prime example of such; embedded CPUs based on the 68k core (e.g. the Coldfire) were extremely popular up until a few years ago when ARMs and ATmegas started offering greater power and lower energy consumption alongside a lower price tag. So we already have the answer - sophisticated large instruction set, large cache chips prefer regular arrays, and that includes the higher end ARMs. Simpler, cacheless chips prefer linked lists, unless they also display unusual behaviour like being over 50% faster when traversing an array from the bottom up rather than top down.
Things become a bit more interesting for embedded real time systems - when finding a memory location to allocate takes too long, so you have to allocate all of your memory in advanced, or write your own allocation algorithm. This shouldn't make a significant difference for reading the array/linked list, but should affect creation time.
Even more, I want to see the assembly language listings of the compiled output. p[i].p is well known to be less efficient than using pointer arithmetic (even inside an array); however, some (but not all!) compilers can recognize when p[i] is used inside a loop and automatically substitute pointer-based code. While I expect this to occur for x86-based code considering gcc's/clang's relative maturity, I question whether the 68K compilers provide similar optimizations.
"p[i].p is well known to be less efficient than using pointer arithmetic". My understanding is that p[i] is equivalent to *(p + i), why would it be any slower?
Mike Scott Because it involves a multiply and an addition, versus p++ which is just a simple addition. This becomes particularly visible with multiple dereferences to p[i].
This video was just quite misleading and clickbait'y. Very disappointing compared to your typical videos. I've gotta put a dislike on this video for a few reasons: - The description which made it out as though in the general case arrays were slower - He made out as though he was NOT going to be running it on a modern machine - just on the Atari - The fact one has to watch the first 22 minutes (thank god for YT x2 speed) before he properly addresses the fact there are MAJOR differences between the machines
Extra credit earned for using the Atari! And I'm pretty sure the reason why it's faster when run in reverse is because no compare instruction is needed in the low-level loop, only a check for zero. (I see you clarified this at the end, and I'm right. Atari 2600 programmer here...)
28:15 - "That's because I used a slightly different compiler there. I used clang on the iMac rather than gcc on the other 2". Doesn't that seem rather unfair? How do u know the code compiled in clang wouldn't have been faster than the code compiled in gcc on the other 2 machines?
If I may, in this case, it may have been interesting to also compare between the array of structures you had and a structure of arrays (one for p, the other for q) to enforce the vectorization of your summing instructions, and optimize the cache use. Although, as you used the -O3 optimization, may it be possible that the q part of the structures had been opt-out ?
Also if you decide to manage the memory with the linked list you could allocate a block of memory in which the linked list gets stored with new blocks getting allocated when space runs out. It would mean you get less cache misses and it would run a bit faster
While there is some amusement value to this video, it provides very little information of what is actually going on. We do not know how the nodes of the linked list are ordered in memory, and this is very relevant. The fact that the nodes were just all generated in one swoop just before the test would suggest they are probably linearly ordered, with each link pointing a fixed distance forward from itself; by contrast, in a linked list that has been used for some time in ways that linked lists are made for (insertions, deletions, permutations of links), the nodes would tend to be scrambled in memory in a way that is detrimental to locality and therefore to cache performance. We also don't know whether the address arithmetic implicit in finding the address of p[i] is actually performed each time, or that (quite likely) the compiler has optimised this away by just incrementing the address of the previous array access. Unless we have some idea of what happens at the assembly level, we are just watching a race between contenders we hardly know at all. No mention made of locality of access, though this is the unique feature that makes the data cache relevant: in the example, no item in memory is ever used more than once (I'm assuming the few local variables are in registers) so the _only_ gain from cache is that it fetches not only the item requested by the CPU, but a whole cache line so that the next several requests might be cache hits.Caches (and many other things) have been designed specifically to optimise linearly repetitive access to arrays, so it is not so surprising that an array program benefits most from such optimisations. On the other hand, mention of the instruction cache is beside the point: both solutions are small loops that can reside entirely in the instruction cache. If you really want to compare intrinsic efficiency of array of linked lists as data structures, I would suggest a test that forces the array to be accessed non-linearly, and the list to be non-linearly ordered in memory. For instance, just insist that the items be visited in increasing order by the q-component while using the p-components (maybe doing something other than summation, for which order matters). In the array case, leave everything in place and pre-compute the permutation giving the order of visit, while in the list case (where we cannot do random access)pre-sort the list by increasing q, with a sorting method that only adjusts the links. The net effect would be that the linked list is scrambled in memory in the same way the array accesses are scrambled. Then the playing field is levelled in several respects (neither has much locality, so caching has little influence; array indexing cannot be optimised away), and the tests would be a lot more interesting.
Yesterday I fell asleep while watching this video, and when the outro popped, I swear, I woke up THE WHOLE HOUSE with the greatest scream humanity has ever heard.
"We're not gonna run this on the iMac, we're not gonna run it on the Pi, we're gonna run it on the Atari behind me" I love this guy so much.
Nick1wasd, he's taking CPU caching out of the equation, but on modern architectures, CPU cache friendly data structures can make enormous difference in performance - arrays can be designed to be more CPU cache friendly
TheSulross You probably already know this by now but he goes into the major speed differences made due to CPU cache later into the video
Let’s run it on the Amiga and C64 and 800xl
@@TheSulross isn't that the point of why you use arrays more than linked lists just the fact it's CPU cache friendly?
The more important difference is the difference is how the performance scales for different operations and what you actually plan to use the data structure for. Both of these were O(n) operations, but selecting an arbitrary element in an array is O(1) in an array and O(n) in a linked list. However, trying to dynamically increase the size of an array would be O(n) (since you have to make a new array and copy the data) while a linked list can add another item to the front in O(1). Adding or removing arbitrary items next to a node you're currently looking at is O(1) for linked lists and O(n) for arrays (though this would be O(n) for linked lists if you had walk to the element you want to change).
They are different tools for different things. If you're only using operations where the two structures are comparable (like the summation mentioned in the video) and speed is particularly important, then such a test makes sense, but what you use the data structure for is likely to be far more important.
I agree with your point about choosing the right tool for the job, but insertion into a dynamically resizing array is amortized θ(1) if you change the size geometrically.
Chris Hinton I think Nerdnumberone was referring to inserts into the middle or beginning of the array, which is O(n).
"Amortized" means allocate once, expand if needed, use many times. Yes, this is the case when array has to be used. But in case you have to do many insert/delete operations, there is no amortization.
Having worked mostly on web since school, I jumped to the absolute stupidest way to resize an array: Making a new array that is one element larger and copying the data into it, assuming that any array is necessarily a continuous block of information. Sorry, I haven't dug into the low level implementation of data structures since college, where an array was the continuous block and anything else was either a different data structure or a hand-waved black-box.
This assumption was idiotic of me and required the programmer to fail to allocate enough memory for their application. If I'm skimming this Wikipedia article correctly, then dynamic arrays just allocate more memory, just in case. You still have this problem every time you hit that limit, but it should be rare enough to be negligible (as in every time you hit your cap you double the memory allocation).
+Ivo Ivanov That is not what amortized means at all. Amortized θ(1) means that *on average* the time per operation is constant. Yes, whenever you have to reallocate the array, that particular insertion will take θ(n) time, but because you're increasing the size of the array geometrically, the interval between such insertions also grows linearly, so *on average* every insertion is performed in constant time.
05:22 We're going to visit every single element (125k) in an array & linked list and add up all their values.
11:18 Running linked list version on Atari computer takes ~166 clock ticks.
13:07 Running array version on Atari computer takes ~179 clock ticks.
17:55 On the Atari computer. Traversing the array backwards is ~114 clock ticks.
18:20 Why? What's happening is that the instructions the Atari machine can execute favors walking backwards.
21:00 Summary of results for Atari, Raspberry Pi, and iMac.
On the Raspberry Pi and the iMac, the array is much faster than the linked list.
22:15 What's happening? Caching. The Atari doesn't have a cache. Every bit of data has to be fetched from memory each time. But the CPU isn't much faster than the memory. But for the iMac and the Pi, the CPU is much faster and has to wait for things to be fetched from memory.
Almost, but you haven't properly annotated *why* the linked list is marginally faster than plain walking forwards on the ST (your next address is prebaked and can be used as a JMP IMMediate, so you don't have to depend on using the CPU's own, slightly slower address calculation), nor why the cache speeds things up so dramatically for the plain array in both directions vs the LL on the others (reading an uncached memory address actually pulls in a whole block of data surrounding that one location into the cache; if you're simply traversing a large chunk of memory either wholly linearly or with very small jumps for each step, that means you could potentially have a few dozen extra cache hits after the initial miss before crossing the next block boundary, massively accelerating the processing speed; the linked list, if it doesn't allocate each new value close to the previous one but instead at an essentially random location, cannot benefit from that speedup)...
Also, going backwards on a 68k series chip is markedly faster because of one particular instruction group: Decrement and Branch if is met. Typically DBI-zero - ie it will spin rapidly through a loop that does some particular task, finishing only when the content of a particular register hits zero (and sets its easily testable zero flag). There's no particularly easy or fast equivalent for counting upwards - instead you have to repeatedly test whether your loop-count register is equal to some particular arbitrary value, and simply supplying *that* and doing the more complicated equal/not equal test takes a nontrivial amount of extra processor (and possibly memory bus) time. You can somewhat split the difference by using a decrementing counter to control the loop but either incrementing a separate counter to control the forwards motion, or setting a register to minus the decrementing one, but it's only going to shave the penny and will still be quite a bit slower than just running through the array backwards. The memory in those older machines has no burst output mode (which typically only works in incrementing-address mode) or even anything much like FPM or EDO (which allow reasonably fast, zero-page-esque random addressing within a single block of memory), so you get no compensating speed advantage in terms of reduced memory latency; wholly random access is exactly as fast as wholly linear forward stepping. Which in a way is also why the linked list is faster, besides the lack of cache.
That particular instruction is also why, if it was better optimised for later 68k-series chips, the program would show rather curious results on the Falcon (I wouldn't want to try and predict them, in fact) or any other similarly engineered 68030 machine, and show an even stronger bias towards backwards array walking on an ST upgraded to a 68010 (or maybe 020, which has a rather poorer cache than the 030) - the very tight "Loop Mode" instruction introduced with the 010 that is literally engineered for doing nothing more than blasting through very simple memory-walking routines bounded by exit conditions such as DBZ. For the most applicable uses, optimising code to use that instruction can give a 2 to 3 times speedup vs the more explicitly written-out plain 68000 version, and in mixed-instruction code that can occasionally make use of it you still get acceleration measured in dozens of percent. Essentially, if you thought already thought 68k code was strange because you're more used to how x86 does things, Loop Mode just makes it even weirder... the 010 and later processors are pretty much tuned, in cases where burst mode or caches are unavailable or no use, for skimming through memory, very quickly, _backwards._
I wish people didn't create such bloated videos in the first place.
Thanks for the laconized info!
thx, nice summary
@@gwenynorisu6883 as someone who's programmed for the 68k a bit, I'm curious, how would a data register being decremented by DBRA speed up array access when walking backwards? Wouldn't you have to increase or decrease a value in an address register in every iteration of the loop? How does the value in the data register get used to calculate the address of an array element? Off the top of my head I can't really think of anything that would be faster than just adding a constant value (i.e. the size in bytes of one element) to an address register every iteration.
I learned java programming from an institute, in there I basically learned the basics on how to program, develop websites and things like that, but they never even began to explain how things actually worked, it still worries me how little I actually know about computers, since all I know is basically what years of progress in the computer world have achieved, and I'm just using the tools they gave me without questioning, videos like this are very important for someone like me, I appreciate it.
they didn't teach you data structures?
In my opinion this doesn't really have too much to do with Data Structures but more so highlighting compilers, CPU Processors@@abdulhfhd
@@jaydavis6357 you are right
"Confusing variable names, but hey, I'm a C programmer" 😂
As a student learning in C i can confirm that variable names are either long and confusing or short and confusing
@MLU8811 Same
hell0
@MLU8811 Did exactly the same AND like the video
Say that to the Assembly programmer, they don't even have VARIABLEs.
This stuff is worth Gold. Not many modern Computer Science degrees teach this because modern day CPUs have cache and compliers have optimization which means that there is virtually no difference in the speed. So people have started assuming that everything runs as quick. Scale the CPU down and you understand the true answer.
Very well put video! Worth the 30 minute
Any "Computer Science degree" that doesn't teach linked lists is not computer science.
What do you mean by "this"? Teach about caches? Teach about unrolling loops?
What practical difference is there if there is no difference? Academic?
There's effectively no difference in small problems, or most common teaching assignments (which are usually fast enough to be instant), then you go out into the real world with huge datasets or untidy problems where the differences can be a LOT more noticeable. Especially working in games where performance can be critical and you're sometimes working with very limited hardware - this is one of the things I always try to drum into new graduates
Thanks, I understand what you're saying but I guess I was just hung up on the word practical. As in, in practice. If there is a difference, there is a difference.
Between caches and prefetchers, in modern hardware arrays are almost always better than linked lists, or other data structures much of the time. Especially in things like game engine architecture, putting items in arrays versus linked lists of randomly allocated elements (i.e. items all over in memory) can yield very significant speed benefits.
that is true, however this is mostly because intel have made their backdoor optimizations, I can bet that the performance on an arm or amd is much more different. The problem with this is that at application level you have no idea what is going on and you can't optimize it, it is just a question of time until this will fall off, just like directx compared to vulkan.
@@D4no00 The same performance hit happens in PowerPC architectures, and for that matter, any processor with a data cache. Also, arrays are safer than pointers stored in memory. Some safety critical coding standards prohibit stored pointers.
Steve is by far the clearest when it comes to explaining anything Computerphile. Clear, Concise, C programmer - all the C's! :)
“...which means I need a very high tech piece of equipment”
*showed us a floppy disk*
videos like this make me really appreciate C as a language
C is the ultimate language.
Second term computer science students, rejoice. Computerphile has come to save you.
i swear, this channel has saved me many times.
Saved me when studying recursion
I've already graduated but better late than never, although I'm doubtful it'll be useful in my job
I remember when Computerphile was brand new, I was in my Data Structures and Algorithms class, and doing horribly. I think I would have done a lot better if they had started this channel about a year sooner. I did tell my professor about this channel on one of the last days of class, so I never found out if he ever used it in any of his lectures or not.
Damn son. Saved me again.
A fantastic video. It’s set up great. And the fact you do “myth busting” and teaching people how to test a question like is simply awesome.
First impression. Depends on the operation. I tend to use arrays when I need contiguous memory, or I have a known length, and if I need random access. I use linked lists when building lists with unknown length and where random access is not required.
I stumbled across this video this evening, some 5 years after it was published. Brilliant work, mate! I'll be searching to watch more of your videos right now!
This will completely be a function of use case and implementation. I love to see what you all setup for this test.
He's talking about complex computer stuff and I'm just intrigued by the little ASCII spinner at 9:37.
Edit: Haha, they talk about it later on. :D
lol same here. The first thing I wondered was "I wonder how many valuable CPU cycles are being wasted animating that."
Used that spinner on VDTs, back in the day... { putchar('\b'), putchar( "\\|/-"[ a++ & 03 ] ); }
His was displayed inside "()", so I suspect some kind of "printf()" that certainly wouldn't help speed things along during initialisation (on the Atari)...
Not sure why he's generating random numbers to initialise all those 'p's... Any (smallish) number would do just as well: struct.p = loopcounter;
Having written lots of code for diverse tasks, imo, each approach has its use and purpose. Arrays when their size is not prone to fluctuation, single or double linked lists when there's lots of inserting/deleting to be done. There's no one "optimal" scheme that is best in all circumstances. (A recipe book has a fixed number of recipes, but each recipe has a variable number of ingredients... Thankfully 'C' has dynamic heap allocation and pointers...)
@S B C's "volatile" keyword makes optimisation a little less secure about its assumptions of what stays and what can be replaced.
Scary will be the time when optimisation replaces some code's explicit linked list with array operations, or vice versa, and coders come to rely on that.
The demo would work just as well with recursive function calls for the linked list example using the stack as the 'chain' of object instances.
Great info about the differences in the way the chips and the compilers handle the same simple code. Stuff like that fascinates me. Thanks.
a + b = b + a
you can watch Numberphile for that 😂😂 LoL
more like LoL are three letters => Half life 3 confirmed
SABER - FICTIONAL CHARACTER Not in computer science. ad does not equal ba.
seems legit
But you forget that we can extend list easily but if we want to add 1 extra element to an array we must create new one
wait, i thought that there was a conjecture, that a+b=c
so it that not valid?
Also, depending on the fragmentation of the allocator from which the linked list nodes are taken from, on a cached CPU it can be much much slower than an array which always has its elements on the same cache lines relative to one another. And then there's a funny case where supposing you have an allocator which allocates things linearly and you allocate the whole linked list at once, it basically gives you an array with a bunch of (useless) pointers to next nodes between each element. This will basically perform not quite as fast as the array, because it won't be able to fit as many elements per cache line due to the space wasted by the pointers, BUT it'll be much closer to the array than prior. Overall, the speed of operating on a full linked list on a cached CPU will vary depending on where the nodes were put in memory relative to one another.
Another thing to note is how in my work anyhow, you're very unlikely to see a list as big as 125k elements. So, the effects of the cache on full array loops is increased even moreso.
Thank you for clearing that up. I came from programming in the end of the 1980's. I didn't get the linked lists thing and why arrays weren't used. People were talking about linked lists instead.
A lot of it was about just allocating the memory you used, then recycling it, which favoured dynamic approaches storing structs/records together.
Traditional arrays require sizing and static allocation which put arbitrary limits. You didn't want to allocate a massive array on start up for some feature which wasn't used, or would restrict what the system could do.
The languages used weren't necessarily as flexible as C which limited options for long running programs.
Allocate at beginning, process and terminate worked fine for short lived programs running jobs on multi-user systems.
As a user began using larger GUI programs, static pre-allocation was less feasible.
Ironically I see this video next on the auto-play list "Bjarne Stroustrup: Why you should avoid Linked Lists"
why indeed
@@tanveerhasan2382 Watch it, then you'll know. But generally, because growing an array geometrically reduces the amount of time you need for expanding/shrinking it, and because insertions/deletions, which are a Linked List's strength, still require traversing it in O(n).
From my own and limited experience, I've not yet come across a case where anything was ever actually faster than an array. In one case, a hash map came close. Might even have been the same speed, and was definitely more convenient because I had key/value pairs, but I ended up ditching it and using a sorted std::vector, because I needed the simplicity for MPI compatibility..
Wow. That has to be one the best video on the topic of computer hardware differences and how we as programmers should not make any assumption on how a machine is supposed to function when executing code. I love the "I'm writting in a high level language" when actually it's C and it's one of the least abstract language you can use without resorting to assembly. Just shows you how much this man knows his stuff.
gcc can unroll loops too, it's not a clang-specific feature.
Every decent compiler can.
for(p=0; p
He wasn't talking about unrolling a loop, he was talking about SIMD.
When GCC switched to the GPLv3 license, Apple forked GCC. I believe the GCC version shipped on OSX is something like 4.2, mainline GCC is up to 7.1. This was in like 2006 or something. GCC has been significantly improved since then. So it's not unlikely that his version of GCC won't unroll and vectorize the code, even though real GCC will.
Reckless Roges Fails to compile. You smartass...
why is backwards array faster than array on Atari?
I'd say that it comes down to the line
for (i=0, i
8:11 "p inside p, were using p inside p, confusing variable names, but then I'm a C programmer" LOL.
Would love to see Dr. Jason Atkin next time. Had him for my C++ module. He is amazing. Dr. Bagley is awesome as well.
Good call on running it on a simple machine. More entertaining and more consistency, win win!
I'm certain Mr. Bagley already know about this but on 68k specifically an array will ALWAYS be faster than a linked list when coding in assembly since you can predecrement/postincrement the address-pointer.
(as in direction doesn't matter)
ie:
lea.l "address of array", a0
clr.l d0
clr.l d1
move.l 1048576, d2 (Number of iterations.)
Loop:
move.b (a0)+, d1
add.l d1, d0
subq.l 1, d2
bne.b Loop
d0 will contain a checksum32
you could also add 1048576 to the address of the array and do "move.b -(a0), d1" inside the loop to read it in reverse. Speed should be more or less the same.
With a linked list you also have to fetch a new address for every iteration so there's no chance that is faster on these specifically :)
I can understand that the benchmark had to be trivial in order to run on the Atari ST, but I think simply summing integer elements in a loop gives a bad indication on the performance differences between linked-lists and arrays/vectors on modern architectures. If the pointers in a linked list point to a larger data structure, it can absolutely trash the cache and cause big performance hits. It is even worse if the elements in the linked list are scattered around in memory. With three levels of cache which we have now, getting cache misses gives a hefty penalty.
Using arrays/vectors is almost always the right choice nowadays when performance is important, even if you don't do random-access. The only exception I can think of is for huge data sets with frequent insertions and removals.
Madsy9 worth noting is that if you don't care about the order of items in your array, both insert and remove operations are O(1)
The point wasn't to address a typical use case. The point was to illustrated that the fundamental architecture underlying your code can in fact cause things to behave in unexpected ways. Yes, most of the time X may be true.. but *assuming* it's a universal case can cause problems.
It depends on your use case, I often end up in cases where an unsorted array and a bit of brute force is faster than a tree or hash based set when your data size is tiny (for example under 10) but you're doing a huge number of lookups.
+Madsy9 Sorting is another area where linked lists can be faster, especially if the record size is large, since sorting a list only requires altering the pointers, not moving entire records around.
Possibly the compiler is doing something less than optimal when making the code for an array reference.
While in the Air Force in the early 90s we had a problem writing a document paginator on a worst case 5MB report file. (Ultrix workstations). The program sat there and churned for over 30 seconds while the paginator routine determined the starting location of every page in 17 different page formats. Users found this waiting intolerable.
I read an article then about how C compilers may generate a multiplication to reference array elements, and that traversing an array by pointer reference instead of subscripts would not include this extra math behind the scenes and should run faster. I tried this idea out on my Amiga 3000 and the paginator using pointers ran in 12 seconds for an equivalent 5MB text file. I was pretty sure the 25MHz 030 should not be a lot faster than the Ultrix boxes at work. The recoded pointer version of the paginator ran in 5 seconds on the ultrix workstations. Happy users.
It's always a good idea to have an understanding of what's going on under the hood.
But in the end the array was faster on all the CPU with a cache, which should be all CPUs now. Right?
Generally, yes... across most use cases and particularly with modern cpu/compiler/memory architecture arrays will be faster than linked lists. But the point of the video is that it's not a fundamental given. Even in cases where you can math out the number of steps involved in the logic, your real-world systems that the code runs on can affect the actual results.
Even CPUs considered obsolete have caches. CPUs without one are old enough to consider them as antiques.
correct mè if i am wrong, linked list are slower but memory efficient, slower than arrays as they don't provide random access, but heaps can overcome all that stuff, is there anything more it in the video?
Not all modern processors have caches. For instance, most Arduinos do not have a cache.
Linked lists are not mem effecient, as they require more data to keep ( every node keeps its key and adresses of prev. and next nodes). However, it is easier to add or remove nodes from the middle pf the list. If you search data structures on wikipedia, i think they say both mem. Efficiency and time efficiency for different operations
When using the Raspberry Pi 3 there is the NEON co-processor. Compiling with -O3 could generate vectorized code. To be sure it is not, the vector optimalization should be disabled. Vectorized code would be increasing the speed on array's manipulation.
It's not necessary to pre-declare arrays in Perl. They grow to whatever size is required. (It may be slightly faster to create large ones in one go, if it's known in advance how many will be needed.)
Taking the median would give more accurate results. The OS scheduler might interrupt the program sometimes resulting in distorted results when taking the average.
True
If you need linked lists on modern architecture store them in arrays, and instead of pointers use indices. It can be faster than linked list, but of course you still need to measure that. General approach for fast code is to put all required data relatively nearby in memory. Allocating the big array chunk kind-of ensures that (lol there are OS caveats still), as long as elements aren't the pointers.
So in all fairness you'd need to enable loop unrolling in gcc (experimental) and compare again. That should significantly speed up the array-based RPi compilation.
A lot of people in the comments do not know (or at least 4 years ago did not know) that you can insert Linked List elements at the head for O(1) if order does not matter. It will reverse the order of elements coming in of course, but this gets rid of the need for an O(n) insert or needing to track the last element in the linked list.
Being a C programmer is no excuse for confusing or non-descriptive variable names. As much as possible variable names should explain the purpose of their data unambiguously.
I recognize there are times when this is not easy, but with modern IDEs there is no justification for bad variable names. Don't ignore the power of your tools.
He is a computer scientist not an industrial programmer
Ok so i will name it
"Pointer_that_Points_To_Next_Element_In_The_Array_List"
♫♪Ludwig van Beethoven♪♫ Or you could just name it nextpointer#
Butthurt.
He said "asleep" programmer, not a C programmer, so he had an excuse
15:20 - "But if u think about it, if we started at the end and moved backwards, it would be exactly the same as starting from the beginning and moving forwards, so there's no point in testing it."
Doesn't that seem like a rather unfair assumption to make? I mean, most people would have assumed looping backwards through an array would be "exactly the same" as looping forwards, but that turned out to be false. Why would u then assume, without even testing it, that going backwards through a list would be just the same?
Dear me this is dodgy really. The choice between data structure should be made depending on what you're use case is. eg is random access by index needed? are inserts needed? is iteration needed? I found this video interesting in some ways but it fails to address the important points of how to choose a data structure
I expected the reverse loop to be fastest, because that's one subtraction less.
With a normal loop for (int i=0; i=0; i--) there is no subtraction and i is directly compared to 0.
In any case, testing speed optimizations is needed. I sometimes theorized optimizations that turned out to actually be slower than something simpler.
9:03 I thought I was the only one still using floppies for some applications... ;)
21:40 I would have liked to see the results on an Intel Windows computer.
I'm 2 years too late, but I think that there is one change to the linked list version that will speed it up on the cached processors. I assume that the elements of the linked list are added to the beginning as they are generated. As the summing loop walks the list, the memory address for each successive element is decreasing, making it more difficult for the CPU to predict and do speculative prefetch. If the newly allocated elements are appended, the general direction is toward higher addresses, which are easier to predict for the prefetch mechanism.
I worked on a commercial C compiler for the Atari ST back in the 1980s, and benchmarks were important to the success of the product. Optimization strategies were weighted toward better performance on things like the Dhrystone with the knowledge that there was no cache.
Oooh, which C compiler was that?
@@DrSteveBagley mwc (Mark Williams C) for the Atari ST.
It's quite obvious. Linked lists are of complexity O(n) for lookup, insertion or deletion. Arrays are of complexity O(1) for the same thing. you take the starting address and add (n * sizeOfElem). Arrays are fixed size, linked list are of variable size and more flexible. At the same time memory allocation is very slow. if you put your array on the stack it's again literally a single cpu instruction (add to esp). running over all the elements still would be faster in an array still, since in a linked list, any element might be anywhere in memory, and thus it might not be in the cache in a sufficiently large list while the stack is almost certainly on the cache. also you need to dereference a pointer, meaning looking at the memory address to fetch the memory address at which the next element is, which is 2 memory accesses, while the array is again literally ESP + offset. Also linked lists are larger in memory, since you have to store that extra bit of information.
"It's quite obvious. Linked lists are of complexity O(n) for lookup, insertion or deletion. Arrays are of complexity O(1) for the same thing."
No - Arraya only have the complexity O(1) for lookups - when it comes to inserts and deletions in average half of data in the array has to be moved/copied so complexity is O(N) for inserts/deletions in arrays.
Arrays are (compared to linked lists) faster in random read but slower in inserting or removing elements.
They are also faster for sequential reads where all elements starting from the first one have to be read when optimized by modern compiler.
Normally you need the calculation
startingAddress + index * sizeOfElement
when you want to access the element in the Array with the given index.
But if you have a loop which increases the index every cycle by one
a modern compiler can get rid of that muplication by introduing a variable "offset"
which is set to startingAdress before the loop and to which sizeOfElement is added after each cycle in the loop..
Can Dr Bagley do another shootout, this time between lookup tables and live calculation? That would be amazing to see how modern processors differ from older ones.
"There's technically a cache but not really but essentially no but there's something like a prefetch"
Hmm... another thing with the Falcon is that it makes use of the 68030's burst mode memory transfer mode (sort of an early attempt at EDO, which naturally makes loading large blocks faster than true random-access), as well as the cache. It'd be interesting to see the code run again on the MSTe at full 16MHz speed with cache enabled, to see which program is faster in that case, or if it ends up being pretty much 50:50.
Unfortunately, the Falcon030 doesn't support burst mode. The TT did, though. Both Alive and New Beat document this deficiency. I agree that it would be interesting to compare with the MSTe in 16MHz mode with cache enabled, though.
It'd be great if Dr. Bagley was recorded in front of a darker background. The over exposed window background is rather distracting.
Damodara Kovie Maybe it's part of who he is. I like to think it is intentional, after all they're computer scientists; they know how to use a damn DSLR.
Bryan Keller
Fair enough, but I'm appealing to Sean Riley, the videographer and editor.
Just needs some front lighting
Actually this is a very interesting video, many might have thought that rpi is just an intermediate device for running tests, but actually it is running on totally different architecture, arm64. I know for sure that all servers are migrating to arm64 nowadays as it is much more energy efficient, however those number of cycles are alarming, as clock speeds of those processors are not any higher than intel ones.
well...you should really re-run it with c++11 iterative for-loops, because these have a kinda big impact on the cpu-performance of arrays
So you're saying the node pointer is Mario, the node is a remote floating platform, the "next node" pointer is a chuckster, and that my neighbors have already reported me to the police for blaring the a cappella theme on max volume. Well, I must admit, I like what you have to say, sir. I like it a lot.
Extremely disappointed this was a comparison of a linear scan of both data structures. What about arbitrary insertion and deletion? Those are some of the the use cases linked lists were invented to address (no pun intended).
Couldn't agree more. The entire video I couldn't stop thinking about how this comparison is unfair.
On top of that, something was seriously wrong with the compiler implementation on the Atari ST, that a linear forward scan of an array was somehow not faster than a linked list. And noting the speed of list/array creation -- something that is typically done only once -- was sort of pointless.
Arbitrary insertion and deletion of large structures (100's or 1000's of bytes) is going to be slow in an array. Then there's making a list out of items that may not be contiguously allocated, like a list of request structures for I/O coming from different threads that they allocate themselves, possibly on their own stacks. Of course, for the former, you can create an array of pointers to the big structures and move those pointers around the array if that suits you.
I found this disappointing too. It's billed as a comparison of data structures and ends up being a comparison of hardware architectures.When I was in school we used the concept of an abstract machine and everyone learned the pros and cons of arrays and lists for sequential access, random access, inserts, deletes and moves. If a linked list beats an array for sequential or random access, something is definitely screwed up.
While a definitely agree that the choice really comes down to what you need the structure for, I found this information importantant as well as it explains another aspect of choosing a structure that some people might not think about and isn’t considered as often as big O notation.
I gaurentee you that if he had done the video we had expected then there would be people complaining in the comments saying that he didn’t consider how they run on different architectures.
Besides, at 30 minutes this video was long enough as is without comparing all the different use cases (and the previous video did mention some of it anyway).
It would be helpful to have a link in the description to Alex's video...
Isn't the Atari array code running slow vs linked list because the cost of index to memory address is using Multiplication? It more related to C array syntax and math occurring because of the index parameter. I expect this (multiplication) is a larger factor than memory access and not having caches on old Motorola 68000 Atari system.
Linked list code will not involve multiplication.
Did you look at the assembly code generated?
Did you notice that it was an array of objects, and not an array of primitives? Makes the array testing come out far, far worse than it otherwise would.
Joseph Cote, This is C, not Java. They're called "structs". ;)
So what? My point still stands. Arrays of pointers will always be less efficient than an array of primitives.
Joseph Cote While I don't care about the whole "object" vs "struct" thing, it is important to note that, unlike in Java, data structures in C aren't referenced by pointers unless the code specifies that they are. In C (and other languages), any structure, regardless of it's type, can be accessed either directly or by pointer. In this case, the array was a contiguous sequence of structures, no pointers involved other than the pointer to the array on the heap.
Let me make my point even more plain.
Linked list. Pointer to the head. Each item has a data and the pointer to the next "thing". For each item in the list you have to get the address of the item either from the head pointer or the previous item. Memory access to get the data, memory access to get the next pointer. Perhaps even more processing to resolve the actual addresses within the "thing."
Primitive data array: Using the base address of the array and the index, calculate the address to read (very fast all in cpu) access memory to retrieve data, go to next item. For each item ONE address calculation and done.
This is my ONLY point: Arrays of primitive values will ALWAYS be much more efficient in tests like this than any kind of fooling around with pointers and objects or whatever you want to call those "things." The video compared a linked list of "things" to an array of "things", NOT a big difference.
This whole thing is really muddied by basically every detail of the hardware architecture, but part of the reason walking backwards through an array can be faster is the loop end condition test (it's also more likely to be noticeable in this case because the actual work done by the loop is very small, so loop overheads are magnified). Unfortunately the video didn't show the loop code for the backwards case, but consider the fundamental work needed to handle these two cases (ignoring compiler optimisations like loop unrolling and auto vectorization):
for (i=0; i=0; i--)
sum += a[i];
Both visit the same elements of a: [0, 99], but the forwards version needs these sort of operations:
- fetch the value of a[i] from memory
- add that to an accumulator register (sum)
- increment the loop counter (i)
- subtract the value of the loop end point (100) from the loop counter
- jump back to the first instruction if the result of the above operation wasn't 0
In the backwards version:
- fetch the value of a[i] from memory
- add that to an accumulator register (sum)
- decrement the loop counter (i)
- jump back to the first instruction if the result of the above operation wasn't 0
Notice that in the second (backwards) version we need one fewer arithmetic operation, because we can take advantage of the fact that the decrement will set the zero flag when it reaches 0 (in every architecture I can remember working on), whereas the increment to 100 doesn't trigger any flags to change that we can use to conditionally jump to the top of the loop - the cpu has to subtract the end value to see if the difference is 0 (cmp is just a subtract that throws away the result but keeps the effect on the flags register), and then do the conditional jump.
Since the actual work done in the loop is 1 arithmetic operation and 1 memory fetch, the additional arithmetic operation to test for the loop end is a significant extra overhead.
a question which interest me most now. Is the floppy HD format or DD ?
"DD" with a hole drilled to make HD
Why would you do that, as it's being used in a machine that, odds on, only has a double density drive? (I know HD was at least a later option for the MSTe, but STs in the main have 720k drives)
As well as bothering with that trick when proper HD floppies remain cheap (heck, pretty much free these days) and plentiful, and have formulation that's a better match for the drive characteristics?
PC drives including USB ones can still happily read and write 720k discs, btw.
A huge advantage of arrays is as mentioned cache friendliness (data locality). But more than that newer CPUs also include SIMD instructions to perform operations on a lot of data as once, say adding 4 or 8 ints in parallel. Adding in this the array will beat a linked list even worse on modern machines.
I would try compiling and running with -march=native (along with -O3), which will allow for CPU specific optimizations. Or use some intrinsics to manually add the needed SSE or AVX instructions.
"The compiler is being clever", which is why you don't optimize code to compare algorithm speeds. Also, this is just traversal. How about speed testing insertion in the middle, which is what linked lists excel at?
Also, the iMac has SSE4 ans AVX instructions, which can add 8 or 16 array items in one cycle.
Which is facilitated by the loop unrolling, and then the cache makes it so that all 8 or 16 of those elements are accessed in a single read.
8:14 "Confusing variable names, but then, I'm a C-programmer." Haha!
You could also add than in the iMac, the array could fit entirely on cache as opposed to the linked list, and optimizing the code for loop unrolling gave it all the speed boost. So theoretically it might have never accesed the memory but to load the program from disk when running as array.
8:15 "But then, I'm a C programmer." xD
Dream selection of retro kit in that office :)
Actually... you may want to look at the index calculation time on the array on the older machine. Some compilers didn't optimize for powers of two and used a multiply for the index. This was pretty painful back in the day.. remember the ARM processor had a serial multiplier on some models but a 5 or 6 clock multiply was also a possibility.
People forget this today, typically its 1 clock.
IIRC the 80486 had a 3 clock integer multiply
If the Linked List stores it's data like an Array, it'll likely be about as fast. But in practice, Linked Lists will have insertions, deletions, concatenations, all of which fragment this data and create cache miss opportunities. Context matters in this case, which is why it's best to profile these things in the situation you need them in
Man, I really wish I can switch my major rn.. I've always had smth for computers
Do it
26:28
"the fact that the value is precalculated in memory..." what exactly does this mean? what is he referring to?
I had the same question. I think he probably means the fact that indexing has to do a calculation to get the result. I.e. getting the 5th element of the array requires doing " arrayheadptr + (5 * element_size)" or something like that. This doesn't have to be done in the linked list version because the memory location pointed to by each next pointer is already 'precalculated'.
Not sure.... I didn't paid attention to this particular part.
Maybe does he talk about the fact that modern processors are not running each instruction in a single step but through pipes.
Let's say you want to compute a+b. The addition will be processed in several steps, for instance 3 steps
(a,b)-> STEP 1 | STEP 2 | STEP 3 -> a+b
Imagine that each step takes 1 unit of time. The time to complete the addition is 3 units.
Now let's imagine you want to compute a+b+c+d.
Three additions has to be performed, so the time to compute a+b+c+d should be 3*3=9. Right ?
Well.... in fact, you can go faster because each component of the processor can work independently. Here is how the compiler can schedule the computation of a+b+c+d
Time
t=1 STEP1 of a+b
t=2 STEP2 of a+b AND STEP1 of c+d
t=3 STEP3 of a+b AND STEP2 of c+d
t=4 STEP3 of c+d
t=5 STEP1 of (a+b)+(c+d)
t=6 STEP 2 of (a+b)+(c+d)
t=7 STEP 3 of (a+b)+(c+d)
So it takes only 7 unit times instead of 9 !!!!
As you can see (c+d) is in some way "precomputed" during the computation of (a+b) itself.
And as everyone knows (a+b)+(c+d)=((a+b)+c)+d [Numberphile has probably a video for it], so you get the expected result (but faster).
How about sorting?
When you have big enough objects in an array/list wouldn't sorting a list be quicker when you can manipulate pointers instead of moving whole objects around in an array?
For BIG sructs this should be true, but first you must come up with an efficient in-place algorithm, where you need no random access. I'm not quite sure, whether you can do selection sort or sth comparable with a single linked list (this example should work with double linked lists), at quicksort this should be even harder...
skeliton11211 I was meant to add that to the question to be honest. Of course an array of pointers would be much quicker. When you have a warehouse full of containers you make a checklist where everything is and organise it and not move all the containers around.
Stefan Gräber I realised that later. sorting one way list might be problematic since you have to relink the neighbouring objects and insert the object in its new place. It results in a far more pointer manipulation than an array of pointers. My question was more about the "pure" version of a structure
I would imagine it depends on what you're sorting, and what sorting algorithm you use.
First, on the data being sorted:
Arrays have to swap the contents, so the time is dependent on the size of that content.
Linked lists have to swap the pointers, so constant.
These two are equal if the data being sorted is the size (or smaller) of a pointer, otherwise the linked list (I must assume, I should point out that I'm not an expert) will win out.
As for how you sort:
Insertion sort moves backwards along the array/list, and so is an outright no for linked lists (unless, of course, you use a doubly linked list, i.e. a linked list linked in both directions, which would have even greater memory use than a single linked list and would require two pointer swaps rather than just one for every data swap).
Heap sort relies on random access, so would take a hit on linked lists.
Merge sort and quick sort (properly implemented) work well on both arrays and linked lists.
Radix sort I'm not sure about. On the one hand, from what I recall, what you're sorting is specifically the sort of thing that would benefit from linked lists only needing to swap pointers rather than data. On the other, I would also think these things to be things where the array would only contain pointers to them, for example strings (assuming you aren't accepting the restriction of all strings being the same maximum length, and probably all close to it (otherwise you're wasting lots of space)).
Provided you keep in mind which sorting algorithms to use and which not, and that your array elements aren't unreasonably large, the difference is (probably) negligible: sorting an array/list from scratch is something you probably won't be doing that often, or with data large enough, for it to really matter which of the two you use, at least in comparison to other operations you're likely to use on the same data.
Lastly, I suspect that if you're sorting an array or list, you're probably going to want to do other things to it that linked lists are better suited to, that is inserting and removing items. That bit however is purely subjective and I don't have anything to really back that up.
Quicksort can be done in place so there's no need to do large copies. Even still, moving small structures is still often faster than moving pointers in a linked list, again, because of the cache which can preform the move operations in bulk.
Instead of using an array index, use a pointer to the first element of the array, and increment the pointer. I think that will be even faster.
Modern Intel processors handle adding an index to a pointer and dereferencing the incremented pointer all in a single instruction.
The reverse speed is likely due to how the cache works. (I haven’t seen the end of the video yet).
Very insightful benchmark. At first I was worried, that you don't mention cache's role in it but you did.
Only thing that's missing there, is that this is synthetic test, and in real life applications would show far different results. The reason is not only those cache misses, that you mentioned, but also that on x64 linked list node would be 2x greater in size than one in array. This applies consequences for how much memory will be prefetched to cache and it's pollution thus greater performance hits during execution.
Also, as this is synthetic test, all memory for linked list is allocated one after another anyways, thus lower cache misses than it would be in more likely scenario in real life application, where nodes would be scattered through memory and single fetch would likely be a miss.
Everything you say here is true, but beside the point. He's trying to teach a more fundamental fact about CompSci in general - just because X is faster/better/whatever on paper doesn't mean that will always turn out to be true when the code hits metal. While it's true in modern cases that most programs can ignore the underlying architecture in most cases, sometimes it can result in unexpected behaviors.
There are use cases for linked lists, and use cases for arrays.
If your collection is immutable then linked lists are always faster as you would have to copy the hole array on change.
If we are talking mutable collections then array(lists) are usually faster.
That is as long as you are not changing the length of the list. If you are just adding/deleting at the end, arrays are still usually faster, when you are changing something in the middle some form of linked list is faster.
Linked List are only useful if you make only small jumps from Node to Node.
adding or deleting Random Elements in a Linked List is about as fast as on Arrays.
(It is useful if the Array has free space)
spaces in an array are definitely a problem and a huge source of bugs. That is if we are talking ordered collections. If we have a hashmap or something like this it is obviously totally fine. But comparing a map to a list wont give you the results you want.
aullik the free space should be at the End of the Array (look Vector or ArrayList)
Then i missread your comment. I know that arraylists normaly have space at the end.
For smaller lists this mean that this will be faster then a linked list. But once your array is so big that you have to array copy over multiple cache-lines, a linked list will be faster
Wouldn't it have been faster without printing stuff on the screen inbetween? I know that many of my c and c++ programs run much quicker if you leave out the print statement.
that's a good point, writing to the screen can take some time. however both versions print to the screen, so they are subject to the same slowness. it's like if you had two people run a race through syrup - you can still tell who's faster than who.
yes however the percentage in speed difference is exactly the same as they are printing the same length (minus one or two chars) the same amount of times so it wont effect the comparison
They did the prints outside the timings, so it would not affect the recorded time.
If you're finding that printing half a line of low rez text to the console makes a significant difference to the loop speed of your otherwise ~850ms routine, even on an ST, then you've got bigger problems than whether or not to print said text. Like, realising that you're actually stuck in a runtime interpreter for a particularly shoddy variety of BASIC when you thought you were making compiled C programs.
If they're routines that take closer to 850us, then, sure, maybe it'll be worth it. But the text only prints after the loop is done (which is how it knows how long said loop took) and before the start of the next one, so it's not going to affect the reported results any, and in most cases the difference between printing or not printing 100 half-lines of text will save you, ooh, maybe one video frame's worth of runtime?
when i choose a list and not an array it is not about the speed but the arithmetic.
u run into complicated delete and replace methods if u dont choose the right type to store your data.
finally, something academic 😅
I love these in depth looks on computing processes, keep it up!
Did he actually say "doog yrev"!? lmao
Did eeh sey
That's what it was!!!
For cached architectures, It would interesting to see the performance difference with the linked list version if you malloc'ed a single contiguous block of memory and then 'sub-allocate' the structures in the memory block.
insert word association programming humor here
That could probably be done faster with a linked list than with an array.
THC IS IN MY P
OK so real computer science talk here regarding array vs linked list.
So a Linked list is a modular data structure that can grow or shrink to accommodate data in real time. Arrays need to be completely reallocated and copied to change size. Linked lists also require an extra bit of data in each element while arrays do not. Linked lists must be accessed from the top to the bottom, unless you saved an item of interest in advance.
It should be noted that the overhead used in a for loop adds more instructions to this operations than just moving a pointer into a local variable.
For loops require an index value such as i, a comparison operation, and an iteration. While a linked list only need to go to the next item until that next item is 0.
both situations sounds simple and the same, however they are not when it comes to the assembly since you would just move the next items address into a register compare it, and keep going but in a for loop you would have to compare the index counter to the length of the array then increment the index counter. Since the for loop operation involves two variables(the length of the array, and the index counter) you would need to push those values onto the stack, then back off the stack as your application stepped deeper into its functions and or move those variables into and out of registers each time you got to the comparison. With the Linked List you only need to check if the value stored at a fixed distance away from the list items pointer is 0 or not. Since this distance is fixed, no need to pull the value from memory into an register.
I'm not sure I am making a good explanation here but essentially a for loop uses more steps to maintain the loop while a linked list loop just keeps going until a condition is met.
MEMORY IS AN ARRAY!!!!!!!!!
I just learnt about "-O3" setting in GCC. Nice one, thanks!
FWIW, -O3 has a tendency to break code, -O2 is a safer, albeit slightly slower alternative
@@anshul5243 ozone mode bad
Steven: *uses VSCode*
me: MAH BOIiii
Loving the Atari MegaST in the background!
You're original test is invalid. You unnecessarily used a different type of loop, with the loop used for iterating over the array doing a bunch of useless loop-maintaining arithmetic and comparisons that any competent programmer would have either removed or accounted for in the benchmark. Pro tip: summing the elements in an array is in fact faster since it reads half the amount of data from memory for each access (only the number being summed, and no pointer). Since the x86 has more registers, the optimizing compiler had better luck keeping your loop counter in a register and not in memory, where-as on the 68k, the loop counter was stored is a local variable (on the stack in memory), and you access it 3 times per iteration: once to compare it, once to read it for incrementing, and once to write it for incrementing. That's completely unfair to the array implementation.
68k has 8 data registers and 7 address registers, it should be enough for all three programs.
The operation that slowed down the forward array iteration was probably comparing the index with the constant, which requires one more instruction than comparing it to zero.
it's not "unfair." the point is that the speed depends on the architecture's bottlenecks.
imagine we were dealing with HDD instead of RAM. you would want a data structure that minimized reads from the disk because it's so much slower than memory. On the other hand with an SSD, you are not penalized so much for excessive reads and writes. Therefore you can use different data structures and algorithms without incurring the performance hit of the HDD.
That's not how Big O works.
The test is unfair and should be invalid, but not because the 68k is inferior, but since the for-loop use more complex mechanics and thus is significantly slower than the while-loop
The ascending for-loop is the slowest because it uses loop-counter-mechanics and have to do an active end-of-loop-check every iteration.
The descending for-loop is faster because it uses the very nice decrease-and-branch-all-in-one-instruction and use the decrease instruction's result to compare to zero instead of a separate compare instruction. But it still have the loop-counter-mechanics to deal with.
The linked list is fastest because it just loads the next address and checks if it's zero.
Those computerphile nerds are so friendly. If all teachers were like them...
Commenting on an Atari isn't particularly relevant as modern computers generally use Intel chips that are very cache heavy. Many tests have been done on these computers that show that almost all optimizations are next to nothing compared to the cache effects. The bottom line as illustrated by many, is that arrays are almost always must faster than linked lists and therefore linked lists should only ever be used only in special cases. Arrays should always be considered the default. My first computer was a 8080 running at 1 megahertz (linked lists were probably ok on that computer) but that isn't very important to what works on modern computers. There is no argument, don't use linked lists.
To be fair that was not the point of the video. It was that testing was king.
Modern computers are not "generally" using intel chips. Intel chips are dominant for pc:s, but pc:s does not represent a majority of even consumer computer systems. As an example take the 3 big gaming consoles of the 7th generation (xbox360, wii and ps3). Between them they have over 260 million units sold, and none of them use intel chipsets. Furthermore most cellphones today are running on qualcom chipsets. That's not even mentioning the non-consumer market. There the atari test is absolutely relevent as many embedded chipsets do not have cache.
Pretty sure the majority are ARM chips. That's what you're going to find in just about anything that doesn't need a high performance CPU and in some things that do, like most smartphones or that Raspberry Pi for instance.
Maybe if you're using something dismal like an Atom or an embedded / microcontroller core based off a 386SX or what-have-you, but I don't think that was the point of the test. But even if it was, you can use the 68000 in the ST as a fairly prime example of such; embedded CPUs based on the 68k core (e.g. the Coldfire) were extremely popular up until a few years ago when ARMs and ATmegas started offering greater power and lower energy consumption alongside a lower price tag. So we already have the answer - sophisticated large instruction set, large cache chips prefer regular arrays, and that includes the higher end ARMs. Simpler, cacheless chips prefer linked lists, unless they also display unusual behaviour like being over 50% faster when traversing an array from the bottom up rather than top down.
Things become a bit more interesting for embedded real time systems - when finding a memory location to allocate takes too long, so you have to allocate all of your memory in advanced, or write your own allocation algorithm. This shouldn't make a significant difference for reading the array/linked list, but should affect creation time.
VS Code master race reporting in
i think is C
Yes, but the text editor he was using to write it on the Mac was VS Code
Excellent break down. Another gem.
Doogy rev, doogy rev. Gronda gronda Rangdo!
wopsim
The most important thing that everyone should learn is this: BOTH ARE FAST ENOUGH! Use whatever method works for your application.
I don't buy this. Give us a pastebin of the source code so we can see if there are any other differences between the source code.
Even more, I want to see the assembly language listings of the compiled output. p[i].p is well known to be less efficient than using pointer arithmetic (even inside an array); however, some (but not all!) compilers can recognize when p[i] is used inside a loop and automatically substitute pointer-based code. While I expect this to occur for x86-based code considering gcc's/clang's relative maturity, I question whether the 68K compilers provide similar optimizations.
github or it didn't happen
"p[i].p is well known to be less efficient than using pointer arithmetic". My understanding is that p[i] is equivalent to *(p + i), why would it be any slower?
Mike Scott My guess is that although equivalent, the computer still need to compute p[i].p into *(p+i).
Mike Scott Because it involves a multiply and an addition, versus p++ which is just a simple addition. This becomes particularly visible with multiple dereferences to p[i].
Nice test and comprehensive explanation as always. Love your videos. 😊
This video was just quite misleading and clickbait'y. Very disappointing compared to your typical videos.
I've gotta put a dislike on this video for a few reasons:
- The description which made it out as though in the general case arrays were slower
- He made out as though he was NOT going to be running it on a modern machine - just on the Atari
- The fact one has to watch the first 22 minutes (thank god for YT x2 speed) before he properly addresses the fact there are MAJOR differences between the machines
Extra credit earned for using the Atari! And I'm pretty sure the reason why it's faster when run in reverse is because no compare instruction is needed in the low-level loop, only a check for zero. (I see you clarified this at the end, and I'm right. Atari 2600 programmer here...)
28:15 - "That's because I used a slightly different compiler there. I used clang on the iMac rather than gcc on the other 2".
Doesn't that seem rather unfair? How do u know the code compiled in clang wouldn't have been faster than the code compiled in gcc on the other 2 machines?
If I may, in this case, it may have been interesting to also compare between the array of structures you had and a structure of arrays (one for p, the other for q) to enforce the vectorization of your summing instructions, and optimize the cache use. Although, as you used the -O3 optimization, may it be possible that the q part of the structures had been opt-out ?
Also if you decide to manage the memory with the linked list you could allocate a block of memory in which the linked list gets stored with new blocks getting allocated when space runs out. It would mean you get less cache misses and it would run a bit faster
While there is some amusement value to this video, it provides very little information of what is actually going on. We do not know how the nodes of the linked list are ordered in memory, and this is very relevant. The fact that the nodes were just all generated in one swoop just before the test would suggest they are probably linearly ordered, with each link pointing a fixed distance forward from itself; by contrast, in a linked list that has been used for some time in ways that linked lists are made for (insertions, deletions, permutations of links), the nodes would tend to be scrambled in memory in a way that is detrimental to locality and therefore to cache performance. We also don't know whether the address arithmetic implicit in finding the address of p[i] is actually performed each time, or that (quite likely) the compiler has optimised this away by just incrementing the address of the previous array access. Unless we have some idea of what happens at the assembly level, we are just watching a race between contenders we hardly know at all.
No mention made of locality of access, though this is the unique feature that makes the data cache relevant: in the example, no item in memory is ever used more than once (I'm assuming the few local variables are in registers) so the _only_ gain from cache is that it fetches not only the item requested by the CPU, but a whole cache line so that the next several requests might be cache hits.Caches (and many other things) have been designed specifically to optimise linearly repetitive access to arrays, so it is not so surprising that an array program benefits most from such optimisations. On the other hand, mention of the instruction cache is beside the point: both solutions are small loops that can reside entirely in the instruction cache.
If you really want to compare intrinsic efficiency of array of linked lists as data structures, I would suggest a test that forces the array to be accessed non-linearly, and the list to be non-linearly ordered in memory. For instance, just insist that the items be visited in increasing order by the q-component while using the p-components (maybe doing something other than summation, for which order matters). In the array case, leave everything in place and pre-compute the permutation giving the order of visit, while in the list case (where we cannot do random access)pre-sort the list by increasing q, with a sorting method that only adjusts the links. The net effect would be that the linked list is scrambled in memory in the same way the array accesses are scrambled. Then the playing field is levelled in several respects (neither has much locality, so caching has little influence; array indexing cannot be optimised away), and the tests would be a lot more interesting.
Yesterday I fell asleep while watching this video, and when the outro popped, I swear, I woke up THE WHOLE HOUSE with the greatest scream humanity has ever heard.