Sorting Algorithms: Speed Is Found In The Minds of People - Andrei Alexandrescu - CppCon 2019

แชร์
ฝัง
  • เผยแพร่เมื่อ 18 ม.ค. 2025

ความคิดเห็น • 89

  • @oleksiimomot4040
    @oleksiimomot4040 5 ปีที่แล้ว +91

    Amazing talk with important conclusions and funny jokes from Andrei. Thank a lot!

  • @movax20h
    @movax20h 4 ปีที่แล้ว +25

    Excellent talk. There is so many unsolved problems in sorting (especially adaptive sorts, parallel and distributed sorting, in place merge-style sorting, etc). People often overlook actual complexity (both theoretical and practical) of sorting. Many years ago I had an interesting sorting requirements, and well, I found out that despite research for 60 years and a lot of progress there is no known optimal solution, nor a well working scheme that works good on average to this day.

  • @fritsgerms3565
    @fritsgerms3565 2 ปีที่แล้ว +19

    Andrei is one of the original pall-bearers of c++. Always enjoy his talks. He is creative & unique in his thinking. I wish cppcon would become smaller and increase its quality like it's golden age in +-2014.

  • @CedricMialaret
    @CedricMialaret 5 ปีที่แล้ว +26

    Must-watch presentation, as always with A.A.

  • @andrewzmorris
    @andrewzmorris 5 ปีที่แล้ว +18

    Fantastic talk. Andrei is such a great speaker, and the content was very interesting. I'm not sure how to really improve our code in the ways Alexei highlights, but it certainly seems to be something that we should be talking about.

    • @boguscoder
      @boguscoder 5 ปีที่แล้ว +8

      *Andrei ;)

  • @dengan699
    @dengan699 4 ปีที่แล้ว +8

    Best CPP speaker
    And I am not even CPP developer!

  • @absurdengineering
    @absurdengineering 5 ปีที่แล้ว +71

    I have a problem with this talk: it presumes a lot.
    It presumes branch prediction, caches of certain depth, speculative execution, and who knows what else. Sure: modern desktop CPUs have it, as do the large ARMs. Now take this code and start moving it backwards in history. Say, to Raspberry Pi A. To a P-II. To a 486. To a 386. To AVR. Heck, rewrite it in Turbo Pascal for CP/M. I’ve tried: it all falls apart as you pass ARMv7. And the simpler the architecture, the worse the regression.
    The standard library is supposed to be as universal as possible. It may run on varied architectures. Yet here we micro-optimize in ways that make things much worse as soon as you’re outside the realm of the most modern CPU. These improvements make the sort worse than 2x as slow on a 486, for example. Sure, it’s not exactly a common target. But Arduino supports C++20, and if we try this trick there, we get same performance regressions. So, before anyone sticks this code into any standard libraries, they better examine the supported platforms and make it optional, and enabled only on modern architectures.
    I also have a problem with sweeping statements like “they don’t teach it in books”. CS introductory texts all presume an uncached in-order architecture where every memory access and comparison has equal cost. We all know this, even if the authors fail miserably at putting this assumption in big bold letters at the start of every chapter. But the standard analysis of algorithms doesn’t somehow fail here: as soon as you make the costs take into account caching and branch prediction, you get somewhat more complex equations, but you recover the approximation of computed costs to costs of running on real hardware. So, given that we all understand what assumptions the classic CS texts make, it’s disingenuous to say “they don’t teach it in the books”. They don’t because even with the simplified logical architecture the texts presume, it’s still hard enough. And there are plenty of theoretical algorithm treatments that do in fact take architectural details into account.
    Also, it’s not as if the CS classics are completely detached from modernity. There’s this thing that was all the rage back when cores were too small to fit all the business data. Remember tape sorting? It takes excellent advantage of memory caching and prefetch, and you start seeing these advantages as soon as you “upgrade” from 386 to 486.
    Also, this general admonition against branches but in favor of data-integrated condition tests: all is fine and dandy if you don’t introduce data dependencies. Not a thing back on a 486, but these days short-acting data dependencies deparallelize the CPU, and the worse they get the more the billion transistor CPU starts to work like fast but “dumb” 486, with 1000x less transistors…
    This talk is a wonderful show and it is full of technical insights, but this is high-grade engineering entertainment, not an engineering project report. Don’t try to apply these things without having a clear understanding of what’s involved but unsaid, and for crying out loud: repeat all the benchmarks in your target application, with your target data.

    • @Sogartar
      @Sogartar 3 ปีที่แล้ว +4

      I think his point was exactly that. If you can introspect much of the relevant things like the type of data, hardware, etc. you can compile an algorithm more adapted to the task.

    • @KafeinBE
      @KafeinBE ปีที่แล้ว

      Of course the talk should probably have mentioned this caveat, but I don't agree the standard library should optimize for any CPUs other than the most modern. Because the most modern are going to be the fastest, and that's naturally where speed and efficiency matters the most.
      Additionally, caches, pipelining and speculative execution have been a feature of architecture design for a while now, and a pretty pervasive. Chances are, if you're not coding baremetal then you have to take these things into account. Sure, a very small IoT device won't have the best sort, but chances are the programmer ought to roll their own sort anyway.

    • @godDIEmanLIVE
      @godDIEmanLIVE 2 หลายเดือนก่อน

      That ties in directly with Mike Acton's talk. Taking into account the hardware and the data is part of the problem solving and not something you just ignore and abstract away. So what you're saying is exactly Andrei's point: hardware matters and data matters so you tune your "generic" algos to them. The presumtions are exactly the point and it is how you write optimal code for your hardware and data.

  • @saaah707
    @saaah707 7 หลายเดือนก่อน +1

    56:00 IIRC, reverse-sorted array is worst case for insertion sort. Why are the performances better than the sorted case (55:40)?

  • @evikone
    @evikone 5 หลายเดือนก่อน

    I have Andrei's Book on D language--an excellent and insightful work.

  • @oscarasterkrans9301
    @oscarasterkrans9301 4 ปีที่แล้ว +3

    Fantastic talk as always from Andrei.

  • @JordiMontesSanabria
    @JordiMontesSanabria 4 ปีที่แล้ว +21

    25:50 "I am almost sorry there is no rotation there"
    LOL

    • @MalcolmParsons
      @MalcolmParsons 4 ปีที่แล้ว +1

      There's a rotate() in insertion_sort().

  • @greenfloatingtoad
    @greenfloatingtoad 4 ปีที่แล้ว +5

    selecting an algorithm though specifics about types I believe is compatible/part of what Stepanov envisioned. The difference between a Concept and an Algebraic Object is the operations of a concept come with a cost

  • @MichaelPohoreski
    @MichaelPohoreski 5 ปีที่แล้ว +29

    _Design by Introspection_ is a sub-set of **Data Orientated Design.**
    There are 3 types of optimizations (from slowest to fastest):
    1. Bit Twiddling
    2. Algorithmic
    3. Data orientated (cache usage and branch prediction)
    The FIRST rule for optimization is:
    **KNOW THY DATA**

    • @kuhluhOG
      @kuhluhOG 5 ปีที่แล้ว +1

      @Ziggi Mon data-orientated design means that you write your algorithms for your data, and don't abstract your data to something general but stay at the specific
      it ends at: use different algorithms for different data, don't generalize (at least how I understand it, I don't use it though)
      the first time I saw something about this topic was about by a (professional) video game developer who also said: "If we would have the time, we would write everything in Assembler, but we don't."

    • @sethsora7963
      @sethsora7963 5 ปีที่แล้ว

      Glad to know that someone else recognized this too.

    • @LEGOOOOOOOOOOOS
      @LEGOOOOOOOOOOOS 2 ปีที่แล้ว

      Is there a book or talk on data orientated design?

  • @chidude16
    @chidude16 2 ปีที่แล้ว +1

    that was fantastic! brilliant. add that with Carbon and we're rocking....

  • @6754bettkitty
    @6754bettkitty 3 ปีที่แล้ว +5

    42:20 That goto comment is one of the funniest things I have ever seen in programming!

  • @ruroruro
    @ruroruro 5 ปีที่แล้ว +15

    I think, it may be interesting to try use a different intermediate structure, than the heap. If I understood it correctly, the key reason, why constructing a heap improves the performance is that a heap is in some sense already partially sorted and so, the linear searches end up being shorter or more uniform, improving both locality and maybe even branch prediction.
    I think, that there may exist some data structure, that would be simpler to construct than the heap (maybe in terms of big-O or maybe in terms of performance), but would have the same or similar effects for the following linear insertion sort.
    After all, as Alexandrescu himself said "the heap is a delicate fine-tuned structure", so maybe constructing a heap is actually overkill. After all, we are not actually using all the invariants the heap provides. So maybe we can get away with using some structure with weaker invariants.

    • @michaelryabko885
      @michaelryabko885 5 ปีที่แล้ว +2

      It would be interesting to see performance with different types of heaps, even. Here A.A. only tried with the basic 2 child heap(which makes sense as it is a lot easier to reason with), but using something like a fibonnaci heap might have interesting results. Although this reasoning may go harder in the direction of strong invariants and may lead to worse performance overall.

    • @arirahikkala
      @arirahikkala 4 ปีที่แล้ว +3

      @@michaelryabko885 Keep in mind that these heaps are still all under a size threshold that went up to, what, 255 elements at most. And that your heap has to live in the array that's getting sorted.

    • @yash1152
      @yash1152 2 ปีที่แล้ว

      rrlated question at 1:25:53

  • @OMGclueless
    @OMGclueless 4 ปีที่แล้ว +9

    How on earth do you write unguarded_insertion_sort for a generic type? It's one thing for std::sort to return elements in an arbitrary order if a programmer overlooks that their less-than predicate doesn't define a total order. It's totally another for the insertion operation to buffer underflow and start writing to arbitrary locations.
    Actually, I don't even know that you can do it for doubles. At least, his code has a bug: at 29:22, if the element at *begin is NaN, then make_heap will leave it there because greater() is false for all comparisons with that element. Then unguarded_insertion_sort will underflow below that element because all less() comparisons will be false as well. IEEE764 floating point doesn't define a total order so you can't use the default comparison operators in the way Andrei does.

    • @mxBug
      @mxBug 3 ปีที่แล้ว

      OK. so let the consumer of the sort function supply a comparison with a total order.
      if you are concerned about how the compiler can ensure the correctness of the comparison, perhaps dependent typing is your thing?

    • @Kromaatikse
      @Kromaatikse 2 ปีที่แล้ว +3

      First you compare the element being inserted with the element at the very beginning of the array. If it goes before that, you insert it there and you're done. If it doesn't, then the unguarded insertion run is guaranteed to terminate at or before the first element, *regardless* of whether the elements are totally-ordered or not.

  • @bluecircle56
    @bluecircle56 4 ปีที่แล้ว +3

    Andrei Alexandrescu talk, wheee!...

  • @srh80
    @srh80 2 ปีที่แล้ว

    Guy from Somebody feed Phil, moonlights as a C++ guru. Amazing talk.

  • @maximilian19931
    @maximilian19931 ปีที่แล้ว

    ASM instructions have cycle costs, mostly defined in CPU Documentation booklet(mostly a table of how many cycles it takes), would be useful at compile time to produce code that uses the smallest instruction block to get the job done.
    The Architecture definition for compile target should include the ASM instruction cycle cost and make a decision to choose the least costly route available.

  • @stephenhowe4107
    @stephenhowe4107 5 ปีที่แล้ว +7

    Slide at 26:56 is wrong.
    It should be unguarded_insertion_sort(begin +1, end);
    Because the 2nd smallest element can be at either begin +1 or begin +2 and it is still a valid heap.

    • @MikhailMatrosov
      @MikhailMatrosov 5 ปีที่แล้ว +2

      Exactly, but he insists that should be +3. Maybe we are missing something?..

    • @jeffdovalovsky7259
      @jeffdovalovsky7259 5 ปีที่แล้ว +6

      The slides in the linked github acknowledge that begin+3 was wrong, and correctly use begin+2.
      Stepanov's unguarded_insertion_sort(first,last) accepts a range of unsorted elements, which it will insert as needed into a preceding pre-sorted sub-sequence. begin and begin+1 are already sorted, begin+2 is not, so it can be unguarded_insertion_sort(begin+2,end)

    •  5 ปีที่แล้ว +3

      @@MikhailMatrosov , Stephen Howe - in the comments to this talk at Reddit (link in the description of the video) Alexei says that it should be +2, not +3 (because the two first elements are already sorted). The +3 was mistake in slides, and he has +2 in the actual code.

    • @stephenhowe4107
      @stephenhowe4107 4 ปีที่แล้ว +2

      @ : No that is not a correct explanation. Andrei explained to me, Your right about the +2.
      The first element is sorted.
      The second and third elements can be out of order, because that is still a valid heap.
      You can have 1, 7, 4 and that is a valid heap as the root is less than 2nd and 3rd element its children.
      The fact that the 2nd and 3rd element are out of order is neither here and there, it is a valid heap.
      But it does not matter because this is unguarded insertion sort not insertion sort.
      So the 2nd and 3rd element get corrected. That is Andrei has +2 (works due to unguarded version) rather than +3 (does not work).
      In correspondence with Andrei, I was insisting on +1 because of this, but he pointed out this is unguarded insertion sort not insertion sort and therefore +2 is the minimum value which it is guaranted to work.

  • @sakuraikamoto3699
    @sakuraikamoto3699 5 ปีที่แล้ว +3

    Thank you sir is the best talk ini learning education from engineer 👍👍

  • @03Prashanth
    @03Prashanth 3 ปีที่แล้ว +7

    22:50 That comment on the undergrads in india was hilarious. Quite true but still... hahahaha

  • @CT7ALW
    @CT7ALW 4 ปีที่แล้ว +6

    All that "boolean arithmetic instead of ifs" goes a bit against the "you're not smarter than the compiler", doesn't it? It's almost like that division example where you try to be smart with bit shifts and the compiler puts the division back...

    • @Theonewhowantsaname
      @Theonewhowantsaname 3 ปีที่แล้ว +1

      That’s the first question he answers, at 1:17:25

  • @aretha6360
    @aretha6360 4 ปีที่แล้ว +3

    I thought infinite loops were a bad thing in C/C++ since the compiler has no easy way of unrolling the loop because it doesn't know when it stops and it would interfere with the compiler's vectorisation ability? Also, in my code if I want an infinite loop I tend to write while(true), but he uses for(;;), is there a difference?

    • @lewistherin4096
      @lewistherin4096 4 ปีที่แล้ว +1

      I guess there is no difference in for and while, but perhaps check godbolt.org/ and test it yourself with the compiler you are using

    • @darkengine5931
      @darkengine5931 4 ปีที่แล้ว +2

      I'm curious about the first question as well. If the loop unrolling is applied at the IR level, it may make no difference as far as how we write them. But I'm not sure how optimizers typically approach loop unrolling optimizations. For the second question, 'for(;;)' tends to be what's accepted as the standard way to write infinite loops in C++. If you use 'while (true)', some compilers may complain about branching with a compile-time constant for its expression.

    • @6754bettkitty
      @6754bettkitty 3 ปีที่แล้ว

      Cpp insights might show a difference, if there is one...

    • @kylek.3689
      @kylek.3689 2 ปีที่แล้ว

      Vectorization is undesirable for sorting an arbitrarily sized list of elements, since you can't predict how many elements you will need to sort ahead of time.

  • @Mage_Chartreux
    @Mage_Chartreux 2 ปีที่แล้ว +1

    Was this tested on many different computers? I just watched an excellent talk on how memory layout during runtime can account for plus or minus 40% of the variation in the time it takes to run something. I'm wondering if a lot of this isn't just because the data is getting restructured.

    • @yash1152
      @yash1152 2 ปีที่แล้ว

      1:24:35 > _"different intel machines"_

    • @NXTangl
      @NXTangl ปีที่แล้ว

      Anything that lays out doubles differently is going to be pretty damn exotic.

  • @Sebanisu
    @Sebanisu 4 ปีที่แล้ว +3

    I wonder if GNU has fixed the issues he found.

  • @KafeinBE
    @KafeinBE ปีที่แล้ว

    What happens if you repeat the silly idea? Do make_heap twice, once on the entire array as shown, then once again with the array minus the three first elements?

  • @peceed
    @peceed 2 ปีที่แล้ว

    For every set of parametrical disorders there exist asymptotically optimal algorithm for all elements, unfortunately proof interleaves Turing machines so we can expect linear (according to set size) slowdown. What is surprising there exist algorithms that are optimal and practically quick for many different disorders.

  • @Dante3085
    @Dante3085 5 ปีที่แล้ว +6

    1:16:58 Thank you

  • @joseville
    @joseville 3 ปีที่แล้ว +3

    36:12 I went the other way. I thought you'd add an arbitrarily large element (larger than any element in the list) to the list before heapify and then pop after sort.
    39:45 Infinite loops FTW

    • @tyronewilson6463
      @tyronewilson6463 ปีที่แล้ว

      The arbitrary large element idea is probably hard with the use of function pointers to define the comparisons

  • @hanyanglee9018
    @hanyanglee9018 2 ปีที่แล้ว

    17:55 15 seconds ago I came up with the idea of branch prection. When he asked the question I answer to my screen "Catching!". I'm definitely in poverty.

  • @colinmaharaj
    @colinmaharaj 3 ปีที่แล้ว +2

    I need to start traveling to meet these giants

  • @antonpetrenko9176
    @antonpetrenko9176 4 ปีที่แล้ว +3

    fire man and great talk!

  • @monad_tcp
    @monad_tcp 3 ปีที่แล้ว

    48:57 I guess there's a way to improve more by exploiting addressing modes

  • @yash1152
    @yash1152 2 ปีที่แล้ว

    16:35 why the binary search insertion sort is bad

  • @rkalle66
    @rkalle66 2 ปีที่แล้ว

    The performance order of doing silly things is O(n²). Am I right? It's like gambling with hindsight 20/20. It reminds me in Germany talking about soccer and somebody saying "The game has 90 minutes or till the referee whistles the end". Something you have to put a 5-Mark coin into the piggy bank.

    • @andralex
      @andralex 2 ปีที่แล้ว

      It is indeed quadratic, but not a gamble. Note that the quadratic algorithm is always invoked on small to medium input sizes (for larger inputs you'd use quicksort).

  • @yash1152
    @yash1152 2 ปีที่แล้ว

    1:04:35 whats hot/cold code??

    • @saaah707
      @saaah707 7 หลายเดือนก่อน

      hot is code that is executed many times, cold is code that is not executed often
      i believe the goal is to minimize thrashing of instruction and data cache

  • @wessmall7957
    @wessmall7957 5 ปีที่แล้ว

    This is earth shattering

  • @ThanhNguyen-rz4tf
    @ThanhNguyen-rz4tf 3 ปีที่แล้ว +3

    Note for myself(You should ignore it):
    15:22 -> 16:43Compare linear-optimistic insertion sort vs binary insertion sort. => Worse O(n) (linear) doesn't mean worse runtime.
    22:00 Try to insert condition into artimetic

  • @greenfloatingtoad
    @greenfloatingtoad 4 ปีที่แล้ว +1

    Until we have introspection, I think we can fake it with type_traits style structs of constants

  • @HtS643KyS6555GxQ3edA
    @HtS643KyS6555GxQ3edA 3 ปีที่แล้ว +1

    I wonder what the benchmarks would be on Apple m1 chip?

    • @yash1152
      @yash1152 2 ปีที่แล้ว

      > _"am wondering about benchmarks on apple m1 chip"_
      did u check?

  • @Attlanttizz
    @Attlanttizz 5 ปีที่แล้ว +9

    33:38 The underscore bonus... 🤣

  • @Dante3085
    @Dante3085 5 ปีที่แล้ว +1

    Yeah, this guy is fun :D

  • @NXTangl
    @NXTangl ปีที่แล้ว

    PDQsort would have handled random01.

  • @JorgeLuis-ts6qp
    @JorgeLuis-ts6qp 2 ปีที่แล้ว +1

    Il leave 1:03:00 for me in the future.

  • @tomcheng2437
    @tomcheng2437 5 ปีที่แล้ว

    I thought quicksort is not stable...

    • @VioletGiraffe
      @VioletGiraffe 5 ปีที่แล้ว

      Good point, there's std::sort and std::stable_sort.

    • @greenfloatingtoad
      @greenfloatingtoad 4 ปีที่แล้ว +8

      Idempotency is weaker than stability. If you put in perfectly sorted data, an idempotent sort will leave it alone. But if it's even a little bit unsorted, it might change the order equivalent elements show up in (stability)

  • @akshitjangra2471
    @akshitjangra2471 ปีที่แล้ว

    What sorting algorithm was the speaker trying to improve?

  • @LaurentLaborde
    @LaurentLaborde 3 ปีที่แล้ว +1

    from 1:09:34 I keep shouting "A DATABASE !!" every 20s but he's not answering :(

    • @LaurentLaborde
      @LaurentLaborde 3 ปีที่แล้ว

      a database is doing A LOT of work it order to be as fast as possible when needed. Which is 50% of his talk

  • @qm3ster
    @qm3ster 2 ปีที่แล้ว +1

    Nah, generic programming should support this.
    Don't just statically introspect the size of the element.
    Compile the moving and comparison code just to analyze it, then use the results of that in the body of the sort implementation :v

  • @max0x7ba
    @max0x7ba 5 ปีที่แล้ว +2

    That statement that Radix Sort only works for integers is not accurate. See Malte Skarupke brilliant talk "Sorting in less than O(n log n): Generalizing and optimizing radix sort": th-cam.com/video/zqs87a_7zxw/w-d-xo.html

  • @childhood1888
    @childhood1888 2 ปีที่แล้ว

    46:11

  • @monad_tcp
    @monad_tcp 3 ปีที่แล้ว

    1:20:27 concepts don't work, the only solution for that is tagging types, and dependently typing, you have to tag directly the AST and attribute types in it, and them forward them, you can't do that in C++, because you can't introspect non-const-expr code, so concepts are useless (to solve that problem), as is SIFINAE, and template metaprogramming. So in a way, Generics Programming (in C++) is why we can't have nice things, you could actually do this optimization in C# thou, because the JIT compiler has that information !
    Or you could manually tag your UDTs as he said, but that's not-optimal, it does work, but we can do better, its possible, perhaps C++ is why we can't have nice things, but we can have better languages.

  • @mishatsaritsyn4843
    @mishatsaritsyn4843 6 หลายเดือนก่อน

    So this is how generic programming dies... with thunderous applause