Fastware - Andrei Alexandrescu

แชร์
ฝัง
  • เผยแพร่เมื่อ 26 ส.ค. 2024

ความคิดเห็น • 29

  • @dedvzer
    @dedvzer 2 ปีที่แล้ว +3

    I like how he expands most of these topics into complete talks for the next 10 years (this is 2008). I've seen probably most of them, but they deserve some time. And here he is, ready to do it all in one hour.

    • @simonfarre4907
      @simonfarre4907 11 หลายเดือนก่อน +3

      The talk was not from 2008. It was from 2017.

  • @TheOnlyAndreySotnikov
    @TheOnlyAndreySotnikov 3 ปีที่แล้ว +11

    I tried his version of improved atoi. It indeed is faster than the standard library atoi. However, manually implemented atoi with "bad" dependency "result = result * 10 + (*b - '0')" showed the same performance as the "improved" version . IMHO it makes sense, since "result += ..." presents the same dependency on the previous result as "result = result * 10 + (*b - '0')".
    In clang the results are even more stunning. His "bad" version of atoi is in fact *2 times faster* than the trick with precalculated powers of 10, because when clang sees "result = result * 10 + (*b - '0')", it uses two LEA instruction instead of multiply: leaq (%rcx,%rcx,4), %rcx, leaq (%rsi,%rcx,2), %rcx. The first one performs multiplication by 5, the second combines multiply by 2 and addition.

    • @tedrivera8051
      @tedrivera8051 2 ปีที่แล้ว +5

      He did mention that some of the advice might "not age very well with improvements in compiler and CPU technology" (3:23). Clang has done a lot in that space, but it's good to know which advice hasn't aged well. Thanks for that!

    • @MrAbrazildo
      @MrAbrazildo 8 หลายเดือนก่อน

      54:34, I recently discovered that unrolling a loop of 'sscanf's to (I don't remember well) 50 numbers made it more than 3x faster, for both Clang and GCC. I thought they do that by default in -O2/3.

  • @fritzschnitzmueller3768
    @fritzschnitzmueller3768 3 ปีที่แล้ว +2

    I love how he lets the audience choose the topic of the talk on the fly :D Who else does that :D

  • @Voltra_
    @Voltra_ 2 ปีที่แล้ว +1

    I honestly didn't know how to implement strtr. So as a challenge when listening to this, I wrote it as an infinite loop. It's much easier to think about and I think I got a pretty optimized linear version

  • @MrAbrazildo
    @MrAbrazildo 8 หลายเดือนก่อน

    50:12, I would expect isdigit to be faster, because it would only need to check some bits.

  • @victornoagbodji
    @victornoagbodji 7 ปีที่แล้ว +2

    amazing talk!

  • @xvoidee
    @xvoidee 4 ปีที่แล้ว +2

    Now I have big doubts regarding my optimization skills :-)

  • @ace100hyper3
    @ace100hyper3 2 ปีที่แล้ว

    I like Andrei, but I have to disagree that optimal way to implement StrStr is by starting with an infinite loop. I just implemented the algorithm in C# and I used the loop condition regularly. I would really like to see Andrei's implementation on this, because it may be that he is doing extra comparisons in the loop body.

  • @chriswatch582
    @chriswatch582 6 ปีที่แล้ว +5

    On Visual Studio 2015 using Microsoft 7 Professional with 2 Cores, I ran a test version of the non-parallel, parallel computation and unrolling, but the non-parallel routine was always faster. I even tried putting the large array and string length computation in the routine but it was still slower than the non-parallel computation. Relying on parallelism was slightly slower than unrolling in the last 2 tests.
    I developed 2 programs. I included the first program below and the output from the 2nd program!
    Output from 1st program:
    TEST 1: Convert string to number: 1234.
    Not Use Parallelism Times: min = 0 max = 5307 Min Iter = 10000 Max Iter = 100000000
    RESULT: 1234
    Use Parallelism Times: min = 0 max = 6310 Min Iter = 10000 Max Iter = 100000000
    RESULT: 1234
    Use Unrolling Times: min = 0 max = 6621 Min Iter = 10000 Max Iter = 100000000
    RESULT: 1234
    TEST 2: Convert string to number: 12345678.
    Not Use Parallelism Times: min = 1 max = 9573 Min Iter = 10000 Max Iter = 100000000
    RESULT: 12345678
    Use Parallelism Times: min = 1 max = 10227 Min Iter = 10000 Max Iter = 100000000
    RESULT: 12345678
    Use Unrolling Times: min = 1 max = 9799 Min Iter = 10000 Max Iter = 100000000
    RESULT: 12345678
    TEST 3: Convert string to number: 1234567890123456.
    Not Use Parallelism Times: min = 1 max = 12625 Min Iter = 10000 Max Iter = 100000000
    RESULT: 1234567890123456
    Use Parallelism Times: min = 2 max = 17928 Min Iter = 10000 Max Iter = 100000000
    RESULT: 1234567890123456
    Use Unrolling Times: min = 2 max = 17227 Min Iter = 10000 Max Iter = 100000000
    RESULT: 1234567890123456
    ------------------------------------------------------------------------------------------------------------------------------
    I quickly modified the first version of the below program. In this 2nd program for the unrolling test, I replaced the "for" loop with an "if" statement. The "if" statement checked if there were "4", "8", or "16" digits in the test string. Then for each active option, I unrolled 4 times, 8 times, or 16 times. The resulting output from the program was that the routine based on unparllelism was always fastest. The second fastest was always unrolling. The routine based on parallelism was always the slowest routine.
    ** Basically, all the interesting things Andrei said about parallelism (and unrolling) does not work with the program that I developed on Visual Studio 2015. I believe all those fine and eloquent things should work elsewhere though!
    ------------------------------------------------------------------------------------------------------------------------------
    Output from 2nd program:
    TEST 1: Convert string to number: 1234.
    Not Use Parallelism Times: min = 0 max = 3520 Min Iter = 10000 Max Iter = 100000000
    RESULT: 1234
    Use Parallelism Times: min = 0 max = 5156 Min Iter = 10000 Max Iter = 100000000
    RESULT: 1234
    Use Unrolling Times: min = 0 max = 4933 Min Iter = 10000 Max Iter = 100000000
    RESULT: 1234
    TEST 2: Convert string to number: 12345678.
    Not Use Parallelism Times: min = 0 max = 5161 Min Iter = 10000 Max Iter = 100000000
    RESULT: 12345678
    Use Parallelism Times: min = 1 max = 8520 Min Iter = 10000 Max Iter = 100000000
    RESULT: 12345678
    Use Unrolling Times: min = 1 max = 8295 Min Iter = 10000 Max Iter = 100000000
    RESULT: 12345678
    TEST 3: Convert string to number: 1234567890123456.
    Not Use Parallelism Times: min = 1 max = 9065 Min Iter = 10000 Max Iter = 100000000
    RESULT: 1234567890123456
    Use Parallelism Times: min = 1 max = 15273 Min Iter = 10000 Max Iter = 100000000
    RESULT: 1234567890123456
    Use Unrolling Times: min = 2 max = 15128 Min Iter = 10000 Max Iter = 100000000
    RESULT: 1234567890123456
    ------------------------------------------------------------------------------------------------------------------------------
    #include "stdafx.h" //USED BY VISUAL STUDIO 2015
    #include
    #include
    using namespace std;
    // 2 ROUTINES THAT CREATE AN INTEGER FROM A STRING REPRESENTATION OF THAT NUMBER.
    //THE SLOWEST IS SUPPOSE TO RELY ON NON-PARALLELISM.
    //ROUTINE FROM TALK: slide 59
    unsigned long long atoui_notuse_parallelism(const char* b, const char* e)
    {
    unsigned long long result = 0;
    for (; *b != *e; ++b)
    {
    result = result * 10 + (*b - '0');
    }
    return result;
    }
    static const long long pow10[21] = {
    1UL,
    10UL,
    100UL,
    1000UL,
    10000UL,
    100000UL,
    1000000UL,
    10000000UL,
    100000000UL,
    1000000000UL,
    10000000000UL,
    100000000000UL,
    1000000000000UL,
    10000000000000UL,
    100000000000000UL,
    1000000000000000UL,
    10000000000000000UL,
    100000000000000000UL,
    1000000000000000000UL,
    10000000000000000000UL
    };
    //IDEA FOR ROUTINE CAME FROM :slide 63
    unsigned long long atoui_use_parallelism(const char* b, const char* e, short i)
    {
    unsigned long long result = 0;
    for (; *b != *e; ++b)
    {
    result += pow10[i--] * (*b - '0');
    }
    return result;
    }
    //IDEA FOR ROUTINE CAME FROM UNROLLING SLIDE: slide
    unsigned long long atoui_use_unrolling (const char* b, const char* e, short i)
    {
    unsigned long long result = 0;
    for (; *b != *e;)
    {
    result += pow10[i--] * (*b - '0'); b++;
    result += pow10[i--] * (*b - '0'); b++;
    result += pow10[i--] * (*b - '0'); b++;
    result += pow10[i--] * (*b - '0'); b++;
    }
    return result;
    }
    int main(int argc, int** argv[])
    {
    int elapsed_Time;
    int max, min;
    int max_num_iterations, min_num_iterations;
    long long result;
    std::clock_t start; //Holds starting time
    __int64 test_Number;
    char test[30]; //3 TEST VALUES
    char test2[9] = "1234.";
    char test3[11] = "12345678.";
    char test4[18] = "1234567890123456.";
    const char* b;
    const char* e = ".";
    int strlength;
    for (auto j = 1; j

    • @nivo6379
      @nivo6379 3 ปีที่แล้ว +1

      For benchmarking, you should probably use Clang or GCC.

    • @NoNameAtAll2
      @NoNameAtAll2 3 ปีที่แล้ว

      You use std::clock, but you need to know that it isn't representing real time, but cpu-usage time
      Use high_resolution_clock or steady_clock

  • @abc3631
    @abc3631 4 ปีที่แล้ว +5

    What a deadpan audience

  • @yuehuajia3176
    @yuehuajia3176 6 ปีที่แล้ว +2

    Great Talk. I learned a lot from you.

  • @leipzig3
    @leipzig3 3 ปีที่แล้ว

    Certain cicadas come out every 17 years. If you need to know if this year is a cicada year, return (year %17 + offset)

  • @eduardomadrid2380
    @eduardomadrid2380 7 ปีที่แล้ว +5

    Another good presentation by Alexandrescu, but watch out!, "digits10" is actually a pessimization, as benchmarked in my x86-64 computer, here:
    github.com/thecppzoo/inprogress/blob/master/src/main.cpp
    The reason is that it has too many unpredictable microbranches for reasonable inputs, this is "sloware". Eventually I will publish a more detailed analysis, for the time being, you may use my implementation of digits10 which is almost twice as fast for sequential inputs and about five times faster for random inputs

    • @FoxDr
      @FoxDr 3 ปีที่แล้ว +1

      Thanks for the bench. The whole hour, I've been thinking "why not just use `ceil(log10(abs(a)))`?".
      Sure, float ops up the wazoo, might not be worth it for small numbers, but no looping required at all. I'll add it to your benches when I find some time :)

  • @bool2max
    @bool2max ปีที่แล้ว

    why does he call integers "integrals"

  • @FEE1DEAD
    @FEE1DEAD 3 ปีที่แล้ว +1

    Interesting talk. But that audience....

  • @typon1
    @typon1 5 ปีที่แล้ว +13

    Did they find a bunch of zombies to attend this talk or are the English completely humor less?

    • @Vitorian
      @Vitorian 4 ปีที่แล้ว +1

      Aren't those the same?

    • @GeorgeTsiros
      @GeorgeTsiros 3 ปีที่แล้ว +2

      he didn't do _the hands_
      on a more serious note, sometimes i watch andrei for serious... sometimes not so much (example, the allocators video, that one was a kneeslapper)

  • @GeorgeTsiros
    @GeorgeTsiros 5 ปีที่แล้ว

    37:15 wtf we have them on our WRISTS these days!

  • @TheDvrx
    @TheDvrx 3 ปีที่แล้ว

    So pipelined cpu's are secretly vector processors.

  • @gbizzotto
    @gbizzotto 5 ปีที่แล้ว

    I bet this abstract was writen by andrei

  • @GeorgeTsiros
    @GeorgeTsiros 5 ปีที่แล้ว

    42:44 FORTH! RPL! PS!