CppCon 2016: Tim Haines “Improving Performance Through Compiler Switches..."

แชร์
ฝัง
  • เผยแพร่เมื่อ 3 ธ.ค. 2024

ความคิดเห็น • 22

  • @juzujuzu4555
    @juzujuzu4555 6 ปีที่แล้ว +6

    I have been wondering, what kind of information we could provide within our source code, for example through annotations, that could help the compiler to optimize better or to allow completely new optimizations? We could add new annotations as we find more useful information for compilers. What kind of information the developer knows of how his code should work that the compiler could use?

  • @peppybocan
    @peppybocan 8 ปีที่แล้ว +7

    Have you tried to test Array-of-Structures vs. Structure-of-Arrays? How much does it matter to you as HPC-developer?

    • @timhaines3877
      @timhaines3877 8 ปีที่แล้ว +13

      It matters enormously. I work primarily with N-body simulations where each "particle" can occupy between ~60-200 bytes depending on which simulation package and physics I am using. They all use the AoS format which means each particle can occupy between 1 and 4 cache lines, depending on which variables in the particle I need. This is terrible for performance, but people who write these simulations just don't stop and think about how much cache space is wasted on unused data with the AoS format. There have been some experimental simulation packages written that address this, but they are not in widespread use, sadly. I can send you some links to papers, if you are interested in implementation details.

    • @stephenborntrager6542
      @stephenborntrager6542 8 ปีที่แล้ว +5

      I feel I need to warn people to be careful with SoA designs, because I have seen (and made) some really terrible SoA attempts.
      I second the point that the difference is huge though.
      Unfortunately, it's sometimes hugely bad. I once tried to optimize a particle system to make use of a SoA, and it ended up being worse than just allocating individual elements with new. (Not as part of an array!)
      Then, I tried moving the data elements around into hot and cold structs allocated into arrays. I got a marginal improvement over using new, but still not as good as a plain array of structs. I tried re-arranging the elements again, and got a slight improvement that time. The hot-cold paths were not at all what I had originally expected, and needed several iterations of tweaking before it actually made a substantial improvement.
      So it absolutely matters, but it's also not always intuitive. Keep your profiler close, and your assumptions far...
      @Tim: Any way I could see those papers? I struggle with these kinds of optimizations to data. I would gladly welcome any resources that might help demystify the 'right' way to do this kind of thing.

    • @0xCAFEF00D
      @0xCAFEF00D 7 ปีที่แล้ว

      Really nice talk. Thanks.

    • @seditt5146
      @seditt5146 5 ปีที่แล้ว +4

      I seen a speed up on a Simple sim I created to give a Jello like surface to a 3D sphere. The SoA setup increased my performance many many times over, something like 10-30x faster or something like that it was. MANY times faster. My jaw hit the floor because I understand why it is better and why I did it, I HATE SoA programming. I just hate managing the code, it looks like trash and feels like a mess but man... that performance though.

  • @LemonChieff
    @LemonChieff 4 ปีที่แล้ว +4

    I'm gonna have to tape my jaw back to my head cause it's sitting on the floor right now. 🤯

  • @braindeadbonobo
    @braindeadbonobo 8 ปีที่แล้ว +3

    42:28, 44:28 it's not loading the constant from the stack

    • @bernardgingold4386
      @bernardgingold4386 6 ปีที่แล้ว

      Loading constants from .text section i.e. [rip+0x850]

  • @nicolasgascon1676
    @nicolasgascon1676 8 ปีที่แล้ว +3

    You talk about O2, O3 and others flags but what Ofast is doing ?

    • @szaszm_
      @szaszm_ 7 ปีที่แล้ว +1

      On gcc it enables everything -O3 does plus -ffast-math,

    • @movax20h
      @movax20h 4 ปีที่แล้ว +3

      In general you should not use Ofast unless you really know what you are doing. Ofast is not applicable in scientific computing really, because it is a grate risk in terms of loosing FP correctness. It is ok to use in some non critical situations, but I would never really use it in production code.

  • @vsoch
    @vsoch 9 หลายเดือนก่อน

    "If you live in a 32 bit world... I'm sorry."
    lol!

  • @dat_21
    @dat_21 5 ปีที่แล้ว +1

    13:04 ILP and OOO can't affect the result because they respect dependencies.

  • @MrAbrazildo
    @MrAbrazildo 8 ปีที่แล้ว +1

    O2 faster?! o-0 Wow!

  • @morthim
    @morthim 8 ปีที่แล้ว +5

    label your axis. you talk about turning on flags, but talk about data without metrics. what is 1.00 supposed to mean? what is 0.80 supposed to mean? what is 1.15 supposed to mean?
    in a previous slide you say "relative run times" relative to what? none of the points land on 1.00.
    imagine someone tells you a bookcase is seven. seven shelves, part of a seven piece set, seven units of height, seven units of width, seven units of weight, seventh most expensive, seventh best selling, seven years old, etc... they just say it is seven.

    • @bluespeck
      @bluespeck 8 ปีที่แล้ว

      Aren't the values relative to running without any flags? for example, 0.85 for clang -O2+arch in slide 40 would be 0.85 of the clang time without any flags.

    • @timhaines3877
      @timhaines3877 8 ปีที่แล้ว +4

      Indeed. This was my first time giving a recorded talk where the largest audience would be those not in the room. I found the lack of graphical cues to be a problem, myself, when I was looking over the slides and watching the talk last night. I need to add more animations to the slides to make sure these smaller points are easier to discern for those who can't see my laser pointer.
      That said, the discussion of how the relative timings were constructed starts at 50:10.

    • @timhaines3877
      @timhaines3877 8 ปีที่แล้ว

      The description I gave was true, but not complete. By default, the Intel compiler uses "-fp-model fast=1" which is mostly like gcc and clang's "-ffast-math." In truth, using "-fp-model fast=2" is a closer approximation to what "-ffast-math" does. But Intel doesn't like to divulge what they are _actually_ doing with their flags, so it's hard to say exactly what's going on between the different levels of "fastness" in Intel.

    • @timhaines3877
      @timhaines3877 8 ปีที่แล้ว

      All of my code examples are on the github repo mentioned in the slides linked in the description. The very last slide (after the Appendix) is a complete example of doing the SAXPY with AVX intrinsics.

    • @timhaines3877
      @timhaines3877 8 ปีที่แล้ว +1

      This is essentially what I'm doing in my little SIMD library, but I use tag dispatch to select the desired intrinsic at compile time. This is also how Boost.SIMD does theirs. Although it is a very bad idea to mix architectures in a single translation unit because switching between SSE and AVX incurs a non-zero runtime overhead. I'm not sure if that is true for AVX2 AVX512. Of course, we'll have to wait until Q1 of next year to see AMD's Zen, but I'm excited!