GPU Performance Portability Using Standard C++ with SYCL - Hugh Delaney & Rod Burns - CppCon 2022

แชร์
ฝัง
  • เผยแพร่เมื่อ 5 ต.ค. 2024
  • cppcon.org/
    ---
    GPU Performance Portability Using Standard C++ with SYCL - Hugh Delaney & Rod Burns - CppCon 2022
    github.com/Cpp...
    The proliferation of accelerators, in particular GPUs, over the past decade is impacting the way software is being developed. Most developers who have been using CPU based machines are now considering how it's possible to improve the performance of applications by offloading execution to many core processors. Many emerging disciplines including AI, deep neural networks and machine learning have shown that GPUs can increase performance by many times compared to CPU-only architectures. New hardware features such as "tensor cores" are also starting to emerge to address specific problems including mixed precision computing. The new challenge for developers is figuring out how to develop for heterogeneous architectures that include GPUs made by different companies. Currently the most common way to develop software for GPUs is using the CUDA programming model but this has pitfalls. CUDA uses non-standard C++ syntax and semantics, is a proprietary interface, and can only be used to target Nvidia GPUs. Alternatives include HIP which offers another proprietary programming interface only capable of targetting AMD GPUs.
    This presentation will demonstrate how standard C++ code with SYCL can be used to achieve performance portability on processors from multiple vendors including Nvidia GPUs, AMD GPUs and Intel GPUs. The SYCL programming interface is a royalty free and industry defined open standard designed to enable the latest features of accelerators. Using an open source project, we'll show how standard C++ syntax and semantics are used to define the SYCL kernel and memory management code required to offload parallel execution to a range of GPUs. Further to this, we'll explain how easy it is to compile this C++ code using a SYCL compiler so that it can be run on Nvidia, AMD and Intel GPUs and compare this execution performance with the same code written using proprietary CUDA and HIP environments. Lastly we'll share our tips for achieving the best performance on different processor architectures, including dealing with varying memory resources, using the most appropriate memory access patterns, using hardware specific features such as "tensor cores" and ensuring high utilization of the processor cores.
    ---
    Rod Burns
    Rod Burns has been helping developers to build complex software for well over a decade with experience in organizing training, tutorials and workshops. At Codeplay Rod leads the effort to support and educate developers using SYCL. Rod helped to create “SYCL Academy,” an open source set of materials for teaching SYCL, that have already been adopted by some of the top universities in the world and has been used at multiple conferences to teach SYCL.
    ---
    Hugh Delaney
    Hugh is a software engineer at Codeplay, where he works on the DPC++ compiler. Hugh’s academic background is in mathematics and HPC with a focus on numerical algorithms and linear algebra. Hugh has been teaching mathematics and computing in some manner for all of his adult life.
    ---
    Videos Filmed & Edited by Bash Films: www.BashFilms.com
    TH-cam Channel Managed by Digital Medium Ltd events.digital...
    #cppcon #programming #gpu

ความคิดเห็น • 10

  • @John-xl5bx
    @John-xl5bx ปีที่แล้ว +9

    I wish the Kokkos discussion was audible as it really raises the question of why the national labs (which are the sponsors of this OneAPI stuff) won't commit to just SYCL. I also would like a more in-depth discussion of how you "abstract away" the data migration. Every previous attempt (Unified Memory and the like) has failed whenever faced with non-trivial codes. If I've got to do it myself I may as well stick with something broadly supported, like OpenMP. If you have a more detailed discussion elsewhere, please reply with a link.

    • @perrystyle6028
      @perrystyle6028 ปีที่แล้ว +2

      This is a bit of late reply, but I think it's because national labs do not want to commit to just one resource. They want to branch out and ensure that they would not be locked to a single resource in case it ever fails. You can see this with the building of supercomputers where the new ones coming out are using different vendors.

  • @trejkaz
    @trejkaz ปีที่แล้ว +1

    In many modern languages we can just write code the way we always wrote it, and the compiler and/or runtime will decide to automatically produce CPU vector instructions. My hope is that some day the compiler and/or runtime will decide to offload some blocks of code to the GPU where possible.

  • @apivovarov2
    @apivovarov2 7 หลายเดือนก่อน +1

    It looks a bit strange that to run such a basic y=x*x example we need to use nested lambdas where outer lambda captures by reference and inner lambda captures by value. Feels a bit overdesigned and overcomplicated. Get device info using Templates is another contradicting place.

  • @catlakprofesormfb
    @catlakprofesormfb ปีที่แล้ว +1

    Great talk!

  • @arisweedler4703
    @arisweedler4703 ปีที่แล้ว +2

    Does anyone have recommendations where I could learn how to write CUDA code that’s highly optimized for hardware? I’m took some computer engineering classes in college but didn’t even minor in it. I had fun with verilog and later my computer architecture class, though.

  • @12nites
    @12nites ปีที่แล้ว +6

    5:50 : it says intel's one api runs on any CPU. Does it though? Can you reliably run it on AMDs CPUs?

    • @dimula73
      @dimula73 ปีที่แล้ว +5

      As far as I understand, it runs on CPU using an openCL API, so it basically depends on how the openCL backend is optimized to the used CPU

  • @FrankRobinson-y7e
    @FrankRobinson-y7e 25 วันที่ผ่านมา

    Winston Manors