Making Golang 13x faster with Assembly code

แชร์
ฝัง
  • เผยแพร่เมื่อ 5 ต.ค. 2024
  • One of the coolest parts of Go (golang) is that there are many ways to speed up your program. One such way is to take advantage of the ability to create .s and .asm assembly code files that are compiled directly into your program. In this video I go over what I did in my Golang Vulkan game engine to improve the performance of the linear algebra math. Taking advantage of the SIMD (AVX) instructions we can improve some functions by nearly 13x. SIMD is "single instruction multiple data" and is a key component missing from the standard go compiler. We can of course use the built in assembly capabilities of Go to improve performance and access non-accessable cpu instructions for many more things other than vectorization operations, but this is probably the most common operation people would drop into assembly for.
    Go assembly file ► github.com/Kai...
    Twitter ► brentfarris.com...
    Website ► brentfarris.com
    GitHub ► brentfarris.com...

ความคิดเห็น • 89

  • @user-tw2kr6hg4r
    @user-tw2kr6hg4r 3 หลายเดือนก่อน +404

    you know its serious computer engineering when the source code is printed on a sheet of paper

    • @lozyodella4178
      @lozyodella4178 3 หลายเดือนก่อน +3

      😂😂😂

    • @w花b
      @w花b 3 หลายเดือนก่อน +11

      First one I've seen like that was ben Eater but this one even has colors that's next level

    • @araz911
      @araz911 3 หลายเดือนก่อน +4

      ​@@w花bfor syntax highlight, it's a paper of 2024, ok?...!!

    • @gregandark8571
      @gregandark8571 3 หลายเดือนก่อน +1

      Always has been.

  • @AK-vx4dy
    @AK-vx4dy 3 หลายเดือนก่อน +79

    In assembly you have only layer of abstraction... paper 😅

  • @crowlsyong
    @crowlsyong 3 หลายเดือนก่อน +15

    What a supernatural gift to the world

  • @baxiry.
    @baxiry. 3 หลายเดือนก่อน +10

    What a supernatural topic

  • @MaxPicAxe
    @MaxPicAxe 2 หลายเดือนก่อน +2

    That's nice how, for four floats, the 2 bits for src, 2 bits for dst and 4 bits for bitmask conveniently fit into exactly a byte.
    The next convenient number of floats with this pattern appears to be a very large number, where convenient means when the amount of space src,dst,bitmask take up in bits is a power of 2.

    • @BrentFarris
      @BrentFarris  2 หลายเดือนก่อน

      You can also operate on doubles, but half as many due to using the same space.

  • @lufsss_
    @lufsss_ 3 หลายเดือนก่อน +5

    What a supernatural explanation

  • @mrrolandlawrence
    @mrrolandlawrence 3 หลายเดือนก่อน +1

    wow love this. i used to be an ARM programmer many many years ago. back in those days you really had to optimise code for the number of cpu cycles needed. sophie wilson really made ARM instructions a doddle to use.

  • @grimquokka9843
    @grimquokka9843 3 หลายเดือนก่อน +3

    this is a good idea you came up and also appreciate using Paper explanation,.Please keep up with these videos sir.

  • @tubbystubby
    @tubbystubby 2 หลายเดือนก่อน +1

    I started go half a year ago and have been enjoying it a lot. This was awesome, learned a lot. Thanks for great content.
    You know you are getting the juiciest stuff if it's on paper.

  • @QW3RTYUU
    @QW3RTYUU 2 หลายเดือนก่อน

    Ben Eater vibes this gives me. Thanks for the video!

  •  3 หลายเดือนก่อน +1

    Love that you did it with go. It's just such a clean language.

    • @BrentFarris
      @BrentFarris  3 หลายเดือนก่อน +3

      I picked up Go after learning that Ken Thompson helped design it. Slices and goroutine/channels are awesome

    • @shurizzle
      @shurizzle 2 หลายเดือนก่อน

      @@BrentFarris Goroutines/channels come from Plan9, as does that ASM syntax. After all, Rob Pike is behind both Golang and Plan9.

  • @treelibrarian7618
    @treelibrarian7618 3 หลายเดือนก่อน +13

    just thought you might be interested that the pack operation with 16 insertps's (16p23,16p5 ops) instead may be done as an in-register matrix transpose using unpckhps/unpcklps (4p23,8p5 ops) in half the time.
    I'm not familiar with golangs inline asm so I'll use intel asm instead, I'm sure you'll be able to translate:
    vmovups xmm1, [rbp + start] ; a3a2a1a0
    vmovups xmm2, [rbp + start + 16] ; b3b2b1b0
    vmovups xmm3, [rbp + start + 32] ; c3c2c1c0
    vmovups xmm4, [rbp + start + 48] ; d3d2d1d0
    vunpckhps xmm5, xmm2, xmm4 ; d3b3d2b2
    vunpcklps xmm4, xmm2, xmm4 ; d1b1d0b0
    vunpcklps xmm2, xmm1, xmm3 ; c1a1c0a0
    vunpckhps xmm3, xmm1, xmm3 ; c3a3c2a2
    vunpcklps xmm1, xmm2, xmm4 ; d0c0b0a0
    vunpckhps xmm2, xmm2, xmm4 ; d1c1b1a1
    vunpckhps xmm4, xmm3, xmm5 ; d3c3b3a3
    vunpcklps xmm3, xmm3, xmm5 ; d2c2b2a2
    has the advantage that whether xmm,ymm, or zmm registers it's still 8 unpack ops to do 1,2 or 4 4x4 matrix transposes. This formula uses one extra register and produces the result in the same order in the registers as your insertps-based version.
    edit: realized a couple of days later that I used the AVX 3-operand versions of the instructions not the SSE1 2-operand versions, so I've added the V's. it's not so pretty if you can't use the, since every output has to be copied first and the operand ordering is inconvenient too, so it doesn't fit in 5 registers any more...

    • @lozyodella4178
      @lozyodella4178 3 หลายเดือนก่อน +2

      Is this the language of Gods?

    • @stercorarius
      @stercorarius 3 หลายเดือนก่อน

      @@lozyodella4178 nah thats lisp

    • @treelibrarian7618
      @treelibrarian7618 3 หลายเดือนก่อน

      a further thought for you: perhaps the transpose is entirely unneeded anyway: this code here does the 4x4 matrix multiply without it. see the inline comments for functional details.
      I adapted this from a 16x16 avx512 version where the macro was just 16 fma instructions with inline broadcast loading the elements of A directly. Here using shufps as an SSE broadcast equivalent and the multiplies and adds are separated.
      ;; 4x4 matrix multiply
      ;; A is the matrix that is scanned horizontally,
      ;; B is the matrix to be scanned vertically.
      ;; output to O
      %macro domatrixrowSSE 0
      shufps xmm0, xmm3, 0 ; broadcast first element of A row 1
      mulps xmm0, xmm4 ; multiply whole first row of B
      shufps xmm1, xmm3, 0x55 ; bcast second element of A row 1
      mulps xmm1, xmm5 ; multiply by second row of B
      addps xmm0, xmm1 ; add to first result
      shufps xmm1, xmm3, 0xaa ; e3 of A row 1
      mulps xmm1, xmm6 ; mult B row 3
      addps xmm0, xmm1 ; add
      shufps xmm1, xmm3, 0xff ; e4 of A row 1
      mulps xmm1, xmm7 ; mult B row 4
      addps xmm0, xmm1 ; last add
      %endmacro
      multiply4x4function: ; this is not complete: replace the tokens of a, b and o
      ; with whatever you have those pointers in.
      ; Can be used as a base for larger matrix multiplies
      ; if you load the prior output content before adding all 4 lines
      ; and change 16/32/48 to 1/2/3x row length in bytes,
      ; and a/b/o point to the relevant parts of the input/output matrices.
      movups xmm4, [b + 0] ; load whole b matrix
      movups xmm5, [b + 16]
      movups xmm6, [b + 32]
      movups xmm7, [b + 48]
      movups xmm3, [a + 0] ; load first row of A matrix
      domatrixrowSSE ; the macro multiplies one row of A by 4 columns of B
      movups [o + 0], xmm0 ; store results to first row of output matrix
      ; e1r1O = : e2r1O = : e3r1O = : e4r1O =
      ; e1r1A*e1r1B : e1r1A*e2r1B : e1r1A*e3r1B : e1r1A*e4r1B
      ; + e2r1A*e1r2B : + e2r1A*e2r2B : + e2r1A*e3r2B : + e2r1A*e4r2B
      ; + e3r1A*e1r3B : + e3r1A*e2r3B : + e3r1A*e3r3B : + e3r1A*e4r3B
      ; + e4r1A*e1r4B : + e4r1A*e2r4B : + e3r1A*e3r4B : + e4r1A*e4r4B
      movups xmm3, [a + 16] ; load second row of A
      domatrixrowSSE
      movups [o + 16], xmm0 ; store to second row of O
      movups xmm3, [a + 32] ; third row of A
      domatrixrowSSE
      movups [o + 32], xmm0 ; to third row of O
      movups xmm3, [a + 48] ; 4th row of A
      domatrixrowSSE
      movups [o + 48], xmm0 ; to 4th row of O
      ;; 28p01, 16p5, 8r4w. 16cycles/matrix on icelake, 28c/matrix on older CPU with only 1 vfp port (eg sandy bridge)

    • @shappertallw
      @shappertallw 2 หลายเดือนก่อน +1

      @@treelibrarian7618 this is insane i never thought i would see the day where someone cold rolled asm with sse instructions no less in a yt comments section. props

    • @treelibrarian7618
      @treelibrarian7618 2 หลายเดือนก่อน

      @@shappertallw it's a hobby of mine: I've done it before and I'll probably do it again. I think I might have scared one or two youtubers away from posting asm-related videos - which was not my intention. I really should be making video's myself...

  • @user-tw2kr6hg4r
    @user-tw2kr6hg4r 3 หลายเดือนก่อน +31

    matrix multiplication in primary school?

    • @BrentFarris
      @BrentFarris  3 หลายเดือนก่อน +63

      One year, you're learning to read books without pictures. The next, you're calculating the cross product on a 4 dimensional matrix. Then you go to middle school, learn about girls, forget it all, and have to relearn it in pre-calc.

    • @Paul-zh2jp
      @Paul-zh2jp 3 หลายเดือนก่อน

      this is what i came to comment lol

    • @w花b
      @w花b 3 หลายเดือนก่อน +3

      ​@@BrentFarris Don't have to forget it if no girls approaches you. That's a win in my book.

  • @mr.daniish
    @mr.daniish 2 หลายเดือนก่อน

    This is some serious knowledge! More of these please

  • @sanderbos4243
    @sanderbos4243 2 หลายเดือนก่อน

    Extremely good explanation

  • @joeybasile1572
    @joeybasile1572 2 หลายเดือนก่อน +1

    Thanks dude. Informative. Good presentation.

  • @Antonio-yy2ec
    @Antonio-yy2ec 3 หลายเดือนก่อน +2

    Pure gold!!

  • @kira.herself
    @kira.herself 3 หลายเดือนก่อน +1

    What a supernatural video

  • @blockshift758
    @blockshift758 3 หลายเดือนก่อน +1

    I always see comments "matrix math on middle/high school?!" On videos like this. And laugh to my self because i remember we did it on elementary(grade 4-6).

  • @Decastyled
    @Decastyled 3 หลายเดือนก่อน +1

    'Cause you're a supernatural
    A beating heart of stone
    You gotta be so cold
    To make it in this world
    Yeah, you're a supernatural
    Living your life cutthroat
    You gotta be so cold
    Yeah, you're a supernatural

  • @--bountyhunter--
    @--bountyhunter-- 3 หลายเดือนก่อน +3

    what a natural super

  • @danielsmith5626
    @danielsmith5626 3 หลายเดือนก่อน

    ASMR backend is peak

  • @hyprland
    @hyprland 3 หลายเดือนก่อน +3

    What a super nature

  • @MrTomyCJ
    @MrTomyCJ 3 หลายเดือนก่อน +1

    There is a flaw in the system: I can deduce from the comments that I should reply something supernatural without having watched the entire video.
    The next time you'll have to provide a function to determine what to comment instead of a phrase, so that the appropriate comment can't be deduced from the comments.
    You got me to comment anyway though.

  • @TheCyberBully420
    @TheCyberBully420 2 หลายเดือนก่อน +1

    You made an engine with Vulkan or you made something similar to Vulkan??

    • @BrentFarris
      @BrentFarris  2 หลายเดือนก่อน

      Using Vulkan, I've made engines in C, C++, and Go. It has a pretty nice and straightforward structure once you get a handle of it.

  • @tiskanto
    @tiskanto 2 หลายเดือนก่อน

    This is the "Ben Eater" style

  • @timofeysobolev7498
    @timofeysobolev7498 2 หลายเดือนก่อน

    Great video!)

  • @Caellyan
    @Caellyan 3 หลายเดือนก่อน +1

    What about using something like volk (vector optimized library of kernels)?
    Is Go FFI slow?

    • @BrentFarris
      @BrentFarris  3 หลายเดือนก่อน

      You likely can without issues. You may have to benchmark it though because you do have to pay the small cost of swapping stacks. Go's stack is built leaning for goroutines, so it has the swap to a C-compatible stack to call C.

  • @harold2718
    @harold2718 3 หลายเดือนก่อน

    Instead of transposing B and then doing dot-products essentially, you can take a row of B and multiply it by a broadcasted element of A, and then add it into the result. That's more efficient than doing dot-products, HADDPS isn't that efficient (essentially equal to 2 shuffles plus ADDPS). Also even you do want to transpose, you can do it with 8 shuffles instead of 16 INSERTPSes, similar to how the _MM_TRANSPOSE4_PS macro does it (but you have no access to that so you'd implement it manually).

  • @fqidz
    @fqidz 3 หลายเดือนก่อน +1

    supernatural season 2 ep 2

  • @maximus1172
    @maximus1172 3 หลายเดือนก่อน

    very cool!!, you should also try making the engine in rust

    • @BrentFarris
      @BrentFarris  3 หลายเดือนก่อน

      One day, I may. I enjoy trying out languages, and game frameworks/engines tend to be my testbed. Either as the core engine code or as a scripting language depending on the nature of the language.

  • @Onyx-it8gk
    @Onyx-it8gk 3 หลายเดือนก่อน

    Neat video! If you have this much programming knowledge and skill, I think you'd really appreciate Vale. It's a new language that takes a novel approach to memory management without a GC. It borrows concepts from many languages like Rust, Cyclone, Pony and Forty2.

    • @BrentFarris
      @BrentFarris  3 หลายเดือนก่อน +1

      Thanks! I'll have to check it out, I have a lot of fun trying out different languages. There have been a lot of languages popping up lately. It's so hard to keep up, haha

    • @Onyx-it8gk
      @Onyx-it8gk 3 หลายเดือนก่อน

      @@BrentFarris I know what you mean! I'm sure someone such as yourself has a very long list of things to check out with not enough time in the day!

  • @rdubb77
    @rdubb77 3 หลายเดือนก่อน

    Primary school? Linear algebra is generally a college subject, I didn’t learn matrix multiplication even in high school

    • @BrentFarris
      @BrentFarris  3 หลายเดือนก่อน +4

      What? Kids nowadays don't do linear algebra after nap time anymore?

  • @gbucks5117
    @gbucks5117 3 หลายเดือนก่อน

    When code in paper , you know the shit is serious

  • @sirbumblefuck
    @sirbumblefuck 3 หลายเดือนก่อน +3

    What a supernatural way of explaining

  • @iant9053
    @iant9053 3 หลายเดือนก่อน

    Holy, If you had to learn everything from scratch, in what order would you learn your langs? just starting with C, thx wizard

    • @BrentFarris
      @BrentFarris  3 หลายเดือนก่อน +2

      I would learn C if I had to go from scratch. It's just high level enough to do huge projects and just low level enough to teach you how computers work internally.
      I learned C++ as my first language, but I wish it were C.

  • @hz8711
    @hz8711 2 หลายเดือนก่อน

    I am missing too many things to understand this, can someone explain it in few sentences, at least what is the idea? Thanks!

  • @Jhat
    @Jhat 3 หลายเดือนก่อน

    the real question is... WHAT IS THAT PENCIL HOLDER????

  • @cvabds
    @cvabds 3 หลายเดือนก่อน +1

    How much you want to create a game engine for temple OS?

    • @BrentFarris
      @BrentFarris  3 หลายเดือนก่อน +1

      Haha, I haven't booted up TempleOS yet. It's still on my bucket list. When I do, it might just happen!

    • @cvabds
      @cvabds 3 หลายเดือนก่อน

      @@BrentFarris please don't be restricted to the whole religious thing, use it to the full potential please, 4k high res

  • @wakanda6357
    @wakanda6357 3 หลายเดือนก่อน

    What should one do or learn to understand assembly??

    • @greenrocket23
      @greenrocket23 3 หลายเดือนก่อน

      Well, a pretty good resource for beginners is the MIT OpenCourseWare for the x86_64 architecture

    • @BrentFarris
      @BrentFarris  3 หลายเดือนก่อน

      Program some small things in 6502 assembly. It is an incredibly small assembly language and will teach you 90% of what you need to know. You can then get a book or read online docs for x86/x64 and arm instructions.
      Check out this 6502 tutorial. It comes with an emulator and is a lot of fun:
      skilldrick.github.io/easy6502/index.html

  • @Kyle-do6nj
    @Kyle-do6nj 3 หลายเดือนก่อน

    All this to ultimately have a 20% efficiency at candy crush...

  • @trungthanhbp
    @trungthanhbp 2 หลายเดือนก่อน

    niec

  • @hulakdar
    @hulakdar 3 หลายเดือนก่อน +1

    is there no way to natively emit vector instructions in go?
    If that is true, than that is quite unfortunate
    Isn't it easier to write those functions in C and link with them instead of writing out assembly?

    • @BrentFarris
      @BrentFarris  3 หลายเดือนก่อน +1

      Not at the moment in Go directly. You have a few options:
      1. Write assembly as we did here directly (fastest execution).
      2. Write vectorized assembly instructions as their own function (similar to C) and use them at a higher level, but you'll need to take care to follow the calling conventions to not clobbered your asm work.
      3. Use the C vectorization library functions and call from C. This will have the tiny overhead of swapping stacks, though.

    • @gregandark8571
      @gregandark8571 3 หลายเดือนก่อน +1

      @@BrentFarris Go is bullshit language exactly for such technical lacks :(

  • @alejandroulisessanchezgame6924
    @alejandroulisessanchezgame6924 3 หลายเดือนก่อน

    It is posible to develop 3d games with golang like this, even if its a gc language?

    • @BrentFarris
      @BrentFarris  3 หลายเดือนก่อน +5

      Yes, you can either write it from scratch like I do for fun (see Kaiju github engine link in description). Or you can load up helper C libraries for OpenGL, SDL, etc; which I've done in the past.
      You'll find most game engines like Unreal and Unity use an internally built garbage collector, so don't let the GC hold you back from experimenting.

    • @alejandroulisessanchezgame6924
      @alejandroulisessanchezgame6924 3 หลายเดือนก่อน

      Thanks i will try.

    • @nittani.
      @nittani. 3 หลายเดือนก่อน

      What is garbage ​@@BrentFarris

    • @QW3RTYUU
      @QW3RTYUU 2 หลายเดือนก่อน

      @@nittani. something to be collected it seems

    • @ErikOnNoobTube
      @ErikOnNoobTube 2 หลายเดือนก่อน

      @@QW3RTYUU you made me spit my coffee

  • @sokiuwu
    @sokiuwu 3 หลายเดือนก่อน +2

    Making assembly 30× faster by writing in binary

    • @BrentFarris
      @BrentFarris  3 หลายเดือนก่อน +2

      Don't tempt me with a good time

  • @spoonikle
    @spoonikle 3 หลายเดือนก่อน +2

    Who else is naturally this super?

    • @Miles-co5xm
      @Miles-co5xm 3 หลายเดือนก่อน

      Java base classes

  • @domelessanne6357
    @domelessanne6357 3 หลายเดือนก่อน

    wow

  • @opkp
    @opkp 3 หลายเดือนก่อน

    Neat

  • @emirsahin4105
    @emirsahin4105 2 หลายเดือนก่อน

    manyakadam

  • @mikejohneviota9293
    @mikejohneviota9293 3 หลายเดือนก่อน

    primary school huh for linear math i feel dumb

    • @BrentFarris
      @BrentFarris  3 หลายเดือนก่อน +1

      Me too, I must have missed the linear algebra class they taught at recess...

  • @blockshift758
    @blockshift758 3 หลายเดือนก่อน

    Bruh is expaining code on paper

  • @user-lh3xs9km6z
    @user-lh3xs9km6z 3 หลายเดือนก่อน

    it's nice results ... without dubt...but at that point of needed optimization going back to c/c++ isn't better?

    • @BrentFarris
      @BrentFarris  3 หลายเดือนก่อน +1

      Actually, there are some highly optimized Go functions that beat its Go Assembly counterparts. I'll make a video on this next.
      I always advocate for people to write in C, I'm biased because it's my main language. But, you really do get some amazing benefits in Go that you just don't in C/C++. So it's really up to the taste of the developer. I've written 3D Vulkan game engines in all 3 languages (C, C++, and Go)