Making Golang 13x faster with Assembly code
ฝัง
- เผยแพร่เมื่อ 5 ต.ค. 2024
- One of the coolest parts of Go (golang) is that there are many ways to speed up your program. One such way is to take advantage of the ability to create .s and .asm assembly code files that are compiled directly into your program. In this video I go over what I did in my Golang Vulkan game engine to improve the performance of the linear algebra math. Taking advantage of the SIMD (AVX) instructions we can improve some functions by nearly 13x. SIMD is "single instruction multiple data" and is a key component missing from the standard go compiler. We can of course use the built in assembly capabilities of Go to improve performance and access non-accessable cpu instructions for many more things other than vectorization operations, but this is probably the most common operation people would drop into assembly for.
Go assembly file ► github.com/Kai...
Twitter ► brentfarris.com...
Website ► brentfarris.com
GitHub ► brentfarris.com...
you know its serious computer engineering when the source code is printed on a sheet of paper
😂😂😂
First one I've seen like that was ben Eater but this one even has colors that's next level
@@w花bfor syntax highlight, it's a paper of 2024, ok?...!!
Always has been.
In assembly you have only layer of abstraction... paper 😅
What a supernatural gift to the world
What a supernatural topic
That's nice how, for four floats, the 2 bits for src, 2 bits for dst and 4 bits for bitmask conveniently fit into exactly a byte.
The next convenient number of floats with this pattern appears to be a very large number, where convenient means when the amount of space src,dst,bitmask take up in bits is a power of 2.
You can also operate on doubles, but half as many due to using the same space.
What a supernatural explanation
wow love this. i used to be an ARM programmer many many years ago. back in those days you really had to optimise code for the number of cpu cycles needed. sophie wilson really made ARM instructions a doddle to use.
this is a good idea you came up and also appreciate using Paper explanation,.Please keep up with these videos sir.
I started go half a year ago and have been enjoying it a lot. This was awesome, learned a lot. Thanks for great content.
You know you are getting the juiciest stuff if it's on paper.
Ben Eater vibes this gives me. Thanks for the video!
Love that you did it with go. It's just such a clean language.
I picked up Go after learning that Ken Thompson helped design it. Slices and goroutine/channels are awesome
@@BrentFarris Goroutines/channels come from Plan9, as does that ASM syntax. After all, Rob Pike is behind both Golang and Plan9.
just thought you might be interested that the pack operation with 16 insertps's (16p23,16p5 ops) instead may be done as an in-register matrix transpose using unpckhps/unpcklps (4p23,8p5 ops) in half the time.
I'm not familiar with golangs inline asm so I'll use intel asm instead, I'm sure you'll be able to translate:
vmovups xmm1, [rbp + start] ; a3a2a1a0
vmovups xmm2, [rbp + start + 16] ; b3b2b1b0
vmovups xmm3, [rbp + start + 32] ; c3c2c1c0
vmovups xmm4, [rbp + start + 48] ; d3d2d1d0
vunpckhps xmm5, xmm2, xmm4 ; d3b3d2b2
vunpcklps xmm4, xmm2, xmm4 ; d1b1d0b0
vunpcklps xmm2, xmm1, xmm3 ; c1a1c0a0
vunpckhps xmm3, xmm1, xmm3 ; c3a3c2a2
vunpcklps xmm1, xmm2, xmm4 ; d0c0b0a0
vunpckhps xmm2, xmm2, xmm4 ; d1c1b1a1
vunpckhps xmm4, xmm3, xmm5 ; d3c3b3a3
vunpcklps xmm3, xmm3, xmm5 ; d2c2b2a2
has the advantage that whether xmm,ymm, or zmm registers it's still 8 unpack ops to do 1,2 or 4 4x4 matrix transposes. This formula uses one extra register and produces the result in the same order in the registers as your insertps-based version.
edit: realized a couple of days later that I used the AVX 3-operand versions of the instructions not the SSE1 2-operand versions, so I've added the V's. it's not so pretty if you can't use the, since every output has to be copied first and the operand ordering is inconvenient too, so it doesn't fit in 5 registers any more...
Is this the language of Gods?
@@lozyodella4178 nah thats lisp
a further thought for you: perhaps the transpose is entirely unneeded anyway: this code here does the 4x4 matrix multiply without it. see the inline comments for functional details.
I adapted this from a 16x16 avx512 version where the macro was just 16 fma instructions with inline broadcast loading the elements of A directly. Here using shufps as an SSE broadcast equivalent and the multiplies and adds are separated.
;; 4x4 matrix multiply
;; A is the matrix that is scanned horizontally,
;; B is the matrix to be scanned vertically.
;; output to O
%macro domatrixrowSSE 0
shufps xmm0, xmm3, 0 ; broadcast first element of A row 1
mulps xmm0, xmm4 ; multiply whole first row of B
shufps xmm1, xmm3, 0x55 ; bcast second element of A row 1
mulps xmm1, xmm5 ; multiply by second row of B
addps xmm0, xmm1 ; add to first result
shufps xmm1, xmm3, 0xaa ; e3 of A row 1
mulps xmm1, xmm6 ; mult B row 3
addps xmm0, xmm1 ; add
shufps xmm1, xmm3, 0xff ; e4 of A row 1
mulps xmm1, xmm7 ; mult B row 4
addps xmm0, xmm1 ; last add
%endmacro
multiply4x4function: ; this is not complete: replace the tokens of a, b and o
; with whatever you have those pointers in.
; Can be used as a base for larger matrix multiplies
; if you load the prior output content before adding all 4 lines
; and change 16/32/48 to 1/2/3x row length in bytes,
; and a/b/o point to the relevant parts of the input/output matrices.
movups xmm4, [b + 0] ; load whole b matrix
movups xmm5, [b + 16]
movups xmm6, [b + 32]
movups xmm7, [b + 48]
movups xmm3, [a + 0] ; load first row of A matrix
domatrixrowSSE ; the macro multiplies one row of A by 4 columns of B
movups [o + 0], xmm0 ; store results to first row of output matrix
; e1r1O = : e2r1O = : e3r1O = : e4r1O =
; e1r1A*e1r1B : e1r1A*e2r1B : e1r1A*e3r1B : e1r1A*e4r1B
; + e2r1A*e1r2B : + e2r1A*e2r2B : + e2r1A*e3r2B : + e2r1A*e4r2B
; + e3r1A*e1r3B : + e3r1A*e2r3B : + e3r1A*e3r3B : + e3r1A*e4r3B
; + e4r1A*e1r4B : + e4r1A*e2r4B : + e3r1A*e3r4B : + e4r1A*e4r4B
movups xmm3, [a + 16] ; load second row of A
domatrixrowSSE
movups [o + 16], xmm0 ; store to second row of O
movups xmm3, [a + 32] ; third row of A
domatrixrowSSE
movups [o + 32], xmm0 ; to third row of O
movups xmm3, [a + 48] ; 4th row of A
domatrixrowSSE
movups [o + 48], xmm0 ; to 4th row of O
;; 28p01, 16p5, 8r4w. 16cycles/matrix on icelake, 28c/matrix on older CPU with only 1 vfp port (eg sandy bridge)
@@treelibrarian7618 this is insane i never thought i would see the day where someone cold rolled asm with sse instructions no less in a yt comments section. props
@@shappertallw it's a hobby of mine: I've done it before and I'll probably do it again. I think I might have scared one or two youtubers away from posting asm-related videos - which was not my intention. I really should be making video's myself...
matrix multiplication in primary school?
One year, you're learning to read books without pictures. The next, you're calculating the cross product on a 4 dimensional matrix. Then you go to middle school, learn about girls, forget it all, and have to relearn it in pre-calc.
this is what i came to comment lol
@@BrentFarris Don't have to forget it if no girls approaches you. That's a win in my book.
This is some serious knowledge! More of these please
Extremely good explanation
Thanks dude. Informative. Good presentation.
Pure gold!!
What a supernatural video
I always see comments "matrix math on middle/high school?!" On videos like this. And laugh to my self because i remember we did it on elementary(grade 4-6).
'Cause you're a supernatural
A beating heart of stone
You gotta be so cold
To make it in this world
Yeah, you're a supernatural
Living your life cutthroat
You gotta be so cold
Yeah, you're a supernatural
what a natural super
ASMR backend is peak
What a super nature
There is a flaw in the system: I can deduce from the comments that I should reply something supernatural without having watched the entire video.
The next time you'll have to provide a function to determine what to comment instead of a phrase, so that the appropriate comment can't be deduced from the comments.
You got me to comment anyway though.
You made an engine with Vulkan or you made something similar to Vulkan??
Using Vulkan, I've made engines in C, C++, and Go. It has a pretty nice and straightforward structure once you get a handle of it.
This is the "Ben Eater" style
Great video!)
What about using something like volk (vector optimized library of kernels)?
Is Go FFI slow?
You likely can without issues. You may have to benchmark it though because you do have to pay the small cost of swapping stacks. Go's stack is built leaning for goroutines, so it has the swap to a C-compatible stack to call C.
Instead of transposing B and then doing dot-products essentially, you can take a row of B and multiply it by a broadcasted element of A, and then add it into the result. That's more efficient than doing dot-products, HADDPS isn't that efficient (essentially equal to 2 shuffles plus ADDPS). Also even you do want to transpose, you can do it with 8 shuffles instead of 16 INSERTPSes, similar to how the _MM_TRANSPOSE4_PS macro does it (but you have no access to that so you'd implement it manually).
supernatural season 2 ep 2
very cool!!, you should also try making the engine in rust
One day, I may. I enjoy trying out languages, and game frameworks/engines tend to be my testbed. Either as the core engine code or as a scripting language depending on the nature of the language.
Neat video! If you have this much programming knowledge and skill, I think you'd really appreciate Vale. It's a new language that takes a novel approach to memory management without a GC. It borrows concepts from many languages like Rust, Cyclone, Pony and Forty2.
Thanks! I'll have to check it out, I have a lot of fun trying out different languages. There have been a lot of languages popping up lately. It's so hard to keep up, haha
@@BrentFarris I know what you mean! I'm sure someone such as yourself has a very long list of things to check out with not enough time in the day!
Primary school? Linear algebra is generally a college subject, I didn’t learn matrix multiplication even in high school
What? Kids nowadays don't do linear algebra after nap time anymore?
When code in paper , you know the shit is serious
What a supernatural way of explaining
Holy, If you had to learn everything from scratch, in what order would you learn your langs? just starting with C, thx wizard
I would learn C if I had to go from scratch. It's just high level enough to do huge projects and just low level enough to teach you how computers work internally.
I learned C++ as my first language, but I wish it were C.
I am missing too many things to understand this, can someone explain it in few sentences, at least what is the idea? Thanks!
the real question is... WHAT IS THAT PENCIL HOLDER????
How much you want to create a game engine for temple OS?
Haha, I haven't booted up TempleOS yet. It's still on my bucket list. When I do, it might just happen!
@@BrentFarris please don't be restricted to the whole religious thing, use it to the full potential please, 4k high res
What should one do or learn to understand assembly??
Well, a pretty good resource for beginners is the MIT OpenCourseWare for the x86_64 architecture
Program some small things in 6502 assembly. It is an incredibly small assembly language and will teach you 90% of what you need to know. You can then get a book or read online docs for x86/x64 and arm instructions.
Check out this 6502 tutorial. It comes with an emulator and is a lot of fun:
skilldrick.github.io/easy6502/index.html
All this to ultimately have a 20% efficiency at candy crush...
niec
is there no way to natively emit vector instructions in go?
If that is true, than that is quite unfortunate
Isn't it easier to write those functions in C and link with them instead of writing out assembly?
Not at the moment in Go directly. You have a few options:
1. Write assembly as we did here directly (fastest execution).
2. Write vectorized assembly instructions as their own function (similar to C) and use them at a higher level, but you'll need to take care to follow the calling conventions to not clobbered your asm work.
3. Use the C vectorization library functions and call from C. This will have the tiny overhead of swapping stacks, though.
@@BrentFarris Go is bullshit language exactly for such technical lacks :(
It is posible to develop 3d games with golang like this, even if its a gc language?
Yes, you can either write it from scratch like I do for fun (see Kaiju github engine link in description). Or you can load up helper C libraries for OpenGL, SDL, etc; which I've done in the past.
You'll find most game engines like Unreal and Unity use an internally built garbage collector, so don't let the GC hold you back from experimenting.
Thanks i will try.
What is garbage @@BrentFarris
@@nittani. something to be collected it seems
@@QW3RTYUU you made me spit my coffee
Making assembly 30× faster by writing in binary
Don't tempt me with a good time
Who else is naturally this super?
Java base classes
wow
Neat
manyakadam
primary school huh for linear math i feel dumb
Me too, I must have missed the linear algebra class they taught at recess...
Bruh is expaining code on paper
it's nice results ... without dubt...but at that point of needed optimization going back to c/c++ isn't better?
Actually, there are some highly optimized Go functions that beat its Go Assembly counterparts. I'll make a video on this next.
I always advocate for people to write in C, I'm biased because it's my main language. But, you really do get some amazing benefits in Go that you just don't in C/C++. So it's really up to the taste of the developer. I've written 3D Vulkan game engines in all 3 languages (C, C++, and Go)