"Oh Golly I'm running out of time"- Dude, never enough time for ya, great speaker no doubt up there with Herb, Chandler and Andrea as well as some others.
17:00 Not only one instruction, LEA is using AGU instead of ALU unit for its calculations, when two instructions LEA and ADD are close each other both will be done same time as if it was 0.5 instruction for each. Of course if (and this is common) CPU have 3-4 ALU units 3-4 ADD, SUB will run same time (if independent) but if AGU used you can increase this pararelism up to 8
18:30 Reason why GCC still using LEA even for multiply cause it will be done in just one cycle, without LEA it would be faster to use SHL still than multiply witch takes 4 cycles usualy or SHR instead division witch uses 12 cycles usually
Actually at 58:36, compiler is perfectly free to replace this implementation of bogo-sort with call to qsort or something. Programs that don't terminate are ill formed. So compiler can assume there will be a progress, and the function will finish. There is no other observable state (mt is a local, and there is no other global state touched). So it must terminate and exit to ensure progress. So it knows that there is no other side effects, and on the exit the table is sorted. So sort it. :) It is kind of allowed optimization, but no sane compiler will actually do it, because it is essentially impossible to deduce the above logic by looking at this code, including stability of sort used, or which version is actually faster (it might depend on the size of the array possibly, and even the data values in the array itself).
This is partly why we need (more-)high-level concepts. They will allow to distinguish RNG from PRNG (which is handy for crypto) and qualify what is mean for array to pass `is_sorted`. With this replacing bogosort with `std::sort` is reasonable. So, yeah. We can make a lot of fantastic decisions and optmizations on compiler front end.
Programs that don’t terminate are perfectly legal. But suppose for the sake of argument that not terminating is undefined. That would just lead to the principle of explosion, of which the desired outcome is an infinitesimal subset. To be more explicit in a way that still incorporates some of the usual logic, why wouldn’t is_sorted be considered the culprit? Forcing it to be true in a case where the MT shuffle would never hit the right permutation is no more ridiculous than magically sorting the list (or setting your aunt’s hair on fire while still not terminating) in that same case. A theoretical super smart compiler might recognize that the defined case always results in a sorted list, and a simple way to omit the undefined case is to run the sorting algorithm there as well. Maybe an even smarter compiler could optimize the sort based on some property of bogosortable lists, and make even a charitable interpretation of undefined behavior break things again. Just don’t ever go there.
That highlights the two principled kinds of optimization: those that transform, and those that calculate (or deduce), cache, and leverage runtime results at compile time - the second kind is always problematic at the edge cases because the compiler is dictating that it has perfect information when it in practice doesnt (it is assuming the state of runtime memory within an abstract machine that has multiple actors that can change it) - there are entire keywords dedicated to making the second type of optimization less problematic but they dont solve the problem they only disable those problematic optimizations (ex: volatile)@@EllipticGeometry
Can one do fixed point arithmetic with 256 bit registers, I ended up using fixed point for a ray-trace sqrt algorithm for 32 bit code in masm. Probably around 2007.
Your avatar reminded me of how much I've been wanting to try Arch Linux on my desktop. Currently running openSUSE Tumbleweed on my laptop, and honestly, I freaking love it.
github.com/bad-alloc-heavy-industries/substrate/blob/master/substrate/fd#L192 - my answer to the insanity that is "just use the netcode library to conditionally 'fix' the endian-ness of your data once it's already too late".. and the great thing is, this compiles out entirely on a CPU with matching endian.. and compiles to an efficient swap instruction otherwise
I've been listening to these talks by Kevlin while trying to learn Rust, and I'm just thinking trying to write in Rust is like playing that crazy create password online game where any time you satisfy a condition, you get a new requirement until it agrees to compile. So basically at the end you're putting out fires protecting the egg and losing the workings of your already existing code and it never compiles. Especially if you're trying to combine features.
Wouldn't it be more appropriate if the CPU itself does most vectorization? With ever changing architectures and instruction set inflation it becomes increasingly pointless to target a specific subset. Also we are stuck since over 10 years in the time of saturating single core performance, it is time for the CPU itself to be able to auto-vectorize and auto-paralellize to give more performance. Maybe this job of mapping the code to the available hardware of GPU, CPU cores, co-processor, etc. will be pioneered in a virtual machine like WASM before a CPU manufacturer manages to deliver something.
I am not sure that is the way to go. (1) It is really hard to do in hardware and it would mean that you would need to devote lots of transistors on the chip to this kind of optimization. And that means that CPUs would become even more inefficient than they are now. Even today only a tiny fraction of the transistors on the CPU are dedicated to the actual computation. Pretty much all transistors are used to keep the processor pipeline moving: caches and cache-management, out-of-order execution, speculative execution and stuff like that. (2) Such optimizations cross the boundary to algorithms. IMHO Software should not rely on the hardware to fix stupid algorithmic design. There is another way to stay machine-independent while still using machine-dependent optimizations: Acceleration libraries. I am thinking about libraries like Intel IPP, MKL. You can write architecture-independent code and at program startup the libraries will automatically detect the CPUs capabilities and select the appropiate implementation.
One of the obstacles is, that the original C++ source target C++ abstract machine. After you compile it to the machine code of some target architecture like x64, you are throwing away all the original constraints and contracts defined by the C++ source and introducing the new set of constraints and contracts defined by the target-architecture. The machine code constraints are of course result-wise compatible with the original constraints to ensure that the result of machine code is correct under the C++ abstract machine considerations, but you can't reconstruct the original set of constraints from the machine code, you would end with lot more complex (more strict) requirements. But the CPU does see only the machine code, so it would have to vectorize against the lot more harsh contract, often ending with less optimal code than the compiler could produce, as compiler understands better the original intent of programmer (I mean the intent encoded in the C++ source, not the true intent of programmer, that will be revealed later over multiple debugging sessions after the programmer will realize how to write his original intent correctly in C++).
@@compuholic82 I didn't say it has to be done with specific CPU logic circuitry, just by the CPU from the code perspective. Doing it with hardwired logic would be a rather stupid idea for such a complicated and constantly changing problem. I implied by mentioning virtual machines that it might just be that, a virtual machine running software doing the machine specific mapping of a abstract machine to the actual computing resources. The CPU HW only has to allow for this virtualization, as it is already to some degree, the rest is SW. And the computation for this can be partly pooled, put into a cloud, as there is significant homogeneity in computational resource architectures.
There is no need to do this within the CPU. Just run the compiler on some kind of intermediate code as you install a program, so the compiler can optimize for the CPU. I believe Android does this.
yeah, compiling all your code on the user's machine at program startup under extreme time constraints using a 100 megabyte runtime that ships with your program just so you can inline virtual functions 25% more frequently sounds cool and all... until you compare it with something that generates better code in every other case, once, ahead of time, and doesn't require a runtime.
That does not sound cool to me, but to each their own I guess... But, just talk me through your reasoning here. So you have a program which needs to respond quickly, and you've picked a JIT which compiles on startup, but then you have also decided to start it up every time you need it? Why? Either way, you can't fanboy over C++ in those "extreme" situation either, because that is C territory.
@@ThePlacehole what are you even saying? my entire comment was sarcastic. i \*don't\* like JIT. i think it's a stupid idea. how does it follow that if you want your program to "respond quickly" that you would use a JIT? also yes, every time the program itself STARTS, the JIT has to recompile it. unless .NET or whomever has some stupid bytecode cache thing. i wouldn't know, i'm not a pajeet, and i'm most certainly not "fanboying" over sepples. my profile picture is literally yuki nagato holding up the k&r book.
@@loli42 I am trying to say that your hypothetical situation does not really detract from JITs, because using a JIT in that way is retarded. Also that the only place where technologies have no negatives is the Magical Fanboy-land.
@@ThePlacehole using a JIT in what way?? the only way to use a JIT is to compile just-in-time, i.e. when the program starts up! in your hypothetical situation, does a JIT just never fucking compile the program? you know it takes TIME TO COMPILE A PROGRAM RIGHT??? it's like i'm talking to a fucking serial killer.
Peter Mortensen first and foremost: either the mic or the audio editing of the video are far from being ok. The “volume” (i.e. the level) is far too low. Secondly, the speaker occasionally bursts into loud sentences or laughter and then all of a sudden starts murmuring something to himself as if he wasn’t addressing an audience. That’s extremely disturbing and spoils the otherwise enjoyable and insightful content. I spent most of the time riding the volume control.
I listened to most of it with half my attention, but honestly the sound is fine for me, granted I usually listen to things at a comfortable mild level, never loud, so the only times I complain about sound is when something is too low to be actually discernable
"Oh Golly I'm running out of time"- Dude, never enough time for ya, great speaker no doubt up there with Herb, Chandler and Andrea as well as some others.
Go on, funny verb man, blow my mind.
17:00 Not only one instruction, LEA is using AGU instead of ALU unit for its calculations, when two instructions LEA and ADD are close each other both will be done same time as if it was 0.5 instruction for each. Of course if (and this is common) CPU have 3-4 ALU units 3-4 ADD, SUB will run same time (if independent) but if AGU used you can increase this pararelism up to 8
18:30 Reason why GCC still using LEA even for multiply cause it will be done in just one cycle, without LEA it would be faster to use SHL still than multiply witch takes 4 cycles usualy or SHR instead division witch uses 12 cycles usually
Actually at 58:36, compiler is perfectly free to replace this implementation of bogo-sort with call to qsort or something. Programs that don't terminate are ill formed. So compiler can assume there will be a progress, and the function will finish. There is no other observable state (mt is a local, and there is no other global state touched). So it must terminate and exit to ensure progress. So it knows that there is no other side effects, and on the exit the table is sorted. So sort it. :)
It is kind of allowed optimization, but no sane compiler will actually do it, because it is essentially impossible to deduce the above logic by looking at this code, including stability of sort used, or which version is actually faster (it might depend on the size of the array possibly, and even the data values in the array itself).
This is partly why we need (more-)high-level concepts.
They will allow to distinguish RNG from PRNG (which is handy for crypto) and qualify what is mean for array to pass `is_sorted`. With this replacing bogosort with `std::sort` is reasonable.
So, yeah. We can make a lot of fantastic decisions and optmizations on compiler front end.
Programs that don’t terminate are perfectly legal. But suppose for the sake of argument that not terminating is undefined. That would just lead to the principle of explosion, of which the desired outcome is an infinitesimal subset. To be more explicit in a way that still incorporates some of the usual logic, why wouldn’t is_sorted be considered the culprit? Forcing it to be true in a case where the MT shuffle would never hit the right permutation is no more ridiculous than magically sorting the list (or setting your aunt’s hair on fire while still not terminating) in that same case. A theoretical super smart compiler might recognize that the defined case always results in a sorted list, and a simple way to omit the undefined case is to run the sorting algorithm there as well. Maybe an even smarter compiler could optimize the sort based on some property of bogosortable lists, and make even a charitable interpretation of undefined behavior break things again. Just don’t ever go there.
That highlights the two principled kinds of optimization: those that transform, and those that calculate (or deduce), cache, and leverage runtime results at compile time - the second kind is always problematic at the edge cases because the compiler is dictating that it has perfect information when it in practice doesnt (it is assuming the state of runtime memory within an abstract machine that has multiple actors that can change it) - there are entire keywords dedicated to making the second type of optimization less problematic but they dont solve the problem they only disable those problematic optimizations (ex: volatile)@@EllipticGeometry
What a wonderful website and introduction! Thanks for the resource.
I am in fact, using this to help write an interpreter.
I learned enough in my compiler class in university 30 years ago to know I didn't ever want to make one.
The easiest compilers to make are Lisp macros if you think about it...
28:45 he meant to say 256 bit wide registers, right? 32 bytes are 256 bits
52:34 “GCC has something called speculative virtualization…” **crowd member giggles** “yeah!”
Can one do fixed point arithmetic with 256 bit registers, I ended up using fixed point for a ray-trace sqrt algorithm for 32 bit code in masm. Probably around 2007.
His name is GODBOLT
Great talk.
3:21 is really a good joke :D
All this time I thought int was defined as the machine word size. I guess not.
Assembly is scary?
Nah x86 is .
Your avatar reminded me of how much I've been wanting to try Arch Linux on my desktop. Currently running openSUSE Tumbleweed on my laptop, and honestly, I freaking love it.
@@ominous-omnipresent-they
Happy to hear that .
"Compilers are awesome"
me writing mostly with the 65C02:
:(
But the 65C02 is awesome.
@@rabidbigdog My first CPU.
Assembly language is language
github.com/bad-alloc-heavy-industries/substrate/blob/master/substrate/fd#L192 - my answer to the insanity that is "just use the netcode library to conditionally 'fix' the endian-ness of your data once it's already too late".. and the great thing is, this compiles out entirely on a CPU with matching endian.. and compiles to an efficient swap instruction otherwise
Wow!
we should have a T-shirt: Compilers are awesome
I've been listening to these talks by Kevlin while trying to learn Rust, and I'm just thinking trying to write in Rust is like playing that crazy create password online game where any time you satisfy a condition, you get a new requirement until it agrees to compile. So basically at the end you're putting out fires protecting the egg and losing the workings of your already existing code and it never compiles. Especially if you're trying to combine features.
This depends on your skill level. You can build exactly what you want as you know the language more and more
Terry Davis has entered the chat.
RIP Terry, those glowies can't get to you now.
Who is terry ?
@@xrafter best programmer to ever live
@@lucianoosinaga2980
Wow.
Friends don't let friends use Windows for serious work.
Wouldn't it be more appropriate if the CPU itself does most vectorization? With ever changing architectures and instruction set inflation it becomes increasingly pointless to target a specific subset.
Also we are stuck since over 10 years in the time of saturating single core performance, it is time for the CPU itself to be able to auto-vectorize and auto-paralellize to give more performance.
Maybe this job of mapping the code to the available hardware of GPU, CPU cores, co-processor, etc. will be pioneered in a virtual machine like WASM before a CPU manufacturer manages to deliver something.
I am not sure that is the way to go. (1) It is really hard to do in hardware and it would mean that you would need to devote lots of transistors on the chip to this kind of optimization. And that means that CPUs would become even more inefficient than they are now. Even today only a tiny fraction of the transistors on the CPU are dedicated to the actual computation. Pretty much all transistors are used to keep the processor pipeline moving: caches and cache-management, out-of-order execution, speculative execution and stuff like that.
(2) Such optimizations cross the boundary to algorithms. IMHO Software should not rely on the hardware to fix stupid algorithmic design.
There is another way to stay machine-independent while still using machine-dependent optimizations: Acceleration libraries. I am thinking about libraries like Intel IPP, MKL. You can write architecture-independent code and at program startup the libraries will automatically detect the CPUs capabilities and select the appropiate implementation.
@@compuholic82 Don't forget about ISPC!
One of the obstacles is, that the original C++ source target C++ abstract machine. After you compile it to the machine code of some target architecture like x64, you are throwing away all the original constraints and contracts defined by the C++ source and introducing the new set of constraints and contracts defined by the target-architecture. The machine code constraints are of course result-wise compatible with the original constraints to ensure that the result of machine code is correct under the C++ abstract machine considerations, but you can't reconstruct the original set of constraints from the machine code, you would end with lot more complex (more strict) requirements.
But the CPU does see only the machine code, so it would have to vectorize against the lot more harsh contract, often ending with less optimal code than the compiler could produce, as compiler understands better the original intent of programmer (I mean the intent encoded in the C++ source, not the true intent of programmer, that will be revealed later over multiple debugging sessions after the programmer will realize how to write his original intent correctly in C++).
@@compuholic82 I didn't say it has to be done with specific CPU logic circuitry, just by the CPU from the code perspective.
Doing it with hardwired logic would be a rather stupid idea for such a complicated and constantly changing problem.
I implied by mentioning virtual machines that it might just be that, a virtual machine running software doing the machine specific mapping of a abstract machine to the actual computing resources.
The CPU HW only has to allow for this virtualization, as it is already to some degree, the rest is SW.
And the computation for this can be partly pooled, put into a cloud, as there is significant homogeneity in computational resource architectures.
There is no need to do this within the CPU. Just run the compiler on some kind of intermediate code as you install a program, so the compiler can optimize for the CPU. I believe Android does this.
Yeah, compile-time speculative inlining of virtual functions sounds cool and all... Until you compare it to any JIT ever.
yeah, compiling all your code on the user's machine at program startup under extreme time constraints using a 100 megabyte runtime that ships with your program just so you can inline virtual functions 25% more frequently sounds cool and all... until you compare it with something that generates better code in every other case, once, ahead of time, and doesn't require a runtime.
That does not sound cool to me, but to each their own I guess...
But, just talk me through your reasoning here. So you have a program which needs to respond quickly, and you've picked a JIT which compiles on startup, but then you have also decided to start it up every time you need it? Why?
Either way, you can't fanboy over C++ in those "extreme" situation either, because that is C territory.
@@ThePlacehole what are you even saying? my entire comment was sarcastic. i \*don't\* like JIT. i think it's a stupid idea. how does it follow that if you want your program to "respond quickly" that you would use a JIT? also yes, every time the program itself STARTS, the JIT has to recompile it. unless .NET or whomever has some stupid bytecode cache thing. i wouldn't know, i'm not a pajeet, and i'm most certainly not "fanboying" over sepples. my profile picture is literally yuki nagato holding up the k&r book.
@@loli42 I am trying to say that your hypothetical situation does not really detract from JITs, because using a JIT in that way is retarded. Also that the only place where technologies have no negatives is the Magical Fanboy-land.
@@ThePlacehole using a JIT in what way?? the only way to use a JIT is to compile just-in-time, i.e. when the program starts up! in your hypothetical situation, does a JIT just never fucking compile the program? you know it takes TIME TO COMPILE A PROGRAM RIGHT??? it's like i'm talking to a fucking serial killer.
Lol Haskell is actually closer to what the CPU does than C with out-of-order execution.
c++ on C😂
The laughter is super creepy
-funsafe-laughter
it's not, WTH
Big congrats for the hideous audio level. The guy is perhaps a C++ guru, but he doesn't know how to speak properly
Peter Mortensen first and foremost: either the mic or the audio editing of the video are far from being ok. The “volume” (i.e. the level) is far too low. Secondly, the speaker occasionally bursts into loud sentences or laughter and then all of a sudden starts murmuring something to himself as if he wasn’t addressing an audience. That’s extremely disturbing and spoils the otherwise enjoyable and insightful content. I spent most of the time riding the volume control.
I listened to most of it with half my attention, but honestly the sound is fine for me, granted I usually listen to things at a comfortable mild level, never loud, so the only times I complain about sound is when something is too low to be actually discernable
Doru Min very glad it works for you. You clearly have different standards when it comes to audio quality.
@@Dorumin The audio was fine for me as well.
I see what you mean, I wouldn't necessarily complain but it is indeed an inconvenience