I seem to learn more from people with a temperament like yours, jovial, unassuming in instruction. Thanks for keeping the topic straight forward, short and sharp
To sort 3 numbers, I rather suggest this method which requires 6 operations (instead of 8 in the video): min3 = min( min(a,b),c); //2 operations max3=max(max(a,b),c); //2 operations med3=max( min(a,b),min(c,max(a,b))); //4 operations, but can be done in 2 operations instead! As the temporary results of "min(a,b)" and "max(a,b)" can be kept from the first steps in registers, this method requires only 6 min/max operations!!! (BTW, no issue any more with floating point precision)
Well, yes, but actually no. The XOR solution is faster, since the 2 serial xors that come after finding min3 and max3 can happen in less time than the 2 equivalent min/max ops, (latency of fp min/max = fp add/sub = 4, latency of xor = 1). The initial 2 xors can happen concurrently with the min/max ops on the third unused vector port. For the same reason using XOR also reduces load on the 2 vector min/max capable ports and enables faster looping of the whole sequence, although the difference in reality is minimal (8/3 vs. 6/2 cycles throughput - about 11% faster).
The sort3() function via XOR is very neat, you could in fact do it in scalar using 64-bit integer regs! The scariest part when sorting fp numbers happens when you have infinities or NaNs: I.e. the add and subtract the min/max fails completely, both with a single Inf or a single NaN, even though the ordering is at least defined when you have a mix of regular and a single inf. With NaN, all comparisons return false!
So I found your content due to the fact that I'd like to start with an understanding on how computers work. I just started out learning Java as well as assembly. I don't do that because of commercial reasons but for reasons of fascination. Thank you for your work! Greetings from Switzerland 🍾 BTW: I have no experience in programming/computer architecture, but built an 8bit calculator in Minecraft (Redstone). This gave me a huge fascination to core concepts in CS and EE. Highly recommend this!
I highly recommend Ben Eater if you're interested in the nitty gritty details of the hardware. I built a redstone computer as well as a physical one based on what I learned from his videos!
@@change_profile_n8755 It was a long but very rewarding process. Being able to say that I fully understand how it functions is really cool, especially since I didn't really stick to Ben's design at all. His is really simple, but not very capable tbf. I think knowing how a simple computer like that works makes it easier to understand how a modern x86/64 system works too. Honestly modern computers never cease to blow me away. They're so complex it's insane lol
Just wanted to say a huge thanks for your videos about modern x64 assembly. I'm teacher and I'm preparing a course about x64 assembly and your videos helped me a lot. Many thanks. Subscribed to the channel. Concerning the side effect of 32 bits operations on high 32 bits of 64 bits registers, after doing some search, it seems that there is no physical RAX, RBX etc. registers, but there is a bank of registers and registers are allocated depending on instructions and then merged when instructions are completed... may be for optimisation reasons it is faster to just put zeros in the 32 high bits... but indeed it's a strange effect.
This was really interesting. I've ever only come across assembly in the Linux kernel's architecture dependant code. It looks like you need to be a certain kind of masochist to enjoy the challenge of writing actual problem solving code in assembly... I should give it a try...
There's another concern with the sorting of 3 numbers method involving min/max/substract technique that is worse than losing floating point precision: if all three floating point numbers are close enough to the absolute maximum representation, adding them will overflow. Not sure what an overflow looks like with floating points, but if it's like with integers you'll get something very wrong in the end. In any case, thanks for the video, that's interesting. I'd be for a follow-up with more usual patterns and tricks. And maybe another video about ARM and RISCV assembly at some point?
Oh, just discovered CMOV 😢 That would have been extra useful when I was doing CS. The worse part is that I read through helppc at the time to find useful mnemonics we didn't learn. I don't know how I missed that one! That shows how basics can benefit everyone ^^
CMOV (on x86 cpus) is very rarely a win! The branch predictors are so good that it is _almost_ always faster to simply load the one possible return value, then branch over a single instion that loads the alternative: ;; EAX has a, EBX has b, return the smaller of a & b: cmp eax,ebx jl done mov eax,ebx done: When EAX is the prevalent smaller value, then the cpu will predict this correctly and run the entire block in a single cycle (or even less if there is some other work which can overlap). With EBX being the return value we also have to execute the MOV, but this can be done in the renamer and so don't actually take any cycles! 🙂 The CMOV version will always take the same number of cycles, typically 2 or 3. (There are other architectures where CMOV is much faster, sometimes down to one or zero cycles.)
@@Double-Negative Sure, I thought that was clear from the way I wrote it, but I see now that it wasn't. Anyway, absolute worst case a MOV REG,REG takes a single cycle unless the CPU is from before about 1992 (Intel 486). 🙂
Reminder of XOR⊕ property: a⊕a=0 a⊕b=b⊕a (a⊕b)⊕c=a⊕(b⊕c)=a⊕b⊕c a⊕0=a So, if we call the 3 registers a,b and c respectively and min and max as m and n respectively, we have the following expressions: a⊕b⊕c (xor the 3) (a⊕b⊕c)⊕m⊕M (and xor with min and max) Let say m=a and M=c (could be any pair), then the expression becomes: (a⊕b⊕c)⊕a⊕c Per the above property we can remove parenthesis: a⊕b⊕c⊕a⊕c We can move values and group them: (a⊕a)⊕b⊕(c⊕c) We can also reduce the parenthesis: 0⊕b⊕0 Which evaluates to: b
We need to get all the language experts in a room (you being one of them) and create another assembly abstraction like C but with modern memory protection and better/modern op representation built in to the syntax but still being a "mid/low level" structured typed functional programming language that closely represents the codegen.
you're not gonna get a functional programming language if you're abstracting assembly. You'd be better off making a procedural language bc it fits the architecture more, but C already exists, so I don't see the point.
Using intel syntax: 1. fast addition 16 bit instructions LEA bx, [bx+si] ; no memory access, no flags touched, result have to fit the target 32 bit: LEA ecx, [ecx+eax]
Great content, always keeping me stoked for the next video. For clarity sake, don't you think you should update the leftover comments that still state that xoring "sums" or "subtracts"? You even sinfully say it out loud. 🙂It accumulates and extracts which is good enough and just what we want but far from adding or subtracting. Keep it up pleeeease! 👍
Nice video! I am probably too late to the party, but I think you didnt answer question 3. The question was how to count bits in a dword on a 8086 (16 bit processor). No fancy bitcount instructions there.
Drawback of these sorting methods is they can't be applied if there are not only "keys" but also values which should be sorted with these keys. Plane old bubble sort in that case, I presume... Nice video nevertheless!
Another potential problem with the additive method is the potential for overflow. Sure, it's not likely with three values, but it is possible. What instruction format do you prefer (AT&T, Intel, NASM)? Over the years I've found I'm liking nasm more.
Just a note for those implementing FPN comparisons via binary, treat the sign, exponent & mantissa as separate comparisons: int cmpf( fpn a, fpn b) { int sigA, sigB, expA, expB; intmax_t manA, manB; /* Extract info */ ... if ( sigA - sigB ) return -(sigA - sigB); if ( expA - expB ) return -(expA - expB); return cmp(manA,manB,bits); } fpn minf( fpn a, fpn b ) { return cmpf(a, b) < 0 ? a : b; } fpn maxf( fpn a, fpn b ) { return cmpf(a, b) > 0 ? a : b; } Doing it that way avoids the possibility of incorrect return values (provided I got the signs the right way round in cmpf)
I have a question about ARM Assembly. If you use malloc, will the kernel try to give you a pointer that is 8-bit rotatable (i.e. can be loaded into a register using a single instruction?)
Some instructions don't affect the flags, so you can execute some instructions between. Mostly MOV doesn't change the flags. Usually the CMP and Jcc are close by though.
Usually on x86 the answer is yes, but ultimately it depends on the instructions you're using. On ARM, RISC-V, and MIPS you can do whatever you want in between.
If you look at compiler output, jumps are often put far after cmps or other flag altering instructions! I’ve seen a loop where the comparison was a SUBS instruction at the very top of the loop like 40 instructions before the branch. I think compilers strive for this because it essentially guarantees that the comparison result will be completely done before the branch is hit, preventing branch prediction miss penalties.
Imagine since Intel Core2 architecture we can execute 4 integer instructions parallel, if there are no depency between and if the code have a good mixture of complex and simple instructions in the pipelines. This is not a CISC CPU, it is a mixture of RISC and CISC. The CPU split complex x86 instructions into micro ops to execute with some of the RISC units.
@@pyromen321 Could that actually become counter-productive at some point? I.e., could it happen that by the time you reach the jmp, the cmp has been evicted from the intrsuction cache?
I would personally write the forms, buttons and front end in C++ or C#, and just keep ASM for the number crunching. I'm not sure I have the engineering skill to organize a very large scale, 100% Assembly project like that! It would certainly be a challenge :)
That would be a massive drain of time and effort. Not much to gain and much to lose. What's to be written directly in assembly has to be important enough to be justified being written in assembly.
Also, fp error are "weird" as the gap between "consecutive" numbers just widen like crazy as you get far from 0. (expected since mantissa has a finite precision ^^) I think that fp introduces too much weirdness because of that and can be a big hurdle for beginners.
Oh yes, I remember when I was a young lad and started writing my first code in AutoIt and trying to figure out what's ASM and I was like... "WTF are these? are they just there for moving numbers around, adding and subtracting them? What for?" as I was trying to create a program with nice UI and messageBox and stuff... I'm pretty sure there are many peoples out there having the same question when looking at ASM at first ;) one day it just clicks and I still have no idea what I'm doing with ASM most of the time but can read and understand some parts of it.
How do compilers generate object code which can run on the variety of AMD64 family CPUs? There are so many variants which have extended complex action opcodes, how can the compiler know when to use those opcodes? I know there are compiler flags but in software distribution it’s impossible to know ahead of time which CPU instructions are supported. How is this handled at runtime?
There is a CPUID instruction, that returns information about supported instruction sets. You can patch your code at the start of the program. You can also compile several versions and make installer pick one, depending on CPUID. But the default behavior is to target old enough CPU and just crash, if even older one is present.
If we can use xor to get perfect float math, why dont all cpus just always do that? Why is floating point error still a problem we have to deal with? Even assuming that trick does not work with multiply and divide, making add and subtract perfect would be amazing.
Could you possibly do an explanation of how to call C library functions like puts() from assembly, or maybe just link to a guide with the correct answer? I've found a couple different guides online and I couldn't get it to work for one reason or another. I'm just not experienced enough to know why. I'm using VS2022 btw
19:17, I thought you would do x = min(a,b), y = max(b,c), z = (a+b+c)-(x+y) **Edit:** Gave it more thought and noticed a scenario where the wrong answer would be given, I'll leave finding that as a thought exercise for peops who care
How, in assembly language, does one write function that takes two or more arguments and returns a result? And how does one afterwards call that function from other languages, such Python, C++, C#, or F#?
Well, there’s videos on here for doing some of these things. Mostly very old videos. Calling from C++ is easier than C#, and I found that writing a wrapper in C++, and then calling that from C# was maybe the best way? C# just has a lot of extra type safety and memory management issues you have to work with. To call native code from C#, I have found it convenient to compile the native code to a DLL. Then you use something like ‘interopservices’ and ‘importdll’ from C# to import the functions you want to use. Something like that, you’d have to look up the details. As for calling native ASM from C++, there’s a lot of ways. If you’re in 32 bit, you can code inline ASM. If you’re in x64, then it’s a little trickier, but a lot of the videos on my channel here involve calling ASM from C++, so maybe if you have a look at one of the early ASM and C++ vids, you will see one way to do this. I’m pretty sure we did this in the very first video I uploaded. You might want to try assemble to a library file, either LIB or DLL, and link to that in your C++. I do not usually do this in these videos because they’re usually just little code snippets, but in a real project it helps to set things out like that. Then you’re looking for how to call a native DLL from C++, which is bound to get plenty of results on googs. As for Python and F#, I must say I have no idea sorry. Hope this helps, have a good one :)
Thank you. I've figured out how to do some of that. I have DLLs created from legacy code written in Fortran that I then call from C# using interop services. But in that situation, the Fortran compiler made the DLL for me, so I didn't learn much, if anything, about the internal layout of the DLL in the process. Do you have any link to clear instructions on how to code a DLL from scratch in pure assembly?
Excellent video, cheers for the upload. I wish there was conditional moves back in the day with the 6502/10, then again, I like spaghetti. For the count the set bits question, using that 8bit 6510, I would look at bit shifting e.g. ASL of the value being examined (split over 4 bytes), examining the carry flag and increment a counter if set. I wrote the following as one way to do the job. ldy #4 ;4 bytes lp0 lda Data-1,y ldx #8 ;8 bits lp1 asl bcs BitSet lp2 dex bne lp1 dey bne lp0 rts BitSet inc Count bne lp2 ;faster than jmp and ok to use as long as Count never wrapped to 0 ;faster if below in zero page Count byte $0 Data byte %11110000,%00000111,%10101010,%11100111
@@rsa5991That's a really good idea. Could you provide a working example for the 4 bytes? - my brain was too sore to write an optimised version for the 4 bytes.
@@ChrisM541 I don't have any 6502 tools, so I cannot confirm it working, but: ldy #4 ;bytes lda #0 ;count byteLoop ldx Data-1,Y stx 0 ;use zero page to store current byte asl 0 ;"prime" the loop bitLoop adc #0 ;spoils ZF, so should be before ASL asl 0 ;sets CF for ADC, ZF on last '1' shifted out bne bitLoop adc #0 ;count last '1' dey bne byteLoop rts ; reg A holds the result UPDATE: Had checked on online emulator, seems to be working. Runs in 312 cycles vs yours 506.
@@rsa5991 Fantastic! works perfectly! - I've CBM Prg Studio installed, working on old platformer CDU magazine entry I submitted a 'wee while' ago - wish I had that utility then. Always nice to see different ways to solve problems, cheers.
Okay, adding two numbers together is easy. but that's not really helpful, is it? What we want isn't to read the result off in the debugger, or to change our code to change which numbers to add. In other words, I/O is missing. I know to deal with I/O you use syscalls, system interrupts or in embedded devices access the mapped memory of attached devices and read/write values to specified addresses which map to those devices, possibly in response to an interrupt.
I think the important takeaway here is that if you haven't experienced the pain of making your assembler for a fictional CPU, you don't truly know the assembly meta.
Your second scheme is what I refer to "converting algorithmic operation to arithmetic operation" eliminating branches. The routine is "straight-line code". The complexity of code is proportional to the number of branches in it. Your scheme for calculating the number of 1's in a number only works if you have the special instruction. Without one, I have scheme for determining if the number has one or fewer 1's in it. Copy the number to another register. Decrement one of the numbers. Then do a bitwise AND between the 2 numbers. If it is zero, the number had one or fewer 1's. ROUNDING CRAP: get RID of floating point math! Floating point math belongs only in hastily "slapped together" programs written to get a quick answer. Most programmers are too lazy to properly scale their numbers. For the sort: as you explain in your sort videos, there are 3! possible = 6 outcomes. Do 3 compares, say 1&2, 2&3, & 1&3. After each compare, shift the carry flag (is set if 1st arg >= 2nd arg) into register with SHIFT LEFT (with carry) into precleared register. You have 3-bit number (8 possible outcomes, of which 6 are "legal"). Use this to index into a look-up table of the swaps required. Put the index of the "get" of the swap into the table of 8 entries. Example: let's say : A>B, A
Your passion for assembly language is hugely inspiring!
Thank you my friend!! Cheers for watching :)
Agreed.
I seem to learn more from people with a temperament like yours, jovial, unassuming in instruction. Thanks for keeping the topic straight forward, short and sharp
Finally, somebody does a real programming.
Glad to see you laying some of the groundwork for assembly.
Highly underrated channel! Keep up the good work!
I could watch videos like these all day. Great way to learn something new and refresh my memory. Awesome video!!!
To sort 3 numbers, I rather suggest this method which requires 6 operations (instead of 8 in the video):
min3 = min( min(a,b),c); //2 operations
max3=max(max(a,b),c); //2 operations
med3=max( min(a,b),min(c,max(a,b))); //4 operations, but can be done in 2 operations instead!
As the temporary results of "min(a,b)" and "max(a,b)" can be kept from the first steps in registers, this method requires only 6 min/max operations!!!
(BTW, no issue any more with floating point precision)
Well, yes, but actually no. The XOR solution is faster, since the 2 serial xors that come after finding min3 and max3 can happen in less time than the 2 equivalent min/max ops, (latency of fp min/max = fp add/sub = 4, latency of xor = 1). The initial 2 xors can happen concurrently with the min/max ops on the third unused vector port. For the same reason using XOR also reduces load on the 2 vector min/max capable ports and enables faster looping of the whole sequence, although the difference in reality is minimal (8/3 vs. 6/2 cycles throughput - about 11% faster).
The sort3() function via XOR is very neat, you could in fact do it in scalar using 64-bit integer regs!
The scariest part when sorting fp numbers happens when you have infinities or NaNs:
I.e. the add and subtract the min/max fails completely, both with a single Inf or a single NaN, even though the ordering is at least defined when you have a mix of regular and a single inf.
With NaN, all comparisons return false!
I'm always happy to see you posting
So I found your content due to the fact that I'd like to start with an understanding on how computers work. I just started out learning Java as well as assembly. I don't do that because of commercial reasons but for reasons of fascination. Thank you for your work! Greetings from Switzerland 🍾
BTW: I have no experience in programming/computer architecture, but built an 8bit calculator in Minecraft (Redstone). This gave me a huge fascination to core concepts in CS and EE. Highly recommend this!
I highly recommend Ben Eater if you're interested in the nitty gritty details of the hardware. I built a redstone computer as well as a physical one based on what I learned from his videos!
@@Eidolon2003 I actually was watching his "Hello, World from scratch" video about a week ago :D. Oh wow, a physical as well? How was the process?
@@change_profile_n8755 It was a long but very rewarding process. Being able to say that I fully understand how it functions is really cool, especially since I didn't really stick to Ben's design at all. His is really simple, but not very capable tbf. I think knowing how a simple computer like that works makes it easier to understand how a modern x86/64 system works too. Honestly modern computers never cease to blow me away. They're so complex it's insane lol
I love your videos, I can just watch them without having to think too much but still learn a lot
I've been learning about the open source risc-v assembly. liking it so far.
Also. keep up the good work.
Just wanted to say a huge thanks for your videos about modern x64 assembly. I'm teacher and I'm preparing a course about x64 assembly and your videos helped me a lot. Many thanks. Subscribed to the channel. Concerning the side effect of 32 bits operations on high 32 bits of 64 bits registers, after doing some search, it seems that there is no physical RAX, RBX etc. registers, but there is a bank of registers and registers are allocated depending on instructions and then merged when instructions are completed... may be for optimisation reasons it is faster to just put zeros in the 32 high bits... but indeed it's a strange effect.
This was really interesting. I've ever only come across assembly in the Linux kernel's architecture dependant code. It looks like you need to be a certain kind of masochist to enjoy the challenge of writing actual problem solving code in assembly... I should give it a try...
There's another concern with the sorting of 3 numbers method involving min/max/substract technique that is worse than losing floating point precision: if all three floating point numbers are close enough to the absolute maximum representation, adding them will overflow.
Not sure what an overflow looks like with floating points, but if it's like with integers you'll get something very wrong in the end.
In any case, thanks for the video, that's interesting. I'd be for a follow-up with more usual patterns and tricks. And maybe another video about ARM and RISCV assembly at some point?
An overflow in floating point will probably either be an Inf or a NaN
Oh, just discovered CMOV 😢 That would have been extra useful when I was doing CS. The worse part is that I read through helppc at the time to find useful mnemonics we didn't learn. I don't know how I missed that one! That shows how basics can benefit everyone ^^
CMOV (on x86 cpus) is very rarely a win! The branch predictors are so good that it is _almost_ always faster to simply load the one possible return value, then branch over a single instion that loads the alternative:
;; EAX has a, EBX has b, return the smaller of a & b:
cmp eax,ebx
jl done
mov eax,ebx
done:
When EAX is the prevalent smaller value, then the cpu will predict this correctly and run the entire block in a single cycle (or even less if there is some other work which can overlap).
With EBX being the return value we also have to execute the MOV, but this can be done in the renamer and so don't actually take any cycles! 🙂
The CMOV version will always take the same number of cycles, typically 2 or 3. (There are other architectures where CMOV is much faster, sometimes down to one or zero cycles.)
The renamer can take extra time depending on surrounding code, so it’s not always free
@@Double-Negative Sure, I thought that was clear from the way I wrote it, but I see now that it wasn't. Anyway, absolute worst case a MOV REG,REG takes a single cycle unless the CPU is from before about 1992 (Intel 486). 🙂
Reminder of XOR⊕ property:
a⊕a=0
a⊕b=b⊕a
(a⊕b)⊕c=a⊕(b⊕c)=a⊕b⊕c
a⊕0=a
So, if we call the 3 registers a,b and c respectively and min and max as m and n respectively, we have the following expressions:
a⊕b⊕c (xor the 3)
(a⊕b⊕c)⊕m⊕M (and xor with min and max)
Let say m=a and M=c (could be any pair), then the expression becomes:
(a⊕b⊕c)⊕a⊕c
Per the above property we can remove parenthesis:
a⊕b⊕c⊕a⊕c
We can move values and group them:
(a⊕a)⊕b⊕(c⊕c)
We can also reduce the parenthesis:
0⊕b⊕0
Which evaluates to:
b
We need to get all the language experts in a room (you being one of them) and create another assembly abstraction like C but with modern memory protection and better/modern op representation built in to the syntax but still being a "mid/low level" structured typed functional programming language that closely represents the codegen.
ok
you're not gonna get a functional programming language if you're abstracting assembly. You'd be better off making a procedural language bc it fits the architecture more, but C already exists, so I don't see the point.
Fascinating stuff. Love your videos, I'm always impressed by your knowledge and find this all VERY interesting! Keep up the good work, thanks. 🙂
Cheers mate! Thanks for watching :)
Using intel syntax:
1. fast addition 16 bit instructions
LEA bx, [bx+si] ; no memory access, no flags touched, result have to fit the target
32 bit:
LEA ecx, [ecx+eax]
Great content, always keeping me stoked for the next video. For clarity sake, don't you think you should update the leftover comments that still state that xoring "sums" or "subtracts"? You even sinfully say it out loud. 🙂It accumulates and extracts which is good enough and just what we want but far from adding or subtracting. Keep it up pleeeease! 👍
Excellent stuff, amazing channel!!!!!
Nice video! I am probably too late to the party, but I think you didnt answer question 3. The question was how to count bits in a dword on a 8086 (16 bit processor). No fancy bitcount instructions there.
Drawback of these sorting methods is they can't be applied if there are not only "keys" but also values which should be sorted with these keys. Plane old bubble sort in that case, I presume...
Nice video nevertheless!
You've been teaching this chem teacher to code for years now. Cheers! One question I have: How would you sort *four* numbers in asm?
Ha! I reckon BubbleSort would do the trick :)
Another potential problem with the additive method is the potential for overflow. Sure, it's not likely with three values, but it is possible. What instruction format do you prefer (AT&T, Intel, NASM)? Over the years I've found I'm liking nasm more.
Excellent!
Just a note for those implementing FPN comparisons via binary, treat the sign, exponent & mantissa as separate comparisons:
int cmpf( fpn a, fpn b)
{
int sigA, sigB, expA, expB;
intmax_t manA, manB;
/* Extract info */
...
if ( sigA - sigB )
return -(sigA - sigB);
if ( expA - expB )
return -(expA - expB);
return cmp(manA,manB,bits);
}
fpn minf( fpn a, fpn b ) { return cmpf(a, b) < 0 ? a : b; }
fpn maxf( fpn a, fpn b ) { return cmpf(a, b) > 0 ? a : b; }
Doing it that way avoids the possibility of incorrect return values (provided I got the signs the right way round in cmpf)
I have a question about ARM Assembly. If you use malloc, will the kernel try to give you a pointer that is 8-bit rotatable (i.e. can be loaded into a register using a single instruction?)
I don't even wanna learn assembly, but I still watch your videos as they make me feel smart.
what happened to FADD (float add) instruction, why do we use SIMD all the time for one floating-point value?
faster and easier to use simd registers
Hey man you got nice skills and look & talk similar to Mr.rocky balboa ! 👍🙂
Does a jump always have to immediately follow a cmp? Or could you execute some other instructions in-between?
Some instructions don't affect the flags, so you can execute some instructions between. Mostly MOV doesn't change the flags. Usually the CMP and Jcc are close by though.
Usually on x86 the answer is yes, but ultimately it depends on the instructions you're using. On ARM, RISC-V, and MIPS you can do whatever you want in between.
If you look at compiler output, jumps are often put far after cmps or other flag altering instructions! I’ve seen a loop where the comparison was a SUBS instruction at the very top of the loop like 40 instructions before the branch.
I think compilers strive for this because it essentially guarantees that the comparison result will be completely done before the branch is hit, preventing branch prediction miss penalties.
Imagine since Intel Core2 architecture we can execute 4 integer instructions parallel, if there are no depency between and if the code have a good mixture of complex and simple instructions in the pipelines. This is not a CISC CPU, it is a mixture of RISC and CISC. The CPU split complex x86 instructions into micro ops to execute with some of the RISC units.
@@pyromen321 Could that actually become counter-productive at some point? I.e., could it happen that by the time you reach the jmp, the cmp has been evicted from the intrsuction cache?
x86 cornditional jump instructions:
for unsigned values
JA jump above
JB jump below
...
for signed values
JG jump greater
JL jump less
I got a question bout assembly.
If you have a determinate size loop, does it execute faster if written out line by line? (even if marginally so)
So what about writing in assembly language an application like FreeCAD or autodesk fusion 360
I would personally write the forms, buttons and front end in C++ or C#, and just keep ASM for the number crunching. I'm not sure I have the engineering skill to organize a very large scale, 100% Assembly project like that!
It would certainly be a challenge :)
RollerCoaster Tycoon comes to mind!
That would be a massive drain of time and effort. Not much to gain and much to lose. What's to be written directly in assembly has to be important enough to be justified being written in assembly.
Also, fp error are "weird" as the gap between "consecutive" numbers just widen like crazy as you get far from 0. (expected since mantissa has a finite precision ^^)
I think that fp introduces too much weirdness because of that and can be a big hurdle for beginners.
Oh yes, I remember when I was a young lad and started writing my first code in AutoIt and trying to figure out what's ASM and I was like... "WTF are these? are they just there for moving numbers around, adding and subtracting them? What for?" as I was trying to create a program with nice UI and messageBox and stuff... I'm pretty sure there are many peoples out there having the same question when looking at ASM at first ;) one day it just clicks and I still have no idea what I'm doing with ASM most of the time but can read and understand some parts of it.
How do compilers generate object code which can run on the variety of AMD64 family CPUs?
There are so many variants which have extended complex action opcodes, how can the compiler know when to use those opcodes? I know there are compiler flags but in software distribution it’s impossible to know ahead of time which CPU instructions are supported. How is this handled at runtime?
There is a CPUID instruction, that returns information about supported instruction sets. You can patch your code at the start of the program. You can also compile several versions and make installer pick one, depending on CPUID.
But the default behavior is to target old enough CPU and just crash, if even older one is present.
If we can use xor to get perfect float math, why dont all cpus just always do that? Why is floating point error still a problem we have to deal with? Even assuming that trick does not work with multiply and divide, making add and subtract perfect would be amazing.
I love assembly, the problem is that it gets too addictive and you want to use it evrywhere LoL..
Could you possibly do an explanation of how to call C library functions like puts() from assembly, or maybe just link to a guide with the correct answer? I've found a couple different guides online and I couldn't get it to work for one reason or another. I'm just not experienced enough to know why. I'm using VS2022 btw
he covered this 11 years ago
th-cam.com/video/txFXiFafTTc/w-d-xo.html
What’s your day job? Are you a systems engineer? Or do you contribute to open source projects, is that ebook just for patreons or can anyone read it?
12:55, nah if I was to code that I would've skipped eax completely:
mov ecx 17
mov edx 32
cmp ecx, edx
cmovl edx, ecx
ret
19:17, I thought you would do x = min(a,b), y = max(b,c), z = (a+b+c)-(x+y)
**Edit:** Gave it more thought and noticed a scenario where the wrong answer would be given, I'll leave finding that as a thought exercise for peops who care
How, in assembly language, does one write function that takes two or more arguments and returns a result? And how does one afterwards call that function from other languages, such Python, C++, C#, or F#?
Well, there’s videos on here for doing some of these things. Mostly very old videos. Calling from C++ is easier than C#, and I found that writing a wrapper in C++, and then calling that from C# was maybe the best way? C# just has a lot of extra type safety and memory management issues you have to work with.
To call native code from C#, I have found it convenient to compile the native code to a DLL. Then you use something like ‘interopservices’ and ‘importdll’ from C# to import the functions you want to use. Something like that, you’d have to look up the details.
As for calling native ASM from C++, there’s a lot of ways. If you’re in 32 bit, you can code inline ASM. If you’re in x64, then it’s a little trickier, but a lot of the videos on my channel here involve calling ASM from C++, so maybe if you have a look at one of the early ASM and C++ vids, you will see one way to do this. I’m pretty sure we did this in the very first video I uploaded.
You might want to try assemble to a library file, either LIB or DLL, and link to that in your C++. I do not usually do this in these videos because they’re usually just little code snippets, but in a real project it helps to set things out like that. Then you’re looking for how to call a native DLL from C++, which is bound to get plenty of results on googs.
As for Python and F#, I must say I have no idea sorry.
Hope this helps, have a good one :)
Thank you.
I've figured out how to do some of that. I have DLLs created from legacy code written in Fortran that I then call from C# using interop services. But in that situation, the Fortran compiler made the DLL for me, so I didn't learn much, if anything, about the internal layout of the DLL in the process.
Do you have any link to clear instructions on how to code a DLL from scratch in pure assembly?
Excellent video, cheers for the upload. I wish there was conditional moves back in the day with the 6502/10, then again, I like spaghetti.
For the count the set bits question, using that 8bit 6510, I would look at bit shifting e.g. ASL of the value being examined (split over 4 bytes), examining the carry flag and increment a counter if set. I wrote the following as one way to do the job.
ldy #4 ;4 bytes
lp0 lda Data-1,y
ldx #8 ;8 bits
lp1 asl
bcs BitSet
lp2 dex
bne lp1
dey
bne lp0
rts
BitSet inc Count
bne lp2 ;faster than jmp and ok to use as long as Count never wrapped to 0
;faster if below in zero page
Count byte $0
Data byte %11110000,%00000111,%10101010,%11100111
Instead of conditional jumping, you can use ADC #0 to add CF to A.
@@rsa5991That's a really good idea. Could you provide a working example for the 4 bytes? - my brain was too sore to write an optimised version for the 4 bytes.
@@ChrisM541 I don't have any 6502 tools, so I cannot confirm it working, but:
ldy #4 ;bytes
lda #0 ;count
byteLoop ldx Data-1,Y
stx 0 ;use zero page to store current byte
asl 0 ;"prime" the loop
bitLoop adc #0 ;spoils ZF, so should be before ASL
asl 0 ;sets CF for ADC, ZF on last '1' shifted out
bne bitLoop
adc #0 ;count last '1'
dey
bne byteLoop
rts ; reg A holds the result
UPDATE: Had checked on online emulator, seems to be working. Runs in 312 cycles vs yours 506.
@@rsa5991 Fantastic! works perfectly! - I've CBM Prg Studio installed, working on old platformer CDU magazine entry I submitted a 'wee while' ago - wish I had that utility then.
Always nice to see different ways to solve problems, cheers.
Okay, adding two numbers together is easy. but that's not really helpful, is it? What we want isn't to read the result off in the debugger, or to change our code to change which numbers to add. In other words, I/O is missing. I know to deal with I/O you use syscalls, system interrupts or in embedded devices access the mapped memory of attached devices and read/write values to specified addresses which map to those devices, possibly in response to an interrupt.
Do you have Irish in your family??
I think the important takeaway here is that if you haven't experienced the pain of making your assembler for a fictional CPU, you don't truly know the assembly meta.
Your second scheme is what I refer to "converting algorithmic operation to arithmetic operation" eliminating branches. The routine is "straight-line code". The complexity of code is proportional to the number of branches in it.
Your scheme for calculating the number of 1's in a number only works if you have the special instruction. Without one, I have scheme for determining if the number has one or fewer 1's in it. Copy the number to another register. Decrement one of the numbers. Then do a bitwise AND between the 2 numbers. If it is zero, the number had one or fewer 1's.
ROUNDING CRAP: get RID of floating point math! Floating point math belongs only in hastily "slapped together" programs written to get a quick answer. Most programmers are too lazy to properly scale their numbers.
For the sort: as you explain in your sort videos, there are 3! possible = 6 outcomes. Do 3 compares, say 1&2, 2&3, & 1&3. After each compare, shift the carry flag (is set if 1st arg >= 2nd arg) into register with SHIFT LEFT (with carry) into precleared register. You have 3-bit number (8 possible outcomes, of which 6 are "legal"). Use this to index into a look-up table of the swaps required. Put the index of the "get" of the swap into the table of 8 entries.
Example: let's say : A>B, A
But that comnt is 4 months old
Damn mate it's nice to have you back but you put on some weight. Please don't let it get any worse!
Ha! Sure did! :)
🌸 𝓹𝓻𝓸𝓶𝓸𝓼𝓶