Assembly, System Calls, and Hardware in C++ - David Sankel - CppNow 2023

แชร์
ฝัง
  • เผยแพร่เมื่อ 11 ม.ค. 2025

ความคิดเห็น • 77

  • @MattGodbolt
    @MattGodbolt ปีที่แล้ว +74

    One day I'll understand what goes on, on the right hand side of that web site!
    Great talk!

    • @WilhelmDrake
      @WilhelmDrake ปีที่แล้ว +5

      So, I'm not the only one?! Maybe I'm not so hopeless after all!

  • @chrzan6101
    @chrzan6101 ปีที่แล้ว +45

    Great talk! David Sankel is great at presenting and keeping listeners engaged.

  • @singhharmeet
    @singhharmeet 3 หลายเดือนก่อน

    I'm a fan of David. He is helping me overcome my fear of C++ and assembly. love how his talks always have a lively audience and a lot of humor going on . Great talk sir

  • @flamewingsonic
    @flamewingsonic ปีที่แล้ว +11

    @26:00 There is, indeed, an error in the slide; but it is not whether or not R12 needs to be saved by caller or callee. Instead, argument 6 is passed on R9, not R12 (see figure 3.4 on page 21 of System V Application Binary Interface AMD64 Architecture Processor Supplement Draft Version 0.99.6).

  • @trondenver5017
    @trondenver5017 ปีที่แล้ว +14

    Shout out to grugq and scrut! Great talk sir.

  • @jasoncole7711
    @jasoncole7711 ปีที่แล้ว +10

    @08:40 - the Base Pointer is typically used for Stack Frames. You'll see the typical idiom at the start of a function: "PUSH BP; MOV BP, SP" with the corresponding pop at the end. There's usually a "SUB SP," immediately afterwards to provide storage for stack variables.

    • @GeorgeTsiros
      @GeorgeTsiros 11 หลายเดือนก่อน

      ... which are no longer used in AMD64, for better or for better 🙃

  • @ZeroZ30o
    @ZeroZ30o ปีที่แล้ว +5

    YES THANK YOU FOR TALKING ABOUT ATOMICS/CONCURRENCY
    I could not find this information ANYWHERE.
    47:40
    Also in general, thanks for making Assembly sound so simple and "in reach"

    • @GeorgeTsiros
      @GeorgeTsiros 11 หลายเดือนก่อน

      fedor pikus talks _a lot_ about atomics!

  • @bartlx
    @bartlx 11 หลายเดือนก่อน

    Thank you David and CppNow for this surprisingly entertaining and educational trip down memory lane! :) Very well presented. Loved it!

    • @BoostCon
      @BoostCon  11 หลายเดือนก่อน

      So very pleased to hear your appreciation of David Sankel's presentation. Thank you for commenting!

  • @der.Schtefan
    @der.Schtefan ปีที่แล้ว +7

    ADD / LEA combinations always use the multiple ALU/AGU units of the processor. both operations will effectively perform in parallel, since the AGU is a free ALU unit, with the only data dependency wait being in the third assembly instruction to combine both results. The code will run faster not only because of it just being "3 bytes", it will run faster because this is the best way to parallelize the operations, plus the AGU, being restricted to very specific operations, often completes faster than the ALU, so you will see LEA everywhere where it would be insane to use.

    • @der.Schtefan
      @der.Schtefan ปีที่แล้ว

      The same with XOR eax, eax. It is basically a free operation that can run in parallel while the mov/cmp/test is trying to figure things out. If the resulting jump is taken to skip overwriting eax, then the register already is 0.

    • @CensoredUsername_
      @CensoredUsername_ ปีที่แล้ว +2

      @@der.Schtefan XOR eax, eax gets special treatment and gets absorbed in renaming (otherwise it'd be kinda terrible actually, with that unnecessary data dependency. So it doesn't even use a pipe at all.
      Also another fun thing about LEA: LEA doesn't set flags so it doesn't have flag dependencies. That's why it often gets preferred over add.

    • @FantomX932
      @FantomX932 ปีที่แล้ว

      in modern CPUs LEA is run on ALUs only...You need to check each CPU spec. On Haskell there is only for port btw for 3 operands...

  • @NonTwinBrothers
    @NonTwinBrothers ปีที่แล้ว

    One of my favorite Sankel talks at this point (I have yet to watch more!)

  • @gwendolynhunt2493
    @gwendolynhunt2493 ปีที่แล้ว +3

    Nicely done David, thoroughly enjoyed the presentation. Excellent!

  • @superscatboy
    @superscatboy ปีที่แล้ว +5

    Spent the whole talk thinking "man, I need to re-watch all of Creel's videos", and then he got a shout-out. Nice 👍

  • @digama0
    @digama0 ปีที่แล้ว +20

    @55:16 The reason it is zeroing out ECX with the MOVZX instruction is most likely to avoid a spurious data dependency on whatever the previous value of ECX was, because if you just MOV into it then all the high bytes have to be preserved, which can cause pipeline stalls if e.g. this register was previously filled by a value from memory that is still pending (remember that modern processors are deeply out-of-order now and love to do as many independent things as possible). Q: So why isn't it zero extending to RCX instead of ECX? Doesn't the same issue appear there? A: When moving to 64-bit they wised up on this issue, and so all 32-bit ops automatically zero out the top half of the registers, even though the 16-bit ones don't (and the 8-bit ones definitely can't because there are low and high registers there). So MOVZX ECX and MOVZX RCX are the same, but the latter is an extra byte to encode so compilers don't bother.

    • @ufufuawa401
      @ufufuawa401 ปีที่แล้ว

      Yep, accessing 64 bit register need REX prefix

    • @patrickwoolard4340
      @patrickwoolard4340 ปีที่แล้ว

      I think what I remember in my uni class is that the "fastest" way to clear out a register is to XOR it with itself. Something about how the bit operation is faster then having to load things into register memory (which makes since because you would imagine the bit operation would be highly optimized). I think that's why when you look into a decompiled C program you will see a lot of instance of registers XOR-ing themselves as a quick way of clearing them before having them used later in the code section.

    • @CensoredUsername_
      @CensoredUsername_ ปีที่แล้ว

      @@patrickwoolard4340 In x64 there's several ways of zeroing registers that are the fastest. One of the flaws of xor reg, reg is that it technically has a data dependency on what was in that register, as opposed to mov reg 0 which is theoretically faster (but more instruction bytes).
      However, modern processors tend to handle zeroing registers directly in the renaming logic. Several ways of setting a register to zero (sub reg, reg; xor reg, reg; and their vector variants) get recognized as a zeroing instruction on afaik both modern Intel and AMD chips, and just disappear into renaming.

    • @styleisaweapon
      @styleisaweapon ปีที่แล้ว

      the data dependency with xor is also on the flags register and its the same with many similar instructions used for the same purpose (sub reg, reg) @@CensoredUsername_

  • @paulfloyd9258
    @paulfloyd9258 ปีที่แล้ว +4

    FreeBSD and macOS also have argv envp auxv on the stack. Probably most if not all unix variants do.
    As a Valgrind developer I'm used to making direct syscalls. To avoid initialisation conflicts (and and to execute all of libc's startup) Valgrind doesn't link with libc. Anything it needs it implements itself. This is all very educational for learning about how the OS starts applications.
    The AMD manual is also multiple thousands of pages long.
    Does cpuid count as the craziest instruction? Probably the longest entry in the manuals. For AMD it takes up about 50 pages in appendix E of Volume 3 of the manual.

  • @kentvanvels
    @kentvanvels 8 หลายเดือนก่อน

    Enjoyed this talk. I don't think I have laughed out loud at a cpp presentation. The crowd was in a good mood, too.

  • @colinmaharaj50
    @colinmaharaj50 ปีที่แล้ว

    10:25 Where I work, my team used to work with Itanium Servers. I wanted to make a belt buckle with one.

  • @dascandy
    @dascandy ปีที่แล้ว +15

    @31:58 that's not how the stack works. The stack contains the argc value, the argv pointer and the envp pointer. Even if you don't name your envp pointer, it is on the stack. Argv and envp are stored in memory adjacent as arrays, with envp appearing directly behind argv.
    The values in argv are definitely not stored on the stack. Only the (singular) pointer is.
    Have to admit, I completely forgot we're doing x64 primary nowadays. On x64 David is actually correct and I'm
    wrong - the stack holds the actual values, while registers hold the arguments (argc, argv, envp). On x86, the stack holds the arguments to main too, so would hold argc, argv and envp.
    @1:03:00 What about the AAA instruction? Nothing beats a cpu instruction that's just yelling out into the void.

    • @Darkstar2342
      @Darkstar2342 ปีที่แล้ว

      can you see him waving his hands? ;-)
      I'm sure you found all the other inaccuracies in the talk too, but still, the argument holds, and the strings are very close to the actual pointers so they might as well be on the stack FWIW

    • @gamma77-mr1gk
      @gamma77-mr1gk ปีที่แล้ว

      I'm confused why argc would be on the stack, but the array of char* pointed to by argv is on the stack, above the stack frame of main. argc and argv are passed to main in registers.

    • @digama0
      @digama0 ปีที่แล้ว +1

      On linux, the values stored in argv are indeed in the same memory allocation as the stack, usually just a few words up, at the very top of the stack. Same thing for envp, the environment is all there in the same allocation. You can observe this by compiling an empty asm function with a _start procedure and breakpointing on the entry and looking at the stack allocation you get.

    • @reductor_
      @reductor_ ปีที่แล้ว +1

      31:58 I think that confusion might be between main and _start (the actual entry point) the stack is in that state at the entry point then libc's entry point turns those into the argc, argv, envp that can be passed to main which for many calling conventions it's going to have argc, argv and envp in registers.

    • @paulfloyd9258
      @paulfloyd9258 ปีที่แล้ว

      @@gamma77-mr1gkjust a question of i386 vs amd64 calling conventions

  • @GeorgeTsiros
    @GeorgeTsiros 11 หลายเดือนก่อน

    51:35 so yeah about `mov dword ptr[rdi], 1`: the second argument, "1", as written, does not _define_ if it is an 8bit, 16bit, 32bit or 64bit value. It can fit in _1_ byte (it can technically fit into 1 _bit_ but let's not get weird) but it can also fit in 8 bytes. So the assembler does not know if you want to write just one byte of data _at_ the position pointed at by rdi, or _8_ bytes of data _starting at_ that position. _If_ the second argument was AL, AX, EAX or RAX, then the assembler would say "ah, the register you want me to copy to memory is 1, 2, 4, 8 bytes(respectively) so i will use the instruction that copies _that_ part of (or the whole) register". `MOV [RDI], AL` gets assembled into the byte sequence "88 07", "MOV [RDI], AX" becomes "66 89 07", "MOV [RDI], EAX" becomes "89 07" and "MOV [RDI], RAX" becomes "48 89 07".

  • @styleisaweapon
    @styleisaweapon ปีที่แล้ว

    no those brackets arent for weird reasons nor is it simply special syntax .. intel always had powerful addressing modes .. the brackets are used anywhere you use the address generation hardware be it because you are reading/writing to memory, or when you'd rather get the address without the reading/writing to memory

  • @un2mensch
    @un2mensch ปีที่แล้ว +1

    56:35 - JE = "jump if equal", which is the same instruction as JZ = "jump if zero". A programmer would write "JZ" here for clarity, but a compiler doesn't need to care. When comparing two things, eg "CMP eax, ebx" the CPU is simply performing a "SUB eax, ebx", setting the flags accordingly, but throwing away the result. Similarly, for "TEST cl, 1" the CPU performs "AND cl, 1" (the bitwise AND operation) then sets the flags accordingly, and throws away the result.
    So if the boolean value you're testing is guaranteed to have "1" in bit 0 when the value itself is non-zero, then "CMP cl, 0" and "TEST cl, 1" will always give you the same result in the zero flag.
    Just a comment about what the compiler has generated here for "get_atomic": it's garbage. I mean it's functionally the same but there is no reason for the extra shuffling around with movzx ecx *in this specific example*, being that it is a maximally simple example. It quickly becomes more convenient to have the atomic value pre-loaded into a register as the logic gets more complex. I'd guess it also becomes convenient to have zero-extended the value into the full register.
    However in this example it's just wasting code and space. Even according to the Intel manual as quoted here. Both subroutines are reading the same byte. You see, there are different definitions of atomicity going on here. The Intel manual is specifically talking about whether any particular memory read or write will *complete* within one bus cycle. The issue is that the memory bus might need to perform more than one fetch or store in order to fulfil a value wider than 1 byte depending on *alignment*.
    The kind of "atomicity" that you'd have in mind as a C++ programmer using the std::atomic paradigms have more to do with whether a read-modify-write operation (eg, "foo++") completes without interference from other cores/threads, and guaranteeing how memory accesses get ordered (ie, fencing). The LOCK prefix is needed for operations that take more than 1 bus cycle, which certainly includes a read-modify-write op, and any memory access that spans across the alignment boundary relevant to your architecture.

    • @mgancarzjr
      @mgancarzjr ปีที่แล้ว

      Thanks. I think I get it, but let me see if I understand. CMP cl, 0 subtracts 0 from cpl and sets the Zero Flag if the result is 0. JE then checks the Zero Flag and jumps if it's set.
      TEST cl, 1 does a bitwise AND operation and sets the Zero Flag if the (result is greater than 0?) (none of the original bit flipped?) Is this particular to the situation because we're only comparing a Boolean? JE then checks the Zero Flag and jumps if it's set.

  • @Yupppi
    @Yupppi ปีที่แล้ว

    Ok the first one was Alice, but the second one was Van Halen and that really got me excited.

  • @douggale5962
    @douggale5962 ปีที่แล้ว +2

    It's not crazy. The order of the encoding is ax, cx, dx, bx, sp, bp, si, di (first being 000, last being 111). Notice anything? On sysv abi, the callee saved registers are bx sequentially thru bp (plus rex escaped sp sequentially thru di for r12-r15) The parameters are right to left, skipping over the callee saved, which visits di, si, dx, cx. Then in the new region, there is no reason to avoid anything so parameters continue at r8 and r9 (rex escaped ax, cx). It is logical with offsets to get it away from ax cx dx, which it is hardwired to use in some instructions.
    On Windows ABI, the only differences are that the whole range from bx all the way thru di are callee saved, and they also made xmm4 thru xmm7 callee saved, and it still visits the registers right to left skipping callee saved.

  • @KabelkowyJoe
    @KabelkowyJoe ปีที่แล้ว +3

    19:00 NOOOOOOOOOO..... LEA is not just about "oh soe let's make 3 byte opcode" most x86 CPU's do have separate ALU and AGU - 8 units total. I's almost like DX9 Vertex, Pixel shaders. Because ADD is executed on ALU - Aritmetic & Logic Unit you have 4 of them and modify FLAGS! - LEA is executed on AGU - Adress Generation Unit you have 4 of them and doesn't modify FLAGS. If you use wisely you can run up to 8 instructions at once. AGU also can combine 3 numbers (just like most RISC CPUs, ARM, MIPS) and generate result. DIfference between x86 and MIPS for example is that x86 uses only 2 operands and one of them is result. MIPS uses 3 operands always. 1 is where result is stored 2 operands to add to result. So if you want to add just 2 numbers just two registers not three you use zero register x86 AGU is adding 2 registers together but can also add third - direct number to it. ALU cant do that. So thanks to LEA not only it's executed in pararell, but also one instruction can combine 3 numbers instead of just 2 so it's quicker.

    • @musaran2
      @musaran2 ปีที่แล้ว +1

      x86 really got overly complicated.

    • @PEGuyMadison
      @PEGuyMadison ปีที่แล้ว +1

      @@musaran2 that's why its faster than anything out there.

    • @PEGuyMadison
      @PEGuyMadison ปีที่แล้ว

      That's great insight, I was listening to the presenter and thought to myself.. hmm... ok I guess. But I was wrong to assume he had it correct, I suspected the address generation logic was being used though.
      Thanks for the great comment.

    • @ante646
      @ante646 ปีที่แล้ว

      thank you

    • @MrHaggyy
      @MrHaggyy ปีที่แล้ว

      @PEGuyMadison not really, for any problem there is at least one faster architecture out there.
      What x86 is kind of good at it doing everything people might need reasonable well. (Generall purpose) And it's probably the biggest market we have so it can afford any goodies chipmakers can provide.

  • @GeorgeTsiros
    @GeorgeTsiros 11 หลายเดือนก่อน

    but why do _private_ functions use the system-wide convention?
    they do not need to, do they?
    like, if i write some int myfoo(int,int); and it is visible _only to my code_ the compiler _should_ see that "oh yeah, this thing here exists only inside this program, i guess i can pass the arguments to whatever registers are most convenient"

  • @TerjeMathisen
    @TerjeMathisen ปีที่แล้ว

    The nolib startup uses a slightly strange way to scan for a dword of zero:
    next:
    add eax,8
    cmp ebp,[eax-8]
    jnz next
    IMHO it would be better to use
    next:
    cmp ebp,[eax]
    lea eax,[eax+8]
    jnz next
    The reason being that the cmp+lea pair can execute in the same cycle, while add+cmp forces the cmp to wait until the add has finished.
    OTOH, it really doesn't matter for code that only runs during startup!
    BTW, for lockless programming I strongly suggest using LOCK XADD!

  • @__hannibaalbarca__
    @__hannibaalbarca__ ปีที่แล้ว +1

    My field in mathematics, and i under-estimate capability of c/c++ till last year; i m very upset how i don’t know that till now (27 years)

    • @MrHaggyy
      @MrHaggyy ปีที่แล้ว

      Compilers are a very interesting field for some mathematicans. A lot of problems where the right numbersystem with the right rules can make a huge impact in performance or efficency for every line of code thats using your compiler.
      Especially intermediate representations nowadays would see that the first two functions are associative equal, so it would only keep one and eliminate the other.

  • @colin398
    @colin398 ปีที่แล้ว +1

    Awesome

  • @GeorgeTsiros
    @GeorgeTsiros 11 หลายเดือนก่อน

    you did not explain how it can work when two *64bit* registers are added and are expected to fit into a *32bit* register :(

  • @GeorgeTsiros
    @GeorgeTsiros 11 หลายเดือนก่อน

    56:04 THAT'S FEDOR
    FEDOOOOOOOOOOOR

  • @shardator
    @shardator ปีที่แล้ว

    Argument 6 callee save makes sense. It means that the ABI reserves the right to reuse the value in that register after the call, while for the other arguments it does not.
    It can be thought of as an always const ref argument. If the register is used only as input, the caller may just continue using the value in it

  • @fynnwilliam
    @fynnwilliam ปีที่แล้ว

    Thanks David…

  • @coreC..
    @coreC.. ปีที่แล้ว

    👍
    Those SIMD instructions are funky indeed. It's cool you can perform some calculation on multiple data, but i always keep thinking there are better ways to come up with ways to process such data. The instructions are just woot sometimes..
    If i was to invent a new instructionset (for SIMD only), i would focus on the things that are used most often, like vector/matrix stuff. For example, now, if you try to optimise some matrix-multiplication manually, you may find out that your code is not as efficient as you expected it would be, because of some "silly" cache-misses because of how your matrices are stored in memory (row/column major).
    My point: You need to be an allround assembler programmer because of so many CISC things happening in the CPU. There are more and more conditions to take into account.
    I guess i'm more a RISC dude..
    BTW: Creel is cool. Very much aware of the CISC stuff going on (branchless programming, for example).
    And Agner Fog is just the master. I've learned a lot from his articles. A true expert.

    • @MrHaggyy
      @MrHaggyy ปีที่แล้ว

      Well in embedded or hardware accelerators you have specific operations to manipulate a matrix for a given size.
      And we do use MIMD a lot to set an action in hardware without having the CPU to touch the result at all.
      Thats how you dimm a light or control a motor rpm with close to zero CPU load.

    • @coreC..
      @coreC.. ปีที่แล้ว

      @@MrHaggyy My expertise stops at SIMD, confined to a single chip/CPU.
      The specific operations you talk about, MIMD, seem to target multiple chips.
      That goes beyond the scope of this video. I mean, David is speaking about single chips/CPU's, obviously..
      If it comes to the low assembler level instructions, how does MIMD perform? There has to be expensive/time-consuming mechanisms to exchange the data between processors. That is even more costly than the "critical section" code to execute on each core to make threads talk to eachother.
      BTW: I make a LED dim by choosing the (closest+1) timer prescaler, and then in the interrupt-handler i switch it on or off, according to the PWM duty cycle value that is provided on some register of the MCU.
      I can make that LED blink at a super precise interval. That seems impossible for an MIMD approach.
      Single processor, or multiple processors, is quite a difference..

  • @CornedBee
    @CornedBee 7 หลายเดือนก่อน

    argv[argc] is defined in the C++ standard to be 0.

  • @mircoi.3256
    @mircoi.3256 ปีที่แล้ว +1

    repeating the comments/questions would have been really nice thou :/ happens way to seldom - and then they do just to get dissuaded...

  • @GeorgeTsiros
    @GeorgeTsiros 11 หลายเดือนก่อน

    14:33 nono, arm is another thing entirely

  • @ttrss
    @ttrss ปีที่แล้ว +11

    the constant comments are sort of annoying

    • @CPSPD
      @CPSPD ปีที่แล้ว +3

      It’s like a computer science university lecture haha. Though ours mellowed out near the end when I was in my final year, and the interjections were more interesting/were better questions.

  • @DxXNA
    @DxXNA 11 หลายเดือนก่อน

    Bruh 15:06 🤣
    So how come... MASM does params like RCX, RDX, R8, R9, then Stack/RSP beyond that.
    Where this does RDI, RSI, RDX, RCX, R8, R9 (not sure why you have R12 the GXX x86-64 godbolt compiler shows a 6th param is R9). So either there's more dumb stuff going on with certain compilers / hardware or one of the two is wrong.
    I feel like they should have been somewhat the same on this stuff and it's semi close but at least they could have had the first 4 params the same so logic transfers between things better.

  • @sdstorm
    @sdstorm ปีที่แล้ว +2

    These clearly AI generated images are hurting my brain. Great talk though.

  • @7cyber
    @7cyber ปีที่แล้ว

    Nice for migration from ASM to C, but PASCAL no migration required. Else where nice joke !

  • @estrizhok
    @estrizhok ปีที่แล้ว

    a lot of hand waving

  • @alexkfridges
    @alexkfridges 11 หลายเดือนก่อน

    Fantastic talk, most annoying audience of all time

  • @sergiocoder
    @sergiocoder ปีที่แล้ว +3

    I'm sorry, but I have to say it: Adobe Acrobat is a piece of garbage

  • @retropaganda8442
    @retropaganda8442 ปีที่แล้ว +3

    I think people who use compiler explorer in their talks should stop showing x86-64 as the assembly output. Prefer some RISC or LLVM's bytecode.