Assembly Language Misconceptions

แชร์
ฝัง
  • เผยแพร่เมื่อ 22 ม.ค. 2025

ความคิดเห็น • 580

  • @mheermance
    @mheermance 4 ปีที่แล้ว +405

    I learned to program in the 80s when compilers stunk, and it was a piece of cake to beat them with hand coded assembly. As a result many projects were written in assembler to run on older and newer hardware. The advent of efficient compilers was a godsend, and for work I was glad to see it sidelined. But for fun I still code in assembly because building high level features like lambda functions or garbage collectors from the ground up teaches you a great deal.

    • @sallylauper8222
      @sallylauper8222 3 ปีที่แล้ว +12

      Yeah, I thought it was really inarestin that he said that today to write faster assembly you have to know all the tricks of the compilers.

    • @SunMasterXIV
      @SunMasterXIV 3 ปีที่แล้ว +12

      I used Lattice C (and 68k assembly) on the Amiga in the 80s, and I thought it was pretty good. But the way modern compilers are able to optimize the code is sometimes amazing. It doesn't help tailormake assembly that so many x64 CPUs variations are available, where instructions execution time vary.

    • @AURORAFIELDS
      @AURORAFIELDS 3 ปีที่แล้ว +4

      68000 is a good example of why C compilers are not good for everything. A lot of the efficient code relies on passing arguments via registers, while C relies on stack frames. Memory access on the 68000 is really slow, so automatically C will be slow too.

    • @mheermance
      @mheermance 3 ปีที่แล้ว +7

      @@AURORAFIELDS true, but many C compilers implement fast call linkage. They pass by registers and the called function saves on the stack if it calls another function.

    • @Ehal256
      @Ehal256 3 ปีที่แล้ว +1

      @@mheermance finding a compiler that does that for the 68k nowadays however, is quite difficult. GCC doesn't, and while llvm recently added support, I doubt it does either. Maybe something from the 80s, but I'd rather code things by hand when performance is really important.

  • @randyscorner9434
    @randyscorner9434 3 ปีที่แล้ว +85

    With current compiler technology there is one area where the move to assembly provides massive advantages. That is when you can vectorize the code to fully use the SSE and MMX extensions. For one routine, unrolling the loop 1 time fit the register set, allowed 8 wide vector calculations and increased the overall performance of a high end electronic piano by 12X. This was sufficient to move the program off a new Mac to a RPI3. The load went from 40% of the CPU on the Mac to 9% of the CPU on the RPI3 with just one thread. Getting to this point with a high level programming language requires a different compiler and coupling that to C or C++ is much harder than doing the 60 assembly instructions by hand.
    It's all about how badly one or two routines dominate the runtime. It's often the case that these "hotspots" can get extra love and show major performance improvement. Of course, the best optimization would be to stop using Python as production code.....

    • @thomasmaughan4798
      @thomasmaughan4798 3 ปีที่แล้ว +28

      "Of course, the best optimization would be to stop using Python as production code"
      LOL 🙂

    • @FM-tq2gs
      @FM-tq2gs ปีที่แล้ว +2

      Newbie question: why can't compilers do that kind of optimization? Will they be able to one day?

    • @Mr8lacklp
      @Mr8lacklp ปีที่แล้ว +17

      ​@@FM-tq2gs they will be able to do it sometimes in the future but there are really two problems here:
      One is that the compiler can only do an optimization if it can prove that it won't change the behavior of the program for any value it might possibly see and it simply doesn't have all the information as all it sees is the source code. You might for example have a number that represents the day of the week so *you* know it's never going to be greater than seven but the compiler can't know that so it can't apply any optimizations that assume that the number won't be greater than seven. So there are some optimization you can do that are literally impossible to do for a compiler no matter how advanced.
      The other problem is that both finding an optimization and proving that it doesn't change the behavior of the code are very difficult and not generally things computers can do at all. And this is where compilers are steadily getting better but it's very possible that there are some optimizations that will just never be worth the longer compile times or the effort of implementing them.

    • @FM-tq2gs
      @FM-tq2gs ปีที่แล้ว +3

      @@Mr8lacklp thank you for the explanation!

    • @robegatt
      @robegatt ปีที่แล้ว

      ​@@Mr8lacklpyeah, that is why some programming language are better than others... a Pascal compiler could easily do what you said in the first example.

  • @spacewolfjr
    @spacewolfjr 4 ปีที่แล้ว +196

    I work in CyberSecurity and end up using assembly a lot when reverse engineering / disassembling malware, it's an essential skill for that kind of work

    • @shanehebert396
      @shanehebert396 3 ปีที่แล้ว +28

      Well... you have to since I doubt the malware writers are going to give you the source and all you have is the executable ;)

    • @tappineapple3381
      @tappineapple3381 3 ปีที่แล้ว +5

      Did you go to college? If so what did you major in? I am currently a junior in high school and I would like to further learn about reverse engineering and getting better with stuff like IDA and reclass. Any advice?

    • @y2ksw1
      @y2ksw1 3 ปีที่แล้ว +1

      Agreed.

    • @y2ksw1
      @y2ksw1 3 ปีที่แล้ว +12

      @@tappineapple3381 I suggest to disassemble Viruses. Most of them are brilliant examples of engineering and most of them are made by true masters of art.
      The next step I suggest, is to make your own operating system. If you master this step, you will have no problem to solve all other problems you may come across.

    • @tappineapple3381
      @tappineapple3381 3 ปีที่แล้ว +4

      @@y2ksw1 Thank you!, I have been following the tutorials on guided hacking and I have very much enjoyed reversing video games and I feel like malware would be the next best step. Now, making an operating system scares me.

  • @ChiliTomatoNoodle
    @ChiliTomatoNoodle 4 ปีที่แล้ว +248

    Really good information quality and density here. This guy knows his stuff.

    • @WhatsACreel
      @WhatsACreel  4 ปีที่แล้ว +24

      Means a lot brus! You are a legend, Chili :)

    • @classicnosh
      @classicnosh 3 ปีที่แล้ว +10

      @@WhatsACreel - He's not wrong. I learned Pascal and C wasn't really taught in my school since Pascal was considered "academic". Assembly was also easier in those days since the microcomputers were much smaller and it was possible to really understand the memory map. Nowadays, the philosophy is very different. The rule of thumb is, don't try to outsmart the compiler. ;)

    • @tootaashraf1
      @tootaashraf1 3 ปีที่แล้ว

      The c++ guy

    • @Andoxico
      @Andoxico 2 ปีที่แล้ว

      ayy it's papa Chili

  • @craigmhall
    @craigmhall ปีที่แล้ว +6

    I rarely write in assembly any more, but it's good to know for:
    -debugging release / optimized code
    -studying the generated assembly and finding ways to tweak the source code to generate better assembly
    -generally understanding how the machine works, what is expensive and what is not

    •  ปีที่แล้ว +2

      This! I personally write asm only as a hobby for microcontrollers, where cycle-level timing is sometimes required (the rest of the time C suffices), but I read it a lot more as disassembled code for the reasons you mentioned.

  • @lgrantcdg
    @lgrantcdg 3 ปีที่แล้ว +36

    Excellent talk!
    In the 1970s at General Motors Research Labs, they ran an experiment with a PLI-based computer graphics system. They recoded a few high-usage routines in assembly language. The system got faster. Then they recoded them in PLI and the system got even faster. Then they recoded them in assembly language again, and it got faster still.
    It turned out that each time they recoded the routines, they improved the algorithm, and that made much more of a difference than which language they used.

  • @guillermoleon0216
    @guillermoleon0216 3 ปีที่แล้ว +19

    First Assembly I ever learned was for the Z80 and I absolutely loved it! I don't use it at work but getting to know it taught me a lot about how computers work.

  • @wingunder
    @wingunder 3 ปีที่แล้ว +63

    "If you can help yourself, try not to write a virus." 😂😂😂
    You should put this quote on a t-shirt. Your sense of humor is simply wicked 👍

    • @OpenGL4ever
      @OpenGL4ever ปีที่แล้ว

      I love that line.
      And the background to that is, if you can do that, you don't need to write a virus. You will also find a well-paid job without having to drift into the criminal corner to make a lot of money.

  • @starpawsy
    @starpawsy 3 ปีที่แล้ว +2

    Most successful assembly program I wrote was in 1992. I did a square root function using Newton's method, that was faster than what the compiler of the day provided in the maths library! In those days, the width of the floating point divide register was 80 bits. Dunno what it is today. This might not work today.
    As an aside, some people night say "only 80 bits"? Well, consider that 80 bits == 24 significant decimal digits. Consider that if you measure the diameter of the known universe to 24 significant figures, the last figure is less than the classical diameter of a hydrogen atom.
    Newton's method for calculating the square root of x.
    Start with a guess, call it a.
    Calculate b = x/a.
    Take the average c = (a/b)/2.
    That will be closer than either a or b. Use c as your next guess for a and iterate. Keep going until a & b vary only by 1 in the LSB.
    The challenge was making a really really good guess for a that works for all numbers. I hit on the idea of dividing the exponent by 2 (shift right by 1) , and zeroing all but the most significant bit of the mantissa. For negative exponents you do the opposite - double the value of the exponent. This actually worked really well.!
    Here's a worked example. Square root of 10 (well actually 10.000000000000000000000000000000)
    start with 3
    10 / 3 = 3.33...
    3 + 3.33...= 6.33...
    divide by 2 = 3.166...
    In one iteration, you've got 2 decimal places.

  • @kevinjensen3056
    @kevinjensen3056 3 ปีที่แล้ว +12

    Been programming in assembly and C since '79. Assembly is still widely in my field of embedded programming, but I haven't needed to resort to it for years. The code density that an expert on the CPU can achieve in assembly is incredible. Still most of what you've said is correct for most complex CPUs, but some comments are a little inaccurate for embedded processors today. Most MCU core instructions are still atomic, but the problem of mutilthreaded read write race conditions still apply when the data size is less than the buss width. This sort of issue appears in most interview tests for embedded programmers.
    You really should do a lecture on race conditions at the sub instruction level (as you just did), the instruction level, at the thread level, the o/s level and even beyond.
    Liked your lecture on radix sort. Never tried that one before. Keep up the good work.

  • @mattias3668
    @mattias3668 4 ปีที่แล้ว +49

    There are some case were you want to use assembly for performance because the compiler will not choose the best instructions for your good. For example, if you are addition on bigints, you will probably with to use the addition with carry instruction, which the compiler probably will not be able to figure out that it can use. And there are probably a large number of very specialised instructions like this, I imagine for example that the compiler won't use the SHA or AES instructions.
    Not only are there different assembly languages for different architectures, you also have different dialects for different assemblers.

    • @WhatsACreel
      @WhatsACreel  4 ปีที่แล้ว +18

      I absolutely agree!

    • @shanehebert396
      @shanehebert396 3 ปีที่แล้ว +10

      You would hope that if you are using a library that's implemented bigint or SHA/AES that the people who wrote the library used intrinsics to implement the library calls.

    • @mattias3668
      @mattias3668 3 ปีที่แล้ว +9

      ​@@shanehebert396 Actually, I wouldn't necessarily hope that. When I implemented addition for bigint, GCC didn't have a good intrinsic for doing add with carry (I don't know it it has now), the closest it had was addition with overflow detection, which it couldn't optimised, so inline assembly was necessary for good performance. So you want your bignum to use inline assembly in this case, and then just add a portable fallback for unknown architectures. In other situations, intrinsics may work just as well, but in these cases you still need a portable fallback, so the older reason to use intrinsics instead of inline assembly in these situations is that the intrinsics may be supported for multiple architectures, and hopefully most compilers will recognise them, but that's not necessarily they case, and it is more likely that they will recognise the inline assembly.
      Similarly, intrinsics for SHA/AES, if there even are any, are not portable.

    • @shanehebert396
      @shanehebert396 3 ปีที่แล้ว +3

      @@mattias3668 yeah, that's the beauty of conditional compilation ;) if the arch is detected, use the version of the library that uses intrinsics, if not, fall back to the library made from portable code. Then it's up to the library providers (or an interested 3rd party in the case of open source) to add to the project.
      But yes, you're also at the mercy of the compiler and how it generates code (gcc, in your case, with add with carry).

    • @andrewdunbar828
      @andrewdunbar828 3 ปีที่แล้ว

      Rotate instructions are also not accessible from your high level language. Endian-switching instructions used to be inaccessible too but various compiler + CPU combos I looked at a while ago could recognize most ways to do endian switching in C and produce the right ASM code... but not always!

  • @clickrick
    @clickrick 3 ปีที่แล้ว +6

    I'm glad you got to the point that there are assembly languages for just about every processor and didn't allow people to assume that x86 is all there is.
    As someone who has written assembler on ICL 1900, IBM 360 & 370, DEC PDP 11, as well as microprocessors like the 6502 and Z80, I've become aware of just how different the fundamental architectures are, in particular addressing modes.

  • @ricos1497
    @ricos1497 4 ปีที่แล้ว +14

    If I'm to take just one thing from this video its that I shouldn't write viruses. One virus, absolutely fine - or recommended perhaps - viruses, not. Great advice, thanks.

  • @brannonharris4642
    @brannonharris4642 3 ปีที่แล้ว +4

    Reductive learning. Discovering what something is not is seemingly more potent than only pondering on what that thing is.
    Love this video!

  • @ParagonX13
    @ParagonX13 3 ปีที่แล้ว +8

    i'm a young person and i taught myself reverse engineering/assembly over the past several years (messing around with disassemblers and searching my questions on the internet) and actually enjoyed it way more than i thought i would... at first it was just a means to an end but i very quickly grew fascinated with it all. i have no idea what to do with this passion though other than hobby projects... :p

    • @OpenGL4ever
      @OpenGL4ever ปีที่แล้ว

      If you need a playground. Many open source audio and video codecs are already optimized for the x86 and ARM architectures, but this is not yet the case for the RISC-V architecture. So you could buy a single board computer (SBC) with a RISC-V CPU and then see what could be optimized there. You would need to learn RISC-V assembly though.

  • @TerjeMathisen
    @TerjeMathisen 3 ปีที่แล้ว +4

    Congratulations Creel, you've managed to create a very informative set of videos on x86 asm, all stuff that I would have loved to have back in the days, starting in 1982 when I had to write interrupt drivers in hex. :-)
    PS. I went on to use asm on everything from video (DVD & BluRay) & audio codecs (ogg vorbis), crypto (AES competition), games (Quake) and I still write some really low-level code, usually using compiler intrinsics since Visual Studio doesn't allow inline asm anymore. :-(

  • @hell0kitje
    @hell0kitje 4 ปีที่แล้ว +15

    Glad to see you back, mate :) I started with your c++vids and now im discoveri g asm, keep posting more!

  • @draconite
    @draconite 2 ปีที่แล้ว +11

    #1: This does depend on the architecture you're building for. Compiling for the 68000 with GCC, it's easy to beat the compiler if you know what you're doing

    • @OpenGL4ever
      @OpenGL4ever ปีที่แล้ว

      You've already made an assumption here, using a specific compiler. On the other hand, if you use a compiler that is optimized for the use of fast calls and 68k, then it can look different.

  • @Alex-op2kc
    @Alex-op2kc 3 ปีที่แล้ว +6

    Here's an alternative definition: An assembly language is a set of mnemonics and other language elements defined by an assembler that let you write symbolic statements that map to hardware instructions.
    Under that definition, there can be multiple assembly languages per architecture. For example, there are multiple assemblers for x86: MASM, NASM, YASM, and fasm. And each define a different, although very similar, assembly language.

    • @YourCloseCoop
      @YourCloseCoop ปีที่แล้ว

      Nasm has the finest "classical" syntax, while all you wanna do looking at masm is to go back to C. Can't tell anything about fasm and yasm, don't have enough experience

  • @Guztav1337
    @Guztav1337 4 ปีที่แล้ว +26

    You should get more cushions/backdrop in the room, there is a bit of echo in the background.

    • @mrdouble
      @mrdouble 3 ปีที่แล้ว

      Was thinking the same, looks like an expensive mic though :/

    • @swharden
      @swharden 3 ปีที่แล้ว +1

      The condenser microphone is "too nice". It's picking-up every little echo in the room. A dynamic microphone or a basic gaming headset (microphone closer to the mouth) could be better options for this space.
      Edit: audio is good in later videos

  • @roax206
    @roax206 2 ปีที่แล้ว +1

    Though from my understanding, assembly is mostly just machine code but replacing the binary instruction IDs with short nicknames for the instruction.
    Technically any compiled "higher level" language will be converted into assembly at one point (unless the person who wrote the compiler is a masochist and memorized all the instruction ID numbers). The main point when assembly becomes quicker then simply relies on whether the problem is easier to express in assembly language rather than the HLL used and to what level you are willing to manually optimize the assembly code.

  • @VTdarkangel
    @VTdarkangel 2 ปีที่แล้ว +1

    I had to do some SPARC assembly programming when I was in school. The real advantage of it was when we had to do hardware interfaces. Those functions could have been done in C, but when I broke the object files down, I found out that the compiler was inserting a bunch extra commands that were completely unnecessary such as settings in the master register for settings that weren't being used. By doing the interfaces in assembly, I could bypass all of that.

  • @alberto3028
    @alberto3028 4 ปีที่แล้ว +46

    ASM is perfect for bootloaders and some parts of OS

    • @WhatsACreel
      @WhatsACreel  4 ปีที่แล้ว +14

      It is indeed! UEFI changed the necessity a little, but certainly low level OS code is one of the most important use cases for ASM! Cheers for watching mate :)

    • @lewiscole5193
      @lewiscole5193 4 ปีที่แล้ว +5

      Assembly language gives complete control of the hardware to the programmer in a way that no HLL can, in no small part because assembly language is processor architecture specific, while an HLL is supposed to be processor architecture independent.
      So, it's not that "ASM is perfect for bootloaders and some parts of OS", it's that there is no other way to get there from here using an HLL.

    • @WhatsACreel
      @WhatsACreel  4 ปีที่แล้ว +3

      @ozan o. I would love to :) Judging by the recent reviews of Apple’s new M1, I think maybe ARM will give x86 a very good shake very soon! We might be witnessing the beginnings of the fall of x86 in the laptop and desktop markets...? Unbelievable!
      Not sure when I can cover these things, but they’re certainly on my to-do list. Thanks for the suggestions, and cheers for watching :)

    • @lewiscole5193
      @lewiscole5193 4 ปีที่แล้ว

      @ozan o.
      OSs have to change over time to meet new hardware and/or user demands, or else they die off.
      Unix is no different and has evolved over time to be different than what it originally started out as.
      So in a very real sense, I suspect that Tony Hoare's famous saying, “I don't know what the language of the year 2000 will look like, but I know it will be called Fortran,” has applicability to OSs with "Linux"/"Unix" being substituted for "Fortran".
      And keep in mind that there already environments where "Linux"/"Unix" is not king ... real time environments such as can be found in cars where QNX, a proprietary message passing microkernel based OS (which can run on ARM based systems by the way), is already more common.
      Yet, thanks to the Posix standard and the QNX's people's interest in it, how, QNX offers a similar interface ("abstraction") to application programs so that their developers feel warm and fuzzy about it.
      I suspect the same thing will likewise happen with any OS that depends on C, including Fuschia.

    • @lewiscole5193
      @lewiscole5193 4 ปีที่แล้ว +2

      @ozan o.
      > As you know, processes never really
      > pause in posix, I don't know if it
      > was due to hardware restriction or
      > design error during constructing
      > of unix back then.
      I don't know what you mean by "processes never really pause in posix".
      Posix is an interface standard for OSs that just happens to look like the interface that Unix/Linux typically used to present.
      It's not an OS itself.
      An OS can be something other than Unix/Linux entirely under the hood and yet present a Posix compliant interface as is the case with QNX which is a proprietary message passing microkernel based OS that is Posix compliant as I indicated before.
      To the extent that Posix was supposed to look like Unix/Linux to the outside world (programmer), various interface calls such as a file read or write do block (pause) because that's what they in Unix/Linux historically did in The Good Old Days.
      That doesn't mean that an OS can't present natively use non-blocking interfaces internally which are look like they are blocking to the user.
      > there is also root privilege problem.
      Again, I don't know what you mean since Posix isn't an OS.
      > Plus Android turn into giant layers of burger.
      > I guess that's why google wanna leave Android.
      Android *IS* Linux by another name. Really.
      > if any other new os becomes complicated
      > and consist of many layers in the future,
      > it will be loop then they will be wandering
      > new solutions in the future:).
      Again, OSs change over time or they die.
      To the extent that everyone thinks that what they want done is the way thing should be, OS developers are likely to toss in lots of crap to satisfy different users.
      If you want a lean, mean OS for your specific machine(s)/application(s), feel free to write one yourself ... and spend forever doing it.

  • @_mrgrak
    @_mrgrak 4 ปีที่แล้ว +1

    The best programming related content on youtube right now. Creel explains complex topics simply, truly a great teacher. Looking forward to the next video!

  • @CallousCoder
    @CallousCoder ปีที่แล้ว +1

    ARM 64 cpus actually have a couple of assembly dialects. You have your AARCH64 but also your Thumb instructions, which are a small instruction to save space.

  • @SimGunther
    @SimGunther 4 ปีที่แล้ว +154

    Gotos are NOT considered harmful
    Wormholes in the other hand are considered VERY harmful

    • @k7iq
      @k7iq 4 ปีที่แล้ว +31

      If one does not like "goto" then just rename it to jmp and then it's OK because it's what the compiler might output in assembly anyway ! 😁

    • @imperatoreTomas
      @imperatoreTomas 4 ปีที่แล้ว +4

      Goto is my favorite function

    • @programaths
      @programaths 4 ปีที่แล้ว +8

      In BASIC, well, it was very present. I learned that on my own and was used to put GOTO everywhere as it was the way to skip code based on a value "ON x GOTO label1,label2,label3" (or line numbers!)
      Then I used GOTO also to recycle code (as in GOSUD).
      Very good for state machines too, even if I didn't know it had a name.
      Then I had to take visual basic courses at school and the teacher was pulling her hair reading my code...no FOR and IF, GOTO worked just fine. On top of that, I kept my habit of reusing code.
      I am not even sure I would be able to understand my own code as I totally forgot that habit. Still, have good memories of that because the teacher ended up saying she will not correct it anymore and just give points for it working as intended. ^^ At the same time, others had troubles to understand what a variable was and I had already implemented snake and Sokoban just for fun :-D
      (As devs, we find it to be very simple, but I taught a bit too and this is a huge hurdle!)

    • @LionKimbro
      @LionKimbro 4 ปีที่แล้ว +16

      Wormhole = en.wikipedia.org/wiki/COMEFROM

    • @roygalaasen
      @roygalaasen 4 ปีที่แล้ว +3

      @@programaths when I started out with computer classes back in 1991, we had to draw flowcharts before we were allowed to write a single line of code. Only one entry point, one exit point and no lines were allowed to cross, essentially banning goto entirely.
      Now my favourite programming language, Swift, is sometimes forcing you to use a label to tell which loop you want to BREAK out of, which is essentially a goto in disguise.
      My brain cringes but I have to get used to it lol
      Edit: to clarify. Break in all programming languages breaks out of neared LOOP. If you are in a switch .. case you will still break out of the nearest loop. In Swift you will break out of the switch case, still stuck in the loop unless you label the loop you want to break out of.

  • @PaulaBean
    @PaulaBean ปีที่แล้ว

    When the rubber hits the road, you can always benchmark the speeds of your C++ code against assembly code. Measurement trumps speculation. Thanks for the nice video!

  • @herrbonk3635
    @herrbonk3635 3 ปีที่แล้ว +11

    2:34 _"That one clockcycle is called the latency"_ Not really, that one cycle is called _throughput_ in these contexts. The latency *for simple instructions* (like ALU reg,reg/im) usually equals the number of pipeline stages. In a simple pipelined CPU, that would be: fetch+decode+calculate+write result, i.e. 4 stages and so 4 clock cycles. For the 486, that was five stages and five cycles, for the P4 it was around 20 stages and cycles, and so on (again for simple instructions like ALU reg,reg/im).

    • @laurelsporter
      @laurelsporter 3 ปีที่แล้ว

      But, calculate can be repeated as nauseum, and as long as that can go on, write can be hidden. The full pipeline isn't executed fully for each instruction, before the next one executes.

    • @herrbonk3635
      @herrbonk3635 3 ปีที่แล้ว

      @@laurelsporter Yes, that's the basic idea with a "pipeline", i.e. having all the stages of the instruction execution fully overlappning, so that (different stages of) several instructions in a sequence can be processed at the same time.
      (Typically instruction fetch -> decode -> effective address calculation -> operand fetch -> ALU -> write-back.)

    • @TellowKrinkle
      @TellowKrinkle 2 ปีที่แล้ว

      Don't know how people talked about the 486, but on modern processors, when people talk about latency, they mean the number of cycles from when the register value is first needed to when it's available to the subsequent instruction. If your CPU has forwarding circuitry (like every modern processor), that's only the number of calculation stages.
      For the example of an `inc rax`, if you had four of those in a row, the cpu would fetch all four in parallel, decode them all in parallel, and calculate them serially, with each one forwarding its result to the next without waiting for writeback. In the end, four (dependent) `inc rax`s would run in four consecutive clock cycles, which is why `inc` is considered to have a latency of just 1 cycle, not 20 or however many a modern processor's pipeline has. The throughput of inc is not 1 but 1/4 for a skylake processor, meaning that the processor can execute four non-dependent inc's in one clock cycle.

  • @Lantalia
    @Lantalia 3 ปีที่แล้ว +1

    So, with regards to #1 inline assembly skips the function call overhead, the main reason to do it is to do it is to use instructions not yet supported by your compiler

  • @3Balala3
    @3Balala3 4 ปีที่แล้ว +7

    Great video, helps a lot understanding the assemly's place and purpose nowdays. Also great timing. Tomorrow I have an exam in assembly. We are programming on an emulated dos program. Really, really interesting... :D

  • @stevem3432
    @stevem3432 3 ปีที่แล้ว

    I begun learning assembly at uni this semester and I actually enjoy it. Thanks for these videos.

  • @theDemong0d
    @theDemong0d 4 ปีที่แล้ว +3

    In my experience writing assembly (mostly to capitalize on AVX), yes the function call overhead is a huge performance hit, but you need to write your program in assembly anyways because when you switch to AVX intrinsics, you need to know what assembly you want the intrinsics to produce. Writing the function first in assembly makes it easy to translate into AVX intrinsics, and the intrinsics should allow you to write C++ that compiles almost exactly instruction-for-instruction identical to your handwritten assembly. Yeah, it's not quite as cool as your program running your handwritten x86, but it's the next best thing and with the call overhead eliminated, you can reap large performance boosts.

  • @programaths
    @programaths 4 ปีที่แล้ว +1

    First year in school: Compute the volume of a cone...in assembly!
    Most student were blocked on the division!!! That's when the learn overflow AND underflow.
    I do not remember the in and out, but the division gives you a good ride if you didn't pay attention to the curriculum.
    Then that's when you are doing your work that you realize that registers can be split in different way, that there is a flag register too.
    At that time (15 years ago), there was "help PC" with nice explanations of all of this...
    Another difficulty of assembly is that it's "verbose". In higher language, "if" is identified as is. In assembly CMP+JNE,JEQ,JZ,JNZ,JNP.
    And even conditions with conjunctive or disjunctive becomes challenging.
    Another nicety was using the stack for local variables instead of trying to guess which register is safe to use ^^
    It's a bit cloudy, because it's far away now. But that wasn't that easy! It's a gymnastic on its own!
    But overall, whatever is the language, programming is really complicated.
    It's all about solving problems and expressing the solution as code...And most of the time, the problem to be solved is also to be found!

    • @WhatsACreel
      @WhatsACreel  4 ปีที่แล้ว +1

      So true! Cheers for watching :)

  • @rfvtgbzhn
    @rfvtgbzhn ปีที่แล้ว

    From what I heated, you can get a significant performance boost in some cases by disassembling the compiled code and rewriting parts in Assembly language.

  • @jeffm2787
    @jeffm2787 3 ปีที่แล้ว +5

    I was writing x86 before it was called x86. Did 6502, 6809, etc. as well. Stopped when the 486 came out.

  • @RufianEmbozado
    @RufianEmbozado ปีที่แล้ว

    Assembly will always retain two strong points. First, when you learn to code in assembly you go through a rush of "illuminations" (I'm always thinking on 8 bit platforms because they are simple enough to have a grasp on all the landscape, and because I'm that old. Nothing is yet done, you push and pull all those pesky bits all over the place "by hand", a blazingly fast hand) that put a lot of pieces of the information science puzzle rigth into place. Second, there is an inherent beauty in assebly code. Motorola 68000 had a beatiful , beautiful assembler (I crashed on it with an Amiga 500 and, man, what a joy it was! All those fancy chips at your command... Most missed piece of hardware ever). I never got that feeling when I tried to code assembly on i386. I still think learning to write assembly for any CPU is worth the price. No need to do great things, just some humble tasks. You'll have the ride of your life (as a nerd, at least) and wont fall for those kind of misconceptions. Great video, of course. Assembly has the virtue to dispell all sorts of misconceptions. But assembly itself is covered by some key misconceptions which keep it from teaching all it can.

  • @spacewolfjr
    @spacewolfjr 4 ปีที่แล้ว +2

    The legend returns! Thanks Mr. Creel.. man..

  • @y2ksw1
    @y2ksw1 3 ปีที่แล้ว +8

    I have been programming for a vast time of my life in Assembly, and the most challenging tasks were to write code in a way, to run in parallel in the separate pipelines (super scalar). The example you have given, would have been rewritten, eventually longer, in order to get the parallel mechanism working. One way would be:
    mov ebx, eax
    inc eax
    nop
    inc ebx
    So the first two run together, and the resting again. And we would gain at least 2 clock cycles.
    However: assembly made a lot of sense in the old days. Now, with multi-core multi-scalar processors and the brilliant optimisation of compilers, Assembly code died pretty much out.
    I still use it on special hardware though. I am eyeballing the Raspberry Pi Pico, for example 😊

    • @OpenGL4ever
      @OpenGL4ever ปีที่แล้ว

      inc eax
      mov ebx, eax
      Does the same job as your code and requires less RAM.

    • @y2ksw1
      @y2ksw1 ปีที่แล้ว

      @@OpenGL4ever It's not a question of memory, but to get part of this code running in a different pipeline and thus double up the speed.

    • @y2ksw1
      @y2ksw1 ปีที่แล้ว

      Your code would run 4 times slower

    • @OpenGL4ever
      @OpenGL4ever ปีที่แล้ว

      @@y2ksw1 Why should it? In my opinion it runs at the same speed.
      Your code might do
      mov ebx, eax
      inc eax
      in its own pipeline, but
      nop ; does nothing
      and
      inc ebx
      depends on the mov ebx, eax before.

    • @y2ksw1
      @y2ksw1 ปีที่แล้ว

      @@OpenGL4ever If you do first an operation on eax, and then use it to assign its value to another register, it stalls and waits to settle just that tiny bit which doesn't allow to move the code to the other pipeline. I have been timing these instructions very accurately and your assumption, while are technically correct, perform way less efficient. On time critical applications, such as real time graphics manipulation I was working for, the code alignment and sometimes illogical reordering of instructions, made the difference of fluent or staggering graphics.
      I got mainly the filter and render code prepared by graphics specialists and my task was it to speed it up. But also big number mathematics and operating system libraries. Most of them grew noticeable in size, but were of unmatched speed.

  • @BlackStarEOP
    @BlackStarEOP 3 ปีที่แล้ว +1

    8:10 "Race conditions are brilliant" :D (y) Thumbs up for that... Tracking down race conditions has been the most difficult part of my career as a software engineer.
    If you implement something using more than 1 thread, if you carefully think things through, there's not much you can do wrong. However... when suddenly one guy in your team says "yes I know how to improve the performance, just put this and this into its own thread" then you know you need to buckle up. You're in for one hell of a ride...

  • @danepane527
    @danepane527 3 ปีที่แล้ว

    The algo sent me here.. was watching a bunch of Coach McGuirk videos.. subbed!

  • @AngDavies
    @AngDavies 3 ปีที่แล้ว +1

    Minor nit/clarification: while you definitely need to know assembly on a deep level to be able to code an optimising compiler- after all, it's a program that turns code in a given language into as efficient/fast machine code representation as possible.
    That doesn't mean you necessarily should write one in assembly itself- it wouldn't make faster code, only code, faster.
    The better option is often to write the compiler in the language that you intend to compile with.
    You spend loads of time writing a compiler that can create really optimised code for a given platform, build it using some existing compiler, which doesn't make very optimised code, and so the compiled compiler takes ages to compile code.
    But now you've just created a program that turns your code in your language into optimised machine code, so just feed the original code through the new compiler, and you now have an optimised optimizing compiler :D
    Having just "GCC" that compiles to your machine is so much better than having to find a version of GCC tailored to your exact platform

  • @DownhillAllTheWay
    @DownhillAllTheWay 3 ปีที่แล้ว +11

    12:15 "Assembly language is the language of the hardware."
    Permit me to nit-pick. *_Machine language_* is the language of the hardware. Asm is a near-English representation of it.
    Many years ago, I had access to a Data General Nova computer (it was the back-up machine on a customer site). I knew how to swap modules, and I was OK at hardware maintenance (scopes, and that sort of stuff) but I didn't know anything about computers at the time. By reading the manual, I entered a 3 (in binary) into a memory address, and a 6 into another address using the front-panel switches, then I wrote an instruction in machine code to add them together - and it produced a 9 in the destination address - a thrill that I remember to this day.
    I learned the machine code pretty well on that machine, and wrote an assembler in binary code. I had been intending to write diagnostics on the machine, but I moved on before I did that, and never used my (rather strange) assembler. Well, I had never seen an assembler up to that point, so I didn't have much to go on.

    • @ancapftw9113
      @ancapftw9113 3 ปีที่แล้ว

      The best example I saw was a guy making a 6202 (I think) program by writing to a ram chip and feeding it into the processor. He showed what the assembly would look like, but had to program it in hex code.

  • @wrtlpfmpf
    @wrtlpfmpf 3 ปีที่แล้ว

    One thing doing a project on a small assembler can really help is with coding style. I used to write multiple screen long functions with control structured nested several levels deep. Writing in assembler can really teach you how to write code that is as simple as possible, yet correct. I once did that for a little project on an ATMega. Those are cute little 8-Bit micro controllers. Since they have different addresses for RAM and Flash, programming them in assembler is a lot less painful than, for example, C. Anyhow that project really helped me write readable code when I later did C projects. I later played around with those microcontrollers in C and looking at the assembly created by the compiler I have to say that it's highly dense.
    (The rationale behind assembler was that I had more experience with AVR assembler and that that code would use the remaining flash program storage as data storage, something that is even harder to do in C)

  • @BrightBlueJim
    @BrightBlueJim 3 ปีที่แล้ว +1

    So to summarize a couple of things you said:
    1) Functions written in assembly don't really run faster than compiled functions.
    6) Assembly is still necessary for low-level optimization, where speed is really important.
    Also, your point on atomic operations applies just as directly to C and C++, or indeed for ANY program written to take advantage of multi-threading.

  • @mikefochtman7164
    @mikefochtman7164 3 ปีที่แล้ว

    Good information. When we had some ASM instruction dependencies, we sometimes would look down a few lines and see if we could move some other instruction in between the dependent instructions. That meant we could space out the two dependent instructions to let the first one finish and give another ALU something to do while the first one crunched.
    Also worked on a different processor that had a special increment. Used in the OS interrupt handling, it had a couple of instructions that were non-interruptable so we could guarantee that the increment and sto would be atomic.

  • @PvblivsAelivs
    @PvblivsAelivs 3 ปีที่แล้ว

    I have seen many people say that compilers do these wonderful tricks and that hand-coded assembly language is not (generally) faster than a compiler's output. While there may be some compilers that do this, no compiler I have actually used does so.
    "You might get the right result."
    Especially if you use the lovely little LOCK. Any processor that can feasibly be part of a multi-processor system needs a way of executing al least certain instructions without interference from other processors.
    "The CPU will perform the instruction a lot slower."
    It will if two processor units are trying to access the same memory at the same time. After all, one must stall. But the processor that "gets there first" has a negligible performance penalty. It was a two-cycle penalty on the 8086. (I only have timing information up to the 486.)

  • @trashtrashisfree
    @trashtrashisfree ปีที่แล้ว

    I always wrote a good macro library for the assembly I was working in. System 360/370 didn't even have stacks so my first priority was writing things to push and pull values and create subroutines. Everyone else was hand-cutting every single line. Far more error free. Same for other issues in 6502.

  • @connclark2154
    @connclark2154 3 ปีที่แล้ว +1

    I think one thing that wasn't mentioned was assembly allows you flexibility that higher level languages do not. With this flexibility you can implement more efficient algorithms. For example in between assembly routines you can return more than one value from a function by using a custom calling convention. Its the ability to leverage the freedoms that gives assembly its power and performance.

    • @bigshrekhorner
      @bigshrekhorner ปีที่แล้ว

      That's not something exclusive to Assembly.
      C is able to do this by using pointers as function arguments. Even higher level languages are also able to do this by using tuples that mix types (or simply the same type), or with methods similarly to C, if they allow memory management concepts like pointers.
      Compilers and compiler engineers are extremely smart and definitely way smarter than me or you. That means that if you have thought of an efficient implementation of an algorithm in Assembly, it's also pretty likely the compiler engineers have also thought of it and implemented it. At least if we are talking about mainstream compilers, like GCC or Clang (for the case of C/C++)

  • @thomasmaughan4798
    @thomasmaughan4798 3 ปีที่แล้ว

    There was a time when assembly was much faster than compiled but eventually the compiler optimizations produced code that executed efficiently. Depending on what one is doing, assembly is considerably smaller. A function in COBOL to parse a text file was 30 kilo-words and took 30 seconds to execute; I re-wrote it in assembly and it produced an executable that was only 3 kilo-words and parsed the same file in 3 seconds. 1/10th the size and ten times faster! But that extreme example is a result partly of COBOL not really a good choice for that sort of thing and my re-write also used static linking; everything it needed was already linked in the executable so at run time, no "fixups" were needed.

  • @thadtheman3751
    @thadtheman3751 3 ปีที่แล้ว

    Actually part of the complexity of assembler comes from the fact that "decorations" of instructions are not uniform. To clarify I will make up an example (it's been a while so don't expect this to be a real world example ).
    You might have INC A,N.
    increase A by N.
    A might be a memory location and N a number (direct addressing)
    INC $A, N
    A might a memory location pointed to by a memory location (indirect addressing)
    INC [$A],N
    N might be a memory location
    INC A,$N
    ...
    THe thing is that some comands accept some of these addressing modes and other do not. A JMP forexample might exceprt all addressing modes, abut a JSR would not. So it get complicated keeping track of which instruction does what.

  • @sergiomarroquinjr3587
    @sergiomarroquinjr3587 3 ปีที่แล้ว

    I always seem to learn something new from you. Keep it up!

  • @microdocker
    @microdocker 2 ปีที่แล้ว

    Very good and explanatory shot.
    One small weired thing (not related to the topic) is, guy is literally sitting in front of a mic and still recording his voice on oncamera microphone ^_^

  • @vikassm
    @vikassm 4 ปีที่แล้ว +1

    Fantastic video and channel! Subbed.
    My 2¢ about the poor audio: Use your mobile phone with a ~5$ lapel mic to capture your "B-Roll" audio 🙂
    That way if your nice desktoo mic doesn't record for some reason, the backup audio from your cellphone is still wayyyyy better than the absolute garbage camera mic.
    Just clap once (Aaand ACTION) at the beginning and the end of each take to simplify A/V sync during editing.

  • @cthutu
    @cthutu 3 ปีที่แล้ว +1

    INC RAX won't execute in one clock cycle on a x86 because of fetching and decoding. However, pipelining can make it seem like it does.

  • @BobDiaz123
    @BobDiaz123 3 ปีที่แล้ว

    When I program the Microchip PICs in Assembly, the code is very simple. The RISC instructions are only 33 or 35 depending on the core used. The fun part is making the task work in the PIC's limited memory.

    • @GodzillaGoesGaga
      @GodzillaGoesGaga 3 ปีที่แล้ว

      PIC's are glorified state machines. They have no stack!! At least the early ones I used!!

  • @dcocz3908
    @dcocz3908 3 ปีที่แล้ว

    I agree but there are lots of situations where the compiler simply fails for example gnuarm won't use multiple load and store properly which for me generated a lot larger code that wouldn't fit in SRAM so it had to run with wait states from flash on my project. By re-writing it in hand assembly allowed me to get a much smaller function, allowing it to be moved into SRAM with the data that was required by application and that is where I got a really large speed improvement. I couldn't have done it without swapping micro for larger memory footprint using just compiler

  • @johnyoungquist6540
    @johnyoungquist6540 4 ปีที่แล้ว +62

    Talking about assembly in general across different processors is fraught with trouble. I do embedded apps in 8051 assembly only. In fact I wrote the assembler. I can promise that C in the 8051 environment is at least 500% slower and also 500% bigger than assembly even for simple things that C should be good at. It is widely accepted that compilers use a tiny fraction of the instructions set and leave a lot behind. It is easy to point out that ordinary languages contain no information to help compilers use special instructions or constructs. The assembly programmer will recognize an AES algorithm and use the AES instructions a C compiler won't. In modern processors the compiler code generator could hold a significant advantage over the programmer with a detailed knowledge of architecture magic like pipelines, cores, caches, threads. I don't know they handle the moving target of the new processor of the week or tell what processor they will run on. One processors optimization is another's down fall. In contrast the assembly programmer wizard may better the C code speed by 100 times or more with devilish clever thinking and detailed knowledge of the whole instruction set.
    One thing that is universally overlooked is how assembly and high level applications are similar. Apps are typically constructed of functions tailored to do common things for that app. If you need 98 digits precision you'll be writing routines to handle that in any language. These modules are easy to define and test and spread among several programmers. We build bricks first then walls later. A function call is about the same complexity and work to implement in any language. Now all of a sudden apps in all languages are basically function calls and logically look about the same. Neither is more difficult than the other. The planning stage and logic can be nearly identical for any language.

    • @donjindra
      @donjindra 4 ปีที่แล้ว +5

      Exactly. People who don't regularly program in assembler have no idea how much faster assembler is than any high level language. Compiler optimization cannot compete with a programmer who knows the instruction set intimately and can tailor the use of those instructions for a particular task. A 10x improvement in speed is pretty normal. OTOH, a poor programmer is not going to benefit much from assembler code. You have to know what you're doing. The 8051 is a good example. That cpu is so weird a compiler can't deal with it efficiently. A compiler does better with something like ARM.

    • @SimonBuchanNz
      @SimonBuchanNz 4 ปีที่แล้ว +14

      @@donjindra complier optimisation can definitely best any reasonable amount of effort for the majority of code, assuming you're not using the trivial C implementations that come with microcontrollers - inlining and avoiding pipeline stalls is drudge work that's better to let the computer handle, especially when your problem is getting something working or cleaning up a mess, not making something faster. Not always, there's always going to be some cases that confuse a compiler enough that it's easier for you to use assembly than to figure out how to mangle your code so the compiler does the right thing, but advanced instructions are available through intrinsics, and compilers will auto vectorize loops, and so on. The low hanging fruit is getting picked all the time.

    • @donjindra
      @donjindra 4 ปีที่แล้ว +1

      @@SimonBuchanNz I don't know why you think that. In fact, I don't even know what sort of code you have in mind. I don't advocate using assembler to add two register-width numbers.

    • @SimonBuchanNz
      @SimonBuchanNz 4 ปีที่แล้ว +3

      @@donjindra sorry, could you clarify what I said that you have an issue with? I was taking about your statement that "a compiler can never compete with [an assembly] programmer": trivially true in that said assembly programmer could at worst use the same instructions, but not practically true. Not sure where you're getting adding numbers from, but if that's literally all you're doing, then actually yeah, you probably will beat a compiler. It's the 50kloc of "adding two numbers" that's not worth the absurd effort to keep optimized in assembly, and mixing and matching can (depending on your baseline) actually pessimize the code since the compiler can't inline now.

    • @donjindra
      @donjindra 4 ปีที่แล้ว +1

      @@SimonBuchanNz Concerning adding numbers I said the opposite of what you think I said. If the task is simple, such as adding two numbers, the compiler does just fine. There's no point in resorting to assembler. It's the complicated, time consuming tasks that benefit from assembler. Compiler optimization was done by assembly language programmers. But they optimize general cases. They aren't magicians. They can't predict all particular cases. Therefore they cannot optimize for all of them. I have no idea what you mean by the end of your comment.

  • @den2k885
    @den2k885 3 ปีที่แล้ว +1

    Compilers optimize very well... for general purpose code, without knowing its data layout. It's very difficult that a compiler will use SIMD instructions and in the rare cases it does it won't make use of the inner characteristics of your problem, as it has no knowledge of them.
    Using Assembler I managed to douvke a linear Sobel algorithm performaces and triple a segmented integral table algorithm's performances. Not even Intel compiler managed to equal those times.

  • @jp5000able
    @jp5000able 2 ปีที่แล้ว

    Back in the early 80's I did some 6502 assembly programming. What made it so difficult, the cpu was only 8 bits. There were no instructions for 16 bit numbers and floating point numbers.

  • @GogiRegion
    @GogiRegion 3 ปีที่แล้ว +1

    I’ve actually looked into virus programming, and commonly out of curiosity, and it looks like good hackers will use C and then compile to assembly for optimization, then assemble it. That’s assuming that you need high level functions in order to do what you need, you want it to take up as little space as possible so it’s harder to detect, and possibly want to remove null bytes (which is supposed to allow your code to work with a wider array of hacks since some rely on a lack of null bytes). It’s actually an interesting topic, and from what I was reading, it sounds like C is preferred over assembly for the same reason Linux is shown in primarily C.

  • @DigitalPhage
    @DigitalPhage 3 ปีที่แล้ว +30

    "x86 Assembly Language Misconceptions" would be a more apt title, however a good video.

    • @TheBypasser
      @TheBypasser 3 ปีที่แล้ว +1

      Oh yeah, say Arduino compared to pure AVRASM is like a snail vs a ballistic missile (just like for the most of the RISC cores, HLL vs ASM that is).

    • @niclash
      @niclash 3 ปีที่แล้ว

      Misconception; x64 Instruction Set is a typical one. The micro controllers are typically magnitudes easier to learn fully. And then there are the funky/academic outliers, like 1 OpCode Instruction Set. But the majority of Assembly Languages out there are dozens, maybe 100 and a bit, and not the thousands in the Intel/AMD world.

  • @FORTRAN4ever
    @FORTRAN4ever 3 ปีที่แล้ว

    I programmed in assembly on a Sperry Univac 1143 mainframe computer in the early 1980's. Each instruction consisted of a 36 bit word. Commenting was a must. I would prefer to program in FORTRAN or COBOL anyday over assembly.

  • @michaelbuerge
    @michaelbuerge 3 ปีที่แล้ว

    Great stuff. Interesting and relevant info. Thanks.
    Allow me a remark about audio: You invested in a nice mic. Now you might want to think about the room you're recording in. Maybe put something absorbing in place to reduce room reverberation.

  • @tchiwam
    @tchiwam 3 ปีที่แล้ว +1

    Would be fun to see a video on transforming locked multithread to lockless thread with a thread manager and completely lock less multithread manager.

  • @erwinmulder1338
    @erwinmulder1338 3 ปีที่แล้ว

    I grew up programming home computers in the 1980s. You had to write assembly (and sometimes even translate it to number by hand) to make anything that would run faster than at a snail's pace. I mean 8 bit computers at 3.5HMz are not incredibly fast at anything. So if you had BASIC, which was interpreted (not even compiled) that was SUPER slow. You couldn't even draw an entire screen in one second most of the time. These days, I mostly work with assembly in writing (toy) compilers for my own programming languages. In the end, what any compiler really does is basically translate the source code to assembler instructions.

  • @Cubinator73
    @Cubinator73 4 ปีที่แล้ว +8

    15:49 I think you got something wrong there. Obviously, assembly is needed in all sorts of things like programming compilers and optimizing low-level routines. The "misconception" that "assembly language is no longer needed due to optimizing compilers" expresses the fact that your average programmer doesn't need to write assembly himself because far more competent people already did it and made their optimized routines available in the optimizing compiler. I myself only ever used assembly to explore how CPUs work and how compilers optimize stuff, but I never NEEDED to write my own assembly code for my own projects.

    • @lewiscole5193
      @lewiscole5193 4 ปีที่แล้ว +2

      That's nice ... OTOH being a former OS maintainer/developer, I used assembly a lot, not just because most of the OS was also written in assembly (which it was), but because it gave me control over data/code placement that no available compiler did/could, which was especially important in the bootstrap code I was responsible for the care and feeding there of.
      And I suspect that's still true ... the hardware defines and uses data structures that I don't want/need a compiler guessing what sort of code should be generated for.

    • @WhatsACreel
      @WhatsACreel  4 ปีที่แล้ว +3

      Yes, I do wish that the proper position of ASM was expressed more clearly in computer science education. I was taught to fear the language during my degree, encouraged to neglect it entirely. Maybe it’s different in other institutions?
      I do not disagree entirely with the sentiment. But I do think it is skewed a little too far away from ASM. I think learning ASM for OS development or to understand the CPU are excellent applications!
      Cheers for watching and commenting folks :)

    • @lewiscole5193
      @lewiscole5193 4 ปีที่แล้ว +2

      @@WhatsACreel
      I have no idea how ASM is being taught in schools these days, but back when I was a student -- just after the dinosaurs had been killed off by an asteroid -- there was no question that any non-impaired human could outdo a compiler in terms of generating fast/small code.
      The reason why you were supposed to use an HLL was because it increased programmer productivity.
      Studies had supposedly been done that showed that the average number of DEBUGGED lines of code that could be produced per programmer per day was about TEN (10) independent of programming language.
      And because each HLL statement typically turned into more ASM line, that meant that if you could use an HLL, you should because you could potentially get more done using an HLL than you could ASM especially in terms of code that was supposedly "portable" across platforms.
      There were also supposedly studies that showed a wide variation in programmer output as well and so YMMV, but familiarity with a particular language also had a lot to do with programmer productivity (I don't recall how much).
      The gist of this is that I usually write in ASM because that's what I'm most familiar with, and because I'm no longer getting paid for what I write, it's my choice.
      I can speak C if I have to, but I don't consider myself fluent and I simply don't see the need to spend time becoming more fluent in C when I can do what I want probably (?) faster in ASM.
      What bothers me is that people who seem to shy away from away from using ASM seem to think that there's something fundamentally different in how you generated ASM code versus an HLL thrown at a compiler.
      To me, though, that's not the case.
      When I occasionally do write HLL code, I do the exact same thing that I do when I write ASM code, the only difference being how far "down" I "refine" the code before I come to a valid HLL or ASM statement.
      I just don't understand what it is that makes people think there's something special when it comes to how to write ASM code versus HLL code.
      It makes me think that maybe too much time is spent teaching the structure of various HLLs and not enough on how to think and solve problems.
      Just my opinion ....

    • @WhatsACreel
      @WhatsACreel  4 ปีที่แล้ว +2

      @@lewiscole5193 Ha! I know the feeling! I learned in the 90’s. Things have changed a lot since then. Especially Assembly language. It’s gone from maybe 100 instructions and 16 registers to massive SIMD register files and 3000 instructions!
      I certainly agree that programmer productivity and portability are very important. And the choice of language is a big part of that. Sometimes ASM is a good fit, and sometimes it is not. I do love how fast it can be, and how flexible. There’s some brain-melting, deep trickery that is natural to ASM, which is too low level to be practical in HLL’s. But for the most part, anything is pretty achievable in any language, and so it becomes a matter of choosing the best tool for the job.
      I couldn’t agree more! The problem with ASM is the perception of it. Folks shy away from it in a way that might not be warranted. It’s just a language, after all. IMHO, it’s a really fun and powerful language.
      I do love a good bit of HLL code too, but ASM will always hold a special place for me. If for nothing else, I made a video about ASM 10 years ago and put it up on TH-cam, and have since built this little channel :)

    • @lewiscole5193
      @lewiscole5193 4 ปีที่แล้ว

      @@WhatsACreel
      Ten years? My how time goes by when you're having "fun".

  • @kevinz1991
    @kevinz1991 3 ปีที่แล้ว

    great information and great delivery. thanks a lot for the time you put into this. subscribed

  • @gideonz74b
    @gideonz74b 2 ปีที่แล้ว +1

    @Creel: Executing an instruction in one cycle does *not* mean that the *latency* is one cycle. It means that the *throughput* is one instruction per cycle. The latency is always a lot more than that, because it has to pass through the pipeline.

  • @wingman2tuc
    @wingman2tuc 3 ปีที่แล้ว

    Modern CPU are also "deep" pipelines. Fetch -> decode -> exec ->mem access-> rightback.As a very simple example.
    Todays CPU can have 20 to 40 steps for completeing a single instruction.
    Things can be pipelined but you need a very inteligent a complicated forwarding unit and branch predictor in order to take advantage of pipelines.
    Understanding modern cpu architecture is a must in order to use ASM eficiently. Also ASM can be cpu spesific so it may not work in other cpus.

  • @DIYRepairHour
    @DIYRepairHour 3 ปีที่แล้ว

    Lock? Part from caching, depends what CPU and clocks configuration of this CPU to memotry controller are the penalty can vastly differ...

  • @NomenNescio99
    @NomenNescio99 3 ปีที่แล้ว

    A long time ago in a galaxy far far away, before the time when gcc used the mmx instruction set to optimize vector arithmetic there was sometimes huuuge gains to be had from inlining some assembly code.

  • @sikkavilla3996
    @sikkavilla3996 4 ปีที่แล้ว

    Happy Holidays @Creel!

    • @WhatsACreel
      @WhatsACreel  4 ปีที่แล้ว

      Happy holidays to you :)

  • @emjizone
    @emjizone ปีที่แล้ว +1

    3:53 This "one instruction per cycle" might be true for the oldest machines, with no clever vectors and lookups and with a very limited set of instructions. This might explain why people believe it to be still true today.
    In that case you'd have to program most of usual math functions yourself (modulo, square root, etc…) and they would take several cycles anyways.

  • @amigalemming
    @amigalemming ปีที่แล้ว

    15:45 I am too lazy to plan register usage myself, thus I use LLVM to generate real assembly code for me. But I inspect the results regularly in order to find weaknesses in LLVM or my code.

  • @michaelmoorrees3585
    @michaelmoorrees3585 3 ปีที่แล้ว

    Still write assembly code for microcontrollers, such as the AVR and 8051 lines. Those are bit painful. Writing assembly on the old Motorola HC line was beautiful, in comparison. I have a hierarchy of pain. If the final binary is less than 4K, its gonna be assembly. If 16K or larger, it will mostly be high level (ie C), but some critical areas will still be in assembly.
    Optimized compilers are similar to autorouters, when laying out a PCB. They will screw things up. Often you have to go into the trenches, and do some manual labor.

  • @derzweistein8973
    @derzweistein8973 4 ปีที่แล้ว +1

    Where do i learn "everything that [i need to lern] about a computer" to gain significant speed in assembly ? (especially the fun hardware stuff like ooo Execution, Loop Streaming, difrent Execution Engines)

  • @RT55J
    @RT55J 3 ปีที่แล้ว

    The effectiveness of unrolling loops as a performance optimization can vary wildly depending on the caching situation. If your architecture has no cache to worry about, then it would give a definite performance boost. However, if you have an instruction cache to worry about, then (depending on the size of the unrolled loop vs the cache) you might suffer a performance decrease from the extra instruction fetching from RAM.

  • @lgrantcdg
    @lgrantcdg 3 ปีที่แล้ว +3

    IBM’s DB2 database for the IBM mainframe (MVS) is written in a proprietary PLI-like language. A few years ago, they increased its speed by 20 percent, just by improving the code that the compiler emitted. Computer architectures are constantly evolving, as newer and fancier instructions are added. Even if you are the world’s best assembly programmer, and know every instruction inside and out, there is no way you can update a large assembly-language code base to take advantage of each improvement in the architecture.

    • @OpenGL4ever
      @OpenGL4ever ปีที่แล้ว

      Fortunately, C has a preprocessor for such cases. It allows you to write all code in C, optimize where necessary for one or more CPU architectures in their specific assembly language and then use the C code as a fallback. And if you then have a much better C compiler. All that is needed after that is just a recompilation with the improved compiler using only the C code. Then you can see where the C compiler optimizes better. And where it's still worse, you compile the assembler routines back in.

  • @k7iq
    @k7iq 4 ปีที่แล้ว +5

    I program ARM in C lately... I find that being able to view the ASM output helps to reduce my C code operation. For instance, recently looked at a particular IF statement that I suspected might not work the best that it could and found that defining one of the variables as local register int32_t it reduced the time of that bit of code by two and the size was a bit less two.
    Also, needed to create an ASM function, a float to int function because the compiler did not output the FPU instruction for the rounded version of that instruction.
    ASM has it's uses but mainly, for me I think, in debugging C code.

  • @seneca983
    @seneca983 3 ปีที่แล้ว +3

    14:36 "The difficulty of assembly is the number of instructions."
    Is this part specific to x86?

    • @kelvinyonger8885
      @kelvinyonger8885 3 ปีที่แล้ว +2

      afaik this whole video is for in-vogue modern uarchs (x86/x64, ARM)

    • @seneca983
      @seneca983 3 ปีที่แล้ว +1

      @@kelvinyonger8885 Doesn't ARM have far fewer instructions than x86?

  • @MURTYPHYSICSVIDEOS
    @MURTYPHYSICSVIDEOS 4 ปีที่แล้ว

    shall we need to learn assembly language to build an operating system?

    • @baldrofasgard7926
      @baldrofasgard7926 3 ปีที่แล้ว

      Yes, but not for all of it. Operating systems (Windows/Unix/Linux) are mostly written C.
      C is the just about the "lowest" high level language you will get. Pointers and it's direct access to the system memory make it very close to assembler, but it has a big advantage over assembler ... it has high level constructs like if/switch blocks, and for/while loops (in assembler you need to constructed these using cmp/jmp instructions).
      Some other features like calling functions are a lot cleaner and simpler in C. In assembler it means pushing each parameter onto the stack, executing the call, then popping each one off at the end ... making sure to keep track of the exact number of parameters each step of the way.
      The problem with C is it's ability to directly access hardware registers. C doesn't have built in keyword to read/write to IO ports or read/write hardware registers on specialist hardware (e.g. video cards, sound cards). This is where you will need to use assembler.
      There are a couple of options system programmer use to achieve this:
      1. Embed assembler in the C code. Generally not considered a good option. It makes the C code hardware dependant, and therefore not portable.
      2. Call an external "assembler" function from the C code. An assembler function is created using the C function calling convention and compiled to an external library (static or dynamic). The external function is then called from the C program. Using this method means the all of the C code can be transferred to a different CPU architecture. The only thing that needs to be rewritten on the new architecture is the assembler accessing the IO ports or hardware registers.

  • @Alex-op2kc
    @Alex-op2kc 3 ปีที่แล้ว

    Creel's back on his cubemaps!

  • @lohphat
    @lohphat 2 ปีที่แล้ว

    Why can't modern compilers keep up with new instructions as new CPU families are released and offer more optimized object code modules so that then installer packages can offer better object code at install time?

  • @brorelien8447
    @brorelien8447 4 ปีที่แล้ว +23

    14:43 I partially disagree with you on this point. Some processor like the 6502 has a little instruction set which can be easily learn (only around 56 instructions). I know an 8 bit CPU can't really be compared with a modern x64, but some embedded CPU still uses these simpler 8 bit instruction set.
    Otherwise I like the video.

    • @y2ksw1
      @y2ksw1 3 ปีที่แล้ว +2

      Well, some 8 bit processors have a lot of instructions. Of course, if you group, then almost any processor has only a few:
      Add, subtract, multiply, divide, invert, move. That's about it.
      When I teach, I actually point out that most processors can only add and negate. They do it in a very efficient way though.

    • @NoNameAtAll2
      @NoNameAtAll2 3 ปีที่แล้ว

      risk v >_>

  • @xeridea
    @xeridea 3 ปีที่แล้ว +1

    Older compilers were known for being slow, and assembly was often used, especially in early consoles. Modern compilers are highly optimized. Besides all the basic stuff, they have all sorts of tricks for optimizing multiply, divide, and what instructions to use, even specific to CPUs if you want. Sometimes CPUs have weird quirks that compiler developers can take advantage of, or at least avoid penalties. Optimizing multiply and divide goes beyond obvious stuff, like bitshifts for powers of 2, they have all sorts of tables for methods for various numbers. Often they can even convert loops into SIMD instructions automatically. If not, doing SIMD completely manually is very tedious, there are methods available in some lower level languages to make it a bit easier.
    Some things can still be hand optimized, but requires very in depth knowledge of CPUs, and even then, may not even be faster. For most purposes, not worth it, though some low resource embedded systems, some drivers, and some other niche cases benifit.

  • @rjones6219
    @rjones6219 ปีที่แล้ว

    Assemblers and machine code is where I did all my programming. Obviously writing in assembler takes more time than a higher level language. But the code space can be more efficient.

  • @DukeDudeston
    @DukeDudeston 3 ปีที่แล้ว +2

    "You can do a lot of stupid things in any language"
    I was able to delete ntfs.sys in a language called "DarkBASIC" when I first started out. So yes. You can do a lot of stupid things in languages.

  • @MaximYudayev
    @MaximYudayev 4 ปีที่แล้ว +1

    That's mostly applicable to general-purpose CISC, no? For example RISC, namely ARM, RISC-V (okay, not always), PIC and other embedded processors and DSPs, execute instructions in one clock cycle and seem to be the main targets for optimization in ASM where compilers are not smart enough to take advantage of all the ins and outs of the dedicated CPU.

    • @WhatsACreel
      @WhatsACreel  4 ปีที่แล้ว

      Yes, I do recall the PIC is designed to run instructions at the same speed. Except for branching, maybe?? My memory is a little shaky there. ARM takes different times for instructions much like x86. I was definitely thinking mostly about x86 in this video, but most of it is applicable to other hardware.
      Cheers for watching mate :)

  • @WolfCoder
    @WolfCoder 3 ปีที่แล้ว +3

    The only time I've written assembly was for the 6502 (because its fun), the Z80 clone in the Gameboy (because its fun and the only compiler I found was terrible and couldn't handle ROM paging well, etc.) and the ARM7 DTMI in the GBA where, while there's a port of gcc for it, you still have to write assembly for heavy duty subroutines like interrupts, audio engines, etc. as the compiler optimizations don't seem to work as well in the gcc port. For x86-64 though? Uh.. I think I'll let the compiler have the 'fun' when it comes to that.

  • @LukeAvedon
    @LukeAvedon 4 ปีที่แล้ว

    Wonderful video! Glad you are back.

  • @DanEllis
    @DanEllis 3 ปีที่แล้ว

    I was a bit puzzled by the first one. You seemed to be suggesting that _calling_ code written in assembly language was slower (disregarding the actual execution time of the function). But of course that's not so.
    Regarding malware, it's sometimes necessary to write code in assembly language to strictly control what machine code is generated. For example, to ensure there are no zeros.
    Finally, "assembly language is etched into the chip". Not really, though. The ISA doesn't dictate the syntax of the assembly language. For example, x86 has two very different syntaxes (Intel and... the other one. AT&T?)

  • @vk3fbab
    @vk3fbab 4 ปีที่แล้ว +1

    I think a lot of people confuse assembly language and machine code. Assembly language is the pneumonics like JMP LDA etc. Machine code is the actual binary or opcode the CPU executes. A CPU only executes machine code. There can also be multiple assembly languages for a given CPU but there is only ever one machine code. I think the confusion comes because often one pneumonic translates to one binary opcode. So machine code and assembly pneumonics morph into one in our minds.

    • @WhatsACreel
      @WhatsACreel  4 ปีที่แล้ว +1

      That’s an interesting point! ASM and machine code are not entirely distinct languages, but there is certainly a distinction. Maybe one day we can explore some machine code on this channel.
      Thank you for watching :)

    • @jackw7714
      @jackw7714 4 ปีที่แล้ว

      Lots of assemblers support macros too

  • @lower_case_t
    @lower_case_t 4 ปีที่แล้ว +2

    Fun fact: The simplest assembly language imaginable has only one CPU instruction. To achieve Turing completeness, an imperative language only needs to have conditional branching and the ability to manipulate memory. There are instructions like subleq (subtract and branch if less than or equal zero) that fulfill that requirements, so you can write any program just using this single instruction. There is even an operating system out there that was written as proof of concept, called Dawn OS.

    • @WhatsACreel
      @WhatsACreel  4 ปีที่แล้ว +1

      Indeed! NAND and NOR are universal themselves. A ridiculously long sequence of NAND and some RAM is all we need to produce any bit pattern we like! Conditional branching is certainly a welcome luxury, haha ;)
      NAND output, in1, in2
      Bam! The simplest language of them all! Wow, Imma port Crysis :)
      I did not know about this OS. That’s awesome! Thank you for watching and sharing :)

    • @KohuGaly
      @KohuGaly 4 ปีที่แล้ว +3

      Important note, by "branching" we mean conditional GOTO statements. Not mere if(condition){code block}. The latter can't produce loops.

    • @lewiscole5193
      @lewiscole5193 4 ปีที่แล้ว +1

      Meanwhile, in the real world, there is no such thing as "Turing complete" computers because all real computers have finite amounts of memory and so cannot fully emulate a theoretical computer that has an infinite amount of memory required by a universal Turing machine.
      Moreover, I fail to see how a "subtract and branch" instruction is capable of simulating/emulating the sort of conditional write instruction (or it's equivalents "Load Linked"/"Store Conditional" sequence) needed to make wait-free and/or lock-free programming work.
      (See Maurice Herlihy's seminal paper, "Wait-Free Synchronization" here:
      < cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf >)
      So I question whether or not it really can used to write any program, at least not without some assumptions being made about the atomicity of the instruction.
      Finally, not to sound like too much of a curmudgeon, but ask me if I care about proof of concept OSs ... or better yet don't.

    • @WhatsACreel
      @WhatsACreel  4 ปีที่แล้ว +1

      @@lewiscole5193 Yes, there are no computers with an infinite amount of memory. When people say NAND is universal or Turing Complete, it just means given any finite set of inputs and the outputs to which they map, there is a series of NAND gates that will compute the mapping.
      NAND, in this context, should be executed as a single instruction stream, so no atomicity is required.
      I agree - modern computers are not strictly Turing Complete. But for practical purposes, it is common to use the term for modern computers and languages.

    • @lewiscole5193
      @lewiscole5193 4 ปีที่แล้ว +1

      @@WhatsACreel
      Thanks for your reply.
      As it so happens, I have a general idea of what Turing Complete means and so I know that strictly speaking it doesn't apply to any real world computer ... ever.
      Yes, I know that the expression "Turing Complete" has been tweaked to mean "approximates ' Turing Complete'" when it comes to real world computers and languages, but I also know that this expression is meaningless for "practical purposes" and so it annoys me when people whip it out as if it does.
      If you're a computer scientist trying to determine whether or not something is "computable", then by all means drag out Turing machines, and Turing Completeness with it, and have at it.
      But let's be clear here ... dropping the name "Turing" and some OS with it is not a particularly impressive to someone like me who has actually worked on an OS.
      Perhaps that's because there are a lot of "problems" here in the real world that are *NOT* "computable", which is to say, you can't write some sequence of code that will halt when the problem is "solved" because the "problem" itself is a process over time ... case in point, the OS that you are currently using.
      Whatever it is, Linux, Windows, Dawn ... it doesn't matter ... while it is most certainly made up of parts that are "computable" (i.e. routines that halt or do the equivalent like "return"), the OS itself is not "computable" since it should never stop ... it should never reach a point where the "problem" that you're trying to solve is done enough so that you can/should halt.
      The same holds true for any language interpreter that asks a user for a command, executes it, and then prints the results.
      But back to "Turing Completeness", as it turns out, just about every language, and anything that can be called a digital computer has the capability of performing a conditional jump and so bringing up "Turing Completeness" is like bringing up the fact that we all breath air ... so what?
      And that brings me back to Conditional Replace which I suspect cannot be simulated by any Turing machine, whether or not it has infinite memory or not as it requires coordination between multiple machines effectively reading/writing over the same infinite tape at the same time.
      So "Turing Completeness" by itself doesn't get you where you want to be when if you're trying to claim that just a bunch of Subtract and Conditionally Jump instruction can let you do whatever any other piece of software running on a modern computer can do.
      Finally, I'm well aware of the fact that you can use just NAND gates to produce any modern digital computer, but not without coordination with a clock to prevent sampling during a metastable state.
      And with the introduction of a clock, you then have to be concerned with atomicity of a multi-cycle operation in the face of multiple processors ... hence locking primitives of which there are somethings that cannot be accomplished in the absence of an atomic Conditional Replace instruction ... including any number of NAND instructions that you might be able to fill memory with.
      Again, see Herlihy for the hierarchy of locking primitives.

  • @MorningNapalm
    @MorningNapalm 3 ปีที่แล้ว +1

    I made it to point 3, and at that point I realised that the correct title of this video should be “x86 Assembly Language Misconceptions”. Byeeee

  • @controlflow89
    @controlflow89 3 ปีที่แล้ว

    Absolutely amazing channel, keep up the great work!

  • @0MoTheG
    @0MoTheG ปีที่แล้ว

    0:50 I disagree. Using inline asm does not take much overhead. In the simplest case it is just a mov to set up the register you want to change. A register to register mov does not take any clk cycles.

  • @Renville80
    @Renville80 ปีที่แล้ว

    One aspect of assembly language that can be annoying is that you can use various shortcuts to save on the number of bytes taken up in memory. One piece I’ve been trying to analyze is effectively 5 kilobytes crammed into a 4 kilobyte EPROM.

  • @pugboi8017
    @pugboi8017 3 ปีที่แล้ว

    what is dis? I’m so glad i got recommended this channel. The coding gsus is australian

  • @kindpotato
    @kindpotato 3 ปีที่แล้ว +5

    "race conditions are brilliant" This guy is awesome.

  • @coder2k
    @coder2k 4 ปีที่แล้ว +1

    Looking forward to seeing that next video you already teased :)