CPU Pipeline - Computerphile

แชร์
ฝัง
  • เผยแพร่เมื่อ 18 พ.ย. 2024

ความคิดเห็น • 123

  • @crystalsoulslayer
    @crystalsoulslayer 7 หลายเดือนก่อน +128

    Shoutout to the people that design processors. This is hard enough to understand as an abstraction, much less when you try to de-abstract it. All of this is being done with _physical circuitry_ inside the computer. Utterly mind-boggling. Hats off.

    • @unvergebeneid
      @unvergebeneid 7 หลายเดือนก่อน +17

      Also shout-out to the people discovering side-channel attacks. They have to understand the internal workings of these CPUs _at least_ as well as the engineers :D

    • @thewhitefalcon8539
      @thewhitefalcon8539 7 หลายเดือนก่อน +2

      Actually I think it's not so hard to design a 80s era non pipelined CPU if you know digital logic.

    • @dascandy
      @dascandy 7 หลายเดือนก่อน +3

      @@thewhitefalcon8539 The logic is the simple part. It's the clocking that makes it harder, and every year in progress beyond the most basic CPUs keeps adding to it.

    • @rafa_br34
      @rafa_br34 7 หลายเดือนก่อน +1

      @@unvergebeneid Honestly if you stop to think about it side channel attacks aren't hard to understand, the clever part is just using cache to test if the guess was correct or not.

    • @meneldal
      @meneldal 7 หลายเดือนก่อน +2

      It has been many years since people don't design the circuitry by hand, there's just nobody smart enough and with enough free time to do it by themselves. There are people who make software that converts your design into actual plans to send to the fab, people who design and simulate the cpu design and so on. People can only understand deeply a tiny part of the whole process.

  • @pyromen321
    @pyromen321 7 หลายเดือนก่อน +96

    Somehow I missed that this guy is Matt Godbolt!
    Compiler explorer is without a doubt my favorite tool in software engineering as a whole!

    • @londonbobby
      @londonbobby 7 หลายเดือนก่อน +3

      What? the guy who created the BBC micro emulator that runs in a browser? Wow, he's a genius 🤩

    • @axelBr1
      @axelBr1 6 หลายเดือนก่อน +3

      I was going to comment about Compiler Explorer. I've only recently heard of it, and it's amazing.
      I recently used it to test my assumption about the declaration of a variable used in a loop. I'd always declared the variable outside the loop, (this is going back a few decades), to avoid the cost of it being repeatedly created inside the loop. The drawback of course is that inside the loop you have some "random" variable and have to go searching for its declaration, (of course not such a big issue with modern IDEs). It recently dawned on me that may be the compiler is smart enough to move the declaration outside the loop for me. I tested this out in Compiler Explorer and it's amazing to see what the compiler will do; declaring inside the loop creates smaller code (presumably faster) as the compiler knows the variable is going to be discarded and appears to even optimise away its creation and tidy up after use.

    • @Functionviktor
      @Functionviktor 6 หลายเดือนก่อน +1

      I use it always in my days now to see what my C++ code generates in assembly and now I'm dominating in reverse engineering using that knowledge 😅

    • @vindieu
      @vindieu 6 หลายเดือนก่อน

      same, i got it at 5 minutes after noticing the compiler explorer shirt

  • @MattGodbolt
    @MattGodbolt 7 หลายเดือนก่อน +41

    A huge shout out to Sean's fantastic animations! Thanks Sean! 🎉

    • @Computerphile
      @Computerphile  7 หลายเดือนก่อน +16

      We'll shovel some more binary soon!

  • @smoorej
    @smoorej 7 หลายเดือนก่อน +57

    As a former IBM/370 Assembler programmer I approve of this presentation!

    • @goldnutter412
      @goldnutter412 7 หลายเดือนก่อน +1

      👏👏👏

    • @NateEngle
      @NateEngle 7 หลายเดือนก่อน

      I always found the PDP 11 architecture very elegant, but of course C was the most productive way to write it so the closest I usually got to PDP assembly was designating some integers as register variables and then happily admiring what was written into the intermediate .a files being the System III C compiler
      I think the instruction set I spent the most time writing actual code for (and hence cursing over it's many flaws) was the i8031 (and its various later versions), but the internal operation of that thing made efficiency experts weep. It was kind of a learning environment for CPU designers whose ideas were inspired by the microcontroller that chose to do everything in the least intelligent manor.

  • @SupaKoopaTroopa64
    @SupaKoopaTroopa64 7 หลายเดือนก่อน +10

    Another interesting way the pipeline is kept busy is the uOP cache. Imagine the executor robot keeping a list of frequently accessed instruction addresses, and what instructions they contain. When the executor calculates the address of the next instruction, it can check it's address book, and if it finds a match, it can start executing that instruction without waiting for it to be fetched and decoded. Modern CPUs usually keep a book of ~5-7 thousand entries.

  • @trevinbeattie4888
    @trevinbeattie4888 7 หลายเดือนก่อน +10

    That last segment about parallel execution units where one might need to wait for the other to finish reminded me of the optimization options in gcc (at least a decade or two ago) where the compiler would figure out how to order the instructions to avoid such dependencies in the CPU pipeline without materially altering the results of the code. As I understand it, modern processors can do instruction re-ordering on the fly to get the same optimization results. It’s rather mind-boggling to me as I’ve just got a rudimentary understanding of microcode as a simple series of opening and closing logic gates (thanks to Ben Eater’s videos.)

  • @unvergebeneid
    @unvergebeneid 7 หลายเดือนก่อน +10

    Can't wait to see more from Matt!

  • @anhha7502
    @anhha7502 5 หลายเดือนก่อน

    I'm taking a computer architecture course and this really makes the learning easier ! Thanks everyone in production and Mr. Godbolt!

  • @easyactually
    @easyactually 7 หลายเดือนก่อน +107

    Can't believe rollercoaster tycoon was made in assembly

    • @TerrorTerros
      @TerrorTerros 7 หลายเดือนก่อน +10

      Performance was amazing though, it really showed 😮

    • @georgeprout42
      @georgeprout42 7 หลายเดือนก่อน +9

      I can. Chris Sawyer joined the European Coaster Club and came on a few trips to Europe with us. He went from "I have an idea for a game" to an adrenaline junkie with his first ride.

    • @codycast
      @codycast 7 หลายเดือนก่อน +5

      yO mamma was made in assembly

    • @casinatorzcraft
      @casinatorzcraft 7 หลายเดือนก่อน

      I can. You can still abstract with subroutines and such

    • @blazer511
      @blazer511 7 หลายเดือนก่อน +5

      That guy is a real software engineer, unlike the react kiddies of today

  • @stupossibleify
    @stupossibleify 7 หลายเดือนก่อน +27

    Any podcast or video benefits from Matt Godbolt appearing

    • @MattGodbolt
      @MattGodbolt 7 หลายเดือนก่อน +8

      Aww shucks :) Thank you very much!

  • @CallousCoder
    @CallousCoder 7 หลายเดือนก่อน +8

    This video should’ve be done 20 years ago when I tried to explain pipelining, super scaling and out of order processing to our intern in the context of pipeline poisoning due to unnecessary branching that I saw in a section of his code. Where actually making the code branchless with relative expensive multiplication instructions made it 5 times faster.
    It’s staggering how quickly the complexity branches out having to explain that to a CS that only works in theory and has no concepts of hardware layer.
    And soon you’re also touching out of order processing and super scaling 😂

  • @KipIngram
    @KipIngram 6 หลายเดือนก่อน +1

    You can design your processor so that the branch "decision" is separated from the branch "action." Imagine having separate fetch and execute units. Between the two, there is a "flag" system - a flag can be empty, true, or false. Starts out empty. So you put the branch decision code as early as you can (you can't always separate them by much, but sometimes you can). As soon as that gets executed, the execute unit will post the result into the flag (so it's now either true or false). Meanwhile the fetch unit is "fetching ahead" - it will eventually see the branch action instruction, and will consult the flag. Hopefully the flag has been set - if so, the branch unit knows exactly what to do and can carry on. If the flag is still empty, though, then we have the problem you're discussing and we have to block until that flag value arrives. But by separating them like this we give ourselves "maximum opportunity" to avoid losing time.

  • @FCHenchy
    @FCHenchy 7 หลายเดือนก่อน +20

    The non-binary robot bit was adorable. Thank you.

  • @jms019
    @jms019 7 หลายเดือนก่อน +4

    Some years back a particularly stupid agency didn’t get that my skills including assembly did in fact match a job requirement assembler.

  • @trevinbeattie4888
    @trevinbeattie4888 7 หลายเดือนก่อน +6

    Love Matt’s T-shirt

  • @lucidmoses
    @lucidmoses 7 หลายเดือนก่อน +2

    As per your analogy, Having two sets of the fetch-decode lines per arithmetic unit would also work now that multiple threads are more common.

  • @kayakMike1000
    @kayakMike1000 7 หลายเดือนก่อน +1

    I am learning about CPU pipelines now. I am working on a RISC-V with a five stage pipeline.

  • @BoomBaaamBoom
    @BoomBaaamBoom 7 หลายเดือนก่อน +1

    Thank you for making Compiler Explorer. It is very helpful to me!❤

  • @aaronr.9644
    @aaronr.9644 7 หลายเดือนก่อน

    OMG!! Godbolt!!! I didn't realise he contributed here.

  • @olafurw
    @olafurw 7 หลายเดือนก่อน +2

    Matt! The best!

  • @YS_Production
    @YS_Production 4 หลายเดือนก่อน

    Wow, the 20 minutes flew by :)

  • @Roarshark12
    @Roarshark12 7 หลายเดือนก่อน +5

    Thanks so much for this video. Can you please also cover Tomasulo's algorithm?

    • @MattGodbolt
      @MattGodbolt 7 หลายเดือนก่อน +14

      I hope to get there :) we're kinda heading in that direction...

    • @Roarshark12
      @Roarshark12 7 หลายเดือนก่อน +6

      @@MattGodbolt yeay!! It's something I really enjoyed learning about twenty-plus years ago in the CPU architecture class I took in computer engineering in college. I wonder what new innovations beyond that (and tournament predictors) have been made since I graduated in 2003.
      Also: I appreciated your lecture style, funneling from the most abstract ideas and intentions down into the nuts and bolts of how things get done inside a CPU.

  • @TheUglyGnome
    @TheUglyGnome 7 หลายเดือนก่อน +8

    This imaginary processor is not well suited for pipelining. Execute stage uses more than one clock cycle every time it needs to access memory. That's why pure RISC processors are load/store architectures. They don't need to access memory during execution stage. Their instruction sets are designed for pipelining from the start.

    • @adingbatponder
      @adingbatponder 6 หลายเดือนก่อน

      Have you got a link or further info on this. The difference seems important. Cheers.

    • @TheUglyGnome
      @TheUglyGnome 6 หลายเดือนก่อน +1

      @adingbatponder There's a great Wikipedia article "Classic RISC Pipeline". Also article "Load-store architecture" gives a bit of background info.
      For further reading, at least early editions of the book "Computer Architecture: A Quantitive Approach" by David Patterson & John Hennessy have great info about RISC vs. CISC pipelining.

  • @FalcoGer
    @FalcoGer 7 หลายเดือนก่อน +2

    And now superscalar, out-of-order, speculative execution with multiple execution units, register renaming and with multilevel cache, multicore and multiple memory channels, and of course virtual addressing. and how to avoid cache timing side channel attacks in such a setting.

  • @absurdengineering
    @absurdengineering 5 หลายเดือนก่อน

    One thing that may be not clear is why can’t the “addition robot” just know that say 01 is addition. That’s because instructions on real machines are much more complicated to decode, so that takes some good time to do. For the very simple “cpu” discussed in this video, it looks like an unnecessary complication to have a decoder - but that’s only an artifact of the simplification.

  • @NickEllis-nr6ot
    @NickEllis-nr6ot 7 หลายเดือนก่อน

    Absolutely brilliant explanation!

  • @SeanHoulihane
    @SeanHoulihane 7 หลายเดือนก่อน +4

    This was just starting to get interesting...

  • @avi12
    @avi12 7 หลายเดือนก่อน +3

    3:24 Mega-Hertz means millions of times per second, not thousands
    Thousands of times per second would be kilo-Hertz

  • @ACCPhil
    @ACCPhil 7 หลายเดือนก่อน

    I could understand the 6502. I miss those days. But the conditional instructions in ARM, we down down before Sophie.

  • @vadrif-draco
    @vadrif-draco 6 หลายเดือนก่อน

    Very cool and informative animations! :)

  • @MrStevetmq
    @MrStevetmq 5 หลายเดือนก่อน

    English language and grammar tell you all you need to know. You can tell from the ending of the word what it is. So ...ly is often a verb, .....er is the thing or person that is doing the action that is the verb. The only odd one out here is "Assembly-Language" witch is a noun, "Assembly" is the verb and "An Assembler" is the operator (the program that turns the code written in "Assembly Language" in to "Machine code".
    In higher level languages there is a "Compiler" that compiles the "Program code" in to "Assembly Language" that is then "Assembled" in to "Machine code" often divided up in routines that are linked together but the "Linker" to make the complete program in "Machine code"

  • @kwazar6725
    @kwazar6725 7 หลายเดือนก่อน

    Oh. Nice intro to instruction pipeline. Welcome to cpus

  • @zxuiji
    @zxuiji 7 หลายเดือนก่อน

    Instead of branch prediction should just preload both via 2 data buses (adaptor circuit can be used for ensure only 1 fetch is passed onto the ram/hdd/ssd/etc at a time - ones designed for dual data buses can just use 2 slots which is easily detected by other hardware). While the CPU is waiting for the 2nd data bus to catch up with the 1st bus it sent a fetch along it can do it's comparison. Think about it, you have to fetch 2 values in some cases to compare already, then you have 2 branches to fetch for, just optimise for fetching both at once and ignore any data you wind up not using.

    • @meneldal
      @meneldal 7 หลายเดือนก่อน

      You can do that for the basic branches, but this doesn't work for indirect branches, as the destination could literally be anywhere.

    • @zxuiji
      @zxuiji 7 หลายเดือนก่อน

      @@meneldalNot really, just delay fetching the values to compare for after fetching the addresses of the indirect branches. Then use those addresses after fetching the values to compare. Not rocket science.

    • @meneldal
      @meneldal 7 หลายเดือนก่อน +1

      @@zxuiji The problem is you don't know the addresses, that's the whole point. Like a switch written as "add pc, r5" (or various variations depending on architecture), while you can't predict r5 because it is the result of a computation just before.

    • @zxuiji
      @zxuiji 7 หลายเดือนก่อน

      @@meneldal Ah, that one. Thought you were on about callback functions straught after if statements (or switch statements)

    • @zxuiji
      @zxuiji 7 หลายเดือนก่อน

      @@meneldal Just thought of a partial fix for that problem. Build ALUs into memory devices like RAM. Doing so allows the CPU to offload the task to the source memory device which requires only 1 fetch, the fetch of the result. That is assuming the both values come from the same source anyways.

  • @tlhIngan
    @tlhIngan 7 หลายเดือนก่อน

    ARM is known for its conditionally executed instructions, except these days it's not done anymore. AArch32 (ARM 32 bit) still supports it, but AArch64 (64 bit ARM) does not support it. This is because conditional execution is incompatible with superscalar architectures (where you can execute more than one instruction simultaneously), so AArch64 code only allows conditionals in certain conditions. And on the very small end, Thumb never supported conditional execution at all. This was due to the limited length of a Thumb instruction.

    • @meneldal
      @meneldal 7 หลายเดือนก่อน

      Conditional execution for almost every instruction is also a bit a waste of space and limits the amount of instructions / size of literals you can have in the instructions. They were also not used as much as they could have, making them feel like even more of a waste.

  • @dyretna681
    @dyretna681 6 หลายเดือนก่อน +1

    Ok ready for TIS-100

  • @rafa_br34
    @rafa_br34 7 หลายเดือนก่อน

    Excellent video, Quite disappointed it didn't touch into branchless programming much.

  • @Lion_McLionhead
    @Lion_McLionhead 7 หลายเดือนก่อน

    Guess pipeline optimizations didn't hit PC's until the 486 but the marketing for the 586 was the 1st time public became widely aware of them. PGCC was just starting to bring delay slot optimizations & out of order instructions to the gcc series by 1997. Primitive, early microprocessors from the stone ages indeed.

    • @Roxor128
      @Roxor128 7 หลายเดือนก่อน

      The original 8086 actually had some pipelining, except it wasn't for instructions, but for the microcode used to execute them. See Ken Shirriff's blog series on reverse-engineering the 8086 if you want some detail.

  • @iNireus
    @iNireus 6 หลายเดือนก่อน

    There needs to be more on micro coding, I do wonder now if they need to revisit microcoding

  • @noname2588o
    @noname2588o 6 หลายเดือนก่อน

    Requesting suggestions for some more reading/video resources to learn more on this and in general about computer architecture, compilers and os and kernel internals.

  • @rumplstiltztinkerstein
    @rumplstiltztinkerstein 7 หลายเดือนก่อน +1

    3:22 I know how stressful it is to give a lesson live correctly in the first shot. I'm just here to be that annoying guy with no friends that point out that MHz is "millions of times per second".

  • @j7ndominica051
    @j7ndominica051 7 หลายเดือนก่อน

    What's the LCD clock in the background scrolling names of programming languages and some number?

    • @MattGodbolt
      @MattGodbolt 7 หลายเดือนก่อน +1

      It's a Tidbyt graphic display, showing the number of compilations by programming language of Compiler Explorer

  • @doge8530
    @doge8530 7 หลายเดือนก่อน +10

    3:20 Megahertz, so thousands of times per second.
    Kilo is thousand, Mega is million.
    I don't mind mistakes, but it really takes away from the thought that Giga means literal billions of clock cycles per second in modern processors

    • @mytech6779
      @mytech6779 7 หลายเดือนก่อน +8

      It is thousands. Thousands of thousands in fact.

    • @tiredcaffeine
      @tiredcaffeine 7 หลายเดือนก่อน

      @@mytech6779 The guy in the video said "thousands of times a second". "Thousands of times" doesn't mean "thousands of thousands of times".

    • @mytech6779
      @mytech6779 7 หลายเดือนก่อน

      @@tiredcaffeine One is inclusive of the other. If you have 100 of something then you also have 10 of that same thing. My 3.8Ghz 8 core CPU is in fact capable of thousands of operations per second, many many thousands.

    • @tiredcaffeine
      @tiredcaffeine 7 หลายเดือนก่อน +1

      @@mytech6779 However in daily usage we say millions when we mean millions, not thousands.

    • @mytech6779
      @mytech6779 7 หลายเดือนก่อน

      @@tiredcaffeine That's fine but it wasn't being used daily he just used it once.

  • @jacejunk
    @jacejunk 7 หลายเดือนก่อน

    @3:24 "Megahertz - thousands of times per second" could have been rephrased as "Kilohertz", or "millions of times per second".

    • @mytube001
      @mytube001 6 หลายเดือนก่อน +2

      Well, it IS thousands of times per second, just thousands of thousands! :D

  • @sudazima
    @sudazima 6 หลายเดือนก่อน

    ive played enough factorio to know how CPUs work

  • @KipIngram
    @KipIngram 6 หลายเดือนก่อน

    Yes - it. A robot is an it. And robots don't have minds. They're doing the same thing you're trying to model here - they just follow a set of instructions.

    • @Princess_Jessie3414
      @Princess_Jessie3414 6 หลายเดือนก่อน

      It is an it but we prefer to make things human by giving them a gender

  • @Rowlesisgay
    @Rowlesisgay 7 หลายเดือนก่อน +4

    Now somehow simplify superscaler out of order pipelined VLIW (I know VLIW is not out of order or even really superscaler) risc (I know risc and VLIW is mutually exclusive) multi-core cpus with 14 layers of cache.

  • @gsestream
    @gsestream 6 หลายเดือนก่อน

    maybe this tech pipe needs some unglogging. gpgpu with cpu compatibility.

  • @mrvulcan
    @mrvulcan 4 หลายเดือนก่อน

    I hate 6500-style assembly
    Z80 code was much more logical.
    i.e. you dont "branch" you jump.
    the "disable interrupts" was not ambiguous...
    and more pet peeves...

  • @jaydenritchie1992
    @jaydenritchie1992 7 หลายเดือนก่อน

    i cant remember if its il or 'e levore (i work) 'e (for) ono plata (one plate) munjari pasta (feed/eat pasta)

  • @mytube001
    @mytube001 6 หลายเดือนก่อน

    I wish I had been born 20-30 years earlier. Assembly is so much more straightforward than modern languages. That stuff feels intuitive to me, while modern languages are very abstract. Sure, they're more powerful in most ways and the only way to do it on modern CPUs, but you've completely lost the connection to the hardware.

    • @ArneChristianRosenfeldt
      @ArneChristianRosenfeldt 6 หลายเดือนก่อน

      I want to live in a world of zero-cost abstractions. A modern language which compiles to readable assembler code ( not x86 ). How do you optimize your hand-assembler for the pipeline? Optimization almost by definition is not straight forward. This is the reason why I don't understand that a greedy algorithm to allocate registers for local variables is an optimization. Yeah, spill over lands on the stack. Optimization is, when you hold back your greed, to keep some register free for later use. I guess that I talk about Sparc. On MIPS and x86 registers need to be assigned bottom up. Then for functions calling a lot of other functions, no registers are left. So they do everything on the stack. Thankfully, there still is block scope for your inner loops.

  • @dj10schannel
    @dj10schannel 7 หลายเดือนก่อน

    Cool

  • @goldnutter412
    @goldnutter412 7 หลายเดือนก่อน

    Perfect 🤣👏

  • @sumanalbargaw3108
    @sumanalbargaw3108 7 หลายเดือนก่อน

    in GPU

  • @TheJimmyCartel
    @TheJimmyCartel 7 หลายเดือนก่อน +7

    Assembly is so much cooler then soydev unity bloat

  • @aryansingh7209
    @aryansingh7209 7 หลายเดือนก่อน

    6 minutes ago is crazy

  • @habibie
    @habibie 7 หลายเดือนก่อน +2

    Shout out for all non binary robots 😂😅

  • @MonochromeWench
    @MonochromeWench 6 หลายเดือนก่อน

    Delay slots, something completely irrelevant to any modern CPU architecture.

  • @JeremyP1
    @JeremyP1 7 หลายเดือนก่อน

    second

  • @zxcaaq
    @zxcaaq 7 หลายเดือนก่อน +1

    clickbait

  • @Robstafarian
    @Robstafarian 2 หลายเดือนก่อน

    Nonbinary programmers, unite! 💛🤍💜🖤 (quantum entanglements welcome)

    • @Cohnan13
      @Cohnan13 หลายเดือนก่อน +1

      superpositions* 😜