This video was sponsored by Brilliant. To try everything Brilliant has to offer-free-for a full 30 days, visit brilliant.org/CoreDumped. You’ll also get 20% off an annual premium subscription.
Everything shown on the video you can build yourself in the virtual environment in the playful manner in the Steam game "Turing Complete". You start by creating simple NOT and AND logical gates and build more complex components like ALU and finish the game with simple version of turing complete computer that you can use in game to write real programs. Highly recommend!
I totally get it. I'm a frontend dev but I'm fascinated with lower level code. As soon as you get into decoding Bluetooth streams or parsing protobuf bytes or if you do anything with WebAssembly, it becomes pretty clear that JS just wasn't meant to handle those situations.
You have no idea how much this video series has helped me understand how data flows in circuits inside a cpu and more importantly how cpu identifies a specific instruction and how it then makes sure which data to move where. Thank you 🙏 As you mentioned in this video that you would explain how the cpu clock works, I am eagerly waiting to understand that as well.
Brilliant, absolutely brilliant! Here's a tip for all devs out there: implementing simple emulators/virtual machines can be a great way to learn more about how CPUs operate. The Chip-8 is a great place to start!
I learned all these things, many, many, years ago when writing code for a Commodore 64 in assembler. However, I have never seen a better explanation! Thanks a lot! 😊
if you can use your fingers to perform counting math, you're good enough to understand this video's content. and i am not joking. computers are actually idiots counting using their fingers, they just do it insanely fast.
These videos is brilliant! I already have experience building CPU's out of logic gates but these videos would be insanely helpful to have a few years ago.
19:37 The reason is we cant both read from a register and write to a register at the same time without a clock, using a clock we can set the rising edge of the clock to reading from registers and set the falling edge of the clock to storing the result in the register. this will take only 1.5 cpu cycles (half for fetch, one for both decode and execute) i hope this was enough.
Generally true, but that also depends on the convention one uses. One can use low and high clock pulses for different tasks which is typically referred to as the rising and falling edges, but you also have high and low logic too. So, in general, there are 4 possible variants to choose from, consider the following table: Clock Edge | Logic Type ------------------ | ------------------ rising | low rising | high falling | low falling | high This is something to be aware of. One always has the rising and falling edge of the clock pulse to work with, but as for Logic Type or convention, the main distinction here is mainly the difference between using NAND based or NOR based technologies within the construction of Sequential Logic (memory-based components). Some latches or flip-flops can be constructed using NAND based components while others can be constructed using NOR based components. The main difference between the two, is when the Reset lines of these latches is being asserted. Now, the vast majority by convention typically chooses NAND over NOR because NAND is cheaper as it does require less Transistors to construct as well as requiring less space in its layout design. This doesn't mean that NOR based logic doesn't have its uses. In some situations, and devices, NOR might be preferred as it does depend on the nature of the problem and the context in which it is being used. So, in general and in most situations, we typically or for the most part use the: NAND variant. Now as for low and high logic in conjunction with NAND and NOR based components, we can always convert one to the other. You can have a purely NAND system working in its natural based logic, but you can also configure it to work in its opposite variant, and the same can be done with NOR based components. It's just that the conversion process does require more logic to be thrown at it. But yes, memory access with read and write operations that trigger their control lines is determined by the rising and or falling edges of the clock. Yet, there is another component that is vital to this and that is the Edge Detector. Without the Edge Detector the rising and falling edge signals of the clock pulse would be basically useless. If one cannot determine if the edge is rising or falling, then the entire discussion of setting the memory's control lines is kind of moot. However, not all but many memory devices have their own edge detector built into it internally. So, in many cases we can abstract away from this, yet that's not always the case. Some memory components don't have any edge detector, and in those situations, we do have to roll our own for without that edge detector, the clock pulse is meaningless.
@@skilz8098 there is a great write-up of the clock in the i386. Basically, there are two kinds of clock lines routed across the chip. Their signal is inverted. Rising and falling latches are grouped so that we don’t have to pay twice for the “last mile”. At least on this chip ( or any microprocessor I looked at ) there is no edge detector. Just the flip flop is switching from input to feedback cycle. So it captures its current value on the edge of the clock. The edge does not need special properties. It can even bounce as long as the input value is stable during this period. I wonder if we could reduce noise on the power lines with a three phase clock with one inactive at a time. All rise and fall switches are balanced and don’t introduce common mode noise on the power supply. So sequential logic would have 3 groups of latches. Two of them pass .. ah crap. One of them passes through the signal. At all times at least one latch latches.
I would think the reason for using registers is to reduce the amount of “wires” needed. For example, to add/subtract values, there need to be dedicated “wires” to each register you may potentially operate on. If you wanted to try and perform arithmetic directly on a register in memory, you would need to multiplex every possible memory register (twice) and have the multiplexed output go to the ALU. Regarding a better way to do it, idk maybe get rid of the multiplexer somehow? Or serialize it with a shift register of some sort? Although I’m guessing that would slow things down.
A shift “register”. So still a register. The AtariJaguar actually can read two values out of palette SRAM at the same time. The processor writes the output into the linebuffer ( another SRAM ).
Main reasons for using registers are speed (to avoid slow memory read / write operations) and to increase the compactness of the machine code (to avoid specifying memory addresses which can be up to 64-bits long). Historically, all sorts of CPU designs existed. Some stored almost all registers in memory (such as TMS9900), and some partially (6502 "zero page" can be viewed as 256 external registers of the CPU).
To answer the question at 19:27, We load data into the registers before using them in the ALU because we simply need somewhere to put the data while it is being used by the ALU. One approach to optimizing this would be to allow one operand to be inside a register with the other operand in memory. This would require us to switch to 16-bit wide instructions so that way we can have extra bits to specify whether the 2nd operand is in memory or in a register. (We still want to be able to do arithmetic operations with both the operands in registers) an example instruction on this architecture might look like 00 10 11 10 01100110 (00 to indicate an arithmetic operation, 10 to indicate subtraction, 11 to indicate register 3, 10 to indicate the second operand is in memory, 01100110 to indicate the second operand is at memory address 01100110) This approach would also require that we change our control unit’s wiring a little since we would need the memory data bus to also connect to the ALU, and we would need to add extra decoders with logic to handle the larger instructions. This approach effectively saves us one load instruction since now our assemble for adding two numbers would look like: load r1 1100, add r1 [01100110] store r1 1101 (With the square brackets indicating that we want to use the value at the contained address as our rhs). Beyond this approach, we could modify our CPU even further to support performing operations directly on memory if we wanted, however this would require us to have even larger (and probably variable length) instructions, and potentially adding buffers to one of the inputs of the ALU to hold the first operand while the second one is being fetched. However, that might be hard to implement due to multiple operations being performed per instruction and it might not be any faster due to us still having to access memory multiple times. Plus, this is starting to go a little beyond the simple example CPU and is getting closer to some of the things modern x86 CPUs do! Thanks for the thought-provoking question!
Reg-mem instructions were actually quite common on CISC. 68k for this reason has 16 bit instructions. 6502 used the zero page of memory a lot. The clean approach is to use 8 bit offsets. So you have a base pointer ( into an array, heap, or stack ) and use a literal offset to access fields. ( or branch around in code ). The nice thing: load is just a normal instruction. Still Store is the weird one.
Really love what you've done on how CPU run programs at the very circuitry level. Do you have any plans to illustrated how GPU works at circuitry level in future episodes? How GPU handles matrices calculations and graphics calculations?
Can we start with 2d ? C64 , NES , MX2, PCEngine ? If you care for Matrix multiplication, check out the Atari Jaguar which has some optimization to keep the fused multiply and add unit busy every cycle. PS1 GTE works similar. Both have a pixel shader. The Jaguar manual goes into great detail how they tried to shade 4px at once to keep up with the speed of the memory and avoid any cache logic.
I can only say EXCELLENT to this animated series about how computer work from CPU structure to the assembly codes. Although in university course > we already have this concept, it is not easy to tell other people not major in computer science. But you did it, you made it friendly to understand all steps when a CPU read code and else.
It's amazing. It's building on things someone else did before incrementally till you have the powerful computers today even on your wrist. These are all things you learn in Computer Science and OS concepts
19:12 One note so: The compiler will pre compute that and just output a single store command, or directly use the pre computed value elsewhere if it's value isn't changed somewhere. They are clever in that regard. So the better you are able to communicate to the compiler what you want to archive the more optimized your program will be, essentially for free.
I can't tell you how much I love your videos. This one in particaular helped me fill in the gaps in Matt Godbolt's series on how CPUs work over on Computerphile. Can't wait for more:)
For your question: Why? 1. because registers are fast and memory is slow. 1.a because the memory doesn't have a direct line to your ALU 2. because addressing registers is easy, addressing memory takes up a lot of space. 2.a because addressing multiple kind of things makes the instruction set and therefore the decoding a lot more complicated, and you couldn't have efficient fixed width instructions. Better way? Not without making the architecture a lot more complicated. And compilers are really good at juggling registers. But in theory, you could make a direct data bus from memory to ALU, more precisely to the funnel shaped multiplexers. But there's still the problem of memory addresses being 4 bit. Of course you can make instructions that do arithmetic with the memory address and a fixed register, so the 4 address bits can all be used for the memory address. But that sounds like a step back, not forward.
Also, adding an ALU access to each memory cell would drastically increase the circuit size with very little benefit compared to the cost. Memory size would decrease by orders of magnitude even if you just connected every cell to an ALU, not to mention the problems in doing so for non nand based memory.
Your assumpion about "memory is slow vs registers" is not applicable to the example of CPU architecture in this video. Memory here is the same pile of transistors as a registers. The only difference is a set of wires.
Always some of the best videos on this platform. Next semester I will probably point some of my students to these videos for those interested in systems programming.
Amazing series, For the question: 1) fetching from memory is slow, by fetching first we can cache the value if we need it in following instructions 2) the number of wires connected to the ALU will be at least 2* address bus because we need to tell ALU the two addresses in case of two operants instruction 3) instruction size will be bigger because it needs to have at least 2* address bus, currently choosing from register needs lower number of wires
This video is all I needed to fulfil my curiosity. I always used to wonder how instruction is given to computer They always say computers use 1s and 0s but never specify how that works on hardware level.
I think it's because registers are like variables for the hardware, and you might want do to something else with this "variable" (which is the value stored in the register) before changing it.
You could directly wire the memory to the ALU, but this would require two data buses. Working directly with registers is magnitudes faster than loading data from memory. Usually an instruction does not live in a vacuum and a value could be used multiple times once it is loaded into a register. Also memory management is its own can of worms, when you start taking virtual address spaces and caching into account.
Yeah people forget that while the 6502 has only 8 data pins, internally it has two data busses ( okay, data only, no address) which can be parted to have 4 in total. The partition border is (can be) between the ALU inputs. So you for inputs you would need to pick registers from different halves of the register file. This is the super cheap CPU from 1975!
Answer to the question: I think you can , just directly pick data from the memory , the ALU does the math and thene store the result, except the instruction needs to be a lot bigger and you need more memory , but I think the there will be an issue with storing the result
I mean, a lot of home computers and consoles had multiple memory chips. I always feel like it is a design failure .. So a mem-mem instruction would always move data from one bank to the other.
@@ArneChristianRosenfeldt these machines had PLAs and MMUs doing heavy lifting of juggling the busses and providing enough of abstractions for chips not to collide with each other. Your thinking has fundamental flaw: mem-mem instruction would always move data from one bank to another THEN AND ONLY THEN if these memory banks would be only chips attached to said bus. When you have to arbitrate between multiple nodes wanting to share their results or tackle upon new problem you have shitload of switches to correctly flip to isolate two and only two interesting parties on the bus at the exact moment of data exchange between them.
@@lis6502 I was kinda spreading my “wisdom” across many similar comments. Many people don’t seem to know how complicated memory access is in a real computer. Just do the math: each bit costs 6 transistors in SRAM and 1 transistor plus a huge capacitor in DRAM. So a C64 with 8kBit chips has more transistors in every memory chip than in any other. So they are expensive and of course designers try to utilise them to the max. MMU is something I don’t really understand. I think that N64 was the first console to have it. Amiga OCS and AtariSt don’t have it. But then again, to select a chip on the bus, the most significant bits of the address need to be checked anyway. So you could have like 16 chips selects which can be mapped to arbitrary pages in memory. So like in NEC TurboGrafx with their 8kB pages to ROM? I thought that the PLA does not cost time because it runs parallel. It is just there to not enable write or read at the last possible point in time.
Ans to you question, registers have bit level control of read and write, that allows passing one at a time to the ALU, and facilitating multiplexing but memory is just a block with no granular control ( probably 😅). I am thinking about registers like a workbench and memory is the closet. Improvement. !!! Dividing memory into subdivisions, keeping closer proximity of registers, for that a more defined instructional set will be required. Please respond if I am right or wrong.
Really great effort Bro, this video means a lot to me, even though i'm embedded system engineer for more than a 7years. I recommended this channel to my Diploma mates, Engineering mates and till my colleagues. Great effort Please do continue your work. One of the great channel in TH-cam, i regularly look forward for new videos.
This is amazing! I'm learning so much from you! I love that you started from transistors and latches, and now we are going into more and more abstraction. It would be really difficult to keep track of all the wiring if ALL transistors would still be visible!
Excellent video. The control scheme depicted is a 'hard-wired" instruction set. I suggest a follow-up discussing the differences between a microcoded instruction set and hard-wired instruction set. Reason: most CPUs today (even RISC) are microcoded. Source: CPU architect Jim Keller.
This is a fantastic video. The animations are excellent and help a lot to teach all the concepts and make the execution logic and sequence very clear. Congratulations and thank you!
If this channel existed in the 2000's, I probably would've been able to graduate in electronics engineering. This kind of content actually makes me want to study it again!! :D
Currently in highschool and I plan to learn as much as possible before college. You are way too underrated to be providing free education man! Love your videos, please keep posting :D
Instead of registers, you could use a stack. Push a value from memory when loading onto the stack and ALU operations take the top two entries as inputs, remove them, and put their result on the top, then pop the top value back to memory to store it. The Intel 8087 Floating Point Unit does this, with a small stack of 8 entries, although it's not a strict implementation, as it does allow some access out of order.
8087 stack does not reside in memory though. It’s not a separate chip. But then again I guess that it is more optimized towards density than speed. 80 bits in 1981 !!
There are multiple advantages to restricting the arithmetic instructions to operating on registers only (this is a rephrasing of your question). First, registers are usually faster than memory, even more so with flash, off-chip, or dirty cache situations. Second, for fixed width instructions it allows for more bits to store the specific ALU instruction. Third, it enables for microcode. By handling complicated instructions (like multiplication or bitswapping) internally as many simpler instructions you can offer a much richer instruction set to the user without adding a lot more dedicated logic to your silicon.
I wonder if on the Amiga it is possible to time the start of the micro code execution of DIV or MUL so that the 68k gets totally of the bus while for example the graphics DMA runs?
@@skilz8098 the context is “microcode”. Your third paragraph. What does microcode bring to the table when no one else uses the bus in the meantime? Microcode already gets instruction fetch from the bus. Why also remove data access? JRISC for example use micro code to implement the inner vector product as reg-mem.
@@ArneChristianRosenfeldt Well, in modern CPUs - architects and computer systems sure you wouldn't be wrong. Yet, if one is building their own system from discrete logic gates or ICs, they are literally hardwiring the buses, the bus transceivers, decoders, etc. and they have to measure or be aware of their clock signals and even the strength of their signals. It all depends on which route of path they tend to take; they could build out all of the logic for their decoder or they could just wire it to say an EEPROM and program it with the binary equivalent. When I started to implement Ben Eaters 8-bit Bread Board CPU in Logisim. I had to build out the components (registers, logic, counters, etc...) for the control unit. Ben did all of that on the bread to where they controlled the I/O signals coming out of his EERPROM ROM module. Here is where my implementation slightly varied from his. I ended up using Logisim built in ROM. I ended up writing a C++ application similar to his EEPROM C like code to generate all of the binaries. Then I had to take the printed output from my C++ project and had to message it (trim off all white spaces). Then load it into the ROM Module within Logisim. This was also tightly integrated into a second counter for all of the micro instructions. Not for every single main system clock cycle itself, but for every increment of the program counter, there were up to 4 or 5 sub cycles. This secondary counter on a varying repeatable loop managed the fetch, decode and execute cycle. Some instructions only took say 3 cycles, where others took 4 and some took 5. All of this had to be taken into account to instruct which device to either accept or assert data from the bus. It was a single bus that handled, instructions, addresses and data. It was a system wide shared bus. So, again, it also depends on the overall layout of the system. It depends on its architecture. Sure, a modern CPU or system doesn't use a single bus anymore, and for most tense and purposes even at the assembly level, we don't usually concern ourselves with the fetch, decode and execute sub cycles, but we should still be aware of their mechanics and properties. Take an NES apart and trace out all its paths. Or take apart an Atari and trace out all of its paths. Get really creative and take apart an Arcade cabinet and trace all of its paths. And that doesn't account for all of the processors, coprocessors, and microprocessors within those systems.
@@skilz8098 what is modern? 6502 is from 1975 ( 70s I am sure ). I think that it predates the SAP-1 book. 6502 solved the variable length of instructions: there are end instructions in microcode. Some illegal instructions manage to miss those and freeze the CPU. Wiring of clock signals is important in ICs. For example up to 1985 commercial chips only had a single metal layer. So any crossing needed a bridge through highly resistive poly-Si . The clock distribution network on DEC alpha is so powerful and has so wide conductors that it is clearly visible in the die shot. They were clearly pushing the limits. SAP uses adders, but I read that you could buy TTL ALUs in the 60s . Dataport 2000 and other processors even operated on single bits (Usagi has videos). Discrete Multiplexers and shifters must have been cheap . But then, the C64 first started with memory with a single data pin. Maybe dataport used those or those early Intel rotation registers which acted as main RAM. I like how Atari and Commodore insisted on tri state address pins for the 6502 CPU. So indeed everything sits on a common bus. Same with AtariJaguar. N64 has a point to point connection between the custom chip and this RAMBus to the DRAM chips. Arcades are a bit like a PC : lots of expansion cards on a mainboard. Not optimized for cost not energy efficiency. Though some are straight forward: buy the expensive 6809 and a lot of RAM for a double buffer with many colors. This much RAM filled its dedicated board on its own back in the day and is obviously expensive. How do low volume ROMs work?
I think loading data into register before using them for ALU Operations is essentials because if we load directly from main memory to ALU might simplify CPU operations, but it is practically inefficient. It’s crucial to note that accessing main memory is much slower than registers. Registers are accessible in one clock cycle while getting into the main memory may take several dozens and even hundreds of cycles which makes ALU work slow. Moreover, ALUs are intended to work with the few, fixed number of registers. Their direct access to main memory would make it difficult for them to design and potentially slowdown the whole CPU. Most architectures of CPUs are structured around using registers for data manipulation because it simplifies instruction set and makes instructions more efficient. Changing this design will require a complete rethink of architecture resulting in higher complexity as well as compatibility problems. Pipelining is heavily used in modern CPUs to execute numerous instructions simultaneously. In this process, registers play a key role by providing fast storage for intermediate results and facilitating quick movement of data between different stages within the pipeline. Direct memory access will increase latencies thereby interrupting the smooth flow thus reducing overall efficiency. Using main memory also leads to higher power consumption compared to employing registers which is a major concern both in mobile devices as well as servers. Some of the better ways to handle (i will share one example for now )as most of modern archtecture do is including serveral memory hierarchy like for prefetching like L1 , L2 , L3 caches which overall reduces latency and improves performances
Also realistic access to memory uses address registers. So you don’t actually cut down on registers. I think that the 6502 zero page is a mistake. In the NEC turbo Grafik CPU, RAM, and ROM all cycle at 8 MHz. No 8 bit CPU after this. But at least for 10 years external memory was fast enough. The MIPS R1000 uses external SRAM caches at CPU clock. Intel was afraid of using pins, and MIPS has 3 32 bit data connections on their chip? Third is to DRAM. The power goes somewhere: electro magnetic interference.
The programme logisim allows you to do all of this, and build all these logic circuits. Its brilliant and will help get this information really lodged in your brain and fully understood.
Can you make a more in depth video where you do this but with a stack and function calls? We already saw how the stack works, but not in the context of a working program. For example, what exactly is a stack frame from the perspective of a CPU? How does it know which variables are located where in the stack, even when you make a new stack frame? How does it know where each stack frame is?
I'm currently in high school, and I'm thinking about studying Computer Science. This low level stuff is really interesting, you are really one of the best TH-cam finds of mine this year
About the question: You load the data into registers before performing an arithmetic operation for multiple reasons: 1) Fetching data from the RAM is way slower than using registers. 2) The CPU doesn't know if the result is going to be stored, so it would be a waste of time doing so. 3) I think that reading data from registers is safer, but I'm not sure. 4) Also, as someone else mentioned, you would need bigger instructions to specify the memory address.
1) the MIPS 5-stage pipeline was invented in 1980. The inventors observed that real programs need addressing modes. So access to RAM reads the address register, adds some offset using the ALU, sends the address out to RAM, and gets back the result in the next cycle. This works for ROM and SRAM ( think: Commodore PET or a console like pcEngine ). DRAM at the time already multiplexed the address bus. So it (could) take two cycles just to transfer the address. Also often in the first cycle some external component checks, which memory chip should be enabled: ROM or DRAM or some IO chips? 2) No. All CPUs have CMP and TEST instructions. On RISC there is the zero register. Starting with SH2 the flag output is optional. 3) On a computer with multiple components which can write to RAM, the registers are your safe space. Many computers had DMA read out for graphics and sound, but the Amiga introduced a blitter which would write to (chip) memory. Sega genesis could have conflicts between Z80 and 68k. 4) That's why instruction encoding is so inefficient on CISC. At least 68k has 16 registers and the swap instruction. Why did people not swap register banks on Z80 that often? Always want to stay compatible with interrupts?
In my collage i had to write some assembly for few weeks. We had to use acumulators for aritmetic operations. I dont know why, and now i can't wait for next episode to learn if acumulators are just registers or is it answer to question 😂
They are just registers. I always thought that the accumulator in 6502 is smarter, but it turns out that it can only calculate decimals, which is useless in games. JRISC on AtariJaguar gets slower if you load two registers. To keep things fast, reuse the result of the last instruction. There is a hidden accumulator. Actually, there are two. Another one for MAC.
Wow!!! Awesome presentation! I already know most of this stuff, but I LOVE to hear such a clear presentation! One small quibble, the word “arithmetic” is pronounced aRITHmatic as a noun, but arithMATIC as an adjective. E.g. an arithMATIC operation performs aRITHmetic on its operands. Quibble aside, awesome stuff!
@@EMLtheViewer It was just my suggestion, i dont know. I researched the topic and did found anything, so I my probably wrong. But I do know that in grafikcards, you can load an array of data in one instruction.
@@EMLtheViewer The 6502 has Increment and Rotate instruction which act directly on memory, and I hate it. But ROR was too much? Or a second accumulator B. But did you know that the 6800 did only have one address register and could not efficiently memCopy?
Great video as usual! I love seeing all the pieces come together, even if it's getting hard to remember all the lower levels that have been abstracted away xd. I would love to see a video on drivers and how connecting external devices works at a low level. As for the question at the end of the video, my guess (without doing further research) would be to have the two data streams in the ALU connect directly to memory and set it up so that their write enable in turned on when an arithmatic instruction is performed.
Answer 🎉: the registers are much faster as they are part of the cpu while ram is separated and can be accessed by diffrent cores wich means its not fast enough to work directly alongside the alu and so the cpu loads data to the registers and then execute it and proccess the instruction However theres a much faster type of memory called cache memory wich is even faster and stores limited amounts of data that is bien processed I learned a lot from ur vids not sure about the ansewer or how that is bien done. I knew about cache and registers before but i had no understanding of how they work big thanks ❤❤
@@CoreDumpped In my code a lot of variable are initialized to some literal value. Bitfields and safety limits for for-loops. With non-portable code, which knows that we are dealing with exactly 8 bits, a lot more helper variables have known values at compile time.
My answer: Because loading data to the register allows us to use simpler and shorter instructions due to simpler operands addressing(yet I am not so sure, I am complete newbie and just have started learning). As it comes to the second part of the question, yes, I think there might be a better way to do it(yet not completely sure it would really be a better way), and my solution would be to use longer instructions(16-bit instead of 8-bit) witch would allow us to use longer addresses for the operands, so we would be able to grant ALU a direct memory access, so it could fetch the values via separate data buses from memory without using any of registers. It's just my concept and understanding , however I don't have any idea is it correct or not.
Feels great to have an understanding of the components I’m learning about for my gcse, this is really helpful! (If a bit beyond the spec of the course lol)
Answer: You can use a carry in the adder or a Carry-Save adder. It uses more components but it is much faster than using what you have shown in previous videos.
Man I don't know what kind of words would give you enough appreciation on this channel but all i can say is thank you for making these videos free and thank you for educating people + you're the best
I really like the designed opcodes. I also recomend the new videos of comptuerphile on Out of Order execution and branch prediction. Also compiler optimzes the code for modern CPU, and many tricks show how a cpu is build. for example unrolling a loop into four steps a time helps performance. Thank you for the great videos
I would love to see a steampunk demake of an instruction scheduler. I like the instruction queue in the 8086. Now let’s take the 4 Bit ALU from the Z80. Then let’s try to occupy memory io and ALU and two port register file access every cycle. No global flags, but RISCV branch instructions.
Why load data into registers before using them for ALU operations? Registers are the closest and fastest memory storage units within the CPU. They provide temporary storage for data that is about to be processed or data that has just been processed by the ALU. Is there a better way to do this? One possible alternative is placing memory closer to the processing unit to minimize data movement. However, this approach still requires a temporary location, such as registers, to perform operations efficiently
The machine language (ISA) of the SAP-1 does not know about registers, still the implementation needed at least 4 . When do you call something a latch and when a register? There is an instruction “register” after all.
I wanted to know what book or documentation explains these so throughly? Even the most advanced books (That I know of) don't explain it like this. Could you please give some suggestions for books where they explain architecture and how components are built/connected at such a low level?
This is great and I want to know where you learnt all of this stuff? Like both generally and specifically because I would love to go into more detail myself.
If you make a video on Control unit of how they are made I would be appreciative even though you said it will be long but will be helpful and unique for sure
Can you organize these in a playlist? You mentioned twice that "we've seen before" something but I don't know which video starts this course, I imagine it's the "HOW TRANSISTORS RUN CODE?" one.
This video was sponsored by Brilliant.
To try everything Brilliant has to offer-free-for a full 30 days, visit brilliant.org/CoreDumped. You’ll also get 20% off an annual premium subscription.
But how the program first loads from a HARDDISK into MEMORY to get started is missing bro I think. Anyways lot of love from India ..!
What do you use to animate your text in your videos? It's so slick! Really top quality.
Everything shown on the video you can build yourself in the virtual environment in the playful manner in the Steam game "Turing Complete". You start by creating simple NOT and AND logical gates and build more complex components like ALU and finish the game with simple version of turing complete computer that you can use in game to write real programs. Highly recommend!
Thank you so much
great game
Thank you, need to check this out
Thanks ❤
Thanks for sharing!
Started with JS, didn't like the abstraction level of the language, came all the way down to 0's and 1's. No regrets.
0s and 1s? Too much abstraction. You better start learning how the MOSFET transistors route current through capacitors and diodes
Really? You write code in binary?
Liar.
I totally get it. I'm a frontend dev but I'm fascinated with lower level code. As soon as you get into decoding Bluetooth streams or parsing protobuf bytes or if you do anything with WebAssembly, it becomes pretty clear that JS just wasn't meant to handle those situations.
Same here!
Epic!
probably the only channel I willingly watch ads for. Please don't stop producing these videos, you are the man
There is another ...
It's called The Why Files.
You have no idea how much this video series has helped me understand how data flows in circuits inside a cpu and more importantly how cpu identifies a specific instruction and how it then makes sure which data to move where. Thank you 🙏
As you mentioned in this video that you would explain how the cpu clock works, I am eagerly waiting to understand that as well.
Brilliant, absolutely brilliant! Here's a tip for all devs out there: implementing simple emulators/virtual machines can be a great way to learn more about how CPUs operate. The Chip-8 is a great place to start!
Another great place to start is watching Ben Eater's 8-Bit Bread Board CPU series.
Bunda mole ksksksksks
Learning more from bro's videos than my CS degree
I learned all these things, many, many, years ago when writing code for a Commodore 64 in assembler. However, I have never seen a better explanation! Thanks a lot! 😊
5:00 "quite straightforward isn't it ?!" BRO, I don't even know why I'm watching this video, I'm a student in management with no programming skills xD
if you can use your fingers to perform counting math, you're good enough to understand this video's content. and i am not joking. computers are actually idiots counting using their fingers, they just do it insanely fast.
@@lis6502 yep I agree
simple don't watch it bro!, the dudes information for somebody how has studied this is amazing even for people who cant use their fingers....
@@lis6502they do it fast just because electricity flows fast, and the whole thing works like a clockwork automation, just faster.
@@lis6502could counting fingers in parallel extremely fast
These videos is brilliant! I already have experience building CPU's out of logic gates but these videos would be insanely helpful to have a few years ago.
19:37 The reason is we cant both read from a register and write to a register at the same time without a clock, using a clock we can set the rising edge of the clock to reading from registers and set the falling edge of the clock to storing the result in the register.
this will take only 1.5 cpu cycles (half for fetch, one for both decode and execute) i hope this was enough.
To be able to increment a register in the file, the 6502 has hidden input register in front of the ALU. ARM2 has a write-back register behind the ALU.
Generally true, but that also depends on the convention one uses. One can use low and high clock pulses for different tasks which is typically referred to as the rising and falling edges, but you also have high and low logic too. So, in general, there are 4 possible variants to choose from, consider the following table:
Clock Edge | Logic Type
------------------ | ------------------
rising | low
rising | high
falling | low
falling | high
This is something to be aware of. One always has the rising and falling edge of the clock pulse to work with, but as for Logic Type or convention, the main distinction here is mainly the difference between using NAND based or NOR based technologies within the construction of Sequential Logic (memory-based components). Some latches or flip-flops can be constructed using NAND based components while others can be constructed using NOR based components. The main difference between the two, is when the Reset lines of these latches is being asserted.
Now, the vast majority by convention typically chooses NAND over NOR because NAND is cheaper as it does require less Transistors to construct as well as requiring less space in its layout design. This doesn't mean that NOR based logic doesn't have its uses. In some situations, and devices, NOR might be preferred as it does depend on the nature of the problem and the context in which it is being used.
So, in general and in most situations, we typically or for the most part use the: NAND variant. Now as for low and high logic in conjunction with NAND and NOR based components, we can always convert one to the other. You can have a purely NAND system working in its natural based logic, but you can also configure it to work in its opposite variant, and the same can be done with NOR based components. It's just that the conversion process does require more logic to be thrown at it.
But yes, memory access with read and write operations that trigger their control lines is determined by the rising and or falling edges of the clock. Yet, there is another component that is vital to this and that is the Edge Detector. Without the Edge Detector the rising and falling edge signals of the clock pulse would be basically useless. If one cannot determine if the edge is rising or falling, then the entire discussion of setting the memory's control lines is kind of moot. However, not all but many memory devices have their own edge detector built into it internally. So, in many cases we can abstract away from this, yet that's not always the case. Some memory components don't have any edge detector, and in those situations, we do have to roll our own for without that edge detector, the clock pulse is meaningless.
@@skilz8098 there is a great write-up of the clock in the i386. Basically, there are two kinds of clock lines routed across the chip. Their signal is inverted. Rising and falling latches are grouped so that we don’t have to pay twice for the “last mile”. At least on this chip ( or any microprocessor I looked at ) there is no edge detector. Just the flip flop is switching from input to feedback cycle. So it captures its current value on the edge of the clock. The edge does not need special properties. It can even bounce as long as the input value is stable during this period.
I wonder if we could reduce noise on the power lines with a three phase clock with one inactive at a time. All rise and fall switches are balanced and don’t introduce common mode noise on the power supply. So sequential logic would have 3 groups of latches. Two of them pass .. ah crap. One of them passes through the signal. At all times at least one latch latches.
This is peak TH-cam. Everyone else give up making videos. This channel wins.
I would think the reason for using registers is to reduce the amount of “wires” needed. For example, to add/subtract values, there need to be dedicated “wires” to each register you may potentially operate on. If you wanted to try and perform arithmetic directly on a register in memory, you would need to multiplex every possible memory register (twice) and have the multiplexed output go to the ALU.
Regarding a better way to do it, idk maybe get rid of the multiplexer somehow? Or serialize it with a shift register of some sort? Although I’m guessing that would slow things down.
A shift “register”. So still a register. The AtariJaguar actually can read two values out of palette SRAM at the same time. The processor writes the output into the linebuffer ( another SRAM ).
Main reasons for using registers are speed (to avoid slow memory read / write operations) and to increase the compactness of the machine code (to avoid specifying memory addresses which can be up to 64-bits long). Historically, all sorts of CPU designs existed. Some stored almost all registers in memory (such as TMS9900), and some partially (6502 "zero page" can be viewed as 256 external registers of the CPU).
To answer the question at 19:27, We load data into the registers before using them in the ALU because we simply need somewhere to put the data while it is being used by the ALU. One approach to optimizing this would be to allow one operand to be inside a register with the other operand in memory. This would require us to switch to 16-bit wide instructions so that way we can have extra bits to specify whether the 2nd operand is in memory or in a register. (We still want to be able to do arithmetic operations with both the operands in registers) an example instruction on this architecture might look like 00 10 11 10 01100110 (00 to indicate an arithmetic operation, 10 to indicate subtraction, 11 to indicate register 3, 10 to indicate the second operand is in memory, 01100110 to indicate the second operand is at memory address 01100110) This approach would also require that we change our control unit’s wiring a little since we would need the memory data bus to also connect to the ALU, and we would need to add extra decoders with logic to handle the larger instructions. This approach effectively saves us one load instruction since now our assemble for adding two numbers would look like: load r1 1100, add r1 [01100110] store r1 1101 (With the square brackets indicating that we want to use the value at the contained address as our rhs). Beyond this approach, we could modify our CPU even further to support performing operations directly on memory if we wanted, however this would require us to have even larger (and probably variable length) instructions, and potentially adding buffers to one of the inputs of the ALU to hold the first operand while the second one is being fetched. However, that might be hard to implement due to multiple operations being performed per instruction and it might not be any faster due to us still having to access memory multiple times. Plus, this is starting to go a little beyond the simple example CPU and is getting closer to some of the things modern x86 CPUs do! Thanks for the thought-provoking question!
Reg-mem instructions were actually quite common on CISC. 68k for this reason has 16 bit instructions. 6502 used the zero page of memory a lot. The clean approach is to use 8 bit offsets. So you have a base pointer ( into an array, heap, or stack ) and use a literal offset to access fields. ( or branch around in code ). The nice thing: load is just a normal instruction. Still Store is the weird one.
That is of course known as "immediate" addressing and is something you can do with the MOS 6502
Really love what you've done on how CPU run programs at the very circuitry level.
Do you have any plans to illustrated how GPU works at circuitry level in future episodes? How GPU handles matrices calculations and graphics calculations?
Thanks for the support! Yes, I do have plans for GPUs from scratch.
Can we start with 2d ? C64 , NES , MX2, PCEngine ?
If you care for Matrix multiplication, check out the Atari Jaguar which has some optimization to keep the fused multiply and add unit busy every cycle. PS1 GTE works similar.
Both have a pixel shader. The Jaguar manual goes into great detail how they tried to shade 4px at once to keep up with the speed of the memory and avoid any cache logic.
This video was very much useful and it has more content in simple explanation. Please do a video on clock signals
I can only say EXCELLENT to this animated series about how computer work from CPU structure to the assembly codes. Although in university course > we already have this concept, it is not easy to tell other people not major in computer science. But you did it, you made it friendly to understand all steps when a CPU read code and else.
15:35 The TITLE DROP was EPIC😨😨
It's amazing. It's building on things someone else did before incrementally till you have the powerful computers today even on your wrist. These are all things you learn in Computer Science and OS concepts
The animations that transition from diagram to diagram help immensely to not get lost in what's happening. Great Work!
19:12 One note so:
The compiler will pre compute that and just output a single store command, or directly use the pre computed value elsewhere if it's value isn't changed somewhere.
They are clever in that regard.
So the better you are able to communicate to the compiler what you want to archive the more optimized your program will be, essentially for free.
Nice comment! I didn't get into optimizations because there's a video dedicated to it. But I guess mentioning it wouldn't hurt.
I can't tell you how much I love your videos. This one in particaular helped me fill in the gaps in Matt Godbolt's series on how CPUs work over on Computerphile. Can't wait for more:)
XXX episodes later... crafting a simulated universe
For your question:
Why?
1. because registers are fast and memory is slow.
1.a because the memory doesn't have a direct line to your ALU
2. because addressing registers is easy, addressing memory takes up a lot of space.
2.a because addressing multiple kind of things makes the instruction set and therefore the decoding a lot more complicated, and you couldn't have efficient fixed width instructions.
Better way?
Not without making the architecture a lot more complicated. And compilers are really good at juggling registers.
But in theory, you could make a direct data bus from memory to ALU, more precisely to the funnel shaped multiplexers. But there's still the problem of memory addresses being 4 bit. Of course you can make instructions that do arithmetic with the memory address and a fixed register, so the 4 address bits can all be used for the memory address. But that sounds like a step back, not forward.
Also, adding an ALU access to each memory cell would drastically increase the circuit size with very little benefit compared to the cost. Memory size would decrease by orders of magnitude even if you just connected every cell to an ALU, not to mention the problems in doing so for non nand based memory.
Your assumpion about "memory is slow vs registers" is not applicable to the example of CPU architecture in this video. Memory here is the same pile of transistors as a registers. The only difference is a set of wires.
Always some of the best videos on this platform. Next semester I will probably point some of my students to these videos for those interested in systems programming.
Amazing series,
For the question:
1) fetching from memory is slow, by fetching first we can cache the value if we need it in following instructions
2) the number of wires connected to the ALU will be at least 2* address bus because we need to tell ALU the two addresses in case of two operants instruction
3) instruction size will be bigger because it needs to have at least 2* address bus, currently choosing from register needs lower number of wires
Wooow! I loved this concepts, congratulations and thank you!! I've used this concepts on my CPU made from scratch throught logic gates.
This video is all I needed to fulfil my curiosity. I always used to wonder how instruction is given to computer
They always say computers use 1s and 0s but never specify how that works on hardware level.
Brilliant video again!!! Could you make a video explaining how a cpu clock works in detail to go from one instruction to another?
I think it's because registers are like variables for the hardware, and you might want do to something else with this "variable" (which is the value stored in the register) before changing it.
please continue making this kind of videos, I really learn a lot.
You could directly wire the memory to the ALU, but this would require two data buses. Working directly with registers is magnitudes faster than loading data from memory. Usually an instruction does not live in a vacuum and a value could be used multiple times once it is loaded into a register. Also memory management is its own can of worms, when you start taking virtual address spaces and caching into account.
Yeah people forget that while the 6502 has only 8 data pins, internally it has two data busses ( okay, data only, no address) which can be parted to have 4 in total. The partition border is (can be) between the ALU inputs. So you for inputs you would need to pick registers from different halves of the register file.
This is the super cheap CPU from 1975!
Thanks!
Your series helped me a lot to understand the lower level concepts. Thanks for making and sharing them!
Answer to the question:
I think you can , just directly pick data from the memory , the ALU does the math and thene store the result, except the instruction needs to be a lot bigger and you need more memory , but I think the there will be an issue with storing the result
I mean, a lot of home computers and consoles had multiple memory chips. I always feel like it is a design failure .. So a mem-mem instruction would always move data from one bank to the other.
@@ArneChristianRosenfeldt these machines had PLAs and MMUs doing heavy lifting of juggling the busses and providing enough of abstractions for chips not to collide with each other.
Your thinking has fundamental flaw: mem-mem instruction would always move data from one bank to another THEN AND ONLY THEN if these memory banks would be only chips attached to said bus. When you have to arbitrate between multiple nodes wanting to share their results or tackle upon new problem you have shitload of switches to correctly flip to isolate two and only two interesting parties on the bus at the exact moment of data exchange between them.
@@lis6502 I was kinda spreading my “wisdom” across many similar comments. Many people don’t seem to know how complicated memory access is in a real computer. Just do the math: each bit costs 6 transistors in SRAM and 1 transistor plus a huge capacitor in DRAM. So a C64 with 8kBit chips has more transistors in every memory chip than in any other. So they are expensive and of course designers try to utilise them to the max.
MMU is something I don’t really understand. I think that N64 was the first console to have it. Amiga OCS and AtariSt don’t have it. But then again, to select a chip on the bus, the most significant bits of the address need to be checked anyway. So you could have like 16 chips selects which can be mapped to arbitrary pages in memory. So like in NEC TurboGrafx with their 8kB pages to ROM? I thought that the PLA does not cost time because it runs parallel. It is just there to not enable write or read at the last possible point in time.
Ans to you question, registers have bit level control of read and write, that allows passing one at a time to the ALU, and facilitating multiplexing but memory is just a block with no granular control ( probably 😅).
I am thinking about registers like a workbench and memory is the closet.
Improvement. !!!
Dividing memory into subdivisions, keeping closer proximity of registers, for that a more defined instructional set will be required.
Please respond if I am right or wrong.
Actually, you are the best at presenting and explaining such complex content. I watched the previous episodes and they were wonderful. Keep going 👍👍👍
The best video explanation series ever about how everything was built up! Thank you!
Really great effort Bro, this video means a lot to me, even though i'm embedded system engineer for more than a 7years.
I recommended this channel to my Diploma mates, Engineering mates and till my colleagues.
Great effort Please do continue your work. One of the great channel in TH-cam, i regularly look forward for new videos.
This is amazing! I'm learning so much from you!
I love that you started from transistors and latches, and now we are going into more and more abstraction. It would be really difficult to keep track of all the wiring if ALL transistors would still be visible!
Excellent video. The control scheme depicted is a 'hard-wired" instruction set. I suggest a follow-up discussing the differences between a microcoded instruction set and hard-wired instruction set. Reason: most CPUs today (even RISC) are microcoded. Source: CPU architect Jim Keller.
This is a fantastic video. The animations are excellent and help a lot to teach all the concepts and make the execution logic and sequence very clear. Congratulations and thank you!
If this channel existed in the 2000's, I probably would've been able to graduate in electronics engineering. This kind of content actually makes me want to study it again!! :D
Currently in highschool and I plan to learn as much as possible before college. You are way too underrated to be providing free education man! Love your videos, please keep posting :D
Instead of registers, you could use a stack. Push a value from memory when loading onto the stack and ALU operations take the top two entries as inputs, remove them, and put their result on the top, then pop the top value back to memory to store it. The Intel 8087 Floating Point Unit does this, with a small stack of 8 entries, although it's not a strict implementation, as it does allow some access out of order.
8087 stack does not reside in memory though. It’s not a separate chip. But then again I guess that it is more optimized towards density than speed. 80 bits in 1981 !!
There are multiple advantages to restricting the arithmetic instructions to operating on registers only (this is a rephrasing of your question).
First, registers are usually faster than memory, even more so with flash, off-chip, or dirty cache situations.
Second, for fixed width instructions it allows for more bits to store the specific ALU instruction.
Third, it enables for microcode. By handling complicated instructions (like multiplication or bitswapping) internally as many simpler instructions you can offer a much richer instruction set to the user without adding a lot more dedicated logic to your silicon.
I wonder if on the Amiga it is possible to time the start of the micro code execution of DIV or MUL so that the 68k gets totally of the bus while for example the graphics DMA runs?
Ah, context switching!
@@skilz8098 the context is “microcode”. Your third paragraph. What does microcode bring to the table when no one else uses the bus in the meantime? Microcode already gets instruction fetch from the bus. Why also remove data access? JRISC for example use micro code to implement the inner vector product as reg-mem.
@@ArneChristianRosenfeldt Well, in modern CPUs - architects and computer systems sure you wouldn't be wrong.
Yet, if one is building their own system from discrete logic gates or ICs, they are literally hardwiring the buses, the bus transceivers, decoders, etc. and they have to measure or be aware of their clock signals and even the strength of their signals.
It all depends on which route of path they tend to take; they could build out all of the logic for their decoder or they could just wire it to say an EEPROM and program it with the binary equivalent.
When I started to implement Ben Eaters 8-bit Bread Board CPU in Logisim. I had to build out the components (registers, logic, counters, etc...) for the control unit. Ben did all of that on the bread to where they controlled the I/O signals coming out of his EERPROM ROM module. Here is where my implementation slightly varied from his. I ended up using Logisim built in ROM. I ended up writing a C++ application similar to his EEPROM C like code to generate all of the binaries. Then I had to take the printed output from my C++ project and had to message it (trim off all white spaces). Then load it into the ROM Module within Logisim.
This was also tightly integrated into a second counter for all of the micro instructions. Not for every single main system clock cycle itself, but for every increment of the program counter, there were up to 4 or 5 sub cycles. This secondary counter on a varying repeatable loop managed the fetch, decode and execute cycle. Some instructions only took say 3 cycles, where others took 4 and some took 5. All of this had to be taken into account to instruct which device to either accept or assert data from the bus. It was a single bus that handled, instructions, addresses and data. It was a system wide shared bus.
So, again, it also depends on the overall layout of the system. It depends on its architecture. Sure, a modern CPU or system doesn't use a single bus anymore, and for most tense and purposes even at the assembly level, we don't usually concern ourselves with the fetch, decode and execute sub cycles, but we should still be aware of their mechanics and properties.
Take an NES apart and trace out all its paths. Or take apart an Atari and trace out all of its paths. Get really creative and take apart an Arcade cabinet and trace all of its paths. And that doesn't account for all of the processors, coprocessors, and microprocessors within those systems.
@@skilz8098 what is modern? 6502 is from 1975 ( 70s I am sure ). I think that it predates the SAP-1 book. 6502 solved the variable length of instructions: there are end instructions in microcode. Some illegal instructions manage to miss those and freeze the CPU. Wiring of clock signals is important in ICs. For example up to 1985 commercial chips only had a single metal layer. So any crossing needed a bridge through highly resistive poly-Si . The clock distribution network on DEC alpha is so powerful and has so wide conductors that it is clearly visible in the die shot. They were clearly pushing the limits.
SAP uses adders, but I read that you could buy TTL ALUs in the 60s . Dataport 2000 and other processors even operated on single bits (Usagi has videos). Discrete Multiplexers and shifters must have been cheap . But then, the C64 first started with memory with a single data pin. Maybe dataport used those or those early Intel rotation registers which acted as main RAM.
I like how Atari and Commodore insisted on tri state address pins for the 6502 CPU. So indeed everything sits on a common bus. Same with AtariJaguar. N64 has a point to point connection between the custom chip and this RAMBus to the DRAM chips. Arcades are a bit like a PC : lots of expansion cards on a mainboard. Not optimized for cost not energy efficiency. Though some are straight forward: buy the expensive 6809 and a lot of RAM for a double buffer with many colors. This much RAM filled its dedicated board on its own back in the day and is obviously expensive. How do low volume ROMs work?
Dear JORJ, Thank you for your excellent presentation.
Great! I believe that how the program counter works would complement this series
I think loading data into register before using them for ALU Operations is essentials because if we load directly from main memory to ALU might simplify CPU operations, but it is practically inefficient. It’s crucial to note that accessing main memory is much slower than registers. Registers are accessible in one clock cycle while getting into the main memory may take several dozens and even hundreds of cycles which makes ALU work slow.
Moreover, ALUs are intended to work with the few, fixed number of registers. Their direct access to main memory would make it difficult for them to design and potentially slowdown the whole CPU. Most architectures of CPUs are structured around using registers for data manipulation because it simplifies instruction set and makes instructions more efficient. Changing this design will require a complete rethink of architecture resulting in higher complexity as well as compatibility problems.
Pipelining is heavily used in modern CPUs to execute numerous instructions simultaneously. In this process, registers play a key role by providing fast storage for intermediate results and facilitating quick movement of data between different stages within the pipeline. Direct memory access will increase latencies thereby interrupting the smooth flow thus reducing overall efficiency.
Using main memory also leads to higher power consumption compared to employing registers which is a major concern both in mobile devices as well as servers.
Some of the better ways to handle (i will share one example for now )as most of modern archtecture do is including serveral memory hierarchy like for prefetching like L1 , L2 , L3 caches which overall reduces latency and improves performances
Also realistic access to memory uses address registers. So you don’t actually cut down on registers. I think that the 6502 zero page is a mistake.
In the NEC turbo Grafik CPU, RAM, and ROM all cycle at 8 MHz. No 8 bit CPU after this. But at least for 10 years external memory was fast enough. The MIPS R1000 uses external SRAM caches at CPU clock. Intel was afraid of using pins, and MIPS has 3 32 bit data connections on their chip? Third is to DRAM.
The power goes somewhere: electro magnetic interference.
The programme logisim allows you to do all of this, and build all these logic circuits. Its brilliant and will help get this information really lodged in your brain and fully understood.
Hey man, it's nice to see you are getting sponsors for your videos, keep up good work.
Can you make a more in depth video where you do this but with a stack and function calls? We already saw how the stack works, but not in the context of a working program. For example, what exactly is a stack frame from the perspective of a CPU? How does it know which variables are located where in the stack, even when you make a new stack frame? How does it know where each stack frame is?
Thank you very much.
I have understand the whole thing.
Easy explanation.
From 🇧🇩🇧🇩🇧🇩
I'm currently in high school, and I'm thinking about studying Computer Science. This low level stuff is really interesting, you are really one of the best TH-cam finds of mine this year
Then this will be only one class in university. CS really loves high level stuff.
This channel inspired me to re-read the nand2tetris book.
6502 Assemby (Sybex). Loved that book - your video is incredible. Thank you
Even if you make 10hrs video, i will watch it without skipping a second. The quality is unmatched.
Dude every time i think of a question you answer it very soon after. these are awesome
About the question:
You load the data into registers before performing an arithmetic operation for multiple reasons:
1) Fetching data from the RAM is way slower than using registers.
2) The CPU doesn't know if the result is going to be stored, so it would be a waste of time doing so.
3) I think that reading data from registers is safer, but I'm not sure.
4) Also, as someone else mentioned, you would need bigger instructions to specify the memory address.
1) the MIPS 5-stage pipeline was invented in 1980. The inventors observed that real programs need addressing modes. So access to RAM reads the address register, adds some offset using the ALU, sends the address out to RAM, and gets back the result in the next cycle. This works for ROM and SRAM ( think: Commodore PET or a console like pcEngine ). DRAM at the time already multiplexed the address bus. So it (could) take two cycles just to transfer the address. Also often in the first cycle some external component checks, which memory chip should be enabled: ROM or DRAM or some IO chips?
2) No. All CPUs have CMP and TEST instructions. On RISC there is the zero register. Starting with SH2 the flag output is optional.
3) On a computer with multiple components which can write to RAM, the registers are your safe space. Many computers had DMA read out for graphics and sound, but the Amiga introduced a blitter which would write to (chip) memory. Sega genesis could have conflicts between Z80 and 68k.
4) That's why instruction encoding is so inefficient on CISC. At least 68k has 16 registers and the swap instruction. Why did people not swap register banks on Z80 that often? Always want to stay compatible with interrupts?
Congrats bro, these videos are magnificent. Well explained and nice animations. The best TH-cam channel of cs
In my collage i had to write some assembly for few weeks. We had to use acumulators for aritmetic operations. I dont know why, and now i can't wait for next episode to learn if acumulators are just registers or is it answer to question 😂
They are just registers. I always thought that the accumulator in 6502 is smarter, but it turns out that it can only calculate decimals, which is useless in games. JRISC on AtariJaguar gets slower if you load two registers. To keep things fast, reuse the result of the last instruction. There is a hidden accumulator. Actually, there are two. Another one for MAC.
You ability synthesize information is alien level
Insanely educational! I'm a out of Computer scene, only some of interest but your content literally hooks me.. keep cooking!
Thank you so much brother, great video, it gave answer to my questions, which didnt answered for decades.. thank you so much.
this is exactly the kind of video I've been looking for a while
Wow!!! Awesome presentation! I already know most of this stuff, but I LOVE to hear such a clear presentation! One small quibble, the word “arithmetic” is pronounced aRITHmatic as a noun, but arithMATIC as an adjective. E.g. an arithMATIC operation performs aRITHmetic on its operands. Quibble aside, awesome stuff!
Bro i love your videos so much I want to know more about digital Electrionics , any books or video recommended for it
Please never stop making these.
Just the Greatest Explanation Ever Seen!!!
You're the best, thank you man.
Could you make an episode about Kernel or about How does a Hypervisor work, please ?
For the last question: You can read directly from the memory into the Alu and store it directly. That should be faster, but needs addtional logic.
Is this type of operation supported by modern assembly languages in any way, or is it a hypothetical architecture feature?
@@EMLtheViewer It was just my suggestion, i dont know. I researched the topic and did found anything, so I my probably wrong.
But I do know that in grafikcards, you can load an array of data in one instruction.
@@EMLtheViewer The 6502 has Increment and Rotate instruction which act directly on memory, and I hate it. But ROR was too much? Or a second accumulator B. But did you know that the 6800 did only have one address register and could not efficiently memCopy?
By far the Best Explanation Ever... You are just amazing
Great video as usual! I love seeing all the pieces come together, even if it's getting hard to remember all the lower levels that have been abstracted away xd. I would love to see a video on drivers and how connecting external devices works at a low level. As for the question at the end of the video, my guess (without doing further research) would be to have the two data streams in the ALU connect directly to memory and set it up so that their write enable in turned on when an arithmatic instruction is performed.
Exactly as I thought, but I think the problem is storing the result
Answer 🎉: the registers are much faster as they are part of the cpu while ram is separated and can be accessed by diffrent cores wich means its not fast enough to work directly alongside the alu and so the cpu loads data to the registers and then execute it and proccess the instruction
However theres a much faster type of memory called cache memory wich is even faster and stores limited amounts of data that is bien processed
I learned a lot from ur vids not sure about the ansewer or how that is bien done.
I knew about cache and registers before but i had no understanding of how they work
big thanks ❤❤
Ok but here the thing, registers are fast, but you still have to load data to them from memory.
i have a better reason.
@@CoreDumpped well thats all i have
@@mahdoosh1907 good for u
Did u win the noble prise yet...no dont give a shit u know
@@CoreDumpped In my code a lot of variable are initialized to some literal value. Bitfields and safety limits for for-loops. With non-portable code, which knows that we are dealing with exactly 8 bits, a lot more helper variables have known values at compile time.
My answer: Because loading data to the register allows us to use simpler and shorter instructions due to simpler operands addressing(yet I am not so sure, I am complete newbie and just have started learning). As it comes to the second part of the question, yes, I think there might be a better way to do it(yet not completely sure it would really be a better way), and my solution would be to use longer instructions(16-bit instead of 8-bit) witch would allow us to use longer addresses for the operands, so we would be able to grant ALU a direct memory access, so it could fetch the values via separate data buses from memory without using any of registers. It's just my concept and understanding , however I don't have any idea is it correct or not.
BTW. Great videos, imo everything is very well explained. Great job! Thank you for your hard work!
What are separate data busses (plural)? Memory sits only on one bus.
just perfect for entusiats and CS / CE students. Thank you
Feels great to have an understanding of the components I’m learning about for my gcse, this is really helpful! (If a bit beyond the spec of the course lol)
Truly excellent videos! 👍
I will point some of the people I am coaching in electronics and computer science to these videos. Thanks for your work.
Answer: You can use a carry in the adder or a Carry-Save adder. It uses more components but it is much faster than using what you have shown in previous videos.
use a carry look ahead adder instead
Good to know even for those who not wanna build hardware.
Man I don't know what kind of words would give you enough appreciation on this channel but all i can say is thank you for making these videos free and thank you for educating people + you're the best
Wow, incredible production quality. I really like your videos.
Yeah it is type a of content which we deserve 😀
Thank you for educating! Best video!
Greaaaaaaaat one. we need a video about clock & clock cycle what is it, why it was invented and how it works
How would you execute a program? You go through it line by line. Like playing music or reading a text aloud word by word.
I really like the designed opcodes.
I also recomend the new videos of comptuerphile on Out of Order execution and branch prediction.
Also compiler optimzes the code for modern CPU, and many tricks show how a cpu is build. for example unrolling a loop into four steps a time helps performance.
Thank you for the great videos
I would love to see a steampunk demake of an instruction scheduler. I like the instruction queue in the 8086. Now let’s take the 4 Bit ALU from the Z80. Then let’s try to occupy memory io and ALU and two port register file access every cycle. No global flags, but RISCV branch instructions.
thank you brother
Why load data into registers before using them for ALU operations?
Registers are the closest and fastest memory storage units within the CPU. They provide temporary storage for data that is about to be processed or data that has just been processed by the ALU.
Is there a better way to do this?
One possible alternative is placing memory closer to the processing unit to minimize data movement. However, this approach still requires a temporary location, such as registers, to perform operations efficiently
The machine language (ISA) of the SAP-1 does not know about registers, still the implementation needed at least 4 . When do you call something a latch and when a register? There is an instruction “register” after all.
CPU work fundamentals best expllanations I have ever seen !!!!!!!!!!!!!!!!!!!!!!!
ultimate and fine content ,please keep upload such videos which remove abstraction and make learning intuitive and logical❤.
I don't have a cs degree so this was very helpful to me, thank you dude
Underrated Creator. I wish you grow even more
I wanted to know what book or documentation explains these so throughly? Even the most advanced books (That I know of) don't explain it like this. Could you please give some suggestions for books where they explain architecture and how components are built/connected at such a low level?
This is great and I want to know where you learnt all of this stuff? Like both generally and specifically because I would love to go into more detail myself.
My man, keep moving forward
This channel’s few videos worth more than my universities whole professors combined
I'll be showing this to your professors
If you make a video on Control unit of how they are made I would be appreciative even though you said it will be long but will be helpful and unique for sure
hey george, this video is awesome as always. btw what software do you use to create the animations? thanks!
I'm going to share this with all my friends daammmmm
Can you organize these in a playlist? You mentioned twice that "we've seen before" something but I don't know which video starts this course, I imagine it's the "HOW TRANSISTORS RUN CODE?" one.
Such an amazing channel with great description of everyhting. Great work