11:15 how the *_hell_* has no-one mentioned this before?! That single fact is insanely huge and yet I've *_never_* heard anyone mention it. The constant lingering issue I've always worried about regarding RISC was the risk that people wouldn't really standardize and everyone would just spin off their own little custom projects without the whole thing ever moving forward, but integrated intercompatibility features like that is insanely good. Making a desktop/laptop CPU for instance, you could make the chip, add an extension which you think makes things better, then write a micro-simulator for that instruction, upstream it to the linux kernel, and boom, everyone using your chip gets high-performance and everyone not using your chip doesn't get incompatibilities. (yeah MS/Apple are probably not going to have this apply to them, but they can be left in the dust trying to make everything themselves)
Probably it doesn't sell well in marketing papers ;) RISC-V suffers same pain as 2007-2015 smartphone cameras; no one really marketed physically bigger matrices, no correlation was emphasized between Carl Zeiss' optics and actual image quality, all what mattered was megapexelz u dumb peasant, WE HAVE MOAR BUY OUR PRODUCT. Same here, Rv nowadays is being presented as "better, more energy efficent ARM" and me myself as enthusiast of this technology (and owner of few RV-based SBCs :P) didn't knew until this presentation that core is somewhat modular and low performance Integer extension is in fact same CPU as G- extension, of course without bells and whisthles. Probably this and fact you can trap unsupported instruction and catch that exception later without halting and catching fire is selling point but for chip manufacturers such as StarFive or CVitek. For us, mere consumers are CPU cores, NN cores and TDP watts ;) And presentations like these of course :)
Mill can do this too: since genASM -> conASM is effectively software-contolled microcode generation, it can drop in emulation routines wherever at zero runtime cost. The downside is you have to rely on the Mill making it to market, whereas RISC-V is available today, and you are losing the ability to migrate threads between heterogeneous processors. The upside is that the code can be inlined or a regular call instead of a hardware trap, and theoretically allows mutually incompatible extensions to execute on the same hardware.
@4:38 "And that has a cost, not in dollars ..." YES in dollars. More instructions => more transistors => greater chip area => lower yields => higher cost. Lower yields both because the part is larger and because the chances of a defect in the part scale with the area.
this is really great, especially considering that the x86 microprocessor has become a kind of emulator of an x86 at this point, they should go back a few generations, since, unlike the risc-v, they lost their way at some point
Actually there are (or were) operating system extensions for some X86 instructions (floating point), but now that ALL X86 processors include floating point this probably is no longer a thing.
I like the idea of Fusion - and the idea of x0 (r0 ) - the DEV/NULL register is my favourite - it has never failed and is the fastest :) - even faster than XOR
If you look at it, the humble 6502 was the first RISC CPU. It had competition (6800, Z80, 6809, 8008/8080), but its minimized register count and minimized instruction set made it very simple to implement efficiently, and it could do everything that any other 8-bit CPU could do, and often just as quickly.
Neither the 6502 or any of the other common 8 bit microprocessors were either RISC or CISC. They were more like a minimal ISA -- they were struggling to get something sensible and usable on to a single chip at all. The 6502's indirect addressing modes absolutely disqualify it from being RISC as they require the CPU to access memory at two unrelated addresses sequentially in the same instruction -- the first to fetch two bytes of a pointer, and the second to access memory relative to that pointer (adding Y to it for the indirectY mode). The first CPU where simplicity was a goal not a limitation, and that followed modern RISC principles, is Seymour Cray's CDC6600 from 1964 (as well as his own Cray 1 a decade later).
I was a Customer Engineer for IBM and the term MICROCODE was used regularly regarding the System 32 and 34. I must have asked a couple of dozen people and noe one could explain it.
I didn't know that the RISC-V instruction set was so much smaller than for ARM. RISC-V being much leaner than ARM I guess could mean for example many more CPU cores on the same die.
One would think so, but then remember all the caveats with compression, fusion, etc? They start with a small instruction set, and then duh, it doesn't look so great compared to the competitors, so let's add this, and that, and in the end I wonder, where's the genius, why is this supposed to be better than ARM or Intel?
@@Teluric2 I would love to see them make the jump to RISC-V, now that they're heavily investing in doing their own chips. At Apple's scale, the savings on the licensing would be quite significant, and other companies tend to follow Apple's lead when they make a technology jump. I gather Google wants to get Android on RISC-V and leave ARM behind, too.
@@fakecubed Google already formally supports RV on Android. They're not too invested in the hardware though, they don't make the majority of the phones. And I think the M1 optimisations they performed won't carry over to RV as well. Some will, but certainly not all. They would need to start over and the development cost they put in the M1 dwarfs the licensing they had to hand over to Arm. Google doesn't have this stake, they buy license their cores from ARM (Apple doesn't, they made their own) which is a lot less friction than if it's time to license another they take a RISC-V one than it would be for Apple to develop another. And on the software side, Linux already works on RISC-V. Most software will simply recompile just fine with another ISA as its target. Only issue is the bootloader stuff, but since vendors like Qualcomm keeps this issue around anyways by not releasing documentation or source code for that you're no worse off.
It's not that much about fast ver. slow. The one kind of cores is more for general purpose decision taking code. The others are specialized more for processing a data stream.
It doesn't, only compressed instructions do. But it allows chips to put together common operations into one micro-operation, e.g. if their ALU pipeline always includes a shift for both operands, then a shift followed by an add with shared operand can be fused by that specific core. So a speed optimisation, not a code size one.
Macro fusion doesn't improve code density, but performance. But remember in the slides I talk about BOTH instruction compression and macro-op fusion. You can do both. The clever choice of RISC-V is to have say a 32-bit word contain two compressed instructions which then get fused to a single instruction upon execution. So why not just have a single 32-bit instruction instead? Because it is a much more flexible approach. You can make a simple chip that performs no fusion and thus take fewer transistors to make. Or you can choose to not support compression. Thus you avoid eating up encoding space to just have a special fast 32-bit instruction. Instead you leave it to the implementers of the chips to decide if they want to support compression or fusion.
The only advantage is the standardization and simplification that removes all failed (in the long term) ideas. But they are really hard to work with, and are generally inconvenient/unpleasent for small contributors. It is not an open source project at its heart, rather "standardization body business".
The analogy with two sports cars and many trucks would make so much more sense, and analogy, if it were two semi trucks vs many small trucks. The semi is optimal when it is fully loaded, going from A to B. But in real life computing they are rarely fully occuopied. The small trucks will have a much better utilization.
Ha! He says don't think of it like Linux, because libre open-source is not the interesting part. Then imediately describes the same modular benefits that give Linux its advantages. Either looking at the kernel with loadable modules and customizable compile options, or looking at a full GNU OS with mix and match utilities. All only viable because they are built on libre IP.
At what point are reduced instructions not competitive (theoretically)? Is this answer dependent on the nature of problem/application? If so, can we have a modular instruction "frontend" that can be configured on a core upon application startup - similar to an FPGA?
At 05:30 ... isn't that comparing apples and oranges? The full 32bit of x86 (very usable, with all features, including floating point and SIMD) against the RV32I (which is Integer only, AFAIK)
Not really. It is a comparison of what you must minimally support. You cannot build a x86 CPU to run current software without supporting around 1300 instructions. However you can target a RISC-V processor with only 40 instructions. You can build e.g. an embedded device with just 40 instructions which work just fine. No such option for Arm or x86 exists. There is no well defined 40-instruction subset for either of those processors which you could target and which software and tools would work for.
It's for the compressed mode. In the C extension (16-bit instructions), there are half as many general purpose registers used, so the standard ABI names the GP registers in two banks. Keep in mind you can use the GP registers however you like, and think of them as x0 through x31 if you'd like.
Something about static rules of the system that will always be there... The emulator would always be limited in outputs for exploits, at least then. He was asked to put in signal limitations and how factors effect this but he didnt like that idea and said it would be tedious and it is already in. "get a scrub to do it..someone who can burn time.""i will have to burn through physical chips to do it and even then i would be better at working on ai or exploits."
An ISA isn’t enough to get performance, you need caches on several levels, vector processing, instruction prediction etc. ARM has an advantage there now but chips from China will probably also have that in a few years.
The Return Address register is a big surprise! I have not seen that since machines designed in the 60s and 70s. It does mean that the subroutines must first disable interrupts and save it to the stack before doing any work. Fast, but is it really practical ?
Huh? The interrupt handling point is not relevant, there are many ways to handle that are specific HW dependant. There is not an automatic requirement for all usermode code subroutines to disable interrupts, that is you assuming what you know is relevant here. Now the great thing with the register RA is the subroutine gets to decide if, when and how to save RA on a per subroutine optimisation basis. The code examples in this videos show small subroutines that never save RA as they are pure functions not crossing any ABI boundary in their implementation. Using a register puts a little less pressure on the memory subsystem.
Actually, ARM does it the same way: "branch and link" instruction copies the return address to LR (link register). To nest subroutines, you use another register as a stack pointer and one or more instructions to push/pop the return register. On ARM, there are several interrupt priorities, and each has its own LR and SP. An interrupt at any given priority locks out all but higher-priority interrupts until or unless your handler pushes necessary registers and re-enables them
Not at all. Interrupts don’t need to use that register. It’s an open architecture and such details can be chosen by implementation. Either the interrupt pushes that register onto the stack automatically, or it uses another dedicated register for its return address. You can also mix those: some interrupts can use a dedicated return address register - for speed, some can use stack. Also… LR - link register - is not so old fashioned at all! Many architectures from the 90s till now use it, and every single smartphone uses many of those registers :)
Because loads and stores in RISC-V only support base + 12-bit signed immediate offset. The offset cannot be another register, so the addition is necessary.
There is no LW a0, a0, a1 instruction. The third operand cannot be a register. It is 12-bit immediate value. RISC-V has LB, LH, LW, LBU and LHU load instructions to deal with different bit-lengths. One can imagine that adding variants for having registers as the third argument would require adding 5 more instructions to the ISA. That isn't really needed as you can imagine in a loop, we don't need to perform all these operations. Doing a simple add to increment the pointer will be enough.
Die cost doesn't go up with the square of thr area. Number of chips per wafer is directly proportional to area, and net cost is slightly more than linear due to yield, but not the square for sure.
The reason die cost goes up by the square is because of defects. If you have 20 defects on a wafer and 100 dies then 20% or fewer of the chips will have defects. If you have 25 dies per wafer then you have maybe 5 or 6 without defects.
@@phookadude There is also the effect of incomplete chips at the wafer edge. The smaller the chip, the less edge silicon is wasted. Unless you want to pack a large chip wafer with small chips at the edge. But then, a design with small chips using wasted edge silicon would accelerate cost savings for smaller chips. Also, a manufacturer could have a marginal production line which has a high aerial defect rate. Small chips could be fabbed on such a line with moderate yield, wheras large chips will yield far too low to be economical.
You can buy CH32V003 RiscV32 E cored MCUs for 10 or 15 cents (depending on config) in small quantities with built in program ROM (flash and RAM. Used for embedded devices. Remote controls, motor controllers, light controllers, maybe calculators, DECT handsets, energy monitors, oven and washing machine controllers. Small instruction set = small silicon = cheap chip. These 32 bit chips are price competitive with the STM8 8-bit MCUs. By comparison, the cheapest ARM embedded micro controllers (STM32) cortex M0 value line cost about 60 cents. It's not quite an apples-to-apples comparison, but you could say maybe about half price for the lowest end devices comparing Cortex M0 with RiscV32E.
Exactly. Unless you make really big dies the yield is a lesser factor compared to overhead for saw kerf, sealring and I/Os, pads or any peripherals. In this regard twice the core area is actually less than twice the cost per finished die.
I wonder if there is a joke, that the x0-register is the correct way to get rid of data. If you leave data in memory for too long it will rot, get moldy, or even evolve to sentient lifeforms. So you do regularly an automatic garbage collection, which will send it to x0, where it is disposed correctly!
@erikengheim1106 In higher level languages like C/C++ you need to write the code in a certain way so the compiler could optimize it efficiently. Now I see the same applies to assembly... It's an odd feeling when you find out that ASM is not actually the lowest level and it gets compiled (compression+ fusion) into something lower (although in the same domain). What will be next? Writing single bits into FPGA?
In short, to reduce hardware complexity for high-performance implementations. RISC-V doesn't have any predicated instructions (execute instruction only if flag is set) for the sake of reducing complexity, as this requires to keep track of extra state in the processor (the flags). Furthermore, it would introduce more instructions to manipulate (read, set, clear) the flags, and use up instruction encoding space to encode under which flag conditions instructions should execute or not. Predication becomes quite complex in hardware if you want to execute instructions out of order or speculatively, as you would need to keep track of and snapshot the flags as well as destination register to roll things back in case the predication is false. Instead, they use explicit branching instructions for example that operate on registers or immediates (e.g., branch if register a not equal to register b), which does not add extra complexity in high-performance architectures besides the dependency tracking you'd already need for any other instructions.
One explanation I think is great is looking at it from the perspective of functional programming and concurrency. For instance in regular programming if you have functions manipulating global state then it is really hard to run those functions in parallel without getting race conditions. Ideally you want pure functions which only operate on inputs. That allows you to run things in parallel. It is the same at microprocessor level. A high-performance superscalar processor runs numerous instructions in parallel. You make the job of implementing that a lot easier by not having global state manipulated by each instruction. You could say the general purpose registers are global but there are many of those and thus there is amble opportunity to avoid conflict. However there is in principle only one status register.
Because things like flags are a state that things like interrupts break, unless you make them into registers under the hood, which can make pipelining messier. Just using the subtraction result register as a test for zero or not gives you the same result for more simplicity, for example. And you can make that any register, or have the result from several saved at a time for later steps.. it's just good design.
The ONE flaw with risc-v is that the ENTIRE set of documentation available online are almost COMPLETELY useless for one or other purpose. You want to see the available opcodes and their function and syntax you go to that document over there and that one over there and that one over there. To see the opcode encodings you need to see these ones over here, that one over there and... the ENTIRE bullshit is scatterbrained across so many documents that its one huge obfuscated PAIN IN THE ASS. Look at arms documentation. They give mnemonics, timing, syntax, behavior AND ENCODING all in one PDF for EVERY... SINGLE... OPCODE!!!!!!!!!!!!!!!!!!
Only seen first 6 minutes yet, but modular instruction set.. doesnt that mean that software written for a risc-v might not work cause your cpu might not have those instructions?. Seems a bit like going backwards?
and that part about extensions sounds like a real nightmare actually. So you could basicly get simulation in software for a lot of instructions and your cpu will feel really slow.
instead of Floats, how about Posits on Risk V , or Unum number format on Risk V, from John Gustafson (NUS); better arithmetic with less bytes, fewer errors, and much much less NANs, less overflows!
The only problem right now for assembly programmers is that there are no good books about "real" programming on Risc-V (useful stuff: embeded, desktop or gaming). It is mostly just instruction listing with some really basic examples. Of course a c, c++ programmer don't need to know anything about Risc-V because the compiler, assembler and linker will do their magic.
@@ferrellsl Does there need to be if you're not using assembly? Recently used a RISC-V microprocessor and it was like any other new ARM chip I used before (minus having to use another toolchain) - new headers, registers, documentation, but otherwise normal C. One flaw on my specific core is there is a new possibility for a hard fault if you read, for example, a uint32_t pointer from an unaligned address, apparently that is not supported - had to redo some parsing code, but other than that it was the same. Not that I ever read a C/C++ or any other book, maybe people want a RISC-V specific one for C, I for sure don't see the point.
@@ferrellsl th-cam.com/video/izOpUfU4_FE/w-d-xo.html "RISC-V delivered over 7 billion chips ... That ain't nothing!" I suspect many people are underestimating RISC-V. Including me; I've been making the assumption that it will not catch up with ARM for 5 to 10 years, at least, but perhaps developments such as Esperanto et-soc-1 (1088 cores in 20 watts) might convince me otherwise! Most modern low-level languages (e.g. Rust or Zig) support cross-compilation, and many languages that use LLVM can easily cross-compile to new targets, so software migration could be fast, once a certain critical mass of applications has been reached to where hardware vendors and OS vendors consider it worthwhile dedicating leading-edge resources (hardware designers and foundry capacity at 7nm or below) to it. Apple's migration to ARM took me by surprise; maybe they'll target RISC-V within 3 to 5 years. Although I doubt license fees on ARM are enough justification; at 1% to 2% of a processor's selling price, it does not seem that substantial, especially since Apple already has a long-term license to design their own ARM processors. But perhaps the power-saving and additional flexibility advantages could be seen as compelling.
@@vocassen Unaligned memory access is an optional feature for a specific hardware implementation. Looks like the CPU you used did not support, but other RV CPUs might choose to trap and emulate unaligned access in software or implement unaligned access in CPU memory interface. The spec allows all these options to be possible and chooses the baseline minimum to be the simplest to not force the extra transistors upon all implementations.
RISCV is indeed an interesting processor and topic, but I feel this presentation was too disorganised and messy and didn't do the topic justice. There are much better introductions elsewhere.
Sorry you feel that way. I cannot really improve my style of presentation or talk without more specific advice or criticism. Lots of people are great at presenting. Especially the ones who made RISC-V hence I see no reason to attempt to duplicate their effort. I have to pick my own angle and perspective. I prefer to look at a broad perspective, as people rarely do that.
RISC-V is not genius. It's a polished evolution of 40 years but it's weird that in all that time, Patterson & co. haven't brought any new compelling feature or paradigm. It's all the same old pipeline of the 80's, the instruction words are quite the same except these two bits they saved for extensions.
Risc-V's performance rely a lot on the assembler/compiler and software toolchain. That is the brilliance of Risc-V, relying more on the tools and less on the programmers (pseudo instructions and linker optimisations). The sad part is that Risc-V will eventually be mostly used to run bloated stuff in bloated web browsers. Just like Unicode and emojis opened the door to useless informations and bloat, web is the reason why you need to update your hardware every 5 years. Because real programmers are an instict species, the industry use unqualified and cheaper web scripters to do stuff, requirering more and more layer to execute their lazy scripts.
Tell that to my 12 year old desktop PC handling my web browsing needs just fine. Or to my parent's 16 year old PC still doing that just fine. Of course, power supplies have been swapped over the years, HDs gave way to SSDs, but the motherboard, CPU and RAM are all really old and that hasn't made those computers useless yet. I don't think there's anything about RISC-V that require way better compilers than x86 or arm, you must be thinking about VLIW designs and variants like the Itanium, which are fine for specialized applications, but are terrible for general purpose stuff. That stuff is mostly dead and gone now. Both x86 and ARM startted out of low power, low performance stuff, it's only natural that RISC-V will so it's first implementations on embedded space and gradually work it's way up to more performance or performance/watt space of mobile computing and servers.
I like the CONCEPT of RISC, and compared to the behemoth that is the x86 instruction set, RISC V is not terrible, but it could be better. First, it needs to be set in stone, and not extensible. Second, I have coded professionally in 8080 Assembly, and even those realtively few commands were more than most of us needed. The BASE RISC V set is 47 instructions, but if you include the variations (RV32E, RV32M, RV32A, RV32F, RV32D, RV32C, etc.), the total number of instructions in the RISC-V ISA can be over 300. What the hell is reduced about 300 instructions? You don't need an instruction for every conceivable function. That's what subroutines are for. I designed an ISA with 16 instructions. Just the fundamentals: ADD, AND, NOT, OR, SHR, SUB, XOR, LDA, STA, RDM, PSH, POP, JC, JN, JV, JZ. You can write anything you want with those. I know, because I have. When your ISA is that small, you don't even need microcode. You can decode with pure logic, mostly in one machine cycle. With only 16 commands to learn, this would be great for beginners. All good wishes.
Of course you can write anything with a few instructions, but that is not the point. The point is to squeeze more performance out off silicon. An ADD may add the value in memory to another value in a registers, but that doesn't make you have a SIMD and fancy stuff like a cross product or vector product in hardware. Of course you could emulate this in software with your few basic instructions, but it will take a lot of steps and that is slow.
Why do you need 16 instructions? You only need 1 of a select few instructions that when manipulated and chained in just the right way can do the same thing.
RISC-V has screwed instruction compression in a very spectacular way, wasting opcodes on nonorthogonal floating point instructions - absolutely obsolete in the most chips where it really matters (embedded), and non-critical in the other (serious code uses vector extensions anyway). It doesn't have critical (for code density and performance on low-spec cores) address modes: post/pre-incrementation. Even adhering to strict 2r1w instruction design it could have stores with them.
RISC-V is a bad architecture: 1. No guaranteed unaligned accesses which are needed for I/O. F.e. every database server layouts its rows inside the blocks mostly unaligned. 2. No predicated instructions because there are no CPU-flags. 3. No FPU-Traps but just status-flags which you could probe.
It's a bit of a stretch to say anything as small as a 4 bit microprocessor is RISC or CISC. CISC as an architectural approach was certainly around before RISC . It was there in the large and ever growing instruction sets of IBM mainframes and others. Which is when it was noticed that the RISC approach could be an efficient alternative way to use all those transistors. That was a little genius right there.
There are boards available right now. And it wouldn't be "smoke, mirrors and spectaculation" anyways. These are tested architectures based on well known principles for more than ten years (>30 years counting earlier RISC architectures).
Man, people really hate Linux. Lol. I don't get it. I mean, the first functioning desktop operating systems to be developed for RISC-V are Linux distros. But, whatever, go off on how terrible it is to have RISC-V to be associated with Linux.
11:15 how the *_hell_* has no-one mentioned this before?! That single fact is insanely huge and yet I've *_never_* heard anyone mention it. The constant lingering issue I've always worried about regarding RISC was the risk that people wouldn't really standardize and everyone would just spin off their own little custom projects without the whole thing ever moving forward, but integrated intercompatibility features like that is insanely good. Making a desktop/laptop CPU for instance, you could make the chip, add an extension which you think makes things better, then write a micro-simulator for that instruction, upstream it to the linux kernel, and boom, everyone using your chip gets high-performance and everyone not using your chip doesn't get incompatibilities. (yeah MS/Apple are probably not going to have this apply to them, but they can be left in the dust trying to make everything themselves)
This.
Probably it doesn't sell well in marketing papers ;) RISC-V suffers same pain as 2007-2015 smartphone cameras; no one really marketed physically bigger matrices, no correlation was emphasized between Carl Zeiss' optics and actual image quality, all what mattered was megapexelz u dumb peasant, WE HAVE MOAR BUY OUR PRODUCT.
Same here, Rv nowadays is being presented as "better, more energy efficent ARM" and me myself as enthusiast of this technology (and owner of few RV-based SBCs :P) didn't knew until this presentation that core is somewhat modular and low performance Integer extension is in fact same CPU as G- extension, of course without bells and whisthles. Probably this and fact you can trap unsupported instruction and catch that exception later without halting and catching fire is selling point but for chip manufacturers such as StarFive or CVitek. For us, mere consumers are CPU cores, NN cores and TDP watts ;) And presentations like these of course :)
Mill can do this too: since genASM -> conASM is effectively software-contolled microcode generation, it can drop in emulation routines wherever at zero runtime cost. The downside is you have to rely on the Mill making it to market, whereas RISC-V is available today, and you are losing the ability to migrate threads between heterogeneous processors. The upside is that the code can be inlined or a regular call instead of a hardware trap, and theoretically allows mutually incompatible extensions to execute on the same hardware.
This was one of the beauties of the 68K.
this was such a good refresher on computer architecture.
That's what i was thinking too
Yes, I felt back in university too 😂
@4:38 "And that has a cost, not in dollars ..." YES in dollars. More instructions => more transistors => greater chip area => lower yields => higher cost. Lower yields both because the part is larger and because the chances of a defect in the part scale with the area.
Ah, he says something like that around @8:30.
Ending on a Godbolt reference. Epic!
this is really great, especially considering that the x86 microprocessor has become a kind of emulator of an x86 at this point, they should go back a few generations, since, unlike the risc-v, they lost their way at some point
they did with Itanium. It was a disaster.
@@JurekOKitanium simply wasn't a good architecture design.. VLIW with funky add-ons to make it arbitrary width in hardware..
Actually there are (or were) operating system extensions for some X86 instructions (floating point), but now that ALL X86 processors include floating point this probably is no longer a thing.
AVX is something where there are optional extensions (e.g. AVX512 comes to mind)
Excellent talk!
This is what i bult in my senior year comp arch course
I like the idea of Fusion - and the idea of x0 (r0 ) - the DEV/NULL register is my favourite - it has never failed and is the fastest :) - even faster than XOR
For me its enough that its supposed to be the 'Linux of processors', but the insights to how it works makes it better.
If you look at it, the humble 6502 was the first RISC CPU. It had competition (6800, Z80, 6809, 8008/8080), but its minimized register count and minimized instruction set made it very simple to implement efficiently, and it could do everything that any other 8-bit CPU could do, and often just as quickly.
Neither the 6502 or any of the other common 8 bit microprocessors were either RISC or CISC. They were more like a minimal ISA -- they were struggling to get something sensible and usable on to a single chip at all. The 6502's indirect addressing modes absolutely disqualify it from being RISC as they require the CPU to access memory at two unrelated addresses sequentially in the same instruction -- the first to fetch two bytes of a pointer, and the second to access memory relative to that pointer (adding Y to it for the indirectY mode). The first CPU where simplicity was a goal not a limitation, and that followed modern RISC principles, is Seymour Cray's CDC6600 from 1964 (as well as his own Cray 1 a decade later).
I was a Customer Engineer for IBM and the term MICROCODE was used regularly regarding the System 32 and 34. I must have asked a couple of dozen people and noe one could explain it.
Nice explaination for RISC-V and CISC,
36:00
Micro operations as in micro-manage rather than micro-meter.
When you micro-manage you use more words than regular management.
I didn't know that the RISC-V instruction set was so much smaller than for ARM. RISC-V being much leaner than ARM I guess could mean for example many more CPU cores on the same die.
One would think so, but then remember all the caveats with compression, fusion, etc? They start with a small instruction set, and then duh, it doesn't look so great compared to the competitors, so let's add this, and that, and in the end I wonder, where's the genius, why is this supposed to be better than ARM or Intel?
Apple is looking for risc v experts. Check their site
@@Teluric2 I would love to see them make the jump to RISC-V, now that they're heavily investing in doing their own chips. At Apple's scale, the savings on the licensing would be quite significant, and other companies tend to follow Apple's lead when they make a technology jump.
I gather Google wants to get Android on RISC-V and leave ARM behind, too.
@@fakecubed Google already formally supports RV on Android. They're not too invested in the hardware though, they don't make the majority of the phones.
And I think the M1 optimisations they performed won't carry over to RV as well. Some will, but certainly not all. They would need to start over and the development cost they put in the M1 dwarfs the licensing they had to hand over to Arm. Google doesn't have this stake, they buy license their cores from ARM (Apple doesn't, they made their own) which is a lot less friction than if it's time to license another they take a RISC-V one than it would be for Apple to develop another.
And on the software side, Linux already works on RISC-V. Most software will simply recompile just fine with another ISA as its target. Only issue is the bootloader stuff, but since vendors like Qualcomm keeps this issue around anyways by not releasing documentation or source code for that you're no worse off.
Bcz its modular microarchitecture
It's not that much about fast ver. slow. The one kind of cores is more for general purpose decision taking code. The others are specialized more for processing a data stream.
I built a RISCV32I core in a game called logic world. its my most recent upload to date.
I had trouble with that, its not exactly an easy thing to do...
Do you have a link to that? Sounds really cool!
@@erikengheim1106 his profile, interesting
Great talk, thank you
Pleased to hear that you enjoyed it!
Nice explanation! Thanks!
Glad it was helpful!
Yes Bjorn is a very good presenter
I did not get the macro fusion part. Where does that happen?
If it happens inside the CPU then how does that help code density?
It doesn't, only compressed instructions do. But it allows chips to put together common operations into one micro-operation, e.g. if their ALU pipeline always includes a shift for both operands, then a shift followed by an add with shared operand can be fused by that specific core. So a speed optimisation, not a code size one.
Macro fusion doesn't improve code density, but performance. But remember in the slides I talk about BOTH instruction compression and macro-op fusion. You can do both. The clever choice of RISC-V is to have say a 32-bit word contain two compressed instructions which then get fused to a single instruction upon execution. So why not just have a single 32-bit instruction instead? Because it is a much more flexible approach. You can make a simple chip that performs no fusion and thus take fewer transistors to make. Or you can choose to not support compression. Thus you avoid eating up encoding space to just have a special fast 32-bit instruction. Instead you leave it to the implementers of the chips to decide if they want to support compression or fusion.
Nothing new or brilliant, but a solid design that allows scalability and customizations through modular extensions.
It's as if the designers of RISC-V did a very good refactor of an ISA.
The only advantage is the standardization and simplification that removes all failed (in the long term) ideas. But they are really hard to work with, and are generally inconvenient/unpleasent for small contributors. It is not an open source project at its heart, rather "standardization body business".
Substract :-) Good talk, thank you!
The analogy with two sports cars and many trucks would make so much more sense, and analogy, if it were two semi trucks vs many small trucks. The semi is optimal when it is fully loaded, going from A to B. But in real life computing they are rarely fully occuopied. The small trucks will have a much better utilization.
Ha! He says don't think of it like Linux, because libre open-source is not the interesting part. Then imediately describes the same modular benefits that give Linux its advantages. Either looking at the kernel with loadable modules and customizable compile options, or looking at a full GNU OS with mix and match utilities. All only viable because they are built on libre IP.
At what point are reduced instructions not competitive (theoretically)? Is this answer dependent on the nature of problem/application? If so, can we have a modular instruction "frontend" that can be configured on a core upon application startup - similar to an FPGA?
At 05:30 ... isn't that comparing apples and oranges? The full 32bit of x86 (very usable, with all features, including floating point and SIMD) against the RV32I (which is Integer only, AFAIK)
Not really. It is a comparison of what you must minimally support. You cannot build a x86 CPU to run current software without supporting around 1300 instructions. However you can target a RISC-V processor with only 40 instructions. You can build e.g. an embedded device with just 40 instructions which work just fine. No such option for Arm or x86 exists. There is no well defined 40-instruction subset for either of those processors which you could target and which software and tools would work for.
@@erikengheim1106I mean, technically you could target 8086 original opcodes that are still backwards compatible.. for reasons.
Well explained, thank you!
Just a small question, at 26:31 why slice t0-t6 to t0-t2 and t3-t6 far apart? Are t0-t2 used more than t3-t6 in practice?
It's for the compressed mode. In the C extension (16-bit instructions), there are half as many general purpose registers used, so the standard ABI names the GP registers in two banks. Keep in mind you can use the GP registers however you like, and think of them as x0 through x31 if you'd like.
WCH has risc-v mcu's from 10cents up to $3 that has 480mbsp usb and 100Mb ethernet.
Something about static rules of the system that will always be there...
The emulator would always be limited in outputs for exploits, at least then.
He was asked to put in signal limitations and how factors effect this but he didnt like that idea and said it would be tedious and it is already in.
"get a scrub to do it..someone who can burn time.""i will have to burn through physical chips to do it and even then i would be better at working on ai or exploits."
"[or] a better unit."
An ISA isn’t enough to get performance, you need caches on several levels, vector processing, instruction prediction etc. ARM has an advantage there now but chips from China will probably also have that in a few years.
Why we can't execute ADD, SUB instructions in one clock cycle
The Return Address register is a big surprise! I have not seen that since machines designed in the 60s and 70s. It does mean that the subroutines must first disable interrupts and save it to the stack before doing any work. Fast, but is it really practical ?
Huh? The interrupt handling point is not relevant, there are many ways to handle that are specific HW dependant. There is not an automatic requirement for all usermode code subroutines to disable interrupts, that is you assuming what you know is relevant here.
Now the great thing with the register RA is the subroutine gets to decide if, when and how to save RA on a per subroutine optimisation basis. The code examples in this videos show small subroutines that never save RA as they are pure functions not crossing any ABI boundary in their implementation. Using a register puts a little less pressure on the memory subsystem.
Actually, ARM does it the same way: "branch and link" instruction copies the return address to LR (link register). To nest subroutines, you use another register as a stack pointer and one or more instructions to push/pop the return register. On ARM, there are several interrupt priorities, and each has its own LR and SP. An interrupt at any given priority locks out all but higher-priority interrupts until or unless your handler pushes necessary registers and re-enables them
Not at all. Interrupts don’t need to use that register. It’s an open architecture and such details can be chosen by implementation. Either the interrupt pushes that register onto the stack automatically, or it uses another dedicated register for its return address. You can also mix those: some interrupts can use a dedicated return address register - for speed, some can use stack. Also… LR - link register - is not so old fashioned at all! Many architectures from the 90s till now use it, and every single smartphone uses many of those registers :)
that can make executing Lisp code so much faster
its also practical for code that uses continuations, async code basically, or anything that will TCO
Just curious why @46:00 you didn't write RISC-V code as:
SLLI a1, a1, 2
LW a0, a0, a1
RET
Because loads and stores in RISC-V only support base + 12-bit signed immediate offset. The offset cannot be another register, so the addition is necessary.
There is no LW a0, a0, a1 instruction. The third operand cannot be a register. It is 12-bit immediate value. RISC-V has LB, LH, LW, LBU and LHU load instructions to deal with different bit-lengths. One can imagine that adding variants for having registers as the third argument would require adding 5 more instructions to the ISA. That isn't really needed as you can imagine in a loop, we don't need to perform all these operations. Doing a simple add to increment the pointer will be enough.
MIPS but with extra suck factor.
68k with orthogonal instruction set and lots of internal tricks and optimizations to make it go fast, would be Genius
can I get the presentation slide?
Die cost doesn't go up with the square of thr area. Number of chips per wafer is directly proportional to area, and net cost is slightly more than linear due to yield, but not the square for sure.
The reason die cost goes up by the square is because of defects. If you have 20 defects on a wafer and 100 dies then 20% or fewer of the chips will have defects. If you have 25 dies per wafer then you have maybe 5 or 6 without defects.
@@phookadude There is also the effect of incomplete chips at the wafer edge. The smaller the chip, the less edge silicon is wasted. Unless you want to pack a large chip wafer with small chips at the edge. But then, a design with small chips using wasted edge silicon would accelerate cost savings for smaller chips.
Also, a manufacturer could have a marginal production line which has a high aerial defect rate. Small chips could be fabbed on such a line with moderate yield, wheras large chips will yield far too low to be economical.
You can buy CH32V003 RiscV32 E cored MCUs for 10 or 15 cents (depending on config) in small quantities with built in program ROM (flash and RAM. Used for embedded devices. Remote controls, motor controllers, light controllers, maybe calculators, DECT handsets, energy monitors, oven and washing machine controllers. Small instruction set = small silicon = cheap chip. These 32 bit chips are price competitive with the STM8 8-bit MCUs. By comparison, the cheapest ARM embedded micro controllers (STM32) cortex M0 value line cost about 60 cents. It's not quite an apples-to-apples comparison, but you could say maybe about half price for the lowest end devices comparing Cortex M0 with RiscV32E.
Exactly. Unless you make really big dies the yield is a lesser factor compared to overhead for saw kerf, sealring and I/Os, pads or any peripherals. In this regard twice the core area is actually less than twice the cost per finished die.
great presentation!
Glad you liked it!
what's the percentage of i3-12100 that is filled with instructions? .01%?
I wonder if there is a joke, that the x0-register is the correct way to get rid of data.
If you leave data in memory for too long it will rot, get moldy, or even evolve to sentient lifeforms. So you do regularly an automatic garbage collection, which will send it to x0, where it is disposed correctly!
@erikengheim1106 In higher level languages like C/C++ you need to write the code in a certain way so the compiler could optimize it efficiently. Now I see the same applies to assembly... It's an odd feeling when you find out that ASM is not actually the lowest level and it gets compiled (compression+ fusion) into something lower (although in the same domain).
What will be next? Writing single bits into FPGA?
Great insight !
Machine code programing is a thing.
Assembly is not machine code. Never have been.
when will be a risc v laptop be a reality?
It is now
Where slides?
Sorry, the speaker did not upload the slides for this presentation.
Why does RISC V has no flags (e.g Overflow flag)?
In short, to reduce hardware complexity for high-performance implementations.
RISC-V doesn't have any predicated instructions (execute instruction only if flag is set) for the sake of reducing complexity, as this requires to keep track of extra state in the processor (the flags). Furthermore, it would introduce more instructions to manipulate (read, set, clear) the flags, and use up instruction encoding space to encode under which flag conditions instructions should execute or not.
Predication becomes quite complex in hardware if you want to execute instructions out of order or speculatively, as you would need to keep track of and snapshot the flags as well as destination register to roll things back in case the predication is false. Instead, they use explicit branching instructions for example that operate on registers or immediates (e.g., branch if register a not equal to register b), which does not add extra complexity in high-performance architectures besides the dependency tracking you'd already need for any other instructions.
One explanation I think is great is looking at it from the perspective of functional programming and concurrency. For instance in regular programming if you have functions manipulating global state then it is really hard to run those functions in parallel without getting race conditions. Ideally you want pure functions which only operate on inputs. That allows you to run things in parallel. It is the same at microprocessor level. A high-performance superscalar processor runs numerous instructions in parallel. You make the job of implementing that a lot easier by not having global state manipulated by each instruction. You could say the general purpose registers are global but there are many of those and thus there is amble opportunity to avoid conflict. However there is in principle only one status register.
Because things like flags are a state that things like interrupts break, unless you make them into registers under the hood, which can make pipelining messier. Just using the subtraction result register as a test for zero or not gives you the same result for more simplicity, for example. And you can make that any register, or have the result from several saved at a time for later steps.. it's just good design.
The ONE flaw with risc-v is that the ENTIRE set of documentation available online are almost COMPLETELY useless for one or other purpose. You want to see the available opcodes and their function and syntax you go to that document over there and that one over there and that one over there. To see the opcode encodings you need to see these ones over here, that one over there and... the ENTIRE bullshit is scatterbrained across so many documents that its one huge obfuscated PAIN IN THE ASS.
Look at arms documentation. They give mnemonics, timing, syntax, behavior AND ENCODING all in one PDF for EVERY... SINGLE... OPCODE!!!!!!!!!!!!!!!!!!
Only seen first 6 minutes yet, but modular instruction set.. doesnt that mean that software written for a risc-v might not work cause your cpu might not have those instructions?. Seems a bit like going backwards?
and that part about extensions sounds like a real nightmare actually. So you could basicly get simulation in software for a lot of instructions and your cpu will feel really slow.
about 16min into. I think we use GPU's for AI cause they are fast at matrix multiplications, but I am no expert.
Same way x86 world still ticks. Multiversioning your compile targets.
instead of Floats, how about Posits on Risk V , or Unum number format on Risk V, from John Gustafson (NUS); better arithmetic with less bytes, fewer errors, and much much less NANs, less overflows!
The only problem right now for assembly programmers is that there are no good books about "real" programming on Risc-V (useful stuff: embeded, desktop or gaming). It is mostly just instruction listing with some really basic examples. Of course a c, c++ programmer don't need to know anything about Risc-V because the compiler, assembler and linker will do their magic.
RISC-V only exists on paper as an instruction set. That's why there are no books, good or bad, about programming RISC-V CPUs.
@@ferrellsl look like you forget sifive and others compagnies... Did the ESP32-C3 with RISC-v processor don't existe for exemple ?
@@ferrellsl Does there need to be if you're not using assembly?
Recently used a RISC-V microprocessor and it was like any other new ARM chip I used before (minus having to use another toolchain) - new headers, registers, documentation, but otherwise normal C. One flaw on my specific core is there is a new possibility for a hard fault if you read, for example, a uint32_t pointer from an unaligned address, apparently that is not supported - had to redo some parsing code, but other than that it was the same.
Not that I ever read a C/C++ or any other book, maybe people want a RISC-V specific one for C, I for sure don't see the point.
@@ferrellsl th-cam.com/video/izOpUfU4_FE/w-d-xo.html "RISC-V delivered over 7 billion chips ... That ain't nothing!"
I suspect many people are underestimating RISC-V. Including me; I've been making the assumption that it will not catch up with ARM for 5 to 10 years, at least, but perhaps developments such as Esperanto et-soc-1 (1088 cores in 20 watts) might convince me otherwise! Most modern low-level languages (e.g. Rust or Zig) support cross-compilation, and many languages that use LLVM can easily cross-compile to new targets, so software migration could be fast, once a certain critical mass of applications has been reached to where hardware vendors and OS vendors consider it worthwhile dedicating leading-edge resources (hardware designers and foundry capacity at 7nm or below) to it. Apple's migration to ARM took me by surprise; maybe they'll target RISC-V within 3 to 5 years. Although I doubt license fees on ARM are enough justification; at 1% to 2% of a processor's selling price, it does not seem that substantial, especially since Apple already has a long-term license to design their own ARM processors. But perhaps the power-saving and additional flexibility advantages could be seen as compelling.
@@vocassen Unaligned memory access is an optional feature for a specific hardware implementation. Looks like the CPU you used did not support, but other RV CPUs might choose to trap and emulate unaligned access in software or implement unaligned access in CPU memory interface. The spec allows all these options to be possible and chooses the baseline minimum to be the simplest to not force the extra transistors upon all implementations.
Basically the perfect Internet of Garbage processor ;-)
an Itanium deja vu.
It will really only stand out when it reaches 128-bit architecture. It could be a trendsetter and PCIe coprocessor rather than a hobbyist commodity.
With the new ESP32C3 and C6, RISC V is used in millions of cheap, connected devices.
RISCV is indeed an interesting processor and topic, but I feel this presentation was too disorganised and messy and didn't do the topic justice. There are much better introductions elsewhere.
Sorry you feel that way. I cannot really improve my style of presentation or talk without more specific advice or criticism. Lots of people are great at presenting. Especially the ones who made RISC-V hence I see no reason to attempt to duplicate their effort. I have to pick my own angle and perspective. I prefer to look at a broad perspective, as people rarely do that.
His presentation isn't instilling confidence that he is a subject matter expert
hifive unmatched is unobtainium.
RISC-V is not genius. It's a polished evolution of 40 years but it's weird that in all that time, Patterson & co. haven't brought any new compelling feature or paradigm. It's all the same old pipeline of the 80's, the instruction words are quite the same except these two bits they saved for extensions.
"substract"
Risc-V's performance rely a lot on the assembler/compiler and software toolchain. That is the brilliance of Risc-V, relying more on the tools and less on the programmers (pseudo instructions and linker optimisations). The sad part is that Risc-V will eventually be mostly used to run bloated stuff in bloated web browsers. Just like Unicode and emojis opened the door to useless informations and bloat, web is the reason why you need to update your hardware every 5 years. Because real programmers are an instict species, the industry use unqualified and cheaper web scripters to do stuff, requirering more and more layer to execute their lazy scripts.
What performance are you talking about? RISC-V only exists on paper. There is no RISC-V hardware, not even for testing.
@@ferrellsl Not true
And you're basing this all on...?
literal schizo std::string user. there's nothing wrong with unicode, 🖕.
Tell that to my 12 year old desktop PC handling my web browsing needs just fine. Or to my parent's 16 year old PC still doing that just fine. Of course, power supplies have been swapped over the years, HDs gave way to SSDs, but the motherboard, CPU and RAM are all really old and that hasn't made those computers useless yet.
I don't think there's anything about RISC-V that require way better compilers than x86 or arm, you must be thinking about VLIW designs and variants like the Itanium, which are fine for specialized applications, but are terrible for general purpose stuff. That stuff is mostly dead and gone now.
Both x86 and ARM startted out of low power, low performance stuff, it's only natural that RISC-V will so it's first implementations on embedded space and gradually work it's way up to more performance or performance/watt space of mobile computing and servers.
I like the CONCEPT of RISC, and compared to the behemoth that is the x86 instruction set, RISC V is not terrible, but it could be better. First, it needs to be set in stone, and not extensible. Second, I have coded professionally in 8080 Assembly, and even those realtively few commands were more than most of us needed. The BASE RISC V set is 47 instructions, but if you include the variations (RV32E, RV32M, RV32A, RV32F, RV32D, RV32C, etc.), the total number of instructions in the RISC-V ISA can be over 300. What the hell is reduced about 300 instructions? You don't need an instruction for every conceivable function. That's what subroutines are for. I designed an ISA with 16 instructions. Just the fundamentals: ADD, AND, NOT, OR, SHR, SUB, XOR, LDA, STA, RDM, PSH, POP, JC, JN, JV, JZ. You can write anything you want with those. I know, because I have. When your ISA is that small, you don't even need microcode. You can decode with pure logic, mostly in one machine cycle. With only 16 commands to learn, this would be great for beginners. All good wishes.
Could you not chose to just use the Base RISC V set though?
Of course you can write anything with a few instructions, but that is not the point. The point is to squeeze more performance out off silicon. An ADD may add the value in memory to another value in a registers, but that doesn't make you have a SIMD and fancy stuff like a cross product or vector product in hardware. Of course you could emulate this in software with your few basic instructions, but it will take a lot of steps and that is slow.
Why do you need 16 instructions? You only need 1 of a select few instructions that when manipulated and chained in just the right way can do the same thing.
🤎💜🧡
RISC-V has screwed instruction compression in a very spectacular way, wasting opcodes on nonorthogonal floating point instructions - absolutely obsolete in the most chips where it really matters (embedded), and non-critical in the other (serious code uses vector extensions anyway).
It doesn't have critical (for code density and performance on low-spec cores) address modes: post/pre-incrementation.
Even adhering to strict 2r1w instruction design it could have stores with them.
RISC-V is a bad architecture:
1. No guaranteed unaligned accesses which are needed for I/O. F.e. every database server layouts its rows inside the blocks mostly unaligned.
2. No predicated instructions because there are no CPU-flags.
3. No FPU-Traps but just status-flags which you could probe.
There is nothing "genius" about the RISC architecture because the first microprocessor was a 4 bit RISC! (CISC came after RISC)
It's a bit of a stretch to say anything as small as a 4 bit microprocessor is RISC or CISC.
CISC as an architectural approach was certainly around before RISC . It was there in the large and ever growing instruction sets of IBM mainframes and others. Which is when it was noticed that the RISC approach could be an efficient alternative way to use all those transistors. That was a little genius right there.
Until there are real engineering samples of RISC-V CPUs that can be purchased and tested, this is all just smoke, mirrors and speculation.
There are some boards actually
RISC-V SoCs exist and you can purchase them right now.
There are boards available right now.
And it wouldn't be "smoke, mirrors and spectaculation" anyways. These are tested architectures based on well known principles for more than ten years (>30 years counting earlier RISC architectures).
Eh... what do you mean? You can buy plenty, and they are already widely used. Lookup the SiFive company. They sell many different boards and designs.
esp32-c3 (rv32imc) ... coolest thing are really the smallest chips
Man, people really hate Linux. Lol. I don't get it. I mean, the first functioning desktop operating systems to be developed for RISC-V are Linux distros. But, whatever, go off on how terrible it is to have RISC-V to be associated with Linux.
Missing the point much huh
He was just complaining that it was a bad comparison.
The buffet analogy is NOT true of ARM, please stop lying!!!!!!!!!!!!!!!!!!!!!!
He's from the era when RISC was the panacea. Those days are over. CISC is king now. PowerPC was RISC's final shot, and they failed. Game Over, Man.
Arm is risc