Slight correction to the video: It looks like NEC (Nippon Electric Company) were actually the ones that designed most of the CPU core here, so give them the credit instead of SGI. Correction correction: Apparently it was SGI after all? I have no idea. someone else can argue this out. I doesn't really matter to me or the video.
@@iwiffitthitotonacc4673 Emulation gets more computationally expensive the greater accuracy is required. Ultra HLE used to run on what would be considered a toaster today.
@@TSquitz I can emulate the most intensive PS2 games at 2x on my 200$ tablet, like unless we're talking 7th gen+ or REALLY high accuracy stuff its just not hard anymore (unless your machine is really underpowered).
Yeah, it really seems like the N64 was actually so much more powerful than any of the games every took advantage of. It also seems a lot of this is down to lack of documentation spelling out the best way to use these capabilities, plus overworked devs who had to get a game out ASAP.
18:00 this trick is called Cache-As-RAM (CAR) and as far as I know it is used by BIOS code in most (all?) PCs. In the earliest part of the boot process you simply do not have any RAM yet, since DDR RAM initialization is so complicated. So when modern x86 CPUs come out of reset, they need to start executing code to initialize their memory controller, so for this CAR is used.
@@ThatOSDeveloper 6502 and similar super early microprocessors have no cache. First processor I saw with an instruction cache was the 68020 on the Amiga 1200, but IIRC they work differently because the amiga itself has a funky bootstrap sequence. The SH-4 has such a mode, however, it's called "OCRAM mode." The Dreamcast has an integrated MMU so, without checking, I'm fairly sure it'd boot the same way.
@@ThatOSDeveloper Nope, anything where the RAM is straight SRAM or something comparable will have RAM after reset. This is only the case for processors that have to initialize their own memory controller with a complicated algorithm.
@@ThatOSDeveloper Technically, @thebackyardchemist is wrong, and early PCs (along with the whole 8 bit space) don't use this, as they don't have CPU cache generally. The 486 was the first processor where it could rely on having internal to the CPU cache. 6502/z80/4004/8008/8080/8086 did not have any cache.
Can the N64 do operations directly on the cache buckets? I would assume it would still have to load the data to a register connected to the ALU, so they’re more like extra _SUPER_-volatile ram addresses that you then have to “flush” (i.e. bring the original data back from RAM over the Ram-Bus so that you don’t overwrite it) I imagine that the trick would be getting as much use out of the cache buckets as you can before needing to reset them back to their original data, or perhaps even invalidate that section of RAM altogether and pretend that the cache _is_ the RAM until the data in it has to be accessed by something other than the CPU.
The SNES CPU also had some memory-mapped bytes on the CPU die ($43x0..$43xB for x=0..7, so 12*8=96 bytes) but sadly they were used mostly just as a place to store DMA/HDMA parameters. Afaik only 1 game used that area as a fast cache for instructions: Another World (SNES port by Rebecca Heineman).
There are a few NES games that use undocumented instructions. On the CHIP-8 (technically a fantasy console) a few undocumented instructions got used so much that they became official
@@VlaDexa_MAX all procs have undocumented instructions...though modern ones can have them "disabled" via the Instruction decoder being set to convert their opcodes to NOPs in the end-user versions.
Essentially you're using dynamic ranges of cache as a sort of register-window; bravo! I've not seen this sort of cache-line optimization talk outside of Linux kernel specific talks before. Excellent!
Yeah, this streaming out of sub-16 byte data packages looked very register windows. The Jaguar has a (buggy) helper registers to let the GPU assemble 32:32 bits to write out in one go as 64 bit.
lots of talk about this kind of optimization going on right now in Dreamcast-land with the community port of GTA3. Currently none of it is implemented as everyone hashes out the detail with profiling to see exactly the best way to attack the problem, with the added complexity that both vertex transformation and vertex submission can *_potentially_* thrash cache depending on how it's done.
Same. I don't even understand half of it, but hearing someone go in depth on their niche interest without being boring is magical when it's clear they have taken their nerdiness to expert level.
2:09 As someone who did maths for their undergrad, I can confirm, I have absolutely no memory (its kinda why generality and derivations from first principles appeal in the first place).
Kaze : "Alright, full disclosure : i am not using quantum physics in my mario 64 mod-" Also Kaze : "-YET" At this rate we'll have ray tracing in RtYI by the time it releases.
@@LokiScarletWasHere Raytraced Yoshi's Island 64, coming to a nintendo 64 near you in 2025. Actually, reminds me of that one guy who made a Ray-Tracing chip for the super nintendo.
Nintendo: *releases N64 specs & development docs* SGI: look how they massacred my boy Edit: Tbf, this is basically software engineering in a nutshell. Hardware folks come up with some rocket science bullshit to squeeze extra perf out of the silicon, and the software people waste all of that work by having compilers ignore modern special-purpose instructions for the sake of backwards compatibility, and putting the entire program behind all the polymorphism, virtual functions, dependency injections, virtual machines & interpreters, and God knows how many other abstractions and obfuscations. Despite the different nature of software optimization then vs. now, it boils down to a similar amount of fundamentally misunderstanding how the hardware actually functions that led to most of the N64 library having lackluster performance. Modern apps are written like a labyrinth, and the CPU is given the unreasonable task of translating the map from a foreign language and solving the labyrinth as quickly as possible. This is often why modern software is ~1000x slower than it could be.
Well, yes and no. For doing both hardware and software (even if at a way simpler level than CPU) I agree with you that some very powerful hardware possibilities are not used. On the other hand, take in account that, the doc (for the N64, but on a lot of projects I worked on) is not as simple or readable as you may expect, and also, the software side don't have a lot of time to learn the hardware and code. That's why most of the time, retro compatibility is a thing, because you can reuse old bricks to try to gain a bit of time. And i agree with you modern frameworks are just an unstable pile of horrendous things (that you can't even modify easily). But just try to say that the game will run faster and on old hardware if you rewrite the engine from scratch with Raylib instead of using Unity...
28 วันที่ผ่านมา +2
> Modern apps are written like a labyrinth, and the CPU is given the unreasonable task of translating the map from a foreign language and solving the labyrinth as quickly as possible. Compilers can do a lot of heavy lifting there.
@@canaconn2388 and not spoken like "Today, we, are going, to, discuss, a groundbreaking, piece, of techonogical, development.. so we, will get, to see, and amazing, hard to believe, sight... So hold on to your papers"
Have to be wonder if all his research into N64 hardware will indirectly help improve fpga N64 projects to improve, to act truly like the real hardware or a bit more like it at least. If you look at the firmware update history of Analogue products you can see they are frequently updating the cores to address inaccuracies in certain games, even very popular games. So fpga may never truly be 100% accurate for 100% of games. So Kaze's intense research into how N64 hardware actually works and how it is actually used, and how it could be used, is probably important for achieving that goal, or at the very least putting the spotlight on N64 hardware when people inevitably try to run all these things on their Analogue 3D's and such.
Man I love the visuals in this one. It's been great learning something new every time. Few of the concepts here I don't think I would have understood without the little graphics.
Ah! A direct mapped cache! The Sega Dreamcast has a similar cache setup. I've got a good scheme created to maximize direct mapped cache by using absolute addressing in gcc with an ld script to create stripped zones. Separate the direct mapped cache into 4 zones, each separated by the width of the cache spacing, to ensure writes to buffers don't overwrite the previous line. On the Dreamcast, you can also enable OCRAM mode, which halves the cache into a scratch pad for fast math. This is actually optimal, because the physical layout of the dreamcast's memory is (for sake of brevity ignoring the 64-bit dual ram setup) 2 ram "chips" with 2 banks inside, each bank made up of 2048 rows of memory "cells," each cell being a cacheline in size. Each bank has a mechanism inside to read a bank called a sense amplifier. To read a cell, a sense amplifier must be attached to the row, so if you read a row outside of the boundary, it incurs a performance penalty as the sense amplifier must detach, move to the appropriate row, and reattach. If you operate in OCRAM mode, the sizing of the remaining Cache is *juuuust* right to fit 4 rows at once if you stripe your memory without sense amplifier penalty. It sounds like the DC and N64 actually share quite a bit in common memory wise. A really cool feature of the Dreamcast memory map is the entire memory is mirrored to an alternate address which skips cache when read, as well. So you can actually store things in memory and call them using an alternate address without thrashing your data cache. The dreamcast also naturally has prefetch and invalidate instructions, which when combined with absolute addressing and OCRAM mode, gives you quite a bit of granularity in how you control your cache. EDIT - Question: Does the N64 offer any sort of degree of instruction parallelization? The Dreamcast uses a 5-stage harvard architecture for instruction fetch, which allows parallelization when basically using any instruction from alternate groups providing they aren't a move opcode. Anything like that exist on the N64? EDIT AGAIN: Welp, looked a little further and it turns out this is actually a part of the MIPS name, lol. "Microprocessor without interlocking pipeline staging." Very, very, verrry cool. The architecture of the DC and N64 are very similar!
This video is so complicated that I am almost relieved that the Atari Jaguar only has scratchpad RAM for code and a Matrix and a ton of registers for the data.
@@ArneChristianRosenfeldt Oh man I've done Jaguar programming with my Skunkboard. I consider Dreamcast development way, way easier lol. The Dreamcast is so elegant, nice FPU with fat registers for 2 full matricies, a bunch of really cool SH4 fast math functions. Plus, the absolute coolest feature: Order-independent transparencies, owed to deferred rasterization. You bin all your polygons upfront before sending them to a tile accelerator to rasterize, which gives the tile accelerator, which generates pixel fragments, the opportunity to depth-test against every other polygon in the bucket. This gives the dreamcast per-pixel transparency without needing to order polygons. I absolutely love 68000 programming, though. When I do Jag development, I make atari age weep because I play mainly with the 68000 lol.
@ I just try to redeem Ataris hardware decisions. Running code out of external memory probably was an accident due to the unified data and code cache and external data access. LOL. I cannot code 68k , only 6502
@@ArneChristianRosenfeldt Coming from 6502, I think you'd find the 68000 a dream to work with. They feel very similar, except the 68000 is just more of everything, especially registers. That's the absolute best thing about the 68000 -- FAAAAAAT registers. The 68000 is 32-bit internal, that's seven 32-bit address registers, and eight 32-bit data registers. With bitmasking and bitshifting, that's essentially the same as sixteen 16-bit data registers, or thirty-two 8-bit registers! And unlike the 6502, data registers are general purpose, use however you want. You can also use the address registers in clever ways. Hands down my favorite CPU of all time, simple enough to know the ins and outs of, but feature packed enough to do some incredible stuff. Definitely give it a try!
Off topic, but kinda funny: That's me in your profile picture. Or, rather, I posed for the reference picture when I was a kid. Wasn't expecting to see myself in the comment section. 😂
And what wasn't, was digitized in the simplest of ways: pulse or frequency counting was used as an ADC (getting, I think, 10 to 18 bit operands usually?). They didn't have integrated peripherals for this, not even dedicated ICs, like we do today. (For fast conversion applications, there were digital conversion CRTs: an electron beam sweeps across a punch-coded plate, producing a serial bit sequence corresponding to beam deflection in the other axis. Not sure who was using these; Bell telephone maybe? Military?) Calculations didn't need to run too often -- a few times a second to update spacial navigation and maneuvering, basically solving differential equations by incremental difference; and managing what digital systems (i.e. on/off switches, relays, lights, display and keypad (DSKY), etc.) were set to automatic (including the autopilot controlling thrusters). It was slow (clock rate low 100s kHz?), but had reasonable bus width (18b?) and a couple of otherwise quite powerful numerical instructions (mul/div/etc.?). Things you might not expect given the low capability generally, but customized perfectly for the workload. Computer design back then was very different: instead of starting with a standard system, there was simply no such thing, as having a CPU at all was already such a massive hurdle; you have a strong incentive to strip out everything unnecessary, and customize the architecture (not just bus sizes, but parallel/serial, instruction timings, pipelining even, etc.) to suit your purpose. There were no standard instruction sets to pick from (for general applications; arguably IBM's System/360 was the first, perhaps only, standardized instruction set -- but only for mainframe data applications, and this might give you some idea of the scale required to obtain value from standardization, and what the scale of computing generally was like back then!). What we think of today as a CPU, reading instructions and processing data, was a more nebulous concept back then. So, between these things being built from gates, or individual transistors, the tremendous design and hand-assembly effort to put those together, let alone writing ROM (e.g. "rope") and assembling RAM (hand-threaded core!), and the rarefied applications that demanded such lavish expense -- they were very bespoke and specialized systems indeed! Pipelining is interesting to mention here... System/360 was the first to have it, ca. 1967, according to one article? More important going into the 70s, and again only for the biggest machines that would benefit from it. It seems like a new thing, but it's relatively new _in the consumer space_ to have needed pipelining, or caching or what have you. What used to be supercomputer tech in the 70s, filtered down to single chip consumer hardware in the 90s, and so on. This pattern hasn't changed much: what passed for a supercomputer in the 2000s (multi-CPU, SMP or asym.; vector instructions; etc.) has filtered down, in a sense, to your smartphone today. We've since settled on the best of both worlds: SMP CPU with moderate vectorization, augmented with large-vector parallel processing ([GP]GPU). We carry in our pockets, for the measly cost of a couple watts power dissipation, the power of myriad Cray Supercomputers. Interestingly, grid or flow computing has long been known, but not gained any traction aside from limited use cases where the flow of data is optimal for the calculation (differential field solvers?). Anyway, modern CPUs and GPUs are so extraordinarily powerful that such applications can still run on them with very reasonable execution time, even if not well suited to the flow and dependency of data (i.e. RAM/cache limited). I wonder if that's changing with the availability of tensor cores today (neural net stuff; ugh, "AI"). (Standard disclaimer: any keywords and inaccuracies are largely from memory, and should be taken as incentive to go and research these things yourself. There are many excellent and accessible articles, going into any level of detail, on the above subjects; highly encouraged!)
*premature* optimization is. However, sometimes, you've traced your performance bottleneck to a specific area, using somewhat realistic very stressful workloads. Now you need to optimize something everyone says is impossible to optimize further, because you have no hope of learning how fast is fast enough (it'll always be too slow for something), and performance is a feature. That's when you reach for the esoteric stuff. I did that a couple months ago for something at work, it gave like... well, it's hard to quantify. It was noticeable on the test case, at least a 10% throughput improvement of this function (which originally was 33% of runtime), how much time it saves depends on a bunch of parameters, we have an O(n) algorithm with a large constant factor that I can't do anything about, and this function has a O(M^4) section (yes, that's a slow complexity, I haven't figured out how to make it M^3 or smaller)
@@skylerross8054I hardly ever see the full quote, which really undercuts the "you shouldn't optimize this" crowd: "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%." Note that small Knuth is only talking about "small efficiencies", and that even there 3% (a remarkably specific number) are still useful. I prefer the much simpler saying "until you measure you're wrong about where your code is slow", or the much more general "you're wrong". Keep that in mind and you'll be much happier a programmer!
@@SimonBuchanNz I'm not sure why I'm getting atted with this lol. I mean, I do agree with this, your conclusion is a nicer way of saying what I already agree with. I'm far from an "anti optimization" person, I'm more a "spend effort where it's the most effective for your goals" person. Indeed, performance is a goal/feature. There are an unfortunate number of people who don't think of it like that, and that's how we get apps that take multiple seconds to open, and have to use animations all over the place to mask the fact that things are taking longer than is comfortable. Alas, performance is a single goal, and we have others that are at least as important (the most performant x is one that does nothing), and working on this goal before checking that the work is useful at all, much less useful to the goal, is... wasteful. Measure twice, cut once. I may not engineer physical... anything really, but the motto still applies, a lot of advice from other engineering fields does. I prefer the term software engineer over any other term to describe the profession for that reason.
You are very much in the territory where speed is no longer a priority. When writing code commercially, you have to balance readability, expandability and execution speed. Even if devs back then knew your arcane arts, I doubt they would use such tricks. If your game loop runs 10 microseconds faster but everything breaks whenever you update the code, it's not a good change. I am genuinely infinitely impressed with your dedication to this madness though.
When I did dev many times when more speed was needed the understandable code got commented out, a paragraph added about what they optimisations were, and if you were lucky, another about WHY and what not to do lol.
Games get shipped. They don't always break even on revenue. Maintenance is a champagne problem. Good enough performance on low end systems is not, because it increases your revenue.
AArch64 (a.k.a. ARM 64-bit) has a "dc zva" instruction that AFAIK does the exact same thing as Create_Dirty_Exclusive but sets the entire cache line to zeroes instead of unpredictable values. It is used in reference implementations of memset released by Arm. So this is definitely a known issue and many modern CPUs can work around it.
Its funny reading the N64 official development docs. They explain what a polygon is to developers because 3D was so new. Would you imagine working at an AAA studio and needed to explain what a 3D model is. But it makes sense, there was a start to everything.
This is actually just a great lesson on computer hardware, like if Kaze's schedule wasn't full I'd say he should definitely do some teaching on the side.
While I'm a programmer, I'm not really a low-level programmer, and these videos are still fascinating as hell to watch. Love your content, can't wait to play your game!
Yep it's a memory throughput issue in the sense that at the moment of this video going up all the best gaming CPU's achieve their top spot on their respective benchmarks exclusively by having an unholy amount of 3D V-Cache. In that sense it's kind of funny that the N64 was almost prophetic in it's first party developers 'not understanding the hardware'. Except nowadays it's not limited to videogames and can close down airports and cost several billions of $ in a single day.
Aside from loving all your videos and being extremely impressed at the level of detail you go into developing on the N64, in this video, I really loved the Ridge Racer Type 4 track (Naked Glow) at 10:14! Well done!
Would really like to see a playlist of all of your optimizations over time in release/watch order. Would love an easy way to see the progress over the years as you've optimized so much.
I've never seen anyone as enthusiastic about the N64 hardware as you and it's amazing to see what you've accomplished so far. However, I keep wondering, if you know so much about the hardware, why haven't you considered writing an N64 emulator yourself? I ask because I'm pretty sure yours could be one of the most accurate since you've accumulated so much knowledge about it over the years. Keep up the good work, btw!
There are people with more knowledge than me contributing to emulators. (Also, even if I was the one with the most knowledge, I would not enjoy spending my time writing an emulator, I'd rather make my games) I think the bottleneck for emulators is often not that perfect accuracy is hard to achieve but rather that it is difficult to be perfectly accurate and performant enough to run games.
@@KazeN64 I totally understand that, and you're right. My point was more about the fact that you are so enthusiastic about the hardware and an emulator from you would be like an added bonus. I understand what you mean about not enjoying programming an emulator, since I'm a programmer too, but I don't enjoy working on emulation.
Cache manipulation is still very much necessary in the modern (console) development space. Most vendor APIs handle much of it automatically, but if you’re trying to squeeze out absolutely every drop of performance you still need to worry about it. Generally just been the CPU and GPU at this point, but back in the PS3 era dealing with the SPUs was a very fun time. Other low-level/embedded development also frequently hits you right in the cache, and it’s almost guaranteed that when things go wrong, the cache is to blame!
The equivalent to create dirty exclusive is to write to a write combining mapping. On x86 you can also use streaming writes from sse2 to do the same thing. It waits for a full cacheline of writes then flushes. It also does the right thing if you don't fill the chache line. You can also do prefetch with the right hints to say that you're going to be writing to it. Other Architectures likely have similar streaming writes. There's a lot of related optimisations. Write combining is mostly for memory mapped devices and things like CPU access to GPU memory. Here write combining or unchached would be set by using the Memory Type Range Registers, or in the page table.
On modern hardware, you just have the temporal instructions, which bypass the cache, but nothing on the instruction set like Dirty-exclusive. But in the microarchitectural level, the OOO circuits may be eliminate the useless loads if you write immediatly on the loaded data. It depend on the load/store queue implementation, and most of the time, the OOO memory system tend to do loads before writes because the ALU are hungry for data and writes ccan be postponed or fused with a write buffer. Interaction of this optimisation with prefetching must be taken in account, also. Using the cache as RAM remind me what's used in modern GPU, notably NVIDIA ones. The L1 cache can be configurated to act as an adressable scratchpad memory (yes, the shared memory in CUDA is just the L1 cache reconfigurated). It's not surprising, since direct-mapped and associative caches contain one or multiple RAMs memories.
Modern software video encoders still perform cache optimizations, and some video game engines also do this. It's gotten less frequently done due to the hardware just no longer really requiring it, but it still has performance gains even today. It's why the AMD X3D CPUs are so much faster than the ones without, they're no longer slamming into the RAM latency as often.
Many of these low level explicit cache management instructions are pretty useful for today's modern HPC applications. Specifically, these are great in lockless multithreaded contexts (alongside volatile reads/writes and memory barriers). Really cool video showcasing some sick usecases!
You can make 2 builds; one with hardware and all other optimizations, the other with only the optimizations that work on emulator. Not ideal having different systems work differently, but as emulators get better maybe your super optimized build would eventually work. Unfortunately I doubt emulators will get much better because they work with the whole N64 library already :/
I alredy replied this before: They are alredy done at the same time. The same code executed in one way on real hardware, but if it identifies that is running in emulator due to accuracy limitations, it can change the code to an emulator friendly one.
with the quantum physics cache where we can have the cache change and decide later if we want to commit to ram. we could do speculativ execution or banch prediction in software. we can run code without knowing if we should waiting for the gpu and reduce the idle time. maybe. i have no idea but this sounds like mad programming and i'm here for it.
R4300i cache is direct mapped, which you explained in a roundabout way. This means accessing instructions n*16KB apart (up to cache line length) or data n*8 KB apart will evict one already in cache cause they collide. I wonder if it's possible to instrument such events. This could enable some madman optimizations in tight loops.
16:16-16:33 I remember reading that Azul Vega had an instruction for zeroing memory without reading the previous value from memory. It was added to make memory allocation faster, because Java initializes all fields with zero when allocating new objects. It improved performance greatly - there was always plenty of memory bandwidth available. They had asked for Intel to add a similar instruction, but at least back then x86 didn't have anything similar. I don't know how the situation is in recent years.
ARM has DC ZVA, which is essentially 'zero a cache line' (slight oversimplification). There are cases where DC ZVA before writing the cacheline does improve performance. That being said, many ARM processors also automatically pause linefills for full-cacheline writes if they detect said linefills are unnecessary.
It do have some cache instruction, for prefetch and invalidation of a cache line (maybe some more) and also temporal load/writes. Some of them are part of the SSE instruction set extension.
@guyg.8529 given things like Rowhammer and Spectre that come from cache manipulation, maybe not giving even more cache control to userspace is a good thing.
That last one where you can decide whether the cache you wrote should be written back to RAM or not made me think of transactions in a database. I'm sure there's going to be code situations where you generate some data and then keep it or discard it based upon whether it passes some test, although seems niche.
I saw your comment on the Mario in the Multiverse hack (which I love after I got it to stop crashing on my PC) and am wondering if you will make a video discussing on how you think it's unpolished. Your attention to detail is superb and I think your input (as well as your work here) benefits the Mario 64 mod community tremendously...
From cache to cash: so we have control to cache memory without cost with some constraints ? That is, complex operations can stay in cache for as long as we need before rambus meddling? Can we then use compression as a way to sink the extra cpu idling and virtually increase bandwidth and cache memory?
I always did wonder if the barrel shifter in the Arm CPU on the 3do was meant for efficient data bit packing for LZW and Huffman . That CPU also has cache. MIPS ISA is different.
I wouldn't be at all surprised, if he wanted to spend the time. If it's true that the Pak's main function in DK64 is to store cached lighting data, then Kaze could probably just optimize to the point where the N64 can render the lighting in real-time and avoid caching anything.
There is alot of Cache in old Source games. Once you load it in and store it, your game always becomes way faster than it was, and its only a one time thing! (atleast in there)
Yeah, some of the more weird mode make you manually manage everything as if you use registers, but you can still use index and addressing mode for arrays.
Slight correction to the video: It looks like NEC (Nippon Electric Company) were actually the ones that designed most of the CPU core here, so give them the credit instead of SGI.
Correction correction: Apparently it was SGI after all? I have no idea. someone else can argue this out. I doesn't really matter to me or the video.
companies like NEC need to start designing and manufacturing chips on old nodes again. we don't need faster chips, we need more cheap chips in 2025 :)
@@xyzabc123-o1l Intel currently has some 14nm fabs sitting idle I believe. They need to find some customers for those fabs!
@@xyzabc123-o1l According to Sophie Wilson (co-designer of ARM processor), price per gate is lowest at the 28nm process node.
@@Dweditty for the info, do you know of any easily purchasable chips on that process node?
Sounds like N64 emulators are just going to have to use your game as an accuracy benchmark.
Expect to need a new PC to play 30 year old games.
@@zeggyiv ?
@@iwiffitthitotonacc4673 emulation is really hard to do, and can require more computing power than the original console
@@iwiffitthitotonacc4673 Emulation gets more computationally expensive the greater accuracy is required. Ultra HLE used to run on what would be considered a toaster today.
@@TSquitz I can emulate the most intensive PS2 games at 2x on my 200$ tablet, like unless we're talking 7th gen+ or REALLY high accuracy stuff its just not hard anymore (unless your machine is really underpowered).
N64 developers: "Well excuse me we didn't have 20 years to study the architecture and had to ship something by the end of the month!"
This is extremely true
Yeah, it really seems like the N64 was actually so much more powerful than any of the games every took advantage of. It also seems a lot of this is down to lack of documentation spelling out the best way to use these capabilities, plus overworked devs who had to get a game out ASAP.
We have to send Kaze back in time to revolutionize the N64
I bet we could do spectacular effects with modern hardware, it's just that the complexity to optimize it to this degree isn't humanly possible
Dk64 moment
18:00 this trick is called Cache-As-RAM (CAR) and as far as I know it is used by BIOS code in most (all?) PCs. In the earliest part of the boot process you simply do not have any RAM yet, since DDR RAM initialization is so complicated. So when modern x86 CPUs come out of reset, they need to start executing code to initialize their memory controller, so for this CAR is used.
oh that's really cool! I didn't know this was commonly used already. Interesting that they do it out of necessity instead of for performance.
Huh thats really interesting, do other things like the 6502 or something use that?
@@ThatOSDeveloper 6502 and similar super early microprocessors have no cache. First processor I saw with an instruction cache was the 68020 on the Amiga 1200, but IIRC they work differently because the amiga itself has a funky bootstrap sequence. The SH-4 has such a mode, however, it's called "OCRAM mode." The Dreamcast has an integrated MMU so, without checking, I'm fairly sure it'd boot the same way.
@@ThatOSDeveloper Nope, anything where the RAM is straight SRAM or something comparable will have RAM after reset. This is only the case for processors that have to initialize their own memory controller with a complicated algorithm.
@@ThatOSDeveloper Technically, @thebackyardchemist is wrong, and early PCs (along with the whole 8 bit space) don't use this, as they don't have CPU cache generally. The 486 was the first processor where it could rely on having internal to the CPU cache. 6502/z80/4004/8008/8080/8086 did not have any cache.
Kaze seems to have blown past the RTX 5090 phase of development and discovered that the N64 has a pseudo-quantum computer inside
Saw this comment before watching and thought this was a joke 😭
This is extra funny considering the people who try to claim that quantum computers use parallel universes and that Mario 64 has parallel universes
Especially now since Mario in the multiverse released… XD
Scrolling past this halfway into the video I thought this was a joke but it turns out HES ACTUALLY STRAIGHT UP DOING THAT WTF
You all must be exaggerating and keeping the joke going. Lemme see where you got the idea..
Create_Dirty_Exclusive sounds like the general idea behind Conker's Bad Fur Day
this made me laugh way too hard
Damn, just made an almost identical comment before stumbling across this one. We must both be very handsome, intelligent, and charismatic.
@@CottonModem 20 intelligent people including us and the OP could have thought of this OP’s comment
And there I was thinking it was the nams of Kaze's onlyfans page
Dirty Cash(e) 😂
So basically, you're taking a couple of cachelines and telling them "you don't cache any more, you are now extra CPU registers."
Can the N64 do operations directly on the cache buckets? I would assume it would still have to load the data to a register connected to the ALU, so they’re more like extra _SUPER_-volatile ram addresses that you then have to “flush” (i.e. bring the original data back from RAM over the Ram-Bus so that you don’t overwrite it)
I imagine that the trick would be getting as much use out of the cache buckets as you can before needing to reset them back to their original data, or perhaps even invalidate that section of RAM altogether and pretend that the cache _is_ the RAM until the data in it has to be accessed by something other than the CPU.
The SNES CPU also had some memory-mapped bytes on the CPU die ($43x0..$43xB for x=0..7, so 12*8=96 bytes) but sadly they were used mostly just as a place to store DMA/HDMA parameters. Afaik only 1 game used that area as a fast cache for instructions: Another World (SNES port by Rebecca Heineman).
2035: Kaze manages to run Crysis on N64 by using instructions that theoretically doesn't even exist
Don't give him ideas. You know he'll do it.
Don't know about N64, but I'm pretty sure that modern CPUs have undocumented instructions, sooo
There are a few NES games that use undocumented instructions. On the CHIP-8 (technically a fantasy console) a few undocumented instructions got used so much that they became official
@@VlaDexa_MAX all procs have undocumented instructions...though modern ones can have them "disabled" via the Instruction decoder being set to convert their opcodes to NOPs in the end-user versions.
bro explained the N64 like a country
@@LavaCreeperPeople what?
And boy did it work
And he somehow made it MORE confusing
bro looks up at the sky and says "bro is blue"
I am not using quantum physics in my Mario 64 mod YET. Famous last words.
Essentially you're using dynamic ranges of cache as a sort of register-window; bravo!
I've not seen this sort of cache-line optimization talk outside of Linux kernel specific talks before. Excellent!
Yeah, this streaming out of sub-16 byte data packages looked very register windows. The Jaguar has a (buggy) helper registers to let the GPU assemble 32:32 bits to write out in one go as 64 bit.
lots of talk about this kind of optimization going on right now in Dreamcast-land with the community port of GTA3. Currently none of it is implemented as everyone hashes out the detail with profiling to see exactly the best way to attack the problem, with the added complexity that both vertex transformation and vertex submission can *_potentially_* thrash cache depending on how it's done.
first time a bus has been mentioned in an n64 video without it being "Imagine a bus"
If I had a nickel for every Mario related bus meme, I'd have two nickels...
@@thewhitefalcon8539 I'd have 3. Desert bus 64.
Imagine a rambus
SMB frame rule and rambus being related 😂
I am not using this information, I am not making a N64 game. I'm just watching this because I can.
This way of thinking is good on any platform.
a great use of freewill
Same. I don't even understand half of it, but hearing someone go in depth on their niche interest without being boring is magical when it's clear they have taken their nerdiness to expert level.
Listening to someone talk about a thing they're passionate about is always fun. Even if you don't understand half of it.
am i ever gonna use this information? not likely
do i like hearing this guy talk about transforming the N64 into a bloody supercomputer? absolutely
2:09 As someone who did maths for their undergrad, I can confirm, I have absolutely no memory (its kinda why generality and derivations from first principles appeal in the first place).
That's because you're not a chad universalist who memorizes their proofs like Poincare :^)
@@JorgetePanete yrou'e*
See now I'm awful at math but my memory is fantastic, wanna connect our brains with a rambus?
If you've never forgotten the quadratic formula on an exam and re-derived it on the spot, then are you even a real mathematician :)
Rambus was finally going vroom vroom, but now it's retired :(
Bro downloaded more ram to the point he didn't need the base ram anymore
He has a good career and now he can enjoy some time off
Having (all) your ALUs munching away on useful work with some memory bandwidth to spare is the goal for a well optimized system.
Don't worry, the RAM BUS is going full time with the RCP = )
It just don't deserve the CPU that often, that's all.
@@mylittleparody2277 i think it s the rdp that needs most of the ram bandwitch.
Kaze : "Alright, full disclosure : i am not using quantum physics in my mario 64 mod-"
Also Kaze : "-YET"
At this rate we'll have ray tracing in RtYI by the time it releases.
The satirical kaze video by sm64rise is gonna become real
You thought the Rt stood for Return To
You were sorely mistaken
@@LokiScarletWasHere Raytraced Yoshi's Island 64, coming to a nintendo 64 near you in 2025.
Actually, reminds me of that one guy who made a Ray-Tracing chip for the super nintendo.
Wait, why is this legitimately a good way to explain how a CPU works?
Nintendo: *releases N64 specs &
development docs*
SGI: look how they massacred my boy
Edit: Tbf, this is basically software engineering in a nutshell. Hardware folks come up with some rocket science bullshit to squeeze extra perf out of the silicon, and the software people waste all of that work by having compilers ignore modern special-purpose instructions for the sake of backwards compatibility, and putting the entire program behind all the polymorphism, virtual functions, dependency injections, virtual machines & interpreters, and God knows how many other abstractions and obfuscations. Despite the different nature of software optimization then vs. now, it boils down to a similar amount of fundamentally misunderstanding how the hardware actually functions that led to most of the N64 library having lackluster performance.
Modern apps are written like a labyrinth, and the CPU is given the unreasonable task of translating the map from a foreign language and solving the labyrinth as quickly as possible. This is often why modern software is ~1000x slower than it could be.
Biggest revelation of this channel (besides all the amazing tech) is that the worst, most performance limiting part of the N64 was the documentation.
@@uponeric36 same with Bosch mototronic ECUs and their stolen/hidden FR manuals
Well, yes and no.
For doing both hardware and software (even if at a way simpler level than CPU) I agree with you that some very powerful hardware possibilities are not used.
On the other hand, take in account that, the doc (for the N64, but on a lot of projects I worked on) is not as simple or readable as you may expect, and also, the software side don't have a lot of time to learn the hardware and code.
That's why most of the time, retro compatibility is a thing, because you can reuse old bricks to try to gain a bit of time.
And i agree with you modern frameworks are just an unstable pile of horrendous things (that you can't even modify easily).
But just try to say that the game will run faster and on old hardware if you rewrite the engine from scratch with Raylib instead of using Unity...
> Modern apps are written like a labyrinth, and the CPU is given the unreasonable task of translating the map from a foreign language and solving the labyrinth as quickly as possible.
Compilers can do a lot of heavy lifting there.
Being a pioneer for a 30 year old console, what a time to be alive.
21 minute papers...
@Doom2proexcept with actual information
@@canaconn2388 and not spoken like "Today, we, are going, to, discuss, a groundbreaking, piece, of techonogical, development.. so we, will get, to see, and amazing, hard to believe, sight... So hold on to your papers"
"Man Revolutionizes N64!"
"He's 25 years late and gonna get sued so IDK why he did."
Have to be wonder if all his research into N64 hardware will indirectly help improve fpga N64 projects to improve, to act truly like the real hardware or a bit more like it at least. If you look at the firmware update history of Analogue products you can see they are frequently updating the cores to address inaccuracies in certain games, even very popular games. So fpga may never truly be 100% accurate for 100% of games. So Kaze's intense research into how N64 hardware actually works and how it is actually used, and how it could be used, is probably important for achieving that goal, or at the very least putting the spotlight on N64 hardware when people inevitably try to run all these things on their Analogue 3D's and such.
mario 64 has parallel universes, nintendo64 has quantum cache everything is coming together for mario64 port for a quantum computer
I was the 64th like!
Man I love the visuals in this one. It's been great learning something new every time. Few of the concepts here I don't think I would have understood without the little graphics.
Ah! A direct mapped cache! The Sega Dreamcast has a similar cache setup. I've got a good scheme created to maximize direct mapped cache by using absolute addressing in gcc with an ld script to create stripped zones. Separate the direct mapped cache into 4 zones, each separated by the width of the cache spacing, to ensure writes to buffers don't overwrite the previous line. On the Dreamcast, you can also enable OCRAM mode, which halves the cache into a scratch pad for fast math. This is actually optimal, because the physical layout of the dreamcast's memory is (for sake of brevity ignoring the 64-bit dual ram setup) 2 ram "chips" with 2 banks inside, each bank made up of 2048 rows of memory "cells," each cell being a cacheline in size. Each bank has a mechanism inside to read a bank called a sense amplifier. To read a cell, a sense amplifier must be attached to the row, so if you read a row outside of the boundary, it incurs a performance penalty as the sense amplifier must detach, move to the appropriate row, and reattach. If you operate in OCRAM mode, the sizing of the remaining Cache is *juuuust* right to fit 4 rows at once if you stripe your memory without sense amplifier penalty. It sounds like the DC and N64 actually share quite a bit in common memory wise.
A really cool feature of the Dreamcast memory map is the entire memory is mirrored to an alternate address which skips cache when read, as well. So you can actually store things in memory and call them using an alternate address without thrashing your data cache. The dreamcast also naturally has prefetch and invalidate instructions, which when combined with absolute addressing and OCRAM mode, gives you quite a bit of granularity in how you control your cache.
EDIT - Question: Does the N64 offer any sort of degree of instruction parallelization? The Dreamcast uses a 5-stage harvard architecture for instruction fetch, which allows parallelization when basically using any instruction from alternate groups providing they aren't a move opcode. Anything like that exist on the N64? EDIT AGAIN: Welp, looked a little further and it turns out this is actually a part of the MIPS name, lol. "Microprocessor without interlocking pipeline staging." Very, very, verrry cool. The architecture of the DC and N64 are very similar!
This video is so complicated that I am almost relieved that the Atari Jaguar only has scratchpad RAM for code and a Matrix and a ton of registers for the data.
@@ArneChristianRosenfeldt Oh man I've done Jaguar programming with my Skunkboard. I consider Dreamcast development way, way easier lol. The Dreamcast is so elegant, nice FPU with fat registers for 2 full matricies, a bunch of really cool SH4 fast math functions. Plus, the absolute coolest feature: Order-independent transparencies, owed to deferred rasterization. You bin all your polygons upfront before sending them to a tile accelerator to rasterize, which gives the tile accelerator, which generates pixel fragments, the opportunity to depth-test against every other polygon in the bucket. This gives the dreamcast per-pixel transparency without needing to order polygons.
I absolutely love 68000 programming, though. When I do Jag development, I make atari age weep because I play mainly with the 68000 lol.
@ I just try to redeem Ataris hardware decisions. Running code out of external memory probably was an accident due to the unified data and code cache and external data access. LOL.
I cannot code 68k , only 6502
@@ArneChristianRosenfeldt Coming from 6502, I think you'd find the 68000 a dream to work with. They feel very similar, except the 68000 is just more of everything, especially registers. That's the absolute best thing about the 68000 -- FAAAAAAT registers. The 68000 is 32-bit internal, that's seven 32-bit address registers, and eight 32-bit data registers. With bitmasking and bitshifting, that's essentially the same as sixteen 16-bit data registers, or thirty-two 8-bit registers! And unlike the 6502, data registers are general purpose, use however you want. You can also use the address registers in clever ways. Hands down my favorite CPU of all time, simple enough to know the ins and outs of, but feature packed enough to do some incredible stuff. Definitely give it a try!
Off topic, but kinda funny: That's me in your profile picture. Or, rather, I posed for the reference picture when I was a kid. Wasn't expecting to see myself in the comment section. 😂
0:50 BITD, I had a girlfriend with a create_dirty_exclusive mode. It wound up not being so exclusive, and then I got dumped.
were you ram
@@Mizu2023 no, but she was
@@GumSkyloard Oh right. Saw "dump" and mind went "ramdump"
10:08 Fun fact: Many of the equipment the Apolo mission used was analog, so not all data required to run on a CPU
Also, execution speed was the last priority. It was (and still is for all space missions) all about reliability for obvious reasons.
The Apollo guidance computer was probably quite memory limited in terms of size
And what wasn't, was digitized in the simplest of ways: pulse or frequency counting was used as an ADC (getting, I think, 10 to 18 bit operands usually?). They didn't have integrated peripherals for this, not even dedicated ICs, like we do today. (For fast conversion applications, there were digital conversion CRTs: an electron beam sweeps across a punch-coded plate, producing a serial bit sequence corresponding to beam deflection in the other axis. Not sure who was using these; Bell telephone maybe? Military?) Calculations didn't need to run too often -- a few times a second to update spacial navigation and maneuvering, basically solving differential equations by incremental difference; and managing what digital systems (i.e. on/off switches, relays, lights, display and keypad (DSKY), etc.) were set to automatic (including the autopilot controlling thrusters). It was slow (clock rate low 100s kHz?), but had reasonable bus width (18b?) and a couple of otherwise quite powerful numerical instructions (mul/div/etc.?). Things you might not expect given the low capability generally, but customized perfectly for the workload.
Computer design back then was very different: instead of starting with a standard system, there was simply no such thing, as having a CPU at all was already such a massive hurdle; you have a strong incentive to strip out everything unnecessary, and customize the architecture (not just bus sizes, but parallel/serial, instruction timings, pipelining even, etc.) to suit your purpose. There were no standard instruction sets to pick from (for general applications; arguably IBM's System/360 was the first, perhaps only, standardized instruction set -- but only for mainframe data applications, and this might give you some idea of the scale required to obtain value from standardization, and what the scale of computing generally was like back then!). What we think of today as a CPU, reading instructions and processing data, was a more nebulous concept back then. So, between these things being built from gates, or individual transistors, the tremendous design and hand-assembly effort to put those together, let alone writing ROM (e.g. "rope") and assembling RAM (hand-threaded core!), and the rarefied applications that demanded such lavish expense -- they were very bespoke and specialized systems indeed!
Pipelining is interesting to mention here... System/360 was the first to have it, ca. 1967, according to one article? More important going into the 70s, and again only for the biggest machines that would benefit from it. It seems like a new thing, but it's relatively new _in the consumer space_ to have needed pipelining, or caching or what have you. What used to be supercomputer tech in the 70s, filtered down to single chip consumer hardware in the 90s, and so on. This pattern hasn't changed much: what passed for a supercomputer in the 2000s (multi-CPU, SMP or asym.; vector instructions; etc.) has filtered down, in a sense, to your smartphone today. We've since settled on the best of both worlds: SMP CPU with moderate vectorization, augmented with large-vector parallel processing ([GP]GPU). We carry in our pockets, for the measly cost of a couple watts power dissipation, the power of myriad Cray Supercomputers.
Interestingly, grid or flow computing has long been known, but not gained any traction aside from limited use cases where the flow of data is optimal for the calculation (differential field solvers?). Anyway, modern CPUs and GPUs are so extraordinarily powerful that such applications can still run on them with very reasonable execution time, even if not well suited to the flow and dependency of data (i.e. RAM/cache limited). I wonder if that's changing with the availability of tensor cores today (neural net stuff; ugh, "AI").
(Standard disclaimer: any keywords and inaccuracies are largely from memory, and should be taken as incentive to go and research these things yourself. There are many excellent and accessible articles, going into any level of detail, on the above subjects; highly encouraged!)
The Apollo computer used for calculating trajectories was an old gear driven cash register
@@T3sl4Bro, I would read your substack.
"This will actually somewhat work on some emulators, too"
*Shows a smoking laptop, which is presumably overheating*
I love it
Sick 3d animations go vroom vroom.
Computers have a few functionalities programmers typically would not consciously use, but for the sake of optimization, they sometimes should.
Kids, don’t do this at home. You are not Kaze. Premature optimization is the root of all evil !
*premature* optimization is.
However, sometimes, you've traced your performance bottleneck to a specific area, using somewhat realistic very stressful workloads. Now you need to optimize something everyone says is impossible to optimize further, because you have no hope of learning how fast is fast enough (it'll always be too slow for something), and performance is a feature. That's when you reach for the esoteric stuff.
I did that a couple months ago for something at work, it gave like... well, it's hard to quantify. It was noticeable on the test case, at least a 10% throughput improvement of this function (which originally was 33% of runtime), how much time it saves depends on a bunch of parameters, we have an O(n) algorithm with a large constant factor that I can't do anything about, and this function has a O(M^4) section (yes, that's a slow complexity, I haven't figured out how to make it M^3 or smaller)
@@skylerross8054I hardly ever see the full quote, which really undercuts the "you shouldn't optimize this" crowd:
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."
Note that small Knuth is only talking about "small efficiencies", and that even there 3% (a remarkably specific number) are still useful.
I prefer the much simpler saying "until you measure you're wrong about where your code is slow", or the much more general "you're wrong". Keep that in mind and you'll be much happier a programmer!
@@SimonBuchanNz I'm not sure why I'm getting atted with this lol.
I mean, I do agree with this, your conclusion is a nicer way of saying what I already agree with.
I'm far from an "anti optimization" person, I'm more a "spend effort where it's the most effective for your goals" person. Indeed, performance is a goal/feature. There are an unfortunate number of people who don't think of it like that, and that's how we get apps that take multiple seconds to open, and have to use animations all over the place to mask the fact that things are taking longer than is comfortable.
Alas, performance is a single goal, and we have others that are at least as important (the most performant x is one that does nothing), and working on this goal before checking that the work is useful at all, much less useful to the goal, is... wasteful. Measure twice, cut once. I may not engineer physical... anything really, but the motto still applies, a lot of advice from other engineering fields does. I prefer the term software engineer over any other term to describe the profession for that reason.
5:50 The framerate of the game here makes me think it's more like a CPU simulator, it's gonna be 100% accurate but simulations are still heavy
So when are you transferring your consciousness to a cluster of n64s?
No need for a cluster, one N64 is plenty powerful enough, he just gotta unlock the hidden consciousness port with the right optimizations.
You are very much in the territory where speed is no longer a priority. When writing code commercially, you have to balance readability, expandability and execution speed. Even if devs back then knew your arcane arts, I doubt they would use such tricks. If your game loop runs 10 microseconds faster but everything breaks whenever you update the code, it's not a good change.
I am genuinely infinitely impressed with your dedication to this madness though.
Kaze doesn't care about readability, he cares about optimisation lol
i recommend you watch his
'optimizing with "bad code" ' video
it has even more Very Fun optimization stuff
When I did dev many times when more speed was needed the understandable code got commented out, a paragraph added about what they optimisations were, and if you were lucky, another about WHY and what not to do lol.
Games get shipped. They don't always break even on revenue. Maintenance is a champagne problem. Good enough performance on low end systems is not, because it increases your revenue.
20:26 Poor Henry Kümpel suffering from mojibake. Unless they inserted ü intentionally as a joke...
Finally, after years of us stupid people asking, Kaze has dumbed it down to our level.
Bus go vroom hehe
Bus go retire
AArch64 (a.k.a. ARM 64-bit) has a "dc zva" instruction that AFAIK does the exact same thing as Create_Dirty_Exclusive but sets the entire cache line to zeroes instead of unpredictable values. It is used in reference implementations of memset released by Arm. So this is definitely a known issue and many modern CPUs can work around it.
bro knows better n64 than nintendo themselves🙏🙏😭
Of course, Nintendo moved onto other technology. Amazing how deep you can dove into a hobby.
Its funny reading the N64 official development docs. They explain what a polygon is to developers because 3D was so new. Would you imagine working at an AAA studio and needed to explain what a 3D model is. But it makes sense, there was a start to everything.
Ps1 is next 😬🥶
Iirc devs had to get permission from Nintendo to use Microsoftcode, so finding optimizations like this was probably stalled by a bunch of red tape.
@@Genzaijh so real bro
Well done, you've made Schrödinger's Memory
I don't remember that part
@@mathphysicsnerd It's the quantum bit
This is actually just a great lesson on computer hardware, like if Kaze's schedule wasn't full I'd say he should definitely do some teaching on the side.
This *_is_* his teaching on the side. Surprise!
when i think you have ran out of n64 hardware vids you keep on dropping em. i don’t regret my sub one bit.
great video
While I'm a programmer, I'm not really a low-level programmer, and these videos are still fascinating as hell to watch. Love your content, can't wait to play your game!
We got 3D Rambus (retired) before GTA 6
TL:DR: friendship ended with rambus, now cache is all kaze needs (this is exxagerated, but you get the point)
Yep it's a memory throughput issue in the sense that at the moment of this video going up all the best gaming CPU's achieve their top spot on their respective benchmarks exclusively by having an unholy amount of 3D V-Cache. In that sense it's kind of funny that the N64 was almost prophetic in it's first party developers 'not understanding the hardware'. Except nowadays it's not limited to videogames and can close down airports and cost several billions of $ in a single day.
ooooo crowdstrike incident reference
Aside from loving all your videos and being extremely impressed at the level of detail you go into developing on the N64, in this video, I really loved the Ridge Racer Type 4 track (Naked Glow) at 10:14! Well done!
This is the best ELI5 and visual representation of how this all works. Great education. Bravo, chapeau and thank you!
Would really like to see a playlist of all of your optimizations over time in release/watch order. Would love an easy way to see the progress over the years as you've optimized so much.
Bro knows the N64 better than his own room
I await your Diddy Kong Racing video.
I've never seen anyone as enthusiastic about the N64 hardware as you and it's amazing to see what you've accomplished so far. However, I keep wondering, if you know so much about the hardware, why haven't you considered writing an N64 emulator yourself? I ask because I'm pretty sure yours could be one of the most accurate since you've accumulated so much knowledge about it over the years. Keep up the good work, btw!
There are people with more knowledge than me contributing to emulators. (Also, even if I was the one with the most knowledge, I would not enjoy spending my time writing an emulator, I'd rather make my games)
I think the bottleneck for emulators is often not that perfect accuracy is hard to achieve but rather that it is difficult to be perfectly accurate and performant enough to run games.
@@KazeN64thank you for your work. It inspires programmers to further optimize their games
@@KazeN64 I totally understand that, and you're right. My point was more about the fact that you are so enthusiastic about the hardware and an emulator from you would be like an added bonus. I understand what you mean about not enjoying programming an emulator, since I'm a programmer too, but I don't enjoy working on emulation.
Cache manipulation is still very much necessary in the modern (console) development space. Most vendor APIs handle much of it automatically, but if you’re trying to squeeze out absolutely every drop of performance you still need to worry about it. Generally just been the CPU and GPU at this point, but back in the PS3 era dealing with the SPUs was a very fun time. Other low-level/embedded development also frequently hits you right in the cache, and it’s almost guaranteed that when things go wrong, the cache is to blame!
man i needed that 2 months ago for my memory management and scheduling class project
When this man speaks the entire modern gaming industry weeps-he saves microseconds where others can’t save seconds.
Aww the animation you did to explain things was adorable. Great job, hjgh effort videos!
The equivalent to create dirty exclusive is to write to a write combining mapping.
On x86 you can also use streaming writes from sse2 to do the same thing. It waits for a full cacheline of writes then flushes. It also does the right thing if you don't fill the chache line. You can also do prefetch with the right hints to say that you're going to be writing to it. Other Architectures likely have similar streaming writes.
There's a lot of related optimisations. Write combining is mostly for memory mapped devices and things like CPU access to GPU memory. Here write combining or unchached would be set by using the Memory Type Range Registers, or in the page table.
It's amazing how good the graphics quality you've achieved on a Nintendo 64 is! It's so beautiful! Imagine this game running in 1996
Thanks for consistently great videos
You are a genius brother, thank you for all the effort!
On modern hardware, you just have the temporal instructions, which bypass the cache, but nothing on the instruction set like Dirty-exclusive. But in the microarchitectural level, the OOO circuits may be eliminate the useless loads if you write immediatly on the loaded data. It depend on the load/store queue implementation, and most of the time, the OOO memory system tend to do loads before writes because the ALU are hungry for data and writes ccan be postponed or fused with a write buffer. Interaction of this optimisation with prefetching must be taken in account, also.
Using the cache as RAM remind me what's used in modern GPU, notably NVIDIA ones. The L1 cache can be configurated to act as an adressable scratchpad memory (yes, the shared memory in CUDA is just the L1 cache reconfigurated). It's not surprising, since direct-mapped and associative caches contain one or multiple RAMs memories.
Props to you for explaining such a topic in such an understandable manner, its a true display of intelligence
I'm so hyped to try this game of yours. It's really impressive just how far you've managed to take N64's capabilities.
Modern software video encoders still perform cache optimizations, and some video game engines also do this. It's gotten less frequently done due to the hardware just no longer really requiring it, but it still has performance gains even today. It's why the AMD X3D CPUs are so much faster than the ones without, they're no longer slamming into the RAM latency as often.
Now the bus fits so many more framerules!
(or something like that)
Many of these low level explicit cache management instructions are pretty useful for today's modern HPC applications. Specifically, these are great in lockless multithreaded contexts (alongside volatile reads/writes and memory barriers). Really cool video showcasing some sick usecases!
You can make 2 builds; one with hardware and all other optimizations, the other with only the optimizations that work on emulator. Not ideal having different systems work differently, but as emulators get better maybe your super optimized build would eventually work. Unfortunately I doubt emulators will get much better because they work with the whole N64 library already :/
Or try out the capabilities on game load and patch in shims or NOPs if something fails
I alredy replied this before:
They are alredy done at the same time.
The same code executed in one way on real hardware, but if it identifies that is running in emulator due to accuracy limitations, it can change the code to an emulator friendly one.
Yeah, Bear Waker had two builds as well, where other was console optimized.
The effort you went through to illustrate the cpu/ram/etc was top notch. Truly outdid yourself. 10/10, no notes.
with the quantum physics cache where we can have the cache change and decide later if we want to commit to ram. we could do speculativ execution or banch prediction in software. we can run code without knowing if we should waiting for the gpu and reduce the idle time. maybe. i have no idea but this sounds like mad programming and i'm here for it.
The N64 styled visuals for the analogy are just ADORABLE! :D
5:03 whoops, cache momentarily corrupted
Me nodding along as if I know what Kaze is talking about when he describes technology more complicated than a rocket for a moon landing.
R4300i cache is direct mapped, which you explained in a roundabout way. This means accessing instructions n*16KB apart (up to cache line length) or data n*8 KB apart will evict one already in cache cause they collide. I wonder if it's possible to instrument such events. This could enable some madman optimizations in tight loops.
16:16-16:33 I remember reading that Azul Vega had an instruction for zeroing memory without reading the previous value from memory. It was added to make memory allocation faster, because Java initializes all fields with zero when allocating new objects. It improved performance greatly - there was always plenty of memory bandwidth available. They had asked for Intel to add a similar instruction, but at least back then x86 didn't have anything similar. I don't know how the situation is in recent years.
ARM has DC ZVA, which is essentially 'zero a cache line' (slight oversimplification). There are cases where DC ZVA before writing the cacheline does improve performance.
That being said, many ARM processors also automatically pause linefills for full-cacheline writes if they detect said linefills are unnecessary.
The "computer science lore" joke at the beginning was peak.
😂 I absolutely loved the RAM bus analogy story, and the acronyms. Very entertaining and educational
I can't believe that x86 despite having stcpy as an instruction doesn't have any cache instructions.
It do have some cache instruction, for prefetch and invalidation of a cache line (maybe some more) and also temporal load/writes. Some of them are part of the SSE instruction set extension.
@guyg.8529 given things like Rowhammer and Spectre that come from cache manipulation, maybe not giving even more cache control to userspace is a good thing.
Dude you explained really complex thing in really easy to understand way. Great job
Honestly, it gets a bit confusing when you try to make a city metaphor out of everything.
Next video: "the N64 was actually capable of finding the cure for cancer, but no game ever used that feature"
That last one where you can decide whether the cache you wrote should be written back to RAM or not made me think of transactions in a database. I'm sure there's going to be code situations where you generate some data and then keep it or discard it based upon whether it passes some test, although seems niche.
Donation to the RAM bus driver now that he is unemployed
hell yes let's go
Yooo Tyler hi!
Bro, I absolutely LOVE the way you animated this in the N64 engines in style!!!!
"I'm micromanaging more than Jeff Bezos his employee's p breaks" 😂
Kudos on the visual presentation, was very fun yo watch!
I saw your comment on the Mario in the Multiverse hack (which I love after I got it to stop crashing on my PC) and am wondering if you will make a video discussing on how you think it's unpolished. Your attention to detail is superb and I think your input (as well as your work here) benefits the Mario 64 mod community tremendously...
i would not want to drop a rovert roast video
@@KazeN64 Is there beef between you and Rovert?
@@MrRaiPlays no
@@KazeN64 oh good, thought I was missing something and struck a nerve... my apologies :)
Wow such an overall great explanation for a CPU and how the internal CPU cache works. This could teach kids in school a lot, it's great!
bro is porting gta 3 to n64 soon i swear
truly one of the memes of all time
dont imagine it
see it
Melody Nosurname song
omgor true :3
This is the most educational "practical programming" channel on youtube.
From cache to cash:
so we have control to cache memory without cost with some constraints ? That is, complex operations can stay in cache for as long as we need before rambus meddling? Can we then use compression as a way to sink the extra cpu idling and virtually increase bandwidth and cache memory?
I always did wonder if the barrel shifter in the Arm CPU on the 3do was meant for efficient data bit packing for LZW and Huffman . That CPU also has cache. MIPS ISA is different.
Incredible work. I've never commented on your channel, but your ability to translate here is phenomenal. You should teach.
Kaze, at this point maybe you could fix Donkey Kong 64 works without Expansion Pak
Ironically, I'm pretty sure this mod requires the expansion pack
@@michawhite7613 rtyi64 doesnt normally require expansion pack, but using it does help with making performance extra stable
I wouldn't be at all surprised, if he wanted to spend the time.
If it's true that the Pak's main function in DK64 is to store cached lighting data, then Kaze could probably just optimize to the point where the N64 can render the lighting in real-time and avoid caching anything.
Iirc, the lighting data only needed to be calculated once, then stored for reference later.
@@ericlizama8552 storing it is slower than calculating i'd imagine
There is alot of Cache in old Source games. Once you load it in and store it, your game always becomes way faster than it was, and its only a one time thing! (atleast in there)
17:12 what do you mean "Yet".
Every copy of this mod will be personalized
Peak Emanuar once again giving me the exact size video I needed to enjoy my meal 🗣️🗣️🔥
Have any of the N64 hardware devs commented on your videos?
I just came here to say how much I love the mario renders you do for the thumbnails
I had hard time to understand, but it seems it's simply using CPU cache as if it was a CPU register ?
Yeah, some of the more weird mode make you manually manage everything as if you use registers, but you can still use index and addressing mode for arrays.
1:52, imagine a City... with a framerule bus, and a rambus.... vroom vroom
the url of this video ending on "SLow" tells me youtube doesnt *quite* understand your channel lul
I feel like your scripts are getting better and more entertaining, I enjoy