And you would presumably have improved after 30 years. You might cringe at your 30 year old past self's code, or just have compassion that you did the best you could.
Wild that Nintendo implemented culling for individual rocks but not for the giant cave at the end of a long tunnel, haha. I'd always assumed that culling/loading was the whole reason the tunnel was there.
the tunnel is there to teleport mario around as the whole level including the cave is too large for a sm64 level, so in the tunnel there is an invisible teleporter that brings you back & forth between the two sections
To be fair, N64 is also built on the MIPS architecture while every Nintendo console until then was built on CISC processors. Pipelining and its associated effects on branch penalty and memory access wasn't well-understood at the time except for the few PhDs who designed these
they probably wrote most of the code on a SGI workstation way before they made the n64 hardware, and that workstation probably had much faster memory, so it made sense to optimize the code that way. When they ported the code to the retail hardware it was pretty bad but there was no time left for it and they decided to ship it. Well that’s only a theory, but it makes sense to me
@AntonioBarba_TheKaneB this happened to the Goldeneye team. They were under the assumption that the hardware would be more powerful and the carts would have more storage than it ended up having so they had to cut a lot of stuff and simplify the level geometry
It's pretty likely you're right though. Launch titles for new consoles are never particularly well-optimised exactly for this reason. The devs just don't know what they will be working with in advance...
The evidence up to now is that optimization was limited due to the novelty of the compiler, tools and hardware itself. The schedule was nuts and some of the Mario 64 programmers quit the games industry altogether after burning out.
I can’t imagine how disappointing it would feel to run the game on the new hardware only for it to have massive optimisation issues you know you don’t have enough time to fix.
Moral of the story here goes beyond N64 development: Optimizations don't exist in a vacuum. You need to know where your bottleneck actually is before you can attempt to work around it.
Some other comments said a major issue is that they didn't know the hardware it would end up being ran on. So it would be pretty hard to identify bottlenecks.
I remember watching a talk for cppcon about how virtual function calls (assembly guaranteed dynamic dispatch) on modern machines don't have a notable performance difference from regular function calls (in fact the performance of them is non-deterministic, you're basically flipping a coin on what will be faster, and it depends on so many things its impossible to predict, yes dynamic dispatch can be just as performant) and one of the primary things the speaker was saying was that benchmarking is meaningless, and its things like this that remind me of that.
@@phantom-ri2tg Benchmarks are generally designed to ignore that, compiler behavior is of little concern because what the compiler produces is deterministic in its behavior, and that's the thing we focus on. The problem is when the machine code doesn't perform deterministically, also a compiler of the same version with the same build options on the same platform and CPU will deterministically compile the same input to the same output. In most cases all x86-64 (64-bit Amd/Intel) CPUs will receive the same produced output so unless your CPU uses a different instruction set the CPU doesn't really matter. You can observe this with godbolt directly. Also this aside the CPU is not designed to inherently run the machine code its given, but at minimum it must be guaranteed to produce functionally the same observable results, how it achieves that doesn't actually matter so long as it has no observable difference as product. (like the simplest optimization is rearranging instructions in the CPU to be slightly more performant)
@@phantom-ri2tg Benchmarks are generally designed to ignore that, compiler behavior is of little concern because what the compiler produces is deterministic in its behavior, and that's the thing we focus on. The problem is when the machine code doesn't perform deterministically, also a compiler of the same version with the same build options on the same platform and CPU will deterministically compile the same input to the same output. In most cases all x86-64 (64-bit Amd/Intel) CPUs will receive the same produced output so unless your CPU uses a different instruction set the CPU doesn't really matter. You can observe this with godbolt directly. Also this aside the CPU is not designed to inherently run the machine code its given, but at minimum it must be guaranteed to produce functionally the same observable results, how it achieves that doesn't actually matter so long as it has no observable difference as product. (like the simplest optimization is rearranging instructions in the CPU to be slightly more performant)
I think the main reason Nintendo used memory instead of the CPU for maths was because Nintendo was used to programming the NES and SNES, which both had fairly weak built in CPUs. The reliance on memory for lookup tables was probably instinctual.
And me, a 386SX user hated lookup tables for a long time. Edit to clarify: it is kinda hard to find, but 386 needs two cycles for a lot of stuff, for example reading from memory. So reading the 32bit address of the table already takes 4 cycles. Then the look up takes another 4 thanks to the page miss. Yeah, 8 does not sound too bad. It is just that in 386 Intel really cleaned up MUL and DIV. Every output bit needs only a single cycle, while the CPU fills its instruction queue. Don’t branch directly after math! Mix math with lookups (interleave two threads).
@@thinkublu if I'm interpreting it correctly, it's a reference to the fact that many programmers/people who work with computers technically, tend to eventually end up doing farm work/manual labour later on in life as a way to escape the technology that has caused so much stress for them. Given that Performance Lottery would cause a LOT of stress/confusion, exposure to it would lead to the programmer being more likely to leave society to work on a farm
This video was flagged as made for kids, so I didn't get a notification not did it show up in my subscriptions tab. and the option to enable notifications is greyed out. Good ol TH-cam
well, when uploading you have to choose between NOT made for kids, and made for kids.... its confusing! i also figured that NOT for kids would mean like adult or rude content..... i did not realize that it is actually the normal option
Why in the world would this be marked for kids anyway? It deals with a lot of complicated computer topics that I don’t think kids would understand. Also, Discord screenshots are in the video, too.
it's interesting how relevant some of this is to games today, memory bandwidth is obviously a lot better now but it still is an issue so it's largely still relevant to optimise for data locality and maximise efficient usage of the cpu cache.
That reminds me of how wild it was when the AMD X3D chips came out with massive CPU cache. Some games made massive fps gains from having more cpu cache available.
@@vilian9185 Um, pretty much every modern game engine is designed with efficient data access in mind, data locality is literally why approaches like ECS are being applied. Less memory use is on average beneficial on every current CPU, and the faster the CPU is, the more it is bottlenecked by memory.
@@dycedargselderbrother5353they don't, they sell consoles at a loss, that's why they seems more powerful than a computer a the same price, the perfect example is Steamdeck, that have gains up to 25% just because it don't runs Microsoft shit OS, to be fair older consoles until ps3/xbox 360,yes, they had various advantages, they had custom hardware focused for games and devs had them as priority so they used these hardware features
Looks like the developpers were used to old architectures and had a hard time to optimise for a new modern CPU with pipeline, caches, and so on. All those optimisations were usefull on old processors with simple microarchitectures, were the constraints were different. Unrolling loops, inlining, memoization of matrices, LUTs for trig functions, are optimisations like that. The memory wall did hit hard on that period, the use of a RISC CPU wasn't helping either for the code size.
@@IncognitoActivado wouldn't a "nintendrone" try to save nintendos sorry ass over even the slightest blunder, as opposed to using there mistakes as a case study due to the sheer amount of low hanging fruit?
Well additional data is never free, but geometry not being the bottleneck did not occur to me either. Maybe some wicked design could get low LOD from full LOD on-the-fly?
@@KARLOSPCgame Kaze has already partially re-created Odissey's stages and mechanics inside Mario 64. It's not the same thing, but it does prove that Oddissey can be recreated on the N64, just with simpler graphics.
I have heard stories of western developers being given japanese manuals for hardware and being unable to make sense of them. I wonder if inverse happened here while nintendo was both creating N64 and games for it.
i don't think so, probably was lack of good benchmark tools, if they had what kaze has today where it measure exactly what's is slowing down the rendering(GPU,CPU,MEMORY) they would have saw that memory was always the bottleneck, and as Kaze said the console had to much discrepancy in how it was put together, to fast CPU at the expence of memory throughput
the developers were very new to all of this and the fact they were able to transition from SNES programming to what remains the greatest videogame of all time is a testament to their intellect and dedication.
OP, you're forgetting that manuals didn't exist. This was not just brand new hardware, but a brand new coding paradigm. These people WROTE the manuals.
@@ssl3546 yep this and very limited time. Another year or two would have made a big difference but they didnt have time they were trying to beat everyone
Removing the lod model is like the bell curve of optimization. It's good because Mario doesn't look worse from far away and because the N64 does less work by executing less code.
They did that in OoT also and it's very distracting when you notice it. I was playing a rando the other day, got bored, and was standing at the distance cut off for the low-detail Link model, moving back and forth a bit and saying, "ugly Link, normal Link, ugly Link, normal LInk".
@@tonyhakston536 If you don't know what game optimization is you might want to remove the low poly Mario because it's ugly. If you think you know what game optimization is you might want to keep it because it renders less polygons when you are far away, there for you can't see the difference very well. If you are Kaze Emanuar you might want to remove the low poly Mario script and model because it saves more memory.
@@QuasarEE I'm still waiting for a mod that has your character and held items always use the high quality model. You can only see the high quality models when the camera is smooshed against a wall to bring it closer to you :/
I love how you start the video with over compressed footage of SM64 in an incorrect aspect ratio with ugly filtering in an emulator despite the fact that you clearly know better. Really takes me back to 2010 😭
People think the PS1 was more powerful because it did transparencies, colored lights and additive blending like it was nothing, so when they play the system on an emulator with perspective correction and in high-res the games appear much better than an emulated N64 game.
It also probably helped at the time that the CD format allows PS1 games to have high quality sound and prerendered videos. I'd imagine back in the mid-90s most people had no concept of the difference between a pre-rendered video and "in-engine" as long as it was running on their TV as they played. Stuff like Parasite Eve 2 and Final Fantasy 8, using pre-rendered video behind the player as you actually moved through an environment? On a CRT to hide the compression?? It looks absolutely fucking unbelievably good, like nothing else that generation could remotely achieve. And I think it's actually worth arguing that these benefits still "count" even if for many people prerendered feels "unfair" compared to in-engine. It's the end user experience that matters, not how they pull it off, right? If Parasite Eve 2 can "trick" you into thinking you're walking through a Los Angeles street full of death and destruction using graphics ten years ahead of their time via clever tricks, I don't think that's a lesser accomplishment compared to rendering the street on the fly as best you can like Silent Hill 1.
The PS1 wasn't more powerful, but it was a better balanced system. The N64 is so bandwidth starved that even first party games waste huge amounts of CPU and GPU time doing nothing. Ocarina, which was years later, still spent over half of each frame with the GPU completely stalled waiting for data, which probably leaves it's effective clock rate (doing real work) not far off of the PS1s GPU. Like kaze said in the video, the power was in more advanced graphical features, not raw numbers.
@@jc_dogen The framebuffer as well, it's amazing that if they just made a few better choices hw wise the N64 would have been a much better system, some of the cuts Nintendo did to save a few cents ended up making games for the system much more difficult than it should.
I suspect a lot of this was due to early development on SGI workstations with different performance characteristics than the final hardware, and possibly immature compilers that didn't implement N64 optimizations well. Remember that the "source code" is derived from the decompiled binary, so unrolling loops might have been done by the compiler, possibly assuming different instruction cache characteristics.
It's always better to remove a problem than to add a solution, when possible. For me it usually makes the codebase slimmer, easier to read and easier to develop further.
Considering how bad most modern software is, watching this video about super optimized low-level code is really satisfying. Most features on Windows, for example, run hundreds or even thousands of times slower than they need to. It's a shame that efficient code just isn't made any more.
Yea its a shame if this level of optimisation was did to Windows 10/11 itd run on many generations older hardware, half as much memory and storage all while being quicker
This is the real use i can see for AI. AI is already a powerhouse for coding, fine tune it with code optimization and you can probably boost the performance of regular AAA games by 30-50% without much money spent on expensive optimization programmers. i hope this will become the reality in 3-5 years
I wonder how much Kaze could improve some of the other games with his level of expertise. Imagine a highly optimized Turok on native hardware or any other games. The N64 is one of my favorite consoles.
I really don't understand anything about software programming, but hearing people like you and Pannenkoek talk about it really helps me appreciate the work, passion, and struggles that go into developing a game. I remember being a kid and basically thinking that games just spring out of holes in the ground at Nintendo.
I love your dedication to get the absolute most out of hardware by actually rethinking your conceptualization of the software to match the hardware's capabilities. Most people are so inflexible in their approach to programming, which is why for the most part we still write software for an architecture from the 1980s.
Reminds me of making demos for my Apple IIGS back in the late 90s/early 00s when I was a kid. Look into every trick the hardware allows pulling out as much as the beefed up little iigs with maxed out (at the time) expansion/accelerators will allow. These days I almost exclusively work with the PC88VA3 for demos after the dual architecture (Z80/8086) grew on me along with the rest of the specs/modes. I've never thought about doing the same process with console hardware, kind of makes me want to try it out now.
0:30 i've uhh ""researched"" it thoroughly, and this quote belies the truth only slightly, buut the saturn's CPUs do have a division unit included specifically to accelerate 3D math (as well as the SCU-DSP which was originally intended as the matrix math unit)
Yeah, SEGA persuaded Hitachi that division is important also for other customers. Doom resurrection uses it on Sega32x . Jaguar also has division running in the background. Not sure about 3do . I think that Arm has an implicit output register, which blocks until the result is there .
The way the Saturn handles its 3D effects still wrinkles my brain. Really underappreciated console with some awesome games (Panzer Dragoon Zwei is my fave game of all time).
"Performance Lottery" is a real bitch. I've tried optimizing some code at work, added a custom SLAB allocator, to ensure, all objects are within roughly the same memory region. And now the time hasn't improved at all, because suddenly ANOTHER function caused cache misses. That one was caused by running the destructor on a large number of objects, despite the objects not being used, afterwards. (It was only one int being set to 0). Originally my boss wrote this other code, thinking that reserving 20 elements, will do less allocs. So he created an array with 20 "empty" elements on stack, instead of using std::vector which will likely use malloc. Which sounds so far so good. However the constructor and destructor now runs for 20 elements (plus the amount of actually used elements) instead of 0-3 most of the times. But that the constructor and destructor of mere 20 integers would cause a problem on even modern Clang 14 + ARM64 is something that even I would not have expected. The best benchmarked solution was to use a union to suppress the automated constructor destructor. And even that, gave only like 150ms on 1.6seconds. Which really doesn't seem worth the uglified code, in my opinion. There are a bunch of these micro-optimizations I could make, but they all make the code uglier. And there are a lot more macro-optimizations that would require the code to be completely refactored and have tests written for all of them. Seeing as we need to come to a pre-release version pretty soon, there is not much time for either of these. The initial version of the product will be shipped with the insane startup time of 8 seconds, on our device. And then I will try to figure out how to improve time, once the other bugs are fixed.
Fun fact: Crash Bandicoot, one of the games people probably use when they said PlayStation is more powerful is technically using resources on the console that aren't meant for running games. Yeah you know the joke of "Naughty dog breaks the limits of a console at the end of a generation"? Hun naughty dog started on PlayStation breaking limits and hyper optimizing their games. There is a very interesting documentary on the development of crash, down to artstyle, animation rigging, their study of how PlayStation 1 works, it's really interesting and I advise you to look it up and give it a watch if this fact peeked your curiosity.
@@HowManySmall The PlayStation 1 had segments of ram specifically allocated for running the PlayStation itself, and naughty dog found that not all of that ram was being utilized, so found a way to tap into it to make crash bandicoot 1 run better. So basically they were using dev intended ram and then snipping a bit more ram from the console from a place not intended for devs to use. So that on top of art decisions made such as making crash only colored vertex planes, and using boxes for interactive set dressing allowed them to focus on more complex environments without the use of pre rendering like other early PlayStation titles.
Nice working on 13/15. I am pumped to play Return to Yoshi's Island when it releases. Keep up the good work Kaze and thanks for sharing your deep understanding of the N64 and Super Mario 64.
I feel that Sega understood some of these things very early on. Games like Daytona USA (original Arcade -version) actually withdraw instanty everything out from enviroment that car passed by at the same speed that car goes on. Basically drawing only things directly on front of you, making draw -distance look awesome.
Daytona USA was released in 1994, by the way. And it was already a relatively mature product in the 3D-games world that was being exploited by various Japanese and other world softhouses. SEGA had a lot of experience with 3D. But in the world of Nintendo-themed TH-camrs, Mario 64 released in 1996 suddenly became one of the first 3D games ever made. 🤣
I wouldn't describe the draw distance of Model 2 games as awesome, theres a ton of completely unmasked pop-in and zero use of LODs for backgrounds. The pop-in was also done in fairly large, predetermined chunks, not gradually. I'm a big fan of AM2 and the model 2 hardware but the (visual) success of those games was a combination of incredible art design and obscenely advanced hardware, rather than genius efficient coding.
@@DenkyManner The hardware was good on Model & Model 2, but not pre-eminent. Daytona USA had 32 Bit CPU 25Mhz with 32Bit Co -Processor, and only 8Mbit (1MB) RAM, while resolution was reasonable 496 x 384. -I would say SEGA learned 3D quicker than others, or at least they moved into making polygonal graphics earlier on. Nintendo would not even knew 3D at the time without Argonaut. However, Sega was not that strong on CD -based console 3D. While Nintendo really nailed 3D gameplay/playcontrol as soon as they tried.
Couldn't you partially avoid "performance lottery" in your code by padding out the binary? If you make all of your code cache poorly but run decently, then certainly it can only run better after you undo the padding.
I'm currently taking a microprocessors class in college, and your series optimizing SM64 has helped a ton with my understanding of how the microprocessor interacts with the memory and how machine code works. Thank you!
Putting in that Minecraft Glide Minigame music from the consoles gave me crazy nostalgia for no reason whilst learning about how getting lucky will basically make the game go _vroom vroom._
Inlining, or tricks to prevent branch misses make me wonder if they developed this code on something like an Intel chip with much longer pipelines that respond to some issue much worse than a RISC with shorter pipelines. And LUTs for circular functions may just be a holdover from CPUs with no multiplier. You can approximate a sine really quickly with raw CPU power with a polynomial if you have a multiplier to do it.
Honestly, I wouldn't be terribly surprised. I'm not a video game developer, but I have had to develop code for a product that was not yet developed. The challenge in that scenario is often that you have to work with *something* in order to get your code working at all, and to start building and testing it. If the N64 wasn't available when they started development, it could absolutely have been a "just grab something, we'll adjust later" kind of situation.
My favorite game on the N64 is F-Zero X. That game is really underrated as it's not only an amazing action racer but also a technical showcase for the system. I'm glad that dedicated developers were able to make such a game back in the day and it's still very fun.
This is an amazing video. My friend and I both work at Microsoft and he's doing performance optimization on a C++ codebase. I absolutely love the way you explain and analyze these problems! You have such dedicated passion into understanding and fixing this game. Whenever you have your Return to Yoshi's Island game released, I will be playing it on my 32" Sony Trinitron TV and absolutely enjoying the experience. Looking forward to it!
This video reminds me again why Optimizations are so important. I’m a dev at a AAA company rn, and I quickly learned to design my assets with optimization in mind instead of trying to implement some crazy inefficient shit and try to fix it later lol. The reality is, time is always a factor so giving leeway for other optimizations toward the end of a project is so so crucial…instead of trying to tidy up things that should’ve been lean in the first place. Great video as always!
@@onebigsnowball I mean, the Ridge Racer developers did a turbo mode on the PS1 that made the game run at 60FPS instead of 30, but it came at the cost of some cars being removed and shading being reworked.
IIRC regarding the "PS1 is faster than N64" claim, it's difficult to even say if it could even be true. Even with the most in-depth knowledge you could have about the Playstation 1, you'd barely be able to match 2/3rds of the performance of the N64 CPU, and still had to sacrifice a lot to get there. Even with it's hexa-processor design (CPU, GTE-cop, MDEC-cop, MEM-cop, GPU, SPU), it was still functionally inferior. A few points: - The triplet of rendering processors (CPU, GTE-cop, GPU) only worked with Fixed Point, in either 16bit or 32bit. Many games had to opt for 16bit, and even the games that used 32bit had to limit their levels to relatively tiny areas compared to the N64. Those transition screens or fades between rooms are not by choice, but by necessity to hide the artifacting (vertex snapping/wobbling). - You had to perform shading, world to camera transform, z-ordering, and camera to view transform on the GTE as the GPU had very limited 3D support. No Z-Buffer, and hardly any actual support for 3D, meant that you got even more wobbly textures as a result. "But you got mip-mapping and dithering for free!" - as if anyone actually wanted that, it was needed to hide the artifacts of the PS1 hardware. - Instead of having to worry about Rambus, you have to worry about DMA abuse instead. It is very easily possible to write code that causes 0 FPS on the PS1. DMA is hard. - Cache Trashing is much harder, as you go CPU Cache -> RAM -> CD-ROM Cache -> CD-ROM. That's three misses that have to happen, but if they happen they're so much worse than N64 cache misses. You could easily spend more than a second stuck due to a scratched disc or a bad CD-ROM. - The built-in hardware decoder for videos with direct DMA to the GPU meant that you could use videos directly, and still render on top if needed. AFAIK the N64 does not have video decoding hardware, and the space on the cartridges wasn't exactly good for it either. It's been a while since I made homebrew for it, since it's just not a good console to try and develop for. Might not be entirely accurate anymore, as I wrote this from what I remember. There's a lot more, but these are like the primary ones I ran into when making homebrew. 700MB of CD space means nothing if you can't actually use it well...
Only the mpeg decoder can output true color. 3d acceleration always used a 16 bpp frame buffer. So you say that the PS1 had virtual memory and games used it? I know that N64 has virtual memory and you could write an OS which loads pages from ROM.
Wait, how is Mario 64 "one of the first true 3D games ever"? "True" and "one of" are carrying so much weight in that sentence they may generate a tiny black hole. Even if you're going to discount every racing game since Hard Drivin' in 89, every pseudo 3D fps alll the way up to System Shock and Duke Nukem 3D, every 3D fighting game since Virtua Fighter, bundle Tomb Raider and Quake into that "one of", dismiss anything with locked camera since Alone in the Dark as "not true 3D"... Mechwarrior 1 and 2 were out. Hell, by 1996 there were as many full 3D space shooters based on Star Wars as mainline Mario platformers.
Yeah I always cringe when people don't make really easy clarifications about formative games that aren't actually the first. The one I'd say for Mario 64 i'd say is "The first good 3-d platformer" or "one of the first 3-d platformers, and a launch title". No salt to Kaze, ur cool you've done it how I like previous times, you just forgor or rewrote it funny this time.
I think he meant one of the first 3D platformers as people often say but its gotten purple monkey dishwashere'd into being one of the first 3D games EVER which is absurd
@@KazeN64 This shows you have some credibility at least. But also "one of" does mean not the actual first(and could for sure cover 3 games before it), and leaving out on home console isn't so bad. I feel the intended point stands that 3D was new and very rare when mario 64 came out, and especially when making the game. So while the clarification is good to make I don't feel it's worth being upset about. Also Doom like games are not 3D for sure. So in short Mario 64 is one of the first true 3D games, not really any correction needed.
you do a bit of both, optimization after the fact can be horrendous too. You optimize for each chunk, then optimize the whole. Then disable the first optimizations to see if there is any difference, then release.
You can also just design things properly first time around and have them optimized Novel idea, i kmow The biggest sin these days is the complete lack of optimizations
That was awesome Kaze. Dude, you never cease to amaze me that you're continuing to find more optimisations. You make coding for the n64 really fun to learn. Cheers mate.
Very cool video! Good work on showcasing the various "optimization-attenpts". It's always important to check your optimization ideas against the actual hardware where the software is run. Especially when only targeting one system (as Nintendo did with Mario 64). Loop unrolling for example makes sense on newer CPUs because of their out-of-order nature but on other hardware where cache locality is much more important it's hurting you (as shown in the video).
I really appreciate all you do to truly get the most out of Mario 64. Hopefully one day a version compiling all your fixes can be made that runs basically without breaking a sweat. Keep up the great work!
I love these optimisation videos Kaze, please keep doing them, seeing how mario 64 approached these kinds of things is really useful for game devs even today
Thanks for making these videos, I'm no programmer but I do work in IT so I know a little bit about how code is supposed to work, and its very interesting to see how Mario 64 was coded. I'd like to believe that the poor optimizations in the game most likely happened due to time crunching and stuff, and as people back in the 90s were more limited in the tools that they had access to, it would have taken them a long time to troubleshoot and properly test for things, which is why games used to be much glitchier in the past, but as a result since the devs liked some of these glitches and stuff, we managed to experience a lot of them through cheat codes and stuff, which is what actually made that era of gaming a very interesting experience to grow alongside in.
Another important part of the story is PS2 dev kits bragged about real-time debuggers on live code. As far as my research goes, 1st gen N64 dev kits did not have live debug on active software. Also, it's likely either the intended platform specs were lower or the sga dev kits had lower specs. These cached renders mentioned halfway through the video May have been fail-safes against the code crashing in the dev environment.
@@kiyoskedante yeah, no console dev kit was known to have really good tools overall until the xbox. the ps2 did have some fancy kit with the performance analyzer, but i think only the main CPU had a debugger for years. Just write all your vector unit code bug free lmao
I love this video. Thanks for pointing out that there are tradeoffs and that having more consistent fps is better than just average or max fps. I remember working on a game (not N64) where I could make it hit 100fps but it would drop a lot, or I could have it hit a consistent 60 but would max out at 80fps. It was a case of premature optimization like the examples you pointed out.
Since no one else is answering you, Kaze is making a mod for SM64. If you haven’t checked out any of his other videos, he goes into depth on the programming of Mario and how he is improving the code for performance games for his mod.
I remember from where I liked to search fun facts about random things in the old days of youtube, that both the N64 and the Gamecube were equal or more powerful than their competitors but that their game storage systems (cartdriges and small disks) were limiting factors
It is great to hear that your next Mario 64 mod is near completion. I just hope it can run on PJ64 Version 3.0 or I can find an emulator that it can work on since learning about what happened with version 1.6.
It was actually because the hardware is generally much more efficient than the N64 and Saturn, and many 3D games tended to look less limited. Nowadays, it’s easy to point out the wobbly textures and weak 2D capabilities of the console.
@@solarflare9078it was easy to point out the wobbly textures and jittery pixels back in the day. I did it when I went to my friend's house, playing on the ps1 really jarring (that said a kid who didn't usually play on an n64 would probably have found the blurry textures and widespread use of fog to cover up low draw distance jarring too)
@@solarflare9078lol anyone could point out the insane texture warping on the ps1 from day 1 Same with the n64 textures being a blurry mess Having already played quake on PC, i was not impressed with either, 3d graphics on pc were lightyears ahead of them and made the early 3d consoles look retarded We even knew the n64 controller was a fucken shite too and ruined the enjoyment
@@tediustimmy PS1 also didn't suffer from memory stalling as badly as N64 did. Which at the time must've helped with keeping the perceptible gap between them smaller.
I think they had some issues with sourcing chips coupled with a lack of knowledge for 3D games. 64 bit was only a thing because a 32-bit CPU at the required spec was more expensive / less available
It's clear that a lot of the techniques learned on the SNES were being applied on the N64 like unrolling and inlining which definitely would have been more effective on the older system. Great video!
I'm not really familiar with the N64, but have you accounted for the generally less-developed state of the compilers 30 years ago? I don't doubt that the performance lottery was a thing even then, but I suspect that more naïve algorithms that covered code generation may have made loops less effective, which would be another reason why unrolled loops were deemed more effective in their tests at the time. The ability to use all registers as effectively as possible makes a considerable impact. While the 90s aren't the archaic 70s, C was 'merely' 20 years old at the time, but the first MIPS processor was from 1985, a mere 8 years prior to the start of the n64s development. There's a good chance they weren't using the newest compiler toolchain at the time yet (internet was still very niche!), so I think it would be interesting to see how well an 'unoptimized' version would do when compiled with the tools they had at the time, if that is even feasible.
Yes! I used the exact same compiler and flags they used back in 1996 here. We know this is the same because the unmodified code compiles byte for byte the same
Here I am watching with a what Kaze is achieving with N64 hardware wondering how the history of gaming would be if he got send back in time to work at Nintendo...
To be fair, the 3D programs were also in it's infancy back then as well, Max and Softimage were the top contenders back then, they didn't have current blender with the F3D exporter. Painting vertex colors wasn't probably that visual back then nor did they have such a nice texture library and authoring tools.
Kaze is an extremely skilled programmer for sure, but the N64 and Mario 64 were pretty novel and the constraints of game development made it so that you have a minimum acceptable framerate and then with any extra development time, you'd focus on ensuring minimal bugs or adding more content rather than optimizing the existing content (doesn't necessarily mean all bugs will be fixed though!). While Kaze and other N64/Mario 64 devs managing to do this without the resources of a huge corporation is insanely impressive, it's not like it was realistic to expect the devs at the time to have similar breakthroughs (although there certainly are "cheap" optimizations the Mario 64 devs could have done at the time but I am unsure of whether they would have that drastic).
@@cdj17e yeah, thats why my mind thought it to be funny to imagine him with his knowledge standing on the shoulders of giant prior using that knowledge to help those giants. :)
@@cdj17e I think it cannot be overstated how much programmers like Kaze are standing on the shoulders of the giants who came before them. I imagine that if you threw Kaze back in time, he'd still be a very talented individual, but depending on the state in which you'd send him back, the results would vary immensely. Modern tooling will have inspired a lot of visualizations that let him realize just how unusable the bars used for performance matrixes by native devs were. Just having these impressions and knowledge of the places where the 'pain' is can avoid so much wasted time and rabbit holes. But at the same time, would he be as effective if he was limited to the tools of the time? Nowadays we have so many means for rapid prototyping that allow a quick 3D scene to be whipped up in Blender and inspected with high framerates, but back then the controls for comparable programs would have been clunky, screen updates slow, and overall process not very flexible in how easily it can be prototyped against the existing product. Also don't underestimate the importance of a quick build-test cycle which very likely involved cross-compiling and maybe even taking things out to plug them into a dev kit device. And finally, assuming Kaze got to work on the product back then, he'd no doubt have to deal with superiors who impose a certain vision or have opinions of their own on how development has to happen, as well as deadlines to meet while regularly spending nights at the office (It's Japan, after all.) It would be a huge difference in every aspect in regards to how he is able to approach these development projects now as a hobby of sorts. (I have no clue if and how he monetizes his activities, but it seems quite niche so I'm assuming it's primarily hobby oriented.)
Loop unrolling is a technique commonly used in the demoscene to get the most out of old 8-bit and 16-bit computers like the C64. Maybe the developers here were used to those old techniques (NES used a 6502 just like the C64) even though applying them to N64 was not the right idea and they just didn't know.
I wonder how these 3 laggy spots look like on PAL. Given the framerate caps at 25, maybe the lags were less noticeable? ( unfortunately the game speed isn't compensated anyway, so it's still slow, just more evenly slow probably. )
@@Martyste If I'm not mistaken, Pal version had o2 compiler optimization enabled that original release didn't have. Said compiler setting slightly boosted performance on PAL and Shindou rerelease. Said boost was for PAL's equivalent of 30, meaning peak fps was 25.
I think the biggest lesson in optimizations I got was when I was making a video game for coding practice, and when I was working on a main menu background (Particles behind the screen flying from the bottom to the top) my attempt to prevent the game from loading too many particles by Deleting multiple particles as it created multiple particles caused ALOT of lag My solution to this and to make the particle background work was to actually make it that it created and deleted one particle at a time and instead to make sure there isnt "One particle only at each Y axis" a particle's spawn point would get randomized to the point where particles created BEFORE one particle could arrive after the particle that was created after them
19:20 I think this is the most important thing to know when questioning programming decisions. Previous nintendo consoles didn't have caches at all; before then memory speed was on par with CPU speed and memory accesses would take the same amount of time no matter when they happened. A huge amount of the modern optimizations are centered around cache performance, but I'd be surprised if cache performance had gotten *any* significant attention at the time.
@@mekafinchi yeah, Nintendo was late to the party and ignorant to the outside world. Probably did not allow experienced ARM coders from Archimedes to come in. Did not pay to get mentor with experience with Sun, Fuji, or SGI servers. Don’t go to trade shows. Don’t learn about profilers and instrumentation.
And at that time, even if the console had cache, no one know to to optimize for a cache memory. I'm pretty sure it's something that appaear later. For example, the Michael Abrash books about optimisation and assembly were very light about cache optimisation and most cache advice didn't care avout code size (because of the CISC x86, i know), and say nothing about memory bandwitch beside wait state. Everything is about the PIQ, teh wait state, the DRAM Refresh, the instruction, calculations, and so on... But the processor of that time had caches !
Nice, but please stop comparing FPS numbers. Use milliseconds so optimization gains can be compared. “Improved by two frames per second” means different things based on where you started.
Dude, if you think it's crazy that some people think the PS1 is more powerful than the N64, I wonder how you feel when so many people now claim the Saturn was more powerful and capable than the N64. Also, we seriously need more people like you in the SNES development scene really optimizing that system and pushing it properly to its limits imo, because there's a lot of games on that system that have room for improvement like this to be honest. And, in the right hands, I genuinely think most of the SNES games that suffer any slowdown could be running at a pretty solid 60fps. Not only that, but it would cool to see some of those game pushing the system even further too, and really showing off what it's capable of. I can only imagine what you might be able to bring to optimize games like Star Fox or Doom on SNES, never mind just the more typical 2D games there.
@@KingKrouch I'm absolutely sure the Saturn could run Doom at 60fps if it doesn't already. I mean, haven't people got it running at 60fps on some of the older consoles already like the 32X or whatever? I swear I read that somewhere. Now I'm curious, does Doom 64 run at 60fps?
A good example that proves what you said is a recent romhack for Ranma 1/2 Chougi Ranbu Hen, its one of the most poorly optimized fighting games on the SNES, it's an otherways great game but it runs like complete dogwater. Recently a user named upsilandre did a partial rewrite of the game, heavily optimizing the code and got it running at a faultless 60FPS. It further frustrates me that the majority of notoriously sluggish SNES games that earned the console a reputation were not really the fault of the console, but rather developers being cheapasses and using SlowROM chips. Kandowontu's been hacking SNES games for a while now, converting them over to FastROM and this alone has yielded significant performance improvements, removing most, if not all the rampant slowdowns in a ton of games. Manfred Trenz in one game with no expansion chips pretty much shamed every SNES dev with Rendering Ranger R2, so the whole console really deserves a redemption arc.
Nice vid. For the audience it would be beneficial to talks about performance as a resource measured in milliseconds. Saving "one fps" is very different at 10 fps vs 60 fps. Thanks for the vid 👍
Have Kaze taught about releasing an optimized Mario 64 before his original n64 game? That would gather publicity to make the release of his game more anticipated. As always fascinating to hear an expert talking about n64 programming even if 90% of the stuff flys over my head. Kaze is the n64 Carmack!
Loop unrolling and inlining are not a great idea when your cpu is much faster than your memory. They probably thought the rambus memory was more performant than it turned out to be in reality. Older 16 bit and especially 8 bit systems had fairly balanced ram to cpu performance characteristics because some cpu instructions could be really expensive and memory latency was low compared to cpu speed. RISC cpus like the N64 used had good overall IPC much better than the common 16 bit cpus of the time. No doubt Nintendo's programmers were just not familiar enough with programming a RISC platform.
The biggest problem of the N64 hardware has always been the slow memory. If Nintendo had just given that console way faster memory, it would have destroyed everything else on the market. That was a pretty bad hardware decision. The memory was the bottleneck for everything. The CPU and the GPU could hardly ever show their full potential as most of the time they were waiting for the memory to provide requested data and thus had to idle. If you can optimize your memory access in such a way that the memory keeps pumping data at maximum speed, demos have shown that you can easily use textures 10 times the size and still get better frame rates on average. So not optimization is the problem but the wrong kind of optimization. Optimization is often a trade-off between CPU time and memory storage. You can re-calculate values or you can cache them. In case of the N64, recalculation is often the way to go as that is faster than accessing cached data in memory. Actually that's true for many modern systems as well. Modern CPUs perform an addition in on clock cycle, a multiplication in on clock cycle and operations can sometimes also overlap (so if you run 10 operations, each requiring one clock cycle, you may have the final result in just 6 clock cycles as not all operations have to wait for the last one to finish). In the end, re-calculating a value may cost you 12 clock cycles but fetching that same value from cache may cost you 20 (1st level) to 60 clock cycles (2nd level) and fetching it from memory may cost you over 200 clock cycles. But optimizations work both ways. So instead of storing something and re-use it later, replacing that with code that intentionally re-calculates it is also an optimization. One that doesn't seem intuitive but can in fact make the code faster and that's what optimization is all about, right?
ACTUALLY! I used to think that too, but just recently I've ran into RSP bottlenecks. I have optimized my memory throughput in such a way that my CPU idles around 76% of the time. At that point, the RSP (=GPU) does become an issue. I might make a video about that soon. Sauraen is now working on a new microcode to fix some RSP bottlenecks.
@@KazeN64 But is the RSP really not able to keep up with the data and it's not the memory again that cannot provide vertex or texture data fast enough? After all you have proven in your other video that the RSP is a beast when it comes to processing vertices and it is also a beast when it comes to processing textures: (tried with link to Sf036fO-ZUk but apparently TH-cam filtered the reply because of the link) Usually the main problem why you cannot just blow up vertex count or texture size is because at some point the RSP gets limited by memory again, so you must be pushing it really hard if you can get it become the bottleneck by itself.
@@xcoder1122 Yeah, it's confirmed. It was the actual RSP cycles that were the limitting factor. Of course, improving memory would still reduce the RSP wait cycles so it's not entirely useless to do - but a 20% increase in RSP cycles was pretty much exactly a 20% increase in frametime.
@@KazeN64 This sounds like a very interesting topic. I'm looking forward to hearing some more technical details about it. I just subscribed to your newsletter so I won't miss it.
I have a bit of a hot take: some of the reason people think the n64 is worse is that the PS1 had an *immense* amount of space for game assets. The PS1 had 10x the amount of space for textures, and that does make a difference! That's why most PS1 games I remember have a much higher texture fidelity, for example. Of course, that also has drawbacks, but overall I think that contributed a lot to the sentiment that the PS1 is more powerful. Because in that one aspect, it truly was a beast.
The part about culling was super interesting to me, as I always wondered whether we can help the N64 perform better by culling on the CPU instead of idling, so the GPU has less work to do when it has to draw the next frame
I think a lot of people really underappreciate the work that went into making a great game like sm64. Yea the code is not perfect, and I am not a coder myself, but I'm sure even Kaze would agree that they did an amazing job with the time and knowledge that they had available.
My nightmare is to have someone 30 years in the future totally roast my code
At least you can just roast em back with "At least I set the standard for the time"
And you would presumably have improved after 30 years. You might cringe at your 30 year old past self's code, or just have compassion that you did the best you could.
I mean llet's be honest thers fat chance somone will, and ther's fat chance that person will be you.
"That was 30 years ago, I'm wiser now" 👍
@@DrsJacksonnproceeds to make the same mistake 30yrs later
Wild that Nintendo implemented culling for individual rocks but not for the giant cave at the end of a long tunnel, haha. I'd always assumed that culling/loading was the whole reason the tunnel was there.
Maybe it was but one hand didn't speak to the other.
Cool pfp
For the cave they could've just split the level in two, just like they did with Dire Dire Docks and Wet Dry World
the tunnel is there to teleport mario around as the whole level including the cave is too large for a sm64 level, so in the tunnel there is an invisible teleporter that brings you back & forth between the two sections
I guess we really just watch the same videos at this point @@LavaCreeperPeople
The code needed to optimize being larger than the issue its trying to optimize feels like some kind of punishment from greek myth
Yes
Nintendo mythology.
apparently this is a somewhat common trap in programming
@@herrabanani yeah it is, funny to see it described this way
To be fair, N64 is also built on the MIPS architecture while every Nintendo console until then was built on CISC processors. Pipelining and its associated effects on branch penalty and memory access wasn't well-understood at the time except for the few PhDs who designed these
they probably wrote most of the code on a SGI workstation way before they made the n64 hardware, and that workstation probably had much faster memory, so it made sense to optimize the code that way. When they ported the code to the retail hardware it was pretty bad but there was no time left for it and they decided to ship it. Well that’s only a theory, but it makes sense to me
steve jobs, peering over a NeXt workstation display: am i a joke to you
@AntonioBarba_TheKaneB this happened to the Goldeneye team. They were under the assumption that the hardware would be more powerful and the carts would have more storage than it ended up having so they had to cut a lot of stuff and simplify the level geometry
It's pretty likely you're right though. Launch titles for new consoles are never particularly well-optimised exactly for this reason. The devs just don't know what they will be working with in advance...
The evidence up to now is that optimization was limited due to the novelty of the compiler, tools and hardware itself. The schedule was nuts and some of the Mario 64 programmers quit the games industry altogether after burning out.
I can’t imagine how disappointing it would feel to run the game on the new hardware only for it to have massive optimisation issues you know you don’t have enough time to fix.
Moral of the story here goes beyond N64 development: Optimizations don't exist in a vacuum. You need to know where your bottleneck actually is before you can attempt to work around it.
I always wondered if this ever actually happened, where optimization backfired by being done poorly lol
Some other comments said a major issue is that they didn't know the hardware it would end up being ran on. So it would be pretty hard to identify bottlenecks.
I remember watching a talk for cppcon about how virtual function calls (assembly guaranteed dynamic dispatch) on modern machines don't have a notable performance difference from regular function calls (in fact the performance of them is non-deterministic, you're basically flipping a coin on what will be faster, and it depends on so many things its impossible to predict, yes dynamic dispatch can be just as performant) and one of the primary things the speaker was saying was that benchmarking is meaningless, and its things like this that remind me of that.
@@phantom-ri2tg Benchmarks are generally designed to ignore that, compiler behavior is of little concern because what the compiler produces is deterministic in its behavior, and that's the thing we focus on. The problem is when the machine code doesn't perform deterministically, also a compiler of the same version with the same build options on the same platform and CPU will deterministically compile the same input to the same output. In most cases all x86-64 (64-bit Amd/Intel) CPUs will receive the same produced output so unless your CPU uses a different instruction set the CPU doesn't really matter. You can observe this with godbolt directly. Also this aside the CPU is not designed to inherently run the machine code its given, but at minimum it must be guaranteed to produce functionally the same observable results, how it achieves that doesn't actually matter so long as it has no observable difference as product. (like the simplest optimization is rearranging instructions in the CPU to be slightly more performant)
@@phantom-ri2tg Benchmarks are generally designed to ignore that, compiler behavior is of little concern because what the compiler produces is deterministic in its behavior, and that's the thing we focus on. The problem is when the machine code doesn't perform deterministically, also a compiler of the same version with the same build options on the same platform and CPU will deterministically compile the same input to the same output. In most cases all x86-64 (64-bit Amd/Intel) CPUs will receive the same produced output so unless your CPU uses a different instruction set the CPU doesn't really matter. You can observe this with godbolt directly. Also this aside the CPU is not designed to inherently run the machine code its given, but at minimum it must be guaranteed to produce functionally the same observable results, how it achieves that doesn't actually matter so long as it has no observable difference as product. (like the simplest optimization is rearranging instructions in the CPU to be slightly more performant)
I think the main reason Nintendo used memory instead of the CPU for maths was because Nintendo was used to programming the NES and SNES, which both had fairly weak built in CPUs. The reliance on memory for lookup tables was probably instinctual.
And me, a 386SX user hated lookup tables for a long time.
Edit to clarify: it is kinda hard to find, but 386 needs two cycles for a lot of stuff, for example reading from memory. So reading the 32bit address of the table already takes 4 cycles. Then the look up takes another 4 thanks to the page miss. Yeah, 8 does not sound too bad. It is just that in 386 Intel really cleaned up MUL and DIV. Every output bit needs only a single cycle, while the CPU fills its instruction queue. Don’t branch directly after math! Mix math with lookups (interleave two threads).
fun fact: exposure to performance lottery can result in a shift to long term agricultural work 👀
Ha, I get that joke!
I don't get it
😂😂
Honestly I' m so curious about the joke here
@@thinkublu if I'm interpreting it correctly, it's a reference to the fact that many programmers/people who work with computers technically, tend to eventually end up doing farm work/manual labour later on in life as a way to escape the technology that has caused so much stress for them.
Given that Performance Lottery would cause a LOT of stress/confusion, exposure to it would lead to the programmer being more likely to leave society to work on a farm
The most 80's looking japanese programmer imaginable is staring dead eyed at this video while chainsmoking
This video was flagged as made for kids, so I didn't get a notification not did it show up in my subscriptions tab. and the option to enable notifications is greyed out. Good ol TH-cam
Kids these days are just interested in N64 code optimizations
smh
well, when uploading you have to choose between NOT made for kids, and made for kids.... its confusing! i also figured that NOT for kids would mean like adult or rude content..... i did not realize that it is actually the normal option
Why in the world would this be marked for kids anyway? It deals with a lot of complicated computer topics that I don’t think kids would understand. Also, Discord screenshots are in the video, too.
@@raafmaat TH-cam likes to "help" creators by forcing the option on sometimes
i didn't mark it for kids. but it looks like it fixed itself now?
it's interesting how relevant some of this is to games today, memory bandwidth is obviously a lot better now but it still is an issue so it's largely still relevant to optimise for data locality and maximise efficient usage of the cpu cache.
of couse not, no game today optimize of data locality, how you going to optimize for 300 times of cpu different
That reminds me of how wild it was when the AMD X3D chips came out with massive CPU cache. Some games made massive fps gains from having more cpu cache available.
@@vilian9185 Um, pretty much every modern game engine is designed with efficient data access in mind, data locality is literally why approaches like ECS are being applied. Less memory use is on average beneficial on every current CPU, and the faster the CPU is, the more it is bottlenecked by memory.
@@vilian9185 Ever wonder why consoles get a lot closer to higher spec PCs than it seems they should?
@@dycedargselderbrother5353they don't, they sell consoles at a loss, that's why they seems more powerful than a computer a the same price, the perfect example is Steamdeck, that have gains up to 25% just because it don't runs Microsoft shit OS, to be fair older consoles until ps3/xbox 360,yes, they had various advantages, they had custom hardware focused for games and devs had them as priority so they used these hardware features
Looks like the developpers were used to old architectures and had a hard time to optimise for a new modern CPU with pipeline, caches, and so on. All those optimisations were usefull on old processors with simple microarchitectures, were the constraints were different. Unrolling loops, inlining, memoization of matrices, LUTs for trig functions, are optimisations like that. The memory wall did hit hard on that period, the use of a RISC CPU wasn't helping either for the code size.
@@IncognitoActivado wouldn't a "nintendrone" try to save nintendos sorry ass over even the slightest blunder, as opposed to using there mistakes as a case study due to the sheer amount of low hanging fruit?
@@malachigv nintendo sucks anyway, so no.
@@IncognitoActivadoI dont know what you think you are acomplishing by not reading the comment and just insulting the person lol
@@usernametaken017 You assumed that they thought before typing.
Do they still make 32 bit RISC CPUs?
Never in a million years would I have thought removing Mario’s LOD model would actually have performance benefits.
Well additional data is never free, but geometry not being the bottleneck did not occur to me either.
Maybe some wicked design could get low LOD from full LOD on-the-fly?
@@musaran2 may i introduce you to: unrelated engine 5?
@@musaran2WAIT NO UNREAL UNREAL!
@@chickendoodle32 lel
Regarding unreal 5 and LODs: nanite is far from the silver bullet it is sold as.
Evil Kaze in a parallel universe:
How I Optimized Mario 64 to Run at
Bethesda Kaze
Running Odyssey on the N64 at the cost of being playable
To answer that, we need to talk about parallel universes.
@@KARLOSPCgame The gameplay can certainly be ported. The question is how much of the aesthetics can it preserve.
@@KARLOSPCgame Kaze has already partially re-created Odissey's stages and mechanics inside Mario 64.
It's not the same thing, but it does prove that Oddissey can be recreated on the N64, just with simpler graphics.
I have heard stories of western developers being given japanese manuals for hardware and being unable to make sense of them. I wonder if inverse happened here while nintendo was both creating N64 and games for it.
i don't think so, probably was lack of good benchmark tools, if they had what kaze has today where it measure exactly what's is slowing down the rendering(GPU,CPU,MEMORY) they would have saw that memory was always the bottleneck, and as Kaze said the console had to much discrepancy in how it was put together, to fast CPU at the expence of memory throughput
I think that happened with the Sega Saturn, not the N64.
the developers were very new to all of this and the fact they were able to transition from SNES programming to what remains the greatest videogame of all time is a testament to their intellect and dedication.
OP, you're forgetting that manuals didn't exist. This was not just brand new hardware, but a brand new coding paradigm. These people WROTE the manuals.
@@ssl3546 yep this and very limited time. Another year or two would have made a big difference but they didnt have time they were trying to beat everyone
Removing the lod model is like the bell curve of optimization. It's good because Mario doesn't look worse from far away and because the N64 does less work by executing less code.
They did that in OoT also and it's very distracting when you notice it. I was playing a rando the other day, got bored, and was standing at the distance cut off for the low-detail Link model, moving back and forth a bit and saying, "ugly Link, normal Link, ugly Link, normal LInk".
... How is that at all a bell curve?
@@tonyhakston536 If you don't know what game optimization is you might want to remove the low poly Mario because it's ugly.
If you think you know what game optimization is you might want to keep it because it renders less polygons when you are far away, there for you can't see the difference very well.
If you are Kaze Emanuar you might want to remove the low poly Mario script and model because it saves more memory.
Agreed.
@@QuasarEE I'm still waiting for a mod that has your character and held items always use the high quality model. You can only see the high quality models when the camera is smooshed against a wall to bring it closer to you :/
I love how you start the video with over compressed footage of SM64 in an incorrect aspect ratio with ugly filtering in an emulator despite the fact that you clearly know better. Really takes me back to 2010 😭
I read this comment before starting the video and still recoiled.
Needs a BANDICAM watermark.
@@SeanCMonahan nah not old enough
Fraps watermark
@@standoidontwantalastname6500no, Hypercam 2
@@standoidontwantalastname6500Unregistered Hypercam 2
People think the PS1 was more powerful because it did transparencies, colored lights and additive blending like it was nothing, so when they play the system on an emulator with perspective correction and in high-res the games appear much better than an emulated N64 game.
So for games that used those effects - it was significantly more powerful. for games that didn't - it wasnt.
It also probably helped at the time that the CD format allows PS1 games to have high quality sound and prerendered videos. I'd imagine back in the mid-90s most people had no concept of the difference between a pre-rendered video and "in-engine" as long as it was running on their TV as they played. Stuff like Parasite Eve 2 and Final Fantasy 8, using pre-rendered video behind the player as you actually moved through an environment? On a CRT to hide the compression?? It looks absolutely fucking unbelievably good, like nothing else that generation could remotely achieve. And I think it's actually worth arguing that these benefits still "count" even if for many people prerendered feels "unfair" compared to in-engine. It's the end user experience that matters, not how they pull it off, right? If Parasite Eve 2 can "trick" you into thinking you're walking through a Los Angeles street full of death and destruction using graphics ten years ahead of their time via clever tricks, I don't think that's a lesser accomplishment compared to rendering the street on the fly as best you can like Silent Hill 1.
I just can't stand the vertex wobble on the PS1
The PS1 wasn't more powerful, but it was a better balanced system. The N64 is so bandwidth starved that even first party games waste huge amounts of CPU and GPU time doing nothing. Ocarina, which was years later, still spent over half of each frame with the GPU completely stalled waiting for data, which probably leaves it's effective clock rate (doing real work) not far off of the PS1s GPU.
Like kaze said in the video, the power was in more advanced graphical features, not raw numbers.
@@jc_dogen The framebuffer as well, it's amazing that if they just made a few better choices hw wise the N64 would have been a much better system, some of the cuts Nintendo did to save a few cents ended up making games for the system much more difficult than it should.
I suspect a lot of this was due to early development on SGI workstations with different performance characteristics than the final hardware, and possibly immature compilers that didn't implement N64 optimizations well.
Remember that the "source code" is derived from the decompiled binary, so unrolling loops might have been done by the compiler, possibly assuming different instruction cache characteristics.
moral of the story:
Premature optimize, don't benchmark anything at all and sleep well at night knowing you did the best you could
Actually maybe don't program at all because in 30 years people could make a video roasting your code
It's always better to remove a problem than to add a solution, when possible.
For me it usually makes the codebase slimmer, easier to read and easier to develop further.
I also like to remove solutions. We really ought to wrap this all up, get back to playing solitaire in the computer closet
seeing those practically unusable profiler bars really puts into perspective how single digit frame optimizations could have been overlooked lol
Considering how bad most modern software is, watching this video about super optimized low-level code is really satisfying.
Most features on Windows, for example, run hundreds or even thousands of times slower than they need to. It's a shame that efficient code just isn't made any more.
you could always sacrifice your sanity and become a firmware engineer , the low level never went away lol
@@ante646 Fair point!
Yea its a shame if this level of optimisation was did to Windows 10/11 itd run on many generations older hardware, half as much memory and storage all while being quicker
@@ante646 or run linux
This is the real use i can see for AI.
AI is already a powerhouse for coding, fine tune it with code optimization and you can probably boost the performance of regular AAA games by 30-50% without much money spent on expensive optimization programmers.
i hope this will become the reality in 3-5 years
I wonder how much Kaze could improve some of the other games with his level of expertise. Imagine a highly optimized Turok on native hardware or any other games. The N64 is one of my favorite consoles.
@@Tony78432 Perfect Dark and Goldeneye honestly need it more than first Turok.
i went to see him work in zelda games
M64 is a launch game, so the later ones probably had better optimizations already
That sounds awesome.
@@joebidenVEVO No, not really.
Huge respect for including a git repo!
To think that memory bandwidth prevented this console from benefiting from conventional practice and flying at incredible speed...
Bad coding is still bad.
@@IncognitoActivadono shit
I really don't understand anything about software programming, but hearing people like you and Pannenkoek talk about it really helps me appreciate the work, passion, and struggles that go into developing a game. I remember being a kid and basically thinking that games just spring out of holes in the ground at Nintendo.
I love your dedication to get the absolute most out of hardware by actually rethinking your conceptualization of the software to match the hardware's capabilities. Most people are so inflexible in their approach to programming, which is why for the most part we still write software for an architecture from the 1980s.
To be fair they had just invented it
Reminds me of making demos for my Apple IIGS back in the late 90s/early 00s when I was a kid. Look into every trick the hardware allows pulling out as much as the beefed up little iigs with maxed out (at the time) expansion/accelerators will allow. These days I almost exclusively work with the PC88VA3 for demos after the dual architecture (Z80/8086) grew on me along with the rest of the specs/modes.
I've never thought about doing the same process with console hardware, kind of makes me want to try it out now.
0:30 i've uhh ""researched"" it thoroughly, and this quote belies the truth only slightly, buut the saturn's CPUs do have a division unit included specifically to accelerate 3D math (as well as the SCU-DSP which was originally intended as the matrix math unit)
Yeah, SEGA persuaded Hitachi that division is important also for other customers. Doom resurrection uses it on Sega32x . Jaguar also has division running in the background. Not sure about 3do . I think that Arm has an implicit output register, which blocks until the result is there .
awesome pfp
The way the Saturn handles its 3D effects still wrinkles my brain. Really underappreciated console with some awesome games (Panzer Dragoon Zwei is my fave game of all time).
"Performance Lottery" is a real bitch.
I've tried optimizing some code at work,
added a custom SLAB allocator, to ensure, all objects are within roughly the same memory region.
And now the time hasn't improved at all, because suddenly ANOTHER function caused cache misses.
That one was caused by running the destructor on a large number of objects, despite the objects not being used, afterwards.
(It was only one int being set to 0).
Originally my boss wrote this other code, thinking that reserving 20 elements, will do less allocs.
So he created an array with 20 "empty" elements on stack, instead of using std::vector which will likely use malloc.
Which sounds so far so good. However the constructor and destructor now runs for 20 elements (plus the amount of actually used elements) instead of 0-3 most of the times.
But that the constructor and destructor of mere 20 integers would cause a problem on even modern Clang 14 + ARM64
is something that even I would not have expected.
The best benchmarked solution was to use a union to suppress the automated constructor destructor.
And even that, gave only like 150ms on 1.6seconds. Which really doesn't seem worth the uglified code, in my opinion.
There are a bunch of these micro-optimizations I could make, but they all make the code uglier.
And there are a lot more macro-optimizations that would require the code to be completely refactored and have tests written for all of them.
Seeing as we need to come to a pre-release version pretty soon, there is not much time for either of these.
The initial version of the product will be shipped with the insane startup time of 8 seconds, on our device.
And then I will try to figure out how to improve time, once the other bugs are fixed.
I'm looking forward to playing your completely optimized original SM64 on real hardware, I hope it comes out soon!!
Im pleasantly surprised at how freaking fun that sounds. Ima need to dust of the ol 64
@@HashCracker it's been a few years since he said he'd do it...i'm hoping it comes out soon!
Fun fact: Crash Bandicoot, one of the games people probably use when they said PlayStation is more powerful is technically using resources on the console that aren't meant for running games. Yeah you know the joke of "Naughty dog breaks the limits of a console at the end of a generation"? Hun naughty dog started on PlayStation breaking limits and hyper optimizing their games. There is a very interesting documentary on the development of crash, down to artstyle, animation rigging, their study of how PlayStation 1 works, it's really interesting and I advise you to look it up and give it a watch if this fact peeked your curiosity.
What specifically are they doing
@@HowManySmall The PlayStation 1 had segments of ram specifically allocated for running the PlayStation itself, and naughty dog found that not all of that ram was being utilized, so found a way to tap into it to make crash bandicoot 1 run better. So basically they were using dev intended ram and then snipping a bit more ram from the console from a place not intended for devs to use. So that on top of art decisions made such as making crash only colored vertex planes, and using boxes for interactive set dressing allowed them to focus on more complex environments without the use of pre rendering like other early PlayStation titles.
Nice working on 13/15. I am pumped to play Return to Yoshi's Island when it releases. Keep up the good work Kaze and thanks for sharing your deep understanding of the N64 and Super Mario 64.
I feel that Sega understood some of these things very early on. Games like Daytona USA (original Arcade -version) actually withdraw instanty everything out from enviroment that car passed by at the same speed that car goes on. Basically drawing only things directly on front of you, making draw -distance look awesome.
Daytona USA was released in 1994, by the way. And it was already a relatively mature product in the 3D-games world that was being exploited by various Japanese and other world softhouses. SEGA had a lot of experience with 3D.
But in the world of Nintendo-themed TH-camrs, Mario 64 released in 1996 suddenly became one of the first 3D games ever made. 🤣
@@jpa3974 Cutting teeth on Virtua Racer must've helped.
AM2 was just built different. The rest of Sega? Not so much.
I wouldn't describe the draw distance of Model 2 games as awesome, theres a ton of completely unmasked pop-in and zero use of LODs for backgrounds. The pop-in was also done in fairly large, predetermined chunks, not gradually. I'm a big fan of AM2 and the model 2 hardware but the (visual) success of those games was a combination of incredible art design and obscenely advanced hardware, rather than genius efficient coding.
@@DenkyManner The hardware was good on Model & Model 2, but not pre-eminent.
Daytona USA had 32 Bit CPU 25Mhz with 32Bit Co -Processor, and only 8Mbit (1MB) RAM, while resolution was reasonable 496 x 384.
-I would say SEGA learned 3D quicker than others, or at least they moved into making polygonal graphics earlier on. Nintendo would not even knew 3D at the time without Argonaut.
However, Sega was not that strong on CD -based console 3D.
While Nintendo really nailed 3D gameplay/playcontrol as soon as they tried.
Couldn't you partially avoid "performance lottery" in your code by padding out the binary? If you make all of your code cache poorly but run decently, then certainly it can only run better after you undo the padding.
i've avoidied perf lottery entirely in my game yeah, but that requires some more optimization first.
almost entirely*
IMO it's less about generalized padding, more about avoiding moving things through recompiles.
I'm currently taking a microprocessors class in college, and your series optimizing SM64 has helped a ton with my understanding of how the microprocessor interacts with the memory and how machine code works. Thank you!
This channel and the dude making Mario 64 demake on the GBA are prime content
Putting in that Minecraft Glide Minigame music from the consoles gave me crazy nostalgia for no reason whilst learning about how getting lucky will basically make the game go _vroom vroom._
Inlining, or tricks to prevent branch misses make me wonder if they developed this code on something like an Intel chip with much longer pipelines that respond to some issue much worse than a RISC with shorter pipelines. And LUTs for circular functions may just be a holdover from CPUs with no multiplier. You can approximate a sine really quickly with raw CPU power with a polynomial if you have a multiplier to do it.
Honestly, I wouldn't be terribly surprised.
I'm not a video game developer, but I have had to develop code for a product that was not yet developed. The challenge in that scenario is often that you have to work with *something* in order to get your code working at all, and to start building and testing it.
If the N64 wasn't available when they started development, it could absolutely have been a "just grab something, we'll adjust later" kind of situation.
@@aldproductions2301RISC CPUs were available.
My favorite game on the N64 is F-Zero X. That game is really underrated as it's not only an amazing action racer but also a technical showcase for the system. I'm glad that dedicated developers were able to make such a game back in the day and it's still very fun.
Pod racer came close in terms of pure speed, but yeah that one was fun
The optimisations are so good to the point where fps gets an extremely dramatic boost and it even overflows to just about somewhere under 30 fps.
This is an amazing video. My friend and I both work at Microsoft and he's doing performance optimization on a C++ codebase. I absolutely love the way you explain and analyze these problems! You have such dedicated passion into understanding and fixing this game. Whenever you have your Return to Yoshi's Island game released, I will be playing it on my 32" Sony Trinitron TV and absolutely enjoying the experience. Looking forward to it!
I have learned more about programming from this series than any other place ... Ever
Thanks
This video reminds me again why Optimizations are so important. I’m a dev at a AAA company rn, and I quickly learned to design my assets with optimization in mind instead of trying to implement some crazy inefficient shit and try to fix it later lol. The reality is, time is always a factor so giving leeway for other optimizations toward the end of a project is so so crucial…instead of trying to tidy up things that should’ve been lean in the first place. Great video as always!
Again I'm amazed what decades of hindsight and a known scope can do to a codebase.
lets goo! another Kaze optimization video
I would send all this guy’s videos back in time to the developers.
People forget that they were trying to release a game and not some tech demo
@@onebigsnowball I mean, the Ridge Racer developers did a turbo mode on the PS1 that made the game run at 60FPS instead of 30, but it came at the cost of some cars being removed and shading being reworked.
They probably would not be able to watch them lol
@@onebigsnowball Are you salty, nintendrone?
@@ares395 They kinda sucks anyway.
IIRC regarding the "PS1 is faster than N64" claim, it's difficult to even say if it could even be true. Even with the most in-depth knowledge you could have about the Playstation 1, you'd barely be able to match 2/3rds of the performance of the N64 CPU, and still had to sacrifice a lot to get there. Even with it's hexa-processor design (CPU, GTE-cop, MDEC-cop, MEM-cop, GPU, SPU), it was still functionally inferior. A few points:
- The triplet of rendering processors (CPU, GTE-cop, GPU) only worked with Fixed Point, in either 16bit or 32bit. Many games had to opt for 16bit, and even the games that used 32bit had to limit their levels to relatively tiny areas compared to the N64. Those transition screens or fades between rooms are not by choice, but by necessity to hide the artifacting (vertex snapping/wobbling).
- You had to perform shading, world to camera transform, z-ordering, and camera to view transform on the GTE as the GPU had very limited 3D support. No Z-Buffer, and hardly any actual support for 3D, meant that you got even more wobbly textures as a result. "But you got mip-mapping and dithering for free!" - as if anyone actually wanted that, it was needed to hide the artifacts of the PS1 hardware.
- Instead of having to worry about Rambus, you have to worry about DMA abuse instead. It is very easily possible to write code that causes 0 FPS on the PS1. DMA is hard.
- Cache Trashing is much harder, as you go CPU Cache -> RAM -> CD-ROM Cache -> CD-ROM. That's three misses that have to happen, but if they happen they're so much worse than N64 cache misses. You could easily spend more than a second stuck due to a scratched disc or a bad CD-ROM.
- The built-in hardware decoder for videos with direct DMA to the GPU meant that you could use videos directly, and still render on top if needed. AFAIK the N64 does not have video decoding hardware, and the space on the cartridges wasn't exactly good for it either.
It's been a while since I made homebrew for it, since it's just not a good console to try and develop for. Might not be entirely accurate anymore, as I wrote this from what I remember. There's a lot more, but these are like the primary ones I ran into when making homebrew. 700MB of CD space means nothing if you can't actually use it well...
Only the mpeg decoder can output true color. 3d acceleration always used a 16 bpp frame buffer.
So you say that the PS1 had virtual memory and games used it? I know that N64 has virtual memory and you could write an OS which loads pages from ROM.
"But can it run Crysis?"
Kaze: "Hold my beer."
Wait, how is Mario 64 "one of the first true 3D games ever"? "True" and "one of" are carrying so much weight in that sentence they may generate a tiny black hole.
Even if you're going to discount every racing game since Hard Drivin' in 89, every pseudo 3D fps alll the way up to System Shock and Duke Nukem 3D, every 3D fighting game since Virtua Fighter, bundle Tomb Raider and Quake into that "one of", dismiss anything with locked camera since Alone in the Dark as "not true 3D"... Mechwarrior 1 and 2 were out. Hell, by 1996 there were as many full 3D space shooters based on Star Wars as mainline Mario platformers.
Fair tbh
Yeah I always cringe when people don't make really easy clarifications about formative games that aren't actually the first. The one I'd say for Mario 64 i'd say is "The first good 3-d platformer" or "one of the first 3-d platformers, and a launch title". No salt to Kaze, ur cool you've done it how I like previous times, you just forgor or rewrote it funny this time.
Yeah that bit bugged me too lol. I've made similar mistakes though, so I get it. Being accurate is hard
I think he meant one of the first 3D platformers as people often say but its gotten purple monkey dishwashere'd into being one of the first 3D games EVER which is absurd
@@KazeN64 This shows you have some credibility at least. But also "one of" does mean not the actual first(and could for sure cover 3 games before it), and leaving out on home console isn't so bad. I feel the intended point stands that 3D was new and very rare when mario 64 came out, and especially when making the game. So while the clarification is good to make I don't feel it's worth being upset about. Also Doom like games are not 3D for sure. So in short Mario 64 is one of the first true 3D games, not really any correction needed.
I think this is the best example of how premature optimization can be very bad, but optimization after the fact can help immensely as well.
I don't think it is fair to say it is "premature optimization": the tooling was just 3 bars that move about, on the screen.
you do a bit of both, optimization after the fact can be horrendous too. You optimize for each chunk, then optimize the whole. Then disable the first optimizations to see if there is any difference, then release.
You can also just design things properly first time around and have them optimized
Novel idea, i kmow
The biggest sin these days is the complete lack of optimizations
What's ironic is that I've been accused of making "premature optimizations" for making the same type of optimizations that Kaze is doing.
@@aaendi6661 He's actually undoing optimizations, in this video.
That was awesome Kaze. Dude, you never cease to amaze me that you're continuing to find more optimisations. You make coding for the n64 really fun to learn. Cheers mate.
Mario scared me, when i clicked on the video he was facing the graph but when the ad played mario was looking straight at me
the title and thumbnail are brilliant. just here to say that
I have a small project idea. How complicated could a SM64 map be whilst still achieving a locked 60fps on real hardware?
from kaze maps, very very complex
That'll be basically Return to Yoshi's Island
"""""small project"""""
@@Sauraen would only take maybe say 3 to 5,000 hours?
@@floppyD RtYI is targeting 30 on console.
Very cool video! Good work on showcasing the various "optimization-attenpts".
It's always important to check your optimization ideas against the actual hardware where the software is run. Especially when only targeting one system (as Nintendo did with Mario 64).
Loop unrolling for example makes sense on newer CPUs because of their out-of-order nature but on other hardware where cache locality is much more important it's hurting you (as shown in the video).
I want hacks for these games that enable you to play with all the low-poly LOD objects the entire time :D
I really appreciate all you do to truly get the most out of Mario 64. Hopefully one day a version compiling all your fixes can be made that runs basically without breaking a sweat. Keep up the great work!
I gotta callout Kaze for using CSS code in the background of the video’s thumbnail rather than pure, beautiful N64 assembly code.
I feel like I've seen this happen before
I noticed now because of you
That's because the game has been decompiled and you don't need to know assembly anymore to modify SM64.
Because he isnt writing in assembly, most devs of that era were writing in c/c++ for the consoles.
@@KingChewyy True, but they certainly weren’t writing CSS code, that’s for sure. So the point still stands
I love these optimisation videos Kaze, please keep doing them, seeing how mario 64 approached these kinds of things is really useful for game devs even today
13:13 A racecar with a beercan for a gas tank.
Very nice editing in this one, and super interesting topic
Thanks for making these videos, I'm no programmer but I do work in IT so I know a little bit about how code is supposed to work, and its very interesting to see how Mario 64 was coded. I'd like to believe that the poor optimizations in the game most likely happened due to time crunching and stuff, and as people back in the 90s were more limited in the tools that they had access to, it would have taken them a long time to troubleshoot and properly test for things, which is why games used to be much glitchier in the past, but as a result since the devs liked some of these glitches and stuff, we managed to experience a lot of them through cheat codes and stuff, which is what actually made that era of gaming a very interesting experience to grow alongside in.
Games didn't used to be glitcher in the past. They just don't have patches.
Another important part of the story is PS2 dev kits bragged about real-time debuggers on live code. As far as my research goes, 1st gen N64 dev kits did not have live debug on active software.
Also, it's likely either the intended platform specs were lower or the sga dev kits had lower specs. These cached renders mentioned halfway through the video May have been fail-safes against the code crashing in the dev environment.
@@kiyoskedante yeah, no console dev kit was known to have really good tools overall until the xbox. the ps2 did have some fancy kit with the performance analyzer, but i think only the main CPU had a debugger for years. Just write all your vector unit code bug free lmao
I love this video. Thanks for pointing out that there are tradeoffs and that having more consistent fps is better than just average or max fps. I remember working on a game (not N64) where I could make it hit 100fps but it would drop a lot, or I could have it hit a consistent 60 but would max out at 80fps. It was a case of premature optimization like the examples you pointed out.
Protect this guy at all cost. He is the savior we needed!
Amazing video Kaze! Again and again!
I absolutely love the longer explanation videos. I think you could make awesome videos in the style of Retro Game Mechanics Explained
0:16 what level mod is this?
I have to know the secret level from world 13
Since no one else is answering you, Kaze is making a mod for SM64. If you haven’t checked out any of his other videos, he goes into depth on the programming of Mario and how he is improving the code for performance games for his mod.
@@ObsydianX thank you!
I remember from where I liked to search fun facts about random things in the old days of youtube, that both the N64 and the Gamecube were equal or more powerful than their competitors but that their game storage systems (cartdriges and small disks) were limiting factors
holy shit dude, nice editing
It is great to hear that your next Mario 64 mod is near completion. I just hope it can run on PJ64 Version 3.0 or I can find an emulator that it can work on since learning about what happened with version 1.6.
People thought the PS1 was more powerful because of FMVs. That was it: movies.
It was actually because the hardware is generally much more efficient than the N64 and Saturn, and many 3D games tended to look less limited. Nowadays, it’s easy to point out the wobbly textures and weak 2D capabilities of the console.
@@solarflare9078it was easy to point out the wobbly textures and jittery pixels back in the day. I did it when I went to my friend's house, playing on the ps1 really jarring (that said a kid who didn't usually play on an n64 would probably have found the blurry textures and widespread use of fog to cover up low draw distance jarring too)
@@solarflare9078lol anyone could point out the insane texture warping on the ps1 from day 1
Same with the n64 textures being a blurry mess
Having already played quake on PC, i was not impressed with either, 3d graphics on pc were lightyears ahead of them and made the early 3d consoles look retarded
We even knew the n64 controller was a fucken shite too and ruined the enjoyment
it was actually the CD drive. the extra storage space compared to cartridges meant games were bigger .
@@tediustimmy PS1 also didn't suffer from memory stalling as badly as N64 did. Which at the time must've helped with keeping the perceptible gap between them smaller.
its crazy how unbalanced the hardware inside the N64 is, who tf designed it
I think they had some issues with sourcing chips coupled with a lack of knowledge for 3D games. 64 bit was only a thing because a 32-bit CPU at the required spec was more expensive / less available
@@jackthatmonkey8994 are you sure because I'm more keen to believe that they only got a 64bit CPU for the marketing
NES was the last console without a memory bottleneck.
It's clear that a lot of the techniques learned on the SNES were being applied on the N64 like unrolling and inlining which definitely would have been more effective on the older system. Great video!
1:14 Kaze saying "wide public", but being interpreted by TH-cam as "white public" in the subtitles made me laugh. It almost sounded that way haha
also at 8:15 he said "fog" and YT thought it was f*ck
Kaze's optimizing Mario 64 so much that we'll soon reach the point where unchecking Limit FPS on your emulator takes you back in time.
I'm not really familiar with the N64, but have you accounted for the generally less-developed state of the compilers 30 years ago? I don't doubt that the performance lottery was a thing even then, but I suspect that more naïve algorithms that covered code generation may have made loops less effective, which would be another reason why unrolled loops were deemed more effective in their tests at the time. The ability to use all registers as effectively as possible makes a considerable impact.
While the 90s aren't the archaic 70s, C was 'merely' 20 years old at the time, but the first MIPS processor was from 1985, a mere 8 years prior to the start of the n64s development. There's a good chance they weren't using the newest compiler toolchain at the time yet (internet was still very niche!), so I think it would be interesting to see how well an 'unoptimized' version would do when compiled with the tools they had at the time, if that is even feasible.
Yes! I used the exact same compiler and flags they used back in 1996 here. We know this is the same because the unmodified code compiles byte for byte the same
michaels video is very helpful!! 15:45
Here I am watching with a what Kaze is achieving with N64 hardware wondering how the history of gaming would be if he got send back in time to work at Nintendo...
To be fair, the 3D programs were also in it's infancy back then as well, Max and Softimage were the top contenders back then, they didn't have current blender with the F3D exporter. Painting vertex colors wasn't probably that visual back then nor did they have such a nice texture library and authoring tools.
@@xdanic3 Fair enough :D
Kaze is an extremely skilled programmer for sure, but the N64 and Mario 64 were pretty novel and the constraints of game development made it so that you have a minimum acceptable framerate and then with any extra development time, you'd focus on ensuring minimal bugs or adding more content rather than optimizing the existing content (doesn't necessarily mean all bugs will be fixed though!). While Kaze and other N64/Mario 64 devs managing to do this without the resources of a huge corporation is insanely impressive, it's not like it was realistic to expect the devs at the time to have similar breakthroughs (although there certainly are "cheap" optimizations the Mario 64 devs could have done at the time but I am unsure of whether they would have that drastic).
@@cdj17e yeah, thats why my mind thought it to be funny to imagine him with his knowledge standing on the shoulders of giant prior using that knowledge to help those giants. :)
@@cdj17e I think it cannot be overstated how much programmers like Kaze are standing on the shoulders of the giants who came before them. I imagine that if you threw Kaze back in time, he'd still be a very talented individual, but depending on the state in which you'd send him back, the results would vary immensely. Modern tooling will have inspired a lot of visualizations that let him realize just how unusable the bars used for performance matrixes by native devs were. Just having these impressions and knowledge of the places where the 'pain' is can avoid so much wasted time and rabbit holes. But at the same time, would he be as effective if he was limited to the tools of the time? Nowadays we have so many means for rapid prototyping that allow a quick 3D scene to be whipped up in Blender and inspected with high framerates, but back then the controls for comparable programs would have been clunky, screen updates slow, and overall process not very flexible in how easily it can be prototyped against the existing product. Also don't underestimate the importance of a quick build-test cycle which very likely involved cross-compiling and maybe even taking things out to plug them into a dev kit device. And finally, assuming Kaze got to work on the product back then, he'd no doubt have to deal with superiors who impose a certain vision or have opinions of their own on how development has to happen, as well as deadlines to meet while regularly spending nights at the office (It's Japan, after all.) It would be a huge difference in every aspect in regards to how he is able to approach these development projects now as a hobby of sorts. (I have no clue if and how he monetizes his activities, but it seems quite niche so I'm assuming it's primarily hobby oriented.)
Loop unrolling is a technique commonly used in the demoscene to get the most out of old 8-bit and 16-bit computers like the C64. Maybe the developers here were used to those old techniques (NES used a 6502 just like the C64) even though applying them to N64 was not the right idea and they just didn't know.
"How throwing gasoline into a fire made it COLDER"
I can't even begin to comprehend the complexity behind the technical stuff, it is impressive to see what you are doing.
I wonder how these 3 laggy spots look like on PAL. Given the framerate caps at 25, maybe the lags were less noticeable? ( unfortunately the game speed isn't compensated anyway, so it's still slow, just more evenly slow probably. )
@@Martyste If I'm not mistaken, Pal version had o2 compiler optimization enabled that original release didn't have.
Said compiler setting slightly boosted performance on PAL and Shindou rerelease. Said boost was for PAL's equivalent of 30, meaning peak fps was 25.
I think the biggest lesson in optimizations I got was when I was making a video game for coding practice, and when I was working on a main menu background (Particles behind the screen flying from the bottom to the top) my attempt to prevent the game from loading too many particles by Deleting multiple particles as it created multiple particles caused ALOT of lag
My solution to this and to make the particle background work was to actually make it that it created and deleted one particle at a time and instead to make sure there isnt "One particle only at each Y axis" a particle's spawn point would get randomized to the point where particles created BEFORE one particle could arrive after the particle that was created after them
19:20
I think this is the most important thing to know when questioning programming decisions. Previous nintendo
consoles didn't have caches at all; before then memory speed was on par with CPU speed and memory accesses would take the same amount of time no matter when they happened. A huge amount of the modern optimizations are centered around cache performance, but I'd be surprised if cache performance had gotten *any* significant attention at the time.
Consoles with cache: 3do, Jaguar, Sega 32x, PS1, Saturn
@@ArneChristianRosenfeldt one may notice that none of those are from Nintendo
@@mekafinchi yeah, Nintendo was late to the party and ignorant to the outside world. Probably did not allow experienced ARM coders from Archimedes to come in. Did not pay to get mentor with experience with Sun, Fuji, or SGI servers. Don’t go to trade shows. Don’t learn about profilers and instrumentation.
@@ArneChristianRosenfeldt ok
And at that time, even if the console had cache, no one know to to optimize for a cache memory. I'm pretty sure it's something that appaear later. For example, the Michael Abrash books about optimisation and assembly were very light about cache optimisation and most cache advice didn't care avout code size (because of the CISC x86, i know), and say nothing about memory bandwitch beside wait state. Everything is about the PIQ, teh wait state, the DRAM Refresh, the instruction, calculations, and so on... But the processor of that time had caches !
All this work to eventually get ceased and dissist by Nintendo 😶
Nice, but please stop comparing FPS numbers. Use milliseconds so optimization gains can be compared. “Improved by two frames per second” means different things based on where you started.
Rather use max frame duration in the last X time. This indicates stutter
It's so random people who stumble onto this can understand. The average idiot understands a different in fps, not milliseconds
Admirable. You really have a grasp on the whole thing.
Dude, if you think it's crazy that some people think the PS1 is more powerful than the N64, I wonder how you feel when so many people now claim the Saturn was more powerful and capable than the N64.
Also, we seriously need more people like you in the SNES development scene really optimizing that system and pushing it properly to its limits imo, because there's a lot of games on that system that have room for improvement like this to be honest. And, in the right hands, I genuinely think most of the SNES games that suffer any slowdown could be running at a pretty solid 60fps. Not only that, but it would cool to see some of those game pushing the system even further too, and really showing off what it's capable of.
I can only imagine what you might be able to bring to optimize games like Star Fox or Doom on SNES, never mind just the more typical 2D games there.
Honestly, optimizing Doom for 60FPS on the Sega Saturn (like originally intended) would be a really neat thing to see someone attempt.
@@KingKrouch I'm absolutely sure the Saturn could run Doom at 60fps if it doesn't already. I mean, haven't people got it running at 60fps on some of the older consoles already like the 32X or whatever? I swear I read that somewhere.
Now I'm curious, does Doom 64 run at 60fps?
A good example that proves what you said is a recent romhack for Ranma 1/2 Chougi Ranbu Hen, its one of the most poorly optimized fighting games on the SNES, it's an otherways great game but it runs like complete dogwater. Recently a user named upsilandre did a partial rewrite of the game, heavily optimizing the code and got it running at a faultless 60FPS.
It further frustrates me that the majority of notoriously sluggish SNES games that earned the console a reputation were not really the fault of the console, but rather developers being cheapasses and using SlowROM chips. Kandowontu's been hacking SNES games for a while now, converting them over to FastROM and this alone has yielded significant performance improvements, removing most, if not all the rampant slowdowns in a ton of games.
Manfred Trenz in one game with no expansion chips pretty much shamed every SNES dev with Rendering Ranger R2, so the whole console really deserves a redemption arc.
And there are even crazy people claiming the N64 was more powerful.
Retro Core would def think the Saturn is better than the N64, but mainly because he’s not fond of the N64 at ALL.
Cool video man. As always, the best optimisation is to do less work. 😊 Interesting to see how not avoiding loops makes for better cache utilization.
Nice vid. For the audience it would be beneficial to talks about performance as a resource measured in milliseconds. Saving "one fps" is very different at 10 fps vs 60 fps. Thanks for the vid 👍
Have Kaze taught about releasing an optimized Mario 64 before his original n64 game? That would gather publicity to make the release of his game more anticipated.
As always fascinating to hear an expert talking about n64 programming even if 90% of the stuff flys over my head. Kaze is the n64 Carmack!
Basically: Sometimes not doing anything is easier than building a machine that makes things easier
The dynamic collision on the submarine makes me think they originally intended it to move, scrapped the idea but forgot to change the collision.
Loop unrolling and inlining are not a great idea when your cpu is much faster than your memory. They probably thought the rambus memory was more performant than it turned out to be in reality. Older 16 bit and especially 8 bit systems had fairly balanced ram to cpu performance characteristics because some cpu instructions could be really expensive and memory latency was low compared to cpu speed. RISC cpus like the N64 used had good overall IPC much better than the common 16 bit cpus of the time. No doubt Nintendo's programmers were just not familiar enough with programming a RISC platform.
Let me tell you about Atari Lynx and SegaCD.
The biggest problem of the N64 hardware has always been the slow memory. If Nintendo had just given that console way faster memory, it would have destroyed everything else on the market. That was a pretty bad hardware decision. The memory was the bottleneck for everything. The CPU and the GPU could hardly ever show their full potential as most of the time they were waiting for the memory to provide requested data and thus had to idle. If you can optimize your memory access in such a way that the memory keeps pumping data at maximum speed, demos have shown that you can easily use textures 10 times the size and still get better frame rates on average.
So not optimization is the problem but the wrong kind of optimization. Optimization is often a trade-off between CPU time and memory storage. You can re-calculate values or you can cache them. In case of the N64, recalculation is often the way to go as that is faster than accessing cached data in memory. Actually that's true for many modern systems as well.
Modern CPUs perform an addition in on clock cycle, a multiplication in on clock cycle and operations can sometimes also overlap (so if you run 10 operations, each requiring one clock cycle, you may have the final result in just 6 clock cycles as not all operations have to wait for the last one to finish). In the end, re-calculating a value may cost you 12 clock cycles but fetching that same value from cache may cost you 20 (1st level) to 60 clock cycles (2nd level) and fetching it from memory may cost you over 200 clock cycles.
But optimizations work both ways. So instead of storing something and re-use it later, replacing that with code that intentionally re-calculates it is also an optimization. One that doesn't seem intuitive but can in fact make the code faster and that's what optimization is all about, right?
ACTUALLY! I used to think that too, but just recently I've ran into RSP bottlenecks. I have optimized my memory throughput in such a way that my CPU idles around 76% of the time. At that point, the RSP (=GPU) does become an issue. I might make a video about that soon. Sauraen is now working on a new microcode to fix some RSP bottlenecks.
@@KazeN64 But is the RSP really not able to keep up with the data and it's not the memory again that cannot provide vertex or texture data fast enough? After all you have proven in your other video that the RSP is a beast when it comes to processing vertices and it is also a beast when it comes to processing textures: (tried with link to Sf036fO-ZUk but apparently TH-cam filtered the reply because of the link) Usually the main problem why you cannot just blow up vertex count or texture size is because at some point the RSP gets limited by memory again, so you must be pushing it really hard if you can get it become the bottleneck by itself.
@@xcoder1122 Yeah, it's confirmed. It was the actual RSP cycles that were the limitting factor. Of course, improving memory would still reduce the RSP wait cycles so it's not entirely useless to do - but a 20% increase in RSP cycles was pretty much exactly a 20% increase in frametime.
@@KazeN64 This sounds like a very interesting topic. I'm looking forward to hearing some more technical details about it. I just subscribed to your newsletter so I won't miss it.
The WiiU right now has the exact "problem". It has low bandwith and no real dma unit. Which means slow Ram transfer! Just like with the N64!
I have a bit of a hot take: some of the reason people think the n64 is worse is that the PS1 had an *immense* amount of space for game assets. The PS1 had 10x the amount of space for textures, and that does make a difference!
That's why most PS1 games I remember have a much higher texture fidelity, for example. Of course, that also has drawbacks, but overall I think that contributed a lot to the sentiment that the PS1 is more powerful. Because in that one aspect, it truly was a beast.
Discs vs Cartridges
The part about culling was super interesting to me, as I always wondered whether we can help the N64 perform better by culling on the CPU instead of idling, so the GPU has less work to do when it has to draw the next frame
16:52 this check just confuses me. Actors shouldn't even be out of bounds unless they're placed there
I think a lot of people really underappreciate the work that went into making a great game like sm64. Yea the code is not perfect, and I am not a coder myself, but I'm sure even Kaze would agree that they did an amazing job with the time and knowledge that they had available.
13:25, so who's gonna code Super Mario for the PS1, and who's gonna be the first speedrunner for it?
I wonder how hard will be to port it, considering both machines use RISC CPUs but had dramatically different GPUs.
We'll call it Cramari.