Why Mario 64's Render Speed BLOWS
ฝัง
- เผยแพร่เมื่อ 23 เม.ย. 2024
- 0:00 Introduction
0:53 Chapter1: Compiler Optimization
1:50 Chapter2: Math Functions
2:58 Chapter3: Shadows
3:53 Chapter4: The Instruction Cache
4:50 Chapter5: Animated Bones
5:31 Chapter6: The RDRAM
6:19 Conclusion
7:01 Bonus: Insane Man Rambling On About C Code
Yes, this would help with 60FPS Console SM64.
Subscribe for more Retro Mods!
Patreon: / kazestuff
🎥 / kazesm64
🐦 / kazeemanuar - เกม
Unlisting this video because of how outdated this is and how much higher quality the new stuff is.
newer stuff is very high quality!
this is unlisted? no idea how i got here
oke!
I've said something similar on a past video, but I would be interested in a patch for the vanilla SM64 that applies all these fixes you've done over the years. I'm sure most of the rom hacking community wouldn't have much use for it, but I think it would be a great technical showcase.
Agreed. I'd love to see a more optimized Super Mario 64 with all of these fixes, optimizations, and enhancements Kaze and others have been able to pull off, and see it compared to the original game.
Maybe that could make it run at 60fps on console one day but we can only dream
@@SmashyPlays It *would* make it run on 60FPS on console, at least in optimal conditions. It's just a matter of patching the base game with the optimizations and loading the ROM onto a flashcart
I was thinking the same thing, It would be awesome to play 60fps sm64 on real hardware
Including your optimized Mario and coin models please.
I also hope this code can be shared with the people working on the reverse engineered PC port.
Sure PC hardware is more powerful, but now people are throwing new features in that port, higher resolution textures, higher poly models, etc. And then they're trying to get this port working on various different consoles with varying levels of processing power as well.
Programmers complaining that loops are not faster than unrolled ones: Watch the bonus section of the video. This is an N64 hardware specific thing. (also yes, I do have the compilerflag to not unroll loops on)
The Mario Kart 64 community sends it's thanks for documenting all this information and explaining the reasoning behind it.
Any chance I can get a source for that Komm Susser Tod Cover lol
@@OverKart64 60fps 4-player when?
Bro you are legit a genius
Can U Please Do Ocarina of time
I’m glad you put the “ramble” as you call it at the end - I’m sure I’m not the only one who loves this stuff 🧐
Glenn plant is here! I watch all your reviews
Glenn! The man! The legend!
The original team of like 20 did a pretty great job creating the game, console, controller, and even 2 more games at the same time
But Kaze coming in and helping the game run better puts a smile on my face because it's like giving it new life, sorta like an old car
Yeah, and the difference is that the og team had a deadline. Kaze is doing all of this during his free time and without any pressure to finish it asap
@@ddnava96 yeah, but it's still surprising that a single guy could figure out so many things and fix them, Kaze doesn't have deadlines but neither he has a budget or a team, so it's respectable nonetheless.
@@TheKorenji Are you a programmer? In my experience, as a web dev, what a budget and a team gives you, is primarily time. Having a deadline takes away time.
And no finished project, was ever finished with the idea that it couldn't be optimized any more - if you're lucky, you reach "good enough".
Another big difference is, that a team will always try to improve the product. Not "improve the rendering engine". They might have tried to implement one more power up or level given enough time. Or improved the controls. Or fixed the camera angles in the haunted house.
So while super impressive, it's not at all surprising that someone could optimize SM64. Still, very impressive.
@@tokeivo you shouldn't say it's not impressive, even if you're right. because I can assure you that most people around(or at least a lot of them) have been surprised by Kaze one way or another, one of those being, his programming skills, since not everyone is around his level.. to be fair, yeah, it is quite predictable that optimizing these games should be possible, but it is the sole dedication of this man what does it for me.
@@TheKorenji i specifically mention twice that it's impressive. Dunno how you got the idea otherwise.
There is a certain charm about watching a gameplay recorded from real hardware and one recorded from emulator, I’m not sure how to explain it tho
The low res?
@@mlalbaitero Maybe. That’s probably another one of the reasons
The colors are dimmer, and it’s more pixelated.
it's just fuzzy in a way that only analog video can be
@@RizzlinHD Bingo! That’s most likely what I meant!
great stuff!
Wtf you commented on his video LULW 😆
When even MVG is impressed, you know you've done something amazing
@@samuelthecamel We need to talk about MVG and his love of Mario 64.
He is a good programmer who knows about hardware limitations and optimizations.
You know it is legit when MVG gives his stamp of approval.
Optimizing math functions using linear algebra and an intimate knowledge of how the R4300i CPU works: 3ms
Removing two unnecessary raycasts: 2.5ms
Kaze: 😐
its stupid sometimes
The golden rule of optimization: profile first!!
@@vinesthemonkey profiling sm64 without any debug symbols be like
@@mariocamspam72 SM64 has a full decomp tho?
If raycasts are that slow, maybe their collision geo could do with so w acceleration stucture...
I’d pay good money to take a “Kaze teaches SM64 C programming” class for an intro to modding the game. I’m someone that works on IoT embedded systems, but graphics and games are a whole different animal
just c coding in general, this man is already better than most modern developers in c++ with ue4
This is really interesting. I wonder why they felt like the game needed three? vertical raycasts. I suppose that might just go with the territory of making stuff up as you go along
yep this was clearly an oversight. it was the same raycast 3 times in different parts of the shadow processing. they should have passed the results down to the next function, but did not. i suppose the people implementing these 3 functinos each received specifications that did not include konwing the surface data.
@@KazeN64 where the HELL is it in the code? im going crazy trying to optimise super mario 64 for 3ds right now and i cant bloody find it
@@clementpoon120 they dont share the same codebase or render pipeline
@@mariocamspam72 I think Clement is trying to optimize the *direct* Super Mario 64 port for the 3DS--the one that came out recently, after the source code leak--as opposed to _Super Mario 64 DS._
@@harrisonfackrell how are the ones for the ds and 3ds different?
Could you make a rom hack of standard Mario 64 using all of these optimizations?
i wonder with all these optimizations if a 16:9 mod would run mostly at 30fps
I don't know how possible that would be without essentially rewriting and recompiling the game code
@@Oocca_Truthwouldn't these optimizations be public via recompiling rewritten code? Unless you're saying that 16:9 support requires a bunch of rewriting, which I don't think does given Everdrive / GameShark codes existing.
the optimization is REAL!!
Ubisoft: opti-what?
True
"the compiler wasn't good enough so I rewrote this function in assembly" what a madlad
Now we just need a DeLorean so we can go back to 1995 and give a VHS of this upload to all N64 developers. 😅
They still wouldn't have enabled GCC optimizations unless you show a video that using the compiler of the time didn't make a build with bugs.
Lol
I think you should add a minecart from the outside of Bowser's Blazing Burrows. That would make sense how Mario got hop in the cart.
that is the plan
@@KazeN64 good. Really appreciate the efforts you working on this major rom hack!
10:50 WHEEZE; bypassing compilation to save time is such a Kaze solution, well done! Just curious what sort of changes in performance would your changes bring to vanilla SM64? I'm just imagining a world where I can look dead on in Fire-Sea or Bowser's Sub in DDD without it being a slideshow ha
You're a titan brother
those 2 levels could definitely be lagless with a few more tweaks!
@@KazeN64 not drawing the whole sub each frame could bring the framerate back up from a code perspective, but do you know how much of an impact it would have if the sub was made with a fraction of the triangles instead (without any changes to the code)?
@@t0lkki the problem is not that the sub is drawn every frame. its the collision math. you could simply load it as permanent collision and it'd be fine. ive done this in my sm64 multiplayer and the sub in that game has a higher framerate in multiplayer than it'd usually have in singleplayer...
@@KazeN64 oh that's interesting, doesn't that imply the sub was intended to move at some point? now that'd been a slideshow to watch!
@@t0lkki if it moved, it'd be the exact same lag. i think the reason they didnt make it permanent collision is that it disappears on a later act and simply didnt have a function to make it permanent collision only on certain acts. it takes 2 minutes to fix though. i think it was just programmer stupidity.
The moment Kaze started to explain the code he rewrote, he just goes gigachad. Especially how he said he just decided to rewrite stuff in assembly...
I'm going to have to learn how to code just to understand this, as I've seen many people say the same sentiment
Me almost to the end of the rambling section: Okay, this is impressive. It can't possibly get any more insane...
Kaze: So I started coding in assembly
"Yea i like my gameplay optimized*
Those bonus bones are also present in Melee models, I theorize that they're probably referential bones because the bones have issues tracking their relationships to their original location, they're commonly found in shoulders and thighs, things protruding from the base st ructure
This footage looks like some fever dream direct-to-video Mario 64 sequel jahsnahakahkahsjahaksj
Based assembly dev.
virgin high level programming fan vs chad asm god
@@KazeN64 NOOOOOOOOOO! YOU CAN'T JUST HAVE AN INTRICATE LEVEL OF KNOWLEDGE ABOUT HOW THE CPU WORKS
Chad Asm god: Duh huh computer go brrrrr faster.
love the eva music for the code
Absolutely amazing Kaze. I really hope to learn from you someday.
"crack head version" my programming in a nutshell
That Komm Susser Tod segment was so emotionally dissonant, I love it.
If you'd like more Kaze content, definitely check out my new 2nd/backup channel! I upload stuff that wouldn't fit the main channel here!!!
th-cam.com/video/qMQZJjt90xI/w-d-xo.html
Kannst du verstehen?
ja
@@KazeN64 nice
@@KazeN64 eig. Mega dumme Frage 😅 Aber ist schon geil
Dud rambling about C code might be one of my favorite things. Very interesting if you are learning C / C++.
Need more!
I'm so glad you did the part at the end. Some of those changes I didn't know how they could be faster than what the compiler should output.
god bless that vaporwave cover of komm susser tod
seeing your optimizations, if you saw my "port" of the pipfall minigame from Fallout4 to the 68K processor, you'd kill me
that komm susser todd remix came outta nowhere lmao
You are a legend, i like your vec optimizations specially the one done in assembly
Really well thought out and well constructed video, I hope you make more of this! It was such a treat to watch :)
A 60fps Mario 64 on original hardware would be incredible. Too bad none of this improves the renderer. Super cool nonetheless.
this frees up some memory reads, meaning the renderer is also slightly sped up! 60fps sm64 is within reach.
Bo do I have news for you!
You praise them at first and then proceed to roast them.
dude keep it up, I love these technical videos of yours
you are one romhacker
i have modifed the code myself and found the 3 raycasts,
I UNDERSTAND YOUR ANGER
the first one i found was if mario is over water cast the shadow on it, if he is in it, cast the shadow at the bottom, not only is this not how water works but it runs faster if you just cast it at the bottom realistically
the next one was for objects with 4 sided shadows which was identical to the last one for round shadows, i expected to do some work to make it cast 4 sided but no, if the identical function was already in cache then it would still have to load another one
there's more raycasts...
there's also a few raycasts to get the floor height around the shadow (instead of using the surface normals...) and there is FIVE raycasts during mario's step function. plus every object has a seperate raycast for it's physics and graphics even though both will have identical results.
@@KazeN64 O_O
You should have done a TAS side by side before/after to show the visual difference
This is great, very well done!
What about clang?
who's clang
7:17 Have you considered the restrict keyword? Without the restrict keyword, by the laws of the C standard, it must assume that dest and src overlap.
So, for example, let's say you did
float x[4] = { 1, 2, 3, 4 };
copy(&x[1], &x[0]);
If GCC loaded first then stored, it would end up in 1 1 2 3, but the C standard says it should be 1 1 1 1.
The restrict keyword says "these are never going to overlap" and therefore it doesn't need to worry about that.
yeah, that would have worked. im no expert at C so i had no idea until i saw a few comments like this.
@@KazeN64 "restrict" was introduced in C99, it did not exist when Mario 64 was being made. I assume there was a GNU extension prior to 1999 but I don't know when.
@@ssl3546 they didn't use GCC so I doubt it
fly me to the moon & phonk in the same vid? fuckin BANGERS kaze
Yoshi shaking his ass at the end had me weak LMAO
the guy is a sm64 modder who knows code better than nasa dudes for real
Topping it off with assembly. What a legend!
This is picturesque example of the 80/20 rule.
9:40 Kind of weird that GCC is not able optimize float moves to immediate int moves by itself. x86 compilers can do such optimizations, so maybe it's disabled on MIPS because it could cause unexpected behavior? Or maybe that would only work in C++? Or maybe I just don't get it 🤔
Or it’s just gcc being a pile of suck
My guess is gcc isn't as good on most other platforms as it is on x86. My megadrive project uses gcc but I have to inspect the compiler's output and use inline asm often in performance critical areas to get around gcc's poor 68k codegen.
@@Ehal256 MIPS support on GCC is actually not super duper hot like it is on x86. IDO on O2 is actually pretty good for being a 1994-1996 compiler and GCC here is only *marginally* better on the default settings, although you can get a lot more out of it by being flag specific like Kaze is doing.
Maybe it doesn't realize it actually can save instructions (and especially loads) since loading a single precision float is a two-step process but so is a 32-bit immediate, since classic RISC instructions only take 16 bits of data at once. But since all the lower bits of the representation of 1.0f are zero, it can be loaded in a single step (lui $2,16256). PowerPC has a similar issue, they only figured that out for ARM. Old x86 cheats by having a FLD1 instruction.
That bit about C programming just reminds me of how much improvement needs to be done to the compilers optimization algorithm
thanks for the shout out dude!
As someone who only had taken an intro to C++ course, the only thought I had was “oh that’s a void function. That’s neat.”
I love these vids you do about optimizations
2:00 joke's on you, I'm a graphics programmer and I'm invested
SM64 needs a PSX port. Bubsy 3D proves it's feasible.
"my compiler will optimize it all, I don't need to understand the cpu"
So much this. Been working on rewriting the Kociemba algorithm (for cube solving) in assembly for modern hardware, though of course I haven't finished it because ADHD had other plans. For the record, Kociemba definitively understood how CPU's work when he wrote his C implementation, BUT turns out once you get AVX involved and do it in assembly you can reach 1 billion turns per second on a laptop. Though seriously, throughout this process I've learned that even if you don't use any asm, understanding those details under the hood is absolutely critical. Both for performance _and_ readability.
It's sad, so much CPU power now days is basically wasted.
I would LOVE to see something assembly based running on a modern CPU. Like you know those old fun little demo things you could run on a commodore 64? Where you could like run colors across the screen and you can see how much faster machine language was than basic?
@@locklear308 you might want to check out the channel "What's a Creel?" He does a fair bit of assembly and it's where I learned a lot of what I know.
@@rubixtheslime oh thanks man that sounds dope!
I'm trying to find a video of Michael Abrash (of quake/Oculus/Graphics Programming Black Book) and how he optimized the snot out of a naive implementation Conway's game of life 1000x over.
> BUT turns out once you get AVX involved
oh, sure once you get hardware support the sky's the limit. It's a world of difference from an early 90s MIPS
Great video! I really appreciate how you emphasize the context that the original code was written in, and how it's different from the context you're making these improvements in. Like the disclaimer at the very start of the video, and how you brought in an actual developer of the game to ask about the compiler options!
Personally I might have given the 3 raycasts thing more slack, or at least not pinning it on 1 theoretical person. As a complete hypothetical, maybe the extra raycasts were to fix bugs that occured in 1 specific level, and the team was on a deadline, so they chose that fix. That said, I could be wrong, and you're the one who's seen the actual code. Anyways, I appreciate all the attention given to the original developer context throughout the video.
Unrelated but I adore what you did with the vertex shading. It makes things look so vibrant and lively. Very Spyro-esque, which is a great thing in my opinion!
Instantly subbed
Really incredible work!
Super interesting, thank you!
This guy is actually insane
Incredible stuff you've done here!
3:20 im pretty new to programming but i guess they used the 3 raycasts to calculate the angle of the ground instead of just getting the normal direction from the ground below?
no, they used all 3 to get the floorheight and slopedness. like i said in the video, you can do it in 1 without changing the shadow by a single pixel.
(as in, all of the 3 raycasts are straight down from the same position)
oh lol. I was thinking it might've been like one for the height, one ahead of it for the slope one way, and one to the side for the slope the other way, but I guess it is just redundant lol
@@thegreatautismo224 yep, unfortunately it is haha. it does do what you've described for mario's shadow specifically, i did keep that one in tact.
I remember noticing that it calculates Mario's circular shadow using Pi to something like 20 digits of precision. With how low resolution this game is they probably could have got away with just using 3.0 :D
The left side of the picture at 0:38 about sums up my frustrations with programming culture nowadays 🤣
Very great insight! Loved the video!
Utilizing the hardware provided will always provide a better solution than expecting the compiler to do things for you.
The compiler can only reason about such a small portion of your code even with optimizations turned on.
ie.. USE THE CACHES AS INTENDED.
For this case you almost chopped the time in half by using the cache better and shaved miniscule amounts of time from using the compiler optimizations.
Making this open-source could help optimise the code even further, but this is really pushing the hardware to its limits! Can't wait for more
I doubt their team wouldn't already have everyone deeply enough interested to this topic to be helpful
It would be a sure path to a C&D from nintendo
For the first optimization (load/store x3 v. load x3 + store x3), I would guess that this is an aliasing issue, and could be solved more simply with the restrict keyword. Good stuff!
Came here to say this. That can help almost all these cases, since the optimizer must otherwise assume the worst case that every store could modify any other value you're loading.
Yep exactly this. For arguments that overlap in memory the optimization will cause different behaviour, so the compiler can't apply it without you pinky-promising that they don't.
I was thinking that the compiler should be able to optimize a simple memcpy like that, but I forgot that C allowed mutable aliasing.
Impressive work dude!!
I love the fact we are cousins. Congrats on your constant work modding the 64.
nice taste in music ;D
absolutely insane, I love this
You are absolutely amazing dude.
God damn I love in-depth technical videos relating to video game software. Thanks Kaze for being so inspirational and awesome.
10:54
Compiler: am I a joke to you?
Some deranged Yoshi nerd: yes.
"The compiler is not good enough so we do it ourselves" You're so fucking cool tbh
Wow, this is some amazing work!
The song used in the Bonus is:
PARADIGMA (Remix) - by MC ORSEN
Thanks, I appreciate these videos just as much as your others.
Nintendo should hire this man
i like how you made the letters the same color as the numbers
Didn't understand a thing you said but I enjoyed watching this while thing while eating lunch!
Insane
The pains of working on older hardware. The load/store issue is much less of an issue with out of order CPUs. They rearrange instructions on the fly to try to prevent issues such as memory stalls. This was also the early days of realtime 3D rendering, and not as many shortcuts were known.
Even the XBOX 360 CPU used in-order execution.
Kaze this is incredible
Interesting vid.
Amazing Work! :)
I'm waiting for someone to do something like this to Golden Eye or Starfox 64.
Very pog, thanks for the nerdy part
What I’m really waiting for is a ridiculously tight and compiled mario rom I can run on my stock n64 at 60fps (or whatever it would be with this insanely optimized code)
2:29 komm susser todd by astrophysics i see, great music taste.
I love this thank you for sharing
Man I can't wait until you get your hands on OoT's code. Gonna be so awesome.
Dude you are insane in programming holy moly
Great work!👏🏻👏🏻👏🏻👏🏻
that image at the beginning speaks volumes. be more like the 1996 guy.
This is awesome!
I can't stop watching the stuff about the math function optimizations, that stuff is really interesting to me.
Every now and again its good to be reminded, as I sit there looking smug because of some optimisation I've done in the game I'm making, that there is a whole other level of big-brain optimisations I don't even know exists.
Awesome to see! The footage shown looks like a 3ds game! Insane improvements imo.
Ooo! Coding video! My favourite kind!
That level of optimization is insane! really cool!
Your a genius kaze