Why Mario 64's Render Speed BLOWS

  • เผยแพร่เมื่อ 28 ก.ย. 2022
  • เกม

ความคิดเห็น • 829

  • Kaze Emanuar
    Kaze Emanuar  ปีที่แล้ว +781

    Programmers complaining that loops are not faster than unrolled ones: Watch the bonus section of the video. This is an N64 hardware specific thing. (also yes, I do have the compilerflag to not unroll loops on)

      YA BOI JUANITO หลายเดือนก่อน

      @Engineer Gaming your Motorola G7 Moto Snapdragon is 632 and its a pretty alright processor and my Samsung phone processor is very comparable to a snapdragon 690.

      YA BOI JUANITO หลายเดือนก่อน

      @Engineer Gaming my phone specs is MediaTek Dimensity 720 5G and my phone cores is 8 2x2.0 GHz Cortex-A76 & 6x2.0 GHz Cortex-A55 and I have a LG Stylo 5 Android 9 pie Snapdragon 450 it runs some n64 games like it runs sm64 at 30fps and smash bro 64 it frame drops. But it do run nes and snes and teletubbies ps1 really smooth.

    • Engineer Gaming
      Engineer Gaming หลายเดือนก่อน

      @YA BOI JUANITO I'll try that later, I'm using Android 12L (LineageOS) so I'm curious as to how that will run.

    • Engineer Gaming
      Engineer Gaming หลายเดือนก่อน

      @YA BOI JUANITO I'm typing this on an Android phone right now. I know from experience emulation runs like shit, maybe I'm using the wrong emulators though. Edit: of course, your phone's CPU has twice the CPU cores as mine, that probably contributes to why you can properly run emulators on it.
      Edit 2: The device I currently use is a Motorola Moto G7 actually, with a Motorola Moto G Stylus arriving later today to replace it, because I've replaced its screen 5 times, with each screen being horrible in its own unique way.

      YA BOI JUANITO หลายเดือนก่อน

      @Engineer Gaming also Mupen64plus FZ Pro got updated and runs pretty well on android 11 and 12

  • Modern Vintage Gamer
    Modern Vintage Gamer ปีที่แล้ว +788

    great stuff!

    • Omega Rugal
      Omega Rugal 5 หลายเดือนก่อน

      mistakes are being fixed

    • BingBingWahoo
      BingBingWahoo 9 หลายเดือนก่อน


    • Brendan Dulay
      Brendan Dulay 11 หลายเดือนก่อน

      He’s finally coming to the dope pages now for cryzenx too

    • Mr Heck
      Mr Heck 11 หลายเดือนก่อน


    • Dorktales
      Dorktales 11 หลายเดือนก่อน +3

      Video approved

  • Jimmy Hirr
    Jimmy Hirr ปีที่แล้ว +419

    Optimizing math functions using linear algebra and an intimate knowledge of how the R4300i CPU works: 3ms
    Removing two unnecessary raycasts: 2.5ms
    Kaze: 😐

    • AVX512 is a waste of Silicon
      AVX512 is a waste of Silicon 2 หลายเดือนก่อน +1

      @Augusto Severini You're technically correct and Kaze made a video also optimizing collisions in his Walls video

    • Augusto Severini
      Augusto Severini 5 หลายเดือนก่อน

      If raycasts are that slow, maybe their collision geo could do with so w acceleration stucture...

    • vinesthemonkey
      vinesthemonkey 6 หลายเดือนก่อน +8

      @Mariocamspam SM64 has a full decomp tho?

    • Mariocamspam
      Mariocamspam 6 หลายเดือนก่อน +3

      @vinesthemonkey profiling sm64 without any debug symbols be like

    • vinesthemonkey
      vinesthemonkey 11 หลายเดือนก่อน +61

      The golden rule of optimization: profile first!!

  • JcFerggy
    JcFerggy ปีที่แล้ว +796

    I've said something similar on a past video, but I would be interested in a patch for the vanilla SM64 that applies all these fixes you've done over the years. I'm sure most of the rom hacking community wouldn't have much use for it, but I think it would be a great technical showcase.

    • ifroad33
      ifroad33 2 หลายเดือนก่อน

      The madlad did it

    • DazZ Dark Knight
      DazZ Dark Knight 5 หลายเดือนก่อน

      @DCVK I would go a step further and make it a standalone cartridge.

      NOMAD, AKA NMD 5 หลายเดือนก่อน

      The use for it is having a decent source to mod that runs on real N64 hardware. The biggest issues with modding Mario 64 are that you can’t add too much because of how much space is wasted on the cart already

    • TrafficConeMemes
      TrafficConeMemes 7 หลายเดือนก่อน

      @Mnnvint use patches

    • Spicy Bread Productions
      Spicy Bread Productions 7 หลายเดือนก่อน

      @JcFerggy I have been, actually. They’re quite useful, especially that Sonic 2 guide to speed up scattered ring lag

  • TruelyJohn64
    TruelyJohn64 ปีที่แล้ว +386

    The original team of like 20 did a pretty great job creating the game, console, controller, and even 2 more games at the same time
    But Kaze coming in and helping the game run better puts a smile on my face because it's like giving it new life, sorta like an old car

    • Eduardo Anonimo
      Eduardo Anonimo 9 หลายเดือนก่อน +2

      @Toke Ivø Kids gonna kill you for show ideas that brokes their snowflake condition... I know that its a constructive critique but they dont care, I talk from the experience...
      Anyway, the big step its not that its improving SM64, its that this engine happens to be the "standard" engine in majority of Nintendo games, so basically can solve the Donkey Kong problem.

    • Michael1875l
      Michael1875l 11 หลายเดือนก่อน

      Damn. Well said.

    • Toke Ivø
      Toke Ivø 11 หลายเดือนก่อน +19

      @Eldoofus hates Christmas i specifically mention twice that it's impressive. Dunno how you got the idea otherwise.

    • Eldoofus hates Christmas
      Eldoofus hates Christmas 11 หลายเดือนก่อน +1

      @Toke Ivø you shouldn't say it's not impressive, even if you're right. because I can assure you that most people around(or at least a lot of them) have been surprised by Kaze one way or another, one of those being, his programming skills, since not everyone is around his level.. to be fair, yeah, it is quite predictable that optimizing these games should be possible, but it is the sole dedication of this man what does it for me.

    • Toke Ivø
      Toke Ivø 11 หลายเดือนก่อน +24

      @Eldoofus hates Christmas Are you a programmer? In my experience, as a web dev, what a budget and a team gives you, is primarily time. Having a deadline takes away time.
      And no finished project, was ever finished with the idea that it couldn't be optimized any more - if you're lucky, you reach "good enough".
      Another big difference is, that a team will always try to improve the product. Not "improve the rendering engine". They might have tried to implement one more power up or level given enough time. Or improved the controls. Or fixed the camera angles in the haunted house.
      So while super impressive, it's not at all surprising that someone could optimize SM64. Still, very impressive.

  • N64 Glenn Plant
    N64 Glenn Plant ปีที่แล้ว +293

    I’m glad you put the “ramble” as you call it at the end - I’m sure I’m not the only one who loves this stuff 🧐

    • Sean
      Sean ปีที่แล้ว +2

      Glenn! The man! The legend!

    • Nobbie
      Nobbie ปีที่แล้ว +2

      Glenn plant is here! I watch all your reviews

  • King Pixel
    King Pixel ปีที่แล้ว +379

    There is a certain charm about watching a gameplay recorded from real hardware and one recorded from emulator, I’m not sure how to explain it tho

    • Grub
      Grub 11 หลายเดือนก่อน +3

      The answer is because glide64 doesn't emulate the n64's VI at a low level, thus you lose the low resotion, dithering, dedithering, and antialiasing

    • Sonya Blade's Booty
      Sonya Blade's Booty 11 หลายเดือนก่อน

      Nah not rly

    • Kamoune '_'
      Kamoune '_' ปีที่แล้ว +2

      Because it feels more natural ?

    • RizzlinHD
      RizzlinHD ปีที่แล้ว +2

      @King Pixel aye. for bonus nostalgia points try watching this video on a CRT tube haha

    • King Pixel
      King Pixel ปีที่แล้ว +1

      Same for what @Trunkit wrote

  • June
    June ปีที่แล้ว +203

    After two hours of scouring the code, I managed to find the precise floating point calculation that was causing this slowdown, after re-writing this code to use less space and be less hardware intensive, I managed to boost overall performance by 20% *Deadlifts 225*

    • mcklucker
      mcklucker 4 หลายเดือนก่อน

      Kaze can BLJ IRL.

    • Black Fang
      Black Fang 5 หลายเดือนก่อน

      @snes Its almost not optional if you don't want your health to go into rapid decline.

    • Brandofreak
      Brandofreak 9 หลายเดือนก่อน +4

      Doing the math Kaze's deadlift works out to 2100 N64 carts.

    • d t
      d t 11 หลายเดือนก่อน +1

      damn you really would fold simple

    • Jimmy Hirr
      Jimmy Hirr ปีที่แล้ว +4

      @Kaze Emanuar I didn't realize that it was common to use English units for weightlifting outside of Anglo countries. Wow, that is still a lot of weight.

  • SNES drunk
    SNES drunk 11 หลายเดือนก่อน +104

    This is really interesting. I wonder why they felt like the game needed three? vertical raycasts. I suppose that might just go with the territory of making stuff up as you go along

    • Harrison Fackrell
      Harrison Fackrell 5 หลายเดือนก่อน +4

      @Mariocamspam I think Clement is trying to optimize the *direct* Super Mario 64 port for the 3DS--the one that came out recently, after the source code leak--as opposed to _Super Mario 64 DS._

    • Mariocamspam
      Mariocamspam 6 หลายเดือนก่อน

      @Clement Poon they dont share the same codebase or render pipeline

    • Clement Poon
      Clement Poon 7 หลายเดือนก่อน +5

      @Kaze Emanuar where the HELL is it in the code? im going crazy trying to optimise super mario 64 for 3ds right now and i cant bloody find it

    • Kaze Emanuar
      Kaze Emanuar  11 หลายเดือนก่อน +81

      yep this was clearly an oversight. it was the same raycast 3 times in different parts of the shadow processing. they should have passed the results down to the next function, but did not. i suppose the people implementing these 3 functinos each received specifications that did not include konwing the surface data.

  • zheil9152
    zheil9152 11 หลายเดือนก่อน +49

    I’d pay good money to take a “Kaze teaches SM64 C programming” class for an intro to modding the game. I’m someone that works on IoT embedded systems, but graphics and games are a whole different animal

    • snes
      snes 5 หลายเดือนก่อน +1

      just c coding in general, this man is already better than most modern developers in c++ with ue4

  • old mage
    old mage ปีที่แล้ว +88

    10:50 WHEEZE; bypassing compilation to save time is such a Kaze solution, well done! Just curious what sort of changes in performance would your changes bring to vanilla SM64? I'm just imagining a world where I can look dead on in Fire-Sea or Bowser's Sub in DDD without it being a slideshow ha
    You're a titan brother

    • Kaze Emanuar
      Kaze Emanuar  9 หลายเดือนก่อน +15

      @t0lkki if it moved, it'd be the exact same lag. i think the reason they didnt make it permanent collision is that it disappears on a later act and simply didnt have a function to make it permanent collision only on certain acts. it takes 2 minutes to fix though. i think it was just programmer stupidity.

    • t0lkki
      t0lkki 9 หลายเดือนก่อน +3

      @Kaze Emanuar oh that's interesting, doesn't that imply the sub was intended to move at some point? now that'd been a slideshow to watch!

    • Kaze Emanuar
      Kaze Emanuar  9 หลายเดือนก่อน +14

      @t0lkki the problem is not that the sub is drawn every frame. its the collision math. you could simply load it as permanent collision and it'd be fine. ive done this in my sm64 multiplayer and the sub in that game has a higher framerate in multiplayer than it'd usually have in singleplayer...

    • t0lkki
      t0lkki 9 หลายเดือนก่อน +3

      @Kaze Emanuar not drawing the whole sub each frame could bring the framerate back up from a code perspective, but do you know how much of an impact it would have if the sub was made with a fraction of the triangles instead (without any changes to the code)?

    • Kaze Emanuar
      Kaze Emanuar  ปีที่แล้ว +54

      those 2 levels could definitely be lagless with a few more tweaks!

  • John Smith
    John Smith ปีที่แล้ว +59

    As a server programmer making microservices to be managed across dozens of servers, I think I'll stick to my high level abstractions and libraries. However, I always give props to people that take the time to do proper optimization where it's important, especially in projects like this when they can optimize down to the hardware level some of the calls that will happen millions of times. Nice job Kaze. Respect from a fellow dev.

  • Ben Goodwin
    Ben Goodwin ปีที่แล้ว +70

    "the compiler wasn't good enough so I rewrote this function in assembly" what a madlad

  • thekingofmoo
    thekingofmoo ปีที่แล้ว +72

    Could you make a rom hack of standard Mario 64 using all of these optimizations?

    • deyvien
      deyvien 11 หลายเดือนก่อน +1

      @RichardKidd2010wouldn't these optimizations be public via recompiling rewritten code? Unless you're saying that 16:9 support requires a bunch of rewriting, which I don't think does given Everdrive / GameShark codes existing.

    • RichardKidd2010
      RichardKidd2010 11 หลายเดือนก่อน

      I don't know how possible that would be without essentially rewriting and recompiling the game code

    • deyvien
      deyvien 11 หลายเดือนก่อน +6

      i wonder with all these optimizations if a 16:9 mod would run mostly at 30fps

  • AdrienTD
    AdrienTD ปีที่แล้ว +46

    9:40 Kind of weird that GCC is not able optimize float moves to immediate int moves by itself. x86 compilers can do such optimizations, so maybe it's disabled on MIPS because it could cause unexpected behavior? Or maybe that would only work in C++? Or maybe I just don't get it 🤔

    • Pointing Device
      Pointing Device 11 หลายเดือนก่อน +5

      Maybe it doesn't realize it actually can save instructions (and especially loads) since loading a single precision float is a two-step process but so is a 32-bit immediate, since classic RISC instructions only take 16 bits of data at once. But since all the lower bits of the representation of 1.0f are zero, it can be loaded in a single step (lui $2,16256). PowerPC has a similar issue, they only figured that out for ARM. Old x86 cheats by having a FLD1 instruction.

    • Nobody
      Nobody 11 หลายเดือนก่อน +7

      @Erik MIPS support on GCC is actually not super duper hot like it is on x86. IDO on O2 is actually pretty good for being a 1994-1996 compiler and GCC here is only *marginally* better on the default settings, although you can get a lot more out of it by being flag specific like Kaze is doing.

    • Erik
      Erik 11 หลายเดือนก่อน +16

      My guess is gcc isn't as good on most other platforms as it is on x86. My megadrive project uses gcc but I have to inspect the compiler's output and use inline asm often in performance critical areas to get around gcc's poor 68k codegen.

    • alfie gordon
      alfie gordon 11 หลายเดือนก่อน +4

      Or it’s just gcc being a pile of suck

  • Mario
    Mario ปีที่แล้ว +41

    I think you should add a minecart from the outside of Bowser's Blazing Burrows. That would make sense how Mario got hop in the cart.

    • Mario
      Mario ปีที่แล้ว +15

      @Kaze Emanuar good. Really appreciate the efforts you working on this major rom hack!

    • Kaze Emanuar
      Kaze Emanuar  ปีที่แล้ว +43

      that is the plan

  • Bizzozeron
    Bizzozeron ปีที่แล้ว +9

    Those bonus bones are also present in Melee models, I theorize that they're probably referential bones because the bones have issues tracking their relationships to their original location, they're commonly found in shoulders and thighs, things protruding from the base st ructure

  • SirSethery
    SirSethery 11 หลายเดือนก่อน +10

    A 60fps Mario 64 on original hardware would be incredible. Too bad none of this improves the renderer. Super cool nonetheless.

    • Kaze Emanuar
      Kaze Emanuar  11 หลายเดือนก่อน +18

      this frees up some memory reads, meaning the renderer is also slightly sped up! 60fps sm64 is within reach.

  • B Targ
    B Targ ปีที่แล้ว +132

    Making this open-source could help optimise the code even further, but this is really pushing the hardware to its limits! Can't wait for more

    • Frogge
      Frogge 4 หลายเดือนก่อน +1

      It would be a sure path to a C&D from nintendo

    • Henrix98
      Henrix98 ปีที่แล้ว +18

      I doubt their team wouldn't already have everyone deeply enough interested to this topic to be helpful

  • ????? le ????
    ????? le ???? 4 หลายเดือนก่อน +3

    The moment Kaze started to explain the code he rewrote, he just goes gigachad. Especially how he said he just decided to rewrite stuff in assembly...

  • patrickgh3
    patrickgh3 ปีที่แล้ว +11

    Great video! I really appreciate how you emphasize the context that the original code was written in, and how it's different from the context you're making these improvements in. Like the disclaimer at the very start of the video, and how you brought in an actual developer of the game to ask about the compiler options!
    Personally I might have given the 3 raycasts thing more slack, or at least not pinning it on 1 theoretical person. As a complete hypothetical, maybe the extra raycasts were to fix bugs that occured in 1 specific level, and the team was on a deadline, so they chose that fix. That said, I could be wrong, and you're the one who's seen the actual code. Anyways, I appreciate all the attention given to the original developer context throughout the video.

  • Samuel Voltz
    Samuel Voltz 11 หลายเดือนก่อน +15

    Me almost to the end of the rambling section: Okay, this is impressive. It can't possibly get any more insane...
    Kaze: So I started coding in assembly

  • gneii
    gneii ปีที่แล้ว +5

    At 7:30 a another approach might be to use the __restrict keyword for the parameters. The reason it doesn’t pipeline the loads/stores is because it can’t assume the two pointers don’t overlap (in other words dst[0] might be src[1] in memory, so it thinks it has to do that store before it does the load). Using restrict lets the compiler know that the two pointers don’t overlap/alias so it can perform optimizations like you did manually. This also might be something that could be used elsewhere as well to enable better optimizations.

    • Kaze Emanuar
      Kaze Emanuar  ปีที่แล้ว +5

      thats true! i had not known that before. im really bad with compilers haha

  • Mark Lee
    Mark Lee ปีที่แล้ว +60

    Now we just need a DeLorean so we can go back to 1995 and give a VHS of this upload to all N64 developers. 😅

    • WestHaddnin
      WestHaddnin 8 หลายเดือนก่อน


    • blargg
      blargg 11 หลายเดือนก่อน +5

      They still wouldn't have enabled GCC optimizations unless you show a video that using the compiler of the time didn't make a build with bugs.

  • unsubtract
    unsubtract ปีที่แล้ว +7

    Would Link-Time Optimizations be able to further speed up the game past -Ofast?

    • unsubtract
      unsubtract ปีที่แล้ว +8

      @Kaze Emanuar -flto flag in GCC (on both the compiler and linker), enables some additional optimizations done during linking.

    • Kaze Emanuar
      Kaze Emanuar  ปีที่แล้ว +1

      i have no idea what that is honestly. im an asm boi

    TCMOREIRA 9 หลายเดือนก่อน +2

    Man I'd love a thousand hours video only about the insane C code ramblings!

  • Mark Grgurev
    Mark Grgurev 11 หลายเดือนก่อน +4

    A year or two ago, I captured frames from several N64 games using an emulator and RenderDoc and I noticed that most N64 games tended to use the painters algorithm despite using a depth buffer. I'd imagine there would be a decent speedup getting rid of the unneeded write to the depth buffer, write to the color buffer, and possible read from the color buffer if the triangle was transparent. Has anybody messed with drawing opaque triangles front to back and transparent triangles back to front?

    • Mark Grgurev
      Mark Grgurev 9 หลายเดือนก่อน

      @Kaze Emanuar I know that Conkers Bad Fur Day also draws from back to front, too, so it's possible N64 games rendered like that the whole time with the exception of that game used the Z-sort microcode.

    • Kaze Emanuar
      Kaze Emanuar  11 หลายเดือนก่อน +6

      i do my best to use both of these, however, i haven't implemented a sorting algorithm yet. i simply do it by presorting the drawing order within levels atm. doing that could be a pretty insane speedup though, since usually the pixel drawing is a big limiting factor. i imagine a lot of the later released N64 games have some sort of sorting going on.

  • Shadoninja
    Shadoninja 11 หลายเดือนก่อน +8

    You should have done a TAS side by side before/after to show the visual difference

  • easyaspi314
    easyaspi314 8 หลายเดือนก่อน +12

    7:17 Have you considered the restrict keyword? Without the restrict keyword, by the laws of the C standard, it must assume that dest and src overlap.
    So, for example, let's say you did
    float x[4] = { 1, 2, 3, 4 };
    copy(&x[1], &x[0]);
    If GCC loaded first then stored, it would end up in 1 1 2 3, but the C standard says it should be 1 1 1 1.
    The restrict keyword says "these are never going to overlap" and therefore it doesn't need to worry about that.

    • easyaspi314
      easyaspi314 5 หลายเดือนก่อน +1

      @ss l they didn't use GCC so I doubt it

    • ss l
      ss l 5 หลายเดือนก่อน +6

      @Kaze Emanuar "restrict" was introduced in C99, it did not exist when Mario 64 was being made. I assume there was a GNU extension prior to 1999 but I don't know when.

    • Kaze Emanuar
      Kaze Emanuar  8 หลายเดือนก่อน +13

      yeah, that would have worked. im no expert at C so i had no idea until i saw a few comments like this.

  • Daniel Savage
    Daniel Savage ปีที่แล้ว +5

    For the `vec3f_copy` optimization, is the speedup coming from better pipeline usage? Like, my initial guess is that the fetch/decode bits of the processor would be consistently busy and less contested.

    • Kaze Emanuar
      Kaze Emanuar  ปีที่แล้ว +6

      yep, that's how it saves 3 cycles. one cycle on each load+store

  • gumgrapes
    gumgrapes ปีที่แล้ว +6

    Absolutely amazing Kaze. I really hope to learn from you someday.

  • Xeridea
    Xeridea 11 หลายเดือนก่อน +4

    The pains of working on older hardware. The load/store issue is much less of an issue with out of order CPUs. They rearrange instructions on the fly to try to prevent issues such as memory stalls. This was also the early days of realtime 3D rendering, and not as many shortcuts were known.

    • Thelango99
      Thelango99 9 หลายเดือนก่อน

      Even the XBOX 360 CPU used in-order execution.

  • Pea Otter
    Pea Otter 9 หลายเดือนก่อน +1

    I absolutely love when someone's pushing hardware to its limits.

  • Won Chun
    Won Chun 11 หลายเดือนก่อน +6

    For the first optimization (load/store x3 v. load x3 + store x3), I would guess that this is an aliasing issue, and could be solved more simply with the restrict keyword. Good stuff!

    • angeldude101
      angeldude101 8 หลายเดือนก่อน

      I was thinking that the compiler should be able to optimize a simple memcpy like that, but I forgot that C allowed mutable aliasing.

    • reeeeeeeeemmmmmmmmmm
      reeeeeeeeemmmmmmmmmm 11 หลายเดือนก่อน

      Yep exactly this. For arguments that overlap in memory the optimization will cause different behaviour, so the compiler can't apply it without you pinky-promising that they don't.

    • blargg
      blargg 11 หลายเดือนก่อน +1

      Came here to say this. That can help almost all these cases, since the optimizer must otherwise assume the worst case that every store could modify any other value you're loading.

  • liorhaddad
    liorhaddad ปีที่แล้ว +9

    Ooo! Coding video! My favourite kind!
    That level of optimization is insane! really cool!

  • roblox_harry 2006
    roblox_harry 2006 10 หลายเดือนก่อน +6

    I'm not very tech-savy with these explinations, but I would love to see Mario 64 eventually running in 480p at 60fps on real N64 hardware!!!

    • Adriel Oliveras
      Adriel Oliveras 5 หลายเดือนก่อน

      Resolution is a hardware limitation.

  • Crim Sama
    Crim Sama 11 หลายเดือนก่อน +1

    Awesome to see! The footage shown looks like a 3ds game! Insane improvements imo.

  • AssailantLF
    AssailantLF ปีที่แล้ว +3

    God damn I love in-depth technical videos relating to video game software. Thanks Kaze for being so inspirational and awesome.

  • CAEC64's New Channel
    CAEC64's New Channel ปีที่แล้ว +42

    the optimization is REAL!!

    • Weegeepie
      Weegeepie ปีที่แล้ว +1


    • Dozaemon
      Dozaemon ปีที่แล้ว +4

      Ubisoft: opti-what?

  • Vato From the Astral Plane
    Vato From the Astral Plane 8 หลายเดือนก่อน +1

    Man I can't wait until you get your hands on OoT's code. Gonna be so awesome.

  • Random Tech, Auto, Security, & Skateboarding
    Random Tech, Auto, Security, & Skateboarding 11 หลายเดือนก่อน +3

    Wow, great stuff! It is always great to hear from talented reverse code engineers who understand the fundamental concepts required to implement such optimizations and get ever closer to squeezing out every last drop of free performance possible. What kind of tools are you using to do profiling between pre-optimization and post-optimization functions? Are you using some type of MIPS emulator (in IDA or otherwise) to get a quick idea of how changing a function in different ways affects performance (since most of this seems to be related to graphics rendering I guess a MIPS emulator probably wouldn't be helpful)? Or maybe you're using a PC N64 emulator to run the code? Having zero experience with N64 homebrew development (I work as a reverse code engineer in the security industry and mostly write software/OS exploits.. mostly x86), I'm just curious what the tool chain looks like.

    • Kaze Emanuar
      Kaze Emanuar  11 หลายเดือนก่อน +3

      I was testing on N64. It has tools to check how many cycles have passed from which you can calculate how long it takes to render a whole frame. I was not benchmarking individual functions really (that would have been pointless due to instruction cache), I was always benchmarking the whole frame with the same camera setup.

  • Aunarky
    Aunarky ปีที่แล้ว

    Really well thought out and well constructed video, I hope you make more of this! It was such a treat to watch :)

  • Dissonance Paradiddle
    Dissonance Paradiddle 8 หลายเดือนก่อน

    The section with bomb Mario looks unbelievable!! I can't believe how good the lightning and polygons look. It's very polished and clean. It's so surreal

  • AntiSocial Vigilante
    AntiSocial Vigilante 7 หลายเดือนก่อน

    There is also a small microoptmization that you could try: Have for loops decrement to zero as comparing against 0 is a lot faster for the processor. Though i don't know if compilers already optimize this.

  • N64 Brasil
    N64 Brasil 11 หลายเดือนก่อน +1

    This is beautiful. I can't help but imagine the effect on the original game running on real hardware. Maybe a patch will come from this? (PLEASE KAZE!!!!)

  • Roxanne Celi
    Roxanne Celi 8 หลายเดือนก่อน +1

    Hello Kaze. Have you ever considered the idea of ​​collaborating with the Render96 team?
    When I see what they have done with the graphics and sounds of the game and when I see your optimizations I can't help but think that you could give us a real definitive edition of Super Mario 64 with all the good stuff. (stable 60fps, HD models, textures and sounds, ray tracing, quality of life improvements and maybe a fix for most if not all bugs and of course Luigi and Wario).
    Thanks for what you are doing Kaze! I can't wait for your next game. Looks pretty cool so far! :)

  • An ordinary gaming channel
    An ordinary gaming channel 9 หลายเดือนก่อน

    Wow, this is some amazing work!

  • gudenau
    gudenau ปีที่แล้ว +1

    I'm so glad you did the part at the end. Some of those changes I didn't know how they could be faster than what the compiler should output.

  • Zygal Studios
    Zygal Studios 9 หลายเดือนก่อน +6

    The left side of the picture at 0:38 about sums up my frustrations with programming culture nowadays 🤣
    Very great insight! Loved the video!
    Utilizing the hardware provided will always provide a better solution than expecting the compiler to do things for you.
    The compiler can only reason about such a small portion of your code even with optimizations turned on.
    For this case you almost chopped the time in half by using the cache better and shaved miniscule amounts of time from using the compiler optimizations.

  • Richard G
    Richard G 11 หลายเดือนก่อน +2

    Every now and again its good to be reminded, as I sit there looking smug because of some optimisation I've done in the game I'm making, that there is a whole other level of big-brain optimisations I don't even know exists.

  • Rainer K.
    Rainer K. 5 หลายเดือนก่อน

    Plenty of those optimizations bring back memories. It's how you had to code on the Amiga if you wanted fast code without having to resort to assembler.

  • Kyrieru
    Kyrieru 10 หลายเดือนก่อน

    I'm so thankful for game engines and the opportunities they give people on the artistic side of things. It's amazing we ever got video games at all in the early days given how much wizardry they took just to get things running. Early 3d Nintendo games did things that even games today don't do, which is both impressive and depressing.

  • RADkate
    RADkate ปีที่แล้ว +25

    3:20 im pretty new to programming but i guess they used the 3 raycasts to calculate the angle of the ground instead of just getting the normal direction from the ground below?

    • not a lost number
      not a lost number 6 หลายเดือนก่อน

      @TJ ;
      I hope you do see a difference in frametimes!

    • TJ
      TJ 6 หลายเดือนก่อน

      @Rena Kunisaki this was a great suggestion. I changed all references to 5 digits and I see no difference.

    • TJ
      TJ 6 หลายเดือนก่อน +1

      @not a lost number I took your advice and changed all references to PI in Mario 64 to 5 digits, just compiled and I can see no difference.

    • Konomi
      Konomi 11 หลายเดือนก่อน +1

      @not a lost number as someone who has memorised 50 digits of pi, i have never used more than 5

    • not a lost number
      not a lost number ปีที่แล้ว +8

      @Rena Kunisaki ;
      I've been doing some trigonometry for quite a bit of time, and let me tell you.
      3.1416 is very good for all cases.

  • SullySadface
    SullySadface ปีที่แล้ว +3

    nice work, NERD
    is there a patch that could be applied or a build of the pc port that utilizes this?

    PBJ AND A HIGHFIVE 11 หลายเดือนก่อน

    Every time I watch a Kaze explanation video I get a serious case of intelligence envy. This man is incredibly impressive.

  • Memnarch
    Memnarch 9 หลายเดือนก่อน +1

    The US version was compiled without optimization. The EU version was compiled with optimization. That's why Bowsers submarine runs way faster on the EU version.
    The EU version had a later releasedate, so i'd assume it has to do with deadlines, as hinted in this video.

  • Jack Stone
    Jack Stone ปีที่แล้ว +2

    As someone who only had taken an intro to C++ course, the only thought I had was “oh that’s a void function. That’s neat.”

  • Iwer Sonsch
    Iwer Sonsch 11 หลายเดือนก่อน +1

    Now for applications like RCPS or training neural networks to play romhacks, you don't need to render frames at all (or only on a small percentage of sample occasions). Would be an even bigger speedup.

  • BlackWorm
    BlackWorm ปีที่แล้ว +1

    Awesome video Kaze.
    Do you think you could apply these optimizations to some of your more popular rom hacks?

  • MusicManiac898
    MusicManiac898 11 หลายเดือนก่อน

    Really incredible work!

  • Cecille Wolters
    Cecille Wolters ปีที่แล้ว +1

    Awesome work man!
    If you could build your own console, would you make it similar to this or just something else entirerly?

  • magnusm4
    magnusm4 5 หลายเดือนก่อน

    When it comes to optimization and relegating. Many underestimate the GPU how much performance they can save by simply putting repeated linear code on the GPU and leave the logic to the CPU where it belongs. There's a reason the GPU is named after it's main function, highly demanding computing graphics.

  • Richand Darksbane
    Richand Darksbane ปีที่แล้ว

    Incredible stuff you've done here!

  • Elijah Robertson
    Elijah Robertson 11 หลายเดือนก่อน +1

    If I had to take a guess at all of the extra unused bones they we're probably used for corrective targeting and for driving the rig. They just didn't get rid of the non-deformation bones in the engine proper. The animations for Super Mario 64 are pretty complex and detailed for the time so I am sure they had some pretty complex rigs!

    • Kaze Emanuar
      Kaze Emanuar  11 หลายเดือนก่อน +6

      these bones don't animate usually. they most of the time just have 1 set rotation throughout the whole animation. i think nintendo simply didnt realize how performance eating this was.

    TCMOREIRA 9 หลายเดือนก่อน

    Are you planning on bringing such optimizations to the open-source PC version as well?

  • Spooky Ghost
    Spooky Ghost 9 หลายเดือนก่อน

    ideally i feel like all those math functions are written fine and compiler optimizations should be added for those cases. cool video good work

  • grizzoo
    grizzoo ปีที่แล้ว

    Thanks, I appreciate these videos just as much as your others.

  • Rubixninja314
    Rubixninja314 11 หลายเดือนก่อน +46

    "my compiler will optimize it all, I don't need to understand the cpu"
    So much this. Been working on rewriting the Kociemba algorithm (for cube solving) in assembly for modern hardware, though of course I haven't finished it because ADHD had other plans. For the record, Kociemba definitively understood how CPU's work when he wrote his C implementation, BUT turns out once you get AVX involved and do it in assembly you can reach 1 billion turns per second on a laptop. Though seriously, throughout this process I've learned that even if you don't use any asm, understanding those details under the hood is absolutely critical. Both for performance _and_ readability.

    • John Simon
      John Simon 6 หลายเดือนก่อน

      I'm trying to find a video of Michael Abrash (of quake/Oculus/Graphics Programming Black Book) and how he optimized the snot out of a naive implementation Conway's game of life 1000x over.
      > BUT turns out once you get AVX involved
      oh, sure once you get hardware support the sky's the limit. It's a world of difference from an early 90s MIPS

    • locklear308
      locklear308 11 หลายเดือนก่อน +1

      @Rubixninja314 oh thanks man that sounds dope!

    • Rubixninja314
      Rubixninja314 11 หลายเดือนก่อน +5

      @locklear308 you might want to check out the channel "What's a Creel?" He does a fair bit of assembly and it's where I learned a lot of what I know.

    • locklear308
      locklear308 11 หลายเดือนก่อน +4

      It's sad, so much CPU power now days is basically wasted.
      I would LOVE to see something assembly based running on a modern CPU. Like you know those old fun little demo things you could run on a commodore 64? Where you could like run colors across the screen and you can see how much faster machine language was than basic?

  • kemo
    kemo ปีที่แล้ว +2

    You are a legend, i like your vec optimizations specially the one done in assembly

  • Kranker Geist
    Kranker Geist ปีที่แล้ว

    You are very knowledgeable when it comes to Super Mario 64 and the N64 as a whole. Impressive work, Kaze!

  • Michael Jarriel
    Michael Jarriel ปีที่แล้ว +2

    This guy is actually insane

  • Hexagon
    Hexagon 5 หลายเดือนก่อน

    This is an _insane_ video, amazing! This is probably a dumb obvious question, but have you tried clang to see if it does better for some functions? also using those other flags past just -Ofast for fast math (which seems like the main bottleneck here), but idk how the N64 FPU works as much as x86
    also such an on point meme at the beginning lmao

  • Panegyr
    Panegyr 11 หลายเดือนก่อน +1

    As something of an embedded developer myself I live for this content

  • Takumi
    Takumi ปีที่แล้ว +6

    Is there any possibility of applying these optimizations along with the other ones in previous videos to something like Star Road to give it nearly perfect framerate on console?

    • Takumi
      Takumi ปีที่แล้ว +2

      @Kaze Emanuar Sick. I'll keep my fingers crossed.

    • Kaze Emanuar
      Kaze Emanuar  ปีที่แล้ว +15

      yes. i might do it one day.

  • kiwirocket64
    kiwirocket64 11 หลายเดือนก่อน +2

    Mr maze I have a question do you think you could optimize super Mario 64 land? I’ve been really enjoying the rom hack but in certain levels the frame rate goes down to like 5 fps I was wondering if you could optimize it though then again I did download it when it came out so maybe there was an update to fix it I don’t know

  • Mark Kane
    Mark Kane ปีที่แล้ว +10

    Frankly that's incredible! Next challenge: Perfect Dark at a playable framerate! lol

    • not a lost number
      not a lost number 11 หลายเดือนก่อน

      @westingtyler ;
      It's an example of how something can be way faster in one game than the other, and I said about pausing because it's a notorious example of something being different.
      Yes, it's nice how it shows the clock, but if you pause and unpause many times, each time it's a little bit faster, compared to PD where it's essentially the same time across the board.
      If you want a better example of performance differences, walk forward in each game while looking at the horizon. In PD you will think you move way faster than GE, but that's because of framerate discrepancies.

    • westingtyler
      westingtyler 11 หลายเดือนก่อน

      @not a lost number really? I always thought the Goldeneye pause menu loading slow was just a dumb directorial decision to show the watch moving up onto screen. do you know for sure it's to help with load times of the menu?

    • not a lost number
      not a lost number ปีที่แล้ว +6

      I believe GoldenEye would need it more, PerfectDark is a bit more optimized (look at how fast the pause menu can be toggled in each game for example)

  • Wolfcl0ck
    Wolfcl0ck ปีที่แล้ว

    absolutely insane, I love this

  • Johnatan Gonzalez
    Johnatan Gonzalez ปีที่แล้ว

    dude keep it up, I love these technical videos of yours

  • Kram1032
    Kram1032 11 หลายเดือนก่อน +1

    9:20 I wish you had explained the crazy thing going on here
    9:47 ok with how common that sort of thing is, I'd expect compilers to be able to recognize that and fix it for you. Just specifically the numbers 0, 1, and -1 should get optimizations of that sort. And maybe even all constants. I see no immediately obvious reason why the compiler couldn't just do that for you.

    • Kaze Emanuar
      Kaze Emanuar  11 หลายเดือนก่อน +1

      Yeah the 9:47 issue is pretty silly. I heard that old versions of GCC do actually do that for you, but some update messed it up. Maybe there is a good reason to do it the "wrong" way on modern machines.

  • Ice God
    Ice God ปีที่แล้ว

    Besides the N64 hardware-specific instruction cache thing, I actually guessed all of the optimizations because I was pausing and reading the code in Chapter 2. Most of them were the use of pointer math to save one operation per matrix access but some were cool and using registers, but the use of a struct cast was really cool! The assembly one was really cool too (I use a different assembly syntax so I am very confused about that section rip). I took me a while to get that 0x3F800000 was float for 1 lol.

  • IWasLego
    IWasLego 8 หลายเดือนก่อน

    This man is way too clever. The amount of talent you put into your projects baffles me. 90% of your projects shouldn’t be possible on console.

  • Eduardo Simor
    Eduardo Simor 11 หลายเดือนก่อน +1

    I would love if you give ocarina of time the same treatment, I heard that the source code would be available soon

  • Michatroschka
    Michatroschka 11 หลายเดือนก่อน

    is that stuff playable?? very nice work man, love what youre doing for the n64 community

  • Sofia Lafitte
    Sofia Lafitte 9 หลายเดือนก่อน

    Unrelated but I adore what you did with the vertex shading. It makes things look so vibrant and lively. Very Spyro-esque, which is a great thing in my opinion!

  • Justin113D
    Justin113D 11 หลายเดือนก่อน +2

    i actually looked at all the math functions. Good stuff!
    Also, this madman actually just said "fuck this floating point conversion functin in particular" and rewrote it in assembly

  • The 1-Up_Triforce
    The 1-Up_Triforce ปีที่แล้ว +5

    Even though I understand all of this in the most rudimentary way "remove unused wasted space game goes faster woo!" it still fascinates the hell out of me never stop doing these videos.

    • Merlin
      Merlin 11 หลายเดือนก่อน +1

      Same here

  • Hylian monkeys
    Hylian monkeys ปีที่แล้ว

    This is super neat.
    I build simple computers in minecraft so I can relate to the feeling of accomplishment from getting code to function faster.

  • gdm413229
    gdm413229 11 หลายเดือนก่อน

    Some of the names of the data types in the code remind me a bit of the built-in types in GLSL.

  • Clement Poon
    Clement Poon 7 หลายเดือนก่อน

    you should make a source code patch with the optimisations, it would've helped A LOT in sm64 3ds

  • Jimmy Hirr
    Jimmy Hirr ปีที่แล้ว +1

    8:51 Why do temp++ first and then temporarily subtract one to set it? Wouldn't it be faster to set it and then increment temp? Or is it faster this way because it spreads out the memory accesses?

    • Jimmy Hirr
      Jimmy Hirr ปีที่แล้ว +1

      @Kaze Emanuar So it's implemented that way because of the pipeline. Thanks for explaining! I didn't know that MIPS could do a subtraction and a store in a single instruction.

    • Kaze Emanuar
      Kaze Emanuar  ปีที่แล้ว +4

      i subtract temp first so that sum2 is available for use in the register for the addition in the next line. subtracting one happens in a single instruction (swc1 register -4(temp)), so it wastes no extra time that way

  • ultraokletsgo
    ultraokletsgo 11 หลายเดือนก่อน +1

    You and Displaced Gamers are producing great content!

  • Alissa Swan
    Alissa Swan 6 หลายเดือนก่อน

    Crazy how much optimization you can do on sm64

  • iamdarkyoshi
    iamdarkyoshi ปีที่แล้ว

    Impressive work dude!!

  • Extreme Wreck 2000
    Extreme Wreck 2000 11 หลายเดือนก่อน

    You know, I have to wonder what will happen if you were to run the original game, but with all those optimizations to make it run faster. Would the game start running way too fast to the point of being nearly unplayable?

  • King of the Grapes
    King of the Grapes ปีที่แล้ว +1

    I love these vids you do about optimizations

  • The 19th Fighter
    The 19th Fighter 11 หลายเดือนก่อน +1

    7:55 I found following result to appear a tiny bit better generally, as it does not involve loading FPU registers (assuming it's actually slower):
    struct Vec3fS {
    float x,y,z;
    void *vec3f_copy3S(struct Vec3fS *dest,struct Vec3fS *src) {
    *dest = *src;
    You could probably just cast the array to a pointer of Vec3fS as shown in another optimization and copy it that way.
    EDIT: nvm, that's why you've casted it to uint specifically.

    • Kaze Emanuar
      Kaze Emanuar  11 หลายเดือนก่อน +2

      this solution would be equivalent. fpus are not slower, ive only cast to u32 in that example because that code was copypasted from somewhere else that did something similar.

  • Amaroq Starwind
    Amaroq Starwind 11 หลายเดือนก่อน

    Maybe after you're done with Mario 64, you can also start heavily optimizing Ocarina of Time and Majora's Mask... And eventually, other N64 games.
    And then, perhaps, we can move onto other console generations. Many PS2 and some original XBOX games come to mind.

    PEACEWALKER 11 หลายเดือนก่อน

    Amazing Work! :)

  • Axel Prino
    Axel Prino 11 หลายเดือนก่อน

    At this point SM64 modders are putting more work into the game that the original team ever intended to do themselves.

  • Mario Wario
    Mario Wario 11 หลายเดือนก่อน +1

    Hey kaze, will you ever release an updated 60fps mod that runs on n64 hardware now?