Finding the BEST sine function for Nintendo 64

แชร์
ฝัง
  • เผยแพร่เมื่อ 5 พ.ค. 2024
  • To try everything Brilliant has to offer-free-for a full 30 days, visit brilliant.org/KazeEmanuar. The first 200 of you will get 20% off Brilliant’s annual premium subscription.
    Patreon: / kazestuff
    🎥 / @kazeclips
    🐦 / kazeemanuar
    MERCH: kazemerch.myspreadshop.com/all
    This video was sponsored by Brilliant.
    0:00 Einleitung
    0:35 Why is sine Important?
    3:54 Mario 64's sine
    7:04 Ocarina of Time's sine
    9:31 Interwoven co/sine
    16:58 3rd order sines
    18:38 Numerics
    20:21 5th order sines
    22:27 4th order cosine
    23:30 conclusion
  • เกม

ความคิดเห็น • 1.3K

  • @KazeN64
    @KazeN64  10 หลายเดือนก่อน +115

    To try everything Brilliant has to offer-free-for a full 30 days, visit brilliant.org/KazeEmanuar. The first 200 of you will get 20% off Brilliant’s annual premium subscription.

    • @official-obama
      @official-obama 10 หลายเดือนก่อน +1

      5:30 it's wikipedia, you can edit it

    • @CandiedC
      @CandiedC 10 หลายเดือนก่อน

      Considering you’ve added hi res models into SM64 before, is it possible to add Render96 to it?

    • @SuperM789
      @SuperM789 10 หลายเดือนก่อน

      @@CandiedC render96 looks like shit, so even if you could it would be better not to

    • @RanEncounter
      @RanEncounter 10 หลายเดือนก่อน +3

      @ 17:34 the derivative is wrong. The correct derivative is 3ax^2 + b, not 3ax^2 + x.

    • @KazeN64
      @KazeN64  10 หลายเดือนก่อน +5

      @@RanEncounter oh whoops thats a typo

  • @gabrieltorres6633
    @gabrieltorres6633 10 หลายเดือนก่อน +2734

    This man knows more about the n64 than nintendo themselves

    • @Havii
      @Havii 10 หลายเดือนก่อน +75

      Yeah, for real haha

    • @GoofyPoptart
      @GoofyPoptart 10 หลายเดือนก่อน +33

      Yes he does

    • @italianninjaaa
      @italianninjaaa 10 หลายเดือนก่อน +114

      @@vexicebruh, u didint watch all his videos

    • @megadubstepxd9462
      @megadubstepxd9462 10 หลายเดือนก่อน +50

      Also probably the best assembly coder there is.

    • @megadubstepxd9462
      @megadubstepxd9462 10 หลายเดือนก่อน +20

      @@vexice well who still code's in assembly and tries to optimize a coding language from 70 years ago

  • @prismavoid3764
    @prismavoid3764 10 หลายเดือนก่อน +1676

    Using ancient Indian mathematical formulas to make the funny red man go bing bing wahoo even faster, even after going like five massive optimizations that give wild speedups already
    Classic Kaze.

    • @lerikhkl
      @lerikhkl 10 หลายเดือนก่อน +84

      "funny red man go bing bing wahoo" is a great band name

    • @williamdrum9899
      @williamdrum9899 10 หลายเดือนก่อน +37

      The trick to fast computing is to use integers as much as possible

    • @Swenglish
      @Swenglish 10 หลายเดือนก่อน +7

      @@lerikhkl I love their song "everybody cheating but me".

    • @l3rvn0
      @l3rvn0 10 หลายเดือนก่อน +9

      @@williamdrum9899 I wonder if this is really true.
      Float has 7 decimal positions, if I use an integer and treat the first 7 positions as decimals, would that make my code much faster and more efficient?

    • @ThePondermatic
      @ThePondermatic 10 หลายเดือนก่อน +20

      This is part of Bhaskara's legacy, one he could not have possibly imagined hundreds of years ago.

  • @SeanCMonahan
    @SeanCMonahan 10 หลายเดือนก่อน +450

    Maybe the 562.5 MegaBytes/s at 5:37 should actually be Megabits/s. 562.5 Mbps = 70.3 MB/s.

    • @retrofraction
      @retrofraction 10 หลายเดือนก่อน +85

      Yea I am sure it was a publication error as most people forget the capitalization huge differences.

    • @InternetListener
      @InternetListener 10 หลายเดือนก่อน +40

      Add also usual confusion on base 2 and base 10 Mega and you'll come closer to 75... Like the 1TB or 931GB...

    • @joseduarte9823
      @joseduarte9823 10 หลายเดือนก่อน +16

      @@retrofraction Good thing about Wikipedia is that you can change it and be the MVP!

    • @brandonupchurch7628
      @brandonupchurch7628 10 หลายเดือนก่อน +5

      I think the raw datapath to the RDRAM is capable of that but due to access latency and the memory architecture of the system the practical bandwidth is much much lower.

    • @lotrbuilders5041
      @lotrbuilders5041 10 หลายเดือนก่อน +3

      No, the ram operates at 500MHz over a 9-bit data bus. The 9th bit is only useable by the GPU. The GPU uses the same memory over the same bus and that’s why the CPU access is relatively slow

  • @Temulgeh
    @Temulgeh 10 หลายเดือนก่อน +277

    it's so cool to see situations where it's actually worth it taking up more CPU cycles than RAM, when a lot of programming problems are the opposite

    • @T3sl4
      @T3sl4 10 หลายเดือนก่อน +37

      The situation hasn't changed much today, actually; more in terms of scale, and levels of abstraction/recursion -- simple operations like these complete almost instantaneously in CPU time, and the rest of the time is spent ferrying the inputs and outputs through multiple levels of caches. Which may extend not just to system RAM or HDD, but over the network as well (which might in turn be cached across numerous servers, and so on!). Accounting for Rambus access is relatively painless in comparison! 😅

    • @musaran2
      @musaran2 10 หลายเดือนก่อน +26

      Actually there is a long trend this way, CPUs have accelerated way more than RAM.
      Antique platforms would pre-compute even flipped sprites, because even though RAM was very scarce, the CPU was even more sluggish than that.

    • @crazybeatrice4555
      @crazybeatrice4555 10 หลายเดือนก่อน +3

      Cache and memory has also stopped scaling as well as compute on newer process nodes

    • @ABaumstumpf
      @ABaumstumpf 10 หลายเดือนก่อน +15

      That has nearly always been the case. The SNES and earlier system were basically the last widely available systems were memory access can be done in a single CPU cycle.
      Nowadays we are at a point were in programming it often is faster to the the simplest linear search over hundreds or even thousands of elements instead of using things like binary-search that "should" be much faster as they need to only do a fraction of the work.

    • @Temulgeh
      @Temulgeh 10 หลายเดือนก่อน +2

      @@ABaumstumpf a few people have pointed that out and yeah they're right. i was more thinking about caching stuff for recursive problems but optimising for memory access is probably a more common thing

  • @csbluechip
    @csbluechip 10 หลายเดือนก่อน +499

    I've been optimising sine tables since the ZX Spectrum. I never realised how many tricks I've missed! ... BRILLIANT ... Thanks for sharing

    • @guisampaio2008
      @guisampaio2008 10 หลายเดือนก่อน +20

      Holy shit bruh, the wise sage spoketh.

    • @JohnKerrashVirgo
      @JohnKerrashVirgo 10 หลายเดือนก่อน +1

      Speccy > C64 😝

    • @keiyakins
      @keiyakins 10 หลายเดือนก่อน +38

      On the ZX Spectrum, a LUT was probably the right solution. The CPU was slow and the memory was fast. Though there wasn't very much of it, so you'd definitely want the quarter-table trick.

    • @JohnKerrashVirgo
      @JohnKerrashVirgo 10 หลายเดือนก่อน +5

      @@keiyakins it does make me want to pull the hardware out and compare the difference. All though I feel this implementation is specific to the way the N64 fetches data

    • @williamdrum9899
      @williamdrum9899 10 หลายเดือนก่อน +22

      My favorite Z80 trick is aligning a table to &xx00 so instead of loading the address into HL you load the address high byte into H and the desired index into L. Easily triples the speed of any lookup operation

  • @totalphantasm
    @totalphantasm 10 หลายเดือนก่อน +925

    Man it’s always crazy how even a game that’s considered “poorly optimized” like Mario 64 is still optimized incredibly well for its time.
    Now a days developers say “oh most peoples computers these days should have about 16 gigs of ram? Why waste time making it run on 8 then?” And then the game still comes out rushed and bad.

    • @augustdahlkvist3998
      @augustdahlkvist3998 10 หลายเดือนก่อน +117

      Lol mario 64 isn't well optimised at all, they didn't even turn on the automatic compiler optimisations and it regularly drops below 30fps even though the n64 should be able to handle it fine.
      Modern games at least shipped with the compiler optimisations enabled

    • @benjaminoechsli1941
      @benjaminoechsli1941 10 หลายเดือนก่อน +125

      What you have to remember is that this is a console game, from the age of cartridges.
      There was no such thing as "updates" or bug fixes, so studios were forced to give their devs time to get a game good enough to sell.

    • @FlamespeedyAMV
      @FlamespeedyAMV 10 หลายเดือนก่อน

      ​@@dekufiremage7808because they keep hiring morons

    • @incognitoman3656
      @incognitoman3656 10 หลายเดือนก่อน +5

      @@benjaminoechsli1941here were re-releases thankfully, but changing the engine is drastic for one to do.

    • @totalphantasm
      @totalphantasm 10 หลายเดือนก่อน +46

      @@augustdahlkvist3998 I’m not super familiar with this topic in particular but it’s a fact that there’s almost nothing in game development that’s as simple as “turning on an optimization”. I would bet money that the issue isn’t as simple as turning on a feature. There are some things that only look obvious and easy with 3 decades of retrospective.
      Even then, most modern game studios probably couldn’t get Mario 64 to run at “drops below 30 with big objects” if I’m being honest.

  • @darkwalker19
    @darkwalker19 10 หลายเดือนก่อน +136

    Me (Person who has a literal degree in computer science, built multiple games, and over 5 years industry level experience): This guy is insane and so much is going in one ear and out the other but this is so interesting to watch

    • @johnmoser3594
      @johnmoser3594 9 หลายเดือนก่อน +18

      Take a computer engineering degree and learn about signal processing, stuff like this will look more familiar.

  • @Zaurthur
    @Zaurthur 10 หลายเดือนก่อน +359

    14:57
    The reason he has to make the value negative here is that Bhaskara is only accurate in the range -pi/2 to pi/2.
    The calculation for flip sign will be one, if the value is outside of this range. if this is the case the value needs to be shifted and flipped as it is after the if(flipsign) and the ternary operator after the Bhaskara algo proper
    Here is a graph showing the process, and also how he gets sine out of this calculation.
    BUT TH-cam wont let me post it because they are dicks

    • @KazeN64
      @KazeN64  10 หลายเดือนก่อน +136

      He DMd me the graph:
      www.desmos.com/calculator/dq8l6tj6cv

    • @Zaurthur
      @Zaurthur 10 หลายเดือนก่อน +57

      @@KazeN64 Thanks, I appreciate it, and hope someone finds it interesting.

    • @IronLotus15
      @IronLotus15 10 หลายเดือนก่อน +2

      Out of curiosity, how do you DM on TH-cam? Or did you use an outside platform

    • @harrifongostudios
      @harrifongostudios 10 หลายเดือนก่อน +10

      @@IronLotus15he has his socials linked

    • @spirttomb
      @spirttomb 10 หลายเดือนก่อน +36

      ​@@IronLotus15I remember when TH-cam had private messaging 😢

  • @FabulousJejmaze
    @FabulousJejmaze 10 หลายเดือนก่อน +96

    Nintendo has used a lookup table for trigonometry since Super Mario World, where it was used for the rotating sprites (ball and chain, chained platforms, etc)

    • @AlexRodriguez-gb9ez
      @AlexRodriguez-gb9ez 9 หลายเดือนก่อน +9

      Back in the day (especially on SNES) the CPU was the bottleneck not the cache so they needed to look it up and not calculate it. I don't even think SNES could do floating point numbers in hardware.

    • @FabulousJejmaze
      @FabulousJejmaze 9 หลายเดือนก่อน +6

      @@AlexRodriguez-gb9ez It could not use floating point numbers, but fixed point (usually 8.8) are very common and can even be hardware-accelerated. In Super Mario World there's a big table with sine/cosine values in this format which is then multiplied with whatever coordinate you want to transform using hardware registers.

    • @Clownacy
      @Clownacy 7 หลายเดือนก่อน

      Likewise for Sonic 1, also in 1991. I'm getting so tired of this 'Mario 64 was the first game to do X' bullshit. What's next, Kaze - Mario 64 was the first game to ever have 3D graphics?

    • @XENON2028
      @XENON2028 2 หลายเดือนก่อน

      @@Clownacy mario 64 was the first game to be called super mario 64

  • @Rihcterwilker
    @Rihcterwilker 10 หลายเดือนก่อน +729

    This is a coding channel that eventually uses mario to show how programming works.

    • @amineaitsaidi5919
      @amineaitsaidi5919 9 หลายเดือนก่อน +7

      Exactly, and this is why i am here.

    • @marioluigijam3612
      @marioluigijam3612 7 หลายเดือนก่อน +6

      Coding AND MATH!!!! My favorites

    • @amineaitsaidi5919
      @amineaitsaidi5919 7 หลายเดือนก่อน +1

      I just realized how Mario 64 is not optimized at all to store a whole table of sin, it is juste pure rush.

    • @telanis9
      @telanis9 5 หลายเดือนก่อน +7

      @@amineaitsaidi5919 No, a lookup table is actually quite a good optimization. Like Kaze said, computing the actual Taylor series to a high degree of accuracy would be incredibly slow. Nintendo didn't understand just how much their memory bus design constrained everything until much much later, so at the time it made perfect sense to try to save on CPU with the table.

  • @pants64
    @pants64 10 หลายเดือนก่อน +151

    SM64DS also uses an interlaced sine/cosine table, but without mirroring. I never really thought about why that is but given how cache works it makes perfect sense.

    • @zheil9152
      @zheil9152 10 หลายเดือนก่อน +40

      I wish that game had as much love for the technical, decomp, and romhacking crowd as SM64 does. Glad to see you posting vids for the game

    • @FabulousJejmaze
      @FabulousJejmaze 10 หลายเดือนก่อน +2

      Doesn't the DS work like the GBA though, where reading from ROM is pretty free?

    • @sven33r
      @sven33r 10 หลายเดือนก่อน +3

      Was SM64DS ever decompiled to produce the same assembly code as with SM64? Or why do you know it?

    • @williamdrum9899
      @williamdrum9899 10 หลายเดือนก่อน +1

      @@FabulousJejmaze They're both ARM machines so I would think so. But I heard the DS has a memcpy bug that the GBA did not

    • @williamdrum9899
      @williamdrum9899 10 หลายเดือนก่อน +2

      @@sven33r It wouldn't be exactly the same since each runs on a different CPU. N64 doesn't have variable pointer offsetting.

  • @christopherdigirolamo9879
    @christopherdigirolamo9879 10 หลายเดือนก่อน +196

    "I never expected anything I've learned in school to actually be a useful life skill"
    Apparently hacking out a polynomial with a low-error approximation of sine for SM64 optimization in 2023 is a useful life skill lol

    • @MrMeow-dk2tx
      @MrMeow-dk2tx 10 หลายเดือนก่อน +21

      It is, it's called fun. You ever had it in your life?

    • @christopherdigirolamo9879
      @christopherdigirolamo9879 10 หลายเดือนก่อน +9

      ​@@MrMeow-dk2tx
      Yessir

    • @Johncw87
      @Johncw87 10 หลายเดือนก่อน +38

      It is when making Mario 64 romhacks is one of your bodily functions.

    • @spyczech
      @spyczech 10 หลายเดือนก่อน +15

      His brain has a built in RISC co processor for just rom hacking

    • @vinesthemonkey
      @vinesthemonkey 10 หลายเดือนก่อน +5

      math is useful for physics and statistics if you want to consider "real world applications"

  • @Dwedit
    @Dwedit 10 หลายเดือนก่อน +44

    In GCC for ARM, when you want to return two 32 bit integers in registers, you return a 64 bit number. To convert between two 32 bit numbers and one 64 bit number, you use a union.

    • @KazeN64
      @KazeN64  10 หลายเดือนก่อน +25

      Yeah that works on n64 too!

    • @williamdrum9899
      @williamdrum9899 10 หลายเดือนก่อน +14

      That's what I love about unions, they let you break data types when it's convenient to do so. Best part is, there is no performance penalty.

  • @starfoxmoore6037
    @starfoxmoore6037 10 หลายเดือนก่อน +312

    As a computer scientist myself. I find these type of videos incredibly interesting.

    • @Psythik
      @Psythik 10 หลายเดือนก่อน +7

      As a non-computer scientist, I wish I didn't fall asleep in high school geometry class. I find these types of videos incredibly interesting as well, but I have no idea why sine waves are needed to calculate angles in thee dimensions. What does a sign save have to do with math in this context?

    • @hi-i-am-atan
      @hi-i-am-atan 10 หลายเดือนก่อน +8

      @@Psythik sine waves are necessary to calculate angles in _any_ dimensions ( aside from, i suppose, 1d geometry ... mostly because it, uh, doesn't have angles, at least as we traditionally understand them ). or, more precisely, they're necessary to calculate the cartesian coordinates of a vector rotated by an angle, aka, the change in a game entity's _x_ and _y_ positions when moving _d_ game units at an angle of _a_ degrees
      basically, imagine a standard 2d coordinate plane, then imagine a right triangle with a hypotenuse ( the line that _is not_ part of the right angle; it's the one that's usually drawn diagonally ) of length _d,_ extending from ( 0, 0 ). this means that one of our non-hypotenuse sides is our x-axis, while the other is aligned with the y-axis. the _a_ of sine and cosine is one of the triangle's non-right angles, and the one that's most useful for us is the one that ends up right next to ( 0, 0 ). now, the sine and cosine of _a_ is the length of the non-hypotenuse side that is opposite / adjacent to, respectively, a, divided by the length of our hypotenuse. in other words, they're the ratio between the _y_ / _x_ coordinate of the hypotenuse's second vertex and the hypotenuse's length. so, all we need is to multiply the length by those ratios and, voila! we now know that the hypotenuse extends from ( 0, 0 ) to ( _x, y_ ). which is something you may recognize as a vector, but even if you don't, all you need to know for this example is that you can add this _x_ and _y_ to the coordinates of a point to offset that point _d_ units at angle of _a_
      if that seems like a whole lot of steps for a whole lot of nothin', remember that basically _every_ game maps out space using cartesian coordinates. every point is represented by an x value, a y value, and maybe a z value if you're working with three dimensions, and ... that's it. if you want to move 14 units from ( 32, 24 ) at a 45 degree angle, then you gotta calculate exactly _what_ position is 14 units from ( 32, 24 ) at a 45 degree angle. and it isn't ( 39, 31 ), what you get if you just cut 14 in half and slap it on both numbers, even if a 45 degree angle is perfectly diagonal and thus offsets equally on both axes - remember, all of a circle's points are an equal distance from its origin, and yet its diagonal points aren't ( 1, 1 )*. the actual change in _x_ and _y_ we get out of this is a bit under 10, so we actually end up at ( ~42, ~34 )
      incidentally, the reason why kaze spoke as if they _aren't_ necessary in 2d games is because all of this is _wholly irrelevant_ in 2d mario games, because they're sidescrollers. so the x and y axes are just completely different forms of movement, so him walking 14 units forward is entirely contained to the x axis. but it's also true that a lot of simpler top-down games, like a lot of old zelda ones, don't bother with using this sort of math for their movement and just slap a constant speed value onto the entity's x and y coordinates. this results in diagonal movement being straight-up faster than cardinal movement, but it's usually not all that noticeable because of the game's presentation. but then you have the interesting case of the original doom games, which as you'd expect from a fps, properly vectorizes your movement so your forward speed remains consistent regardless of which angle you're facing ... except, it vectorizes your forward and strafing movement _separately,_ so you _still_ get the "diagonal movement = faster" effect, it just ends up rotated so you have to move diagonal relative to your facing angle rather than the map's origin
      *well, they are when using chebyshev distance, where circles are literally just squares. yeah, there are multiple ways to define distance ( which, granted, probably _isn't_ something that got covered in your geometry class ), and it's less theoretical than you might think because ... there's just a lot of different things you can measure! like, chebyshev distance is outright useful in game design, because it represents a type of movement used in the little known game of _chess._ another common alternative to euclidean distance is taxicab distance, which you'd be familiar with if you've played any tile-based game that doesn't allow diagonal movement, like advance wars or fire emblem

    • @PowerfulKundalini
      @PowerfulKundalini 10 หลายเดือนก่อน +3

      youre not a computer scientist lol

    • @starfoxmoore6037
      @starfoxmoore6037 10 หลายเดือนก่อน +2

      @@PowerfulKundalini
      I know it is hard to believe but I really am friend.

    • @crackasaurus_rox9740
      @crackasaurus_rox9740 10 หลายเดือนก่อน +3

      ​@@starfoxmoore6037Don't take it personally. A lot of people just found out over the past few years that a bunch of liars with college degrees have been masquerading as practitioners of science. If you are practicing the scientific method, it's probably best to let your work speak for itself. People simply don't trust the formal classifications at this point, and who could blame them.

  • @ZTimeGamingYT
    @ZTimeGamingYT 10 หลายเดือนก่อน +75

    I love it when trigonometry and calculus are used with gaming and coding. Math is very entertaining.

    • @Tolbat
      @Tolbat 10 หลายเดือนก่อน +14

      Shame they do not teach it in school as an applied math in game design.

    • @Dead_Metal
      @Dead_Metal 10 หลายเดือนก่อน +4

      I thought I could escape this stuff for a few months after finishing my GCSE exams... I guess not then 🤣
      Kaze's videos are a joy to watch nevertheless.

    • @retrokoh1507
      @retrokoh1507 10 หลายเดือนก่อน +2

      It's exactly why I love the classic Sonic engine... that said, things aren't nearly as complex as they are here, obviously.

    • @chaotickreg7024
      @chaotickreg7024 10 หลายเดือนก่อน +7

      I wish they taught practical applications of higher math. I tried to make a 2d animation of dots spinning like a processing top, turns out it's a ton of trig inside of matrix algebra. I got into game dev and wanted to make a projectile bounce off a wall, more trig inside matrix algebra, vectors are everywhere in game dev. If you want your character to be affected by gravity, there's immediately calculus to handle. The whole concept of "ramp" in magic the gathering is just exponential growth. Geometry is useful all over the place in action games. It's like you can't move your character without at least distance/hitbox equations happening. So many people say "Math is miserable, you never use it, it's not worth the time" but I try to say math is important, it's a way to understand every physics process unfolding around you, and that power it extremely useful. Even if it's just so you can speak to computers and be better at basic programming, or just to be a better gamer, math is incredibly useful.

  • @Creatively_Bored
    @Creatively_Bored 10 หลายเดือนก่อน +43

    I have a hunch about the angle flipping necessity: tangent.
    The angles in the 1st and 3rd quadrants (going ccw around a circle on the cartesian grid) have the same tangent values, the same with the 2nd and 4th quadrants.
    So you have to distinguish when you want the angle from the 1st or 3rd quadrant (and the 2nd or 4th Quadrant).
    MS Excel (and Google Sheets iirc) have the 2 argument arctan function to treat the angle calculation as a vector problem, but since Mario 64 doesn't have this, you have to use a from scratch angle discrimination setup, much like what Kaze ended up using.

    • @a2e5
      @a2e5 10 หลายเดือนก่อน

      atan2 is a godsend when it comes to not having to roll your head everywhere to figure things out. you can define it from a classic arctan though -- just some if cases for signs and stuff

  • @porterleete
    @porterleete 10 หลายเดือนก่อน +145

    The video's great! I love seeing how things like this are done on old hardware. It seems to me like it would be hard to understand how anything would be best optimized for modern hardware with preemptive execution, and weird hardware tricks and strange security flaws - more like magic than science. Even though I don't really program much, let alone optimize anything, optimizing code for old hardware seems like it's something that can actually be learned in a concrete and non-mystical way by a human being, even if it takes effort.

    • @KazeN64
      @KazeN64  10 หลายเดือนก่อน +147

      modern hardware is optimized for bad code. i think optimizing code on more modern hardware is less about nitty gritty details like this and more about using best practices. there is also an issue that most modern games run on many different consoles so you dont just have one metric to maximize. a lot of these n64 optimizations would slow modern machines down.

    • @incognitoman3656
      @incognitoman3656 10 หลายเดือนก่อน +12

      @@KazeN64Ever since Playstations, gaming has gotten more techy to the point of modern renders being mostly power and less “magicks”. Thanks to you, we can see their bag of tricks!

    • @KopperNeoman
      @KopperNeoman 10 หลายเดือนก่อน +4

      @KazeN64 The more powerful the hardware, the harder it is to run anything bare metal. On modern systems, it can even be impossible, requiring you to bypass even the UEFI itself.

    • @Ehal256
      @Ehal256 10 หลายเดือนก่อน

      @@KazeN64 It's still optimized for good code, it just also handles bad code a bit less poorly. Bad code on modern hardware can still easily waste 90% of memory bandwidth, etc.
      Where I think modern hardware shines is making decent, not optimized, but not wasteful code run very fast.

  • @brunogm
    @brunogm 10 หลายเดือนก่อน +15

    A reminder of the actual MIPs cpu cache size (16 KB instruction cache and an 8 KB data cache) brings a whole lot perspective on that LUT size.

  • @Toenailslikkepind
    @Toenailslikkepind 10 หลายเดือนก่อน +54

    I love these optimization stories of old video games. I switched from application development to firmware development because I figured it would be the closest I could get in todays technology landscape. Thank you for your videos. They are such a treat ❤

    • @davidmckean955
      @davidmckean955 10 หลายเดือนก่อน +2

      You could also program industrial equipment

  • @tnelsond
    @tnelsond 10 หลายเดือนก่อน +48

    05:32
    The number for peak bandwidth is megabits per second, that's why your number which is megabytes per second is about one eighth of their number.

    • @KazeN64
      @KazeN64  10 หลายเดือนก่อน +54

      It's megabytes on their website. There are other website claiming 4500mbit/sec too. I thought they could be talking about Mb/s too but unforuntately they werent

    • @t0biascze644
      @t0biascze644 10 หลายเดือนก่อน +19

      @@KazeN64 maybe they calculated it in Mbits/s and accidentely wrote it as MB/s? and then someone took the wrong MB/s from wikipedia and multiplied it by 8 again resulting in 4500?

    • @Dovahkings
      @Dovahkings 10 หลายเดือนก่อน +7

      @@t0biascze644 Definitely that lol. It's a rookie mistake, and since everyone is using Wikipedia as their reference.. Well nearly everything ends up wrong afterwards.

    • @lotrbuilders5041
      @lotrbuilders5041 10 หลายเดือนก่อน

      @@KazeN64RDRAM operates at 250 MHz over a 9-bit bus with data (possibly) transferred at every clock. That’s where the number comes from. As the GPU uses most of that and the overhead is worse, the CPU might get only 75MB/s

    • @KazeN64
      @KazeN64  10 หลายเดือนก่อน

      @@lotrbuilders5041 i've computed my 75mb/s from the graphics actually. the CPU is a lot slower than the 75mb/s. usually they share the data speed so both components together probably average around 25mb/s during normal gameplay.

  • @youdj_app
    @youdj_app 10 หลายเดือนก่อน +25

    When I was coding 3D dos games 25 years ago, I couldn't understand why my sin table was not that fast, it is just one line of code with no CPU computation... Cache/memory bottleneckes make a huge different, I confirm.

    • @williamdrum9899
      @williamdrum9899 10 หลายเดือนก่อน +6

      See I program on CPUs that didn't have a cache. So for me, I don't need to worry about it since memory access is equally slow!

    • @Anon.G
      @Anon.G 10 หลายเดือนก่อน

      ​@@williamdrum9899then you have to start worrying about other problems!

    • @andrewdunbar828
      @andrewdunbar828 10 หลายเดือนก่อน +1

      Yeah you'd have to go back to the '80s when we had no caches to avoid that. I used to play with this stuff in the 8-bit days until about the Amiga era.

    • @williamdrum9899
      @williamdrum9899 10 หลายเดือนก่อน +1

      @@Anon.G Haha yeah. At least with C compilers you can actually finish a project. Writing an entire game in assembly is a monumental task even on an 8 bit platform

    • @williamdrum9899
      @williamdrum9899 10 หลายเดือนก่อน +1

      @@andrewdunbar828 I love the 68000, it's so easy to work with

  • @aldendwyer
    @aldendwyer 10 หลายเดือนก่อน +14

    This channel is slowly becoming the best showcase of the general outline of optimizing code on TH-cam. I love it.

  • @jimmerjammy
    @jimmerjammy 10 หลายเดือนก่อน +23

    I've always been fascinated with how computers implement trigonometry. It's more difficult than arithmetic, and there are many ways to do trig functions. Another method is the CORDIC algorithm, which involves adding smaller and smaller angles until it reaches the input. It was developed for use in 8 and 16 bit processors. The TI-83 calculator, which uses a chip similar to the gameboy's, uses this algorithm.

  • @st0nedpenguin
    @st0nedpenguin 10 หลายเดือนก่อน +37

    I am so notoriously bad at math my brain literally locks up when presented with it but you still manage to explain everything in these videos enough that my potato brain can keep up.
    Which is good because I love seeing how you optimize the hell out of this game despite being a math potato.

  • @squiddlecomplains9829
    @squiddlecomplains9829 10 หลายเดือนก่อน +27

    Thanks Kaze! This was the perfect video to watch while running head first into a wall!

  • @AsymptoteInverse
    @AsymptoteInverse 10 หลายเดือนก่อน +19

    It's always interesting to see how constraints inspire creativity and inventiveness. I have very limited programming experience, but a fair bit of writing experience, and having a constraint (the syllable count of a haiku, a particular meter in a poem, a more arbitrary constraint like not using a certain letter at all) really pushes you to play around with the elements you've got and try to achieve a pleasant result in spite of the constraint. It strikes me as weirdly similar to working around memory and cache constraints.
    Reminds me, too, of a book I read written by an ex-Atari programmer. (I think it was Chris Crawford.) He talked about the clever things they tried to do with graphics by executing code in the handful of machine cycles in between CRT raster lines.

  • @tumm1192
    @tumm1192 10 หลายเดือนก่อน +6

    I remember from a book called "linear algebra done right" on page 115 it says the best approximation for sin(x), of fifth degree, on the interval [-pi,pi] is given by 0.987862x − 0.155271x^3 + 0.00564312x^5.

    • @totem6064
      @totem6064 10 หลายเดือนก่อน +23

      you have a great memory

    • @KazeN64
      @KazeN64  10 หลายเดือนก่อน +9

      looks like a minmax approximation which has issues i've outlined in the video

    • @tumm1192
      @tumm1192 10 หลายเดือนก่อน +2

      ​@@KazeN64awesome.

  • @thrasher1462
    @thrasher1462 10 หลายเดือนก่อน +23

    this is so crazily impressive, kaze must have gone insane to do all this for that little performance gains lmao

    • @Bittboy
      @Bittboy 10 หลายเดือนก่อน +1

      Could also just be a passionate autistic programmer like me lol

    • @davidmckean955
      @davidmckean955 10 หลายเดือนก่อน +3

      considering how well optimized a LUT already was, he knew going into it that he was never going to achieve more than a little performance gain

  • @Diablokiller999
    @Diablokiller999 10 หลายเดือนก่อน +6

    Have you ever tried CORDIC algorithm?
    It needs a (quarter?) Sine table but calculates everything just by adding and shifting, so basically one cycle operations and the accuracy depends on the number of repetitions you give the algorithm.
    Could be too taxing on the bus but maybe give it a try? Not hard to implement though...
    Used it on FPGAs to reduce computation unit requirements by a lot, since multiplier are rare on those. Golem has a great article about it (and in german :3 )!

  • @Xaymar
    @Xaymar 10 หลายเดือนก่อน +28

    5:35 The wiki may be using Mbit/s, as dividing the value by 8 gets pretty close to the real bandwidth.

    • @sinom
      @sinom 10 หลายเดือนก่อน +9

      Then it should read Mb instead of MB though

    • @littlenyancat5754
      @littlenyancat5754 10 หลายเดือนก่อน +10

      @@sinom I wouldn't be surprised if some marketing guy decided to change to "MB" because it "looked neater" without understanding the technical implications of it

    • @Xaymar
      @Xaymar 10 หลายเดือนก่อน +4

      ​​​@@sinomGenerally memory transfer speed is measured in bits/second, not bytes/second. It's not impossible that in a community driven wiki someone made a mistake somewhere along the chain of copying or stating information. And the b/B difference is a newer thing - for quite a while it was common in data/tech sheets to have B just mean both, requiring a call to the vendor or knowledge of how transfer speed is measured at this particular company.
      But yes, it should be b for bits, or Mbit/s to be explicit.

  • @T3sl4
    @T3sl4 10 หลายเดือนก่อน +13

    A tip about evaluating polynomials -- start with the highest coefficient, multiply by x, add the next coefficient, multiply by x, etc. Called Horner's method, this greatly reduces floating point error as the terms tend to be of similar scale. For all-even or all-odd series, also prepare x^2 beforehand, to evaluate it as a polynomial in x^2 instead; for odd, factor out the common x, then multiply by it in a final step.
    About Bhaskara's formula: it's not so mysterious, just a rational approximation. A rational polynomial is a function f(x) = P(x) / Q(x), P and Q being regular polynomials. Generally speaking, the accuracy is equivalent to a n+m order polynomial, where n and m are the order of P and Q respectively. Evidently, this would be advantageous for modest order polynomials (maybe 7th+?) where the division can save some multiplications. Rationals are particularly well suited for peaky functions, where 1/x or 1/(x^2 + a^2) sorts of forms would otherwise be very challenging for a polynomial fit. For smooth functions like sin/cos, it's not too important; and indeed, your 4th/5th order forms are in the same order of magnitude accuracy, as expected.
    There's also a method not listed here, the hybrid between look-up and curve-fitting: you could have a table of not just values, but coefficients as well. A much smaller table (say 16 or 64 elements) could contain, say, the offset, slope and curvature, which are simply dumped into a quadratic polynomial, and out drops the spline result for that segment. Note this is less intensive than a linear (let alone quadratic or cubic) interpolation, as the coefficients don't need to be computed from adjacent points (table values); for example linear interpolation has to calculate the slope between adjacent points (evil division!), but we can just tabulate that instead. It's not obvious if this would perform better overall than the final choices in the video, but considering how comprehensive the video is, I suppose it bears mentioning.
    The mention of returning complex pairs has some very interesting outcomes, if taken to its logical conclusion. Ultimately all we're doing is coming up with a vector [x, y] = [cos theta, sin theta], and most of the time we're using both values close by so it makes sense to return both pairs directly. _We could avoid angles entirely and only pass around these vectors!_ This is probably such an immense pain that it's not worth it (every single object which takes an angle parameter has to be refactored to take a normal vector instead!), but, interesting just to imagine how it would all work. Note that, where angles can be added or subtracted, angle vectors must be multiplied or divided, using the complex vector formulas (a + bi)(c+di) = ac - bd + i(ad + bc) and (a + bi) / (c + di) = ( ac + bd + i(bc - ad) ) / (c^2 + d^2). (The division is avoided as long as c^2 + d^2 = 1, which is true for normals.) Fractional fixed-point can be used to save on FLOPs; probably short pairs, so, doubling the memory requirements of stored angles. The magnitudes will eventually drift from enough accumulated operations, and using first-principles values (like, oh I don't know, calculating sin and cos? :^) ) from time to time, or entirely in the first place (imagine that!) is probably helpful; or normalizing (divide by sqrt(a^2 + b^2) -- yuck, what a floating-point mess! -- but a simple iterated approximation can be used (or the infamous inverse square root..!) since it just has to move the value closer to normal, not set it exactly so.
    Tremendous amount of work, by the way! Numerical approximation, and implementation, can be quite a deep rabbit hole. As, uh, I think you've discovered, hah! :D

    • @KazeN64
      @KazeN64  10 หลายเดือนก่อน +7

      Yeah dont worry about the first point - thats how my code does it. Ffast math automatically multiplies xsquared first

    • @fibbooo1123
      @fibbooo1123 10 หลายเดือนก่อน

      +. Excellent comment!

    • @wile9763
      @wile9763 10 หลายเดือนก่อน +1

      I'm not sure why a Taylor series approximation would be too slow. There are no divisions and you can compute x^5 in three multiply instructions. The whole thing would probably take about 8 multiplies and 3 adds for an 8th order accurate approximation. Check out the musl or glibc libm implementations.

    • @wile9763
      @wile9763 10 หลายเดือนก่อน +1

      Also, -ffast-math can cause incorrect computations since it can break IEEE compliance. Do you see the same incorrect stretching when you build without it?

  • @Frenchnostalgique
    @Frenchnostalgique 10 หลายเดือนก่อน +7

    I got a welcome ego boost watching this video by thinking about the "only use one 4th of the curve" and "interweave sine and cosine" before you mentioned them. Thanks for that, I love these optimization tricks videos

  • @JonathanThe501st
    @JonathanThe501st 10 หลายเดือนก่อน +3

    "What if I told you there's room for improvement?"
    Kade, this is YOU we're talking about.
    You'll probably be able to get Super Mario 64 to run on a TI-86 Calculator if given enough time

  • @cosmicreciever
    @cosmicreciever 10 หลายเดือนก่อน +5

    The idea to interweave the sine and cosine values to avoid cache misses is a stroke of genius! Great work!

  • @CAEC64
    @CAEC64 10 หลายเดือนก่อน +19

    thanks for enlightening me on the vertical and horizontal speed on 2d games kaze

    • @BGP00
      @BGP00 10 หลายเดือนก่อน

      NO WAY CAEC

  • @Isaac________
    @Isaac________ 10 หลายเดือนก่อน +30

    Instead of Euler angles, which have several drawbacks such as gimbal lock and numerical precision, would it be more efficient to use another rotation representation such as quaternions? Would that also reduce the number of needed trig function calls?

    • @KazeN64
      @KazeN64  10 หลายเดือนก่อน +69

      Yes, I'm actually planning to convert all the animations to quaternion animations - It'll reduce it from 42 multiplications and 6 sin/cos calls to 37 multiplications. There is also an advantage in quaternions being 8 bytes per rotation instead of 6 - that sounds worse at first, but that way you can guarantee every quaternion is accessible in a single dcache access (whereas the eulers could be spread across 2 dcache lines). So yeah i'm absolutely planning on doing that.

    • @Isaac________
      @Isaac________ 10 หลายเดือนก่อน +22

      @@KazeN64 Wow, you're eventually gonna rewrite the whole damn engine! Quaternions seem less intuitive at first, but then you see their elegance. And when you start counting mults and other ops, the benefits are clear.

    • @angeldude101
      @angeldude101 10 หลายเดือนก่อน +4

      @@Isaac________ They're only less intuitive if you try to visualize them as 4-dimensional arrows. They're _very_ intuitive if you look at them for what they are: mixtures of 4 basis 3D rotations: 180° around x, 180° around y, 180° around z, and 0° (around nothing). An alternative way of visualizing them is as compositions of two reflections. They're more efficient to compose than matrices, but less efficient to apply.
      If you want to mix in translations, then you need something called dual-quaternions, which have 8 components, but that's still less than a 16-component homogenous matrix or 12-component affine matrix.

    • @Isaac________
      @Isaac________ 10 หลายเดือนก่อน +4

      @@angeldude101 Yes. That's why I wrote "at first", and noted their elegance. :)

    • @Triszious
      @Triszious 10 หลายเดือนก่อน +2

      ​@@angeldude101 I think a lot of confusion around quaternions is caused by how we usually think about imaginary numbers as a representation of 2D space. If you try to tack on a similar interpretation of the scalar part of a quaternion as relating to a basis axis, then it doesn't make sense. The scalar part is related to an inner product and projects your vector into a plane where the rotation is a simple reflection. It's the imaginary vector part that "lives" in 3D space. It also helps to realize that the multiplication rules of quaternions are just an algebraic representation of the cross product:
      j*k = i
      which is similar to:
      y × z = x
      for basis axes x, y, z.

  • @dellybird5394
    @dellybird5394 10 หลายเดือนก่อน +4

    "You are going to learn a lot of math today, and you are going to like it" was weirdly threatening

  • @WineStrikesBack
    @WineStrikesBack 10 หลายเดือนก่อน +33

    Ah yes, one step closer to optimizing the N64's boot time. Wonderful video.

  • @crimsonvale7337
    @crimsonvale7337 วันที่ผ่านมา

    13:40 I’ve heard that division is shockingly costly for computation, but it being on par with square roots is insane

  • @dons2148
    @dons2148 10 หลายเดือนก่อน +1

    Very clever optimization to get the cosine to cache with the sine. Loved the deep dive. You definitely did look cool doing it! Great video

  • @MiguelDiaz_
    @MiguelDiaz_ 10 หลายเดือนก่อน +6

    I love that Kaze's romhacking has evolved into the most insane solutions to code optimization problems

  • @saligraphy
    @saligraphy 8 หลายเดือนก่อน

    I appreciate that you were passionate enough to go this deep into this game

  • @kestergreen3844
    @kestergreen3844 3 หลายเดือนก่อน

    Rewatching this now. It has to be one of my favourite videos! Everything is explained well.

  • @gzusfishlives8897
    @gzusfishlives8897 10 หลายเดือนก่อน +4

    @2:12 I was today years old when i learned you can get on the castle. I never even considered it before. The vid was cool but thanks for that

  • @plkh9602
    @plkh9602 8 หลายเดือนก่อน +4

    5:29 It actually comes from : 250 MHz clock speed (RDRAM double data rate) = 500MT/s
    9 Bits bus * 500 MT/s = 4.5 Gbps = ~562.5 MB/s
    Unfortunately that's in theory a lot goes on in between, latency, memory controllers, data computing speed etc...

    • @KazeN64
      @KazeN64  8 หลายเดือนก่อน

      that makes sense!

  • @Roxor128
    @Roxor128 10 หลายเดือนก่อน +2

    The lookup table approach is used in the classic Yamaha OPL2 and OPL3 FM synthesis chips. They contain a 256-entry table of the first quarter in ROM. However, it's not a straight-up sine function, but a log of a sine function. They also contain an exponential table ROM as well, needed for converting the log-transformed values to linear ones.

  • @btarg1
    @btarg1 10 หลายเดือนก่อน

    I need more of these high-quality edited videos! Every time you make a new optimisation I look forward to another one of these for it lol

  • @citizenofhyrule
    @citizenofhyrule 10 หลายเดือนก่อน +8

    ocarina of sine

    • @angeldude101
      @angeldude101 10 หลายเดือนก่อน

      Considering that music is just sound waves of varying frequency, and sound waves are waves like any others, _yes._

  • @costelinha1867
    @costelinha1867 10 หลายเดือนก่อน +4

    20:10 while these bugged bones are not desireable normally, I think they could make for a pretty funny mechanic for another romhack... maybe a power-up?

  • @thewolfmanhulk2927
    @thewolfmanhulk2927 6 หลายเดือนก่อน

    Thank you for making videos like these explaining mathematical concepts related to Mario 64. It is informative and appreciated, and encouraging learning

  • @ryanretrogamer7
    @ryanretrogamer7 หลายเดือนก่อน

    The fact that this is mostly review for me from my Numerical Analysis class!
    It just blows my mind seeing this all working! I mean, I’ve worked with this many long nights studying, but still, it amazes me!

  • @grantsparks6554
    @grantsparks6554 10 หลายเดือนก่อน +12

    Hey Kaze, this was another solid deep-dive. If I could make a suggestion: I found the additional audio of the speedrunning/glitching balanced a little too loud. Loud, glitched "wa-wa-wa-wa-wahoo"s into a wall + music + mathematics VO is a lot to listen to all at once and this is the only time I can recall a video of yours feeling this loud and overstimulating. I'd like to point to 16:45 as an example, so see if others feel the same.
    Otherwise, great topic! Love the feeling of progression to the algorithm

    • @KazeN64
      @KazeN64  10 หลายเดือนก่อน +10

      Yeah thats fair! i'll make sure to turn the audio down a bit next video.

    • @grantsparks6554
      @grantsparks6554 10 หลายเดือนก่อน +2

      Thanks, I appreciate it 👍

  • @LarsDonner
    @LarsDonner 10 หลายเดือนก่อน +4

    For generating the minimax coefficients you can use a command-line-tool called Sollya. It's neither very well documented nor does it give optimal results, but it's pretty close.

  • @The4Crawler
    @The4Crawler 10 หลายเดือนก่อน

    Great video.
    In a past life, I wrote display device drivers for the GEM Desktop VDI interface, mainly on 286 and 386 systems. In that interface, they do use some angle parameters, mainly for circular arcs. Angles are specified in 0.1 degree increments. The supplied driver code used sine lookup tables and was based on octants, so 0.0 - 45.0 degrees. There was some front end code that would reduce angles into one of 8 octants, even for angles over 360 degrees.
    I started out using this code but found some strange behavior every once in a while the display update would pause for a noticeable delay. I ended up writing a test program that would load the driver and be able to send it a range of parameters for all the graphics primitive functions. This way I could load my program (and the driver) with a debugger (displaying on a separate system via serial port).
    I finally narrowed down the problem in the arc drawing code. While the comment in the code (was originally written in a Fortran variant, then the assembler output from that was used) said reduce the angle to 0..7 octant, it was actually doing 1..8 octant. This was in a loop where the code would take the angle, check for > 360 and if larger, subtract 360 and repeat. Then it would check for octant 0..7, but if the angle reduced to 8, it would hit the outer loop and keep going until it finally would get into the 0..7 range.
    First, I fixed the assembler code to do the proper 0..7 calculation and it worked fine. However, there was over 900 bytes of lookup table (and this was in the 16 bit / 640KB days), so that was a huge chunk of memory just sitting there in case a rarely used arc call was made. I was also in the process of converting almost all this assembler code to C and realized this whole mess could be replaced with a call to the sin() library function. Since we were selling high end graphics cards, almost every customer of ours had a 287 or 387 math co-processor in their system, so this was a fairly low overhead call. Doing this trimmed about 1KB from the driver memory footprint and it ran smoother without that annoying pause every time it got a strange angle parameter.

  • @nprocopi1231
    @nprocopi1231 10 หลายเดือนก่อน +2

    9:56 "not sure if anyone else thought of this", it was used in Crash Team Racing (1999) by Naughty Dog on PlayStation 1!

  • @pikapics222
    @pikapics222 10 หลายเดือนก่อน +6

    thank you for Mario Teaches Math, Kaze 64 edition. your lesson was a lot of fun!!!

  • @timmygilbert4102
    @timmygilbert4102 10 หลายเดือนก่อน +3

    In before he switch to mighty quaternion for cache locality

    • @KazeN64
      @KazeN64  10 หลายเดือนก่อน +7

      i am 100% planning to do that for animations

  • @smartperson1
    @smartperson1 10 หลายเดือนก่อน

    My word, that sin/cos layout for caching is brilliant. Excellent work, as usual.

  • @Wezla
    @Wezla 9 หลายเดือนก่อน

    This is exactly the type of video that was interesting enough to get me into a CS program, and even more fascinating and appreciable after my masters lol
    Cheers, loved this profusely!

  • @MattGodbolt
    @MattGodbolt 10 หลายเดือนก่อน +3

    Great video! I remember using overlapping sin/cos tables for the BBC Micro back in the day. And I think I recognise the UI you used for comparing the C code compilation results! 😊

    • @KazeN64
      @KazeN64  10 หลายเดือนก่อน +3

      the man himself

    • @vinesthemonkey
      @vinesthemonkey 10 หลายเดือนก่อน +1

      Matt Godbolt watches N64 optimizations nice

    • @wowimoldaf
      @wowimoldaf 8 หลายเดือนก่อน +1

      Holy fuck, didnt expected to see true OG on here lmao.

  • @gamingSlasher
    @gamingSlasher 10 หลายเดือนก่อน +7

    I am old school so I really liked these optimizations considering that to day most programmers just make more and more bloated codes.

  • @cattuswitch7246
    @cattuswitch7246 8 หลายเดือนก่อน

    Wow Bhaskara Approximation was discovered around the year 680 , a year with only 3 digits. The 4th digit for years had not even been invented yet, and its still the best / fastest approximation. Mindblowing.

  • @hi02hi36
    @hi02hi36 10 หลายเดือนก่อน +1

    Really interesting math!
    This comment is long, just so you know.
    I tried to find a 4th order approximation satisfying:
    f(0) = 0
    f_prime(0) = 1
    f(pi/2) = 1
    f_prime(pi/2) = 0
    The only one is ax^4 + bx^3 + x,
    where
    a = (16 / pi^4) * (pi - 3)
    b = (16 / pi^3) * (2 - 0.75pi)
    The maximum error is ~0.0020673.
    As for cos(x), another method is to take the derivative of the polynomial for sin(x). Although there will be a greater max error, it will be faster, as the polynomial is of lesser degree and it avoids using a sqrt. If we use the 4th degree polynomial I found, the maximum error is ~0.005. Also, because of the constraints on ~sin(x), this ~cos(x) will always be between 0 and 1 (in the range 0 .. pi/2). As for smoothness, the second derivative f''(0) = 0 correctly and f''(pi/2) = -1.0437 instead of the perfect -1.
    I also tried to find a 3rd minimax approximation, forcing the following:
    f(0) = 0
    f(-pi/2) = -1
    f(pi/2) = 1
    The function I ended up with was approximately 0.9824x - 0.14x^3, more specifically:
    0.9823962677074202x - 0.14013793513412705x^3
    It has a max error of 0.005745, which is slightly better than your function.
    non-pro tip: Those 'error bumps' at 19:35 should have the same maximum height. Meaning, in the minimax approximation, you sacrifice some error in the shorter bump in order to lower the tallest bump, eventually resulting in the minimax approximation
    However, it faces the same issue as yours. That is, the function may be > 1. The maximum is at around 1.0012. Forcing the cubic to not exceed 1 (by forcing f_prime(pi/2) = 0) results in easyaspi's approximation.

  • @solaerwig
    @solaerwig 10 หลายเดือนก่อน +10

    It's crazy to think that Kaze might just be one of the greatest programmers of our generation... And he uses his insane talents to make Mario 64 mods O_o

  • @thecynicalone7655
    @thecynicalone7655 10 หลายเดือนก่อน +12

    Did you consider the CORDIC algorithm? It's (to my understanding) really efficient on low end hardware, so it might be worth looking into

    • @movax20h
      @movax20h 10 หลายเดือนก่อน

      Cordic is going to be more accurate, but slower.

  • @kevinbroberg3504
    @kevinbroberg3504 9 หลายเดือนก่อน

    I can't believe I was so engaged by this video for 26 minutes. Gotta be the best ad read I've ever seen, too, I spent 15 of those minutes "how could I possibly learn all this?"

  • @Platformer64
    @Platformer64 4 หลายเดือนก่อน

    Thank you Kaze, this video actually helped me figure out something really useful for my Workshop game modes in Overwatch lol. Really nice video with good explanatations!

  • @timseguine2
    @timseguine2 10 หลายเดือนก่อน +4

    If you split it again into two separate approximations one from 0 to pi/8 and a different one from pi/8 to pi/4, I think you can probably get a pretty accurate quadratic approximation of sine. I didn't work out the details so it might end up being slower in the long run or have other problems.

    • @KazeN64
      @KazeN64  10 หลายเดือนก่อน +4

      it'd be cheaper and more accurate to just use the cosine of one higher degree to get around having to do float comparisons at that point

    • @timseguine2
      @timseguine2 10 หลายเดือนก่อน

      @@KazeN64 I have seen this method used in a different context to good effect, which is why I suggested it, but every processor is different, and I don't really know this one nearly as well as you do obviously. So fair enough.

    • @timseguine2
      @timseguine2 10 หลายเดือนก่อน

      @@KazeN64 And by the way, you can get better polynomial interpolations sometimes by choosing different nodes.
      I tried out a fifth order cosine with f(x) and f'(x) equal to those of cosine at x=0, pi/4, pi/2 and it gives an order of magnitude better approximation than the fifth order sine you showed. and it has the property also that it is bounded from above by the cosine function on the interval 0 to pi/2

    • @KazeN64
      @KazeN64  10 หลายเดือนก่อน

      @@timseguine2 wait why a 5th order cosine? odd orders should always be sines and even orders should always be cosines. otherwise you end up having too many coefficients. can you show the function?

    • @timseguine2
      @timseguine2 10 หลายเดือนก่อน +3

      ​@@KazeN64 The way the coefficients disappear depends on which x values are used for evaluating the interpolation. If you split them up like I suggested it works out to:
      1 + ((1024 sqrt(2) + 256) π + 4096 sqrt(2) - 11776)x^2/(128π^(2)) + ((-8192 sqrt(2) - 2560) π - 16384 sqrt(2) + 67584)x^3/(128π^(3)) + ((20480 sqrt(2) + 8192) π + 16384 sqrt(2) - 139264)x^4/(128π^(4)) + ((-128 sqrt(2) - 64) π + 768)x^5/(π^(5))
      and the maximum error on 0 to pi/2 is about 0.00004
      The thing I said about the bound was backwards, but it is never larger than 1.
      There are other alternative ways to interpolate it depending on what you are going for. A decent quartic one could work for example by dropping one of the conditions at x=pi/4, but I didn't try that

  • @bentomo
    @bentomo 10 หลายเดือนก่อน +3

    This was a fun watch. That makes me wonder how much silicon in modern gaming apus might be dedicated to the trig portion of the ALUs

    • @williamdrum9899
      @williamdrum9899 10 หลายเดือนก่อน

      Probably enough to where your average programmer doesn't need to think about this stuff

  • @Myako
    @Myako 5 หลายเดือนก่อน

    Your degrees of enthusiasm, investigation, knowledge and actual results are absolutely amazing.
    They come shining through the video, making it even more entertaining.
    Congratulations, and I can't wait to see these implemented in your game. 👏🏻👏🏻👏🏻💪🏻💪🏻💪🏻

  • @AVX512
    @AVX512 10 หลายเดือนก่อน +3

    It's so wild to me that actually trying to approximate sine can be more efficient than the LUT

  • @bruhweirdo7992
    @bruhweirdo7992 10 หลายเดือนก่อน +3

    You should make that big fist Mario 12:23 a power up in the game

  • @RiverKinn
    @RiverKinn 8 หลายเดือนก่อน

    watching your videos makes me feel smarter. i have very little expectation of ever utilizing these lessons in my waking life, but i do love learning from you. thanks man.

  • @winterispretty6464
    @winterispretty6464 10 หลายเดือนก่อน

    You can use y1~[equation] to do regression with your equation from a table by inserting the sine graph x positions and y positions and it will autimatically find pretty much the best answer without playing with numbers yourself

  • @zheil9152
    @zheil9152 10 หลายเดือนก่อน +3

    I was really hoping sin(x)=x approximation was going to make a guest appearance here

  • @0AThijs
    @0AThijs 10 หลายเดือนก่อน +4

    I dont know how im supposed to watch this, but I probably did watch it like a Caveman being learned about basic electricity.

    • @XpRnz
      @XpRnz 10 หลายเดือนก่อน +2

      Exactly this, i find it really interesting but i literally burst out laughing because i realise i have no idea what it all means.

    • @javierflores09
      @javierflores09 10 หลายเดือนก่อน

      well, at least I am glad I am not the only caveman around

  • @jimmygervaisnet
    @jimmygervaisnet 8 หลายเดือนก่อน

    Interesting! I'm in audio synthesis, so it's always nice to discover new sine approximations.

  • @lahma69
    @lahma69 10 หลายเดือนก่อน

    Every video you release is simply brilliant and I can't get enough. I eagerly look forward to your next release!

  • @tgrey_shift..mp334
    @tgrey_shift..mp334 10 หลายเดือนก่อน +3

    Hey Kaze!
    I’m no mathematician, but I had an idea regarding those tables. I’m working on being an engineer and those tables reminded me of log tables. Those values in the video aren’t logs, but if they were log’d values, you could preform division by just adding the log values and finding their corresponding number similar to how slide rules were used for quick multiplication or division.
    Because of how logarithms work, multiplication and division could be done by just subtracting and adding log values. Perhaps this mathematical property could be used to make your algorithm faster somehow?
    I hope this was food for thought and helps in you finding the optimum sine function! Id love to know what you think of this detail!

    • @tgrey_shift..mp334
      @tgrey_shift..mp334 10 หลายเดือนก่อน

      Also i wanted to say log slide rules also calculate sine functions too!

  • @Birb64
    @Birb64 10 หลายเดือนก่อน +3

    I think Wikipedia meant megabits, not megabytes. 8x difference

  • @lovalmidas
    @lovalmidas 9 หลายเดือนก่อน

    It is a good optimization considering a program usually deals with only a handful of values as inputs per frame. You are not likely to calculate cos(2) after cos(1), making caching cos(2) irrelevant, but you often calculate cos(1) and sin(1) together, especially for calculating projections of the same vector over two axes. Returning both values at once is very useful for 3D-driven games as both are generally consumed by the same function. You get potentially twice cache utilization before a cache invalidation. This you did in get_sin_and_cos() for the LUT method, and the F32x2 trick in the numerical method.
    Both cases can give you savings outside of the function, by reducing the number of your total function calls. This savings exist independently of naive instruction call / cycle counting.
    Comparison between OoT sine and interwoven sine is quite illustrative. Of course, in the real game, there may be times when only one of the sin/cos is needed, but it seems to be a minority.
    I would be interested to see how much savings from the latter numerical methods can be attributed to the use of F32x2 (consumes one register, assuming the consumer code can work without that register space), as compared to fetching one of them and converting to the other via an identity (consumes at least one multiply op, add/sub op, followed by a sqrt op - which you can take out from the main sincosXXXX function since you no longer need to perform sqrt(ONE - x * x) conversion there).

  • @faranocks
    @faranocks 10 หลายเดือนก่อน

    I made a sine function with 16 bit sin/cos interwoven to 32 bits (4 bytes) for an FPGA, it worked pretty well, and yea less cache misses since sine and cos were interwoven (unintended effect tbh, I just wanted 1 input value to return both sin and cos but). I also shortened to 1/8 and swapped values around as necessary using hardware muxes essentially. Data size was 32 x 256 bits. FPGA so flipping signals based on the value is pretty fast relative to software calculations for this sort of thing (only around 2 more cycles to swap values vs storing entire sin/cos wave, and no need to worry about pipeline hazard optimizations as with coding). Another optimization was using 2048 angles, rather than a degree based calculation, that way there wasn't a need for a modulo (0 = 0 degrees/360 degrees, 1024 = 180 degrees, etc.), and I restricted the angle to an 11 bit register so that overflows would just happen and the angle would go from positive to negative or from 2pi +1 to +1 without any additional calculations needed.

  • @Blueyzachary
    @Blueyzachary 10 หลายเดือนก่อน +3

    My brain took so long to recognize “sixty-five thousand five hundred and thirty-six” as 65536

  • @SanaeKochiya
    @SanaeKochiya 10 หลายเดือนก่อน +4

    you are nothing short of a genius

  • @gemstone7818
    @gemstone7818 10 หลายเดือนก่อน +3

    I love seeing things get optimized

    • @johnellis9281
      @johnellis9281 10 หลายเดือนก่อน +1

      And here I am feeling good when a php page takes less than 5 seconds to load.

  • @sebastienloisel7983
    @sebastienloisel7983 8 หลายเดือนก่อน +1

    You might want to use the "true barycentric form" of Lagrange interpolation, it is very efficient and avoids the numerical difficulty you experienced. Also, the Chebyshev nodes produce a very good approximation to the polynomial of best approximation. I'm not sure if I'm allowed to post links, but both of these topics are well described on Wikipedia.

  • @iankoberlein1974
    @iankoberlein1974 8 หลายเดือนก่อน

    You are an absolute genius, Kaze.

  • @drowned309
    @drowned309 7 หลายเดือนก่อน +3

    This video makes me feel bad at math

  • @SendFoodz
    @SendFoodz 10 หลายเดือนก่อน +10

    when he started showing sign wave... it hit me.. this man is a 1 man dev team ARMY... and he makes yt content?! im just gonna appreciate him while we still have him before a AAA game company snipes him from us. you are unreal man, thank you!!

  • @YEWCHENGYINMoe
    @YEWCHENGYINMoe 10 หลายเดือนก่อน +1

    I also like the random jumpcut in the middle of the video

  • @TazAlonzo
    @TazAlonzo 9 หลายเดือนก่อน

    Hey man! You're doing great! We are all patiently waiting and watching you make the DLC Mario 64 deserves! Love all your content, especially optimizing SM64!

  • @Xilefian
    @Xilefian 10 หลายเดือนก่อน +4

    I brought up the idea of vector extensions in one of your live streams for returning two (or 4) floats from a function, I gave it a try just now and whilst it is better than returning a struct on the stack (4 instructions), like the struct technique it still avoids the hardware float registers...
    I'm surprised GCC fails here. On other modern architectures the vector extensions produce the same code as the complex number extension
    I suppose MIPS hasn't received much attention for this optimisation (similar story for ARM32 with hardware float)
    ! if you do need to ever return two (or 4) integers from a function without the horrid struct-in-stack behaviour then use the gcc vector extension

    • @Xilefian
      @Xilefian 10 หลายเดือนก่อน +3

      And now that I tried modern GCC with the struct-on-stack technique, it actually behaves the same as the vector extension, so 4 instructions and stack is avoided (pass-by-register)
      Very few ABIs behave like this for structs, so I think modern GCC has made a MIPS specific optimisation for this case

    • @KazeN64
      @KazeN64  10 หลายเดือนก่อน +4

      huh it never did that for me - i'm not sure how to use vector extensions but if you could show some sample code that would be pretty useful to have a look at. that type of optimization might apply to a lot more stuff in the repo.

    • @optimal284
      @optimal284 10 หลายเดือนก่อน +2

      Since he's at the insane optimization level he is now and has shown that the memory bandwidth of the N64 is pretty low, using vector registers as essentially stack space might give a huge boost to performance.

  • @SpeedcoreDancecore
    @SpeedcoreDancecore 10 หลายเดือนก่อน +7

    What about using sin(x) = cos(π/2 - x) instead of sin(x) = √(1-cos(x)²)
    Also I urge you to look into series economization, it's a way to take the Taylor series (which is very precise on (-ε; ε)) and modify it so the error is distributed on the whole interval (-1; 1); basically converting the polynomial to the base of Chebyshev polynomials, then cutting it at a lower order, and converting back into the {Xⁿ} base; now you get a polynomial as precise as the one you started with, but with less terms, and precise on the whole interval
    If your function is not on (-1; 1) you can just compose it with an affine transformation to put it there

    • @KazeN64
      @KazeN64  10 หลายเดือนก่อน +11

      that'd be a lot more expensive because we'd have to run the entire polynomial again. just taking the square roat is faster (square root isn't that expensive on the n64)
      the chebyshev polynomial might reduce the error across the interval, having no error around pi/2 is much more important to me. if we wanted the min error across the whole interval, i'd use a minmax approximation algorithm.

    • @T3sl4
      @T3sl4 10 หลายเดือนก่อน +4

      @@KazeN64 If I understand correctly, the trick is to obtain coefficients entirely by derivation (no curve fitting required) _and then transforming back to the plain polynomial_ (because Cheb. polynomials are just that, a family of polynomials, sum up the coefficients), and use those in the computation. You have the same polynomial implementation, just with different coefficients put into it. The result will, I think, be equivalent to an L^2 (least squares) curve fit, without having to run a curve fitting algorithm. (You can of course use the algorithm just as well, give or take its numerical accuracy.)
      Chebyshev polynomials are also great for curve fitting in general, as they converge much more quickly than by tweaking naked polynomial coefficients.

  • @thomas-ux8co
    @thomas-ux8co 9 หลายเดือนก่อน +1

    I actually find it pretty clever to use two different functions with different speed/accuracy tradeoffs for the different situations

  • @johnmoser3594
    @johnmoser3594 9 หลายเดือนก่อน

    You should be able to calculate cosine for an exact phase (track phase instead of accumulating the cosine value) using a Horner polynomial. As you know, it's an even function, so you only need half the multiplications. The 6th-order cosine approximation requires 4 multiplications (one is x*x which is then used for 3 multiplications by x^2) and 3 additions. Using fused multiply-add this is one multiply and 3 FMA, a total of 4 instructions to perform the actual calculation. If you normalize the phase to [0,1) for [0,τ) this then requires:
    - Load the current phase (x) into a register
    - Depending on the bits for 1/2 and 1/4, decide whether you are going forwards (use bits below 1/2 and 1/4) or backwards (subtract those bits from 0.25) and whether you will use a negative result (caveat: the cosine polynomial approximation normalized to [0,1) is technically domain [0,0.25], but the result is identical if using the range [-0.25,0.25], so it is also possible to subtract 0.25 from the phase and then subtract the result of the polynomial from 0 if the phase is greater than 0.5)
    - Perform subtraction if so needed
    - 4 instruction calculation
    - Subtract result from 0 if in the negative half of sine function
    - Update the phase (you should be adding a normalized number, e.g. 440Hz f0 with updates at 60Hz means adding 440/60, not 2*pi*440/60)
    - Store the new phase
    That should get you under 20 instructions, maybe under 15, with 3 memory accesses (load current phase, load the amount to step the phase, store the new phase)
    Cosine polynomial reference: gist.github.com/publik-void/067f7f2fef32dbe5c27d6e215f824c91#cos-rel-error-minimized-degree-6

  • @NoSpamForYou
    @NoSpamForYou 10 หลายเดือนก่อน +6

    Do you have enough overhead to do some advanced texture filtering in software with a locked 30FPS? Or rudimentary bump mapping?
    Are the CPU and GPU segregated to where you can't do that?
    The textures are the thing really holding N64 quality back IMO. Of course this would clash with the increased CPU cycles you just implemented for the Sine / Cosine.

    • @KazeN64
      @KazeN64  10 หลายเดือนก่อน +6

      doing any kind of bump mapping is super expensive on the n64. the n64 has a fixed shader so youd have to implement graphics effect like that entirely on the cpu,... not reasonable for a normal level. there are only a few cpu-generated textures for special effects like oceans, light refractions or animations.

  • @lauchlangibson732
    @lauchlangibson732 10 หลายเดือนก่อน +10

    Finding the BEST dry bones function for dry bones 64

  • @VraccasVII
    @VraccasVII 9 หลายเดือนก่อน

    You were right, I did like learning about the lot-more-maths in this

  • @flameofthephoenix8395
    @flameofthephoenix8395 10 หลายเดือนก่อน

    One way of calculating sine that I came up with was using the fact that in both 3d and 2d you can find the angle between two angles, by normalizing both angles, then simply averaging their position, and normalizing again. So, with this you can calculate any value, first you start with 3 stored values of rotation (0,1) or 0 degrees, (1,0) or 90 degrees, and (0,-1) or 180 degrees, then using the aforementioned function you can take the in-between of 90 degrees and 0 degrees to get 45 degrees, then 22.5 degrees, and so on. You can then convert the angle you are trying to find the sine of into binary. And use each 1 as a call back to one of the computed angles, you add the angles together with a rotation matrix, and you'll get the answer. This works quite well, but isn't designed for speed or to be compact in the memory.

    • @angeldude101
      @angeldude101 10 หลายเดือนก่อน

      Even better, you can get the components of a rotation matrix with twice the angle between two vectors with just their dot and cross products (Technically a single geometric product.) Even if the vectors aren't normalized, the factor by which the sine and cosine are off by is the same for each, so they effectively cancel out. If you want the rotation only from one to the other, then you need to either normalize the vectors themselves and add them together to get their average, or you can take the double rotation, normalize it, add 1, and then normalize again, which is both the halfway interpolation of the rotation with the identity _and_ the rotation's square root. It's possible with algebraic manipulation to prove that these two methods give the same result.
      Often when finding rotations, you do so by computing the angle between two vectors using the arcsin of the magnitude of the cross product, the arccos of their dot product, or arctan of the ratio between the two. This is then usually plugged straight into a sincos to get the components for the rotation, leaving you with having called 2 or 3 different transcendental functions which do _absolutely nothing but cancel each other out._
      "in both 3d and 2d"? This strategy works in _Nd._ It's completely dimension agnostic. You can even multiply the result by more rotations to compose them, which causes a new behavior in 4d that doesn't happen in 3d or 2d. I didn't even specify the vectors being linearly independent, since the double rotation that results is still a valid 360° rotation, meaning it technically even works in _1d!_
      Kaze mentioned in other comments that he was working on converting things to Quaternions, which are a natural part of this system, and (an obfuscated form of) the actual result given by the geometric product of two vectors, or more accurately _mirrors._