To try everything Brilliant has to offer-free-for a full 30 days, visit brilliant.org/KazeEmanuar. The first 200 of you will get 20% off Brilliant’s annual premium subscription.
Using ancient Indian mathematical formulas to make the funny red man go bing bing wahoo even faster, even after going like five massive optimizations that give wild speedups already Classic Kaze.
@@williamdrum9899 I wonder if this is really true. Float has 7 decimal positions, if I use an integer and treat the first 7 positions as decimals, would that make my code much faster and more efficient?
On the ZX Spectrum, a LUT was probably the right solution. The CPU was slow and the memory was fast. Though there wasn't very much of it, so you'd definitely want the quarter-table trick.
@@keiyakins it does make me want to pull the hardware out and compare the difference. All though I feel this implementation is specific to the way the N64 fetches data
My favorite Z80 trick is aligning a table to &xx00 so instead of loading the address into HL you load the address high byte into H and the desired index into L. Easily triples the speed of any lookup operation
The situation hasn't changed much today, actually; more in terms of scale, and levels of abstraction/recursion -- simple operations like these complete almost instantaneously in CPU time, and the rest of the time is spent ferrying the inputs and outputs through multiple levels of caches. Which may extend not just to system RAM or HDD, but over the network as well (which might in turn be cached across numerous servers, and so on!). Accounting for Rambus access is relatively painless in comparison! 😅
Actually there is a long trend this way, CPUs have accelerated way more than RAM. Antique platforms would pre-compute even flipped sprites, because even though RAM was very scarce, the CPU was even more sluggish than that.
That has nearly always been the case. The SNES and earlier system were basically the last widely available systems were memory access can be done in a single CPU cycle. Nowadays we are at a point were in programming it often is faster to the the simplest linear search over hundreds or even thousands of elements instead of using things like binary-search that "should" be much faster as they need to only do a fraction of the work.
@@ABaumstumpf a few people have pointed that out and yeah they're right. i was more thinking about caching stuff for recursive problems but optimising for memory access is probably a more common thing
I think the raw datapath to the RDRAM is capable of that but due to access latency and the memory architecture of the system the practical bandwidth is much much lower.
No, the ram operates at 500MHz over a 9-bit data bus. The 9th bit is only useable by the GPU. The GPU uses the same memory over the same bus and that’s why the CPU access is relatively slow
14:57 The reason he has to make the value negative here is that Bhaskara is only accurate in the range -pi/2 to pi/2. The calculation for flip sign will be one, if the value is outside of this range. if this is the case the value needs to be shifted and flipped as it is after the if(flipsign) and the ternary operator after the Bhaskara algo proper Here is a graph showing the process, and also how he gets sine out of this calculation. BUT TH-cam wont let me post it because they are dicks
Nintendo has used a lookup table for trigonometry since Super Mario World, where it was used for the rotating sprites (ball and chain, chained platforms, etc)
Back in the day (especially on SNES) the CPU was the bottleneck not the cache so they needed to look it up and not calculate it. I don't even think SNES could do floating point numbers in hardware.
@@AlexRodriguez-gb9ez It could not use floating point numbers, but fixed point (usually 8.8) are very common and can even be hardware-accelerated. In Super Mario World there's a big table with sine/cosine values in this format which is then multiplied with whatever coordinate you want to transform using hardware registers.
Likewise for Sonic 1, also in 1991. I'm getting so tired of this 'Mario 64 was the first game to do X' bullshit. What's next, Kaze - Mario 64 was the first game to ever have 3D graphics?
Man it’s always crazy how even a game that’s considered “poorly optimized” like Mario 64 is still optimized incredibly well for its time. Now a days developers say “oh most peoples computers these days should have about 16 gigs of ram? Why waste time making it run on 8 then?” And then the game still comes out rushed and bad.
Lol mario 64 isn't well optimised at all, they didn't even turn on the automatic compiler optimisations and it regularly drops below 30fps even though the n64 should be able to handle it fine. Modern games at least shipped with the compiler optimisations enabled
What you have to remember is that this is a console game, from the age of cartridges. There was no such thing as "updates" or bug fixes, so studios were forced to give their devs time to get a game good enough to sell.
@@augustdahlkvist3998 I’m not super familiar with this topic in particular but it’s a fact that there’s almost nothing in game development that’s as simple as “turning on an optimization”. I would bet money that the issue isn’t as simple as turning on a feature. There are some things that only look obvious and easy with 3 decades of retrospective. Even then, most modern game studios probably couldn’t get Mario 64 to run at “drops below 30 with big objects” if I’m being honest.
SM64DS also uses an interlaced sine/cosine table, but without mirroring. I never really thought about why that is but given how cache works it makes perfect sense.
The video's great! I love seeing how things like this are done on old hardware. It seems to me like it would be hard to understand how anything would be best optimized for modern hardware with preemptive execution, and weird hardware tricks and strange security flaws - more like magic than science. Even though I don't really program much, let alone optimize anything, optimizing code for old hardware seems like it's something that can actually be learned in a concrete and non-mystical way by a human being, even if it takes effort.
modern hardware is optimized for bad code. i think optimizing code on more modern hardware is less about nitty gritty details like this and more about using best practices. there is also an issue that most modern games run on many different consoles so you dont just have one metric to maximize. a lot of these n64 optimizations would slow modern machines down.
@@KazeN64Ever since Playstations, gaming has gotten more techy to the point of modern renders being mostly power and less “magicks”. Thanks to you, we can see their bag of tricks!
@KazeN64 The more powerful the hardware, the harder it is to run anything bare metal. On modern systems, it can even be impossible, requiring you to bypass even the UEFI itself.
@@KazeN64 It's still optimized for good code, it just also handles bad code a bit less poorly. Bad code on modern hardware can still easily waste 90% of memory bandwidth, etc. Where I think modern hardware shines is making decent, not optimized, but not wasteful code run very fast.
I have a hunch about the angle flipping necessity: tangent. The angles in the 1st and 3rd quadrants (going ccw around a circle on the cartesian grid) have the same tangent values, the same with the 2nd and 4th quadrants. So you have to distinguish when you want the angle from the 1st or 3rd quadrant (and the 2nd or 4th Quadrant). MS Excel (and Google Sheets iirc) have the 2 argument arctan function to treat the angle calculation as a vector problem, but since Mario 64 doesn't have this, you have to use a from scratch angle discrimination setup, much like what Kaze ended up using.
atan2 is a godsend when it comes to not having to roll your head everywhere to figure things out. you can define it from a classic arctan though -- just some if cases for signs and stuff
Me (Person who has a literal degree in computer science, built multiple games, and over 5 years industry level experience): This guy is insane and so much is going in one ear and out the other but this is so interesting to watch
I remember from a book called "linear algebra done right" on page 115 it says the best approximation for sin(x), of fifth degree, on the interval [-pi,pi] is given by 0.987862x − 0.155271x^3 + 0.00564312x^5.
In GCC for ARM, when you want to return two 32 bit integers in registers, you return a 64 bit number. To convert between two 32 bit numbers and one 64 bit number, you use a union.
When I was coding 3D dos games 25 years ago, I couldn't understand why my sin table was not that fast, it is just one line of code with no CPU computation... Cache/memory bottleneckes make a huge different, I confirm.
Yeah you'd have to go back to the '80s when we had no caches to avoid that. I used to play with this stuff in the 8-bit days until about the Amiga era.
@@Anon.G Haha yeah. At least with C compilers you can actually finish a project. Writing an entire game in assembly is a monumental task even on an 8 bit platform
It's always interesting to see how constraints inspire creativity and inventiveness. I have very limited programming experience, but a fair bit of writing experience, and having a constraint (the syllable count of a haiku, a particular meter in a poem, a more arbitrary constraint like not using a certain letter at all) really pushes you to play around with the elements you've got and try to achieve a pleasant result in spite of the constraint. It strikes me as weirdly similar to working around memory and cache constraints. Reminds me, too, of a book I read written by an ex-Atari programmer. (I think it was Chris Crawford.) He talked about the clever things they tried to do with graphics by executing code in the handful of machine cycles in between CRT raster lines.
@@amineaitsaidi5919 No, a lookup table is actually quite a good optimization. Like Kaze said, computing the actual Taylor series to a high degree of accuracy would be incredibly slow. Nintendo didn't understand just how much their memory bus design constrained everything until much much later, so at the time it made perfect sense to try to save on CPU with the table.
I love these optimization stories of old video games. I switched from application development to firmware development because I figured it would be the closest I could get in todays technology landscape. Thank you for your videos. They are such a treat ❤
As a non-computer scientist, I wish I didn't fall asleep in high school geometry class. I find these types of videos incredibly interesting as well, but I have no idea why sine waves are needed to calculate angles in thee dimensions. What does a sign save have to do with math in this context?
@@Psythik sine waves are necessary to calculate angles in _any_ dimensions ( aside from, i suppose, 1d geometry ... mostly because it, uh, doesn't have angles, at least as we traditionally understand them ). or, more precisely, they're necessary to calculate the cartesian coordinates of a vector rotated by an angle, aka, the change in a game entity's _x_ and _y_ positions when moving _d_ game units at an angle of _a_ degrees basically, imagine a standard 2d coordinate plane, then imagine a right triangle with a hypotenuse ( the line that _is not_ part of the right angle; it's the one that's usually drawn diagonally ) of length _d,_ extending from ( 0, 0 ). this means that one of our non-hypotenuse sides is our x-axis, while the other is aligned with the y-axis. the _a_ of sine and cosine is one of the triangle's non-right angles, and the one that's most useful for us is the one that ends up right next to ( 0, 0 ). now, the sine and cosine of _a_ is the length of the non-hypotenuse side that is opposite / adjacent to, respectively, a, divided by the length of our hypotenuse. in other words, they're the ratio between the _y_ / _x_ coordinate of the hypotenuse's second vertex and the hypotenuse's length. so, all we need is to multiply the length by those ratios and, voila! we now know that the hypotenuse extends from ( 0, 0 ) to ( _x, y_ ). which is something you may recognize as a vector, but even if you don't, all you need to know for this example is that you can add this _x_ and _y_ to the coordinates of a point to offset that point _d_ units at angle of _a_ if that seems like a whole lot of steps for a whole lot of nothin', remember that basically _every_ game maps out space using cartesian coordinates. every point is represented by an x value, a y value, and maybe a z value if you're working with three dimensions, and ... that's it. if you want to move 14 units from ( 32, 24 ) at a 45 degree angle, then you gotta calculate exactly _what_ position is 14 units from ( 32, 24 ) at a 45 degree angle. and it isn't ( 39, 31 ), what you get if you just cut 14 in half and slap it on both numbers, even if a 45 degree angle is perfectly diagonal and thus offsets equally on both axes - remember, all of a circle's points are an equal distance from its origin, and yet its diagonal points aren't ( 1, 1 )*. the actual change in _x_ and _y_ we get out of this is a bit under 10, so we actually end up at ( ~42, ~34 ) incidentally, the reason why kaze spoke as if they _aren't_ necessary in 2d games is because all of this is _wholly irrelevant_ in 2d mario games, because they're sidescrollers. so the x and y axes are just completely different forms of movement, so him walking 14 units forward is entirely contained to the x axis. but it's also true that a lot of simpler top-down games, like a lot of old zelda ones, don't bother with using this sort of math for their movement and just slap a constant speed value onto the entity's x and y coordinates. this results in diagonal movement being straight-up faster than cardinal movement, but it's usually not all that noticeable because of the game's presentation. but then you have the interesting case of the original doom games, which as you'd expect from a fps, properly vectorizes your movement so your forward speed remains consistent regardless of which angle you're facing ... except, it vectorizes your forward and strafing movement _separately,_ so you _still_ get the "diagonal movement = faster" effect, it just ends up rotated so you have to move diagonal relative to your facing angle rather than the map's origin *well, they are when using chebyshev distance, where circles are literally just squares. yeah, there are multiple ways to define distance ( which, granted, probably _isn't_ something that got covered in your geometry class ), and it's less theoretical than you might think because ... there's just a lot of different things you can measure! like, chebyshev distance is outright useful in game design, because it represents a type of movement used in the little known game of _chess._ another common alternative to euclidean distance is taxicab distance, which you'd be familiar with if you've played any tile-based game that doesn't allow diagonal movement, like advance wars or fire emblem
@@starfoxmoore6037Don't take it personally. A lot of people just found out over the past few years that a bunch of liars with college degrees have been masquerading as practitioners of science. If you are practicing the scientific method, it's probably best to let your work speak for itself. People simply don't trust the formal classifications at this point, and who could blame them.
I thought I could escape this stuff for a few months after finishing my GCSE exams... I guess not then 🤣 Kaze's videos are a joy to watch nevertheless.
I wish they taught practical applications of higher math. I tried to make a 2d animation of dots spinning like a processing top, turns out it's a ton of trig inside of matrix algebra. I got into game dev and wanted to make a projectile bounce off a wall, more trig inside matrix algebra, vectors are everywhere in game dev. If you want your character to be affected by gravity, there's immediately calculus to handle. The whole concept of "ramp" in magic the gathering is just exponential growth. Geometry is useful all over the place in action games. It's like you can't move your character without at least distance/hitbox equations happening. So many people say "Math is miserable, you never use it, it's not worth the time" but I try to say math is important, it's a way to understand every physics process unfolding around you, and that power it extremely useful. Even if it's just so you can speak to computers and be better at basic programming, or just to be a better gamer, math is incredibly useful.
Instead of Euler angles, which have several drawbacks such as gimbal lock and numerical precision, would it be more efficient to use another rotation representation such as quaternions? Would that also reduce the number of needed trig function calls?
Yes, I'm actually planning to convert all the animations to quaternion animations - It'll reduce it from 42 multiplications and 6 sin/cos calls to 37 multiplications. There is also an advantage in quaternions being 8 bytes per rotation instead of 6 - that sounds worse at first, but that way you can guarantee every quaternion is accessible in a single dcache access (whereas the eulers could be spread across 2 dcache lines). So yeah i'm absolutely planning on doing that.
@@KazeN64 Wow, you're eventually gonna rewrite the whole damn engine! Quaternions seem less intuitive at first, but then you see their elegance. And when you start counting mults and other ops, the benefits are clear.
@@Isaac________ They're only less intuitive if you try to visualize them as 4-dimensional arrows. They're _very_ intuitive if you look at them for what they are: mixtures of 4 basis 3D rotations: 180° around x, 180° around y, 180° around z, and 0° (around nothing). An alternative way of visualizing them is as compositions of two reflections. They're more efficient to compose than matrices, but less efficient to apply. If you want to mix in translations, then you need something called dual-quaternions, which have 8 components, but that's still less than a 16-component homogenous matrix or 12-component affine matrix.
@@angeldude101 I think a lot of confusion around quaternions is caused by how we usually think about imaginary numbers as a representation of 2D space. If you try to tack on a similar interpretation of the scalar part of a quaternion as relating to a basis axis, then it doesn't make sense. The scalar part is related to an inner product and projects your vector into a plane where the rotation is a simple reflection. It's the imaginary vector part that "lives" in 3D space. It also helps to realize that the multiplication rules of quaternions are just an algebraic representation of the cross product: j*k = i which is similar to: y × z = x for basis axes x, y, z.
I am so notoriously bad at math my brain literally locks up when presented with it but you still manage to explain everything in these videos enough that my potato brain can keep up. Which is good because I love seeing how you optimize the hell out of this game despite being a math potato.
"What if I told you there's room for improvement?" Kade, this is YOU we're talking about. You'll probably be able to get Super Mario 64 to run on a TI-86 Calculator if given enough time
I've always been fascinated with how computers implement trigonometry. It's more difficult than arithmetic, and there are many ways to do trig functions. Another method is the CORDIC algorithm, which involves adding smaller and smaller angles until it reaches the input. It was developed for use in 8 and 16 bit processors. The TI-83 calculator, which uses a chip similar to the gameboy's, uses this algorithm.
It's megabytes on their website. There are other website claiming 4500mbit/sec too. I thought they could be talking about Mb/s too but unforuntately they werent
@@KazeN64 maybe they calculated it in Mbits/s and accidentely wrote it as MB/s? and then someone took the wrong MB/s from wikipedia and multiplied it by 8 again resulting in 4500?
@@t0biascze644 Definitely that lol. It's a rookie mistake, and since everyone is using Wikipedia as their reference.. Well nearly everything ends up wrong afterwards.
@@KazeN64RDRAM operates at 250 MHz over a 9-bit bus with data (possibly) transferred at every clock. That’s where the number comes from. As the GPU uses most of that and the overhead is worse, the CPU might get only 75MB/s
@@lotrbuilders5041 i've computed my 75mb/s from the graphics actually. the CPU is a lot slower than the 75mb/s. usually they share the data speed so both components together probably average around 25mb/s during normal gameplay.
Have you ever tried CORDIC algorithm? It needs a (quarter?) Sine table but calculates everything just by adding and shifting, so basically one cycle operations and the accuracy depends on the number of repetitions you give the algorithm. Could be too taxing on the bus but maybe give it a try? Not hard to implement though... Used it on FPGAs to reduce computation unit requirements by a lot, since multiplier are rare on those. Golem has a great article about it (and in german :3 )!
I got a welcome ego boost watching this video by thinking about the "only use one 4th of the curve" and "interweave sine and cosine" before you mentioned them. Thanks for that, I love these optimization tricks videos
A tip about evaluating polynomials -- start with the highest coefficient, multiply by x, add the next coefficient, multiply by x, etc. Called Horner's method, this greatly reduces floating point error as the terms tend to be of similar scale. For all-even or all-odd series, also prepare x^2 beforehand, to evaluate it as a polynomial in x^2 instead; for odd, factor out the common x, then multiply by it in a final step. About Bhaskara's formula: it's not so mysterious, just a rational approximation. A rational polynomial is a function f(x) = P(x) / Q(x), P and Q being regular polynomials. Generally speaking, the accuracy is equivalent to a n+m order polynomial, where n and m are the order of P and Q respectively. Evidently, this would be advantageous for modest order polynomials (maybe 7th+?) where the division can save some multiplications. Rationals are particularly well suited for peaky functions, where 1/x or 1/(x^2 + a^2) sorts of forms would otherwise be very challenging for a polynomial fit. For smooth functions like sin/cos, it's not too important; and indeed, your 4th/5th order forms are in the same order of magnitude accuracy, as expected. There's also a method not listed here, the hybrid between look-up and curve-fitting: you could have a table of not just values, but coefficients as well. A much smaller table (say 16 or 64 elements) could contain, say, the offset, slope and curvature, which are simply dumped into a quadratic polynomial, and out drops the spline result for that segment. Note this is less intensive than a linear (let alone quadratic or cubic) interpolation, as the coefficients don't need to be computed from adjacent points (table values); for example linear interpolation has to calculate the slope between adjacent points (evil division!), but we can just tabulate that instead. It's not obvious if this would perform better overall than the final choices in the video, but considering how comprehensive the video is, I suppose it bears mentioning. The mention of returning complex pairs has some very interesting outcomes, if taken to its logical conclusion. Ultimately all we're doing is coming up with a vector [x, y] = [cos theta, sin theta], and most of the time we're using both values close by so it makes sense to return both pairs directly. _We could avoid angles entirely and only pass around these vectors!_ This is probably such an immense pain that it's not worth it (every single object which takes an angle parameter has to be refactored to take a normal vector instead!), but, interesting just to imagine how it would all work. Note that, where angles can be added or subtracted, angle vectors must be multiplied or divided, using the complex vector formulas (a + bi)(c+di) = ac - bd + i(ad + bc) and (a + bi) / (c + di) = ( ac + bd + i(bc - ad) ) / (c^2 + d^2). (The division is avoided as long as c^2 + d^2 = 1, which is true for normals.) Fractional fixed-point can be used to save on FLOPs; probably short pairs, so, doubling the memory requirements of stored angles. The magnitudes will eventually drift from enough accumulated operations, and using first-principles values (like, oh I don't know, calculating sin and cos? :^) ) from time to time, or entirely in the first place (imagine that!) is probably helpful; or normalizing (divide by sqrt(a^2 + b^2) -- yuck, what a floating-point mess! -- but a simple iterated approximation can be used (or the infamous inverse square root..!) since it just has to move the value closer to normal, not set it exactly so. Tremendous amount of work, by the way! Numerical approximation, and implementation, can be quite a deep rabbit hole. As, uh, I think you've discovered, hah! :D
I'm not sure why a Taylor series approximation would be too slow. There are no divisions and you can compute x^5 in three multiply instructions. The whole thing would probably take about 8 multiplies and 3 adds for an 8th order accurate approximation. Check out the musl or glibc libm implementations.
Also, -ffast-math can cause incorrect computations since it can break IEEE compliance. Do you see the same incorrect stretching when you build without it?
Really interesting. Easy to understand the x y speed part from the beginning, but sadly, I’m having a hard time with the use of sin/cos in 3d bone movement and wishing I understood why it’s necessary from the video.
"I never expected anything I've learned in school to actually be a useful life skill" Apparently hacking out a polynomial with a low-error approximation of sine for SM64 optimization in 2023 is a useful life skill lol
I'm sure you already know this, but someone did a response to this vid where they brute-forced the best sine/4th order cosine with the least amount of error using a SAT solver. TH-cam will eat my link, but the post is "Finding the BESTEST sine function for Nintendo 64" by Steinar H. Gunderson.
The lookup table approach is used in the classic Yamaha OPL2 and OPL3 FM synthesis chips. They contain a 256-entry table of the first quarter in ROM. However, it's not a straight-up sine function, but a log of a sine function. They also contain an exponential table ROM as well, needed for converting the log-transformed values to linear ones.
@@sinom I wouldn't be surprised if some marketing guy decided to change to "MB" because it "looked neater" without understanding the technical implications of it
@@sinomGenerally memory transfer speed is measured in bits/second, not bytes/second. It's not impossible that in a community driven wiki someone made a mistake somewhere along the chain of copying or stating information. And the b/B difference is a newer thing - for quite a while it was common in data/tech sheets to have B just mean both, requiring a call to the vendor or knowledge of how transfer speed is measured at this particular company. But yes, it should be b for bits, or Mbit/s to be explicit.
1:35 As a programmer, I have noticed that we don't actually have any proof that this is the best way to represent 3d rotations, while it is without a doubt always going to return a value that is on a sphere, we don't know that it's the most efficient, or the best way to do it.
I'm pretty sure it isn't, because it requires 3 separate rotations, meaning 3 separate complex exponentials, and since the axes are relative to each other, it's possible to lose a degree of freedom if two of them align. Quaternions are much more likely to be the best way to calculate rotations since they just require a single complex exponential and an arbitrary axis on top of being as efficient to store as Euler Angles (3 components). Actually applying the rotations is a different story since it's actually more efficient to do that with matrices than quaternions, but matrices are less efficient to store and compose, so you have to balance the tradeoffs.
@@angeldude101 I think probably the optimal system would be with only two variables, in other words it could be attained by storing a 2d-value. While I know for certain that we already have systems like that, I also don't know that those systems are the best ones. I think likely it would be a lot more complicated than what we have, in a similar way to how 2d rotation is stored as a 1d variable compared to 1d rotation being a Boolean/0d variable.
@@flameofthephoenix8395 In 2D, there's only 1 plane within which you can rotate, so you only need 1 component, and from that you can derive a mixture of of rotation and not-rotation (2 components). In 3D, there are 3 orthogonal planes within which you can rotate, so you need 3 components to define the axis (which only actually requires 2 coordinates) and the angle. More generally, N-dimensional space has N choose 2 rotational degrees of freedom. (1 choose 2) = 0, so 1D rotations have 0 components; (2 choose 2) = 1, so 2D rotations have 1 component; (3 choose 2) = 3, so 3D rotations have 3 components; (4 choose 2) = 6, so 4D rotations have 6 components. That last case actually has something new happen when converting from a axis-angle to a mixture of rotations, as you can now get an 8th component that appears when the axis is actually 2 completely orthogonal axes and can't reduce to just 1.
Great video! I remember using overlapping sin/cos tables for the BBC Micro back in the day. And I think I recognise the UI you used for comparing the C code compilation results! 😊
Hey Kaze, this was another solid deep-dive. If I could make a suggestion: I found the additional audio of the speedrunning/glitching balanced a little too loud. Loud, glitched "wa-wa-wa-wa-wahoo"s into a wall + music + mathematics VO is a lot to listen to all at once and this is the only time I can recall a video of yours feeling this loud and overstimulating. I'd like to point to 16:45 as an example, so see if others feel the same. Otherwise, great topic! Love the feeling of progression to the algorithm
5:29 It actually comes from : 250 MHz clock speed (RDRAM double data rate) = 500MT/s 9 Bits bus * 500 MT/s = 4.5 Gbps = ~562.5 MB/s Unfortunately that's in theory a lot goes on in between, latency, memory controllers, data computing speed etc...
Thank you for making videos like these explaining mathematical concepts related to Mario 64. It is informative and appreciated, and encouraging learning
Hello! Floating-Point/Computer Maths engineer here. I've developed trig functions for quite a lot of architectures/my life at this point, maybe I can give some suggestions. In our world, the minimax approximation is mostly king. I say mostly because mathematically it's usually the provably optimal polynomial, however for us there's the usual floating-point issues you get trying to apply maths onto computers. But if there's an issue with a minimax approximation you usually look inwards before looking outwards (Especially if your sin/cos ends up being greater than 1) If you have problems around 0, you should probably be optimizing for relative error, as opposed to absolute error which is what it seems you've done here (I might be wrong). In the business we tend to use an error metric called the ULP error (units in last place), which roughly measures how many bits at the end of your floating point answer are incorrect. (It's very close to a relative error measure, bit not quite the same. Optimising for relative error usually works well) Because of how floating-points scale, this means that you need tiny errors for tiny floats, and can have much larger errors for larger floats. Using the absolute error as your metric allows vales near 0 to blow up to however large you want in a relative error sense. Without an exact application in mind this ULP-metric is the one people like myself would use to write eg maths libraries. Another issue is that if you take the coefficients of an optimal real-valued minimax polynomial and convert them straight to floats you lose an pile of accuracy, unfortunately for computer scientists everywhere. Probably a lot more than you might think, when compounded with floating-point evaluations. We have some tools developed to try get around a lot of these limitations, a good starting point is one called sollya. Here you can ask for relative/absolute error, and give all kinds of constraints to your problem. Importantly you can restrict the output coefficients to be floats, and it will try optimise around this, so you won't lose large amounts of accuracy converting reals coefficients. For example, on the interval [0 -> pi/4], for sin() it gives a cubic polynomial of: x * (0.9995915889739990234375f + x^2 * (-0.16153512895107269287109375f)) Or by forcing the first coefficient to be 1 (generally sollya knows best, but sometimes storing a value of 1 is free on hardware, and it also might make really tiny values more accurate at the expense of larger ones): x * (1.0f + x^2 * (-0.16242791712284088134765625f)) And similarly the next higher polynomial gives: x * (0.999998509883880615234375f + x^2 * (-0.16662393510341644287109375f + x^2 * 8.150202222168445587158203125e-3f)) Or with forced 1: x * (1.0f + x^2 * (-0.166633903980255126953125f + x^2 * 8.16328264772891998291015625e-3f)) Note that sollya does not produce the absolute best polynomial possible, but it's the best you'll get with not too much work, and the results it does give are usually very decent.
Otherse have suggested sollya as well! The problem with the polynomials you've suggested though is that you are adding additional coefficients. I've purposely chosen only odd and only even polynomials because they give you a lot of accuracy with less coefficients. instead of a 3rd order sine, an odd 4th order cosine will always be cheaper and more accurate for example. considering the constraints i gave my polynomials, i also think it's entirely impossible to come up with a different one. we have used up all degrees of freedom after all. although someone suggested easing up on some of the conditions for the higher order polynomials and it did end up improving them.
Funny I actually worked on a project that used the same sine approximation you came up with for the fifth order, but for the third equation (for the last coefficient) we calculated it by minimising the distance between the function and the ideal sine, to do so we ended up equating both integrals between 0 and pi/2 and deduced the result. We had in the end : f(x) = a*(pi/2)*x + b*((pi/2)*x)^3 + c*((pi/2)*x)^5 with a = 4*(3/pi - 9/16) , b = 5/2 - 2*a , c = a - 3/2. I haven't checked if it's better or worse than your function that you found but when we tried a strictly derivative approach (What you did kinda) it ended up doing worse for our purpose. edit: didn't see the new video my bad
yeah that's more accurate. this video was made before i even knew polynomial solvers were a thing and i just kinda went with the approach i knew worked somewhat.
20:10 while these bugged bones are not desireable normally, I think they could make for a pretty funny mechanic for another romhack... maybe a power-up?
For generating the minimax coefficients you can use a command-line-tool called Sollya. It's neither very well documented nor does it give optimal results, but it's pretty close.
If you split it again into two separate approximations one from 0 to pi/8 and a different one from pi/8 to pi/4, I think you can probably get a pretty accurate quadratic approximation of sine. I didn't work out the details so it might end up being slower in the long run or have other problems.
@@KazeN64 I have seen this method used in a different context to good effect, which is why I suggested it, but every processor is different, and I don't really know this one nearly as well as you do obviously. So fair enough.
@@KazeN64 And by the way, you can get better polynomial interpolations sometimes by choosing different nodes. I tried out a fifth order cosine with f(x) and f'(x) equal to those of cosine at x=0, pi/4, pi/2 and it gives an order of magnitude better approximation than the fifth order sine you showed. and it has the property also that it is bounded from above by the cosine function on the interval 0 to pi/2
@@timseguine2 wait why a 5th order cosine? odd orders should always be sines and even orders should always be cosines. otherwise you end up having too many coefficients. can you show the function?
@@KazeN64 The way the coefficients disappear depends on which x values are used for evaluating the interpolation. If you split them up like I suggested it works out to: 1 + ((1024 sqrt(2) + 256) π + 4096 sqrt(2) - 11776)x^2/(128π^(2)) + ((-8192 sqrt(2) - 2560) π - 16384 sqrt(2) + 67584)x^3/(128π^(3)) + ((20480 sqrt(2) + 8192) π + 16384 sqrt(2) - 139264)x^4/(128π^(4)) + ((-128 sqrt(2) - 64) π + 768)x^5/(π^(5)) and the maximum error on 0 to pi/2 is about 0.00004 The thing I said about the bound was backwards, but it is never larger than 1. There are other alternative ways to interpolate it depending on what you are going for. A decent quartic one could work for example by dropping one of the conditions at x=pi/4, but I didn't try that
watching your videos makes me feel smarter. i have very little expectation of ever utilizing these lessons in my waking life, but i do love learning from you. thanks man.
It's crazy to think that Kaze might just be one of the greatest programmers of our generation... And he uses his insane talents to make Mario 64 mods O_o
Great video. In a past life, I wrote display device drivers for the GEM Desktop VDI interface, mainly on 286 and 386 systems. In that interface, they do use some angle parameters, mainly for circular arcs. Angles are specified in 0.1 degree increments. The supplied driver code used sine lookup tables and was based on octants, so 0.0 - 45.0 degrees. There was some front end code that would reduce angles into one of 8 octants, even for angles over 360 degrees. I started out using this code but found some strange behavior every once in a while the display update would pause for a noticeable delay. I ended up writing a test program that would load the driver and be able to send it a range of parameters for all the graphics primitive functions. This way I could load my program (and the driver) with a debugger (displaying on a separate system via serial port). I finally narrowed down the problem in the arc drawing code. While the comment in the code (was originally written in a Fortran variant, then the assembler output from that was used) said reduce the angle to 0..7 octant, it was actually doing 1..8 octant. This was in a loop where the code would take the angle, check for > 360 and if larger, subtract 360 and repeat. Then it would check for octant 0..7, but if the angle reduced to 8, it would hit the outer loop and keep going until it finally would get into the 0..7 range. First, I fixed the assembler code to do the proper 0..7 calculation and it worked fine. However, there was over 900 bytes of lookup table (and this was in the 16 bit / 640KB days), so that was a huge chunk of memory just sitting there in case a rarely used arc call was made. I was also in the process of converting almost all this assembler code to C and realized this whole mess could be replaced with a call to the sin() library function. Since we were selling high end graphics cards, almost every customer of ours had a 287 or 387 math co-processor in their system, so this was a fairly low overhead call. Doing this trimmed about 1KB from the driver memory footprint and it ran smoother without that annoying pause every time it got a strange angle parameter.
Really interesting math! This comment is long, just so you know. I tried to find a 4th order approximation satisfying: f(0) = 0 f_prime(0) = 1 f(pi/2) = 1 f_prime(pi/2) = 0 The only one is ax^4 + bx^3 + x, where a = (16 / pi^4) * (pi - 3) b = (16 / pi^3) * (2 - 0.75pi) The maximum error is ~0.0020673. As for cos(x), another method is to take the derivative of the polynomial for sin(x). Although there will be a greater max error, it will be faster, as the polynomial is of lesser degree and it avoids using a sqrt. If we use the 4th degree polynomial I found, the maximum error is ~0.005. Also, because of the constraints on ~sin(x), this ~cos(x) will always be between 0 and 1 (in the range 0 .. pi/2). As for smoothness, the second derivative f''(0) = 0 correctly and f''(pi/2) = -1.0437 instead of the perfect -1. I also tried to find a 3rd minimax approximation, forcing the following: f(0) = 0 f(-pi/2) = -1 f(pi/2) = 1 The function I ended up with was approximately 0.9824x - 0.14x^3, more specifically: 0.9823962677074202x - 0.14013793513412705x^3 It has a max error of 0.005745, which is slightly better than your function. non-pro tip: Those 'error bumps' at 19:35 should have the same maximum height. Meaning, in the minimax approximation, you sacrifice some error in the shorter bump in order to lower the tallest bump, eventually resulting in the minimax approximation However, it faces the same issue as yours. That is, the function may be > 1. The maximum is at around 1.0012. Forcing the cubic to not exceed 1 (by forcing f_prime(pi/2) = 0) results in easyaspi's approximation.
I can't believe I was so engaged by this video for 26 minutes. Gotta be the best ad read I've ever seen, too, I spent 15 of those minutes "how could I possibly learn all this?"
Would it be faster to compute cos(x) as sin(x + pi/2), instead of using sqrt to get both? I'm assuming the compiler could simply emit the "offsetting" as a prepended prologue to the sin(x) function. But maybe you never need only sin or only cos, so that wouldn't be worth it?
@4:26 - If you're storing 4096 sine values as 32-bit floats, it's not 4 KiB, it's 16 KiB. In the case of the Zelda64 method, you're looking at 4 KiB. I highly doubt that you're using 8-bit chars to store angles, because you'd be looking at a precision of only 1.4 degrees.
2:01 A transcendental number is _NOT_ a number that does not repeat. √2 is a number that does not repeat, but it’s not transcendental either. The opposite of _transcendental_ is _algebraic._ An algebraic number is one that is the solution to a polynomial with integer coefficients, i.e. something like x⁵−x−1 = 0 is well-known to have a solution, which is algebraic by definition and its value is approximately 1.1673, but it cannot be described in a closed form with integers and basic operators including roots (square root, cube root, etc.).
The fact that this is mostly review for me from my Numerical Analysis class! It just blows my mind seeing this all working! I mean, I’ve worked with this many long nights studying, but still, it amazes me!
Use the angle addition formulas sin(x+y) = sin(x)cos(y) + cos(x)sin(y) and cos(x+y) = cos(x)cos(y) - sin(x)sin(y). If you need to compute sin(z), compute sin(z) = sin(x+y) Use a very small lookup table with only a handful of entries (32? I dunno. Not many.) to compute sin(x) and cos(x). (with your interleave trick (heckin' neat btw) that's only one RAM hit on the rare occasions it's not already in data cache) calculate y = z - x. (y should be very close to 0) Then 'calculate' sin(y) = y and cos(y) = 1. then sin(z) = sin(x) + y cos(x) and cos(z) = cos(x) - y sin (x). If you're using radians that's 5 arithmetic instructions, 6 if you're using not-radians. You need to one cache line twice for the lookup for the sin(x)/cos(x) values, but the table should be *very* small and it's reasonably likely to always be in cache.
with y very close to 0, x could grow as big as being very close to pi/2. sin(pi/2) = sin(0) + pi/2 cos(pi/2) = 0 is not an accurate enough approximation.
Me at the start of the video: "Yay I'm about to learn stuff!" My brain after watching this: 😵💫 Super interesting watch, was glued to my seat from start to finish!
I brought up the idea of vector extensions in one of your live streams for returning two (or 4) floats from a function, I gave it a try just now and whilst it is better than returning a struct on the stack (4 instructions), like the struct technique it still avoids the hardware float registers... I'm surprised GCC fails here. On other modern architectures the vector extensions produce the same code as the complex number extension I suppose MIPS hasn't received much attention for this optimisation (similar story for ARM32 with hardware float) ! if you do need to ever return two (or 4) integers from a function without the horrid struct-in-stack behaviour then use the gcc vector extension
And now that I tried modern GCC with the struct-on-stack technique, it actually behaves the same as the vector extension, so 4 instructions and stack is avoided (pass-by-register) Very few ABIs behave like this for structs, so I think modern GCC has made a MIPS specific optimisation for this case
huh it never did that for me - i'm not sure how to use vector extensions but if you could show some sample code that would be pretty useful to have a look at. that type of optimization might apply to a lot more stuff in the repo.
Since he's at the insane optimization level he is now and has shown that the memory bandwidth of the N64 is pretty low, using vector registers as essentially stack space might give a huge boost to performance.
You should be able to calculate cosine for an exact phase (track phase instead of accumulating the cosine value) using a Horner polynomial. As you know, it's an even function, so you only need half the multiplications. The 6th-order cosine approximation requires 4 multiplications (one is x*x which is then used for 3 multiplications by x^2) and 3 additions. Using fused multiply-add this is one multiply and 3 FMA, a total of 4 instructions to perform the actual calculation. If you normalize the phase to [0,1) for [0,τ) this then requires: - Load the current phase (x) into a register - Depending on the bits for 1/2 and 1/4, decide whether you are going forwards (use bits below 1/2 and 1/4) or backwards (subtract those bits from 0.25) and whether you will use a negative result (caveat: the cosine polynomial approximation normalized to [0,1) is technically domain [0,0.25], but the result is identical if using the range [-0.25,0.25], so it is also possible to subtract 0.25 from the phase and then subtract the result of the polynomial from 0 if the phase is greater than 0.5) - Perform subtraction if so needed - 4 instruction calculation - Subtract result from 0 if in the negative half of sine function - Update the phase (you should be adding a normalized number, e.g. 440Hz f0 with updates at 60Hz means adding 440/60, not 2*pi*440/60) - Store the new phase That should get you under 20 instructions, maybe under 15, with 3 memory accesses (load current phase, load the amount to step the phase, store the new phase) Cosine polynomial reference: gist.github.com/publik-void/067f7f2fef32dbe5c27d6e215f824c91#cos-rel-error-minimized-degree-6
To try everything Brilliant has to offer-free-for a full 30 days, visit brilliant.org/KazeEmanuar. The first 200 of you will get 20% off Brilliant’s annual premium subscription.
5:30 it's wikipedia, you can edit it
Considering you’ve added hi res models into SM64 before, is it possible to add Render96 to it?
@@CandiedC render96 looks like shit, so even if you could it would be better not to
@ 17:34 the derivative is wrong. The correct derivative is 3ax^2 + b, not 3ax^2 + x.
@@RanEncounter oh whoops thats a typo
Using ancient Indian mathematical formulas to make the funny red man go bing bing wahoo even faster, even after going like five massive optimizations that give wild speedups already
Classic Kaze.
"funny red man go bing bing wahoo" is a great band name
The trick to fast computing is to use integers as much as possible
@@lerikhkl I love their song "everybody cheating but me".
@@williamdrum9899 I wonder if this is really true.
Float has 7 decimal positions, if I use an integer and treat the first 7 positions as decimals, would that make my code much faster and more efficient?
This is part of Bhaskara's legacy, one he could not have possibly imagined hundreds of years ago.
I've been optimising sine tables since the ZX Spectrum. I never realised how many tricks I've missed! ... BRILLIANT ... Thanks for sharing
Holy shit bruh, the wise sage spoketh.
Speccy > C64 😝
On the ZX Spectrum, a LUT was probably the right solution. The CPU was slow and the memory was fast. Though there wasn't very much of it, so you'd definitely want the quarter-table trick.
@@keiyakins it does make me want to pull the hardware out and compare the difference. All though I feel this implementation is specific to the way the N64 fetches data
My favorite Z80 trick is aligning a table to &xx00 so instead of loading the address into HL you load the address high byte into H and the desired index into L. Easily triples the speed of any lookup operation
This man knows more about the n64 than nintendo themselves
Yeah, for real haha
Yes he does
@@kserratabruh, u didint watch all his videos
Also probably the best assembly coder there is.
@@kserrata well who still code's in assembly and tries to optimize a coding language from 70 years ago
it's so cool to see situations where it's actually worth it taking up more CPU cycles than RAM, when a lot of programming problems are the opposite
The situation hasn't changed much today, actually; more in terms of scale, and levels of abstraction/recursion -- simple operations like these complete almost instantaneously in CPU time, and the rest of the time is spent ferrying the inputs and outputs through multiple levels of caches. Which may extend not just to system RAM or HDD, but over the network as well (which might in turn be cached across numerous servers, and so on!). Accounting for Rambus access is relatively painless in comparison! 😅
Actually there is a long trend this way, CPUs have accelerated way more than RAM.
Antique platforms would pre-compute even flipped sprites, because even though RAM was very scarce, the CPU was even more sluggish than that.
Cache and memory has also stopped scaling as well as compute on newer process nodes
That has nearly always been the case. The SNES and earlier system were basically the last widely available systems were memory access can be done in a single CPU cycle.
Nowadays we are at a point were in programming it often is faster to the the simplest linear search over hundreds or even thousands of elements instead of using things like binary-search that "should" be much faster as they need to only do a fraction of the work.
@@ABaumstumpf a few people have pointed that out and yeah they're right. i was more thinking about caching stuff for recursive problems but optimising for memory access is probably a more common thing
Maybe the 562.5 MegaBytes/s at 5:37 should actually be Megabits/s. 562.5 Mbps = 70.3 MB/s.
Yea I am sure it was a publication error as most people forget the capitalization huge differences.
Add also usual confusion on base 2 and base 10 Mega and you'll come closer to 75... Like the 1TB or 931GB...
@@retrofraction Good thing about Wikipedia is that you can change it and be the MVP!
I think the raw datapath to the RDRAM is capable of that but due to access latency and the memory architecture of the system the practical bandwidth is much much lower.
No, the ram operates at 500MHz over a 9-bit data bus. The 9th bit is only useable by the GPU. The GPU uses the same memory over the same bus and that’s why the CPU access is relatively slow
14:57
The reason he has to make the value negative here is that Bhaskara is only accurate in the range -pi/2 to pi/2.
The calculation for flip sign will be one, if the value is outside of this range. if this is the case the value needs to be shifted and flipped as it is after the if(flipsign) and the ternary operator after the Bhaskara algo proper
Here is a graph showing the process, and also how he gets sine out of this calculation.
BUT TH-cam wont let me post it because they are dicks
He DMd me the graph:
www.desmos.com/calculator/dq8l6tj6cv
@@KazeN64 Thanks, I appreciate it, and hope someone finds it interesting.
Out of curiosity, how do you DM on TH-cam? Or did you use an outside platform
@@IronLotus15he has his socials linked
@@IronLotus15I remember when TH-cam had private messaging 😢
Nintendo has used a lookup table for trigonometry since Super Mario World, where it was used for the rotating sprites (ball and chain, chained platforms, etc)
Back in the day (especially on SNES) the CPU was the bottleneck not the cache so they needed to look it up and not calculate it. I don't even think SNES could do floating point numbers in hardware.
@@AlexRodriguez-gb9ez It could not use floating point numbers, but fixed point (usually 8.8) are very common and can even be hardware-accelerated. In Super Mario World there's a big table with sine/cosine values in this format which is then multiplied with whatever coordinate you want to transform using hardware registers.
Likewise for Sonic 1, also in 1991. I'm getting so tired of this 'Mario 64 was the first game to do X' bullshit. What's next, Kaze - Mario 64 was the first game to ever have 3D graphics?
@@Clownacy mario 64 was the first game to be called super mario 64
Man it’s always crazy how even a game that’s considered “poorly optimized” like Mario 64 is still optimized incredibly well for its time.
Now a days developers say “oh most peoples computers these days should have about 16 gigs of ram? Why waste time making it run on 8 then?” And then the game still comes out rushed and bad.
Lol mario 64 isn't well optimised at all, they didn't even turn on the automatic compiler optimisations and it regularly drops below 30fps even though the n64 should be able to handle it fine.
Modern games at least shipped with the compiler optimisations enabled
What you have to remember is that this is a console game, from the age of cartridges.
There was no such thing as "updates" or bug fixes, so studios were forced to give their devs time to get a game good enough to sell.
@@dekufiremage7808because they keep hiring morons
@@benjaminoechsli1941here were re-releases thankfully, but changing the engine is drastic for one to do.
@@augustdahlkvist3998 I’m not super familiar with this topic in particular but it’s a fact that there’s almost nothing in game development that’s as simple as “turning on an optimization”. I would bet money that the issue isn’t as simple as turning on a feature. There are some things that only look obvious and easy with 3 decades of retrospective.
Even then, most modern game studios probably couldn’t get Mario 64 to run at “drops below 30 with big objects” if I’m being honest.
SM64DS also uses an interlaced sine/cosine table, but without mirroring. I never really thought about why that is but given how cache works it makes perfect sense.
I wish that game had as much love for the technical, decomp, and romhacking crowd as SM64 does. Glad to see you posting vids for the game
Doesn't the DS work like the GBA though, where reading from ROM is pretty free?
Was SM64DS ever decompiled to produce the same assembly code as with SM64? Or why do you know it?
@@FabulousJejmaze They're both ARM machines so I would think so. But I heard the DS has a memcpy bug that the GBA did not
@@sven33r It wouldn't be exactly the same since each runs on a different CPU. N64 doesn't have variable pointer offsetting.
4:20 Small correction: The table uses 16KB (20KB with cos) of memory, because floats are 4 bytes wide.
true!i was thinking of 0x4000 and just said 4kb because im so hexbrained lmao
@@KazeN64 pin the comment with corrections or update description?
as an SNES homebrew dev, it supremely fucks me up that it's faster to *calculate* a sine than to just use a simple LUT...what the hell even is the N64
Even on modern systems accessing memory costs cycles. The speed of light is a limiting factor of electronics.
Just make the computer smaller 4head, less wiring = less distance.
Ram also oscillates, reading is not free. N64s CPU was ahead of its time - both rom and ram lacked behind. That's why
The video's great! I love seeing how things like this are done on old hardware. It seems to me like it would be hard to understand how anything would be best optimized for modern hardware with preemptive execution, and weird hardware tricks and strange security flaws - more like magic than science. Even though I don't really program much, let alone optimize anything, optimizing code for old hardware seems like it's something that can actually be learned in a concrete and non-mystical way by a human being, even if it takes effort.
modern hardware is optimized for bad code. i think optimizing code on more modern hardware is less about nitty gritty details like this and more about using best practices. there is also an issue that most modern games run on many different consoles so you dont just have one metric to maximize. a lot of these n64 optimizations would slow modern machines down.
@@KazeN64Ever since Playstations, gaming has gotten more techy to the point of modern renders being mostly power and less “magicks”. Thanks to you, we can see their bag of tricks!
@KazeN64 The more powerful the hardware, the harder it is to run anything bare metal. On modern systems, it can even be impossible, requiring you to bypass even the UEFI itself.
@@KazeN64 It's still optimized for good code, it just also handles bad code a bit less poorly. Bad code on modern hardware can still easily waste 90% of memory bandwidth, etc.
Where I think modern hardware shines is making decent, not optimized, but not wasteful code run very fast.
I have a hunch about the angle flipping necessity: tangent.
The angles in the 1st and 3rd quadrants (going ccw around a circle on the cartesian grid) have the same tangent values, the same with the 2nd and 4th quadrants.
So you have to distinguish when you want the angle from the 1st or 3rd quadrant (and the 2nd or 4th Quadrant).
MS Excel (and Google Sheets iirc) have the 2 argument arctan function to treat the angle calculation as a vector problem, but since Mario 64 doesn't have this, you have to use a from scratch angle discrimination setup, much like what Kaze ended up using.
atan2 is a godsend when it comes to not having to roll your head everywhere to figure things out. you can define it from a classic arctan though -- just some if cases for signs and stuff
Me (Person who has a literal degree in computer science, built multiple games, and over 5 years industry level experience): This guy is insane and so much is going in one ear and out the other but this is so interesting to watch
Take a computer engineering degree and learn about signal processing, stuff like this will look more familiar.
To be fair to yourself, this is not a topic most computer scientists would deal with. It is much closer to engineering
I remember from a book called "linear algebra done right" on page 115 it says the best approximation for sin(x), of fifth degree, on the interval [-pi,pi] is given by 0.987862x − 0.155271x^3 + 0.00564312x^5.
looks like a minmax approximation which has issues i've outlined in the video
@@KazeN64awesome.
In GCC for ARM, when you want to return two 32 bit integers in registers, you return a 64 bit number. To convert between two 32 bit numbers and one 64 bit number, you use a union.
Yeah that works on n64 too!
That's what I love about unions, they let you break data types when it's convenient to do so. Best part is, there is no performance penalty.
Thanks Kaze! This was the perfect video to watch while running head first into a wall!
This channel is slowly becoming the best showcase of the general outline of optimizing code on TH-cam. I love it.
When I was coding 3D dos games 25 years ago, I couldn't understand why my sin table was not that fast, it is just one line of code with no CPU computation... Cache/memory bottleneckes make a huge different, I confirm.
See I program on CPUs that didn't have a cache. So for me, I don't need to worry about it since memory access is equally slow!
@@williamdrum9899then you have to start worrying about other problems!
Yeah you'd have to go back to the '80s when we had no caches to avoid that. I used to play with this stuff in the 8-bit days until about the Amiga era.
@@Anon.G Haha yeah. At least with C compilers you can actually finish a project. Writing an entire game in assembly is a monumental task even on an 8 bit platform
@@andrewdunbar828 I love the 68000, it's so easy to work with
It's always interesting to see how constraints inspire creativity and inventiveness. I have very limited programming experience, but a fair bit of writing experience, and having a constraint (the syllable count of a haiku, a particular meter in a poem, a more arbitrary constraint like not using a certain letter at all) really pushes you to play around with the elements you've got and try to achieve a pleasant result in spite of the constraint. It strikes me as weirdly similar to working around memory and cache constraints.
Reminds me, too, of a book I read written by an ex-Atari programmer. (I think it was Chris Crawford.) He talked about the clever things they tried to do with graphics by executing code in the handful of machine cycles in between CRT raster lines.
This is a coding channel that eventually uses mario to show how programming works.
Exactly, and this is why i am here.
Coding AND MATH!!!! My favorites
I just realized how Mario 64 is not optimized at all to store a whole table of sin, it is juste pure rush.
@@amineaitsaidi5919 No, a lookup table is actually quite a good optimization. Like Kaze said, computing the actual Taylor series to a high degree of accuracy would be incredibly slow. Nintendo didn't understand just how much their memory bus design constrained everything until much much later, so at the time it made perfect sense to try to save on CPU with the table.
A reminder of the actual MIPs cpu cache size (16 KB instruction cache and an 8 KB data cache) brings a whole lot perspective on that LUT size.
"You are going to learn a lot of math today, and you are going to like it" was weirdly threatening
I love these optimization stories of old video games. I switched from application development to firmware development because I figured it would be the closest I could get in todays technology landscape. Thank you for your videos. They are such a treat ❤
You could also program industrial equipment
this is so crazily impressive, kaze must have gone insane to do all this for that little performance gains lmao
Could also just be a passionate autistic programmer like me lol
considering how well optimized a LUT already was, he knew going into it that he was never going to achieve more than a little performance gain
As a computer scientist myself. I find these type of videos incredibly interesting.
As a non-computer scientist, I wish I didn't fall asleep in high school geometry class. I find these types of videos incredibly interesting as well, but I have no idea why sine waves are needed to calculate angles in thee dimensions. What does a sign save have to do with math in this context?
@@Psythik sine waves are necessary to calculate angles in _any_ dimensions ( aside from, i suppose, 1d geometry ... mostly because it, uh, doesn't have angles, at least as we traditionally understand them ). or, more precisely, they're necessary to calculate the cartesian coordinates of a vector rotated by an angle, aka, the change in a game entity's _x_ and _y_ positions when moving _d_ game units at an angle of _a_ degrees
basically, imagine a standard 2d coordinate plane, then imagine a right triangle with a hypotenuse ( the line that _is not_ part of the right angle; it's the one that's usually drawn diagonally ) of length _d,_ extending from ( 0, 0 ). this means that one of our non-hypotenuse sides is our x-axis, while the other is aligned with the y-axis. the _a_ of sine and cosine is one of the triangle's non-right angles, and the one that's most useful for us is the one that ends up right next to ( 0, 0 ). now, the sine and cosine of _a_ is the length of the non-hypotenuse side that is opposite / adjacent to, respectively, a, divided by the length of our hypotenuse. in other words, they're the ratio between the _y_ / _x_ coordinate of the hypotenuse's second vertex and the hypotenuse's length. so, all we need is to multiply the length by those ratios and, voila! we now know that the hypotenuse extends from ( 0, 0 ) to ( _x, y_ ). which is something you may recognize as a vector, but even if you don't, all you need to know for this example is that you can add this _x_ and _y_ to the coordinates of a point to offset that point _d_ units at angle of _a_
if that seems like a whole lot of steps for a whole lot of nothin', remember that basically _every_ game maps out space using cartesian coordinates. every point is represented by an x value, a y value, and maybe a z value if you're working with three dimensions, and ... that's it. if you want to move 14 units from ( 32, 24 ) at a 45 degree angle, then you gotta calculate exactly _what_ position is 14 units from ( 32, 24 ) at a 45 degree angle. and it isn't ( 39, 31 ), what you get if you just cut 14 in half and slap it on both numbers, even if a 45 degree angle is perfectly diagonal and thus offsets equally on both axes - remember, all of a circle's points are an equal distance from its origin, and yet its diagonal points aren't ( 1, 1 )*. the actual change in _x_ and _y_ we get out of this is a bit under 10, so we actually end up at ( ~42, ~34 )
incidentally, the reason why kaze spoke as if they _aren't_ necessary in 2d games is because all of this is _wholly irrelevant_ in 2d mario games, because they're sidescrollers. so the x and y axes are just completely different forms of movement, so him walking 14 units forward is entirely contained to the x axis. but it's also true that a lot of simpler top-down games, like a lot of old zelda ones, don't bother with using this sort of math for their movement and just slap a constant speed value onto the entity's x and y coordinates. this results in diagonal movement being straight-up faster than cardinal movement, but it's usually not all that noticeable because of the game's presentation. but then you have the interesting case of the original doom games, which as you'd expect from a fps, properly vectorizes your movement so your forward speed remains consistent regardless of which angle you're facing ... except, it vectorizes your forward and strafing movement _separately,_ so you _still_ get the "diagonal movement = faster" effect, it just ends up rotated so you have to move diagonal relative to your facing angle rather than the map's origin
*well, they are when using chebyshev distance, where circles are literally just squares. yeah, there are multiple ways to define distance ( which, granted, probably _isn't_ something that got covered in your geometry class ), and it's less theoretical than you might think because ... there's just a lot of different things you can measure! like, chebyshev distance is outright useful in game design, because it represents a type of movement used in the little known game of _chess._ another common alternative to euclidean distance is taxicab distance, which you'd be familiar with if you've played any tile-based game that doesn't allow diagonal movement, like advance wars or fire emblem
youre not a computer scientist lol
@@PowerfulKundalini
I know it is hard to believe but I really am friend.
@@starfoxmoore6037Don't take it personally. A lot of people just found out over the past few years that a bunch of liars with college degrees have been masquerading as practitioners of science. If you are practicing the scientific method, it's probably best to let your work speak for itself. People simply don't trust the formal classifications at this point, and who could blame them.
I love it when trigonometry and calculus are used with gaming and coding. Math is very entertaining.
Shame they do not teach it in school as an applied math in game design.
I thought I could escape this stuff for a few months after finishing my GCSE exams... I guess not then 🤣
Kaze's videos are a joy to watch nevertheless.
It's exactly why I love the classic Sonic engine... that said, things aren't nearly as complex as they are here, obviously.
I wish they taught practical applications of higher math. I tried to make a 2d animation of dots spinning like a processing top, turns out it's a ton of trig inside of matrix algebra. I got into game dev and wanted to make a projectile bounce off a wall, more trig inside matrix algebra, vectors are everywhere in game dev. If you want your character to be affected by gravity, there's immediately calculus to handle. The whole concept of "ramp" in magic the gathering is just exponential growth. Geometry is useful all over the place in action games. It's like you can't move your character without at least distance/hitbox equations happening. So many people say "Math is miserable, you never use it, it's not worth the time" but I try to say math is important, it's a way to understand every physics process unfolding around you, and that power it extremely useful. Even if it's just so you can speak to computers and be better at basic programming, or just to be a better gamer, math is incredibly useful.
The idea to interweave the sine and cosine values to avoid cache misses is a stroke of genius! Great work!
Instead of Euler angles, which have several drawbacks such as gimbal lock and numerical precision, would it be more efficient to use another rotation representation such as quaternions? Would that also reduce the number of needed trig function calls?
Yes, I'm actually planning to convert all the animations to quaternion animations - It'll reduce it from 42 multiplications and 6 sin/cos calls to 37 multiplications. There is also an advantage in quaternions being 8 bytes per rotation instead of 6 - that sounds worse at first, but that way you can guarantee every quaternion is accessible in a single dcache access (whereas the eulers could be spread across 2 dcache lines). So yeah i'm absolutely planning on doing that.
@@KazeN64 Wow, you're eventually gonna rewrite the whole damn engine! Quaternions seem less intuitive at first, but then you see their elegance. And when you start counting mults and other ops, the benefits are clear.
@@Isaac________ They're only less intuitive if you try to visualize them as 4-dimensional arrows. They're _very_ intuitive if you look at them for what they are: mixtures of 4 basis 3D rotations: 180° around x, 180° around y, 180° around z, and 0° (around nothing). An alternative way of visualizing them is as compositions of two reflections. They're more efficient to compose than matrices, but less efficient to apply.
If you want to mix in translations, then you need something called dual-quaternions, which have 8 components, but that's still less than a 16-component homogenous matrix or 12-component affine matrix.
@@angeldude101 Yes. That's why I wrote "at first", and noted their elegance. :)
@@angeldude101 I think a lot of confusion around quaternions is caused by how we usually think about imaginary numbers as a representation of 2D space. If you try to tack on a similar interpretation of the scalar part of a quaternion as relating to a basis axis, then it doesn't make sense. The scalar part is related to an inner product and projects your vector into a plane where the rotation is a simple reflection. It's the imaginary vector part that "lives" in 3D space. It also helps to realize that the multiplication rules of quaternions are just an algebraic representation of the cross product:
j*k = i
which is similar to:
y × z = x
for basis axes x, y, z.
I am so notoriously bad at math my brain literally locks up when presented with it but you still manage to explain everything in these videos enough that my potato brain can keep up.
Which is good because I love seeing how you optimize the hell out of this game despite being a math potato.
"What if I told you there's room for improvement?"
Kade, this is YOU we're talking about.
You'll probably be able to get Super Mario 64 to run on a TI-86 Calculator if given enough time
thanks for enlightening me on the vertical and horizontal speed on 2d games kaze
NO WAY CAEC
I've always been fascinated with how computers implement trigonometry. It's more difficult than arithmetic, and there are many ways to do trig functions. Another method is the CORDIC algorithm, which involves adding smaller and smaller angles until it reaches the input. It was developed for use in 8 and 16 bit processors. The TI-83 calculator, which uses a chip similar to the gameboy's, uses this algorithm.
05:32
The number for peak bandwidth is megabits per second, that's why your number which is megabytes per second is about one eighth of their number.
It's megabytes on their website. There are other website claiming 4500mbit/sec too. I thought they could be talking about Mb/s too but unforuntately they werent
@@KazeN64 maybe they calculated it in Mbits/s and accidentely wrote it as MB/s? and then someone took the wrong MB/s from wikipedia and multiplied it by 8 again resulting in 4500?
@@t0biascze644 Definitely that lol. It's a rookie mistake, and since everyone is using Wikipedia as their reference.. Well nearly everything ends up wrong afterwards.
@@KazeN64RDRAM operates at 250 MHz over a 9-bit bus with data (possibly) transferred at every clock. That’s where the number comes from. As the GPU uses most of that and the overhead is worse, the CPU might get only 75MB/s
@@lotrbuilders5041 i've computed my 75mb/s from the graphics actually. the CPU is a lot slower than the 75mb/s. usually they share the data speed so both components together probably average around 25mb/s during normal gameplay.
Have you ever tried CORDIC algorithm?
It needs a (quarter?) Sine table but calculates everything just by adding and shifting, so basically one cycle operations and the accuracy depends on the number of repetitions you give the algorithm.
Could be too taxing on the bus but maybe give it a try? Not hard to implement though...
Used it on FPGAs to reduce computation unit requirements by a lot, since multiplier are rare on those. Golem has a great article about it (and in german :3 )!
I got a welcome ego boost watching this video by thinking about the "only use one 4th of the curve" and "interweave sine and cosine" before you mentioned them. Thanks for that, I love these optimization tricks videos
A tip about evaluating polynomials -- start with the highest coefficient, multiply by x, add the next coefficient, multiply by x, etc. Called Horner's method, this greatly reduces floating point error as the terms tend to be of similar scale. For all-even or all-odd series, also prepare x^2 beforehand, to evaluate it as a polynomial in x^2 instead; for odd, factor out the common x, then multiply by it in a final step.
About Bhaskara's formula: it's not so mysterious, just a rational approximation. A rational polynomial is a function f(x) = P(x) / Q(x), P and Q being regular polynomials. Generally speaking, the accuracy is equivalent to a n+m order polynomial, where n and m are the order of P and Q respectively. Evidently, this would be advantageous for modest order polynomials (maybe 7th+?) where the division can save some multiplications. Rationals are particularly well suited for peaky functions, where 1/x or 1/(x^2 + a^2) sorts of forms would otherwise be very challenging for a polynomial fit. For smooth functions like sin/cos, it's not too important; and indeed, your 4th/5th order forms are in the same order of magnitude accuracy, as expected.
There's also a method not listed here, the hybrid between look-up and curve-fitting: you could have a table of not just values, but coefficients as well. A much smaller table (say 16 or 64 elements) could contain, say, the offset, slope and curvature, which are simply dumped into a quadratic polynomial, and out drops the spline result for that segment. Note this is less intensive than a linear (let alone quadratic or cubic) interpolation, as the coefficients don't need to be computed from adjacent points (table values); for example linear interpolation has to calculate the slope between adjacent points (evil division!), but we can just tabulate that instead. It's not obvious if this would perform better overall than the final choices in the video, but considering how comprehensive the video is, I suppose it bears mentioning.
The mention of returning complex pairs has some very interesting outcomes, if taken to its logical conclusion. Ultimately all we're doing is coming up with a vector [x, y] = [cos theta, sin theta], and most of the time we're using both values close by so it makes sense to return both pairs directly. _We could avoid angles entirely and only pass around these vectors!_ This is probably such an immense pain that it's not worth it (every single object which takes an angle parameter has to be refactored to take a normal vector instead!), but, interesting just to imagine how it would all work. Note that, where angles can be added or subtracted, angle vectors must be multiplied or divided, using the complex vector formulas (a + bi)(c+di) = ac - bd + i(ad + bc) and (a + bi) / (c + di) = ( ac + bd + i(bc - ad) ) / (c^2 + d^2). (The division is avoided as long as c^2 + d^2 = 1, which is true for normals.) Fractional fixed-point can be used to save on FLOPs; probably short pairs, so, doubling the memory requirements of stored angles. The magnitudes will eventually drift from enough accumulated operations, and using first-principles values (like, oh I don't know, calculating sin and cos? :^) ) from time to time, or entirely in the first place (imagine that!) is probably helpful; or normalizing (divide by sqrt(a^2 + b^2) -- yuck, what a floating-point mess! -- but a simple iterated approximation can be used (or the infamous inverse square root..!) since it just has to move the value closer to normal, not set it exactly so.
Tremendous amount of work, by the way! Numerical approximation, and implementation, can be quite a deep rabbit hole. As, uh, I think you've discovered, hah! :D
Yeah dont worry about the first point - thats how my code does it. Ffast math automatically multiplies xsquared first
+. Excellent comment!
I'm not sure why a Taylor series approximation would be too slow. There are no divisions and you can compute x^5 in three multiply instructions. The whole thing would probably take about 8 multiplies and 3 adds for an 8th order accurate approximation. Check out the musl or glibc libm implementations.
Also, -ffast-math can cause incorrect computations since it can break IEEE compliance. Do you see the same incorrect stretching when you build without it?
Really interesting. Easy to understand the x y speed part from the beginning, but sadly, I’m having a hard time with the use of sin/cos in 3d bone movement and wishing I understood why it’s necessary from the video.
"I never expected anything I've learned in school to actually be a useful life skill"
Apparently hacking out a polynomial with a low-error approximation of sine for SM64 optimization in 2023 is a useful life skill lol
It is, it's called fun. You ever had it in your life?
@@MrMeow-dk2tx
Yessir
It is when making Mario 64 romhacks is one of your bodily functions.
His brain has a built in RISC co processor for just rom hacking
math is useful for physics and statistics if you want to consider "real world applications"
I'm sure you already know this, but someone did a response to this vid where they brute-forced the best sine/4th order cosine with the least amount of error using a SAT solver. TH-cam will eat my link, but the post is "Finding the BESTEST sine function for Nintendo 64" by Steinar H. Gunderson.
@2:12 I was today years old when i learned you can get on the castle. I never even considered it before. The vid was cool but thanks for that
13:47 This is the best b-roll I've ever seen
I can't wait till this game is so ridiculously optimized that I can run this on my ti-84
The lookup table approach is used in the classic Yamaha OPL2 and OPL3 FM synthesis chips. They contain a 256-entry table of the first quarter in ROM. However, it's not a straight-up sine function, but a log of a sine function. They also contain an exponential table ROM as well, needed for converting the log-transformed values to linear ones.
I love that Kaze's romhacking has evolved into the most insane solutions to code optimization problems
Very clever optimization to get the cosine to cache with the sine. Loved the deep dive. You definitely did look cool doing it! Great video
5:35 The wiki may be using Mbit/s, as dividing the value by 8 gets pretty close to the real bandwidth.
Then it should read Mb instead of MB though
@@sinom I wouldn't be surprised if some marketing guy decided to change to "MB" because it "looked neater" without understanding the technical implications of it
@@sinomGenerally memory transfer speed is measured in bits/second, not bytes/second. It's not impossible that in a community driven wiki someone made a mistake somewhere along the chain of copying or stating information. And the b/B difference is a newer thing - for quite a while it was common in data/tech sheets to have B just mean both, requiring a call to the vendor or knowledge of how transfer speed is measured at this particular company.
But yes, it should be b for bits, or Mbit/s to be explicit.
1:35 As a programmer, I have noticed that we don't actually have any proof that this is the best way to represent 3d rotations, while it is without a doubt always going to return a value that is on a sphere, we don't know that it's the most efficient, or the best way to do it.
I'm pretty sure it isn't, because it requires 3 separate rotations, meaning 3 separate complex exponentials, and since the axes are relative to each other, it's possible to lose a degree of freedom if two of them align. Quaternions are much more likely to be the best way to calculate rotations since they just require a single complex exponential and an arbitrary axis on top of being as efficient to store as Euler Angles (3 components). Actually applying the rotations is a different story since it's actually more efficient to do that with matrices than quaternions, but matrices are less efficient to store and compose, so you have to balance the tradeoffs.
@@angeldude101 I think probably the optimal system would be with only two variables, in other words it could be attained by storing a 2d-value. While I know for certain that we already have systems like that, I also don't know that those systems are the best ones. I think likely it would be a lot more complicated than what we have, in a similar way to how 2d rotation is stored as a 1d variable compared to 1d rotation being a Boolean/0d variable.
@@flameofthephoenix8395 In 2D, there's only 1 plane within which you can rotate, so you only need 1 component, and from that you can derive a mixture of of rotation and not-rotation (2 components). In 3D, there are 3 orthogonal planes within which you can rotate, so you need 3 components to define the axis (which only actually requires 2 coordinates) and the angle.
More generally, N-dimensional space has N choose 2 rotational degrees of freedom. (1 choose 2) = 0, so 1D rotations have 0 components; (2 choose 2) = 1, so 2D rotations have 1 component; (3 choose 2) = 3, so 3D rotations have 3 components; (4 choose 2) = 6, so 4D rotations have 6 components. That last case actually has something new happen when converting from a axis-angle to a mixture of rotations, as you can now get an 8th component that appears when the axis is actually 2 completely orthogonal axes and can't reduce to just 1.
Great video! I remember using overlapping sin/cos tables for the BBC Micro back in the day. And I think I recognise the UI you used for comparing the C code compilation results! 😊
the man himself
Matt Godbolt watches N64 optimizations nice
Holy fuck, didnt expected to see true OG on here lmao.
thank you for Mario Teaches Math, Kaze 64 edition. your lesson was a lot of fun!!!
My brain took so long to recognize “sixty-five thousand five hundred and thirty-six” as 65536
Hey Kaze, this was another solid deep-dive. If I could make a suggestion: I found the additional audio of the speedrunning/glitching balanced a little too loud. Loud, glitched "wa-wa-wa-wa-wahoo"s into a wall + music + mathematics VO is a lot to listen to all at once and this is the only time I can recall a video of yours feeling this loud and overstimulating. I'd like to point to 16:45 as an example, so see if others feel the same.
Otherwise, great topic! Love the feeling of progression to the algorithm
Yeah thats fair! i'll make sure to turn the audio down a bit next video.
Thanks, I appreciate it 👍
11:52 yea thats exactly what I was thinking
5:29 It actually comes from : 250 MHz clock speed (RDRAM double data rate) = 500MT/s
9 Bits bus * 500 MT/s = 4.5 Gbps = ~562.5 MB/s
Unfortunately that's in theory a lot goes on in between, latency, memory controllers, data computing speed etc...
that makes sense!
Thank you for making videos like these explaining mathematical concepts related to Mario 64. It is informative and appreciated, and encouraging learning
Hello! Floating-Point/Computer Maths engineer here.
I've developed trig functions for quite a lot of architectures/my life at this point, maybe I can give some suggestions.
In our world, the minimax approximation is mostly king. I say mostly because mathematically it's usually the provably optimal polynomial, however for us there's the usual floating-point issues you get trying to apply maths onto computers.
But if there's an issue with a minimax approximation you usually look inwards before looking outwards (Especially if your sin/cos ends up being greater than 1)
If you have problems around 0, you should probably be optimizing for relative error, as opposed to absolute error which is what it seems you've done here (I might be wrong). In the business we tend to use an error metric called the ULP error (units in last place), which roughly measures how many bits at the end of your floating point answer are incorrect.
(It's very close to a relative error measure, bit not quite the same. Optimising for relative error usually works well)
Because of how floating-points scale, this means that you need tiny errors for tiny floats, and can have much larger errors for larger floats. Using the absolute error as your metric allows vales near 0 to blow up to however large you want in a relative error sense.
Without an exact application in mind this ULP-metric is the one people like myself would use to write eg maths libraries.
Another issue is that if you take the coefficients of an optimal real-valued minimax polynomial and convert them straight to floats you lose an pile of accuracy, unfortunately for computer scientists everywhere. Probably a lot more than you might think, when compounded with floating-point evaluations.
We have some tools developed to try get around a lot of these limitations, a good starting point is one called sollya. Here you can ask for relative/absolute error, and give all kinds of constraints to your problem. Importantly you can restrict the output coefficients to be floats, and it will try optimise around this, so you won't lose large amounts of accuracy converting reals coefficients.
For example, on the interval [0 -> pi/4], for sin() it gives a cubic polynomial of:
x * (0.9995915889739990234375f + x^2 * (-0.16153512895107269287109375f))
Or by forcing the first coefficient to be 1 (generally sollya knows best, but sometimes storing a value of 1 is free on hardware, and it also might make really tiny values more accurate at the expense of larger ones):
x * (1.0f + x^2 * (-0.16242791712284088134765625f))
And similarly the next higher polynomial gives:
x * (0.999998509883880615234375f + x^2 * (-0.16662393510341644287109375f + x^2 * 8.150202222168445587158203125e-3f))
Or with forced 1:
x * (1.0f + x^2 * (-0.166633903980255126953125f + x^2 * 8.16328264772891998291015625e-3f))
Note that sollya does not produce the absolute best polynomial possible, but it's the best you'll get with not too much work, and the results it does give are usually very decent.
Otherse have suggested sollya as well! The problem with the polynomials you've suggested though is that you are adding additional coefficients. I've purposely chosen only odd and only even polynomials because they give you a lot of accuracy with less coefficients. instead of a 3rd order sine, an odd 4th order cosine will always be cheaper and more accurate for example.
considering the constraints i gave my polynomials, i also think it's entirely impossible to come up with a different one. we have used up all degrees of freedom after all.
although someone suggested easing up on some of the conditions for the higher order polynomials and it did end up improving them.
Replying so i can remember this golden comment via comment history. Thanks for this comment, i also learn a lot from this..
ocarina of sine
Considering that music is just sound waves of varying frequency, and sound waves are waves like any others, _yes._
Funny I actually worked on a project that used the same sine approximation you came up with for the fifth order, but for the third equation (for the last coefficient) we calculated it by minimising the distance between the function and the ideal sine, to do so we ended up equating both integrals between 0 and pi/2 and deduced the result.
We had in the end : f(x) = a*(pi/2)*x + b*((pi/2)*x)^3 + c*((pi/2)*x)^5 with a = 4*(3/pi - 9/16) , b = 5/2 - 2*a , c = a - 3/2.
I haven't checked if it's better or worse than your function that you found but when we tried a strictly derivative approach (What you did kinda) it ended up doing worse for our purpose.
edit: didn't see the new video my bad
yeah that's more accurate. this video was made before i even knew polynomial solvers were a thing and i just kinda went with the approach i knew worked somewhat.
20:10 while these bugged bones are not desireable normally, I think they could make for a pretty funny mechanic for another romhack... maybe a power-up?
13:40 I’ve heard that division is shockingly costly for computation, but it being on par with square roots is insane
Ah yes, one step closer to optimizing the N64's boot time. Wonderful video.
9:56 "not sure if anyone else thought of this", it was used in Crash Team Racing (1999) by Naughty Dog on PlayStation 1!
I dont know how im supposed to watch this, but I probably did watch it like a Caveman being learned about basic electricity.
Exactly this, i find it really interesting but i literally burst out laughing because i realise i have no idea what it all means.
well, at least I am glad I am not the only caveman around
The words of a true computer scientist. "You are going to learn a lot of math to day, and you will like it".
For generating the minimax coefficients you can use a command-line-tool called Sollya. It's neither very well documented nor does it give optimal results, but it's pretty close.
Did you consider the CORDIC algorithm? It's (to my understanding) really efficient on low end hardware, so it might be worth looking into
Cordic is going to be more accurate, but slower.
If you split it again into two separate approximations one from 0 to pi/8 and a different one from pi/8 to pi/4, I think you can probably get a pretty accurate quadratic approximation of sine. I didn't work out the details so it might end up being slower in the long run or have other problems.
it'd be cheaper and more accurate to just use the cosine of one higher degree to get around having to do float comparisons at that point
@@KazeN64 I have seen this method used in a different context to good effect, which is why I suggested it, but every processor is different, and I don't really know this one nearly as well as you do obviously. So fair enough.
@@KazeN64 And by the way, you can get better polynomial interpolations sometimes by choosing different nodes.
I tried out a fifth order cosine with f(x) and f'(x) equal to those of cosine at x=0, pi/4, pi/2 and it gives an order of magnitude better approximation than the fifth order sine you showed. and it has the property also that it is bounded from above by the cosine function on the interval 0 to pi/2
@@timseguine2 wait why a 5th order cosine? odd orders should always be sines and even orders should always be cosines. otherwise you end up having too many coefficients. can you show the function?
@@KazeN64 The way the coefficients disappear depends on which x values are used for evaluating the interpolation. If you split them up like I suggested it works out to:
1 + ((1024 sqrt(2) + 256) π + 4096 sqrt(2) - 11776)x^2/(128π^(2)) + ((-8192 sqrt(2) - 2560) π - 16384 sqrt(2) + 67584)x^3/(128π^(3)) + ((20480 sqrt(2) + 8192) π + 16384 sqrt(2) - 139264)x^4/(128π^(4)) + ((-128 sqrt(2) - 64) π + 768)x^5/(π^(5))
and the maximum error on 0 to pi/2 is about 0.00004
The thing I said about the bound was backwards, but it is never larger than 1.
There are other alternative ways to interpolate it depending on what you are going for. A decent quartic one could work for example by dropping one of the conditions at x=pi/4, but I didn't try that
watching your videos makes me feel smarter. i have very little expectation of ever utilizing these lessons in my waking life, but i do love learning from you. thanks man.
It's so wild to me that actually trying to approximate sine can be more efficient than the LUT
That new Mario model is straight up perfection.
It's crazy to think that Kaze might just be one of the greatest programmers of our generation... And he uses his insane talents to make Mario 64 mods O_o
Great video.
In a past life, I wrote display device drivers for the GEM Desktop VDI interface, mainly on 286 and 386 systems. In that interface, they do use some angle parameters, mainly for circular arcs. Angles are specified in 0.1 degree increments. The supplied driver code used sine lookup tables and was based on octants, so 0.0 - 45.0 degrees. There was some front end code that would reduce angles into one of 8 octants, even for angles over 360 degrees.
I started out using this code but found some strange behavior every once in a while the display update would pause for a noticeable delay. I ended up writing a test program that would load the driver and be able to send it a range of parameters for all the graphics primitive functions. This way I could load my program (and the driver) with a debugger (displaying on a separate system via serial port).
I finally narrowed down the problem in the arc drawing code. While the comment in the code (was originally written in a Fortran variant, then the assembler output from that was used) said reduce the angle to 0..7 octant, it was actually doing 1..8 octant. This was in a loop where the code would take the angle, check for > 360 and if larger, subtract 360 and repeat. Then it would check for octant 0..7, but if the angle reduced to 8, it would hit the outer loop and keep going until it finally would get into the 0..7 range.
First, I fixed the assembler code to do the proper 0..7 calculation and it worked fine. However, there was over 900 bytes of lookup table (and this was in the 16 bit / 640KB days), so that was a huge chunk of memory just sitting there in case a rarely used arc call was made. I was also in the process of converting almost all this assembler code to C and realized this whole mess could be replaced with a call to the sin() library function. Since we were selling high end graphics cards, almost every customer of ours had a 287 or 387 math co-processor in their system, so this was a fairly low overhead call. Doing this trimmed about 1KB from the driver memory footprint and it ran smoother without that annoying pause every time it got a strange angle parameter.
In before he switch to mighty quaternion for cache locality
i am 100% planning to do that for animations
Really interesting math!
This comment is long, just so you know.
I tried to find a 4th order approximation satisfying:
f(0) = 0
f_prime(0) = 1
f(pi/2) = 1
f_prime(pi/2) = 0
The only one is ax^4 + bx^3 + x,
where
a = (16 / pi^4) * (pi - 3)
b = (16 / pi^3) * (2 - 0.75pi)
The maximum error is ~0.0020673.
As for cos(x), another method is to take the derivative of the polynomial for sin(x). Although there will be a greater max error, it will be faster, as the polynomial is of lesser degree and it avoids using a sqrt. If we use the 4th degree polynomial I found, the maximum error is ~0.005. Also, because of the constraints on ~sin(x), this ~cos(x) will always be between 0 and 1 (in the range 0 .. pi/2). As for smoothness, the second derivative f''(0) = 0 correctly and f''(pi/2) = -1.0437 instead of the perfect -1.
I also tried to find a 3rd minimax approximation, forcing the following:
f(0) = 0
f(-pi/2) = -1
f(pi/2) = 1
The function I ended up with was approximately 0.9824x - 0.14x^3, more specifically:
0.9823962677074202x - 0.14013793513412705x^3
It has a max error of 0.005745, which is slightly better than your function.
non-pro tip: Those 'error bumps' at 19:35 should have the same maximum height. Meaning, in the minimax approximation, you sacrifice some error in the shorter bump in order to lower the tallest bump, eventually resulting in the minimax approximation
However, it faces the same issue as yours. That is, the function may be > 1. The maximum is at around 1.0012. Forcing the cubic to not exceed 1 (by forcing f_prime(pi/2) = 0) results in easyaspi's approximation.
I can't believe I was so engaged by this video for 26 minutes. Gotta be the best ad read I've ever seen, too, I spent 15 of those minutes "how could I possibly learn all this?"
I am old school so I really liked these optimizations considering that to day most programmers just make more and more bloated codes.
I appreciate that you were passionate enough to go this deep into this game
Would it be faster to compute cos(x) as sin(x + pi/2), instead of using sqrt to get both? I'm assuming the compiler could simply emit the "offsetting" as a prepended prologue to the sin(x) function.
But maybe you never need only sin or only cos, so that wouldn't be worth it?
yeah if i only needed one, that'd be faster. but i always need both.
@4:26 - If you're storing 4096 sine values as 32-bit floats, it's not 4 KiB, it's 16 KiB. In the case of the Zelda64 method, you're looking at 4 KiB. I highly doubt that you're using 8-bit chars to store angles, because you'd be looking at a precision of only 1.4 degrees.
yeah i just misspoke. im so used to seeing it as 0x4000 that i just read the 4 out lol
Who needs math class when you have Super Mario 64?
2:01 A transcendental number is _NOT_ a number that does not repeat. √2 is a number that does not repeat, but it’s not transcendental either. The opposite of _transcendental_ is _algebraic._ An algebraic number is one that is the solution to a polynomial with integer coefficients, i.e. something like x⁵−x−1 = 0 is well-known to have a solution, which is algebraic by definition and its value is approximately 1.1673, but it cannot be described in a closed form with integers and basic operators including roots (square root, cube root, etc.).
yeah that's true. honestly chatgpt wrote most of the script and i just missed this mistake.
@@KazeN64 😂
I think Wikipedia meant megabits, not megabytes. 8x difference
The fact that this is mostly review for me from my Numerical Analysis class!
It just blows my mind seeing this all working! I mean, I’ve worked with this many long nights studying, but still, it amazes me!
I was really hoping sin(x)=x approximation was going to make a guest appearance here
I think this guy just turbo charged Super Mario 64
You should make that big fist Mario 12:23 a power up in the game
Use the angle addition formulas sin(x+y) = sin(x)cos(y) + cos(x)sin(y) and cos(x+y) = cos(x)cos(y) - sin(x)sin(y). If you need to compute sin(z), compute sin(z) = sin(x+y) Use a very small lookup table with only a handful of entries (32? I dunno. Not many.) to compute sin(x) and cos(x). (with your interleave trick (heckin' neat btw) that's only one RAM hit on the rare occasions it's not already in data cache) calculate y = z - x. (y should be very close to 0) Then 'calculate' sin(y) = y and cos(y) = 1. then sin(z) = sin(x) + y cos(x) and cos(z) = cos(x) - y sin (x).
If you're using radians that's 5 arithmetic instructions, 6 if you're using not-radians. You need to one cache line twice for the lookup for the sin(x)/cos(x) values, but the table should be *very* small and it's reasonably likely to always be in cache.
with y very close to 0, x could grow as big as being very close to pi/2. sin(pi/2) = sin(0) + pi/2 cos(pi/2) = 0 is not an accurate enough approximation.
This was a fun watch. That makes me wonder how much silicon in modern gaming apus might be dedicated to the trig portion of the ALUs
Probably enough to where your average programmer doesn't need to think about this stuff
Me at the start of the video: "Yay I'm about to learn stuff!"
My brain after watching this: 😵💫
Super interesting watch, was glued to my seat from start to finish!
you are nothing short of a genius
The Mario Kart music in the background 🔥Nintendo has so many good jams!
I brought up the idea of vector extensions in one of your live streams for returning two (or 4) floats from a function, I gave it a try just now and whilst it is better than returning a struct on the stack (4 instructions), like the struct technique it still avoids the hardware float registers...
I'm surprised GCC fails here. On other modern architectures the vector extensions produce the same code as the complex number extension
I suppose MIPS hasn't received much attention for this optimisation (similar story for ARM32 with hardware float)
! if you do need to ever return two (or 4) integers from a function without the horrid struct-in-stack behaviour then use the gcc vector extension
And now that I tried modern GCC with the struct-on-stack technique, it actually behaves the same as the vector extension, so 4 instructions and stack is avoided (pass-by-register)
Very few ABIs behave like this for structs, so I think modern GCC has made a MIPS specific optimisation for this case
huh it never did that for me - i'm not sure how to use vector extensions but if you could show some sample code that would be pretty useful to have a look at. that type of optimization might apply to a lot more stuff in the repo.
Since he's at the insane optimization level he is now and has shown that the memory bandwidth of the N64 is pretty low, using vector registers as essentially stack space might give a huge boost to performance.
Might be worth doing a short follow up video with some of the suggestions from the comments
there has been no improvement suggested so far, although getting the best coefficients from a LUT was an interesting idea.
Finding the BEST dry bones function for dry bones 64
Rewatching this now. It has to be one of my favourite videos! Everything is explained well.
11:20 kerning moment
That interwoven table is pure genius.
You are to N64 modding what Albert Einstein was to physics and cosmology.
You should be able to calculate cosine for an exact phase (track phase instead of accumulating the cosine value) using a Horner polynomial. As you know, it's an even function, so you only need half the multiplications. The 6th-order cosine approximation requires 4 multiplications (one is x*x which is then used for 3 multiplications by x^2) and 3 additions. Using fused multiply-add this is one multiply and 3 FMA, a total of 4 instructions to perform the actual calculation. If you normalize the phase to [0,1) for [0,τ) this then requires:
- Load the current phase (x) into a register
- Depending on the bits for 1/2 and 1/4, decide whether you are going forwards (use bits below 1/2 and 1/4) or backwards (subtract those bits from 0.25) and whether you will use a negative result (caveat: the cosine polynomial approximation normalized to [0,1) is technically domain [0,0.25], but the result is identical if using the range [-0.25,0.25], so it is also possible to subtract 0.25 from the phase and then subtract the result of the polynomial from 0 if the phase is greater than 0.5)
- Perform subtraction if so needed
- 4 instruction calculation
- Subtract result from 0 if in the negative half of sine function
- Update the phase (you should be adding a normalized number, e.g. 440Hz f0 with updates at 60Hz means adding 440/60, not 2*pi*440/60)
- Store the new phase
That should get you under 20 instructions, maybe under 15, with 3 memory accesses (load current phase, load the amount to step the phase, store the new phase)
Cosine polynomial reference: gist.github.com/publik-void/067f7f2fef32dbe5c27d6e215f824c91#cos-rel-error-minimized-degree-6