To try everything Brilliant has to offer-free-for a full 30 days, visit brilliant.org/KazeEmanuar. The first 200 of you will get 20% off Brilliant’s annual premium subscription.
Using ancient Indian mathematical formulas to make the funny red man go bing bing wahoo even faster, even after going like five massive optimizations that give wild speedups already Classic Kaze.
@@williamdrum9899 I wonder if this is really true. Float has 7 decimal positions, if I use an integer and treat the first 7 positions as decimals, would that make my code much faster and more efficient?
On the ZX Spectrum, a LUT was probably the right solution. The CPU was slow and the memory was fast. Though there wasn't very much of it, so you'd definitely want the quarter-table trick.
@@keiyakins it does make me want to pull the hardware out and compare the difference. All though I feel this implementation is specific to the way the N64 fetches data
My favorite Z80 trick is aligning a table to &xx00 so instead of loading the address into HL you load the address high byte into H and the desired index into L. Easily triples the speed of any lookup operation
The situation hasn't changed much today, actually; more in terms of scale, and levels of abstraction/recursion -- simple operations like these complete almost instantaneously in CPU time, and the rest of the time is spent ferrying the inputs and outputs through multiple levels of caches. Which may extend not just to system RAM or HDD, but over the network as well (which might in turn be cached across numerous servers, and so on!). Accounting for Rambus access is relatively painless in comparison! 😅
Actually there is a long trend this way, CPUs have accelerated way more than RAM. Antique platforms would pre-compute even flipped sprites, because even though RAM was very scarce, the CPU was even more sluggish than that.
That has nearly always been the case. The SNES and earlier system were basically the last widely available systems were memory access can be done in a single CPU cycle. Nowadays we are at a point were in programming it often is faster to the the simplest linear search over hundreds or even thousands of elements instead of using things like binary-search that "should" be much faster as they need to only do a fraction of the work.
@@ABaumstumpf a few people have pointed that out and yeah they're right. i was more thinking about caching stuff for recursive problems but optimising for memory access is probably a more common thing
I think the raw datapath to the RDRAM is capable of that but due to access latency and the memory architecture of the system the practical bandwidth is much much lower.
No, the ram operates at 500MHz over a 9-bit data bus. The 9th bit is only useable by the GPU. The GPU uses the same memory over the same bus and that’s why the CPU access is relatively slow
14:57 The reason he has to make the value negative here is that Bhaskara is only accurate in the range -pi/2 to pi/2. The calculation for flip sign will be one, if the value is outside of this range. if this is the case the value needs to be shifted and flipped as it is after the if(flipsign) and the ternary operator after the Bhaskara algo proper Here is a graph showing the process, and also how he gets sine out of this calculation. BUT RUclips wont let me post it because they are dicks
Nintendo has used a lookup table for trigonometry since Super Mario World, where it was used for the rotating sprites (ball and chain, chained platforms, etc)
Back in the day (especially on SNES) the CPU was the bottleneck not the cache so they needed to look it up and not calculate it. I don't even think SNES could do floating point numbers in hardware.
@@AlexRodriguez-gb9ez It could not use floating point numbers, but fixed point (usually 8.8) are very common and can even be hardware-accelerated. In Super Mario World there's a big table with sine/cosine values in this format which is then multiplied with whatever coordinate you want to transform using hardware registers.
Likewise for Sonic 1, also in 1991. I'm getting so tired of this 'Mario 64 was the first game to do X' bullshit. What's next, Kaze - Mario 64 was the first game to ever have 3D graphics?
SM64DS also uses an interlaced sine/cosine table, but without mirroring. I never really thought about why that is but given how cache works it makes perfect sense.
Man it’s always crazy how even a game that’s considered “poorly optimized” like Mario 64 is still optimized incredibly well for its time. Now a days developers say “oh most peoples computers these days should have about 16 gigs of ram? Why waste time making it run on 8 then?” And then the game still comes out rushed and bad.
Lol mario 64 isn't well optimised at all, they didn't even turn on the automatic compiler optimisations and it regularly drops below 30fps even though the n64 should be able to handle it fine. Modern games at least shipped with the compiler optimisations enabled
What you have to remember is that this is a console game, from the age of cartridges. There was no such thing as "updates" or bug fixes, so studios were forced to give their devs time to get a game good enough to sell.
@@augustdahlkvist3998 I’m not super familiar with this topic in particular but it’s a fact that there’s almost nothing in game development that’s as simple as “turning on an optimization”. I would bet money that the issue isn’t as simple as turning on a feature. There are some things that only look obvious and easy with 3 decades of retrospective. Even then, most modern game studios probably couldn’t get Mario 64 to run at “drops below 30 with big objects” if I’m being honest.
Me (Person who has a literal degree in computer science, built multiple games, and over 5 years industry level experience): This guy is insane and so much is going in one ear and out the other but this is so interesting to watch
In GCC for ARM, when you want to return two 32 bit integers in registers, you return a 64 bit number. To convert between two 32 bit numbers and one 64 bit number, you use a union.
When I was coding 3D dos games 25 years ago, I couldn't understand why my sin table was not that fast, it is just one line of code with no CPU computation... Cache/memory bottleneckes make a huge different, I confirm.
Yeah you'd have to go back to the '80s when we had no caches to avoid that. I used to play with this stuff in the 8-bit days until about the Amiga era.
@@Anon.G Haha yeah. At least with C compilers you can actually finish a project. Writing an entire game in assembly is a monumental task even on an 8 bit platform
I remember from a book called "linear algebra done right" on page 115 it says the best approximation for sin(x), of fifth degree, on the interval [-pi,pi] is given by 0.987862x − 0.155271x^3 + 0.00564312x^5.
I have a hunch about the angle flipping necessity: tangent. The angles in the 1st and 3rd quadrants (going ccw around a circle on the cartesian grid) have the same tangent values, the same with the 2nd and 4th quadrants. So you have to distinguish when you want the angle from the 1st or 3rd quadrant (and the 2nd or 4th Quadrant). MS Excel (and Google Sheets iirc) have the 2 argument arctan function to treat the angle calculation as a vector problem, but since Mario 64 doesn't have this, you have to use a from scratch angle discrimination setup, much like what Kaze ended up using.
atan2 is a godsend when it comes to not having to roll your head everywhere to figure things out. you can define it from a classic arctan though -- just some if cases for signs and stuff
The video's great! I love seeing how things like this are done on old hardware. It seems to me like it would be hard to understand how anything would be best optimized for modern hardware with preemptive execution, and weird hardware tricks and strange security flaws - more like magic than science. Even though I don't really program much, let alone optimize anything, optimizing code for old hardware seems like it's something that can actually be learned in a concrete and non-mystical way by a human being, even if it takes effort.
modern hardware is optimized for bad code. i think optimizing code on more modern hardware is less about nitty gritty details like this and more about using best practices. there is also an issue that most modern games run on many different consoles so you dont just have one metric to maximize. a lot of these n64 optimizations would slow modern machines down.
@@KazeN64Ever since Playstations, gaming has gotten more techy to the point of modern renders being mostly power and less “magicks”. Thanks to you, we can see their bag of tricks!
@KazeN64 The more powerful the hardware, the harder it is to run anything bare metal. On modern systems, it can even be impossible, requiring you to bypass even the UEFI itself.
@@KazeN64 It's still optimized for good code, it just also handles bad code a bit less poorly. Bad code on modern hardware can still easily waste 90% of memory bandwidth, etc. Where I think modern hardware shines is making decent, not optimized, but not wasteful code run very fast.
I love these optimization stories of old video games. I switched from application development to firmware development because I figured it would be the closest I could get in todays technology landscape. Thank you for your videos. They are such a treat ❤
It's always interesting to see how constraints inspire creativity and inventiveness. I have very limited programming experience, but a fair bit of writing experience, and having a constraint (the syllable count of a haiku, a particular meter in a poem, a more arbitrary constraint like not using a certain letter at all) really pushes you to play around with the elements you've got and try to achieve a pleasant result in spite of the constraint. It strikes me as weirdly similar to working around memory and cache constraints. Reminds me, too, of a book I read written by an ex-Atari programmer. (I think it was Chris Crawford.) He talked about the clever things they tried to do with graphics by executing code in the handful of machine cycles in between CRT raster lines.
I thought I could escape this stuff for a few months after finishing my GCSE exams... I guess not then 🤣 Kaze's videos are a joy to watch nevertheless.
I wish they taught practical applications of higher math. I tried to make a 2d animation of dots spinning like a processing top, turns out it's a ton of trig inside of matrix algebra. I got into game dev and wanted to make a projectile bounce off a wall, more trig inside matrix algebra, vectors are everywhere in game dev. If you want your character to be affected by gravity, there's immediately calculus to handle. The whole concept of "ramp" in magic the gathering is just exponential growth. Geometry is useful all over the place in action games. It's like you can't move your character without at least distance/hitbox equations happening. So many people say "Math is miserable, you never use it, it's not worth the time" but I try to say math is important, it's a way to understand every physics process unfolding around you, and that power it extremely useful. Even if it's just so you can speak to computers and be better at basic programming, or just to be a better gamer, math is incredibly useful.
I've always been fascinated with how computers implement trigonometry. It's more difficult than arithmetic, and there are many ways to do trig functions. Another method is the CORDIC algorithm, which involves adding smaller and smaller angles until it reaches the input. It was developed for use in 8 and 16 bit processors. The TI-83 calculator, which uses a chip similar to the gameboy's, uses this algorithm.
@@amineaitsaidi5919 No, a lookup table is actually quite a good optimization. Like Kaze said, computing the actual Taylor series to a high degree of accuracy would be incredibly slow. Nintendo didn't understand just how much their memory bus design constrained everything until much much later, so at the time it made perfect sense to try to save on CPU with the table.
Have you ever tried CORDIC algorithm? It needs a (quarter?) Sine table but calculates everything just by adding and shifting, so basically one cycle operations and the accuracy depends on the number of repetitions you give the algorithm. Could be too taxing on the bus but maybe give it a try? Not hard to implement though... Used it on FPGAs to reduce computation unit requirements by a lot, since multiplier are rare on those. Golem has a great article about it (and in german :3 )!
I got a welcome ego boost watching this video by thinking about the "only use one 4th of the curve" and "interweave sine and cosine" before you mentioned them. Thanks for that, I love these optimization tricks videos
As a non-computer scientist, I wish I didn't fall asleep in high school geometry class. I find these types of videos incredibly interesting as well, but I have no idea why sine waves are needed to calculate angles in thee dimensions. What does a sign save have to do with math in this context?
@@Psythik sine waves are necessary to calculate angles in _any_ dimensions ( aside from, i suppose, 1d geometry ... mostly because it, uh, doesn't have angles, at least as we traditionally understand them ). or, more precisely, they're necessary to calculate the cartesian coordinates of a vector rotated by an angle, aka, the change in a game entity's _x_ and _y_ positions when moving _d_ game units at an angle of _a_ degrees basically, imagine a standard 2d coordinate plane, then imagine a right triangle with a hypotenuse ( the line that _is not_ part of the right angle; it's the one that's usually drawn diagonally ) of length _d,_ extending from ( 0, 0 ). this means that one of our non-hypotenuse sides is our x-axis, while the other is aligned with the y-axis. the _a_ of sine and cosine is one of the triangle's non-right angles, and the one that's most useful for us is the one that ends up right next to ( 0, 0 ). now, the sine and cosine of _a_ is the length of the non-hypotenuse side that is opposite / adjacent to, respectively, a, divided by the length of our hypotenuse. in other words, they're the ratio between the _y_ / _x_ coordinate of the hypotenuse's second vertex and the hypotenuse's length. so, all we need is to multiply the length by those ratios and, voila! we now know that the hypotenuse extends from ( 0, 0 ) to ( _x, y_ ). which is something you may recognize as a vector, but even if you don't, all you need to know for this example is that you can add this _x_ and _y_ to the coordinates of a point to offset that point _d_ units at angle of _a_ if that seems like a whole lot of steps for a whole lot of nothin', remember that basically _every_ game maps out space using cartesian coordinates. every point is represented by an x value, a y value, and maybe a z value if you're working with three dimensions, and ... that's it. if you want to move 14 units from ( 32, 24 ) at a 45 degree angle, then you gotta calculate exactly _what_ position is 14 units from ( 32, 24 ) at a 45 degree angle. and it isn't ( 39, 31 ), what you get if you just cut 14 in half and slap it on both numbers, even if a 45 degree angle is perfectly diagonal and thus offsets equally on both axes - remember, all of a circle's points are an equal distance from its origin, and yet its diagonal points aren't ( 1, 1 )*. the actual change in _x_ and _y_ we get out of this is a bit under 10, so we actually end up at ( ~42, ~34 ) incidentally, the reason why kaze spoke as if they _aren't_ necessary in 2d games is because all of this is _wholly irrelevant_ in 2d mario games, because they're sidescrollers. so the x and y axes are just completely different forms of movement, so him walking 14 units forward is entirely contained to the x axis. but it's also true that a lot of simpler top-down games, like a lot of old zelda ones, don't bother with using this sort of math for their movement and just slap a constant speed value onto the entity's x and y coordinates. this results in diagonal movement being straight-up faster than cardinal movement, but it's usually not all that noticeable because of the game's presentation. but then you have the interesting case of the original doom games, which as you'd expect from a fps, properly vectorizes your movement so your forward speed remains consistent regardless of which angle you're facing ... except, it vectorizes your forward and strafing movement _separately,_ so you _still_ get the "diagonal movement = faster" effect, it just ends up rotated so you have to move diagonal relative to your facing angle rather than the map's origin *well, they are when using chebyshev distance, where circles are literally just squares. yeah, there are multiple ways to define distance ( which, granted, probably _isn't_ something that got covered in your geometry class ), and it's less theoretical than you might think because ... there's just a lot of different things you can measure! like, chebyshev distance is outright useful in game design, because it represents a type of movement used in the little known game of _chess._ another common alternative to euclidean distance is taxicab distance, which you'd be familiar with if you've played any tile-based game that doesn't allow diagonal movement, like advance wars or fire emblem
@@starfoxmoore6037Don't take it personally. A lot of people just found out over the past few years that a bunch of liars with college degrees have been masquerading as practitioners of science. If you are practicing the scientific method, it's probably best to let your work speak for itself. People simply don't trust the formal classifications at this point, and who could blame them.
5:29 It actually comes from : 250 MHz clock speed (RDRAM double data rate) = 500MT/s 9 Bits bus * 500 MT/s = 4.5 Gbps = ~562.5 MB/s Unfortunately that's in theory a lot goes on in between, latency, memory controllers, data computing speed etc...
It's megabytes on their website. There are other website claiming 4500mbit/sec too. I thought they could be talking about Mb/s too but unforuntately they werent
@@KazeN64 maybe they calculated it in Mbits/s and accidentely wrote it as MB/s? and then someone took the wrong MB/s from wikipedia and multiplied it by 8 again resulting in 4500?
@@t0biascze644 Definitely that lol. It's a rookie mistake, and since everyone is using Wikipedia as their reference.. Well nearly everything ends up wrong afterwards.
@@KazeN64RDRAM operates at 250 MHz over a 9-bit bus with data (possibly) transferred at every clock. That’s where the number comes from. As the GPU uses most of that and the overhead is worse, the CPU might get only 75MB/s
@@lotrbuilders5041 i've computed my 75mb/s from the graphics actually. the CPU is a lot slower than the 75mb/s. usually they share the data speed so both components together probably average around 25mb/s during normal gameplay.
Instead of Euler angles, which have several drawbacks such as gimbal lock and numerical precision, would it be more efficient to use another rotation representation such as quaternions? Would that also reduce the number of needed trig function calls?
Yes, I'm actually planning to convert all the animations to quaternion animations - It'll reduce it from 42 multiplications and 6 sin/cos calls to 37 multiplications. There is also an advantage in quaternions being 8 bytes per rotation instead of 6 - that sounds worse at first, but that way you can guarantee every quaternion is accessible in a single dcache access (whereas the eulers could be spread across 2 dcache lines). So yeah i'm absolutely planning on doing that.
@@KazeN64 Wow, you're eventually gonna rewrite the whole damn engine! Quaternions seem less intuitive at first, but then you see their elegance. And when you start counting mults and other ops, the benefits are clear.
@@Isaac________ They're only less intuitive if you try to visualize them as 4-dimensional arrows. They're _very_ intuitive if you look at them for what they are: mixtures of 4 basis 3D rotations: 180° around x, 180° around y, 180° around z, and 0° (around nothing). An alternative way of visualizing them is as compositions of two reflections. They're more efficient to compose than matrices, but less efficient to apply. If you want to mix in translations, then you need something called dual-quaternions, which have 8 components, but that's still less than a 16-component homogenous matrix or 12-component affine matrix.
@@angeldude101 I think a lot of confusion around quaternions is caused by how we usually think about imaginary numbers as a representation of 2D space. If you try to tack on a similar interpretation of the scalar part of a quaternion as relating to a basis axis, then it doesn't make sense. The scalar part is related to an inner product and projects your vector into a plane where the rotation is a simple reflection. It's the imaginary vector part that "lives" in 3D space. It also helps to realize that the multiplication rules of quaternions are just an algebraic representation of the cross product: j*k = i which is similar to: y × z = x for basis axes x, y, z.
A tip about evaluating polynomials -- start with the highest coefficient, multiply by x, add the next coefficient, multiply by x, etc. Called Horner's method, this greatly reduces floating point error as the terms tend to be of similar scale. For all-even or all-odd series, also prepare x^2 beforehand, to evaluate it as a polynomial in x^2 instead; for odd, factor out the common x, then multiply by it in a final step. About Bhaskara's formula: it's not so mysterious, just a rational approximation. A rational polynomial is a function f(x) = P(x) / Q(x), P and Q being regular polynomials. Generally speaking, the accuracy is equivalent to a n+m order polynomial, where n and m are the order of P and Q respectively. Evidently, this would be advantageous for modest order polynomials (maybe 7th+?) where the division can save some multiplications. Rationals are particularly well suited for peaky functions, where 1/x or 1/(x^2 + a^2) sorts of forms would otherwise be very challenging for a polynomial fit. For smooth functions like sin/cos, it's not too important; and indeed, your 4th/5th order forms are in the same order of magnitude accuracy, as expected. There's also a method not listed here, the hybrid between look-up and curve-fitting: you could have a table of not just values, but coefficients as well. A much smaller table (say 16 or 64 elements) could contain, say, the offset, slope and curvature, which are simply dumped into a quadratic polynomial, and out drops the spline result for that segment. Note this is less intensive than a linear (let alone quadratic or cubic) interpolation, as the coefficients don't need to be computed from adjacent points (table values); for example linear interpolation has to calculate the slope between adjacent points (evil division!), but we can just tabulate that instead. It's not obvious if this would perform better overall than the final choices in the video, but considering how comprehensive the video is, I suppose it bears mentioning. The mention of returning complex pairs has some very interesting outcomes, if taken to its logical conclusion. Ultimately all we're doing is coming up with a vector [x, y] = [cos theta, sin theta], and most of the time we're using both values close by so it makes sense to return both pairs directly. _We could avoid angles entirely and only pass around these vectors!_ This is probably such an immense pain that it's not worth it (every single object which takes an angle parameter has to be refactored to take a normal vector instead!), but, interesting just to imagine how it would all work. Note that, where angles can be added or subtracted, angle vectors must be multiplied or divided, using the complex vector formulas (a + bi)(c+di) = ac - bd + i(ad + bc) and (a + bi) / (c + di) = ( ac + bd + i(bc - ad) ) / (c^2 + d^2). (The division is avoided as long as c^2 + d^2 = 1, which is true for normals.) Fractional fixed-point can be used to save on FLOPs; probably short pairs, so, doubling the memory requirements of stored angles. The magnitudes will eventually drift from enough accumulated operations, and using first-principles values (like, oh I don't know, calculating sin and cos? :^) ) from time to time, or entirely in the first place (imagine that!) is probably helpful; or normalizing (divide by sqrt(a^2 + b^2) -- yuck, what a floating-point mess! -- but a simple iterated approximation can be used (or the infamous inverse square root..!) since it just has to move the value closer to normal, not set it exactly so. Tremendous amount of work, by the way! Numerical approximation, and implementation, can be quite a deep rabbit hole. As, uh, I think you've discovered, hah! :D
I'm not sure why a Taylor series approximation would be too slow. There are no divisions and you can compute x^5 in three multiply instructions. The whole thing would probably take about 8 multiplies and 3 adds for an 8th order accurate approximation. Check out the musl or glibc libm implementations.
Also, -ffast-math can cause incorrect computations since it can break IEEE compliance. Do you see the same incorrect stretching when you build without it?
I am so notoriously bad at math my brain literally locks up when presented with it but you still manage to explain everything in these videos enough that my potato brain can keep up. Which is good because I love seeing how you optimize the hell out of this game despite being a math potato.
The lookup table approach is used in the classic Yamaha OPL2 and OPL3 FM synthesis chips. They contain a 256-entry table of the first quarter in ROM. However, it's not a straight-up sine function, but a log of a sine function. They also contain an exponential table ROM as well, needed for converting the log-transformed values to linear ones.
@@sinom I wouldn't be surprised if some marketing guy decided to change to "MB" because it "looked neater" without understanding the technical implications of it
@@sinomGenerally memory transfer speed is measured in bits/second, not bytes/second. It's not impossible that in a community driven wiki someone made a mistake somewhere along the chain of copying or stating information. And the b/B difference is a newer thing - for quite a while it was common in data/tech sheets to have B just mean both, requiring a call to the vendor or knowledge of how transfer speed is measured at this particular company. But yes, it should be b for bits, or Mbit/s to be explicit.
Really interesting. Easy to understand the x y speed part from the beginning, but sadly, I’m having a hard time with the use of sin/cos in 3d bone movement and wishing I understood why it’s necessary from the video.
"What if I told you there's room for improvement?" Kade, this is YOU we're talking about. You'll probably be able to get Super Mario 64 to run on a TI-86 Calculator if given enough time
20:10 while these bugged bones are not desireable normally, I think they could make for a pretty funny mechanic for another romhack... maybe a power-up?
Funny I actually worked on a project that used the same sine approximation you came up with for the fifth order, but for the third equation (for the last coefficient) we calculated it by minimising the distance between the function and the ideal sine, to do so we ended up equating both integrals between 0 and pi/2 and deduced the result. We had in the end : f(x) = a*(pi/2)*x + b*((pi/2)*x)^3 + c*((pi/2)*x)^5 with a = 4*(3/pi - 9/16) , b = 5/2 - 2*a , c = a - 3/2. I haven't checked if it's better or worse than your function that you found but when we tried a strictly derivative approach (What you did kinda) it ended up doing worse for our purpose. edit: didn't see the new video my bad
yeah that's more accurate. this video was made before i even knew polynomial solvers were a thing and i just kinda went with the approach i knew worked somewhat.
Great video! I remember using overlapping sin/cos tables for the BBC Micro back in the day. And I think I recognise the UI you used for comparing the C code compilation results! 😊
"I never expected anything I've learned in school to actually be a useful life skill" Apparently hacking out a polynomial with a low-error approximation of sine for SM64 optimization in 2023 is a useful life skill lol
Thank you for making videos like these explaining mathematical concepts related to Mario 64. It is informative and appreciated, and encouraging learning
For generating the minimax coefficients you can use a command-line-tool called Sollya. It's neither very well documented nor does it give optimal results, but it's pretty close.
Great video. In a past life, I wrote display device drivers for the GEM Desktop VDI interface, mainly on 286 and 386 systems. In that interface, they do use some angle parameters, mainly for circular arcs. Angles are specified in 0.1 degree increments. The supplied driver code used sine lookup tables and was based on octants, so 0.0 - 45.0 degrees. There was some front end code that would reduce angles into one of 8 octants, even for angles over 360 degrees. I started out using this code but found some strange behavior every once in a while the display update would pause for a noticeable delay. I ended up writing a test program that would load the driver and be able to send it a range of parameters for all the graphics primitive functions. This way I could load my program (and the driver) with a debugger (displaying on a separate system via serial port). I finally narrowed down the problem in the arc drawing code. While the comment in the code (was originally written in a Fortran variant, then the assembler output from that was used) said reduce the angle to 0..7 octant, it was actually doing 1..8 octant. This was in a loop where the code would take the angle, check for > 360 and if larger, subtract 360 and repeat. Then it would check for octant 0..7, but if the angle reduced to 8, it would hit the outer loop and keep going until it finally would get into the 0..7 range. First, I fixed the assembler code to do the proper 0..7 calculation and it worked fine. However, there was over 900 bytes of lookup table (and this was in the 16 bit / 640KB days), so that was a huge chunk of memory just sitting there in case a rarely used arc call was made. I was also in the process of converting almost all this assembler code to C and realized this whole mess could be replaced with a call to the sin() library function. Since we were selling high end graphics cards, almost every customer of ours had a 287 or 387 math co-processor in their system, so this was a fairly low overhead call. Doing this trimmed about 1KB from the driver memory footprint and it ran smoother without that annoying pause every time it got a strange angle parameter.
You should be able to calculate cosine for an exact phase (track phase instead of accumulating the cosine value) using a Horner polynomial. As you know, it's an even function, so you only need half the multiplications. The 6th-order cosine approximation requires 4 multiplications (one is x*x which is then used for 3 multiplications by x^2) and 3 additions. Using fused multiply-add this is one multiply and 3 FMA, a total of 4 instructions to perform the actual calculation. If you normalize the phase to [0,1) for [0,τ) this then requires: - Load the current phase (x) into a register - Depending on the bits for 1/2 and 1/4, decide whether you are going forwards (use bits below 1/2 and 1/4) or backwards (subtract those bits from 0.25) and whether you will use a negative result (caveat: the cosine polynomial approximation normalized to [0,1) is technically domain [0,0.25], but the result is identical if using the range [-0.25,0.25], so it is also possible to subtract 0.25 from the phase and then subtract the result of the polynomial from 0 if the phase is greater than 0.5) - Perform subtraction if so needed - 4 instruction calculation - Subtract result from 0 if in the negative half of sine function - Update the phase (you should be adding a normalized number, e.g. 440Hz f0 with updates at 60Hz means adding 440/60, not 2*pi*440/60) - Store the new phase That should get you under 20 instructions, maybe under 15, with 3 memory accesses (load current phase, load the amount to step the phase, store the new phase) Cosine polynomial reference: gist.github.com/publik-void/067f7f2fef32dbe5c27d6e215f824c91#cos-rel-error-minimized-degree-6
I'm sure you already know this, but someone did a response to this vid where they brute-forced the best sine/4th order cosine with the least amount of error using a SAT solver. RUclips will eat my link, but the post is "Finding the BESTEST sine function for Nintendo 64" by Steinar H. Gunderson.
One way of calculating sine that I came up with was using the fact that in both 3d and 2d you can find the angle between two angles, by normalizing both angles, then simply averaging their position, and normalizing again. So, with this you can calculate any value, first you start with 3 stored values of rotation (0,1) or 0 degrees, (1,0) or 90 degrees, and (0,-1) or 180 degrees, then using the aforementioned function you can take the in-between of 90 degrees and 0 degrees to get 45 degrees, then 22.5 degrees, and so on. You can then convert the angle you are trying to find the sine of into binary. And use each 1 as a call back to one of the computed angles, you add the angles together with a rotation matrix, and you'll get the answer. This works quite well, but isn't designed for speed or to be compact in the memory.
Even better, you can get the components of a rotation matrix with twice the angle between two vectors with just their dot and cross products (Technically a single geometric product.) Even if the vectors aren't normalized, the factor by which the sine and cosine are off by is the same for each, so they effectively cancel out. If you want the rotation only from one to the other, then you need to either normalize the vectors themselves and add them together to get their average, or you can take the double rotation, normalize it, add 1, and then normalize again, which is both the halfway interpolation of the rotation with the identity _and_ the rotation's square root. It's possible with algebraic manipulation to prove that these two methods give the same result. Often when finding rotations, you do so by computing the angle between two vectors using the arcsin of the magnitude of the cross product, the arccos of their dot product, or arctan of the ratio between the two. This is then usually plugged straight into a sincos to get the components for the rotation, leaving you with having called 2 or 3 different transcendental functions which do _absolutely nothing but cancel each other out._ "in both 3d and 2d"? This strategy works in _Nd._ It's completely dimension agnostic. You can even multiply the result by more rotations to compose them, which causes a new behavior in 4d that doesn't happen in 3d or 2d. I didn't even specify the vectors being linearly independent, since the double rotation that results is still a valid 360° rotation, meaning it technically even works in _1d!_ Kaze mentioned in other comments that he was working on converting things to Quaternions, which are a natural part of this system, and (an obfuscated form of) the actual result given by the geometric product of two vectors, or more accurately _mirrors._
Really interesting math! This comment is long, just so you know. I tried to find a 4th order approximation satisfying: f(0) = 0 f_prime(0) = 1 f(pi/2) = 1 f_prime(pi/2) = 0 The only one is ax^4 + bx^3 + x, where a = (16 / pi^4) * (pi - 3) b = (16 / pi^3) * (2 - 0.75pi) The maximum error is ~0.0020673. As for cos(x), another method is to take the derivative of the polynomial for sin(x). Although there will be a greater max error, it will be faster, as the polynomial is of lesser degree and it avoids using a sqrt. If we use the 4th degree polynomial I found, the maximum error is ~0.005. Also, because of the constraints on ~sin(x), this ~cos(x) will always be between 0 and 1 (in the range 0 .. pi/2). As for smoothness, the second derivative f''(0) = 0 correctly and f''(pi/2) = -1.0437 instead of the perfect -1. I also tried to find a 3rd minimax approximation, forcing the following: f(0) = 0 f(-pi/2) = -1 f(pi/2) = 1 The function I ended up with was approximately 0.9824x - 0.14x^3, more specifically: 0.9823962677074202x - 0.14013793513412705x^3 It has a max error of 0.005745, which is slightly better than your function. non-pro tip: Those 'error bumps' at 19:35 should have the same maximum height. Meaning, in the minimax approximation, you sacrifice some error in the shorter bump in order to lower the tallest bump, eventually resulting in the minimax approximation However, it faces the same issue as yours. That is, the function may be > 1. The maximum is at around 1.0012. Forcing the cubic to not exceed 1 (by forcing f_prime(pi/2) = 0) results in easyaspi's approximation.
If you split it again into two separate approximations one from 0 to pi/8 and a different one from pi/8 to pi/4, I think you can probably get a pretty accurate quadratic approximation of sine. I didn't work out the details so it might end up being slower in the long run or have other problems.
@@KazeN64 I have seen this method used in a different context to good effect, which is why I suggested it, but every processor is different, and I don't really know this one nearly as well as you do obviously. So fair enough.
@@KazeN64 And by the way, you can get better polynomial interpolations sometimes by choosing different nodes. I tried out a fifth order cosine with f(x) and f'(x) equal to those of cosine at x=0, pi/4, pi/2 and it gives an order of magnitude better approximation than the fifth order sine you showed. and it has the property also that it is bounded from above by the cosine function on the interval 0 to pi/2
@@timseguine2 wait why a 5th order cosine? odd orders should always be sines and even orders should always be cosines. otherwise you end up having too many coefficients. can you show the function?
@@KazeN64 The way the coefficients disappear depends on which x values are used for evaluating the interpolation. If you split them up like I suggested it works out to: 1 + ((1024 sqrt(2) + 256) π + 4096 sqrt(2) - 11776)x^2/(128π^(2)) + ((-8192 sqrt(2) - 2560) π - 16384 sqrt(2) + 67584)x^3/(128π^(3)) + ((20480 sqrt(2) + 8192) π + 16384 sqrt(2) - 139264)x^4/(128π^(4)) + ((-128 sqrt(2) - 64) π + 768)x^5/(π^(5)) and the maximum error on 0 to pi/2 is about 0.00004 The thing I said about the bound was backwards, but it is never larger than 1. There are other alternative ways to interpolate it depending on what you are going for. A decent quartic one could work for example by dropping one of the conditions at x=pi/4, but I didn't try that
Hello! Floating-Point/Computer Maths engineer here. I've developed trig functions for quite a lot of architectures/my life at this point, maybe I can give some suggestions. In our world, the minimax approximation is mostly king. I say mostly because mathematically it's usually the provably optimal polynomial, however for us there's the usual floating-point issues you get trying to apply maths onto computers. But if there's an issue with a minimax approximation you usually look inwards before looking outwards (Especially if your sin/cos ends up being greater than 1) If you have problems around 0, you should probably be optimizing for relative error, as opposed to absolute error which is what it seems you've done here (I might be wrong). In the business we tend to use an error metric called the ULP error (units in last place), which roughly measures how many bits at the end of your floating point answer are incorrect. (It's very close to a relative error measure, bit not quite the same. Optimising for relative error usually works well) Because of how floating-points scale, this means that you need tiny errors for tiny floats, and can have much larger errors for larger floats. Using the absolute error as your metric allows vales near 0 to blow up to however large you want in a relative error sense. Without an exact application in mind this ULP-metric is the one people like myself would use to write eg maths libraries. Another issue is that if you take the coefficients of an optimal real-valued minimax polynomial and convert them straight to floats you lose an pile of accuracy, unfortunately for computer scientists everywhere. Probably a lot more than you might think, when compounded with floating-point evaluations. We have some tools developed to try get around a lot of these limitations, a good starting point is one called sollya. Here you can ask for relative/absolute error, and give all kinds of constraints to your problem. Importantly you can restrict the output coefficients to be floats, and it will try optimise around this, so you won't lose large amounts of accuracy converting reals coefficients. For example, on the interval [0 -> pi/4], for sin() it gives a cubic polynomial of: x * (0.9995915889739990234375f + x^2 * (-0.16153512895107269287109375f)) Or by forcing the first coefficient to be 1 (generally sollya knows best, but sometimes storing a value of 1 is free on hardware, and it also might make really tiny values more accurate at the expense of larger ones): x * (1.0f + x^2 * (-0.16242791712284088134765625f)) And similarly the next higher polynomial gives: x * (0.999998509883880615234375f + x^2 * (-0.16662393510341644287109375f + x^2 * 8.150202222168445587158203125e-3f)) Or with forced 1: x * (1.0f + x^2 * (-0.166633903980255126953125f + x^2 * 8.16328264772891998291015625e-3f)) Note that sollya does not produce the absolute best polynomial possible, but it's the best you'll get with not too much work, and the results it does give are usually very decent.
Otherse have suggested sollya as well! The problem with the polynomials you've suggested though is that you are adding additional coefficients. I've purposely chosen only odd and only even polynomials because they give you a lot of accuracy with less coefficients. instead of a 3rd order sine, an odd 4th order cosine will always be cheaper and more accurate for example. considering the constraints i gave my polynomials, i also think it's entirely impossible to come up with a different one. we have used up all degrees of freedom after all. although someone suggested easing up on some of the conditions for the higher order polynomials and it did end up improving them.
I can't believe I was so engaged by this video for 26 minutes. Gotta be the best ad read I've ever seen, too, I spent 15 of those minutes "how could I possibly learn all this?"
You might want to use the "true barycentric form" of Lagrange interpolation, it is very efficient and avoids the numerical difficulty you experienced. Also, the Chebyshev nodes produce a very good approximation to the polynomial of best approximation. I'm not sure if I'm allowed to post links, but both of these topics are well described on Wikipedia.
Hey Kaze, this was another solid deep-dive. If I could make a suggestion: I found the additional audio of the speedrunning/glitching balanced a little too loud. Loud, glitched "wa-wa-wa-wa-wahoo"s into a wall + music + mathematics VO is a lot to listen to all at once and this is the only time I can recall a video of yours feeling this loud and overstimulating. I'd like to point to 16:45 as an example, so see if others feel the same. Otherwise, great topic! Love the feeling of progression to the algorithm
The fact that this is mostly review for me from my Numerical Analysis class! It just blows my mind seeing this all working! I mean, I’ve worked with this many long nights studying, but still, it amazes me!
1:35 As a programmer, I have noticed that we don't actually have any proof that this is the best way to represent 3d rotations, while it is without a doubt always going to return a value that is on a sphere, we don't know that it's the most efficient, or the best way to do it.
I'm pretty sure it isn't, because it requires 3 separate rotations, meaning 3 separate complex exponentials, and since the axes are relative to each other, it's possible to lose a degree of freedom if two of them align. Quaternions are much more likely to be the best way to calculate rotations since they just require a single complex exponential and an arbitrary axis on top of being as efficient to store as Euler Angles (3 components). Actually applying the rotations is a different story since it's actually more efficient to do that with matrices than quaternions, but matrices are less efficient to store and compose, so you have to balance the tradeoffs.
@@angeldude101 I think probably the optimal system would be with only two variables, in other words it could be attained by storing a 2d-value. While I know for certain that we already have systems like that, I also don't know that those systems are the best ones. I think likely it would be a lot more complicated than what we have, in a similar way to how 2d rotation is stored as a 1d variable compared to 1d rotation being a Boolean/0d variable.
@@flameofthephoenix8395 In 2D, there's only 1 plane within which you can rotate, so you only need 1 component, and from that you can derive a mixture of of rotation and not-rotation (2 components). In 3D, there are 3 orthogonal planes within which you can rotate, so you need 3 components to define the axis (which only actually requires 2 coordinates) and the angle. More generally, N-dimensional space has N choose 2 rotational degrees of freedom. (1 choose 2) = 0, so 1D rotations have 0 components; (2 choose 2) = 1, so 2D rotations have 1 component; (3 choose 2) = 3, so 3D rotations have 3 components; (4 choose 2) = 6, so 4D rotations have 6 components. That last case actually has something new happen when converting from a axis-angle to a mixture of rotations, as you can now get an 8th component that appears when the axis is actually 2 completely orthogonal axes and can't reduce to just 1.
It's crazy to think that Kaze might just be one of the greatest programmers of our generation... And he uses his insane talents to make Mario 64 mods O_o
You can use y1~[equation] to do regression with your equation from a table by inserting the sine graph x positions and y positions and it will autimatically find pretty much the best answer without playing with numbers yourself
I’m guessing that the 564 Mb/s was referring to megabits, even though MB/s typically stands for bytes, not bits, but yeah that was a sneaky way of making the bandwidth seem higher. This video is incredible, I had no idea there were so many trig functions in these games, but it makes perfect sense
Hey Kaze! I’m no mathematician, but I had an idea regarding those tables. I’m working on being an engineer and those tables reminded me of log tables. Those values in the video aren’t logs, but if they were log’d values, you could preform division by just adding the log values and finding their corresponding number similar to how slide rules were used for quick multiplication or division. Because of how logarithms work, multiplication and division could be done by just subtracting and adding log values. Perhaps this mathematical property could be used to make your algorithm faster somehow? I hope this was food for thought and helps in you finding the optimum sine function! Id love to know what you think of this detail!
I think you'd need to use the fast, inaccurate log for it to be worth it. That fast log is just treating a integer as float (or the opposite, I forgot)
Oh WOW that’s clever. Storing sine and cosine together in the interval [0°, 45°) because the rotation matrix uses both sine and cosine of the same number!
It is a good optimization considering a program usually deals with only a handful of values as inputs per frame. You are not likely to calculate cos(2) after cos(1), making caching cos(2) irrelevant, but you often calculate cos(1) and sin(1) together, especially for calculating projections of the same vector over two axes. Returning both values at once is very useful for 3D-driven games as both are generally consumed by the same function. You get potentially twice cache utilization before a cache invalidation. This you did in get_sin_and_cos() for the LUT method, and the F32x2 trick in the numerical method. Both cases can give you savings outside of the function, by reducing the number of your total function calls. This savings exist independently of naive instruction call / cycle counting. Comparison between OoT sine and interwoven sine is quite illustrative. Of course, in the real game, there may be times when only one of the sin/cos is needed, but it seems to be a minority. I would be interested to see how much savings from the latter numerical methods can be attributed to the use of F32x2 (consumes one register, assuming the consumer code can work without that register space), as compared to fetching one of them and converting to the other via an identity (consumes at least one multiply op, add/sub op, followed by a sqrt op - which you can take out from the main sincosXXXX function since you no longer need to perform sqrt(ONE - x * x) conversion there).
What about using sin(x) = cos(π/2 - x) instead of sin(x) = √(1-cos(x)²) Also I urge you to look into series economization, it's a way to take the Taylor series (which is very precise on (-ε; ε)) and modify it so the error is distributed on the whole interval (-1; 1); basically converting the polynomial to the base of Chebyshev polynomials, then cutting it at a lower order, and converting back into the {Xⁿ} base; now you get a polynomial as precise as the one you started with, but with less terms, and precise on the whole interval If your function is not on (-1; 1) you can just compose it with an affine transformation to put it there
that'd be a lot more expensive because we'd have to run the entire polynomial again. just taking the square roat is faster (square root isn't that expensive on the n64) the chebyshev polynomial might reduce the error across the interval, having no error around pi/2 is much more important to me. if we wanted the min error across the whole interval, i'd use a minmax approximation algorithm.
@@KazeN64 If I understand correctly, the trick is to obtain coefficients entirely by derivation (no curve fitting required) _and then transforming back to the plain polynomial_ (because Cheb. polynomials are just that, a family of polynomials, sum up the coefficients), and use those in the computation. You have the same polynomial implementation, just with different coefficients put into it. The result will, I think, be equivalent to an L^2 (least squares) curve fit, without having to run a curve fitting algorithm. (You can of course use the algorithm just as well, give or take its numerical accuracy.) Chebyshev polynomials are also great for curve fitting in general, as they converge much more quickly than by tweaking naked polynomial coefficients.
This is an insane video to me, not only did you have to DO all of that work, but you made it into a video too? Did you know to keep a log (with explanations!) of all of this while you were doing this? Or did you not even intend on making this a video? Amazing stuff to see.
nah i made the video after i was done. ive also switched the order of events around a little to make the story a little nicer haha, in reality the 3rd order sine came first and bhaskara second.
Thank you very much, Kaze, for answering the question that we all asked our maths teachers at least once in our lifetimes: "Wofür brauchen wir das später überhaupt?" That being said, people like you and Pannenkoek have taught me more useful knowledge about maths, computer science in general and programming in particular than five years of university. Cheers and please keep educating us about the wonders of the N64.
Heh, I spent way too long making an efficient and accurate sine function for modern processors. If you want a way to find coefficients for polynomial functions, I suggest using black box function optimization. I used julia and optimized for both reducing the maximum error and reducing the total error over the section of the function I needed. I shot for a half-sine centered on 0 and did funky bit manipulation on floats... I really don't know how much of what I figured out is useful for n64 hardware, but it's nice seeing someone suffer in a similar way as I did :D
Coefficient finding using remez or chebychev is the best, no need for manual search. I would have like to see a comparison of these algorithms with CORDIC, even though it's table based.
@@quentincorradi5646 I'm certainly not an expert! I'm not surprised there's a better way to find coefficients. I'll have to see if I can grok any of that...
This is exactly the type of video that was interesting enough to get me into a CS program, and even more fascinating and appreciable after my masters lol Cheers, loved this profusely!
I brought up the idea of vector extensions in one of your live streams for returning two (or 4) floats from a function, I gave it a try just now and whilst it is better than returning a struct on the stack (4 instructions), like the struct technique it still avoids the hardware float registers... I'm surprised GCC fails here. On other modern architectures the vector extensions produce the same code as the complex number extension I suppose MIPS hasn't received much attention for this optimisation (similar story for ARM32 with hardware float) ! if you do need to ever return two (or 4) integers from a function without the horrid struct-in-stack behaviour then use the gcc vector extension
And now that I tried modern GCC with the struct-on-stack technique, it actually behaves the same as the vector extension, so 4 instructions and stack is avoided (pass-by-register) Very few ABIs behave like this for structs, so I think modern GCC has made a MIPS specific optimisation for this case
huh it never did that for me - i'm not sure how to use vector extensions but if you could show some sample code that would be pretty useful to have a look at. that type of optimization might apply to a lot more stuff in the repo.
Since he's at the insane optimization level he is now and has shown that the memory bandwidth of the N64 is pretty low, using vector registers as essentially stack space might give a huge boost to performance.
To try everything Brilliant has to offer-free-for a full 30 days, visit brilliant.org/KazeEmanuar. The first 200 of you will get 20% off Brilliant’s annual premium subscription.
5:30 it's wikipedia, you can edit it
Considering you’ve added hi res models into SM64 before, is it possible to add Render96 to it?
@@CandiedC render96 looks like shit, so even if you could it would be better not to
@ 17:34 the derivative is wrong. The correct derivative is 3ax^2 + b, not 3ax^2 + x.
@@RanEncounter oh whoops thats a typo
Using ancient Indian mathematical formulas to make the funny red man go bing bing wahoo even faster, even after going like five massive optimizations that give wild speedups already
Classic Kaze.
"funny red man go bing bing wahoo" is a great band name
The trick to fast computing is to use integers as much as possible
@@lerikhkl I love their song "everybody cheating but me".
@@williamdrum9899 I wonder if this is really true.
Float has 7 decimal positions, if I use an integer and treat the first 7 positions as decimals, would that make my code much faster and more efficient?
This is part of Bhaskara's legacy, one he could not have possibly imagined hundreds of years ago.
I've been optimising sine tables since the ZX Spectrum. I never realised how many tricks I've missed! ... BRILLIANT ... Thanks for sharing
Holy shit bruh, the wise sage spoketh.
Speccy > C64 😝
On the ZX Spectrum, a LUT was probably the right solution. The CPU was slow and the memory was fast. Though there wasn't very much of it, so you'd definitely want the quarter-table trick.
@@keiyakins it does make me want to pull the hardware out and compare the difference. All though I feel this implementation is specific to the way the N64 fetches data
My favorite Z80 trick is aligning a table to &xx00 so instead of loading the address into HL you load the address high byte into H and the desired index into L. Easily triples the speed of any lookup operation
it's so cool to see situations where it's actually worth it taking up more CPU cycles than RAM, when a lot of programming problems are the opposite
The situation hasn't changed much today, actually; more in terms of scale, and levels of abstraction/recursion -- simple operations like these complete almost instantaneously in CPU time, and the rest of the time is spent ferrying the inputs and outputs through multiple levels of caches. Which may extend not just to system RAM or HDD, but over the network as well (which might in turn be cached across numerous servers, and so on!). Accounting for Rambus access is relatively painless in comparison! 😅
Actually there is a long trend this way, CPUs have accelerated way more than RAM.
Antique platforms would pre-compute even flipped sprites, because even though RAM was very scarce, the CPU was even more sluggish than that.
Cache and memory has also stopped scaling as well as compute on newer process nodes
That has nearly always been the case. The SNES and earlier system were basically the last widely available systems were memory access can be done in a single CPU cycle.
Nowadays we are at a point were in programming it often is faster to the the simplest linear search over hundreds or even thousands of elements instead of using things like binary-search that "should" be much faster as they need to only do a fraction of the work.
@@ABaumstumpf a few people have pointed that out and yeah they're right. i was more thinking about caching stuff for recursive problems but optimising for memory access is probably a more common thing
This man knows more about the n64 than nintendo themselves
Yeah, for real haha
Yes he does
@@osparavbruh, u didint watch all his videos
Also probably the best assembly coder there is.
@@osparav well who still code's in assembly and tries to optimize a coding language from 70 years ago
Maybe the 562.5 MegaBytes/s at 5:37 should actually be Megabits/s. 562.5 Mbps = 70.3 MB/s.
Yea I am sure it was a publication error as most people forget the capitalization huge differences.
Add also usual confusion on base 2 and base 10 Mega and you'll come closer to 75... Like the 1TB or 931GB...
@@retrofraction Good thing about Wikipedia is that you can change it and be the MVP!
I think the raw datapath to the RDRAM is capable of that but due to access latency and the memory architecture of the system the practical bandwidth is much much lower.
No, the ram operates at 500MHz over a 9-bit data bus. The 9th bit is only useable by the GPU. The GPU uses the same memory over the same bus and that’s why the CPU access is relatively slow
14:57
The reason he has to make the value negative here is that Bhaskara is only accurate in the range -pi/2 to pi/2.
The calculation for flip sign will be one, if the value is outside of this range. if this is the case the value needs to be shifted and flipped as it is after the if(flipsign) and the ternary operator after the Bhaskara algo proper
Here is a graph showing the process, and also how he gets sine out of this calculation.
BUT RUclips wont let me post it because they are dicks
He DMd me the graph:
www.desmos.com/calculator/dq8l6tj6cv
@@KazeN64 Thanks, I appreciate it, and hope someone finds it interesting.
Out of curiosity, how do you DM on RUclips? Or did you use an outside platform
@@IronLotus15he has his socials linked
@@IronLotus15I remember when RUclips had private messaging 😢
Nintendo has used a lookup table for trigonometry since Super Mario World, where it was used for the rotating sprites (ball and chain, chained platforms, etc)
Back in the day (especially on SNES) the CPU was the bottleneck not the cache so they needed to look it up and not calculate it. I don't even think SNES could do floating point numbers in hardware.
@@AlexRodriguez-gb9ez It could not use floating point numbers, but fixed point (usually 8.8) are very common and can even be hardware-accelerated. In Super Mario World there's a big table with sine/cosine values in this format which is then multiplied with whatever coordinate you want to transform using hardware registers.
Likewise for Sonic 1, also in 1991. I'm getting so tired of this 'Mario 64 was the first game to do X' bullshit. What's next, Kaze - Mario 64 was the first game to ever have 3D graphics?
@@Clownacy mario 64 was the first game to be called super mario 64
SM64DS also uses an interlaced sine/cosine table, but without mirroring. I never really thought about why that is but given how cache works it makes perfect sense.
I wish that game had as much love for the technical, decomp, and romhacking crowd as SM64 does. Glad to see you posting vids for the game
Doesn't the DS work like the GBA though, where reading from ROM is pretty free?
Was SM64DS ever decompiled to produce the same assembly code as with SM64? Or why do you know it?
@@FabulousJejmaze They're both ARM machines so I would think so. But I heard the DS has a memcpy bug that the GBA did not
@@sven33r It wouldn't be exactly the same since each runs on a different CPU. N64 doesn't have variable pointer offsetting.
Man it’s always crazy how even a game that’s considered “poorly optimized” like Mario 64 is still optimized incredibly well for its time.
Now a days developers say “oh most peoples computers these days should have about 16 gigs of ram? Why waste time making it run on 8 then?” And then the game still comes out rushed and bad.
Lol mario 64 isn't well optimised at all, they didn't even turn on the automatic compiler optimisations and it regularly drops below 30fps even though the n64 should be able to handle it fine.
Modern games at least shipped with the compiler optimisations enabled
What you have to remember is that this is a console game, from the age of cartridges.
There was no such thing as "updates" or bug fixes, so studios were forced to give their devs time to get a game good enough to sell.
@@dekufiremage7808because they keep hiring morons
@@benjaminoechsli1941here were re-releases thankfully, but changing the engine is drastic for one to do.
@@augustdahlkvist3998 I’m not super familiar with this topic in particular but it’s a fact that there’s almost nothing in game development that’s as simple as “turning on an optimization”. I would bet money that the issue isn’t as simple as turning on a feature. There are some things that only look obvious and easy with 3 decades of retrospective.
Even then, most modern game studios probably couldn’t get Mario 64 to run at “drops below 30 with big objects” if I’m being honest.
Me (Person who has a literal degree in computer science, built multiple games, and over 5 years industry level experience): This guy is insane and so much is going in one ear and out the other but this is so interesting to watch
Take a computer engineering degree and learn about signal processing, stuff like this will look more familiar.
To be fair to yourself, this is not a topic most computer scientists would deal with. It is much closer to engineering
4:20 Small correction: The table uses 16KB (20KB with cos) of memory, because floats are 4 bytes wide.
true!i was thinking of 0x4000 and just said 4kb because im so hexbrained lmao
@@KazeN64 pin the comment with corrections or update description?
In GCC for ARM, when you want to return two 32 bit integers in registers, you return a 64 bit number. To convert between two 32 bit numbers and one 64 bit number, you use a union.
Yeah that works on n64 too!
That's what I love about unions, they let you break data types when it's convenient to do so. Best part is, there is no performance penalty.
When I was coding 3D dos games 25 years ago, I couldn't understand why my sin table was not that fast, it is just one line of code with no CPU computation... Cache/memory bottleneckes make a huge different, I confirm.
See I program on CPUs that didn't have a cache. So for me, I don't need to worry about it since memory access is equally slow!
@@williamdrum9899then you have to start worrying about other problems!
Yeah you'd have to go back to the '80s when we had no caches to avoid that. I used to play with this stuff in the 8-bit days until about the Amiga era.
@@Anon.G Haha yeah. At least with C compilers you can actually finish a project. Writing an entire game in assembly is a monumental task even on an 8 bit platform
@@andrewdunbar828 I love the 68000, it's so easy to work with
I remember from a book called "linear algebra done right" on page 115 it says the best approximation for sin(x), of fifth degree, on the interval [-pi,pi] is given by 0.987862x − 0.155271x^3 + 0.00564312x^5.
looks like a minmax approximation which has issues i've outlined in the video
@@KazeN64awesome.
I have a hunch about the angle flipping necessity: tangent.
The angles in the 1st and 3rd quadrants (going ccw around a circle on the cartesian grid) have the same tangent values, the same with the 2nd and 4th quadrants.
So you have to distinguish when you want the angle from the 1st or 3rd quadrant (and the 2nd or 4th Quadrant).
MS Excel (and Google Sheets iirc) have the 2 argument arctan function to treat the angle calculation as a vector problem, but since Mario 64 doesn't have this, you have to use a from scratch angle discrimination setup, much like what Kaze ended up using.
atan2 is a godsend when it comes to not having to roll your head everywhere to figure things out. you can define it from a classic arctan though -- just some if cases for signs and stuff
as an SNES homebrew dev, it supremely fucks me up that it's faster to *calculate* a sine than to just use a simple LUT...what the hell even is the N64
The video's great! I love seeing how things like this are done on old hardware. It seems to me like it would be hard to understand how anything would be best optimized for modern hardware with preemptive execution, and weird hardware tricks and strange security flaws - more like magic than science. Even though I don't really program much, let alone optimize anything, optimizing code for old hardware seems like it's something that can actually be learned in a concrete and non-mystical way by a human being, even if it takes effort.
modern hardware is optimized for bad code. i think optimizing code on more modern hardware is less about nitty gritty details like this and more about using best practices. there is also an issue that most modern games run on many different consoles so you dont just have one metric to maximize. a lot of these n64 optimizations would slow modern machines down.
@@KazeN64Ever since Playstations, gaming has gotten more techy to the point of modern renders being mostly power and less “magicks”. Thanks to you, we can see their bag of tricks!
@KazeN64 The more powerful the hardware, the harder it is to run anything bare metal. On modern systems, it can even be impossible, requiring you to bypass even the UEFI itself.
@@KazeN64 It's still optimized for good code, it just also handles bad code a bit less poorly. Bad code on modern hardware can still easily waste 90% of memory bandwidth, etc.
Where I think modern hardware shines is making decent, not optimized, but not wasteful code run very fast.
I love these optimization stories of old video games. I switched from application development to firmware development because I figured it would be the closest I could get in todays technology landscape. Thank you for your videos. They are such a treat ❤
You could also program industrial equipment
A reminder of the actual MIPs cpu cache size (16 KB instruction cache and an 8 KB data cache) brings a whole lot perspective on that LUT size.
It's always interesting to see how constraints inspire creativity and inventiveness. I have very limited programming experience, but a fair bit of writing experience, and having a constraint (the syllable count of a haiku, a particular meter in a poem, a more arbitrary constraint like not using a certain letter at all) really pushes you to play around with the elements you've got and try to achieve a pleasant result in spite of the constraint. It strikes me as weirdly similar to working around memory and cache constraints.
Reminds me, too, of a book I read written by an ex-Atari programmer. (I think it was Chris Crawford.) He talked about the clever things they tried to do with graphics by executing code in the handful of machine cycles in between CRT raster lines.
Thanks Kaze! This was the perfect video to watch while running head first into a wall!
This channel is slowly becoming the best showcase of the general outline of optimizing code on RUclips. I love it.
I love it when trigonometry and calculus are used with gaming and coding. Math is very entertaining.
Shame they do not teach it in school as an applied math in game design.
I thought I could escape this stuff for a few months after finishing my GCSE exams... I guess not then 🤣
Kaze's videos are a joy to watch nevertheless.
It's exactly why I love the classic Sonic engine... that said, things aren't nearly as complex as they are here, obviously.
I wish they taught practical applications of higher math. I tried to make a 2d animation of dots spinning like a processing top, turns out it's a ton of trig inside of matrix algebra. I got into game dev and wanted to make a projectile bounce off a wall, more trig inside matrix algebra, vectors are everywhere in game dev. If you want your character to be affected by gravity, there's immediately calculus to handle. The whole concept of "ramp" in magic the gathering is just exponential growth. Geometry is useful all over the place in action games. It's like you can't move your character without at least distance/hitbox equations happening. So many people say "Math is miserable, you never use it, it's not worth the time" but I try to say math is important, it's a way to understand every physics process unfolding around you, and that power it extremely useful. Even if it's just so you can speak to computers and be better at basic programming, or just to be a better gamer, math is incredibly useful.
I've always been fascinated with how computers implement trigonometry. It's more difficult than arithmetic, and there are many ways to do trig functions. Another method is the CORDIC algorithm, which involves adding smaller and smaller angles until it reaches the input. It was developed for use in 8 and 16 bit processors. The TI-83 calculator, which uses a chip similar to the gameboy's, uses this algorithm.
This is a coding channel that eventually uses mario to show how programming works.
Exactly, and this is why i am here.
Coding AND MATH!!!! My favorites
I just realized how Mario 64 is not optimized at all to store a whole table of sin, it is juste pure rush.
@@amineaitsaidi5919 No, a lookup table is actually quite a good optimization. Like Kaze said, computing the actual Taylor series to a high degree of accuracy would be incredibly slow. Nintendo didn't understand just how much their memory bus design constrained everything until much much later, so at the time it made perfect sense to try to save on CPU with the table.
this is so crazily impressive, kaze must have gone insane to do all this for that little performance gains lmao
Could also just be a passionate autistic programmer like me lol
considering how well optimized a LUT already was, he knew going into it that he was never going to achieve more than a little performance gain
Have you ever tried CORDIC algorithm?
It needs a (quarter?) Sine table but calculates everything just by adding and shifting, so basically one cycle operations and the accuracy depends on the number of repetitions you give the algorithm.
Could be too taxing on the bus but maybe give it a try? Not hard to implement though...
Used it on FPGAs to reduce computation unit requirements by a lot, since multiplier are rare on those. Golem has a great article about it (and in german :3 )!
The idea to interweave the sine and cosine values to avoid cache misses is a stroke of genius! Great work!
thanks for enlightening me on the vertical and horizontal speed on 2d games kaze
NO WAY CAEC
I got a welcome ego boost watching this video by thinking about the "only use one 4th of the curve" and "interweave sine and cosine" before you mentioned them. Thanks for that, I love these optimization tricks videos
As a computer scientist myself. I find these type of videos incredibly interesting.
As a non-computer scientist, I wish I didn't fall asleep in high school geometry class. I find these types of videos incredibly interesting as well, but I have no idea why sine waves are needed to calculate angles in thee dimensions. What does a sign save have to do with math in this context?
@@Psythik sine waves are necessary to calculate angles in _any_ dimensions ( aside from, i suppose, 1d geometry ... mostly because it, uh, doesn't have angles, at least as we traditionally understand them ). or, more precisely, they're necessary to calculate the cartesian coordinates of a vector rotated by an angle, aka, the change in a game entity's _x_ and _y_ positions when moving _d_ game units at an angle of _a_ degrees
basically, imagine a standard 2d coordinate plane, then imagine a right triangle with a hypotenuse ( the line that _is not_ part of the right angle; it's the one that's usually drawn diagonally ) of length _d,_ extending from ( 0, 0 ). this means that one of our non-hypotenuse sides is our x-axis, while the other is aligned with the y-axis. the _a_ of sine and cosine is one of the triangle's non-right angles, and the one that's most useful for us is the one that ends up right next to ( 0, 0 ). now, the sine and cosine of _a_ is the length of the non-hypotenuse side that is opposite / adjacent to, respectively, a, divided by the length of our hypotenuse. in other words, they're the ratio between the _y_ / _x_ coordinate of the hypotenuse's second vertex and the hypotenuse's length. so, all we need is to multiply the length by those ratios and, voila! we now know that the hypotenuse extends from ( 0, 0 ) to ( _x, y_ ). which is something you may recognize as a vector, but even if you don't, all you need to know for this example is that you can add this _x_ and _y_ to the coordinates of a point to offset that point _d_ units at angle of _a_
if that seems like a whole lot of steps for a whole lot of nothin', remember that basically _every_ game maps out space using cartesian coordinates. every point is represented by an x value, a y value, and maybe a z value if you're working with three dimensions, and ... that's it. if you want to move 14 units from ( 32, 24 ) at a 45 degree angle, then you gotta calculate exactly _what_ position is 14 units from ( 32, 24 ) at a 45 degree angle. and it isn't ( 39, 31 ), what you get if you just cut 14 in half and slap it on both numbers, even if a 45 degree angle is perfectly diagonal and thus offsets equally on both axes - remember, all of a circle's points are an equal distance from its origin, and yet its diagonal points aren't ( 1, 1 )*. the actual change in _x_ and _y_ we get out of this is a bit under 10, so we actually end up at ( ~42, ~34 )
incidentally, the reason why kaze spoke as if they _aren't_ necessary in 2d games is because all of this is _wholly irrelevant_ in 2d mario games, because they're sidescrollers. so the x and y axes are just completely different forms of movement, so him walking 14 units forward is entirely contained to the x axis. but it's also true that a lot of simpler top-down games, like a lot of old zelda ones, don't bother with using this sort of math for their movement and just slap a constant speed value onto the entity's x and y coordinates. this results in diagonal movement being straight-up faster than cardinal movement, but it's usually not all that noticeable because of the game's presentation. but then you have the interesting case of the original doom games, which as you'd expect from a fps, properly vectorizes your movement so your forward speed remains consistent regardless of which angle you're facing ... except, it vectorizes your forward and strafing movement _separately,_ so you _still_ get the "diagonal movement = faster" effect, it just ends up rotated so you have to move diagonal relative to your facing angle rather than the map's origin
*well, they are when using chebyshev distance, where circles are literally just squares. yeah, there are multiple ways to define distance ( which, granted, probably _isn't_ something that got covered in your geometry class ), and it's less theoretical than you might think because ... there's just a lot of different things you can measure! like, chebyshev distance is outright useful in game design, because it represents a type of movement used in the little known game of _chess._ another common alternative to euclidean distance is taxicab distance, which you'd be familiar with if you've played any tile-based game that doesn't allow diagonal movement, like advance wars or fire emblem
youre not a computer scientist lol
@@PowerfulKundalini
I know it is hard to believe but I really am friend.
@@starfoxmoore6037Don't take it personally. A lot of people just found out over the past few years that a bunch of liars with college degrees have been masquerading as practitioners of science. If you are practicing the scientific method, it's probably best to let your work speak for itself. People simply don't trust the formal classifications at this point, and who could blame them.
5:29 It actually comes from : 250 MHz clock speed (RDRAM double data rate) = 500MT/s
9 Bits bus * 500 MT/s = 4.5 Gbps = ~562.5 MB/s
Unfortunately that's in theory a lot goes on in between, latency, memory controllers, data computing speed etc...
that makes sense!
05:32
The number for peak bandwidth is megabits per second, that's why your number which is megabytes per second is about one eighth of their number.
It's megabytes on their website. There are other website claiming 4500mbit/sec too. I thought they could be talking about Mb/s too but unforuntately they werent
@@KazeN64 maybe they calculated it in Mbits/s and accidentely wrote it as MB/s? and then someone took the wrong MB/s from wikipedia and multiplied it by 8 again resulting in 4500?
@@t0biascze644 Definitely that lol. It's a rookie mistake, and since everyone is using Wikipedia as their reference.. Well nearly everything ends up wrong afterwards.
@@KazeN64RDRAM operates at 250 MHz over a 9-bit bus with data (possibly) transferred at every clock. That’s where the number comes from. As the GPU uses most of that and the overhead is worse, the CPU might get only 75MB/s
@@lotrbuilders5041 i've computed my 75mb/s from the graphics actually. the CPU is a lot slower than the 75mb/s. usually they share the data speed so both components together probably average around 25mb/s during normal gameplay.
Instead of Euler angles, which have several drawbacks such as gimbal lock and numerical precision, would it be more efficient to use another rotation representation such as quaternions? Would that also reduce the number of needed trig function calls?
Yes, I'm actually planning to convert all the animations to quaternion animations - It'll reduce it from 42 multiplications and 6 sin/cos calls to 37 multiplications. There is also an advantage in quaternions being 8 bytes per rotation instead of 6 - that sounds worse at first, but that way you can guarantee every quaternion is accessible in a single dcache access (whereas the eulers could be spread across 2 dcache lines). So yeah i'm absolutely planning on doing that.
@@KazeN64 Wow, you're eventually gonna rewrite the whole damn engine! Quaternions seem less intuitive at first, but then you see their elegance. And when you start counting mults and other ops, the benefits are clear.
@@Isaac________ They're only less intuitive if you try to visualize them as 4-dimensional arrows. They're _very_ intuitive if you look at them for what they are: mixtures of 4 basis 3D rotations: 180° around x, 180° around y, 180° around z, and 0° (around nothing). An alternative way of visualizing them is as compositions of two reflections. They're more efficient to compose than matrices, but less efficient to apply.
If you want to mix in translations, then you need something called dual-quaternions, which have 8 components, but that's still less than a 16-component homogenous matrix or 12-component affine matrix.
@@angeldude101 Yes. That's why I wrote "at first", and noted their elegance. :)
@@angeldude101 I think a lot of confusion around quaternions is caused by how we usually think about imaginary numbers as a representation of 2D space. If you try to tack on a similar interpretation of the scalar part of a quaternion as relating to a basis axis, then it doesn't make sense. The scalar part is related to an inner product and projects your vector into a plane where the rotation is a simple reflection. It's the imaginary vector part that "lives" in 3D space. It also helps to realize that the multiplication rules of quaternions are just an algebraic representation of the cross product:
j*k = i
which is similar to:
y × z = x
for basis axes x, y, z.
A tip about evaluating polynomials -- start with the highest coefficient, multiply by x, add the next coefficient, multiply by x, etc. Called Horner's method, this greatly reduces floating point error as the terms tend to be of similar scale. For all-even or all-odd series, also prepare x^2 beforehand, to evaluate it as a polynomial in x^2 instead; for odd, factor out the common x, then multiply by it in a final step.
About Bhaskara's formula: it's not so mysterious, just a rational approximation. A rational polynomial is a function f(x) = P(x) / Q(x), P and Q being regular polynomials. Generally speaking, the accuracy is equivalent to a n+m order polynomial, where n and m are the order of P and Q respectively. Evidently, this would be advantageous for modest order polynomials (maybe 7th+?) where the division can save some multiplications. Rationals are particularly well suited for peaky functions, where 1/x or 1/(x^2 + a^2) sorts of forms would otherwise be very challenging for a polynomial fit. For smooth functions like sin/cos, it's not too important; and indeed, your 4th/5th order forms are in the same order of magnitude accuracy, as expected.
There's also a method not listed here, the hybrid between look-up and curve-fitting: you could have a table of not just values, but coefficients as well. A much smaller table (say 16 or 64 elements) could contain, say, the offset, slope and curvature, which are simply dumped into a quadratic polynomial, and out drops the spline result for that segment. Note this is less intensive than a linear (let alone quadratic or cubic) interpolation, as the coefficients don't need to be computed from adjacent points (table values); for example linear interpolation has to calculate the slope between adjacent points (evil division!), but we can just tabulate that instead. It's not obvious if this would perform better overall than the final choices in the video, but considering how comprehensive the video is, I suppose it bears mentioning.
The mention of returning complex pairs has some very interesting outcomes, if taken to its logical conclusion. Ultimately all we're doing is coming up with a vector [x, y] = [cos theta, sin theta], and most of the time we're using both values close by so it makes sense to return both pairs directly. _We could avoid angles entirely and only pass around these vectors!_ This is probably such an immense pain that it's not worth it (every single object which takes an angle parameter has to be refactored to take a normal vector instead!), but, interesting just to imagine how it would all work. Note that, where angles can be added or subtracted, angle vectors must be multiplied or divided, using the complex vector formulas (a + bi)(c+di) = ac - bd + i(ad + bc) and (a + bi) / (c + di) = ( ac + bd + i(bc - ad) ) / (c^2 + d^2). (The division is avoided as long as c^2 + d^2 = 1, which is true for normals.) Fractional fixed-point can be used to save on FLOPs; probably short pairs, so, doubling the memory requirements of stored angles. The magnitudes will eventually drift from enough accumulated operations, and using first-principles values (like, oh I don't know, calculating sin and cos? :^) ) from time to time, or entirely in the first place (imagine that!) is probably helpful; or normalizing (divide by sqrt(a^2 + b^2) -- yuck, what a floating-point mess! -- but a simple iterated approximation can be used (or the infamous inverse square root..!) since it just has to move the value closer to normal, not set it exactly so.
Tremendous amount of work, by the way! Numerical approximation, and implementation, can be quite a deep rabbit hole. As, uh, I think you've discovered, hah! :D
Yeah dont worry about the first point - thats how my code does it. Ffast math automatically multiplies xsquared first
+. Excellent comment!
I'm not sure why a Taylor series approximation would be too slow. There are no divisions and you can compute x^5 in three multiply instructions. The whole thing would probably take about 8 multiplies and 3 adds for an 8th order accurate approximation. Check out the musl or glibc libm implementations.
Also, -ffast-math can cause incorrect computations since it can break IEEE compliance. Do you see the same incorrect stretching when you build without it?
I am so notoriously bad at math my brain literally locks up when presented with it but you still manage to explain everything in these videos enough that my potato brain can keep up.
Which is good because I love seeing how you optimize the hell out of this game despite being a math potato.
@2:12 I was today years old when i learned you can get on the castle. I never even considered it before. The vid was cool but thanks for that
The lookup table approach is used in the classic Yamaha OPL2 and OPL3 FM synthesis chips. They contain a 256-entry table of the first quarter in ROM. However, it's not a straight-up sine function, but a log of a sine function. They also contain an exponential table ROM as well, needed for converting the log-transformed values to linear ones.
5:35 The wiki may be using Mbit/s, as dividing the value by 8 gets pretty close to the real bandwidth.
Then it should read Mb instead of MB though
@@sinom I wouldn't be surprised if some marketing guy decided to change to "MB" because it "looked neater" without understanding the technical implications of it
@@sinomGenerally memory transfer speed is measured in bits/second, not bytes/second. It's not impossible that in a community driven wiki someone made a mistake somewhere along the chain of copying or stating information. And the b/B difference is a newer thing - for quite a while it was common in data/tech sheets to have B just mean both, requiring a call to the vendor or knowledge of how transfer speed is measured at this particular company.
But yes, it should be b for bits, or Mbit/s to be explicit.
Really interesting. Easy to understand the x y speed part from the beginning, but sadly, I’m having a hard time with the use of sin/cos in 3d bone movement and wishing I understood why it’s necessary from the video.
"What if I told you there's room for improvement?"
Kade, this is YOU we're talking about.
You'll probably be able to get Super Mario 64 to run on a TI-86 Calculator if given enough time
13:40 I’ve heard that division is shockingly costly for computation, but it being on par with square roots is insane
20:10 while these bugged bones are not desireable normally, I think they could make for a pretty funny mechanic for another romhack... maybe a power-up?
Funny I actually worked on a project that used the same sine approximation you came up with for the fifth order, but for the third equation (for the last coefficient) we calculated it by minimising the distance between the function and the ideal sine, to do so we ended up equating both integrals between 0 and pi/2 and deduced the result.
We had in the end : f(x) = a*(pi/2)*x + b*((pi/2)*x)^3 + c*((pi/2)*x)^5 with a = 4*(3/pi - 9/16) , b = 5/2 - 2*a , c = a - 3/2.
I haven't checked if it's better or worse than your function that you found but when we tried a strictly derivative approach (What you did kinda) it ended up doing worse for our purpose.
edit: didn't see the new video my bad
yeah that's more accurate. this video was made before i even knew polynomial solvers were a thing and i just kinda went with the approach i knew worked somewhat.
I am old school so I really liked these optimizations considering that to day most programmers just make more and more bloated codes.
Great video! I remember using overlapping sin/cos tables for the BBC Micro back in the day. And I think I recognise the UI you used for comparing the C code compilation results! 😊
the man himself
Matt Godbolt watches N64 optimizations nice
Holy fuck, didnt expected to see true OG on here lmao.
"I never expected anything I've learned in school to actually be a useful life skill"
Apparently hacking out a polynomial with a low-error approximation of sine for SM64 optimization in 2023 is a useful life skill lol
It is, it's called fun. You ever had it in your life?
@@MrMeow-dk2tx
Yessir
It is when making Mario 64 romhacks is one of your bodily functions.
His brain has a built in RISC co processor for just rom hacking
math is useful for physics and statistics if you want to consider "real world applications"
Thank you for making videos like these explaining mathematical concepts related to Mario 64. It is informative and appreciated, and encouraging learning
For generating the minimax coefficients you can use a command-line-tool called Sollya. It's neither very well documented nor does it give optimal results, but it's pretty close.
Very clever optimization to get the cosine to cache with the sine. Loved the deep dive. You definitely did look cool doing it! Great video
Did you consider the CORDIC algorithm? It's (to my understanding) really efficient on low end hardware, so it might be worth looking into
Cordic is going to be more accurate, but slower.
13:47 This is the best b-roll I've ever seen
"You are going to learn a lot of math today, and you are going to like it" was weirdly threatening
9:56 "not sure if anyone else thought of this", it was used in Crash Team Racing (1999) by Naughty Dog on PlayStation 1!
Great video.
In a past life, I wrote display device drivers for the GEM Desktop VDI interface, mainly on 286 and 386 systems. In that interface, they do use some angle parameters, mainly for circular arcs. Angles are specified in 0.1 degree increments. The supplied driver code used sine lookup tables and was based on octants, so 0.0 - 45.0 degrees. There was some front end code that would reduce angles into one of 8 octants, even for angles over 360 degrees.
I started out using this code but found some strange behavior every once in a while the display update would pause for a noticeable delay. I ended up writing a test program that would load the driver and be able to send it a range of parameters for all the graphics primitive functions. This way I could load my program (and the driver) with a debugger (displaying on a separate system via serial port).
I finally narrowed down the problem in the arc drawing code. While the comment in the code (was originally written in a Fortran variant, then the assembler output from that was used) said reduce the angle to 0..7 octant, it was actually doing 1..8 octant. This was in a loop where the code would take the angle, check for > 360 and if larger, subtract 360 and repeat. Then it would check for octant 0..7, but if the angle reduced to 8, it would hit the outer loop and keep going until it finally would get into the 0..7 range.
First, I fixed the assembler code to do the proper 0..7 calculation and it worked fine. However, there was over 900 bytes of lookup table (and this was in the 16 bit / 640KB days), so that was a huge chunk of memory just sitting there in case a rarely used arc call was made. I was also in the process of converting almost all this assembler code to C and realized this whole mess could be replaced with a call to the sin() library function. Since we were selling high end graphics cards, almost every customer of ours had a 287 or 387 math co-processor in their system, so this was a fairly low overhead call. Doing this trimmed about 1KB from the driver memory footprint and it ran smoother without that annoying pause every time it got a strange angle parameter.
thank you for Mario Teaches Math, Kaze 64 edition. your lesson was a lot of fun!!!
You should be able to calculate cosine for an exact phase (track phase instead of accumulating the cosine value) using a Horner polynomial. As you know, it's an even function, so you only need half the multiplications. The 6th-order cosine approximation requires 4 multiplications (one is x*x which is then used for 3 multiplications by x^2) and 3 additions. Using fused multiply-add this is one multiply and 3 FMA, a total of 4 instructions to perform the actual calculation. If you normalize the phase to [0,1) for [0,τ) this then requires:
- Load the current phase (x) into a register
- Depending on the bits for 1/2 and 1/4, decide whether you are going forwards (use bits below 1/2 and 1/4) or backwards (subtract those bits from 0.25) and whether you will use a negative result (caveat: the cosine polynomial approximation normalized to [0,1) is technically domain [0,0.25], but the result is identical if using the range [-0.25,0.25], so it is also possible to subtract 0.25 from the phase and then subtract the result of the polynomial from 0 if the phase is greater than 0.5)
- Perform subtraction if so needed
- 4 instruction calculation
- Subtract result from 0 if in the negative half of sine function
- Update the phase (you should be adding a normalized number, e.g. 440Hz f0 with updates at 60Hz means adding 440/60, not 2*pi*440/60)
- Store the new phase
That should get you under 20 instructions, maybe under 15, with 3 memory accesses (load current phase, load the amount to step the phase, store the new phase)
Cosine polynomial reference: gist.github.com/publik-void/067f7f2fef32dbe5c27d6e215f824c91#cos-rel-error-minimized-degree-6
I'm sure you already know this, but someone did a response to this vid where they brute-forced the best sine/4th order cosine with the least amount of error using a SAT solver. RUclips will eat my link, but the post is "Finding the BESTEST sine function for Nintendo 64" by Steinar H. Gunderson.
One way of calculating sine that I came up with was using the fact that in both 3d and 2d you can find the angle between two angles, by normalizing both angles, then simply averaging their position, and normalizing again. So, with this you can calculate any value, first you start with 3 stored values of rotation (0,1) or 0 degrees, (1,0) or 90 degrees, and (0,-1) or 180 degrees, then using the aforementioned function you can take the in-between of 90 degrees and 0 degrees to get 45 degrees, then 22.5 degrees, and so on. You can then convert the angle you are trying to find the sine of into binary. And use each 1 as a call back to one of the computed angles, you add the angles together with a rotation matrix, and you'll get the answer. This works quite well, but isn't designed for speed or to be compact in the memory.
Even better, you can get the components of a rotation matrix with twice the angle between two vectors with just their dot and cross products (Technically a single geometric product.) Even if the vectors aren't normalized, the factor by which the sine and cosine are off by is the same for each, so they effectively cancel out. If you want the rotation only from one to the other, then you need to either normalize the vectors themselves and add them together to get their average, or you can take the double rotation, normalize it, add 1, and then normalize again, which is both the halfway interpolation of the rotation with the identity _and_ the rotation's square root. It's possible with algebraic manipulation to prove that these two methods give the same result.
Often when finding rotations, you do so by computing the angle between two vectors using the arcsin of the magnitude of the cross product, the arccos of their dot product, or arctan of the ratio between the two. This is then usually plugged straight into a sincos to get the components for the rotation, leaving you with having called 2 or 3 different transcendental functions which do _absolutely nothing but cancel each other out._
"in both 3d and 2d"? This strategy works in _Nd._ It's completely dimension agnostic. You can even multiply the result by more rotations to compose them, which causes a new behavior in 4d that doesn't happen in 3d or 2d. I didn't even specify the vectors being linearly independent, since the double rotation that results is still a valid 360° rotation, meaning it technically even works in _1d!_
Kaze mentioned in other comments that he was working on converting things to Quaternions, which are a natural part of this system, and (an obfuscated form of) the actual result given by the geometric product of two vectors, or more accurately _mirrors._
Ah yes, one step closer to optimizing the N64's boot time. Wonderful video.
Really interesting math!
This comment is long, just so you know.
I tried to find a 4th order approximation satisfying:
f(0) = 0
f_prime(0) = 1
f(pi/2) = 1
f_prime(pi/2) = 0
The only one is ax^4 + bx^3 + x,
where
a = (16 / pi^4) * (pi - 3)
b = (16 / pi^3) * (2 - 0.75pi)
The maximum error is ~0.0020673.
As for cos(x), another method is to take the derivative of the polynomial for sin(x). Although there will be a greater max error, it will be faster, as the polynomial is of lesser degree and it avoids using a sqrt. If we use the 4th degree polynomial I found, the maximum error is ~0.005. Also, because of the constraints on ~sin(x), this ~cos(x) will always be between 0 and 1 (in the range 0 .. pi/2). As for smoothness, the second derivative f''(0) = 0 correctly and f''(pi/2) = -1.0437 instead of the perfect -1.
I also tried to find a 3rd minimax approximation, forcing the following:
f(0) = 0
f(-pi/2) = -1
f(pi/2) = 1
The function I ended up with was approximately 0.9824x - 0.14x^3, more specifically:
0.9823962677074202x - 0.14013793513412705x^3
It has a max error of 0.005745, which is slightly better than your function.
non-pro tip: Those 'error bumps' at 19:35 should have the same maximum height. Meaning, in the minimax approximation, you sacrifice some error in the shorter bump in order to lower the tallest bump, eventually resulting in the minimax approximation
However, it faces the same issue as yours. That is, the function may be > 1. The maximum is at around 1.0012. Forcing the cubic to not exceed 1 (by forcing f_prime(pi/2) = 0) results in easyaspi's approximation.
If you split it again into two separate approximations one from 0 to pi/8 and a different one from pi/8 to pi/4, I think you can probably get a pretty accurate quadratic approximation of sine. I didn't work out the details so it might end up being slower in the long run or have other problems.
it'd be cheaper and more accurate to just use the cosine of one higher degree to get around having to do float comparisons at that point
@@KazeN64 I have seen this method used in a different context to good effect, which is why I suggested it, but every processor is different, and I don't really know this one nearly as well as you do obviously. So fair enough.
@@KazeN64 And by the way, you can get better polynomial interpolations sometimes by choosing different nodes.
I tried out a fifth order cosine with f(x) and f'(x) equal to those of cosine at x=0, pi/4, pi/2 and it gives an order of magnitude better approximation than the fifth order sine you showed. and it has the property also that it is bounded from above by the cosine function on the interval 0 to pi/2
@@timseguine2 wait why a 5th order cosine? odd orders should always be sines and even orders should always be cosines. otherwise you end up having too many coefficients. can you show the function?
@@KazeN64 The way the coefficients disappear depends on which x values are used for evaluating the interpolation. If you split them up like I suggested it works out to:
1 + ((1024 sqrt(2) + 256) π + 4096 sqrt(2) - 11776)x^2/(128π^(2)) + ((-8192 sqrt(2) - 2560) π - 16384 sqrt(2) + 67584)x^3/(128π^(3)) + ((20480 sqrt(2) + 8192) π + 16384 sqrt(2) - 139264)x^4/(128π^(4)) + ((-128 sqrt(2) - 64) π + 768)x^5/(π^(5))
and the maximum error on 0 to pi/2 is about 0.00004
The thing I said about the bound was backwards, but it is never larger than 1.
There are other alternative ways to interpolate it depending on what you are going for. A decent quartic one could work for example by dropping one of the conditions at x=pi/4, but I didn't try that
Hello! Floating-Point/Computer Maths engineer here.
I've developed trig functions for quite a lot of architectures/my life at this point, maybe I can give some suggestions.
In our world, the minimax approximation is mostly king. I say mostly because mathematically it's usually the provably optimal polynomial, however for us there's the usual floating-point issues you get trying to apply maths onto computers.
But if there's an issue with a minimax approximation you usually look inwards before looking outwards (Especially if your sin/cos ends up being greater than 1)
If you have problems around 0, you should probably be optimizing for relative error, as opposed to absolute error which is what it seems you've done here (I might be wrong). In the business we tend to use an error metric called the ULP error (units in last place), which roughly measures how many bits at the end of your floating point answer are incorrect.
(It's very close to a relative error measure, bit not quite the same. Optimising for relative error usually works well)
Because of how floating-points scale, this means that you need tiny errors for tiny floats, and can have much larger errors for larger floats. Using the absolute error as your metric allows vales near 0 to blow up to however large you want in a relative error sense.
Without an exact application in mind this ULP-metric is the one people like myself would use to write eg maths libraries.
Another issue is that if you take the coefficients of an optimal real-valued minimax polynomial and convert them straight to floats you lose an pile of accuracy, unfortunately for computer scientists everywhere. Probably a lot more than you might think, when compounded with floating-point evaluations.
We have some tools developed to try get around a lot of these limitations, a good starting point is one called sollya. Here you can ask for relative/absolute error, and give all kinds of constraints to your problem. Importantly you can restrict the output coefficients to be floats, and it will try optimise around this, so you won't lose large amounts of accuracy converting reals coefficients.
For example, on the interval [0 -> pi/4], for sin() it gives a cubic polynomial of:
x * (0.9995915889739990234375f + x^2 * (-0.16153512895107269287109375f))
Or by forcing the first coefficient to be 1 (generally sollya knows best, but sometimes storing a value of 1 is free on hardware, and it also might make really tiny values more accurate at the expense of larger ones):
x * (1.0f + x^2 * (-0.16242791712284088134765625f))
And similarly the next higher polynomial gives:
x * (0.999998509883880615234375f + x^2 * (-0.16662393510341644287109375f + x^2 * 8.150202222168445587158203125e-3f))
Or with forced 1:
x * (1.0f + x^2 * (-0.166633903980255126953125f + x^2 * 8.16328264772891998291015625e-3f))
Note that sollya does not produce the absolute best polynomial possible, but it's the best you'll get with not too much work, and the results it does give are usually very decent.
Otherse have suggested sollya as well! The problem with the polynomials you've suggested though is that you are adding additional coefficients. I've purposely chosen only odd and only even polynomials because they give you a lot of accuracy with less coefficients. instead of a 3rd order sine, an odd 4th order cosine will always be cheaper and more accurate for example.
considering the constraints i gave my polynomials, i also think it's entirely impossible to come up with a different one. we have used up all degrees of freedom after all.
although someone suggested easing up on some of the conditions for the higher order polynomials and it did end up improving them.
Replying so i can remember this golden comment via comment history. Thanks for this comment, i also learn a lot from this..
I love that Kaze's romhacking has evolved into the most insane solutions to code optimization problems
This was a fun watch. That makes me wonder how much silicon in modern gaming apus might be dedicated to the trig portion of the ALUs
Probably enough to where your average programmer doesn't need to think about this stuff
I can't believe I was so engaged by this video for 26 minutes. Gotta be the best ad read I've ever seen, too, I spent 15 of those minutes "how could I possibly learn all this?"
In before he switch to mighty quaternion for cache locality
i am 100% planning to do that for animations
You might want to use the "true barycentric form" of Lagrange interpolation, it is very efficient and avoids the numerical difficulty you experienced. Also, the Chebyshev nodes produce a very good approximation to the polynomial of best approximation. I'm not sure if I'm allowed to post links, but both of these topics are well described on Wikipedia.
Hey Kaze, this was another solid deep-dive. If I could make a suggestion: I found the additional audio of the speedrunning/glitching balanced a little too loud. Loud, glitched "wa-wa-wa-wa-wahoo"s into a wall + music + mathematics VO is a lot to listen to all at once and this is the only time I can recall a video of yours feeling this loud and overstimulating. I'd like to point to 16:45 as an example, so see if others feel the same.
Otherwise, great topic! Love the feeling of progression to the algorithm
Yeah thats fair! i'll make sure to turn the audio down a bit next video.
Thanks, I appreciate it 👍
The fact that this is mostly review for me from my Numerical Analysis class!
It just blows my mind seeing this all working! I mean, I’ve worked with this many long nights studying, but still, it amazes me!
I can't wait till this game is so ridiculously optimized that I can run this on my ti-84
1:35 As a programmer, I have noticed that we don't actually have any proof that this is the best way to represent 3d rotations, while it is without a doubt always going to return a value that is on a sphere, we don't know that it's the most efficient, or the best way to do it.
I'm pretty sure it isn't, because it requires 3 separate rotations, meaning 3 separate complex exponentials, and since the axes are relative to each other, it's possible to lose a degree of freedom if two of them align. Quaternions are much more likely to be the best way to calculate rotations since they just require a single complex exponential and an arbitrary axis on top of being as efficient to store as Euler Angles (3 components). Actually applying the rotations is a different story since it's actually more efficient to do that with matrices than quaternions, but matrices are less efficient to store and compose, so you have to balance the tradeoffs.
@@angeldude101 I think probably the optimal system would be with only two variables, in other words it could be attained by storing a 2d-value. While I know for certain that we already have systems like that, I also don't know that those systems are the best ones. I think likely it would be a lot more complicated than what we have, in a similar way to how 2d rotation is stored as a 1d variable compared to 1d rotation being a Boolean/0d variable.
@@flameofthephoenix8395 In 2D, there's only 1 plane within which you can rotate, so you only need 1 component, and from that you can derive a mixture of of rotation and not-rotation (2 components). In 3D, there are 3 orthogonal planes within which you can rotate, so you need 3 components to define the axis (which only actually requires 2 coordinates) and the angle.
More generally, N-dimensional space has N choose 2 rotational degrees of freedom. (1 choose 2) = 0, so 1D rotations have 0 components; (2 choose 2) = 1, so 2D rotations have 1 component; (3 choose 2) = 3, so 3D rotations have 3 components; (4 choose 2) = 6, so 4D rotations have 6 components. That last case actually has something new happen when converting from a axis-angle to a mixture of rotations, as you can now get an 8th component that appears when the axis is actually 2 completely orthogonal axes and can't reduce to just 1.
It's crazy to think that Kaze might just be one of the greatest programmers of our generation... And he uses his insane talents to make Mario 64 mods O_o
The Mario Kart music in the background 🔥Nintendo has so many good jams!
ocarina of sine
Considering that music is just sound waves of varying frequency, and sound waves are waves like any others, _yes._
You can use y1~[equation] to do regression with your equation from a table by inserting the sine graph x positions and y positions and it will autimatically find pretty much the best answer without playing with numbers yourself
I was really hoping sin(x)=x approximation was going to make a guest appearance here
Rewatching this now. It has to be one of my favourite videos! Everything is explained well.
It's so wild to me that actually trying to approximate sine can be more efficient than the LUT
I’m guessing that the 564 Mb/s was referring to megabits, even though MB/s typically stands for bytes, not bits, but yeah that was a sneaky way of making the bandwidth seem higher.
This video is incredible, I had no idea there were so many trig functions in these games, but it makes perfect sense
Hey Kaze!
I’m no mathematician, but I had an idea regarding those tables. I’m working on being an engineer and those tables reminded me of log tables. Those values in the video aren’t logs, but if they were log’d values, you could preform division by just adding the log values and finding their corresponding number similar to how slide rules were used for quick multiplication or division.
Because of how logarithms work, multiplication and division could be done by just subtracting and adding log values. Perhaps this mathematical property could be used to make your algorithm faster somehow?
I hope this was food for thought and helps in you finding the optimum sine function! Id love to know what you think of this detail!
Also i wanted to say log slide rules also calculate sine functions too!
I think you'd need to use the fast, inaccurate log for it to be worth it. That fast log is just treating a integer as float (or the opposite, I forgot)
Oh WOW that’s clever. Storing sine and cosine together in the interval [0°, 45°) because the rotation matrix uses both sine and cosine of the same number!
You should make that big fist Mario 12:23 a power up in the game
I appreciate that you were passionate enough to go this deep into this game
I think Wikipedia meant megabits, not megabytes. 8x difference
It is a good optimization considering a program usually deals with only a handful of values as inputs per frame. You are not likely to calculate cos(2) after cos(1), making caching cos(2) irrelevant, but you often calculate cos(1) and sin(1) together, especially for calculating projections of the same vector over two axes. Returning both values at once is very useful for 3D-driven games as both are generally consumed by the same function. You get potentially twice cache utilization before a cache invalidation. This you did in get_sin_and_cos() for the LUT method, and the F32x2 trick in the numerical method.
Both cases can give you savings outside of the function, by reducing the number of your total function calls. This savings exist independently of naive instruction call / cycle counting.
Comparison between OoT sine and interwoven sine is quite illustrative. Of course, in the real game, there may be times when only one of the sin/cos is needed, but it seems to be a minority.
I would be interested to see how much savings from the latter numerical methods can be attributed to the use of F32x2 (consumes one register, assuming the consumer code can work without that register space), as compared to fetching one of them and converting to the other via an identity (consumes at least one multiply op, add/sub op, followed by a sqrt op - which you can take out from the main sincosXXXX function since you no longer need to perform sqrt(ONE - x * x) conversion there).
I dont know how im supposed to watch this, but I probably did watch it like a Caveman being learned about basic electricity.
Exactly this, i find it really interesting but i literally burst out laughing because i realise i have no idea what it all means.
well, at least I am glad I am not the only caveman around
3:50 sounds like a fairly interesting and unique way of computing sine.
What about using sin(x) = cos(π/2 - x) instead of sin(x) = √(1-cos(x)²)
Also I urge you to look into series economization, it's a way to take the Taylor series (which is very precise on (-ε; ε)) and modify it so the error is distributed on the whole interval (-1; 1); basically converting the polynomial to the base of Chebyshev polynomials, then cutting it at a lower order, and converting back into the {Xⁿ} base; now you get a polynomial as precise as the one you started with, but with less terms, and precise on the whole interval
If your function is not on (-1; 1) you can just compose it with an affine transformation to put it there
that'd be a lot more expensive because we'd have to run the entire polynomial again. just taking the square roat is faster (square root isn't that expensive on the n64)
the chebyshev polynomial might reduce the error across the interval, having no error around pi/2 is much more important to me. if we wanted the min error across the whole interval, i'd use a minmax approximation algorithm.
@@KazeN64 If I understand correctly, the trick is to obtain coefficients entirely by derivation (no curve fitting required) _and then transforming back to the plain polynomial_ (because Cheb. polynomials are just that, a family of polynomials, sum up the coefficients), and use those in the computation. You have the same polynomial implementation, just with different coefficients put into it. The result will, I think, be equivalent to an L^2 (least squares) curve fit, without having to run a curve fitting algorithm. (You can of course use the algorithm just as well, give or take its numerical accuracy.)
Chebyshev polynomials are also great for curve fitting in general, as they converge much more quickly than by tweaking naked polynomial coefficients.
This is an insane video to me, not only did you have to DO all of that work, but you made it into a video too? Did you know to keep a log (with explanations!) of all of this while you were doing this? Or did you not even intend on making this a video? Amazing stuff to see.
nah i made the video after i was done. ive also switched the order of events around a little to make the story a little nicer haha, in reality the 3rd order sine came first and bhaskara second.
you are nothing short of a genius
Thank you very much, Kaze, for answering the question that we all asked our maths teachers at least once in our lifetimes: "Wofür brauchen wir das später überhaupt?"
That being said, people like you and Pannenkoek have taught me more useful knowledge about maths, computer science in general and programming in particular than five years of university. Cheers and please keep educating us about the wonders of the N64.
Heh, I spent way too long making an efficient and accurate sine function for modern processors. If you want a way to find coefficients for polynomial functions, I suggest using black box function optimization. I used julia and optimized for both reducing the maximum error and reducing the total error over the section of the function I needed. I shot for a half-sine centered on 0 and did funky bit manipulation on floats... I really don't know how much of what I figured out is useful for n64 hardware, but it's nice seeing someone suffer in a similar way as I did :D
Coefficient finding using remez or chebychev is the best, no need for manual search. I would have like to see a comparison of these algorithms with CORDIC, even though it's table based.
@@quentincorradi5646 I'm certainly not an expert! I'm not surprised there's a better way to find coefficients. I'll have to see if I can grok any of that...
This is exactly the type of video that was interesting enough to get me into a CS program, and even more fascinating and appreciable after my masters lol
Cheers, loved this profusely!
I love seeing things get optimized
And here I am feeling good when a php page takes less than 5 seconds to load.
I need more of these high-quality edited videos! Every time you make a new optimisation I look forward to another one of these for it lol
I brought up the idea of vector extensions in one of your live streams for returning two (or 4) floats from a function, I gave it a try just now and whilst it is better than returning a struct on the stack (4 instructions), like the struct technique it still avoids the hardware float registers...
I'm surprised GCC fails here. On other modern architectures the vector extensions produce the same code as the complex number extension
I suppose MIPS hasn't received much attention for this optimisation (similar story for ARM32 with hardware float)
! if you do need to ever return two (or 4) integers from a function without the horrid struct-in-stack behaviour then use the gcc vector extension
And now that I tried modern GCC with the struct-on-stack technique, it actually behaves the same as the vector extension, so 4 instructions and stack is avoided (pass-by-register)
Very few ABIs behave like this for structs, so I think modern GCC has made a MIPS specific optimisation for this case
huh it never did that for me - i'm not sure how to use vector extensions but if you could show some sample code that would be pretty useful to have a look at. that type of optimization might apply to a lot more stuff in the repo.
Since he's at the insane optimization level he is now and has shown that the memory bandwidth of the N64 is pretty low, using vector registers as essentially stack space might give a huge boost to performance.
My word, that sin/cos layout for caching is brilliant. Excellent work, as usual.
You can export excel tables to pdf, or disable auto-correction to get rid of the squiggly red lines.