Is "now try it with the fast inverse sqrt" the programmer version of how every musician content creator is asked/forced to attempt Rush E and other meme songs
Nice video! Just one small point: If you want to use invsqrt(x) to calculate sqrt(x), you can use x*invsqrt(x) instead of 1/invsqrt(x). That might save a few cycles? But I still agree that the quake3 fast inverse square root algorithm is probably not that useful on N64.
to do regular sqrt you would just use a different magic constant. float fast_sqrt(float x) { int i = *(int *)&x; i = 0x1fbd1df5 + (i >> 1); return *(float *)&i; }
If I understood that table correctly from the start of the video, does that imply that the 0 newton iteration would take 6 cycles to complete just the fast inverse portion? How long does * take, may deserve another video...
Did a little reading, it sounds as though hardware implementations of sqrt make have taken a few different routes, common ones apparently were a lookup table for rough approximation followed by a number of newton iterations, or alternatively some process similar to long division. Without digging too deeply, my guess is that might be why the cycle counts for division and sqrt are the same. Since long division's one of the slower approaches to dealing with floats, it could be that the fast inverse sqrt is the way to go, since the n64's hardware was developed before the algorithm was discovered it could be that its implementation could be beaten via software. Newton iterations roughly double the precision of the result so a better initial guess can rapidly decrease the time cost.
I'm kind of surprised it was made popular by Quake III. I could've sworn Mike Abrash put it in the original Quake, but it's been years since I saw the source.
@@KazeN64 ah yes, DOOM with all of it's real time reflections and shading and floating point numbers that it definitely used. /s seriously though, i thought it was funny. because DOOM is pretty much always the first game people think of when IdSoftware is mentioned. so sometimes people think the FastInverseSqrt is also from DOOM, even though the game doesn't even use floats at all because they were way too slow at the time.
Silas' idea with the error cancelling is very cool, there are probably many other examples where we can reduce the error of one problem by dividing it into two sub-problems with opposite error
Error is a fickle issue. While I don't know for sure, usually these types of strategies have smaller error, in return they have larger error when values get extremely big or small.
@@M0liusX I definitely believe that, there are usually very specific cases where one algorithm is better than another, and in general there is no "optimal" algorithm which always works best for all situations, it always depends on the specific example.
We use similar algoritms in land surveying to estimate coordinates with high accuracy. By using a GPS reading and comparing it to a GPS reading, at a control station, we can see which sattelites are visible in both readings. By subtracting the differences between the readings, the accuracy of the initial GPS reading goes from 5-10m(15-30feet) of accuracy, to 2-3cm(1-1.5inches) of accuracy. This only works when the same satelites are visible in both readings, if the control reading is to far away, then the algorithm wouldn't work. But this cancels out errors like the density of the atmosphere, refraction errors, and random errors, since you work with more data.
modern CPUs actually have built in optimized instructions for exactly these types of things. For example, intel CPUs have an operation that does 8 inverse square root operations with perfect accuracy in just one clock cycle. These types of operations are called SIMD, and are majorly underutilized Edit: not quite perfect 2^-14 is the maximum accuracy they guarantee For a game though, that difference will be practically indistinguishable
Right? Game developers need to pay attention. We wouldn't have so many issues with modern games being unoptimized if they used more advanced optimization. Doom Eternal is a perfect example of that
@@mariotheundying Layers and layers and layers of virtualization, security, memory safety, compatibility, APIs, threading, and more before you even get close to running instructions on the bare metal
Never underestimate the value of a single cycle to a bedrock function that gets called endlessly. Like back when one of my teenage hobby projects was a primitive 3D renderer (for modding a game with), written in BASIC (specifically VB4), with basically zero access to any external, actual 3D APIs. Displaying textures in real-time was far outside my abilities but I had the coordinate transformations for wireframe rendering optimized as much as I can think of, including a few scenarios where I resorted to the oft-maligned GOSUB/RETURN type calls instead of making a function call to handle it, simply because it was the faster mechanism.
I didn't actually expect it to work at all. Because I remember you said that square roots on the N64 are relatively fast. Of course Kaze will find a way to eek out that tiny extra bit of performance and then some.
Both editions of the book Hacker's Delight mention the fast inverse square root (or as they call it, an Approximate Reciprocal Square Root Routine) and give various improvements of the algorithm. In the books they already mentioned FISR without Newton iterations: > deleting the Newton step results in a substantially faster function with a relative error within ±0.035, using a constant of 0x5F37642F.
The gas tank's just fine, actually, especially with the Expansion Pak ... The problem is that the hose from the gas tank to the engine is the size of a curly straw.
They started with designs used for SGI workstations and cut stuff waaay down. They were not going to make a new CPU design for the N64 and clocking it slower wouldn't have saved any money. It's not like today where we have ARM cores at every performance level you might want.
It's amazing to me how you can make deeply complex topics so easy to understand by explaining them based on a use case. Programming is like black magic to me, yet I can follow your videos along without any issue. God bless.
have you thought about writing some small research papers for these findings and experiments? like, even if they're extremely specific for your use case, they're still cool as hell and might even help someone some day
For the graph at 9:19, it probably would have been more clear if you'd labeled it as "Error (%) vs Cycles," since that's what the numbers actually represent. In both cases, a lower number is better, which is the inverse of what is implied by "Accuracy vs Performance" (which suggests that a higher value is more accurate or has higher performance).
All of your videos are so incredible. I love how you mix maths and humour in the way you do. Even if I can't comprehend everything, I love each and every second
I was certain it was absolutely "useless" in the Nintendo 64 hardware, I'm amazed you actually found a place to use it! Also, you know your audience very well, @06:58 I chuckled and @07:23 I almost laughed. Great content!
I'm honestly impressed at the grind. In your shoes I would probably have ignored the FISR comments for being more pop comp sci than actual well-informed optimisation tech but good on you for actually investigating it and finding a place it excels even on a chip with hardware floats
I know the project is aimed at maintaining real hardware compatibility, but maybe consider to patch a n64-emu and give one extra rambus to the console to see how far your game can go?
The levels in the beginning of this video look really pretty! This is starting to look like one of the best Mario games ever designed... And it's a rom hack of Mario 64!
I used to think there was no reason for a Mario 64 sequel to exist, but seeing how much there was to improve.on the base engine, and how many level possibilities it opens up, it's obvious they should have made one
At 7:23: I’m not an expert on this, but wouldn’t any number 1.17549435082e-38 or smaller be rounded up due to floating point accuracy, eliminating any potential issue?
that number is the lowest possible floating point number before the exponent hit -127 - if the number was smaller than this, it'd loop around and you'd get a number closer to the max float representable
You don't need to do 1/fastinvsqrt or x*fastinvsqrt to get fastsqrt, you can use the exact same method, just instead of constant-(i>>1) use constant+(i>>1) (and of course the constant is different). You can use this method to calculate any nth power, the general formula is constant+i*n (which for -½ simplifies to constant-(i>>1)).
The truth about why my dad never came back from the grocery store 😢😢😢 Edit: Yooo my mind was blown w/ Silas’s idea. Mathematically it seems obvious but getting that much accuracy improvement with the 2 fourth-root calcs multiplied together is insane! Thanks for the good content as always!
I wonder if you’ll ever do optimization for the n64 bios… I know Nintendo didn’t give very many developers access to the bios but there are a few games which load a different bios into the n64 and allow even more optimized code to be run for whatever game was developed…
You mean the RSP Microcode, and that's a whole different problem. I don't know how doucumented the microcode is. Nintendo certainly didn't want developers messing with it, and just use the ones they provided.
I made my own using weighted quadratic beziers. It's only 4% less accurate than the sine and cosine operations using the squirt, at a fraction of the performans cost. I know it can be improved, but so far so good. :3c
The inverse square root sure went through a journey, didn't it? From being cumbersome to calculate, to an ingenious bit hack, to becoming its own CPU instruction.
Well, computers have to normalize a LOT of vectors for any sort of physical simulation or 3D graphics and it's one of the big bottlenecks. So it makes sense to have its own heavily optimized CPU instruction.
Kaze Emanuar: The only person that can explain to me how to optimise a 26 year old game in high technical detail I can't even begin to understand, while keeping me invested until the end.
8:57 ooo, reminds me of Romberg integration (more generally Richardson extrapolation), add together approximations at different step sizes to cancel out one error term in the taylor expansion. Takes me back to intro to scientific computing :)
I am always fascinated by your content, i'm too dumb to understand all the maths and algorithms involved in your work, but its insane what you achieved with N64. of course specially with Mario 64.
You know, for as terrible as most everything in the world is right now, the fact that we have people like Kaze who is relentlessly optimizing Mario 64 and making great romhacks is a welcome ray of sunshine. Never stop buddy!
Your videos are really fascinating to me. I don't have a background in computing science, but I do know mathematics and videogames. Appreciating your research! :)
I don't know where your journey takes you, but the amount of data you took upon you, is painfully large. You are like a character from literature, the giver. In the story a certain character carries the memory of the world as it was. At this point it's safe to bet you are in the top 5 most knowledgeable n64 programming/development person in the world.
He has the relative luxury of only concerning himself with one set of hardware (the N64) and one set of software (the relevant programming language(s)) for decades. Almost no (game) programmer out there enjoys this kind of laser focus. Instead it's 2-3 things at a time and every other year one thing is discarded and another thing added.
when i was tutoring, i told students that there is a time and place to use the fast inverse square root. the place is on 32 bit windows PCs and the time was 1999. The trick only works for single precision IEEE 754 floats, and essentially any modern PC made after probably 2007 or so is so fast that the gains of the float hack aren't worth it lol. its important to teach though because i think it demystifies the IEEE 754 standard and helps students understand that its essentially just scientific notation but with 2 instead of 10 as your base. and if you go into embedded systems, you do need to understand how that stuff works because you will eventually need to convert a 754 float to a DEC float (or vice versa) and that has its own little hacks.
The same trick works for doubles too, but it needs a different magic constant (which has been listed on wikipedia) Otherwise, yeah, it's role is obsoleted by the dedicated reciprocal square root approximation instruction, where available.
Its great that some of the most interesting computer engineering is being done on old hardware. Can you imagine if a modern developer put this much effort into their optimizations!
@@ictogon Yet, to a programmer it doesn't matter how complicated the CPU is. The issue is that no one cares how performant your code is as long as it finishes the task in a reasonable time. People would rather solve a lot of complicated problems well than to optimize any one so it finishes in 10ms instead of 200ms. No user cares. If I can load my bank account app in 1 second, I don't care as long as I can do everything that I want with it. I'd rather have this than a website that loads on 100ms but can't do a lot of things Once hardware stops advancing, optimization will return.
Love this. One of my favourite algorithms because it looks like magic from today's point of view, and apparently back then too, going by the comments. I don't know about you but I study IT at University and it's really funny to me how obvious the algorithm gets once you grasp the mathematical concepts behind it, and how floats are implemented.
what's crazy to me is that stuff like this proves geniuses are out there. I struggle writing methods in java ffs, I didn't understand a thing going on except that Kaze has an immense IQ
At 1:30 you seem to be suggesting that the best way to get sqrt(x) from 1/sqrt(x) is to take the reciprocal via a division, but you could also just multiply by x since x/sqrt(x) = sqrt(x). I don't know if this changes anything you're saying.
To summarize: _Script kiddie:_ "HAY, you should use this really famous algorithm because it's a more efficient way of performing floating point calculations!" _Nintendo:_ "Yeah, we thought of that, dude. We stuck a chip in the system that does that _specific_ calculation all on its own because doing it any other way was hella inefficient." _Kaze:_ "Amateurs..." _Nintendo/kid:_ "What was that?" _Kaze, LVL. 99 Script Wizard:_ "AMATEURS!"
The game runs so smoothly its amazing, what do you think, could you afford a simple shadow map pass? Is the n64 buffer enough to hold a depth pass or two?
I'm sure I've seen another explanation of the FISR algorithm and their conclusion was that the constant selected couldn't be improved. That was for x86 though so perhaps the code works differently on the N64 🙂😵💫
Is there anyway to upgrade the ram's speed on a N64? I'm a bit curious just how much horsepower is locked away behind having 4 miners share one pickaxe.
The N64 uses a type of RAM called RDRAM. That's where you would need to start your search. There are faster versions of it that were made for PCs, but they were uncommon and the last ones were made around 2003. The PS2 also uses RDRAM, so that might be an easier source to find. After you find some RDRAM, you would need to find out how to make it work. The RDRAM is connected to the RCP, so you would need to find some way to speed up the RDRAM without affecting the RCP's timings.
Im using fixed-point so i kinda need the algorithmic versions, not the mantissa hack. Interested in researching these... In a few years, when i have time to spare this issue
Hi Kaze, love your coding tech video but, frankly, I almost don't understand them, surely you are very talented indeed. One simple question: what do you think about Saturn processing power? The SCU 48-bit DSP inside SS can actually be helpful in matrix transformation aside from the sual indipendet 32bit CPUs?
9:22 hey, don't need to call me out like that. i get it, its some cool math. i honestly had to do a double take when you mentioned 4th roots being multiplied by 2 magic number to cancel out each other's error. that kind of line up is nonsensically precise to land on, and tis just here now.
Quake 3? I'm almost certain it was back in Quake 1. There was a discussion about whether it was Carmack or Mike Abrash, and turns it could have been Gary Tarolli at 3dfx? My memory of that era has long faded.
Kaze, if I was capable of understanding the concepts that you break down in your videos I know I would be able to learn things from you. But because I don't code or anything, the only thing I can truly say I learned today is that you were able to make the n64 go vroom vroom even more! I am happy and grateful for you and the others that dedicate time into development for the n64. Thank you for what you do. Btw is there any updates on the progress of Return to Yoshi's Island?
8:46 calling an inverse fourth root a "fourth inverse square root" really made me lose track of what was happening for a moment, lol. Great video though
1:30 I'm sure finv + div is more expensive than sqrt, but I am certain there is a way to make a finv-like function that only computes the square root without taking the reciprocal. That would be a fairer comparison in that part of the video.
The N64 really was an insane piece of 3D hardware for a home console, it's such a shame that they decided a memory minivan was good enough substitute for the memory bus. I wonder what kind of CPU-intensive software we would have seen if that one bottleneck was enlarged.
The fast inv sqrt function isn’t “magic” like many would assume. An x87 FPU is focused on precision and id wrote that code to to use fixed point math where that IEEE 7.54 precision is not required. Never assume that one CPU architecture works the same as x86. Fast on x86 doesn’t mean fast universally.
Even though this is mostly specific to the N64 I'm going to save it to my game dev list anyway :) Maybe some day I'll find myself optimizing some code where this is useful and remember that it's in there
I have a question... how "friendly" is your code? The work you did to optimize SM64's source code is nothing short of amazing, but from what I've heard, code optimization usually comes at the cost of readability. If someone who's not that experienced with C tried to read your code, how much do you think they would struggle to understand everything you wrote?
EVERYTHING? they would struggle a lot with the new graph system, but i also doubt that someone like that wound't also struggle understanding matrix math in the first place, which is a prerequisite to understanding the code at all.
With what he's doing, he probably keeps the original code in place, with some conditional compilation statements to switch between the two sets of code (so the original code with the clearer intent stays visible), with some reminder comments in the code about what the optimizations are doing, maybe something along the lines of /*using a modification of Quake 3 inverse square algorithm here, constant hex value obtained through empirical brute force test on real N64 hardware*/.
@@jamesfrankiewicz5768that's not really possible unfortunately, huge parts of how the engine work have been changed and most vanilla mario 64 functions don't have an equivalent function in RTYI any more.
Let's be real. If anyone needs Kaze's code, they're probably gonna need it as a library. This is black magic. Just imagine, people making new n64 games using libkazen64
this question was good until we reached "who's not experienced with C tried to read your code".... if you aren't experienced with C it doesn't matter how much you comment they aren't going to understand it nor any of the other principles being used. It *should* be commented and well documented in places where major changes were made and as to why, but for it to be readable for people who have no idea what the concepts are? That's stretching way beyond the purpose of documentation/commenting. Let alone the other realm of concepts like the mathematical operations and hardware specific refinements
Hi Kaze, very cool video. i have a question about a mario 64 port for the playstation classic, it runs very well (60fps) but in under water scenes the framerate drops below 60fps. i watched you videos and question myself: is it possible to build a port for the playstation classsic with your enhanced version of the source code and will it reduce or completely remove the stutters ( framerate drops)?
My best guess.. Some optimizations might be helpful but Kaze has made a lot of very N64 hardware specific improvements, and since the hardware is different, It's not at all certain they would work as well.
You can use the inverse square root algorithm in more cases. The other best example is probably physical simulations, more exactly, simulations of gravity, electromagnetism, etc. since those follow the inverse square law. It's probably obsolete by now though, and i have no idea why you would ever want to build physical simulations on an N64 lmao.
There actually is a fast square root algorithm that uses the same technique as the fast inverse square root algorithm. I don't know if it would be useful at all on an N64 though.
TLDR 0x5F375A86 might be a better magic number I just remembered I read a comment a year ago on a doom video that mentioned a more accurate magic number. I don't know anything about it but it may be of use: "Using the default magic number of 0x5F3759DF actually gives about 0.175% error maximally, dropping thereafter. It is now believed the original author of the function found the constant through trial and error rather than derivation. If you use Chris Lomont's updated magic number of 0x5F375A86 (which he derived), the error drops to a staggering 0.09% after just 1 iteration."
1:11 well use builtin sqrt, then run a parser in cpu cache that grabs this result, and performs inversion within cpu cache, so you only would calculate the square root and write to cache buffer periodically when needed
Is "now try it with the fast inverse sqrt" the programmer version of how every musician content creator is asked/forced to attempt Rush E and other meme songs
Free bird!
Nice video!
Just one small point: If you want to use invsqrt(x) to calculate sqrt(x), you can use x*invsqrt(x) instead of 1/invsqrt(x). That might save a few cycles? But I still agree that the quake3 fast inverse square root algorithm is probably not that useful on N64.
oh true, somehow i just forgot
to do regular sqrt you would just use a different magic constant.
float fast_sqrt(float x) {
int i = *(int *)&x;
i = 0x1fbd1df5 + (i >> 1);
return *(float *)&i;
}
If I understood that table correctly from the start of the video, does that imply that the 0 newton iteration would take 6 cycles to complete just the fast inverse portion? How long does * take, may deserve another video...
Did a little reading, it sounds as though hardware implementations of sqrt make have taken a few different routes, common ones apparently were a lookup table for rough approximation followed by a number of newton iterations, or alternatively some process similar to long division. Without digging too deeply, my guess is that might be why the cycle counts for division and sqrt are the same. Since long division's one of the slower approaches to dealing with floats, it could be that the fast inverse sqrt is the way to go, since the n64's hardware was developed before the algorithm was discovered it could be that its implementation could be beaten via software. Newton iterations roughly double the precision of the result so a better initial guess can rapidly decrease the time cost.
I'm kind of surprised it was made popular by Quake III. I could've sworn Mike Abrash put it in the original Quake, but it's been years since I saw the source.
i thought it was from doom before i made this video lmao
@@KazeN64 You accidentally still said Doom in one part of the video.
ahhh shit, i thought i edited it out haha
@@KazeN64 Twice actually, lol. 5:00 and 8:30
@@KazeN64 ah yes, DOOM with all of it's real time reflections and shading and floating point numbers that it definitely used. /s
seriously though, i thought it was funny. because DOOM is pretty much always the first game people think of when IdSoftware is mentioned.
so sometimes people think the FastInverseSqrt is also from DOOM, even though the game doesn't even use floats at all because they were way too slow at the time.
Silas' idea with the error cancelling is very cool, there are probably many other examples where we can reduce the error of one problem by dividing it into two sub-problems with opposite error
This feels like reading a data book wrong every time with the wrong indicator
Perfectly balanced as all this should be
Error is a fickle issue. While I don't know for sure, usually these types of strategies have smaller error, in return they have larger error when values get extremely big or small.
@@M0liusX I definitely believe that, there are usually very specific cases where one algorithm is better than another, and in general there is no "optimal" algorithm which always works best for all situations, it always depends on the specific example.
We were doing this analogue style in the 1930s with balanced cables. Letting you run small voltages hundreds of feet with no interference.
We use similar algoritms in land surveying to estimate coordinates with high accuracy. By using a GPS reading and comparing it to a GPS reading, at a control station, we can see which sattelites are visible in both readings. By subtracting the differences between the readings, the accuracy of the initial GPS reading goes from 5-10m(15-30feet) of accuracy, to 2-3cm(1-1.5inches) of accuracy. This only works when the same satelites are visible in both readings, if the control reading is to far away, then the algorithm wouldn't work. But this cancels out errors like the density of the atmosphere, refraction errors, and random errors, since you work with more data.
It's crazy to think just what a modern computer might be capable of if we had the time and expertise to optimize code to this degree...
Where is that lost time going to? A big company can do it
modern CPUs actually have built in optimized instructions for exactly these types of things. For example, intel CPUs have an operation that does 8 inverse square root operations with perfect accuracy in just one clock cycle. These types of operations are called SIMD, and are majorly underutilized
Edit: not quite perfect
2^-14 is the maximum accuracy they guarantee
For a game though, that difference will be practically indistinguishable
Right? Game developers need to pay attention. We wouldn't have so many issues with modern games being unoptimized if they used more advanced optimization. Doom Eternal is a perfect example of that
@@mariotheundying Layers and layers and layers of virtualization, security, memory safety, compatibility, APIs, threading, and more before you even get close to running instructions on the bare metal
@@vesuvianprime a big company has all the time, all the people and all the money in the world for that, I'm talking about stuff like Microsoft
everyones gangsta till kaze finds an one cycle improvement in the most cracked function ever existing
Never underestimate the value of a single cycle to a bedrock function that gets called endlessly.
Like back when one of my teenage hobby projects was a primitive 3D renderer (for modding a game with), written in BASIC (specifically VB4), with basically zero access to any external, actual 3D APIs. Displaying textures in real-time was far outside my abilities but I had the coordinate transformations for wireframe rendering optimized as much as I can think of, including a few scenarios where I resorted to the oft-maligned GOSUB/RETURN type calls instead of making a function call to handle it, simply because it was the faster mechanism.
I like to think that whatever piece of his soul John Carmack lost when Oculus got acquired by FB/Meta ended up possessing Kaze
I didn't actually expect it to work at all. Because I remember you said that square roots on the N64 are relatively fast. Of course Kaze will find a way to eek out that tiny extra bit of performance and then some.
you're forgetting this is the reverse sqrt not just sqrt
@@Brad_Script I didn't forget. I just thought that it's not worth it compared to the performance gains on old Pentiums.
Both editions of the book Hacker's Delight mention the fast inverse square root (or as they call it, an Approximate Reciprocal Square Root Routine) and give various improvements of the algorithm. In the books they already mentioned FISR without Newton iterations:
> deleting the Newton step results in a substantially faster function with a relative error within ±0.035, using a constant of 0x5F37642F.
2:00 So they gave it a race car engine and a can of beer for a gas tank?
The gas tank's just fine, actually, especially with the Expansion Pak ... The problem is that the hose from the gas tank to the engine is the size of a curly straw.
Cutting corners in the strangest of ways.
yeah that's pretty much a good metric for why the n64 is slow lol
Yeah, I remember some N64 dev from the time made a similar anology.
They started with designs used for SGI workstations and cut stuff waaay down. They were not going to make a new CPU design for the N64 and clocking it slower wouldn't have saved any money. It's not like today where we have ARM cores at every performance level you might want.
It's amazing to me how you can make deeply complex topics so easy to understand by explaining them based on a use case. Programming is like black magic to me, yet I can follow your videos along without any issue. God bless.
have you thought about writing some small research papers for these findings and experiments? like, even if they're extremely specific for your use case, they're still cool as hell and might even help someone some day
For the graph at 9:19, it probably would have been more clear if you'd labeled it as "Error (%) vs Cycles," since that's what the numbers actually represent. In both cases, a lower number is better, which is the inverse of what is implied by "Accuracy vs Performance" (which suggests that a higher value is more accurate or has higher performance).
All of your videos are so incredible. I love how you mix maths and humour in the way you do. Even if I can't comprehend everything, I love each and every second
RAM Bus is my favorite recurring character on this show, glad to see Them back!
I was certain it was absolutely "useless" in the Nintendo 64 hardware, I'm amazed you actually found a place to use it!
Also, you know your audience very well, @06:58 I chuckled and @07:23 I almost laughed. Great content!
Really. N64 is not my platform, but if it was Gameboy code I might legit have paused the video to count the cycles.
I'm honestly impressed at the grind. In your shoes I would probably have ignored the FISR comments for being more pop comp sci than actual well-informed optimisation tech but good on you for actually investigating it and finding a place it excels even on a chip with hardware floats
Not sure why hardware floats matter for FISR. Quake III was written for Pentium processors, all of which have hardware floats.
@@jimmyhirr5773 Right, what set the N64 aside are actually hardware square roots
5:02 Ah yes, Quake's Fast Inverse Square Root algorithm, famously used in DOOM /j
I know the project is aimed at maintaining real hardware compatibility, but maybe consider to patch a n64-emu and give one extra rambus to the console to see how far your game can go?
The levels in the beginning of this video look really pretty! This is starting to look like one of the best Mario games ever designed... And it's a rom hack of Mario 64!
I used to think there was no reason for a Mario 64 sequel to exist, but seeing how much there was to improve.on the base engine, and how many level possibilities it opens up, it's obvious they should have made one
7:48 There is research for that by Chris Lomont in 2003. However, the number he found was 0x5f37642f.
he might have used a different error measuring technique. I used "Maximum relative error" as a measure of error.
At 7:23: I’m not an expert on this, but wouldn’t any number 1.17549435082e-38 or smaller be rounded up due to floating point accuracy, eliminating any potential issue?
that number is the lowest possible floating point number before the exponent hit -127 - if the number was smaller than this, it'd loop around and you'd get a number closer to the max float representable
You don't need to do 1/fastinvsqrt or x*fastinvsqrt to get fastsqrt, you can use the exact same method, just instead of constant-(i>>1) use constant+(i>>1) (and of course the constant is different). You can use this method to calculate any nth power, the general formula is constant+i*n (which for -½ simplifies to constant-(i>>1)).
oh shit that actually seems like a good idea
The truth about why my dad never came back from the grocery store 😢😢😢
Edit: Yooo my mind was blown w/ Silas’s idea. Mathematically it seems obvious but getting that much accuracy improvement with the 2 fourth-root calcs multiplied together is insane!
Thanks for the good content as always!
don't worry, he's just building up speed for 12 years
Wha
I wonder if you’ll ever do optimization for the n64 bios… I know Nintendo didn’t give very many developers access to the bios but there are a few games which load a different bios into the n64 and allow even more optimized code to be run for whatever game was developed…
You mean the RSP Microcode, and that's a whole different problem. I don't know how doucumented the microcode is. Nintendo certainly didn't want developers messing with it, and just use the ones they provided.
@@watchm4ker yeah, I meant the microcode… I’d be so interested to see what kinds of things could be done with that explored
@@drewynucci9037 having looked at a few more videos... Yeah, it's pretty wild. Look up F3DEX3.
7:20 That poor shyguy....
I made my own using weighted quadratic beziers. It's only 4% less accurate than the sine and cosine operations using the squirt, at a fraction of the performans cost. I know it can be improved, but so far so good. :3c
I didn't even bat an eye at "squirt" the first time I read this comment because that's exactly how I say it in my head
Damn! optimizing the optimization! That's inception level of programming artistry. ❤ good job!
The inverse square root sure went through a journey, didn't it?
From being cumbersome to calculate, to an ingenious bit hack, to becoming its own CPU instruction.
Well, computers have to normalize a LOT of vectors for any sort of physical simulation or 3D graphics and it's one of the big bottlenecks. So it makes sense to have its own heavily optimized CPU instruction.
Kaze Emanuar: The only person that can explain to me how to optimise a 26 year old game in high technical detail I can't even begin to understand, while keeping me invested until the end.
i used to watch your videos
Kaze's coding gives me the feeling of when the Vtech kicks in
Damn, Kaze's take on fast inverse square root is really interesting.
these types of optimizations are fun on microcontrollers. they're kind of the final frontier of optimization shenanigans.
8:57 ooo, reminds me of Romberg integration (more generally Richardson extrapolation), add together approximations at different step sizes to cancel out one error term in the taylor expansion. Takes me back to intro to scientific computing :)
I am always fascinated by your content, i'm too dumb to understand all the maths and algorithms involved in your work, but its insane what you achieved with N64. of course specially with Mario 64.
You know, for as terrible as most everything in the world is right now, the fact that we have people like Kaze who is relentlessly optimizing Mario 64 and making great romhacks is a welcome ray of sunshine. Never stop buddy!
Your videos are really fascinating to me. I don't have a background in computing science, but I do know mathematics and videogames. Appreciating your research! :)
I didn't know how much I missed hearing you say vroom vroom until now. lol
I just fucking died laughing after saying "YOOOO" outloud & IMMEDIATLY got called out for it by the video.
I don't know where your journey takes you, but the amount of data you took upon you, is painfully large. You are like a character from literature, the giver. In the story a certain character carries the memory of the world as it was. At this point it's safe to bet you are in the top 5 most knowledgeable n64 programming/development person in the world.
He has the relative luxury of only concerning himself with one set of hardware (the N64) and one set of software (the relevant programming language(s)) for decades. Almost no (game) programmer out there enjoys this kind of laser focus. Instead it's 2-3 things at a time and every other year one thing is discarded and another thing added.
when i was tutoring, i told students that there is a time and place to use the fast inverse square root. the place is on 32 bit windows PCs and the time was 1999. The trick only works for single precision IEEE 754 floats, and essentially any modern PC made after probably 2007 or so is so fast that the gains of the float hack aren't worth it lol. its important to teach though because i think it demystifies the IEEE 754 standard and helps students understand that its essentially just scientific notation but with 2 instead of 10 as your base. and if you go into embedded systems, you do need to understand how that stuff works because you will eventually need to convert a 754 float to a DEC float (or vice versa) and that has its own little hacks.
The same trick works for doubles too, but it needs a different magic constant (which has been listed on wikipedia)
Otherwise, yeah, it's role is obsoleted by the dedicated reciprocal square root approximation instruction, where available.
The amount of raw joy i felt when you said rambus goes vroom vroom
Another fun and educational video Kaze, thank you!
Its great that some of the most interesting computer engineering is being done on old hardware. Can you imagine if a modern developer put this much effort into their optimizations!
modern cpus are multiple orders of magnitude more complicated than the N64
@@ictogon Yet, to a programmer it doesn't matter how complicated the CPU is. The issue is that no one cares how performant your code is as long as it finishes the task in a reasonable time. People would rather solve a lot of complicated problems well than to optimize any one so it finishes in 10ms instead of 200ms. No user cares.
If I can load my bank account app in 1 second, I don't care as long as I can do everything that I want with it. I'd rather have this than a website that loads on 100ms but can't do a lot of things
Once hardware stops advancing, optimization will return.
Love this. One of my favourite algorithms because it looks like magic from today's point of view, and apparently back then too, going by the comments.
I don't know about you but I study IT at University and it's really funny to me how obvious the algorithm gets once you grasp the mathematical concepts behind it, and how floats are implemented.
what's crazy to me is that stuff like this proves geniuses are out there. I struggle writing methods in java ffs, I didn't understand a thing going on except that Kaze has an immense IQ
Keep at it and you'll get better. I don't know how long you've been programming, but I'm sure that Kaze has been doing it for a long time.
I forgot about making the RAMBUS go VROOM VROOM! I don't program stuff like this at all, but it is still always interesting to see your work.
Me (comp Sci grad but for enterprise software): "mhmm ahh yes I understand some of these words"
At 1:30 you seem to be suggesting that the best way to get sqrt(x) from 1/sqrt(x) is to take the reciprocal via a division, but you could also just multiply by x since x/sqrt(x) = sqrt(x). I don't know if this changes anything you're saying.
it does not
To summarize:
_Script kiddie:_ "HAY, you should use this really famous algorithm because it's a more efficient way of performing floating point calculations!"
_Nintendo:_ "Yeah, we thought of that, dude. We stuck a chip in the system that does that _specific_ calculation all on its own because doing it any other way was hella inefficient."
_Kaze:_ "Amateurs..."
_Nintendo/kid:_ "What was that?"
_Kaze, LVL. 99 Script Wizard:_ "AMATEURS!"
u mean level 199
babe wake up, Kaze just uploaded a new video
The game runs so smoothly its amazing, what do you think, could you afford a simple shadow map pass? Is the n64 buffer enough to hold a depth pass or two?
with all of your improvements, I bet you could make an N64 game engine for building anything. Not just Mario
And don't forget, if the only reason you're using sqrt to normalize your vectors is for distance comparisons, you don't even need it at all.
Am I the only one who doesn’t understand most of this stuff? Yet I’m always excited to see them. And when I see “vroom” I get happy too
I said I would watch this and I’m keeping my word. Thanks for taking the effort!
I'm sure I've seen another explanation of the FISR algorithm and their conclusion was that the constant selected couldn't be improved. That was for x86 though so perhaps the code works differently on the N64 🙂😵💫
great vid as always!! negligible (3:36) should be pronounced like "neg-*lij*-able", whereas you said "neg-li-able". again amazing work!!
Is there anyway to upgrade the ram's speed on a N64? I'm a bit curious just how much horsepower is locked away behind having 4 miners share one pickaxe.
The N64 uses a type of RAM called RDRAM. That's where you would need to start your search. There are faster versions of it that were made for PCs, but they were uncommon and the last ones were made around 2003. The PS2 also uses RDRAM, so that might be an easier source to find.
After you find some RDRAM, you would need to find out how to make it work. The RDRAM is connected to the RCP, so you would need to find some way to speed up the RDRAM without affecting the RCP's timings.
Im using fixed-point so i kinda need the algorithmic versions, not the mantissa hack. Interested in researching these... In a few years, when i have time to spare this issue
You should go over Portal64, maybe you can help with the renderer bottlenecks
Hi Kaze, love your coding tech video but, frankly, I almost don't understand them, surely you are very talented indeed. One simple question: what do you think about Saturn processing power? The SCU 48-bit DSP inside SS can actually be helpful in matrix transformation aside from the sual indipendet 32bit CPUs?
that's jaw on the floor type stuff
9:22 hey, don't need to call me out like that.
i get it, its some cool math.
i honestly had to do a double take when you mentioned 4th roots being multiplied by 2 magic number to cancel out each other's error.
that kind of line up is nonsensically precise to land on, and tis just here now.
Fast Inverse Square Root my beloved
My mans over here fuckin hand counting cpu cycles to save 😂😂😂
Every time you you upload I get amazed. Just hoping to play the mod at some point too.
Quake 3? I'm almost certain it was back in Quake 1. There was a discussion about whether it was Carmack or Mike Abrash, and turns it could have been Gary Tarolli at 3dfx? My memory of that era has long faded.
Oh yeah. This makes complete sense. We need more compute stuff. I understand everything you said……
I still listened to the whole video.
Kaze, if I was capable of understanding the concepts that you break down in your videos I know I would be able to learn things from you. But because I don't code or anything, the only thing I can truly say I learned today is that you were able to make the n64 go vroom vroom even more!
I am happy and grateful for you and the others that dedicate time into development for the n64. Thank you for what you do.
Btw is there any updates on the progress of Return to Yoshi's Island?
course 11 almost done right now! ill post some previews soon of the remade demo levels.
Yeah that's great news! Thank you for the update, and I'm sure we are all excited for it.
this video is basically melatonin, with all the zelda songs and random math. I love it. good job
Nice research, and interesting results. Bravo!
This is beautiful. I've no idea what you're saying but it sounds great.
8:46 calling an inverse fourth root a "fourth inverse square root" really made me lose track of what was happening for a moment, lol. Great video though
The camera normalization makes it look like the camera guy is tipsy. Neat
Thank you for reminding everyone it also screws up the rendering temporarily! Most forget that
1:30 I'm sure finv + div is more expensive than sqrt, but I am certain there is a way to make a finv-like function that only computes the square root without taking the reciprocal. That would be a fairer comparison in that part of the video.
It's been a minute since I did any programming, but this is super interesting stuff.
Doing proper maths revision? Nah. Watching someone optimise the N64's maths? Hell yeah.
The N64 really was an insane piece of 3D hardware for a home console, it's such a shame that they decided a memory minivan was good enough substitute for the memory bus. I wonder what kind of CPU-intensive software we would have seen if that one bottleneck was enlarged.
Are you aware of the Portal64 project? It seems like something you'd enjoy following.
The fast inv sqrt function isn’t “magic” like many would assume. An x87 FPU is focused on precision and id wrote that code to to use fixed point math where that IEEE 7.54 precision is not required. Never assume that one CPU architecture works the same as x86. Fast on x86 doesn’t mean fast universally.
You almost make me want to get a Computer Science degree.
Almost.
the bobomb factory looks like an early gamecube game!!!
I can't wait for the optimized robot uprising
Even though this is mostly specific to the N64 I'm going to save it to my game dev list anyway :) Maybe some day I'll find myself optimizing some code where this is useful and remember that it's in there
Definitely at the end of the video a shocked Pikachu face for myself.
I have a question... how "friendly" is your code?
The work you did to optimize SM64's source code is nothing short of amazing, but from what I've heard, code optimization usually comes at the cost of readability. If someone who's not that experienced with C tried to read your code, how much do you think they would struggle to understand everything you wrote?
EVERYTHING? they would struggle a lot with the new graph system, but i also doubt that someone like that wound't also struggle understanding matrix math in the first place, which is a prerequisite to understanding the code at all.
With what he's doing, he probably keeps the original code in place, with some conditional compilation statements to switch between the two sets of code (so the original code with the clearer intent stays visible), with some reminder comments in the code about what the optimizations are doing, maybe something along the lines of /*using a modification of Quake 3 inverse square algorithm here, constant hex value obtained through empirical brute force test on real N64 hardware*/.
@@jamesfrankiewicz5768that's not really possible unfortunately, huge parts of how the engine work have been changed and most vanilla mario 64 functions don't have an equivalent function in RTYI any more.
Let's be real. If anyone needs Kaze's code, they're probably gonna need it as a library. This is black magic.
Just imagine, people making new n64 games using libkazen64
this question was good until we reached "who's not experienced with C tried to read your code".... if you aren't experienced with C it doesn't matter how much you comment they aren't going to understand it nor any of the other principles being used. It *should* be commented and well documented in places where major changes were made and as to why, but for it to be readable for people who have no idea what the concepts are? That's stretching way beyond the purpose of documentation/commenting. Let alone the other realm of concepts like the mathematical operations and hardware specific refinements
I promise you I learned *nothing* but the video was fascinating anyway.
Oh fuck yeah I am excited for this video. I really dig the deep dives on code optimization like Michael Abrash's Graphics Programming Black Book.
I know nothing of coding, but it would be "Silas's Idea". With names you do the "'s".
Hi Kaze, very cool video. i have a question about a mario 64 port for the playstation classic, it runs very well (60fps) but in under water scenes the framerate drops below 60fps. i watched you videos and question myself: is it possible to build a port for the playstation classsic with your enhanced version of the source code and will it reduce or completely remove the stutters ( framerate drops)?
My best guess.. Some optimizations might be helpful but Kaze has made a lot of very N64 hardware specific improvements, and since the hardware is different, It's not at all certain they would work as well.
I think it is important that the optimization did not came from CS, but from Mathematics theory.
You can use the inverse square root algorithm in more cases. The other best example is probably physical simulations, more exactly, simulations of gravity, electromagnetism, etc. since those follow the inverse square law. It's probably obsolete by now though, and i have no idea why you would ever want to build physical simulations on an N64 lmao.
There actually is a fast square root algorithm that uses the same technique as the fast inverse square root algorithm. I don't know if it would be useful at all on an N64 though.
TLDR 0x5F375A86 might be a better magic number
I just remembered I read a comment a year ago on a doom video that mentioned a more accurate magic number. I don't know anything about it but it may be of use:
"Using the default magic number of 0x5F3759DF actually gives about 0.175% error maximally, dropping thereafter. It is now believed the original author of the function found the constant through trial and error rather than derivation.
If you use Chris Lomont's updated magic number of 0x5F375A86 (which he derived), the error drops to a staggering 0.09% after just 1 iteration."
i'm doing 0 iterations. if i was to do 1 iteration, i'd optimize all 3 constants and i think the error would be even lower.
1:11 well use builtin sqrt, then run a parser in cpu cache that grabs this result, and performs inversion within cpu cache, so you only would calculate the square root and write to cache buffer periodically when needed
So u get best accuracy with more vroom vroom
Less ram latency
Have you talked to the portal 64 dev? I think it would be a great colab for you guys, I'm sure you could help.improve the code.
Make the bus go vroom vroom...OH SHIT, Oh god...that was fantastic xD