2 Common Questions I'd like to answer: 1) Why not just pass a vector pointer to the lateral distance function? Because load instructions under MIPS have a constant offset you can provide. By assuming how far away from the pointer the offset is, we can cover one special case in 1 cycle less and have all the other cases work in the same amount of cycles. Making a special case for the lateral distance function and then constantly repurposing it is an optimization! 2) Why don't you just use inline assembly? - I need my code to be in this exact spot for cache coherency and using inline assembly mips GCC style is too much of a pain. I wish there was a way to just write assembly?? But somehow there isn't.
> But somehow there isn't. Not sure if sarcasm, but you already built a custom GCC. Why not build GAS as well? It builds along with GCC if you configure the project with --with-gnu-as.
"how much did you improve the performance?" "by roughly 2 fps." "what did it cost you?" "i can no longer read the code, let alone make any changes." "was it worth it?" "you bet"
Gonna make the most incomprehensible code ever, not even the most advanced of AIs being able to accurately read it, to make the game 0.1% faster, worth it (dw it adds up to 0.7% with other stuff)
"what did it cost you?" "the resulting binary can no longer be disassembled because i use invalid opcodes, you have to poke a NOP at certain memory locations, disassemble it and then add the missing instructions back, by hand."
You realize it's completely worth it because it's a systems function that doesn't needs being read or changed. It's like that one fast inverse square root function. Literally who cares if it's impossible to follow?
Man, whatever glitches you miss when you release this are gonna be wild. You're playing with extremely volatile logic alchemy here, even more unsafe than Pokémon Gen 1.
i always find the most unreadable code is when they start utilizing undocumented bugs in hardware because the code your looking at even if you 100% understood it, is not doing what you think it does. or when they start using undocumented instructions that do a weird combination of things. this is what makes making emulators so hard to make.
Luckily the N64 doesn't have too much of this on the CPU front, the only bug is shown in this video. Theres also no undocumented instructions. Where for example the NES would partially execute and effectively combine multiple instructions when encountering an unknown opcode, the VR4300 (the CPU used by the N64) throws an Reserved Instruction exception instead. Deterministic cases of undefined behaviour (e.g. divide by zero) could in theory be abused, but im not sure if that ever ends up being useful.
@@uwirl4338 Java, server management. So my optimization is more in the form of SQL query optimization, data manipulation, multithreading, etc. However, I did dabble in asm and low level C in College. Also, I love watching videos that go into such detail for games, like Kaze Emanuar or Retro Game Mechanics Explained, or plenty of other speedrunning channels that talk about things like exploiting arbitrary code execution vulnerabilities!
@@uwirl4338 Well, most software engineers (at least those using Java and C#) work on some sort of "business software". And for this type of software maintainability, extendibility and "getting it done on schedule" is way more important than performance. In the end it's usually a lot cheaper to throw more cores or more memory at a problem, than paying a guy 6+ weeks to optimize the code. And if it is ever necessary to optimize the code, it's usually sufficient to switch to some other algorithm or do "high level" optimizations, not actually chasing a few clock cycles in each function. Oh and obviously: Java and C# aren't even compiled into machine code ahead of time, so all of those optimization schemes wouldn't work in the first place. Never the less it's damn impressive to see these techniques in action and I love hearing about them.
Honestly I think most of the functions here (except the last one??) would be fine if they were explained with a page-long docstring explaining the theory behind the behavior of each function. Novel-length docstrings are also a good signal for junior devs to stay away from that code.
I work in a codebase that is at least 60 years old. Looking at older versions written for IBM mainframes, everything looks like this. My favorite was, instead of having a boolean array for if a region had a particular property, they instead stored the machine code for "no-op" and "jump to function" where relevant and executed the array directly.
In 5 years, Kaze will have moved on from coding entirely and will be building this game transistor-by-transistor to improve on the inefficient architecture of N64 cartridges.
For anyone thinking of implementing this generally, note Kaze is optimising for his specific purpose on his specific system and his specific compiler, and is testing everything in Compiler Explorer. Don't start doing this stuff needlessly in your own code blindly just "because Kaze said faster". If you're not writing an optimised N64 game then you need to do the profiling, research, and proof work all over again. Plus you actually need to save 1 or 2 cycles, which is unlikely on any x86 computer made in the last 30 years.
Plus you have to be super careful not to over-optimize for a specific implementation of x86_64, which then becomes horribly slow or even downright broken on others. These sorts of micro-optimizations are almost never worth implementing unless your target hardware is 100% fixed.
in some old games that super optimizations in the mid to late 90s i saw that in their source code they use to keep 2 version. A more clean version for porting to other systems and a dirty one that was platform specific. I remember some games had both a C and ASM version in some cases.
i'm surprised they actually cared that much to optimize code. that's a neat way of handling it. luckily for sm64, most of the functions still have an equivalent of the base decomp if anyone wants to port this stuff. (i also usually leave a pure C version of the code above it for everyone to be able to figure out what the code is supposed to do)
@@KazeN64Why do you "force" gcc to produce the machine code you want instead of just implementing the function in assembly? Is it the calling overhead?
@@uwirl4338 its just really annoying to do inline assembly in GCC. it throws so many errors. it's easier to get it right by just throwing in small assembly snippits into the C imo
This reminds me of the story of Mel the "Real Programmer" from the jargon files. Utterly indecipherable code which nonetheless is so good it's practically magic.
Oh my god, this was amazing, the ending line "you're welcome" is pure comedy gold. I'm never gonna use any of this but I am sending this to as many people as I can.
As someone who's intimately familiar with the RISC-V instruction set, it's interesting to see how MIPS, the N64's instruction set, is so similar that most optimizations would still apply to a RISC-V system. A stand-out example for me was the global pointer optimization avoiding a two-byte LUI-sequence at 4:11.
@@KazeN64 Are they N64-specific or your-game-specific? Is your GCC fork public? I can imagine the libdragon guys being interested in this (unless they already implemented the same tricks in their toolchain themselves).
@@the_kovic easyaspi made it and it is publically accessible! although she's only posted it in my discord. but i have the custom build linked in this repo's readme
@@KazeN64 Wait, since this is a custom fork of gcc. Is there a way to patch-in a sort-off assembly passthrough in the compiler, so that you can write in both C and Assembly without doing the weird inline assembly trick you mentioned? Probably overthinking things as usual, but who know.
I've had this idea for a while but I'm not sure how to implement it. It would basically be like Godbolt Compiler Explorer but it also showed you clock cycles. Like if you highlighted a section of code, you know how it shows the total character count? Imagine that but it also assembled it and displayed the byte count and clock cycles.
Most of the time I treat math functions as black boxes anyway, so the fact that an evil wizard built them with black magic doesn't bother me. Unless something goes wrong and I have to open the box...
7:04 You can avoid this issue entirely by moving the decrement of the loop variable into the body of the loop so it executes *after* the loop condition and *before* the first loop iteration. This is also very useful for decrementing loops that use unsigned integers.
8:15 I work as a graphics programmer for an engine at my job and this style of coding is incredibly common on the low level parts of the rendering pipeline. I always thought the person who wrote these was insane, but now I finally understand. *They really are insane*
Same. I mainly deal with powershell and .Net and there are times that I don't like the way the interpreter handles something, so I end up calling a .NET class because its faster at the expense of code readability. But hey, if I am needing it to do this thing hundreds of times within a large dataset it is worth it. Always add comments in case you forget why.
Back in the day running your for loops backwards was a huge time saver in JavaScript running under Internet Explorer 6. Because it was Internet Explorer. Just have to be careful the execution order of the items being looped through doesn't matter...
@@tolkienfan1972 I'd even go so far as to store my data backwards sometimes if it needed to be in a certain order. Makes me wish assemblers supported something like: .backwards db &05, &10, &15, &20 .endb
_Kaze's next video:_ "Now we're going to delve into the apocrypha of the forbidden Old Tongue spoken only by native R'lyehn because it'll save us a whole 3 cycles and regular code just won't cut it."
Re: 4 this is so cursed and amazing Re: 7 separating the "hot" and "cold" branches of a function is a good one. There are a few variants of this that may or may not help (usual caveats: depends on arch and runtime patterns). Instead of using "goto" it might be better to call a separate non-inline function. You can put all the "cold" branches in a separate section, so that more of your "hot" paths are in cache. In cases when the "hot" path is very small and common, you could also inline the hot path. Re: 13 at some point it is clearer to just go full asm :)
as a gamedev of intermediate skill, i feel like this video is expanding my brain at such a dangerous speed that it will be permanently damaged. perfect video, i'll watch as many of these as you make
Personally, I see no problem sacrificing readability for performance, as long as you A document it, and B the performance gain is relevant for your use.
Well, tbh the principles are more straightforward than that. When you write code, your very first objective is to make it produce the desired result or outcome. Second, you make it robust - you take into account edge cases and variations. Third, you optimize it. And you have to do it in that order specifically since there is no sense of optimizing a code that doesn't work. And when you optimize you basically sacrifice maintainability - aka you don't expect to change that code much ever again, it is the dead end.
I'm pretty sure casting a function pointer to another one with a different return type would be undefined behavior if you didn't build for a single platform. It's funny in a sense, you don't have to ever worry about undefined behavior since you can always test to see what the n64 does.
that seems like something that would be undefined in terms of the outcome but it would have a pretty clear definition as to what it should compile to idk the whole idea of "undefined behavior" kinda vanishes when you have 1 compiler and 1 architecture. everything's defined by how the compiler works in the end (assuming there is no random noise it pulls from)
1:45 The lie you tell the compiler here is a neat trick, but I wonder of the cast can be avoided with an assume attribute instead to just tell the compiler that the value is in a valid range?
@@tolkienfan1972gcc's attribute assume is even smarter, since it doesn't semantically execute the code. That means you can tell it to assume things about the result of variable modifications.
Kaze: Uses "goto" to swap logic for instruction caching Me: YOU WHAT?! Kaze: Overwrites functions with memcpy for a properly optimized call path to avoid bugs per console Me: *I AM CALLING THE POLICE!*
i have honestly never seen the goto thing before, both are illegal but i'll permit it I'vd overwritten code with raw machine code bytes at runtime before. e.g. a E8 Call on x86. E8 00 00 00 64 calls func 100 bytes after end of that instruction
I've been coding for over a decade, and i kinda feel the same lol. There's a lot i get, but when he gets into the bit shifting i just have to trust him.
6:52 - same with 68000 - it has a "DBRA" instruction - "Decrement and Branch", which decrements and branch if not -1. However this means you MUST write: for ( i = count-1; i >= 0; i-- ) This will NOT produce DBRA (at least not with my version of GCC): for ( i = count; i > 0; i-- ) ...even if "i" is not used in the loop body! Instead it does a standard SUBQ+JNE which is slower. Same trick, but an annoyingly subtle MIPS-vs-68k difference.
That is so frustrating. When I'm doing 68k asm I usually would do something like this: foo: MOVE.W #9-1, D0 DBRA D0, foo for a loop that is intended to execute 9 times
Or having the clean code versions available somewhere to indicate what they're supposed to be equivalent to. I know Kaze has said there are usually equivalent function names in the base decomp project, but I'd love to see optimized, even if not fully, versions of those, but with readability still intact, if available.
Remember to use the shortest variable names possible. This will reduce the time it takes for your compiler to process your code into an AST. On top of that use letters that have the smallest pixel area and simplest geometry so your IDE will render them the fastest due to true type fonts. l and I are great for this
The compare with zero trick also works on the NES, SNES, and all other 6502-based systems. Whenever you do an arithmetic operation, it sets the equal and carry flags with an implicit comparison to 0.
I love your videos so much, optimization exactly like this is my favorite activity on the planet. Seeing a new video from you is like christmas coming early.
Games for TI83 calculators can be pretty small if done well. I made a 5-level puzzle-platformer in around 1.5kb once. But the Bubble Bobble game I made was much bigger, somewhere near 22K.
At some point it's clearer to write the exact procedure you want in assembly than to try to trick a compiler into generating the exact assembly you want with cryptic hacks. You have reached that point. 😛
This is the most interesting video series I've ever seen. I want more exactly like this for other games or pieces of software we're not meant to have the source code for.
Hmm pretty interesting optimization techniques, most of the time idk whats actually happening but your explanation is very understandable, great video Kaze!
It's scary to think, but it seems like that self modifying code trick could actually make a big impact in some real use cases. I mean you can potentially save millions of if-checks with it.
I have done something similar in a Javascript game I'm working on. I had no idea if it was even faster or anything, I just thought it was cool that I could do it. I used it for remapping keyboard inputs to different functions depending on if you're in a menu, a map, a battle, etc.
hell, you can tell the compiler to just make your code writable (or use objcopy, if the way the compiler does it makes things not work how you need them to)
Generally when writing a lot of complex stuff like this where it's not obvious what it does, it's not so much that you can't do it, but they really want you to write code comments along the way to explain to other people why it's done that way and warn them what kind of scenarios could break it. So the more inline assembly you use, the more code comments you need to go along with it... that is to say, the less human-readable the code itself is, the more you have to compensate by having human-readable translations alongside of it. For that code, you'd end up writing a novel explaining it to any programmer that comes after you... and most people would rather just make the code itself more readable than spend half a day explaining their heavily-optimized code to a hypothetical newcomer in the code comments. But when it's absolutely necessary, you will see tricks like this used along with extensive documentation (and sometimes cursing and F-bombs about how pissed they are they had to write the code this way and how much they hate the compiler depending on how professional they are).
I've heard many people talk about decrementing loops being faster over the years, and many people also claim it's false, but this is the first time I've heard anyone so clearly explain that it's about the comparison against zero. It all makes sense now.
Kaze, I am a huge fan of optimization and "form fits function". Even though I don't understand most of what you are doing here - I love watching this content from you. Thank you and keep it up!
An excellent video! I laughed so hard at some of these, partially bc I've had to do the same with microcontroller compilers. When you've got a dinky little 12 MIPS mcu, counting cycles can be important, and improperly handled casts and such by the compiler can eat into that.
Ngl, I actually miss being able to use some tricks like this to make noticeable improvements for tight inner loops of some of my old dos projects. On modern pcs a lot of this stuff would either be operationally insignificant (especially in view of much bigger gains that can be had for doing more common-sense stuff) or even detrimental. But I would be lying if I don't occasionally have things like code-locality cross my mind when writing even the most mundane of functions sometimes. I totally get the whole 'Now my day is ruined' comment about inverting logic lol
Kaze wants every every line of his code to be tagged with "WTF" code commentary by a 3rd party looking at the code. The same way Fast Inverse Square Root was commented on in Quake 3 code (Not Doom).
The comment about your first ever programming undertaking reminded me of a elementary school friend i had who spent their time, at ten years of age, hacking mario world by simply editing the rom directly with a hex editor. Great video!
This really feels like we're delving into the dark arts at this point, learning things mortal man was not meant to know & developing dark powers out of reach of the gods. All to make Mario run a little bit better!
It's quite interesting that the reverse loop count isn't just MIPS exclusive but is fairly natural, in fact, as an SNES dev: Although the 65xx family generally lacks a dedicated decrement and branch if (not) zero opcode (exceptions like the SPC700 do exist), it's still faster to decrement a loop count (preferably an index register) then incrementing it because decrementing a value also affects the zero flag which affect whether BEQ and BNE ("Branch if Equal / Not Equal" but actually "branch if zero / not zero") jump. There also is Super FX which has a dedicated loop opcode which too works by counting down and not up (although there rarely is any reason to use the loop counter directly and it's easier to subtract a constant from the pointers than recalculating the pointer + index). In the end, this is indeed a missing compiler optimisation.
Had a coworker that wrote a small script in python cause it is "So easy and fast to write". When he used it he noticed it was too slow to be used at all (simple test-cases with inputsizes in the range of maybe 50 elements would work, we needed thousands). Then he spent weeks trying to make his python faster. He ended up writing it in C++ and was done in 2 weeks (and a couple thousand times faster program).
I did some advent of code challenges in PHP this year & BOY IS IT SLOW. One of my solutions was a brute force of about 4 billion operations, which i broke into 10 processes & then each process took about 3-5 minutes. Then someone in YT comments said they brute forced in Rust & got it done in a minute or two. To make it better, my brute force didn't even get me the right answer! Lol. Also by "operation" i mean ... a loop, basically. Each loop had quite a few things to do.
There was another advent of code challenge i did in PHP that was computationally intense (+ LOTS of memory access) where i was able to optimize the heck outta my php though & actually make it work. That was my first encounter with memory bottleneck being a real problem.
@@reed6514 Well PHP is (up till very recently) an interpreted scripting language. Will take some time to run anything compute-intensive. It is a wonder how powerful a website-modifying scrip has become.
And so now we know why the N64 was the "less performant" console in spite of the lack of disc read times. To squeeze all the potential it had out of it, you had to be a fluent practitioner of the dark arts, on top of breaking several coding Geneva convention guidelines.
For your reverse loops, for(int i = x; i-- > 0;) would iterate x - 1 to 0. I've also seen it written as for(int i = x; i --> 0;) to make it look slicker.
Once I was writing a loop in assembly that did two different things at several points based on a boolean state variable. Instead of checking the variable each time and branching I just wrote two variants of the loop and jumped between the equivalent point in the loop whenever I wanted to change the variable, eliminating said variable.
Wouldn't writing some functions entirely in assembly be more readable than having to inline some asm instructions here and there to force the compiler into producing the asm that you want? At least for short functions.
GCC is also very enthusiastic about exploiting undefined behavior, and will usually use it to do things you don't want. You have to deliberately disable certain optimizations in order to get away with faster code, which will just make code elsewhere slower.
@@undefinednan7096you don't get the benefit of inlining in that case. I would definitely recommend more inline asm, but I'm working in 68k which is probably a bit easier to write than mips. You do get used to gcc's inline asm syntax eventually, it's all over my engine now.
@@Ehal256 good point, and I don't think you could use LTO to easily fix that. I'm not so sure that 68k assembler is actually easier to write than MIPS asm (a lot of places used to use MIPS as the intro to asm)
Though word of caution, using inline assembly can sometimes discourage the compiler from optimizing certain things since it can't be sure of certain assumptions anymore, and if not it's likely the code will break once you change optimization levels. I always avoid inline assembly inside of C functions as a result and only really use inline assembly in naked/their own functions.
if this made the compiler any worse, i would notice it. in fact, gcc's mips compiler has a bug that makes it insert a few unnecessary NOP instructions after float comparisons, but if you have any inline assembly in the same compile unit, it will discard these NOPs. GCC's mips compiler works better with inline assembly
Might be compiler/arch dependent then. I wonder why NOPs are inserted by compilers anyways, they're usually removed when turning on optimization, alignment/padding maybe?@@KazeN64
@@joopie46614 As was mentioned in the video, some N64 models have a hardware bug, which makes them fail to calculate two floating point multiplications in a row. Rather than treat that as a constraint and place other instructions in between multiplications, GCC instead just replaces every float multiplication with a float multiplication followed by a NOP, considering the bug to be too rare to be worth accommodating properly.
@@angeldude101 Damn. I can think of issues with a SNES C compiler already (basically, multiplication and division requires you to wait a couple cycles before you can read the result) but Super FX gets particularly difficult with its ROM and RAM buffering system (basically, if you want to read from ROM, you have to set the pointer which also tells the processor to cache one byte whereas writing to RAM doesn't write to the memory directly but also goes through a caching system, both which take time).
First, great video, but I'd like to point out a few things for people who might not know them. A key thing that cannot be emphasized enough is that you need to measure whether these sorts of optimizations actually speed up your code. For example, the self-modifying code trick will often make your code slower since to make it work reliably you'll need to invalidate the icacheline (and flush the dcacheline if you used cached access), and the possible extra reads from RAM will often cost you more than you gain. Also, I think you know this, but for other people's benefit, you want to modify code while you're executing far away from it (MIPS cpus wont execute modified instructions if the instruction is already in the pipeline -- this isn't so bad for the VR4300, which only has a 5-stage pipline and is in-order scalar) and in bulk (so you can minimize the number of cache flushes needed). The more modern the CPU, the worse self-modifying code is in general. Also, if you need to do a ton of BS to get the compiler to output a specific sequence of instructions, perhaps you should just code the function or performance critical piece of code part in assembly. In the example of atan2s, it already has large portions of assembly, so making the code less fragile is probably worth it. For the struct Surface, 0x16 isn't 4 byte aligned, so the compiler should automatically insert the padding, so the two structs should have exactly the same layout, which agrees with the offsets in the code you show onscreen (although it could be more clear to explicitly show the padding).
yeah i'd love to just write it in assembly, but i find it to be a huge pain to write inline assembly under GCC. it never seems to work right beyond small snippits. yeah i added the padding on the left side manually just to showcase this better. (although there is probably some compilers that would realign the struct on the right to make the struct smaller? or at least, there should be some settings that do that automatically for you)
@@KazeN64I'm pretty sure that "setting" is just the packed attribute ( __attribute__((packed)) ) which will place the struct members such as they occupy the least amount of space (byte-wise) even if it breaks type alignment rules (like ints needing to be 32bit aligned)
I think this depends on the processor because not every of them has got a decrement and branch if zero opcode. This is true for most of the the 65xx family (exceptions like the SPC700 do exist) which lack this opcode but have a BPL/BMI (Branch if PLus/MInus, essentially branch if < 0x80 / >= 0x80) which allows you to run up to 0x81* iterations like this. In the end, the only advantage is that it makes assembly code easier to read and write because you don't need to append a -1 onto every indexed variable. *To explain why 0x81 specifically: Most loops are best implemented as a do-while (execute code first and then decrement loop count) because you'd otherwise run two branch opcodes (one which does the comparison, the other to jump back to the loop). Because you take the offset into account, a loop count of 0x81 means you store 0x80 into memory. Once the end of the loop is reached, decrement the loop counter which, if it's 0x80 (memory value), becomes 0x7F which is positive.
Hardware-aware optimization is just nuts. The most portable example I can think of is FFTW, which computes fast Fourier transforms by making plans tuned to input sizes and hardware instructions. The most optimal use is to save plans and reuse them on inputs of the same size.
... What I'm getting from this is that when you will be sent back in time to make the fully optimized Super Mario 64, they will either kick you out for writing this code or praise you for your genius.
They didn't had the same compiler. So, some of his hacks won't be usable. Back in the day, if you wanted power, you wrote assembly functions and used C as "super assembly".
since readability is going out of the window, i think kaze should have a code copy of the engine before this dramatic changes, because when he released the source code of rtyi it will be so hard to read and tightly coupled to its function that no other hacker could made a new hack from that source code.
@@Minty_Meeo it is just having a copy of the previous code. With all the previous performance improvements it should be enough for the majority of hackers.There is no need to further improvements from 99% of the rest of mario 64 devs.
13:03 this was very new for me. I thought right shift always fills the most significant bits with 0 but I just learned that there is a difference between logical and arithmetic bit shifting. The logical bit shifting will fill all the new values with 0. Arithmetic right shifting on the other hand will copy the the MSB for all the new bits to guaranty the result will be half of the original value and therefore performing an arithmetic division by 2. Arithmetic shifting seems to be machine specific though. C doesn't guaranty it.
That's also a thing which confused me at first when I read the Super FX instruction set (it also has a DIV2 which is ASR except for input 0xFFFF which becomes 0 but notably no LSL nor ASL) since I've worked on 65816 ASM (SNES dev) which only has ASL and LSR but no LSL nor ASR. Fun fact: The 65xx ASL is actually a logical shift because ASL doesn't take the overflow flag (i.e. an "unexpected" change from negative to positive and vice versa) into account. It seems like the only reason why they're named that way is to make it easier to see the direction the shift applies, lol.
My favorite instance of int bit manipulation to do float math faster is absolutely the fast inverse square root. You (probably) know it, you (probably) love it. Take the bits of a float, turn them into an int, bit shift it, subtract from 0x5f3759df, and convert back to a float. Improve the approximation with Newton's method and you're done!
You just exploded heads of many cleancoders 🤯😜 In 8-bit assembly were not caches but pressing every cycle form hardware was a norm 😂 Nice tricks by they way 💪
Reminder: the possibilites of just a combination of 2 bits for every bit Is the sum of the assigned numbers to bits. Basically, a factorial bit for addition Its 1+2+3... You get it. And 4 bits is an nightmare. Especially in minecraft.
Some really neat tricks in this video, but dude, honestly at a certain point it might be worth just hacking gcc to add a pass that does some of these transformations for you since they’re so mechanical 😂. Or even just a write preprocessor that takes sane C and spits out optimized C (you could do this with python’s pycparser for instance).
2 Common Questions I'd like to answer:
1) Why not just pass a vector pointer to the lateral distance function?
Because load instructions under MIPS have a constant offset you can provide. By assuming how far away from the pointer the offset is, we can cover one special case in 1 cycle less and have all the other cases work in the same amount of cycles. Making a special case for the lateral distance function and then constantly repurposing it is an optimization!
2) Why don't you just use inline assembly?
- I need my code to be in this exact spot for cache coherency and using inline assembly mips GCC style is too much of a pain. I wish there was a way to just write assembly?? But somehow there isn't.
would maybe another compilet like llvm clang maybe able to provide better behaviour when compiling. or maybe it's worst XD.
I’d love to see your work done to Goldeneye on the N64, maybe get it up to a half decent frame rate.
> But somehow there isn't.
Not sure if sarcasm, but you already built a custom GCC. Why not build GAS as well? It builds along with GCC if you configure the project with --with-gnu-as.
You should look at the Union feature in C++, and the Restrict feature in C.
You can write entire assembly functions in a separate source file and link them in. No more fighting the compiler that way.
"how much did you improve the performance?"
"by roughly 2 fps."
"what did it cost you?"
"i can no longer read the code, let alone make any changes."
"was it worth it?"
"you bet"
But hey, rambus go vroom vroom
Gonna make the most incomprehensible code ever, not even the most advanced of AIs being able to accurately read it, to make the game 0.1% faster, worth it (dw it adds up to 0.7% with other stuff)
"what did it cost you?"
"the resulting binary can no longer be disassembled because i use invalid opcodes, you have to poke a NOP at certain memory locations, disassemble it and then add the missing instructions back, by hand."
You realize it's completely worth it because it's a systems function that doesn't needs being read or changed. It's like that one fast inverse square root function. Literally who cares if it's impossible to follow?
@@GeorgeTsiros 😭😭😭😭😭😭😭😭😭
Man, whatever glitches you miss when you release this are gonna be wild. You're playing with extremely volatile logic alchemy here, even more unsafe than Pokémon Gen 1.
i had had to deal with some insane bugs before LOL
@@KazeN64 A video about some of those would be very entertaining lmao
@@javaguru7141agreed
@@javaguru7141 this ++++
@@javaguru7141 oh that could be cool!
Rule 1 of systems coding: "the compiler is smarter than you are"
Rule 2: "except when it isn't"
I always go by "The compiler is not your friend." which I read somewhere on the internet a long time ago. ;w;
@@colonthree "YOU NEED TO BE THE COMPILER !"
@@colonthree but it is your coworker...
Wait? Who's the compiler and who's the compilee?
compiler? i hardly know her!
i always find the most unreadable code is when they start utilizing undocumented bugs in hardware because the code your looking at even if you 100% understood it, is not doing what you think it does. or when they start using undocumented instructions that do a weird combination of things. this is what makes making emulators so hard to make.
"makes making emulators so hard to make" 🤣
Luckily the N64 doesn't have too much of this on the CPU front, the only bug is shown in this video. Theres also no undocumented instructions. Where for example the NES would partially execute and effectively combine multiple instructions when encountering an unknown opcode, the VR4300 (the CPU used by the N64) throws an Reserved Instruction exception instead. Deterministic cases of undefined behaviour (e.g. divide by zero) could in theory be abused, but im not sure if that ever ends up being useful.
@@i.8530Are you also a N64 programmer? You know a lot of things!
@@dacueba-games like im likely to literate too much literation
@@SelendekiYeah like I've never coded on N64 but I'm a programmer by day and I love videos like this
As a software engineer myself, this was an amazing implementation of optimization and horror story of indecipherable deep magic. Nice job as always 👍
What do you work on? 99% of software engineers work with Java or C# and barely know what a clock cycle is
@@uwirl4338yup but besides hating it so much Java brings bread to the table
@@uwirl4338 Java, server management. So my optimization is more in the form of SQL query optimization, data manipulation, multithreading, etc. However, I did dabble in asm and low level C in College. Also, I love watching videos that go into such detail for games, like Kaze Emanuar or Retro Game Mechanics Explained, or plenty of other speedrunning channels that talk about things like exploiting arbitrary code execution vulnerabilities!
@@uwirl4338 Well, most software engineers (at least those using Java and C#) work on some sort of "business software". And for this type of software maintainability, extendibility and "getting it done on schedule" is way more important than performance. In the end it's usually a lot cheaper to throw more cores or more memory at a problem, than paying a guy 6+ weeks to optimize the code. And if it is ever necessary to optimize the code, it's usually sufficient to switch to some other algorithm or do "high level" optimizations, not actually chasing a few clock cycles in each function.
Oh and obviously: Java and C# aren't even compiled into machine code ahead of time, so all of those optimization schemes wouldn't work in the first place. Never the less it's damn impressive to see these techniques in action and I love hearing about them.
Honestly I think most of the functions here (except the last one??) would be fine if they were explained with a page-long docstring explaining the theory behind the behavior of each function.
Novel-length docstrings are also a good signal for junior devs to stay away from that code.
I work in a codebase that is at least 60 years old. Looking at older versions written for IBM mainframes, everything looks like this. My favorite was, instead of having a boolean array for if a region had a particular property, they instead stored the machine code for "no-op" and "jump to function" where relevant and executed the array directly.
classy jump table, gotta love em
In 5 years, Kaze will have moved on from coding entirely and will be building this game transistor-by-transistor to improve on the inefficient architecture of N64 cartridges.
I actually laughed out loud, this is pure gold 😂😂
All copies of RTYI will require the special KAZE-1 enhancement chip on-board to boost performance.
At that point might as well modify the console to make it as optimized as possible
I guess the logical conclusion is Kaze pushing computer tech forward.
I think the next step is him writing his own compiler
It feels like every video takes me even deeper into a rabbithole with more and more cursed and even cleverer tricks.
And it's always so interesting.
There is something just beautiful in writing code that is optimized so "badly" that sometimes it doesn't even compile properly
For anyone thinking of implementing this generally, note Kaze is optimising for his specific purpose on his specific system and his specific compiler, and is testing everything in Compiler Explorer. Don't start doing this stuff needlessly in your own code blindly just "because Kaze said faster". If you're not writing an optimised N64 game then you need to do the profiling, research, and proof work all over again. Plus you actually need to save 1 or 2 cycles, which is unlikely on any x86 computer made in the last 30 years.
Plus you have to be super careful not to over-optimize for a specific implementation of x86_64, which then becomes horribly slow or even downright broken on others. These sorts of micro-optimizations are almost never worth implementing unless your target hardware is 100% fixed.
And also,compilers nowadays arepretty smart aboit compiling to archetectures that are not 20+ years obsolete.
i'm using modern GCC unfortunately
@@KazeN64 Is gcc's mips optimizations as good as its x86 optimizations? Or does architecture not affect how "smart" gcc is?
I wouldn't be surprised if James Lambert is sitting there taking notes for potential optimizations in Portal 64 every time Kaze releases a new video.
in some old games that super optimizations in the mid to late 90s i saw that in their source code they use to keep 2 version. A more clean version for porting to other systems and a dirty one that was platform specific. I remember some games had both a C and ASM version in some cases.
i'm surprised they actually cared that much to optimize code. that's a neat way of handling it. luckily for sm64, most of the functions still have an equivalent of the base decomp if anyone wants to port this stuff. (i also usually leave a pure C version of the code above it for everyone to be able to figure out what the code is supposed to do)
@@KazeN64Why do you "force" gcc to produce the machine code you want instead of just implementing the function in assembly? Is it the calling overhead?
@@uwirl4338 its just really annoying to do inline assembly in GCC. it throws so many errors. it's easier to get it right by just throwing in small assembly snippits into the C imo
@@KazeN64 I agree, the syntax is horrendous and doesn't feel like asm as all
They made so many games, but rarely released source code for most of them.
5:55 first answer on stack overflow is always one saying your question doesn't make sense.
I'm surprised it wasn't marked as a duplicate.
this has been posted before.
thread closed
This reminds me of the story of Mel the "Real Programmer" from the jargon files. Utterly indecipherable code which nonetheless is so good it's practically magic.
And Kaze even made the impossible blackjack game with SM64 Chaos Edition…
Oh my god, this was amazing, the ending line "you're welcome" is pure comedy gold. I'm never gonna use any of this but I am sending this to as many people as I can.
As someone who's intimately familiar with the RISC-V instruction set, it's interesting to see how MIPS, the N64's instruction set, is so similar that most optimizations would still apply to a RISC-V system. A stand-out example for me was the global pointer optimization avoiding a two-byte LUI-sequence at 4:11.
RISC-V was based on MIPS. You can think of it as if MIPS was designed again but with an extra ~25 years of experience.
@@Sauraenso they fixed the branch prediction design? /jk
Good video 👍 (it came out 2 minutes ago and theres no way i watched it all)
👍👍👍👍
👍👍👍👍
👍👍👍👍
👍👍👍👍
I 👍 the comment without reading it
i like the slow pan to the "This doesn't make sense" reply on the stackoverflow page
The 'goto' part got me thinking the next step here would be to implement a N64 specific compiler
i already have some n64 specific tricks in this compiler! its a custom gcc build.
@@KazeN64 Are they N64-specific or your-game-specific? Is your GCC fork public? I can imagine the libdragon guys being interested in this (unless they already implemented the same tricks in their toolchain themselves).
@@the_kovic easyaspi made it and it is publically accessible! although she's only posted it in my discord. but i have the custom build linked in this repo's readme
@@KazeN64
Wait, since this is a custom fork of gcc. Is there a way to patch-in a sort-off assembly passthrough in the compiler, so that you can write in both C and Assembly without doing the weird inline assembly trick you mentioned?
Probably overthinking things as usual, but who know.
I've had this idea for a while but I'm not sure how to implement it. It would basically be like Godbolt Compiler Explorer but it also showed you clock cycles. Like if you highlighted a section of code, you know how it shows the total character count? Imagine that but it also assembled it and displayed the byte count and clock cycles.
Most of the time I treat math functions as black boxes anyway, so the fact that an evil wizard built them with black magic doesn't bother me. Unless something goes wrong and I have to open the box...
I love insane code techniques like this, even if I don't understand totally normal code techniques
7:04 You can avoid this issue entirely by moving the decrement of the loop variable into the body of the loop so it executes *after* the loop condition and *before* the first loop iteration. This is also very useful for decrementing loops that use unsigned integers.
8:15 I work as a graphics programmer for an engine at my job and this style of coding is incredibly common on the low level parts of the rendering pipeline. I always thought the person who wrote these was insane, but now I finally understand.
*They really are insane*
i love how i technically understand what i just saw while at the same time i don't understand the level of wizardry this guy is performing
Exactly this honestly
Same. I mainly deal with powershell and .Net and there are times that I don't like the way the interpreter handles something, so I end up calling a .NET class because its faster at the expense of code readability.
But hey, if I am needing it to do this thing hundreds of times within a large dataset it is worth it.
Always add comments in case you forget why.
DAMN those are some insane micro optimizations, when you said you sacrificed readability for performance, I didn't think it would be THAT bad xd
I hoped it would be even worse! 😂
Back in the day running your for loops backwards was a huge time saver in JavaScript running under Internet Explorer 6. Because it was Internet Explorer.
Just have to be careful the execution order of the items being looped through doesn't matter...
It's always better to run them backwards. But I'm guessing back then the compiler didn't rewrite them as backwards for you?
This was a trick I used to use back when my main PC was Z80 based. That was a long time ago. 😁
@@tolkienfan1972 I'd even go so far as to store my data backwards sometimes if it needed to be in a certain order. Makes me wish assemblers supported something like:
.backwards
db &05, &10, &15, &20
.endb
"This might get you fired from your job, but it makes our rambus go vroomvroom, so I'd say it's worth it."
It's always worth it.
_Kaze's next video:_ "Now we're going to delve into the apocrypha of the forbidden Old Tongue spoken only by native R'lyehn because it'll save us a whole 3 cycles and regular code just won't cut it."
Re: 4 this is so cursed and amazing
Re: 7 separating the "hot" and "cold" branches of a function is a good one. There are a few variants of this that may or may not help (usual caveats: depends on arch and runtime patterns). Instead of using "goto" it might be better to call a separate non-inline function. You can put all the "cold" branches in a separate section, so that more of your "hot" paths are in cache. In cases when the "hot" path is very small and common, you could also inline the hot path.
Re: 13 at some point it is clearer to just go full asm :)
as a gamedev of intermediate skill, i feel like this video is expanding my brain at such a dangerous speed that it will be permanently damaged. perfect video, i'll watch as many of these as you make
Personally, I see no problem sacrificing readability for performance, as long as you A document it, and B the performance gain is relevant for your use.
Well, tbh the principles are more straightforward than that. When you write code, your very first objective is to make it produce the desired result or outcome. Second, you make it robust - you take into account edge cases and variations. Third, you optimize it. And you have to do it in that order specifically since there is no sense of optimizing a code that doesn't work.
And when you optimize you basically sacrifice maintainability - aka you don't expect to change that code much ever again, it is the dead end.
I'm pretty sure casting a function pointer to another one with a different return type would be undefined behavior if you didn't build for a single platform. It's funny in a sense, you don't have to ever worry about undefined behavior since you can always test to see what the n64 does.
that seems like something that would be undefined in terms of the outcome but it would have a pretty clear definition as to what it should compile to
idk the whole idea of "undefined behavior" kinda vanishes when you have 1 compiler and 1 architecture. everything's defined by how the compiler works in the end (assuming there is no random noise it pulls from)
@@KazeN64Pretty much, I always get annoyed when people say "tHe cOmpiLeR cOuLd dO wHateVer iT waNtS!"
1:45 The lie you tell the compiler here is a neat trick, but I wonder of the cast can be avoided with an assume attribute instead to just tell the compiler that the value is in a valid range?
oh i didnt know that was a thing to use
@@KazeN64I've seen an ASSUME(x) macro implemented as if(!(x)) __builtin_unreachable();
@@tolkienfan1972gcc's attribute assume is even smarter, since it doesn't semantically execute the code. That means you can tell it to assume things about the result of variable modifications.
@@gfasterOS I guess that's new. Well, new to me
The footnote of "I use a custom GCC build" is insane enough on its own for a whole video
Kaze: Uses "goto" to swap logic for instruction caching
Me: YOU WHAT?!
Kaze: Overwrites functions with memcpy for a properly optimized call path to avoid bugs per console
Me: *I AM CALLING THE POLICE!*
i have honestly never seen the goto thing before, both are illegal but i'll permit it I'vd overwritten code with raw machine code bytes at runtime before. e.g. a E8 Call on x86.
E8 00 00 00 64
calls func 100 bytes after end of that instruction
How do you think dynamic linking, or loading executables into memory, for that matter, works? Also, lookup STT_GNU_IFUNC
I couldn't read your "YOU WHAT?!" in any voice but Incidental 6's (from Spongebob). You know which one I am talking about.
😂
@@GreyWolfLeaderTW "You asked for a couple of ice cubes, but I only put in one!"
At this point you will end up rewriting SM64 on assembly for performance 😂
Or making your own SM64 compiler.
I love these videos so much because I know nothing about coding but I watch them like I get it.
I've been coding for over a decade, and i kinda feel the same lol.
There's a lot i get, but when he gets into the bit shifting i just have to trust him.
6:52 - same with 68000 - it has a "DBRA" instruction - "Decrement and Branch", which decrements and branch if not -1. However this means you MUST write:
for ( i = count-1; i >= 0; i-- )
This will NOT produce DBRA (at least not with my version of GCC):
for ( i = count; i > 0; i-- )
...even if "i" is not used in the loop body! Instead it does a standard SUBQ+JNE which is slower. Same trick, but an annoyingly subtle MIPS-vs-68k difference.
That is so frustrating. When I'm doing 68k asm I usually would do something like this:
foo:
MOVE.W #9-1, D0
DBRA D0, foo
for a loop that is intended to execute 9 times
Just as a potential idea you could "document" some of the hardest functions by writing tests against them, assuming you don't have any yet.
Or having the clean code versions available somewhere to indicate what they're supposed to be equivalent to.
I know Kaze has said there are usually equivalent function names in the base decomp project, but I'd love to see optimized, even if not fully, versions of those, but with readability still intact, if available.
I wish each of these videos had before and after benchmarks. I love seeing FPS gains or even microsecond gains.
Remember to use the shortest variable names possible. This will reduce the time it takes for your compiler to process your code into an AST. On top of that use letters that have the smallest pixel area and simplest geometry so your IDE will render them the fastest due to true type fonts. l and I are great for this
The compare with zero trick also works on the NES, SNES, and all other 6502-based systems. Whenever you do an arithmetic operation, it sets the equal and carry flags with an implicit comparison to 0.
Even LDA implictly compares to zero!
On x86 too, thanks to the zero flag
@@erkinalp Yep, that's why we use "test eax, eax" so we don't need to encode 4 bytes of zeros
I love your videos so much, optimization exactly like this is my favorite activity on the planet. Seeing a new video from you is like christmas coming early.
As someone who's coded games that fit in a few hundred bytes, I see this as an absolute win.
Boot sector games?
@@EndOfForever In that size range, sometimes smaller. A while ago I made a version of Snake that fit in a Tweet, or under 280 bytes.
Games for TI83 calculators can be pretty small if done well. I made a 5-level puzzle-platformer in around 1.5kb once. But the Bubble Bobble game I made was much bigger, somewhere near 22K.
@@jacklawsen6390jeez, that's barely larger than wozmon, and even that was pretty damn small for what it did.
At some point it's clearer to write the exact procedure you want in assembly than to try to trick a compiler into generating the exact assembly you want with cryptic hacks. You have reached that point. 😛
Local yoshi casts ancient voodoo magic in old videogame and create man-made horrors beyond our comprehension in Orlando, Florida.
great video, i haven't cried this much in years.
This is the most interesting video series I've ever seen. I want more exactly like this for other games or pieces of software we're not meant to have the source code for.
I would really be interested in seeing if Goldeneye 64 can be optimized to run at a stable 30 FPS.
I've always loved just how cursed console optimizations are, good work!
I love in-depth videos like these; great work!
Hmm pretty interesting optimization techniques, most of the time idk whats actually happening but your explanation is very understandable, great video Kaze!
It's scary to think, but it seems like that self modifying code trick could actually make a big impact in some real use cases. I mean you can potentially save millions of if-checks with it.
I have done something similar in a Javascript game I'm working on. I had no idea if it was even faster or anything, I just thought it was cool that I could do it. I used it for remapping keyboard inputs to different functions depending on if you're in a menu, a map, a battle, etc.
thankfully, protected mode saves us from even having to consider that on modern CPUs
@@TwilightFlower9you can tell the kernel to make a read/write/execute page, this is how most JIT compilation works
hell, you can tell the compiler to just make your code writable (or use objcopy, if the way the compiler does it makes things not work how you need them to)
or just use mprotect(2) to mark it as writable at runtime! there's lots of fun stuff you can do
Generally when writing a lot of complex stuff like this where it's not obvious what it does, it's not so much that you can't do it, but they really want you to write code comments along the way to explain to other people why it's done that way and warn them what kind of scenarios could break it. So the more inline assembly you use, the more code comments you need to go along with it... that is to say, the less human-readable the code itself is, the more you have to compensate by having human-readable translations alongside of it. For that code, you'd end up writing a novel explaining it to any programmer that comes after you... and most people would rather just make the code itself more readable than spend half a day explaining their heavily-optimized code to a hypothetical newcomer in the code comments. But when it's absolutely necessary, you will see tricks like this used along with extensive documentation (and sometimes cursing and F-bombs about how pissed they are they had to write the code this way and how much they hate the compiler depending on how professional they are).
or for when you come back to it next week
I chose computer science because of you!
I stopped computer science because of Kaze
@@oSmudge lmao
@@oSmudge lmao
@@oSmudge skill issue
@@oSmudge skill issue
I've heard many people talk about decrementing loops being faster over the years, and many people also claim it's false, but this is the first time I've heard anyone so clearly explain that it's about the comparison against zero. It all makes sense now.
Kaze, I am a huge fan of optimization and "form fits function". Even though I don't understand most of what you are doing here - I love watching this content from you. Thank you and keep it up!
An excellent video! I laughed so hard at some of these, partially bc I've had to do the same with microcontroller compilers. When you've got a dinky little 12 MIPS mcu, counting cycles can be important, and improperly handled casts and such by the compiler can eat into that.
The abs and atan2 optimisations will come in handy some day. Thanks 😊
Ngl, I actually miss being able to use some tricks like this to make noticeable improvements for tight inner loops of some of my old dos projects. On modern pcs a lot of this stuff would either be operationally insignificant (especially in view of much bigger gains that can be had for doing more common-sense stuff) or even detrimental. But I would be lying if I don't occasionally have things like code-locality cross my mind when writing even the most mundane of functions sometimes. I totally get the whole 'Now my day is ruined' comment about inverting logic lol
good video explaining deranged optimization specific to your hardware, love to see it
In 2 years we will see Kaze calculating and checking the memory himself on the go ,just to save 2us and 2 memory reads
YES! This is how C/C++ was meant to be used!
Audibly said "Oh my fucking god" when you said self-modifyong code. You madman. You artist.
Kaze wants every every line of his code to be tagged with "WTF" code commentary by a 3rd party looking at the code. The same way Fast Inverse Square Root was commented on in Quake 3 code (Not Doom).
The comment about your first ever programming undertaking reminded me of a elementary school friend i had who spent their time, at ten years of age, hacking mario world by simply editing the rom directly with a hex editor. Great video!
This really feels like we're delving into the dark arts at this point, learning things mortal man was not meant to know & developing dark powers out of reach of the gods.
All to make Mario run a little bit better!
*insert meme about dark side of the force and unnatural abilities*
It's quite interesting that the reverse loop count isn't just MIPS exclusive but is fairly natural, in fact, as an SNES dev: Although the 65xx family generally lacks a dedicated decrement and branch if (not) zero opcode (exceptions like the SPC700 do exist), it's still faster to decrement a loop count (preferably an index register) then incrementing it because decrementing a value also affects the zero flag which affect whether BEQ and BNE ("Branch if Equal / Not Equal" but actually "branch if zero / not zero") jump. There also is Super FX which has a dedicated loop opcode which too works by counting down and not up (although there rarely is any reason to use the loop counter directly and it's easier to subtract a constant from the pointers than recalculating the pointer + index).
In the end, this is indeed a missing compiler optimisation.
Kaze is sounding more like a villain every day
I understood maybe 10% of the video but it was a good time regardless, looking forward to next time!
I wonder if there are any things the compiler maintainers could do to make Kaze's life easier in implementing these optimizations.
Had a coworker that wrote a small script in python cause it is "So easy and fast to write". When he used it he noticed it was too slow to be used at all (simple test-cases with inputsizes in the range of maybe 50 elements would work, we needed thousands). Then he spent weeks trying to make his python faster.
He ended up writing it in C++ and was done in 2 weeks (and a couple thousand times faster program).
I did some advent of code challenges in PHP this year & BOY IS IT SLOW.
One of my solutions was a brute force of about 4 billion operations, which i broke into 10 processes & then each process took about 3-5 minutes.
Then someone in YT comments said they brute forced in Rust & got it done in a minute or two.
To make it better, my brute force didn't even get me the right answer! Lol.
Also by "operation" i mean ... a loop, basically. Each loop had quite a few things to do.
There was another advent of code challenge i did in PHP that was computationally intense (+ LOTS of memory access) where i was able to optimize the heck outta my php though & actually make it work.
That was my first encounter with memory bottleneck being a real problem.
@@reed6514 Well PHP is (up till very recently) an interpreted scripting language. Will take some time to run anything compute-intensive. It is a wonder how powerful a website-modifying scrip has become.
I will probably understand this in a year or two
And so now we know why the N64 was the "less performant" console in spite of the lack of disc read times.
To squeeze all the potential it had out of it, you had to be a fluent practitioner of the dark arts, on top of breaking several coding Geneva convention guidelines.
fortunately all of these just save a few microseconds here and there. i'm sure other consoles had similar dark art shenanigans you could do haha
@@KazeN64 gotta squeeze as much as you can out of the 16k microseconds you get per frame
I imagine most of this video applied to playstation as well
For your reverse loops, for(int i = x; i-- > 0;) would iterate x - 1 to 0. I've also seen it written as for(int i = x; i --> 0;) to make it look slicker.
Of course, you could just do for (int i = x; i--;), if feeling particularly evil.
@@lassipulkkinen273 I've been working outside C so long I forgot about that. Makes it look like you misplaced the semicolon though.
Ah yes, the mythical "approaches" operator `-->`
Yay! New kaze video!
Once I was writing a loop in assembly that did two different things at several points based on a boolean state variable. Instead of checking the variable each time and branching I just wrote two variants of the loop and jumped between the equivalent point in the loop whenever I wanted to change the variable, eliminating said variable.
Bro is approaching the Kolmogorov complexity for most performant m64 game
Didn't know that concept. And yeah, that's a wet dream come true.
0:38 That brought a tear to my eye. It's so beautiful I'm speechless
Wouldn't writing some functions entirely in assembly be more readable than having to inline some asm instructions here and there to force the compiler into producing the asm that you want? At least for short functions.
probably, but writing inline assembly for mips gcc is torture, i can never get it to just go into the rom and it throws millions of errors lmao
GCC is also very enthusiastic about exploiting undefined behavior, and will usually use it to do things you don't want. You have to deliberately disable certain optimizations in order to get away with faster code, which will just make code elsewhere slower.
@@KazeN64 what about writing out-of-line assembler then and just linking it in as a separate object file
@@undefinednan7096you don't get the benefit of inlining in that case.
I would definitely recommend more inline asm, but I'm working in 68k which is probably a bit easier to write than mips. You do get used to gcc's inline asm syntax eventually, it's all over my engine now.
@@Ehal256 good point, and I don't think you could use LTO to easily fix that. I'm not so sure that 68k assembler is actually easier to write than MIPS asm (a lot of places used to use MIPS as the intro to asm)
Though word of caution, using inline assembly can sometimes discourage the compiler from optimizing certain things since it can't be sure of certain assumptions anymore, and if not it's likely the code will break once you change optimization levels. I always avoid inline assembly inside of C functions as a result and only really use inline assembly in naked/their own functions.
if this made the compiler any worse, i would notice it.
in fact, gcc's mips compiler has a bug that makes it insert a few unnecessary NOP instructions after float comparisons, but if you have any inline assembly in the same compile unit, it will discard these NOPs. GCC's mips compiler works better with inline assembly
Might be compiler/arch dependent then. I wonder why NOPs are inserted by compilers anyways, they're usually removed when turning on optimization, alignment/padding maybe?@@KazeN64
@@joopie46614 As was mentioned in the video, some N64 models have a hardware bug, which makes them fail to calculate two floating point multiplications in a row. Rather than treat that as a constraint and place other instructions in between multiplications, GCC instead just replaces every float multiplication with a float multiplication followed by a NOP, considering the bug to be too rare to be worth accommodating properly.
@@angeldude101 Damn. I can think of issues with a SNES C compiler already (basically, multiplication and division requires you to wait a couple cycles before you can read the result) but Super FX gets particularly difficult with its ROM and RAM buffering system (basically, if you want to read from ROM, you have to set the pointer which also tells the processor to cache one byte whereas writing to RAM doesn't write to the memory directly but also goes through a caching system, both which take time).
First, great video, but I'd like to point out a few things for people who might not know them. A key thing that cannot be emphasized enough is that you need to measure whether these sorts of optimizations actually speed up your code.
For example, the self-modifying code trick will often make your code slower since to make it work reliably you'll need to invalidate the icacheline (and flush the dcacheline if you used cached access), and the possible extra reads from RAM will often cost you more than you gain. Also, I think you know this, but for other people's benefit, you want to modify code while you're executing far away from it (MIPS cpus wont execute modified instructions if the instruction is already in the pipeline -- this isn't so bad for the VR4300, which only has a 5-stage pipline and is in-order scalar) and in bulk (so you can minimize the number of cache flushes needed). The more modern the CPU, the worse self-modifying code is in general.
Also, if you need to do a ton of BS to get the compiler to output a specific sequence of instructions, perhaps you should just code the function or performance critical piece of code part in assembly. In the example of atan2s, it already has large portions of assembly, so making the code less fragile is probably worth it.
For the struct Surface, 0x16 isn't 4 byte aligned, so the compiler should automatically insert the padding, so the two structs should have exactly the same layout, which agrees with the offsets in the code you show onscreen (although it could be more clear to explicitly show the padding).
yeah i'd love to just write it in assembly, but i find it to be a huge pain to write inline assembly under GCC. it never seems to work right beyond small snippits.
yeah i added the padding on the left side manually just to showcase this better. (although there is probably some compilers that would realign the struct on the right to make the struct smaller? or at least, there should be some settings that do that automatically for you)
@@KazeN64I'm pretty sure that "setting" is just the packed attribute ( __attribute__((packed)) ) which will place the struct members such as they occupy the least amount of space (byte-wise) even if it breaks type alignment rules (like ints needing to be 32bit aligned)
watching code return to assembly makes me ticked in all my favorite corners. yes.. make it unreadable, make it perfect, make it efficient
On reverse loops, instead of looping fron N to 1, would it be useful to loop from N-1 to 0 with this? 'for (i = N; i--; )'
!
I think this depends on the processor because not every of them has got a decrement and branch if zero opcode.
This is true for most of the the 65xx family (exceptions like the SPC700 do exist) which lack this opcode but have a BPL/BMI (Branch if PLus/MInus, essentially branch if < 0x80 / >= 0x80) which allows you to run up to 0x81* iterations like this. In the end, the only advantage is that it makes assembly code easier to read and write because you don't need to append a -1 onto every indexed variable.
*To explain why 0x81 specifically: Most loops are best implemented as a do-while (execute code first and then decrement loop count) because you'd otherwise run two branch opcodes (one which does the comparison, the other to jump back to the loop). Because you take the offset into account, a loop count of 0x81 means you store 0x80 into memory. Once the end of the loop is reached, decrement the loop counter which, if it's 0x80 (memory value), becomes 0x7F which is positive.
@@MarioFanGamer659 thank you for the explanation
"Meh, who needs documentation? I'll just explain the code in various RUclips videos." - Kaze
I really hope he backed up the old codebase because wow, this optimizations are getting insane
Hardware-aware optimization is just nuts. The most portable example I can think of is FFTW, which computes fast Fourier transforms by making plans tuned to input sizes and hardware instructions. The most optimal use is to save plans and reuse them on inputs of the same size.
... What I'm getting from this is that when you will be sent back in time to make the fully optimized Super Mario 64, they will either kick you out for writing this code or praise you for your genius.
They didn't had the same compiler.
So, some of his hacks won't be usable.
Back in the day, if you wanted power, you wrote assembly functions and used C as "super assembly".
this was a beautiful video. thank you
10:15 I know only one guy that ever tried to use self modifying code for performance gains. He gave up because debugging became nearly impossible
This is why assembly is the best language, you don't have to do some random thing so the compiler doesn't do a thing. You just don't do the thing.
since readability is going out of the window, i think kaze should have a code copy of the engine before this dramatic changes, because when he released the source code of rtyi it will be so hard to read and tightly coupled to its function that no other hacker could made a new hack from that source code.
That's a lot of effort at no benefit to Kaze.
@@Minty_Meeo it is just having a copy of the previous code. With all the previous performance improvements it should be enough for the majority of hackers.There is no need to further improvements from 99% of the rest of mario 64 devs.
The git commit history is probably all you will get.
I think he does also keep actually readable C versions of all the functions above the bizzaro optimized versions for that purpose.
@@inkoalawetrust that's good to know.
You need to write a compiler
I thought I was the only one doing cursed stuff for fun!
Also thanks for the fast absi, that's quite common so it will help. ♥
13:03 this was very new for me. I thought right shift always fills the most significant bits with 0 but I just learned that there is a difference between logical and arithmetic bit shifting.
The logical bit shifting will fill all the new values with 0. Arithmetic right shifting on the other hand will copy the the MSB for all the new bits to guaranty the result will be half of the original value and therefore performing an arithmetic division by 2.
Arithmetic shifting seems to be machine specific though. C doesn't guaranty it.
C does arithmetic for signed values and logical for unsigned values. MIPS has instructions to do either in 1 cycle.
That's also a thing which confused me at first when I read the Super FX instruction set (it also has a DIV2 which is ASR except for input 0xFFFF which becomes 0 but notably no LSL nor ASL) since I've worked on 65816 ASM (SNES dev) which only has ASL and LSR but no LSL nor ASR.
Fun fact: The 65xx ASL is actually a logical shift because ASL doesn't take the overflow flag (i.e. an "unexpected" change from negative to positive and vice versa) into account. It seems like the only reason why they're named that way is to make it easier to see the direction the shift applies, lol.
thank you kaze for ruining coding for someone somewhere at some point
My favorite instance of int bit manipulation to do float math faster is absolutely the fast inverse square root. You (probably) know it, you (probably) love it. Take the bits of a float, turn them into an int, bit shift it, subtract from 0x5f3759df, and convert back to a float. Improve the approximation with Newton's method and you're done!
i have a video about the fast inv sqrt on n64 too if you wanna see how i use it
You just exploded heads of many cleancoders 🤯😜
In 8-bit assembly were not caches but pressing every cycle form hardware was a norm 😂
Nice tricks by they way 💪
Gotta love the "BIT slide"
Reminder: the possibilites of just a combination of 2 bits for every bit
Is the sum of the assigned numbers to bits.
Basically, a factorial bit for addition
Its 1+2+3...
You get it.
And 4 bits is an nightmare. Especially in minecraft.
Some really neat tricks in this video, but dude, honestly at a certain point it might be worth just hacking gcc to add a pass that does some of these transformations for you since they’re so mechanical 😂. Or even just a write preprocessor that takes sane C and spits out optimized C (you could do this with python’s pycparser for instance).
When you said “cast the function signature” I let out an audible “oh god!”