The Truth About The Fast Inverse Square on N64 | Prime Reacts

Поделиться
HTML-код
  • Опубликовано: 13 дек 2023
  • Recorded live on twitch, GET IN
    / theprimeagen
    Reviewed video: • The Truth about the Fa...
    By: Kaze Emanuar | / @kazen64
    MY MAIN YT CHANNEL: Has well edited engineering videos
    / theprimeagen
    Discord
    / discord
    Have something for me to read or react to?: / theprimeagenreact
    Kinesis Advantage 360: bit.ly/Prime-Kinesis
    Hey I am sponsored by Turso, an edge database. I think they are pretty neet. Give them a try for free and if you want you can get a decent amount off (the free tier is the best (better than planetscale or any other))
    turso.tech/deeznuts
  • НаукаНаука

Комментарии • 184

  • @KazeN64
    @KazeN64 6 месяцев назад +417

    The second column becomes irrelevant because all the instructions to run the entire graphics thread can be cached at the same time, so we have no cache misses = no memory used (basically)

    • @AlbertBalbastreMorte
      @AlbertBalbastreMorte 6 месяцев назад +19

      We are not worthy

    • @zeckma
      @zeckma 6 месяцев назад +10

      this needs to get pinned

    • @konberner170
      @konberner170 6 месяцев назад +2

      Bravo!

    • @marcelocardoso1979
      @marcelocardoso1979 6 месяцев назад +12

      I'm still baffled at how you managed to fit an entire renderer in 12kB. Absolute genius!

    • @SICunchained
      @SICunchained 6 месяцев назад +1

      Thanks for the awesome vids man. Love watching your shit. Hope to be able to achieve your knowledge and application of math someday. ^.^

  • @capsey_
    @capsey_ 6 месяцев назад +534

    The fact that N64 is not bounded by CPU is so funny to me. Instead of punching on Sega because 64 > 32 their marketing team could've just said "our CPU is so fast it's fucking useless"

    • @0-Kirby-0
      @0-Kirby-0 6 месяцев назад +90

      It also creates such a deeply fascinating problem space, where you not only optimise for binary size over cycle count, which is already unusual, but you're specifically worried about fitting as much computation as possible into a single cache-load, so you don't have to go back to ram and disturb the renderer.
      It feels oddly multithread-y, where you bulk-copy what you want to work on into your own little bucket, so you don't have to acquire-release as much, except it happens with the instructions themselves, not just the memory being worked on.
      Absolutely nuts.

    • @Takyodor2
      @Takyodor2 6 месяцев назад +11

      @@0-Kirby-0 Is it really unusual? Copying data from memory is a couple of orders of magnitude more expensive than your average floating-point calculation (with respect to energy and latency), so I'd expect this type of optimization to be sort of common?

    • @Trisslotten96
      @Trisslotten96 6 месяцев назад +8

      The same kind of applies to modern CPUs as well.

    • @samphunter
      @samphunter 6 месяцев назад +17

      ​​​@@Takyodor2it is unusual. Your cpu will usually predict what instructions it will need ahead of time and load them long before the loading slows anything down. That is why branchless is more common optimization nowdays (avoid guessing what will run next)
      That said, it is common for data to be optimized to fit cache. There is a video about why linked lists are almost always slow which explains this.

    • @kreuner11
      @kreuner11 6 месяцев назад +4

      They really flunked on the ram and texture vram, if these were made in a different more sensible architecture, the n64 might have been so much more ahead of the competition

  • @kuhluhOG
    @kuhluhOG 6 месяцев назад +55

    This video in a nutshell:
    Do you still have questions?
    [ ] No, I understood everything.
    [x] No, I didn't understand enough to even begin phrasing questions.

  • @triplebog
    @triplebog 6 месяцев назад +209

    As a graphics engineer, this is one of the only primeagen videos that actually makes sense

    • @khhnator
      @khhnator 6 месяцев назад +14

      ikr! im a game/systems dev and most of the time i completely go "what is this react stuff" to primagen stuff

    • @ramy701
      @ramy701 6 месяцев назад +6

      can you recommend any resources / books for someone wanting to get into graphics engineering ? :)

    • @headecas
      @headecas 4 месяца назад

      ​@@ramy701 dot

  • @bertram-raven
    @bertram-raven 6 месяцев назад +29

    This reminds me of the developer who optimised drum-memory execution by scattering the instructions across the drum in such a way as the read-head was exactly over the next instruction just as it was required. He also added jump instructions which went nowhere so the motion of the drum would flatten loops and reuse code in a way which "increased" drum capacity - effectively storing 6.2KB of code in 6KB.
    This is next-level optimisation.

  • @isodoubIet
    @isodoubIet 6 месяцев назад +44

    The reason the number of instructions becomes irrelevant is that, as far as the cpu running is concerned, what matters is the number of cycles. The number of instructions matters _only_ inasmuch as you're having to stream all that stuff from memory into the instruction cache, since the size of the machine code is roughly proportional to the number of instructions (this is not necessarily the case for monstrosities like x86 architecture, but it is the case for the n64 which has a MIPS processor with a very typical RISC pipeline). Since the entire renderer code fits into the cache anyway, you can have as many instructions as you like, and all you're comparing is how many cycles are being spent on running the instructions themselves.

    • @jeremylakeman
      @jeremylakeman 6 месяцев назад +2

      And that's because the primary thing he's trying to optimise here is memory bus usage. Since the "GPU" needs to use it for texture mapping (etc) to improve frame rates.

  • @Entropy67
    @Entropy67 6 месяцев назад +18

    7:50 he rewrote it so that it would all fit on a single page, its tiny, the CPU will never remove it from cache. That means that we never cache miss. I think

  • @magfal
    @magfal 6 месяцев назад +64

    11:52 the aggregate improvements he's made would have ensured a release of Mario 64 2nd adventure or something if Nintendo mysteriously received them back in the day.
    A 3X improvement would have enabled enough of a difference in capability to justify it.
    Also check out the Portal 64 project.

  • @Tabu11211
    @Tabu11211 6 месяцев назад +16

    My favorite programer streamer covering my favorite niche coder?! What a good day!

  • @TheTrienco
    @TheTrienco 6 месяцев назад +45

    Not thought about 20us in a long time? That depends on what you're working on. Granted, 20us over a full frame probably won't matter much, but for a function you call a lot, that can add up quickly. If you think about it: for 100fps, your frame budget is 10ms.. for AI, physics, game logic, rendering and whatever else needs to be done. That's only 500 of those 20us.

    • @someonespotatohmm9513
      @someonespotatohmm9513 6 месяцев назад +2

      In the micro electronics industry it is also quite common to see u or n as prefix.

    • @prgnify
      @prgnify 6 месяцев назад +11

      @@someonespotatohmm9513 I've done a bit of embedded work, when he said "I don't think any of us has thought of 20us in a long time" I felt invisible.

    • @TrueHolarctic
      @TrueHolarctic 6 месяцев назад +2

      ​@prgnify tbh we dont think about 20us a lot. Thats just one eternity. That sweet sweet 100MHz clock

    • @prgnify
      @prgnify 6 месяцев назад +2

      @@TrueHolarctic Yes, the first thing I though of was a joke from the embedded programming subreddit, that played with the concept of 20us being an eternity to us and completely indifferent for most others

    • @MikkoRantalainen
      @MikkoRantalainen Месяц назад

      20 µs is quite important time even for web programming if you're computing JSON response that consist of 500 elements. If you take 20 µs per element, it will be 10 ms for the whole list already.

  • @whamer100
    @whamer100 6 месяцев назад +11

    god i LOVE Kaze, his contributions to the sm64 romhacking community is next level

  • @zebraforceone
    @zebraforceone 6 месяцев назад +10

    I'm pretty sure that second column becomes irrelevant because the instruction set and it's operating memory is stored on the cache, as opposed to operations across the memory bus

  • @zweitekonto9654
    @zweitekonto9654 6 месяцев назад +6

    his brain was in so much shambles, he forgot the outro.

  • @Mempler
    @Mempler 6 месяцев назад +10

    "The Mario64 command is my favourite linux command"

    • @halle0327
      @halle0327 Месяц назад

      glad I’m not the only one that says this

  • @chhihihi
    @chhihihi 6 месяцев назад +14

    Caching and Cache misses are an easy concept to understand and will dramatically increase the speed of your code. I strongly recommend you check it out.
    Because there's a ton more distance between the cpu and ram, cache is blazingly fast and keeping everything on cache greatly reduces the amount of cycles (time) it takes to complete a set of instructions and what keeps you in cache is having few enough instructions and data to fit. Think of it as a primary resource and ram as a back up.

    • @entertain8648
      @entertain8648 6 месяцев назад

      what about apple chips that has ram close to cpu?

    • @snooks5607
      @snooks5607 6 месяцев назад

      @@entertain8648 still order(s) of magnitude difference, even apple's unified memory can have 100-400ns latency for a fetch depending on multiple things like TLB lookups, where as on-die L1 cache access has cost just couple cycles ever since they were introduced to PCs with 486DX in 1989 (on a modern ~4GHz machine about 1ns)

    • @Takyodor2
      @Takyodor2 6 месяцев назад +3

      @@entertain8648 It's still not _in_ the CPU. Analogy: going to the grocery store in your own town is faster than the one in the next closest town, but both are many times slower than accessing your fridge. (The hard drive would be on the moon in this analogy)

    • @entertain8648
      @entertain8648 6 месяцев назад

      @@Takyodor2 well I understand what kind of difference is that
      What I am asking if anyone knows how apples disign saves the situation

    • @Takyodor2
      @Takyodor2 6 месяцев назад +3

      @@entertain8648 It doesn't save the situation, just makes the huge performance hit slightly less huge.

  • @xdanic3
    @xdanic3 6 месяцев назад

    FINALLY! I've been waiting for you to react to kaze for a while now! And after this we could have a kaze reacts to ThePrimeTime, but you reacting first was more expected, now I gotta watch the video 👀

  • @Kiyuja
    @Kiyuja 6 месяцев назад +6

    yeah Kaze always does great content

  • @glauco_rocha
    @glauco_rocha 6 месяцев назад +3

    as rob pike said: never tune for performance until you have measured, and even so, don't do it until the code you're measuring OVERWHELMS the rest of the code.

    • @MikkoRantalainen
      @MikkoRantalainen Месяц назад

      I mostly agree but if you make *everything* super slow then your code ends up really bad until one single part overwhelms all the rest. I think it's better to decide the required latency at the start and then do whatever it takes to keep the final performance at least as good as your decided latency.
      For a game, the decided latency could be 1/30 or 1/60 seconds to match typical displays. For a web service, the target latency could be 50 or 100 ms.
      Then you know when you have to start optimizing: whenever your existing code cannot meet the latency requirement. What to optimize? The part that overwhelms everything else at that time.

  • @Kavukamari
    @Kavukamari 6 месяцев назад +3

    getting some much needed neck nodding exercises in on this video, Prime is gunna be swole soon

    • @bozoc2572
      @bozoc2572 6 месяцев назад +1

      He's clueless

  • @Emil_96
    @Emil_96 6 месяцев назад +2

    Ocarina of Time is just pure nostalgia and it'll always have a spot in my heart

  • @JimWitschey
    @JimWitschey Месяц назад

    Kaze Emanuar has maybe the strangest career of any programmer alive

  • @alfiegordon9013
    @alfiegordon9013 6 месяцев назад +2

    Lets all love Kaze

  • @nonetrix3066
    @nonetrix3066 6 месяцев назад

    I think the cooler thing is that many of that he mentions in other videos where not known at the time the game was made, so we can take more advantage of the hardware today then we could have ever dreamed in the 90s

  • @jerichaux9219
    @jerichaux9219 6 месяцев назад +1

    I see you and I both have mastered the ancient knowledge of almost-kind-of-remembering-floating-point-formats-but-not-really.

  • @Tobsson
    @Tobsson 6 месяцев назад +2

    I watched so many videos of him. I'm not even 0.0000000000000001% smarter or more knowledgable since then, but it sure sounds cool.

    • @JohnSmith-ox3gy
      @JohnSmith-ox3gy 4 месяца назад

      "I like your funny words, magic man." -JFK

  • @SianaGearz
    @SianaGearz 6 месяцев назад +3

    N64 is a unified memory gal, there is no VRAM, there's just RAM. The whole RAM is connected to the GPU (semi-custom chip) and when the CPU (off the shelf chip) asks for something from RAM over the CPU bus, the GPU has to stop whatever it's doing to serve that memory request for the CPU.
    This is why verbose code that consists of a ton of individually fast machine instructions is generally BAD, since instruction loads use up your RAM bandwidth and slow down the GPU rendering.
    Except when you have a pass in your rendering or other code that munches a lot of data while using all the same code, and all the code is laid out contiguous across a handful pages of memory (as to avoid cache lines fighting for tag space) and all those pages of code end up stored in the CPU's onboard cache for all but the first iteration, then there's no memory hits. You'd call this cache L1 today, but L2 or L3 doesn't exist so it's just Cache. Then you don't care about how verbose the code is in each function, you just need the whole code for that whole pass to fit.
    So for code that is more sprawling and expected to be uncached, such as gameplay code, you want it to be as terse as possible; while for code that can be formulated as an L1-resident pass or kernel, you can actually make it verbose in places.
    I myself program on another classic device which also only has L1, so i'm learning from Kaze a lot. I had also been cursed by a very sharp and strict numerics professor 20 years ago, it took me YEARS of hard work to pass that exam, so well i suppose i'm getting a lot more than just hype from floating point trickery, even though i'm more rusty than i'd like to be.

    • @MikkoRantalainen
      @MikkoRantalainen Месяц назад

      One could also explain N64 as having no RAM at all but being able to run CPU instructions from VRAM if GPU is stopped while that's happening. And there's 16 kB cache in the CPU itself so as long as you can keep everything in that space, the GPU doesn't need to be stopped.

  • @tacokoneko
    @tacokoneko 6 месяцев назад +13

    i hope some that some day kaze can improve the Linux kernel for N64 because right now it could already theoretically run any Linux program .. as long as it fits in 8 MB of RAM alongside everything else. It would be incredible if he could install a web server and then we can really build react for N64

    • @blarghblargh
      @blarghblargh 6 месяцев назад

      n64 already runs doom.
      sometimes you gotta stop and ask why :D

  • @MrAbrazildo
    @MrAbrazildo 6 месяцев назад +8

    7:08, in old hardware, the engine instructions/data didn't fit entirely on the cache. So, depending on how many instructions an action takes, CPU had to seek the RAM, which uses to be 100x slower (maybe less in a console). On modern hardware, all instructions/data are in the cache, which has much more memory than they require, for an old game. However, RAM is still used even nowadays, for multimedia stuff: images, video, audio, textures and other more than 64 KB sized. The optimization for these large things targets to load part of the RAM on the VRAM (GPU cache memory), in a moment the user doesn't care, like a loading scene - i.e. God of War's Kratos passing through some rocks. Sometimes this is used for loading from files to RAM too.
    11:58, but he is doing it for modern hardware, isn't he? The video's goal is just to explain why Quake's alg. is not meant for all cases.
    13:00, the sad truth is that these pointer transformations are UB (undefined behavour). That's why the guy commented it as "evil": he just wanted to get his job done, leaving the comment for the future masochist who will deal with the potential nasty bug. UB means the operation is not standardized. So, the app may someday start crashing or giving wrong values (out of nowhere!), if any thing change from the original setup: hardware, OS, any imaginable protocol that interacts to the game. Not even old C had an expected action for that, as long as I heard.
    13:52, in math, a minus exponent means that the number is divided. So, x*0.5 == x / 2 == x*2^(-1). Instead of multiplying the whole number, it's possible to change its exponent, by sum or subtraction, which are faster operations.

    • @isodoubIet
      @isodoubIet 6 месяцев назад +3

      No he runs this stuff on original hardware

    • @isodoubIet
      @isodoubIet 6 месяцев назад +3

      As for it being UB, technically it is, but so many people do it that I don't think most compilers take advantage of it. C++ has only recently added a non-UB way to do it in C++20 with std::bit_cast. Before, the only way to do it was with memcpy, which would defeat the purpose.

    • @tannerted
      @tannerted 6 месяцев назад

      Why are these pointer transformations UB? Casting to a different type doesn't change anything about the underlying bit representation. So cast and then shift and then cast back is just fine and deterministic according to the C spec. Am I wrong? (I might be; please teach me if I have something wrong) I feel like this is done all the time in embedded systems and OSs

    • @MrAbrazildo
      @MrAbrazildo 6 месяцев назад

      ​@@isodoubIet- Are you saying that Nintendo didn't tried to pack the data into the cache? This seems absurd. I can't imagine a game being made that way. It sounds so amateur.
      - I've forgot about this bit_cast. I need to study C++20 deeply.
      - Because memcpy would be slower?

    • @MrAbrazildo
      @MrAbrazildo 6 месяцев назад

      ​@@tannertedI heard this on a presentation. I don't read standards. But I also heard that C unions are now well defined. Since they were used for type punning, maybe this is now valid C - _UB in C++, because there are other resources, as this std::bit_cast_ .

  • @tornoutlaw
    @tornoutlaw 6 месяцев назад +17

    20ms per frame...does this mean an N64 could run Mario64 in 60fps?

    • @traister101
      @traister101 6 месяцев назад +36

      Yep. Kaze has a rom that does 60fps on a native N64.

    • @chainingsolid
      @chainingsolid 6 месяцев назад +2

      1000/60 = ~16ms so almost..

    • @MrAbrazildo
      @MrAbrazildo 6 месяцев назад +1

      ​@@chainingsolid1000 ms / 20 ms = 50 FPS.

    • @binguloid
      @binguloid 6 месяцев назад +5

      guys he said microseconds not miliseconds

    • @MrAbrazildo
      @MrAbrazildo 6 месяцев назад +1

      ​@@binguloidYeah, and that was a performance earn, not the entire time for the frame.

  • @apollolux
    @apollolux 6 месяцев назад

    Sweet, it's a Kaze reaction! :)

  • @alexaneals8194
    @alexaneals8194 6 месяцев назад +4

    The problem when you optimize for a CPU or GPU is that you should comment for which CPU and GPU was optimized for. Later versions may break your optimization or may offer options that are far better. If the code isn't commented then when someone goes in to make changes, they don't know whether the optimization still applies or if it should be changed without spending a few cycles trying to figure out why the optimization was done. The same principle applies for higher level optimizations. And a note to my past myself this includes personal projects.

    • @Minty_Meeo
      @Minty_Meeo 6 месяцев назад

      Sure, but Kaze's SM64 codebase basically only works on MIPS-GCC and only on N64 with all of the inline asm and illegal code it uses to go fast. He is way beyond the point of cross-compiling his mod.

  • @dsdy1205
    @dsdy1205 2 месяца назад

    10:33 I nearly spat out my drink

  • @jelliott3604
    @jelliott3604 5 месяцев назад

    It's that there is a threehalves label for .. 3 halves .. but the offset just gets a WTF!

  • @ThatJay283
    @ThatJay283 5 месяцев назад +1

    5:10 "you should be ashamed of yourself" - yup. back before i knew any better, i started a react app with a backend in nodejs, express, and typescript. i thought it'd be the "easy way" and instead i just ended up with a nightmare of react components and pointless middle steps that are too late to leave out now. and on top of all of that, it also runs like shit.

  • @i3looi2
    @i3looi2 11 дней назад

    When that guy decides to invent another JS Framework after I just settled on my JS Framework of choice
    "BgRcky: fuck you"

  • @dominikmuller4477
    @dominikmuller4477 6 дней назад

    to be fair, using the fast inverse square root and then inverting it is a dumb way to go about it. A fairer comparison would be to make an analogous fast square root algorithm. The same floating point magic that turns 1/sqrt(x) into (-1/2)* (int x) + magic number would also support turning sqrt(x) into (1/2)* (int x) + different magic number.

  • @madmax2069
    @madmax2069 6 месяцев назад

    Kaze just shows how much potential game consoles (in this case the N64) has that was never reached in their active lifetime (active meaning still manufactured and sold and supported by the manufacturer). This is something that the modding community and homebrew community are good for, figuring out every little aspect of the hardware inside a game console, making better SDKs, fixing bugs and issues in the games, optimizing code for said game, heck just look at the person making portal for the n64.

  • @jorge28624
    @jorge28624 5 месяцев назад

    2:54 we have come full circle lol

  • @csabaczcsomps7655
    @csabaczcsomps7655 6 месяцев назад

    Know you data and know we're dancing.

  • @keyboard_g
    @keyboard_g 6 месяцев назад +4

    Outside of the tiny texture cache and the ram latency (high bandwidth, bad latency), the N64 was a computational beast for the time.
    Nothing else was close.

    • @skilz8098
      @skilz8098 6 месяцев назад

      I don't know about that. The PS1 and the Sega Dreamcast were both impressive machine architectures too.

    • @jc_dogen
      @jc_dogen 6 месяцев назад +4

      ​@@skilz8098Dreamcast was the next generation and N64 runs at 3x the clock speed as the PS1

    • @skilz8098
      @skilz8098 6 месяцев назад

      @@jc_dogen Yeah, but the PS1 was a breakthrough in its day. It wasn't the first "CD-ROM" type console because there was the Sega CD "eh" then the Sega Saturn which was okay around the same time as the PS1. Panasonic even had their own, I think it was the 3DO but it didn't go over so well. The PS1 with its capabilities and affordable cost plus all of the available game titles made it very successful.
      Here's the 90s in a nutshell
      *Sega MegaDrive/Genesis - 88 (cart)
      Commodore 64 - 90 (cart)
      Neo Geo - 90-91 (cart)
      SNES - 90 (cart)
      Philips CD-i - 91 (disc)
      Sega CD - 91 (disc addon to the mega drive)
      3DO - 93 (disc)
      Jaguar - 93 (cart/disc addon in 95)
      Sega 32X (cart-addon to the Genesis)
      Neo Geo CD - 94 (disc)
      Sega Saturn - 94-95 (disc)
      PS1 - 94-95 (disc)
      N64 - 96 (cart)
      3DO M2 - 98 (disc)
      Dreamcast 98-99 (disc)
      *PS2 - 2000 (CD/DVD)
      Some of them were good, some were okay, some were flops. Some were great.
      For me, the SNES, the PS2 were of some of the best consoles. The Genesis was decent, the PS1 was very good. The N64 and Dreamcast were both good. Some of them I never played on such as the 3DO, CD-i, Neo Geo or the Commodore. The Sega Saturn was okay, it had potential but was inferior to other consoles. The Sega CD was a nice concept but didn't go over to well, and the 32X was a major bust. And it was around this time that PC Gaming started to become a commonplace thing too.
      They were definitely the Good old days. I kinda jumped ship from SNES to PS when Final Fantasy dropped Nintendo and migrated to the PS1. And then games such as Resident Evil 1 & 2, Silent Hill, Parasite Eve, Castlevania: Symphony of the Night, Tony Hawk, Cool Boarders, etc... The PS1 then later the PS2 just took over. The SNES was a very popular and long favorite with many great titles... but eventually the PS2 became the champ. I still have both my SNES and my PS2. I don't have my original Atari, NES, Genesis, or PS1 anymore but I still have my PS1 games and I use my PS3 for that. I stopped getting into the "console" fad after the PS2 and only picked up the PS3 used about 2 years ago just for a select few titles and to be able to use my PS1 games. I still have Diablo for PS1. The only game I'm really wanting or missing for my PS1 collection is Ogre Battle.
      And as for Dreamcast being next Gen... kind of. It only came out 2-3 years after N64 just before the PS2 released. So the N64 had a head start on them. And it wasn't until 2001 until Nintendo came back with the Gamecube. So that's basically the 90s in a nutshell. Well from about 1988 - 2002.
      There were a few other consoles but not really worth mentioning as some of the were really obscure or niche console markets.
      But yeah as for performance and being a console with 3D graphics on a Cartridge in 64 bit, yes the N64 was a very nice machine. Mario Kart 64, Bomberman 64, FZero was decent but wasn't as good as FZero on the SNES, yet the FZero version for the Gamecube was great. Then you had Metroid. I could go on... What can I say, I've been gaming since Warlords, Pitfall, Circus Circus, Space Invaders, Asteroids, Missile Command, Breakout, Pacman, and much more... Been at it since the early 80s.

    • @jc_dogen
      @jc_dogen 6 месяцев назад +2

      @@skilz8098 text dump bro. lmao
      but I agree, the dreamcast was in-between gens, though it was still a very big jump. I would also say the n64 was (mostly) much more powerful than the ps1. Some aspects made this less obvious (cartridges, very small 4K texture memory), and the hardware had some serious problems that ate into it's performance (rambus latency and bandwidth problems) that were probably just mistakes. But, at the end of the day, performance sapping features like sub-pixel accurate rendering, perspective correct texturing, texture filtering, and z-buffering were only possible because of the extra power it had.

    • @skilz8098
      @skilz8098 6 месяцев назад

      @@jc_dogen Well as the saying goes a picture is worth a 1,000 words, and I have about 1,000 pictures in mind, lol...

  • @TheHackysack
    @TheHackysack 6 месяцев назад +1

    shoutouts to simpleflips

  • @yxyk-fr
    @yxyk-fr 6 месяцев назад

    There's Newton algorithm (already devised by Babylonians before). And then there's Newton-Raphson iterations that converges like crazy...

  • @noxlupi1
    @noxlupi1 6 месяцев назад

    The Fast Inverse Square Root, was ahead of its time, a long time ago.

  •  6 месяцев назад

    I remember a qnx kernel fits in the cpu cache, weird times, isn’t?

  • @BeamMonsterZeus
    @BeamMonsterZeus 6 месяцев назад +1

    As an amateur astrophysicist, I always knew the N64 was a universal anomaly, but not for the reasons discovered here

  • @antoniogarest7516
    @antoniogarest7516 6 месяцев назад +2

    Prime and Kaze
    Subscribed

  • @stevez5134
    @stevez5134 6 месяцев назад

    this is the one from only 3 weeks ago??? anyways great stuff

  • @n00blamer
    @n00blamer 6 месяцев назад

    You guys... the Devil's in the details but the underlying maths is quite simple: reciprocal is negation of the exponent, 1/(n ^ m) is n ^ -m. sqrt(n ^ m) is n ^ (m / 2), and these can be combined into: n ^ (m * -0.5) == 1.0 / sqrt(n ^ m). The code gets a good initial value and Newton-Raphson iterations converge.

  • @FastVideoProdInNash
    @FastVideoProdInNash 6 месяцев назад

    2:24 😂😂😂😂😂😂
    That was crazy.

  • @HrHaakon
    @HrHaakon 4 месяца назад

    My backend runs smooth and it has something like 150 mhz (I have a few millicpus in the cluster to play with) of Xeon time.
    So a lot more than the N64, but since we moved to a cloud based platform CPU time got expensive.

  • @cherubin7th
    @cherubin7th 5 месяцев назад +1

    Majora's Mask is my favorite.

  • @v2ike6udik
    @v2ike6udik 6 месяцев назад

    2:44 MAN DOWN, MAN DOWN!

  • @lukasoliverleo3730
    @lukasoliverleo3730 6 месяцев назад

    I never expected anyone to react to Kaze

  • @MikkoRantalainen
    @MikkoRantalainen Месяц назад

    5:00 100% agreed!

  • @blipojones2114
    @blipojones2114 6 месяцев назад

    "link to the past" is hard, just started playing and am pretty hard stuck

  • @jim0_o
    @jim0_o 2 месяца назад

    A Link/link To the past to the past was the pinnacle of 2D Zelda(and adventure/exploration games at the time) Ocarina of Time was the base-point of 3D Zelda games, Majora's Mask was a good look back at the gritty darker Zelda games. (they all had darkness but Zelda 2(Link), Majora's Mask and Twilight Princess were different) so both ALttP and MM were better game'wise but Ocarina of Time was ground breaking... now back to the remaining 80% of the video.
    Edit: AFAIK the mask guy(The Happy mask salesman) was a kind of Deus Ex Machina, ie. he was the "hand of god" that started and ended everything, if you follow the story its all his fault (He brought the mask to Clock town getting it into the hands of the Skull kid, but he is also the one that bugs you to get it back.) this is probably also why he is designed to look like the creator of the Zelda series... now back to 50% of the video...

  • @rayanmazouz9542
    @rayanmazouz9542 6 месяцев назад +4

    apparently Mario 64 wasn't even compiled with optimization enabled

    • @RuySenpai
      @RuySenpai 6 месяцев назад +2

      This is a myth and Kaze himself has a video on it.

    • @felixjohnson3874
      @felixjohnson3874 6 месяцев назад

      ​@@RuySenpaiwell it's not so, if he does and it says what you are, it's wrong. I'd love to confirm that but I can't find the video your talking about so I can't.
      We literally have reverse engineered the source code and, using the compiler they were at the time, we can generate byte-for-byte the same code... when optimizations are disabled.
      The PAL version IS optimized, but it's already running about 16% slower anyway.

    • @RuySenpai
      @RuySenpai 6 месяцев назад +1

      @@felixjohnson3874 got it confused, it wasn't kaze it was modern vintage gamer who made the video.
      It was my bad, it isn't a myth that usa sm64 had compile optimizations off, but it's overstated how significant it is.

    • @Bobbias
      @Bobbias 6 месяцев назад

      @@RuySenpai That's primarily because compilers of the time didn't have great optimizations to begin with. Even without platform specific optimizations, modern compilers can do far more optimization than the compilers back in the day could, so whether or not it was optimized was not as big an issue back then as it would be now.
      That said, even with modern optimizations, that's not going to magically buy you a ton more performance. Kaze's massive performance improvements come from the fact that he's basically rewritten the entire game engine (with some of the only untouched code being the actual movement physics and such) with performance in mind.

    • @jc_dogen
      @jc_dogen 6 месяцев назад

      ​@@RuySenpaino he made a video explaining why it was reasonable for the computer optimizations to be turned off

  • @guilhermeraposo6080
    @guilhermeraposo6080 6 месяцев назад

    I think about how nice an extra 20us would be every time the wife and I are erm... Playing SM64

  • @BeamMonsterZeus
    @BeamMonsterZeus 6 месяцев назад

    I admit I've only beat OoT a few times and MM a few as well. I was more into Goldeneye, all of this at age 4-9 btw I'm a babby

  • @skilz8098
    @skilz8098 6 месяцев назад +1

    I have over a billion transistors that are all rated with a 0.05 nano second propagation delay. One of them is working at 0.09 nano seconds. One of my logic gates is slower than the rest and it is the source of all my bottlenecks. I want a full refund! LOL!!!!

  • @wesleymcbob
    @wesleymcbob 6 месяцев назад

    Yeesss Kaze is the greatest

  • @michaelrobb9542
    @michaelrobb9542 4 месяца назад

    Cause he can't stop hearing the music. 9:04.

  • @andrewtfluck
    @andrewtfluck 6 месяцев назад

    Kaze is awesome 😎

  • @lMINERl
    @lMINERl 6 месяцев назад

    5:16 im ashamed😢

  • @kira.herself
    @kira.herself 6 месяцев назад

    OMG I LOVE KAZE

  • @Bliss467
    @Bliss467 6 месяцев назад

    Game dev on limited hardware is truly next fuckin level

  • @AdhirRamjiawan
    @AdhirRamjiawan 6 месяцев назад +2

    I'm ashamed i'm part of the bigger problem :'(

  • @bertram-raven
    @bertram-raven 6 месяцев назад

    Adding my own piece of magic from the 1970s.
    a%=b
    b%=a
    a%=b
    Works for all types, structures, and cache swaps.

    • @n00blamer
      @n00blamer 6 месяцев назад

      If the % is exclusive-or then that would swap in-place.

    • @MikkoRantalainen
      @MikkoRantalainen Месяц назад

      @@n00blamer Typically XOR would use syntax a^=b because usually % means remainder for a division.

    • @n00blamer
      @n00blamer Месяц назад +1

      @@MikkoRantalainen That is why I assumed OP mistakenly used % when he meant ^

  • @cweasegaming2692
    @cweasegaming2692 6 месяцев назад

    AHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH

  • @ikirachen
    @ikirachen 7 дней назад

    MDK2 FTW :)

  • @mfc1190
    @mfc1190 6 месяцев назад +4

    Ocarina of Time was so much better than majora’s mask like who TF said that

    • @billynasir3146
      @billynasir3146 6 месяцев назад

      Majora Mask is way more alive and open-world despite having less dungeons

    • @623-x7b
      @623-x7b 6 месяцев назад

      The Ocarina of Time is better than GTA 5. GTA 5 is better than Majora's mask

  • @bernicefenton
    @bernicefenton 6 месяцев назад

    "take a moment and consider your life and what you've built... versus this... you should be ashamed of yourself" 😂 not so harsh

  • @JustinWalker951
    @JustinWalker951 4 месяца назад

    "Majoro's Mask" 🙄

  • @aeaehow
    @aeaehow 2 месяца назад

    15:30

  • @jordixboy
    @jordixboy 6 месяцев назад

    Could someone explain in the measurements what cycles refer to? The amount of operations the cpu has to do? (whats the difference with instructions? cycles != instructions?) is is the amount of times it has to request/store data in registers/ram?

    • @ricardoamendoeira3800
      @ricardoamendoeira3800 6 месяцев назад +1

      Many instructions take several cycles to finish. Interacting with RAM is a good example.
      One cycle is generally the time needed for the shortest CPU instruction to run.

  • @TheIridescentFisherMan
    @TheIridescentFisherMan 6 месяцев назад

    My boy was like " DAMN 20 MICRO SECONDS??? THATS WAY MORE THAN I THOUGHT ".

  • @oserodal2702
    @oserodal2702 6 месяцев назад +1

    Doesn't the original code of the fast inverse square root also technically has undefined behaviour.

    • @ea_naseer
      @ea_naseer 6 месяцев назад +1

      yeah it talked about the fact since he's not going for accuracy they probably would never divide by zero so no undefined behaviour like original.

    • @isodoubIet
      @isodoubIet 6 месяцев назад +4

      @@ea_naseer The UB is not in any division by zero, it's in accessing a pointer through a different type. That violates the C (and C++) aliasing rules. I don't think it's an actual problem on any real compiler since it's such a common thing to do (and C++ only recently added a standard way to do it), but technically it's UB according to the standard.

  • @connorskudlarek8598
    @connorskudlarek8598 6 месяцев назад

    Prime playing Ocarina of Time makes me wonder if he's ever played the Randomizer or Online Multi-player Randomizer with his kids.

  • @remigoldbach9608
    @remigoldbach9608 6 месяцев назад +1

    John Carmack is a beast (sqrt approximation in Quake)

    • @jc_dogen
      @jc_dogen 6 месяцев назад +3

      wasn't his code though

  • @BigDaddyMort
    @BigDaddyMort 6 месяцев назад +1

    Incorrect, sir! The best Super Mario game was on the SNES: Super Mario World RPG, Legend of the Seven Stars.

    • @B1GL0NGJ0HN
      @B1GL0NGJ0HN 6 месяцев назад

      New Switch remake is a lot of fun!

  • @StingSting844
    @StingSting844 6 месяцев назад +2

    This is a guy who gives imposter syndrome to the ones we get imposter syndromes from

  • @o0shad0oo
    @o0shad0oo 6 месяцев назад

    Fast inverse square root *

  • @chronxdev
    @chronxdev 6 месяцев назад

    Yep, he cooked my noodle

  • @Raven-fu1zz
    @Raven-fu1zz 6 месяцев назад

    I think his talent would really useful on making a compiler or IDE for new age video games, modern video games are needlessly using so many resources unfortunately

    • @MadaraUchihaSecondRikudo
      @MadaraUchihaSecondRikudo 6 месяцев назад +3

      Remember that all of this is for a game that runs on a very specific-custom hardware whose entire specs is known and consistent. This will be a lot more difficult today even if you discount PC (which has so many different processors with so many different features, cache sizes, hardware optimizations, etc) and just go for modern consoles. It essentially became impossible to do these kinds of optimizations by hand a while back for any real scale and to be fair the compiler will do many of those optimizations for you (eg. SIMD, branch prediction, etc), Instruction cache in particular isn't that big of an issue anymore.
      What you still need to be aware of today is memory allocation. You generally want to be CPU bound and not memory bound - and the primary reasons high level languages are generally slower than low-level ones is that it's harder to track and control what memory you allocate and when. If you're smart about allocating memory and working with values you've already loaded (as they're cached), you're 95% of the way there, which is generally more than enough.

  • @Valerius123
    @Valerius123 4 месяца назад

    Was Majoras Mask better than Ocarina Of Time? I didn't think so for the longest time but in objective hindsight... yeah, probably.

  • @notapplicable7292
    @notapplicable7292 6 месяцев назад

    Honestly considering prime apparently at some point in his career did embedded programming he seems to know very little about it.

  • @MemeConnoisseur
    @MemeConnoisseur 6 месяцев назад

    Nintendo suing his ass, how dare he make a good mario game

  • @sa_lowell
    @sa_lowell 6 месяцев назад

    I absolutely hate that I laughed at the vector normies thing. I'm done.

  • @CatherineBert
    @CatherineBert 6 месяцев назад

    I don’t know who you are, but I’m interested in the video you are showing.
    Enjoy this evidence you’re stealing this contents views. Never seen your channel before.

  • @catskinner6
    @catskinner6 6 месяцев назад

    Ocarina of Time >>>>>>>>>> Mario64 Fact

  • @Georgggg
    @Georgggg 6 месяцев назад +1

    TL;DR: this is all became irrelevant 25 years ago

  • @bobanmilisavljevic7857
    @bobanmilisavljevic7857 6 месяцев назад

    I guess it's True when they say,
    D = 8
    8==D