The Genius of the N64's CACHE Instruction

Kaze Emanuar

Просмотров 105 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 5 янв 2025

Комментарии • 787

@KazeN64 День назад ⁺⁵³¹
Slight correction to the video: It looks like NEC (Nippon Electric Company) were actually the ones that designed most of the CPU core here, so give them the credit instead of SGI.
Correction correction: Apparently it was SGI after all? I have no idea. someone else can argue this out. I doesn't really matter to me or the video.
@xyzabc123-o1l День назад ⁺³¹
companies like NEC need to start designing and manufacturing chips on old nodes again. we don't need faster chips, we need more cheap chips in 2025 :)
@syncmonism День назад
@@xyzabc123-o1l Intel currently has some 14nm fabs sitting idle I believe. They need to find some customers for those fabs!
@incognitoman3656 День назад
👌
@Dwedit День назад ⁺¹²
@@xyzabc123-o1l According to Sophie Wilson (co-designer of ARM processor), price per gate is lowest at the 28nm process node.
@xyzabc123-o1l День назад
@@Dweditty for the info, do you know of any easily purchasable chips on that process node?
@chane2k1 2 дня назад ⁺¹⁵¹⁸
Sounds like N64 emulators are just going to have to use your game as an accuracy benchmark.
@zeggyiv 2 дня назад ⁺⁹²
Expect to need a new PC to play 30 year old games.
@iwiffitthitotonacc4673 2 дня назад ⁺¹⁵
@@zeggyiv ?
@TSquitz 2 дня назад ⁺¹¹³
@@iwiffitthitotonacc4673 emulation is really hard to do, and can require more computing power than the original console
@zeggyiv 2 дня назад ⁺¹⁰⁷
@@iwiffitthitotonacc4673 Emulation gets more computationally expensive the greater accuracy is required. Ultra HLE used to run on what would be considered a toaster today.
@EnigmaticGentleman 2 дня назад ⁺³⁶
@@TSquitz I can emulate the most intensive PS2 games at 2x on my 200$ tablet, like unless we're talking 7th gen+ or REALLY high accuracy stuff its just not hard anymore (unless your machine is really underpowered).
@rastas_4221 2 дня назад ⁺⁸⁶⁷
N64 developers: "Well excuse me we didn't have 20 years to study the architecture and had to ship something by the end of the month!"
@thewhitefalcon8539 2 дня назад ⁺⁹⁸
This is extremely true
@Orzorn День назад ⁺¹²⁷
Yeah, it really seems like the N64 was actually so much more powerful than any of the games every took advantage of. It also seems a lot of this is down to lack of documentation spelling out the best way to use these capabilities, plus overworked devs who had to get a game out ASAP.
@PinkMawile День назад ⁺⁶⁶
We have to send Kaze back in time to revolutionize the N64
@ENCHANTMEN_ День назад ⁺⁴⁰
I bet we could do spectacular effects with modern hardware, it's just that the complexity to optimize it to this degree isn't humanly possible
@zdelrod829 День назад ⁺²
Dk64 moment
@johnclark926 2 дня назад ⁺⁷²⁰
Kaze seems to have blown past the RTX 5090 phase of development and discovered that the N64 has a pseudo-quantum computer inside
@daspeedsta9455 День назад ⁺⁷³
Saw this comment before watching and thought this was a joke 😭
@Bones_ День назад ⁺⁴³
This is extra funny considering the people who try to claim that quantum computers use parallel universes and that Mario 64 has parallel universes
@bobcake8904 День назад
Especially now since Mario in the multiverse released… XD
@IXPStaticI День назад ⁺⁴
Scrolling past this halfway into the video I thought this was a joke but it turns out HES ACTUALLY STRAIGHT UP DOING THAT WTF
@viridisspielt День назад ⁺³⁹¹
Create_Dirty_Exclusive sounds like the general idea behind Conker's Bad Fur Day
@skyguysZ День назад ⁺¹⁰
this made me laugh way too hard
@CottonModem День назад ⁺²⁵
Damn, just made an almost identical comment before stumbling across this one. We must both be very handsome, intelligent, and charismatic.
@skyguysZ День назад ⁺⁶
@@CottonModem 20 intelligent people including us and the OP could have thought of this OP’s comment
@thesenamesaretaken День назад ⁺⁷
And there I was thinking it was the nams of Kaze's onlyfans page
@galen__ День назад
Dirty Cash(e) 😂
@TheBackyardChemist 2 дня назад ⁺⁶⁹³
18:00 this trick is called Cache-As-RAM (CAR) and as far as I know it is used by BIOS code in most (all?) PCs. In the earliest part of the boot process you simply do not have any RAM yet, since DDR RAM initialization is so complicated. So when modern x86 CPUs come out of reset, they need to start executing code to initialize their memory controller, so for this CAR is used.
@KazeN64 2 дня назад ⁺³²⁴
oh that's really cool! I didn't know this was commonly used already. Interesting that they do it out of necessity instead of for performance.
@ThatOSDeveloper 2 дня назад ⁺¹⁷
Huh thats really interesting, do other things like the 6502 or something use that?
@gabemorales7814 2 дня назад ⁺⁴⁷
@@ThatOSDeveloper 6502 and similar super early microprocessors have no cache. First processor I saw with an instruction cache was the 68020 on the Amiga 1200, but IIRC they work differently because the amiga itself has a funky bootstrap sequence. The SH-4 has such a mode, however, it's called "OCRAM mode." The Dreamcast has an integrated MMU so, without checking, I'm fairly sure it'd boot the same way.
@TheBackyardChemist 2 дня назад ⁺²⁰
@@ThatOSDeveloper Nope, anything where the RAM is straight SRAM or something comparable will have RAM after reset. This is only the case for processors that have to initialize their own memory controller with a complicated algorithm.
@queazocotal 2 дня назад ⁺¹³
@@ThatOSDeveloper Technically, @thebackyardchemist is wrong, and early PCs (along with the whole 8 bit space) don't use this, as they don't have CPU cache generally. The 486 was the first processor where it could rely on having internal to the CPU cache. 6502/z80/4004/8008/8080/8086 did not have any cache.
@Armi1P 2 дня назад ⁺²²⁰
2035: Kaze manages to run Crysis on N64 by using instructions that theoretically doesn't even exist
@LokiScarletWasHere День назад ⁺¹⁰
Don't give him ideas. You know he'll do it.
@VlaDexa_MAX День назад ⁺⁵
Don't know about N64, but I'm pretty sure that modern CPUs have undocumented instructions, sooo
@kellan5431 День назад ⁺⁷
There are a few NES games that use undocumented instructions. On the CHIP-8 (technically a fantasy console) a few undocumented instructions got used so much that they became official
@Akriashi 23 часа назад
@@VlaDexa_MAX all procs have undocumented instructions...though modern ones can have them "disabled" via the Instruction decoder being set to convert their opcodes to NOPs in the end-user versions.
@Manabender День назад ⁺¹⁸⁰
So basically, you're taking a couple of cachelines and telling them "you don't cache any more, you are now extra CPU registers."
Brilliant.
@hyperon_ion9423 День назад ⁺⁸
Can the N64 do operations directly on the cache buckets? I would assume it would still have to load the data to a register connected to the ALU, so they’re more like extra _SUPER_-volatile ram addresses that you then have to “flush” (i.e. bring the original data back from RAM over the Ram-Bus so that you don’t overwrite it)
I imagine that the trick would be getting as much use out of the cache buckets as you can before needing to reset them back to their original data, or perhaps even invalidate that section of RAM altogether and pretend that the cache _is_ the RAM until the data in it has to be accessed by something other than the CPU.
@shinyhappyrem8728 День назад ⁺⁴
The SNES CPU also had some memory-mapped bytes on the CPU die ($43x0..$43xB for x=0..7, so 12*8=96 bytes) but sadly they were used mostly just as a place to store DMA/HDMA parameters. Afaik only 1 game used that area as a fast cache for instructions: Another World (SNES port by Rebecca Heineman).
@ChineseCookie 2 дня назад ⁺³³⁷
I am not using this information, I am not making a N64 game. I'm just watching this because I can.
@thewhitefalcon8539 2 дня назад ⁺⁹
This way of thinking is good on any platform.
@Goose____ 2 дня назад ⁺⁸
a great use of freewill
@Swenglish День назад ⁺²¹
Same. I don't even understand half of it, but hearing someone go in depth on their niche interest without being boring is magical when it's clear they have taken their nerdiness to expert level.
@HKlink День назад ⁺³
Listening to someone talk about a thing they're passionate about is always fun. Even if you don't understand half of it.
@davidthecommenter День назад ⁺³
am i ever gonna use this information? not likely
do i like hearing this guy talk about transforming the N64 into a bloody supercomputer? absolutely
@6Frxggy 2 дня назад ⁺⁴⁴²
bro explained the N64 like a country
@LavaCreeperPeople 2 дня назад ⁺⁵
Pro
@gabycute5128 2 дня назад ⁺⁶
@@LavaCreeperPeople what?
@Hollow_Struggler 2 дня назад ⁺⁵
And boy did it work
@notme8232 2 дня назад ⁺¹⁵
And he somehow made it MORE confusing
@superking208 2 дня назад ⁺¹⁰
bro looks up at the sky and says "bro is blue"
@dudono1744 2 дня назад ⁺²⁰⁷
Rambus was finally going vroom vroom, but now it's retired :(
@MarioKartSuperCircuit День назад ⁺³⁸
Bro downloaded more ram to the point he didn't need the base ram anymore
@SteveNathn День назад ⁺¹⁷
He has a good career and now he can enjoy some time off
@crestdazoltral7705 День назад ⁺⁵
Having (all) your ALUs munching away on useful work with some memory bandwidth to spare is the goal for a well optimized system.
@3lH4ck3rC0mf0r7 2 дня назад ⁺⁵¹⁴
Nintendo: *releases N64 specs &
development docs*
SGI: look how they massacred my boy
Edit: Tbf, this is basically software engineering in a nutshell. Hardware folks come up with some rocket science bullshit to squeeze extra perf out of the silicon, and the software people waste all of that work by having compilers ignore modern special-purpose instructions for the sake of backwards compatibility, and putting the entire program behind all the polymorphism, virtual functions, dependency injections, virtual machines & interpreters, and God knows how many other abstractions and obfuscations. Despite the different nature of software optimization then vs. now, it boils down to a similar amount of fundamentally misunderstanding how the hardware actually functions that led to most of the N64 library having lackluster performance.
Modern apps are written like a labyrinth, and the CPU is given the unreasonable task of translating the map from a foreign language and solving the labyrinth as quickly as possible. This is often why modern software is ~1000x slower than it could be.
@LavaCreeperPeople 2 дня назад ⁺⁹
Rip
@uponeric36 2 дня назад ⁺¹⁴⁶
Biggest revelation of this channel (besides all the amazing tech) is that the worst, most performance limiting part of the N64 was the documentation.
@brandonlittle6444 День назад ⁺⁴
@@uponeric36 same with Bosch mototronic ECUs and their stolen/hidden FR manuals
@Luna5829 2 дня назад ⁺¹²⁹
first time a bus has been mentioned in an n64 video without it being "Imagine a bus"
@thewhitefalcon8539 2 дня назад ⁺²¹
If I had a nickel for every Mario related bus meme, I'd have two nickels...
@Pie-jacker875 День назад ⁺⁶
@@thewhitefalcon8539 I'd have 3. Desert bus 64.
@patientallison День назад ⁺³
Imagine a rambus
@turbinegraphics16 День назад ⁺¹
SMB frame rule and rambus being related 😂
@johanngambolputty5351 2 дня назад ⁺¹⁵⁰
2:09 As someone who did maths for their undergrad, I can confirm, I have absolutely no memory (its kinda why generality and derivations from first principles appeal in the first place).
@mathphysicsnerd 2 дня назад ⁺²
That's because you're not a chad universalist who memorizes their proofs like Poincare :^)
@JorgetePanete День назад
it's*
@thezipcreator День назад ⁺⁶
@@JorgetePanete yrou'e*
@wyattknutson День назад ⁺²
See now I'm awful at math but my memory is fantastic, wanna connect our brains with a rambus?
@multiapples6215 День назад
If you've never forgotten the quadratic formula on an exam and re-derived it on the spot, then are you even a real mathematician :)
@NanNaN-jw6hl 2 дня назад ⁺¹¹³
Essentially you're using dynamic ranges of cache as a sort of register-window; bravo!
I've not seen this sort of cache-line optimization talk outside of Linux kernel specific talks before. Excellent!
@ArneChristianRosenfeldt 2 дня назад ⁺¹⁴
Yeah, this streaming out of sub-16 byte data packages looked very register windows. The Jaguar has a (buggy) helper registers to let the GPU assemble 32:32 bits to write out in one go as 64 bit.
@gabemorales7814 2 дня назад ⁺¹⁷
lots of talk about this kind of optimization going on right now in Dreamcast-land with the community port of GTA3. Currently none of it is implemented as everyone hashes out the detail with profiling to see exactly the best way to attack the problem, with the added complexity that both vertex transformation and vertex submission can *_potentially_* thrash cache depending on how it's done.
@bottols 2 дня назад ⁺⁸⁷
I am not using quantum physics in my Mario 64 mod YET. Famous last words.
@vilian9185 День назад ⁺⁶²
mario 64 has parallel universes, nintendo64 has quantum cache everything is coming together for mario64 port for a quantum computer
@gabemorales7814 2 дня назад ⁺¹¹⁰
Ah! A direct mapped cache! The Sega Dreamcast has a similar cache setup. I've got a good scheme created to maximize direct mapped cache by using absolute addressing in gcc with an ld script to create stripped zones. Separate the direct mapped cache into 4 zones, each separated by the width of the cache spacing, to ensure writes to buffers don't overwrite the previous line. On the Dreamcast, you can also enable OCRAM mode, which halves the cache into a scratch pad for fast math. This is actually optimal, because the physical layout of the dreamcast's memory is (for sake of brevity ignoring the 64-bit dual ram setup) 2 ram "chips" with 2 banks inside, each bank made up of 2048 rows of memory "cells," each cell being a cacheline in size. Each bank has a mechanism inside to read a bank called a sense amplifier. To read a cell, a sense amplifier must be attached to the row, so if you read a row outside of the boundary, it incurs a performance penalty as the sense amplifier must detach, move to the appropriate row, and reattach. If you operate in OCRAM mode, the sizing of the remaining Cache is *juuuust* right to fit 4 rows at once if you stripe your memory without sense amplifier penalty. It sounds like the DC and N64 actually share quite a bit in common memory wise.
A really cool feature of the Dreamcast memory map is the entire memory is mirrored to an alternate address which skips cache when read, as well. So you can actually store things in memory and call them using an alternate address without thrashing your data cache. The dreamcast also naturally has prefetch and invalidate instructions, which when combined with absolute addressing and OCRAM mode, gives you quite a bit of granularity in how you control your cache.
EDIT - Question: Does the N64 offer any sort of degree of instruction parallelization? The Dreamcast uses a 5-stage harvard architecture for instruction fetch, which allows parallelization when basically using any instruction from alternate groups providing they aren't a move opcode. Anything like that exist on the N64? EDIT AGAIN: Welp, looked a little further and it turns out this is actually a part of the MIPS name, lol. "Microprocessor without interlocking pipeline staging." Very, very, verrry cool. The architecture of the DC and N64 are very similar!
@ArneChristianRosenfeldt 2 дня назад ⁺⁷
This video is so complicated that I am almost relieved that the Atari Jaguar only has scratchpad RAM for code and a Matrix and a ton of registers for the data.
@gabemorales7814 2 дня назад ⁺¹⁶
@@ArneChristianRosenfeldt Oh man I've done Jaguar programming with my Skunkboard. I consider Dreamcast development way, way easier lol. The Dreamcast is so elegant, nice FPU with fat registers for 2 full matricies, a bunch of really cool SH4 fast math functions. Plus, the absolute coolest feature: Order-independent transparencies, owed to deferred rasterization. You bin all your polygons upfront before sending them to a tile accelerator to rasterize, which gives the tile accelerator, which generates pixel fragments, the opportunity to depth-test against every other polygon in the bucket. This gives the dreamcast per-pixel transparency without needing to order polygons.
I absolutely love 68000 programming, though. When I do Jag development, I make atari age weep because I play mainly with the 68000 lol.
@ArneChristianRosenfeldt День назад ⁺⁴
@ I just try to redeem Ataris hardware decisions. Running code out of external memory probably was an accident due to the unified data and code cache and external data access. LOL.
I cannot code 68k , only 6502
@gabemorales7814 День назад ⁺¹⁰
@@ArneChristianRosenfeldt Coming from 6502, I think you'd find the 68000 a dream to work with. They feel very similar, except the 68000 is just more of everything, especially registers. That's the absolute best thing about the 68000 -- FAAAAAAT registers. The 68000 is 32-bit internal, that's seven 32-bit address registers, and eight 32-bit data registers. With bitmasking and bitshifting, that's essentially the same as sixteen 16-bit data registers, or thirty-two 8-bit registers! And unlike the 6502, data registers are general purpose, use however you want. You can also use the address registers in clever ways. Hands down my favorite CPU of all time, simple enough to know the ins and outs of, but feature packed enough to do some incredible stuff. Definitely give it a try!
@CantrellDouglas День назад ⁺⁹
Off topic, but kinda funny: That's me in your profile picture. Or, rather, I posed for the reference picture when I was a kid. Wasn't expecting to see myself in the comment section. 😂
@lucaspec7284 2 дня назад ⁺⁹⁰
Kaze : "Alright, full disclosure : i am not using quantum physics in my mario 64 mod-"
Also Kaze : "-YET"
At this rate we'll have ray tracing in RtYI by the time it releases.
@ssg-eggunner 2 дня назад ⁺³
The satirical kaze video by sm64rise is gonna become real
@LokiScarletWasHere День назад ⁺³
You thought the Rt stood for Return To
You were sorely mistaken
@lucaspec7284 День назад ⁺¹
@@LokiScarletWasHere Raytraced Yoshi's Island 64, coming to a nintendo 64 near you in 2025.
Actually, reminds me of that one guy who made a Ray-Tracing chip for the super nintendo.
@coltonroyle2341 2 дня назад ⁺⁷⁸
Being a pioneer for a 30 year old console, what a time to be alive.
@Doom2pro День назад ⁺⁴
21 minute papers...
@canaconn2388 День назад ⁺¹
@Doom2proexcept with actual information
@Doom2pro День назад
@@canaconn2388 and not spoken like "Today, we, are going, to, discuss, a groundbreaking, piece, of techonogical, development.. so we, will get, to see, and amazing, hard to believe, sight... So hold on to your papers"
@arciks11 День назад
"Man Revolutionizes N64!"
"He's 25 years late and gonna get sued so IDK why he did."
@Teckman8 День назад ⁺⁴¹
Wait, why is this legitimately a good way to explain how a CPU works?
@brunosuperman День назад ⁺¹⁷
It's amazing how good the graphics quality you've achieved on a Nintendo 64 is! It's so beautiful! Imagine this game running in 1996
@TariqMKDS 2 дня назад ⁺²⁷³
bro knows better n64 than nintendo themselves🙏🙏😭
@Genzaijh 2 дня назад ⁺²⁴
Of course, Nintendo moved onto other technology. Amazing how deep you can dove into a hobby.
@pleasedontwatchthese9593 2 дня назад ⁺⁵³
Its funny reading the N64 official development docs. They explain what a polygon is to developers because 3D was so new. Would you imagine working at an AAA studio and needed to explain what a 3D model is. But it makes sense, there was a start to everything.
@crunchdatbacon День назад ⁺²
Ps1 is next 😬🥶
@ericlizama8552 День назад ⁺⁴
Iirc devs had to get permission from Nintendo to use Microsoftcode, so finding optimizations like this was probably stalled by a bunch of red tape.
@TariqMKDS День назад ⁺¹
@@Genzaijh so real bro
@わかるマーン 2 дня назад ⁺⁶⁰
If I were a bus driver, having a schedule of "once every whenever we need a bus" while being paid the same salary as if I were to drive constantly, that would definitely make me quit my current web developer job.
@3dmarth День назад ⁺²¹
Except you'd have a bunch of customers from different places all screaming at you at the same time, while getting upset that you can't be in three places at once!
@AjaxGb День назад ⁺⁸
Funny, "you have no work schedule but are on call constantly" would be a complete deal breaker for me.
@わかるマーン День назад ⁺²
@@3dmarth Good point.
@TheJesterElectronic 2 дня назад ⁺³⁷
Computers have a few functionalities programmers typically would not consciously use, but for the sake of optimization, they sometimes should.
@ArneChristianRosenfeldt 2 дня назад ⁺¹¹
Kids, don’t do this at home. You are not Kaze. Premature optimization is the root of all evil !
@skylerross8054 День назад ⁺¹⁰
*premature* optimization is.
However, sometimes, you've traced your performance bottleneck to a specific area, using somewhat realistic very stressful workloads. Now you need to optimize something everyone says is impossible to optimize further, because you have no hope of learning how fast is fast enough (it'll always be too slow for something), and performance is a feature. That's when you reach for the esoteric stuff.
I did that a couple months ago for something at work, it gave like... well, it's hard to quantify. It was noticeable on the test case, at least a 10% throughput improvement of this function (which originally was 33% of runtime), how much time it saves depends on a bunch of parameters, we have an O(n) algorithm with a large constant factor that I can't do anything about, and this function has a O(M^4) section (yes, that's a slow complexity, I haven't figured out how to make it M^3 or smaller)
@pafnutiytheartist День назад ⁺⁶⁷
You are very much in the territory where speed is no longer a priority. When writing code commercially, you have to balance readability, expandability and execution speed. Even if devs back then knew your arcane arts, I doubt they would use such tricks. If your game loop runs 10 microseconds faster but everything breaks whenever you update the code, it's not a good change.
I am genuinely infinitely impressed with your dedication to this madness though.
@PlaguevonKarma День назад ⁺¹⁶
Kaze doesn't care about readability, he cares about optimisation lol
@Gofer925 День назад ⁺⁹
i recommend you watch his
'optimizing with "bad code" ' video
it has even more Very Fun optimization stuff
@Bthrecon День назад ⁺⁵
When I did dev many times when more speed was needed the understandable code got commented out, a paragraph added about what they optimisations were, and if you were lucky, another about WHY and what not to do lol.
@blarghblargh День назад
Games get shipped. They don't always break even on revenue. Maintenance is a champagne problem. Good enough performance on low end systems is not, because it increases your revenue.
@JohnBromin 2 дня назад ⁺³²
Man I love the visuals in this one. It's been great learning something new every time. Few of the concepts here I don't think I would have understood without the little graphics.
@ErikBod 2 дня назад ⁺⁵⁴
Sick 3d animations go vroom vroom.
@thejakfan313 2 дня назад ⁺¹⁸
"This will actually somewhat work on some emulators, too"
*Shows a smoking laptop, which is presumably overheating*
I love it
@xdanic3 2 дня назад ⁺³¹
10:08 Fun fact: Many of the equipment the Apolo mission used was analog, so not all data required to run on a CPU
@pafnutiytheartist День назад ⁺¹⁸
Also, execution speed was the last priority. It was (and still is for all space missions) all about reliability for obvious reasons.
@cube2fox День назад ⁺³
The Apollo guidance computer was probably quite memory limited in terms of size
@T3sl4 День назад ⁺⁶
And what wasn't, was digitized in the simplest of ways: pulse or frequency counting was used as an ADC (getting, I think, 10 to 18 bit operands usually?). They didn't have integrated peripherals for this, not even dedicated ICs, like we do today. (For fast conversion applications, there were digital conversion CRTs: an electron beam sweeps across a punch-coded plate, producing a serial bit sequence corresponding to beam deflection in the other axis. Not sure who was using these; Bell telephone maybe? Military?) Calculations didn't need to run too often -- a few times a second to update spacial navigation and maneuvering, basically solving differential equations by incremental difference; and managing what digital systems (i.e. on/off switches, relays, lights, display and keypad (DSKY), etc.) were set to automatic (including the autopilot controlling thrusters). It was slow (clock rate low 100s kHz?), but had reasonable bus width (18b?) and a couple of otherwise quite powerful numerical instructions (mul/div/etc.?). Things you might not expect given the low capability generally, but customized perfectly for the workload.
Computer design back then was very different: instead of starting with a standard system, there was simply no such thing, as having a CPU at all was already such a massive hurdle; you have a strong incentive to strip out everything unnecessary, and customize the architecture (not just bus sizes, but parallel/serial, instruction timings, pipelining even, etc.) to suit your purpose. There were no standard instruction sets to pick from (for general applications; arguably IBM's System/360 was the first, perhaps only, standardized instruction set -- but only for mainframe data applications, and this might give you some idea of the scale required to obtain value from standardization, and what the scale of computing generally was like back then!). What we think of today as a CPU, reading instructions and processing data, was a more nebulous concept back then. So, between these things being built from gates, or individual transistors, the tremendous design and hand-assembly effort to put those together, let alone writing ROM (e.g. "rope") and assembling RAM (hand-threaded core!), and the rarefied applications that demanded such lavish expense -- they were very bespoke and specialized systems indeed!
Pipelining is interesting to mention here... System/360 was the first to have it, ca. 1967, according to one article? More important going into the 70s, and again only for the biggest machines that would benefit from it. It seems like a new thing, but it's relatively new _in the consumer space_ to have needed pipelining, or caching or what have you. What used to be supercomputer tech in the 70s, filtered down to single chip consumer hardware in the 90s, and so on. This pattern hasn't changed much: what passed for a supercomputer in the 2000s (multi-CPU, SMP or asym.; vector instructions; etc.) has filtered down, in a sense, to your smartphone today. We've since settled on the best of both worlds: SMP CPU with moderate vectorization, augmented with large-vector parallel processing ([GP]GPU). We carry in our pockets, for the measly cost of a couple watts power dissipation, the power of myriad Cray Supercomputers.
Interestingly, grid or flow computing has long been known, but not gained any traction aside from limited use cases where the flow of data is optimal for the calculation (differential field solvers?). Anyway, modern CPUs and GPUs are so extraordinarily powerful that such applications can still run on them with very reasonable execution time, even if not well suited to the flow and dependency of data (i.e. RAM/cache limited). I wonder if that's changing with the availability of tensor cores today (neural net stuff; ugh, "AI").
(Standard disclaimer: any keywords and inaccuracies are largely from memory, and should be taken as incentive to go and research these things yourself. There are many excellent and accessible articles, going into any level of detail, on the above subjects; highly encouraged!)
@bluedistortions День назад
The Apollo computer used for calculating trajectories was an old gear driven cash register
@daveloomis 19 часов назад
@@T3sl4Bro, I would read your substack.
@jacekm833 2 дня назад ⁺¹⁴
AArch64 (a.k.a. ARM 64-bit) has a "dc zva" instruction that AFAIK does the exact same thing as Create_Dirty_Exclusive but sets the entire cache line to zeroes instead of unpredictable values. It is used in reference implementations of memset released by Arm. So this is definitely a known issue and many modern CPUs can work around it.
@spudd86 День назад ⁺⁷
The equivalent to create dirty exclusive is to write to a write combining mapping.
On x86 you can also use streaming writes from sse2 to do the same thing. It waits for a full cacheline of writes then flushes. It also does the right thing if you don't fill the chache line. You can also do prefetch with the right hints to say that you're going to be writing to it. Other Architectures likely have similar streaming writes.
There's a lot of related optimisations. Write combining is mostly for memory mapped devices and things like CPU access to GPU memory. Here write combining or unchached would be set by using the Memory Type Range Registers, or in the page table.
@BenKDesigns 2 дня назад ⁺³⁵
While I'm a programmer, I'm not really a low-level programmer, and these videos are still fascinating as hell to watch. Love your content, can't wait to play your game!
@metalj День назад ⁺¹⁵
Yep it's a memory throughput issue in the sense that at the moment of this video going up all the best gaming CPU's achieve their top spot on their respective benchmarks exclusively by having an unholy amount of 3D V-Cache. In that sense it's kind of funny that the N64 was almost prophetic in it's first party developers 'not understanding the hardware'. Except nowadays it's not limited to videogames and can close down airports and cost several billions of $ in a single day.
@FluffyFoxUwU День назад
ooooo crowdstrike incident reference
@Reaperman4711 2 дня назад ⁺⁷⁹
0:50 BITD, I had a girlfriend with a create_dirty_exclusive mode. It wound up not being so exclusive, and then I got dumped.
@brianb2308 2 дня назад
F
@Mizu2023 2 дня назад ⁺⁸
were you ram
@GumSkyloard 2 дня назад ⁺¹⁸
@@Mizu2023 no, but she was
@reas0 День назад ⁺¹⁴
invalidated
@Mizu2023 День назад ⁺⁴
@@GumSkyloard Oh right. Saw "dump" and mind went "ramdump"
@johnnywernd2593 2 дня назад ⁺²⁷
I've never seen anyone as enthusiastic about the N64 hardware as you and it's amazing to see what you've accomplished so far. However, I keep wondering, if you know so much about the hardware, why haven't you considered writing an N64 emulator yourself? I ask because I'm pretty sure yours could be one of the most accurate since you've accumulated so much knowledge about it over the years. Keep up the good work, btw!
@KazeN64 2 дня назад ⁺³³
There are people with more knowledge than me contributing to emulators. (Also, even if I was the one with the most knowledge, I would not enjoy spending my time writing an emulator, I'd rather make my games)
I think the bottleneck for emulators is often not that perfect accuracy is hard to achieve but rather that it is difficult to be perfectly accurate and performant enough to run games.
@mbrofoc 2 дня назад ⁺²
@@KazeN64thank you for your work. It inspires programmers to further optimize their games
@johnnywernd2593 2 дня назад ⁺³
@@KazeN64 I totally understand that, and you're right. My point was more about the fact that you are so enthusiastic about the hardware and an emulator from you would be like an added bonus. I understand what you mean about not enjoying programming an emulator, since I'm a programmer too, but I don't enjoy working on emulation.
@drgabi18 2 дня назад ⁺¹⁸
5:50 The framerate of the game here makes me think it's more like a CPU simulator, it's gonna be 100% accurate but simulations are still heavy
@vinnyandlin8510 2 дня назад ⁺³⁰
So when are you transferring your consciousness to a cluster of n64s?
@stealth7225 2 дня назад ⁺¹⁶
No need for a cluster, one N64 is plenty powerful enough, he just gotta unlock the hidden consciousness port with the right optimizations.
@steven0719 День назад ⁺⁶
when i think you have ran out of n64 hardware vids you keep on dropping em. i don’t regret my sub one bit.
great video
@EnigmaticGentleman 2 дня назад ⁺¹⁹
This is actually just a great lesson on computer hardware, like if Kaze's schedule wasn't full I'd say he should definitely do some teaching on the side.
@mathphysicsnerd День назад
This *_is_* his teaching on the side. Surprise!
@goeiecool9999 День назад ⁺¹¹
20:26 Poor Henry Kümpel suffering from mojibake. Unless they inserted Ã¼ intentionally as a joke...
@BenWillock 2 дня назад ⁺¹¹
Finally, after years of us stupid people asking, Kaze has dumbed it down to our level.
Bus go vroom hehe
@IceYetiWins День назад ⁺¹
Bus go retire
@BSEUNHIR 2 дня назад ⁺⁹
We got 3D Rambus (retired) before GTA 6
@guyg.8529 2 дня назад ⁺⁷
On modern hardware, you just have the temporal instructions, which bypass the cache, but nothing on the instruction set like Dirty-exclusive. But in the microarchitectural level, the OOO circuits may be eliminate the useless loads if you write immediatly on the loaded data. It depend on the load/store queue implementation, and most of the time, the OOO memory system tend to do loads before writes because the ALU are hungry for data and writes ccan be postponed or fused with a write buffer. Interaction of this optimisation with prefetching must be taken in account, also.
Using the cache as RAM remind me what's used in modern GPU, notably NVIDIA ones. The L1 cache can be configurated to act as an adressable scratchpad memory (yes, the shared memory in CUDA is just the L1 cache reconfigurated). It's not surprising, since direct-mapped and associative caches contain one or multiple RAMs memories.
@ErPiova 2 дня назад ⁺¹³
TL:DR: friendship ended with rambus, now cache is all kaze needs (this is exxagerated, but you get the point)
@brianb2308 2 дня назад ⁺¹⁸
You can make 2 builds; one with hardware and all other optimizations, the other with only the optimizations that work on emulator. Not ideal having different systems work differently, but as emulators get better maybe your super optimized build would eventually work. Unfortunately I doubt emulators will get much better because they work with the whole N64 library already :/
@ArneChristianRosenfeldt 2 дня назад ⁺³
Or try out the capabilities on game load and patch in shims or NOPs if something fails
@eduardoanonimo3031 2 дня назад ⁺⁹
I alredy replied this before:
They are alredy done at the same time.
The same code executed in one way on real hardware, but if it identifies that is running in emulator due to accuracy limitations, it can change the code to an emulator friendly one.
@M1XART День назад
Yeah, Bear Waker had two builds as well, where other was console optimized.
@njmccarthy 2 дня назад ⁺⁴
Aside from loving all your videos and being extremely impressed at the level of detail you go into developing on the N64, in this video, I really loved the Ridge Racer Type 4 track (Naked Glow) at 10:14! Well done!
@Tinkerer_Red 2 дня назад ⁺⁸
Would really like to see a playlist of all of your optimizations over time in release/watch order. Would love an easy way to see the progress over the years as you've optimized so much.
@thenimbo2 2 дня назад ⁺¹⁰
CACHE RULES EVERYTHING AROUND ME C.R.E.A.M. GET THE MEMORY
@dbarrie День назад ⁺⁶
Cache manipulation is still very much necessary in the modern (console) development space. Most vendor APIs handle much of it automatically, but if you’re trying to squeeze out absolutely every drop of performance you still need to worry about it. Generally just been the CPU and GPU at this point, but back in the PS3 era dealing with the SPUs was a very fun time. Other low-level/embedded development also frequently hits you right in the cache, and it’s almost guaranteed that when things go wrong, the cache is to blame!
@rmod8 День назад ⁺⁷
Well done, you've made Schrödinger's Memory
@mathphysicsnerd День назад
I don't remember that part
@Fyshtako День назад ⁺³
Aww the animation you did to explain things was adorable. Great job, hjgh effort videos!
@hyakin7818 День назад ⁺¹⁰
man i needed that 2 months ago for my memory management and scheduling class project
@DeltaNovum 2 дня назад ⁺⁵
This is the best ELI5 and visual representation of how this all works. Great education. Bravo, chapeau and thank you!
@InkLore-p3h День назад ⁺²
When this man speaks the entire modern gaming industry weeps-he saves microseconds where others can’t save seconds.
@RicoElectrico День назад ⁺³
R4300i cache is direct mapped, which you explained in a roundabout way. This means accessing instructions n*16KB apart (up to cache line length) or data n*8 KB apart will evict one already in cache cause they collide. I wonder if it's possible to instrument such events. This could enable some madman optimizations in tight loops.
@hypersonic12 2 дня назад ⁺¹⁴
I await your Diddy Kong Racing video.
@kyuthefox 2 дня назад ⁺³
with the quantum physics cache where we can have the cache change and decide later if we want to commit to ram. we could do speculativ execution or banch prediction in software. we can run code without knowing if we should waiting for the gpu and reduce the idle time. maybe. i have no idea but this sounds like mad programming and i'm here for it.
@mathiastoala7777 День назад ⁺²
Peak Emanuar once again giving me the exact size video I needed to enjoy my meal 🗣️🗣️🔥
@TheRealKeymaster 19 часов назад ⁺¹
Wow such an overall great explanation for a CPU and how the internal CPU cache works. This could teach kids in school a lot, it's great!
@Xaymar День назад ⁺²
Modern software video encoders still perform cache optimizations, and some video game engines also do this. It's gotten less frequently done due to the hardware just no longer really requiring it, but it still has performance gains even today. It's why the AMD X3D CPUs are so much faster than the ones without, they're no longer slamming into the RAM latency as often.
@Hublium День назад ⁺¹
IMAGINE A BUS
truly one of the memes of all time
@KazeN64 День назад ⁺¹
dont imagine it
see it
@lfestevao День назад ⁺³
Now the bus fits so many more framerules!
(or something like that)
@MrAddemaster 2 дня назад ⁺⁶
Bro knows the N64 better than his own room
@ChannelSho День назад ⁺²
When I worked for a company that still used a MIPS based architecture, I ran across this instruction and thought "oh, that's interesting," but didn't think anything of it since the actual CPU implementation seemed to have modern caching features. I'm not even sure if it did anything with it at the lower levels of the bootup code either.
@LS95774 2 дня назад ⁺⁷
5:03 whoops, cache momentarily corrupted
@Hollow_Struggler 2 дня назад ⁺³
Props to you for explaining such a topic in such an understandable manner, its a true display of intelligence
@cauhxmilloy7670 День назад
Many of these low level explicit cache management instructions are pretty useful for today's modern HPC applications. Specifically, these are great in lockless multithreaded contexts (alongside volatile reads/writes and memory barriers). Really cool video showcasing some sick usecases!
@EskoLuontola День назад ⁺¹
16:16-16:33 I remember reading that Azul Vega had an instruction for zeroing memory without reading the previous value from memory. It was added to make memory allocation faster, because Java initializes all fields with zero when allocating new objects. It improved performance greatly - there was always plenty of memory bandwidth available. They had asked for Intel to add a similar instruction, but at least back then x86 didn't have anything similar. I don't know how the situation is in recent years.
@salvatronprime9882 День назад
This is the most educational "practical programming" channel on youtube.
@jimmyraconteur День назад ⁺¹
Now I want to write a musical concept album called "Create Dirty Exclusive"
@toxicNautilus 2 дня назад ⁺²
Me nodding along as if I know what Kaze is talking about when he describes technology more complicated than a rocket for a moon landing.
@thetruegoldenknight День назад
The N64 styled visuals for the analogy are just ADORABLE! :D
@pkillboredom 2 дня назад ⁺³
The "computer science lore" joke at the beginning was peak.
@Samwow День назад ⁺¹
Kudos on the visual presentation, was very fun yo watch!
@erockbrox8484 19 часов назад ⁺¹
Nobody cares about the optimizations, they just want to see you release a finished game.
@GaudyGabriev02 2 дня назад ⁺³³
Kaze, at this point maybe you could fix Donkey Kong 64 works without Expansion Pak
@michawhite7613 2 дня назад ⁺¹³
Ironically, I'm pretty sure this mod requires the expansion pack
@ssg-eggunner 2 дня назад
@@michawhite7613 rtyi64 doesnt normally require expansion pack, but using it does help with making performance extra stable
@3dmarth День назад ⁺⁵
I wouldn't be at all surprised, if he wanted to spend the time.
If it's true that the Pak's main function in DK64 is to store cached lighting data, then Kaze could probably just optimize to the point where the N64 can render the lighting in real-time and avoid caching anything.
@ericlizama8552 День назад
Iirc, the lighting data only needed to be calculated once, then stored for reference later.
@1ups_15 День назад ⁺¹
I like your funny words magic man!
@playerguy2 День назад
x86_64 (amd64) includes, as standard: the MOVNT* instruction family.
In classic x86 mnemonic naming means MOVe, Non Temporal (as opposed to don't move). ARM has something similar, ateast on ARMv8.
Non-temporal moves generally mean "ignore data dependency/ordering constraints and don't cache the data". In case of a store, data is not fetched from ram first.
Fence instructions are still respected, however.
The x86 architecture standard is so friendly that the non-temporal part of the instruction is only a hint and can be overwritten by the target memory type.
This optimization is not applied by any compiler I'm aware of.
@ukyoize День назад ⁺⁴
I can't believe that x86 despite having stcpy as an instruction doesn't have any cache instructions.
@guyg.8529 День назад ⁺⁴
It do have some cache instruction, for prefetch and invalidation of a cache line (maybe some more) and also temporal load/writes. Some of them are part of the SSE instruction set extension.
@gdclemo День назад
@guyg.8529 given things like Rowhammer and Spectre that come from cache manipulation, maybe not giving even more cache control to userspace is a good thing.
@Gamers_of_Oz День назад
This is the greatest visual analogy I have even listened to and seen
@ssbmoro 2 дня назад ⁺²
imagine if the bus that ran every framerule in SMB1 had a cache
@sw33t.angela День назад ⁺¹
Probably the easiest way to do the stack in cache for the library functions is having C macro functions that do the dirty exclusive cache before, load the cache, call the OS function, and invalidates the cache page after. Then you rewrite the the OS library functions to load from cache and delete all parameters.
@sw33t.angela День назад ⁺¹
However if you have function calls to other library functions within the call, you'll need to track the cache-stack size. This can be done in macro by loading the cache tracking the stack size, offsetting from it and adding the total offset before the function call, then decrementing it. Either that or you inline the functions.
This does add overhead, however.
@ArneChristianRosenfeldt День назад
So ARM and SPARC with their stack instructions do this already?
@sw33t.angela День назад
@@ArneChristianRosenfeldt yup just realized that mips does too
@Bemental77 15 часов назад
Incredible work. I've never commented on your channel, but your ability to translate here is phenomenal. You should teach.
@TylerjWebb День назад
I was so excited to see you had a new update. Love your videos.
@ThePurpleCheeseMan 2 дня назад ⁺²
I'm so hyped to try this game of yours. It's really impressive just how far you've managed to take N64's capabilities.
@roboman2444 День назад
Deciding whether or not to write back the cache could be useful. (18:35)
I'm not sure if N64 supports things like Occlusion queries, but, if it does, you could do something like this.
Send off Occlusion query task to RSP.
Create_dirt_exclusive to set up "scratchpad" in cache.
Calculate out rendering task you want to have happen if that query succeeds, store data in scratchpad.
Query result comes back from RSP now.
If it succeeded, Write back task data.
If not, Invalidate using hit_invalidate.
@JeanOJesus День назад
The amout of effort you do to ilustrate the reasoning is admirable. Keep doing the great work! Thank you very much for sharing knowledge and congratulations
@theultimatetrashman887 2 дня назад ⁺¹
There is alot of Cache in old Source games. Once you load it in and store it, your game always becomes way faster than it was, and its only a one time thing! (atleast in there)
@hyakin7818 День назад ⁺¹¹
bro is porting gta 3 to n64 soon i swear
@Redpoppy80 День назад ⁺¹
I would love to hear the unique challenges of Diddy Kong Racing on a programming level.
@timmygilbert4102 2 дня назад ⁺⁴
From cache to cash:
so we have control to cache memory without cost with some constraints ? That is, complex operations can stay in cache for as long as we need before rambus meddling? Can we then use compression as a way to sink the extra cpu idling and virtually increase bandwidth and cache memory?
@ArneChristianRosenfeldt День назад
I always did wonder if the barrel shifter in the Arm CPU on the 3do was meant for efficient data bit packing for LZW and Huffman . That CPU also has cache. MIPS ISA is different.
@MeriaDuck 2 дня назад ⁺⁵
"I'm micromanaging more than Jeff Bezos his employee's p breaks" 😂
@Tsaukpaetra День назад
Absolutely love the N64 rendered graphics for the story telling.
@toxicroak_gaming6754 День назад
I like how you explained it all! It helped me to understand a little bit deeper on these concepts. The cache is a VERY powerful tool if used correctly that not many Comp Sci people know about. If used poorly though (code has bad spacial or temporal locality), it can really screw up everything, so its a "use with caution" type of thing
@kgreen 20 часов назад
These deep dives are great, never doubt the validity of your content.
@Elesario 2 дня назад ⁺²
That last one where you can decide whether the cache you wrote should be written back to RAM or not made me think of transactions in a database. I'm sure there's going to be code situations where you generate some data and then keep it or discard it based upon whether it passes some test, although seems niche.
@Cramhead43 День назад ⁺¹
Another classic Mario analogy using a bus. Well said Kaze!
@Thewinner312 День назад ⁺³
Honestly, it gets a bit confusing when you try to make a city metaphor out of everything.
@bigchungus7870 День назад
I just came here to say how much I love the mario renders you do for the thumbnails

Следующие

Автовоспроизведение

Can You Create an "Impossible" Pikmin Save File?