Saturn homebrew dev here. The leaked English coding docs for the Sega Saturn are known to be full of errors as they were rushed through translation or something.
Yeah... Most likely. Homebrew kits for alcohol and beer especially have been on sale long before computer games were ever a thing. (Home BREW. after all. Brewing is how you make beer in particular.) How this transitioned to use in game development I don't know. It's also a bit weird that the term has gotten ambiguous enough that we start calling people working with old microcomputers 'homebrew' devs. Console development without an official license and dev kit is one thing. That's not a normal thing to do, so 'homebrew' kinda makes sense. But one of the core differences between a console and a microcomputer is that anyone can develop microcomputer programs, and has always been able to. Me writing a SNES game is a bit unusual... But writing a game for my atari 800XL is only unusual in terms of how old that system is. Back in the 80's you had a lot of 'bedroom coders' who did just that, and then got a publishing deal and released their work commercially;
@@segaunited3855 there is more to it than that. SoJ kept projects like Saturn secret from SoA and is why they wasted time and resources making warts for the Genesis.
This reminds me of a quote Jon made in 1997: "while the PlayStation was easier to get started on ... you quickly reach [its] limits, whereas the Saturn's "complicated" hardware had the ability to improve the speed and look of a game when all used together correctly." Judging from the mastery of the DSP, it seems so. I mean, having an entire 68K all to yourself just to produce sound alone is pretty incredible.
Playstation was simple and hardware wasn't anything over the top. Saturn had the bits to pull far above anyone then but was difficult to get everything talking at the right time but some did pull it off, look at the quake ports, they said it wasn't possible but they did it anyway because the company said they could, sadly it didn't go much further than that but like most consoles, games get better as years go on as people have a better understanding of it, playstation was to easy and max out in 2 years, saturn still hadn't reached it's limit, i think there was one game that came close but just needed optimization and it would have been better than anything released including PC games as they were on the rise
I have to wonder what kind of game you could make if you actually utilized every part of the Saturn to it. It seemed very powerful for its time but no one ever used it right.
@@PrinceSilvermane Just as Jon explains extremely well in his videos, it was a complex piece of hardware. Games that can use it best are ones where both VDP processors can be fully utilized, the DSP has a reasonable chance to get used, and both SH-2 CPUs have work to do that can be done without having to fight for the memory bus too much. All in all, either a game is designed specifically with its architecture in mind (e.g. Panzer Dragoon), or a game concept is slightly tweaked to use the system as well as possible (e.g. Sonic R). In Sonic R TT did a really tremendous effort and it is among the best one could hope to obtain on the system, considering a reasonable development schedule. I have the utmost respect for Jon and his team managed to do. With infinite time available one can always do something better, but it is not representative of a realistic commercial game development process...
Very well explained indeed. As an oldtimer PSX and Saturn programmer too, I can appreciate how challenging it is to explain plainly the very concept of simultaneous operation in the different units of a DSP. Something interesting is that the DSP multiplier unit actually performs the multiplication operation between X and Y every cycle, and the instructions are used to tell the unit at any point in time whether one is interested in the result or not... At any rate, the TMS C6000 DSPs are tricky beasts too, as different operations take a different amount of time to complete (like many other pieces of hardware), but the system does not "wait" for an operation to complete before moving forward (pipelining), and one has to (when programming in assembly) take that delay into account to avoid using a result at the wrong time ;)
Good ol' race conditions. This is the same reason making multi-threaded programs (and games) is difficult. It's like coding for the DSP, but you never know when things will finish, and you have as many threads to work with as your heart desires, but overuse will give performance penalty instead of performance benefit.
Jesus man I assumed at the very least the saturn would have pipelining! Also Thanks jollyrogger for your work on making sonic xtreme playable again. I felt like you never got your due for how much work you put in to getting all those different versions working. It seemed like people didn't appreciate it that much and i just wanted to let you know that as a long time programmer and sonic fan i appreciated the fuck out of it :)
@@littlebigcommentary Thank you so very much, if I ever have time I will go back to it, but now there are a few young developers who have time and talent, and can put the Saturn hardware to good use :)
Which is EXACTLY why most games still don't use more than two cores even now lol. Single core preformance is generally still the best bar to see for benchmarking games then multithreading ones.
Very interesting. Thanks for presenting that in a somewhat understandable manner. I was able to follow but of course that doesn't mean my mind fully grasps it. Congrats on learning the Saturn so well. I had heard that Sega kept a lot of the more complex coding abilities for themselves so that their games looked better than 3rd party games. I don't know how true that is, but it's what I read in magazines back in the day when 3rd parties were complaining about programming and Sega not exactly being helpful in their documentation or programming tools.
@Joaquín Nuñez That's just conjecture... the "no way they would do that" part. Keep in mind that SoJ was run by Nakayama who was a very strange cat indeed.
Game Sack, you're not the only one who gives Nakayama that reputation. I'm still not forgiving him for his hostile takeover in plans for the American Saturn. (Edit): Especially how he did it; as follows (Not part of edit): Turning down Silicon Graphics, insulting Sony, rushing the 32X (which doesn't regard the Saturn, but still make a large confusing position for the Sega fanbase), and pushing the release date from 9/2/95 to 5/15/95.
As somebody who has only barely dabbled in the fundamentals of coding, seeing a breakdown of hardware on such a technical level like this is both fascinating and humbling. It captivates me with how much mastery of the material you and other programmers needed to have at the time for such a dizzying machine while also being awe-inspiring to the point where I want to try and learn even more about this. Thank you, Jon!
After watching the video, now I know that I'll never have the expertise to make a simple game, even though I'm a developer myself... Coding games in the 80's and 90's were the real deal, with all those people (especially you Jon) digging into those pieces of hardware to get the maximum performance. EDIT: I started programming at age 20, I got my first computer the year before and I was determined to become a good developer; I loved playing games and I was very interested in learning C language to understand the source codes I downloaded from the Internet (about 2000-2001) including some Net Yaroze games I enjoyed playing. I was very proud of myself when I completed a playable version of Tetris in my first year of C learning, but then I started working on a consulting firm and time passed... and now I feel that I have wasted all this time. So when I see this kind of video, I can't help but thinking that I could have learnt those things instead of letting time go. It's too late now to revert those bad decisions....
This comment made me appreciate today's tech and specs. Though making games today is still difficult, even with "easy" to use engines, we're spoiled compared what people had to go through back then to make a game.
The jump from 2D to 3D created a split between ‘Engine Developer’ and ‘Gameplay Developer’, it’s all a matter of specialization. But, heck, you can learn to build games for NES or GB without too much effort. If all you’ve done is JavaScript, and you haven’t taken any computer architecture classes, sure there’s a learning curve, but 80’s kids built games for micro computers without too much fuss, and these days an emulator will have debug tools undreamed of at the time
@@johnsimon8457 I have almost entirely on my own created a SNES game engine from scratch, and although it isn't much of a game as of yet, it's still got perfect terrain collision and an infinite (well, as infinite as the cart space) world. It's all a matter of how much time and effort you wanna spend on it.
I think it's a good thing that it's easier to make games nowadays. Not just with the easy to use and free engines, but also pretty much every other software you might need can be free too. Not to mention the abundance of information, tutorials, even straight up lessons that anyone can get for free. A lot of creative minds can then express themselves without working for a big company.
nowadays coding games is still difficult.... but in all sorts of different ways. first games are more complex, all the math the sonic R used... you will be using more for things like simple texture effects and so, and you gotta code every single one of them, is essentially just a lot more to do. but the major technical difference nowadays is that you don't really have to worry about the processors having the horsepower to do anything, if you do things right, they do. the problem is the memory latency. essentially it takes more time to pull a random number on memory than the math you gonna do with it. so you need to code your program in a way that the processor can predict what is the next thing you gonna need from memory, and that can strange
Oh neat, I didn't realize there was a VLIW core in the Saturn. It seems Sega included every possible kind of CPU they could, 2 RISC (SH-2), 1 CISC (68K), and the VLIW DSP core. Truly crazy hardware design. Thanks for sharing!
SH-2 is Dual Cored. NOT Dual CPU'd. The Chips are exactly the same, they're separated on two Wafers due to the Rushed taping of Saturn, that was incompleted due to CSK's Owner Isao Okawa ordering its Taping and Beta design to be rushed out 4 months too early. The SEGA Saturn is Not a 32-bit only console. Its ACTUALLY a 64-bit Console. An Unfinished one. Saturn's design was only half finished. SEGA released an 80% compete product as Saturn's Documentation wasn't even fully finished.
@@segaunited3855 Yes, they're identical, but they operate independently. It's 2 separate 32-bit CPUs. Multicore/multi CPU effectively mean the same thing.
@@superandroidtron You are correct. They do operate independently, but the reason why the share the same DSP Bus is because the MIPS Instructions are all embedded it. Each side pushes Simultaneous Dual 32-bit Instructions as the SH-4 is QUAD threaded and Dual Register Coded. We have a PDF Document of Hitachi SH-2 Aurora. That shows how to get 64-bit instructions on its ISA setup using 6 Instruction Cycles.
@@segaunited3855 I really don't understand what you're saying. None of the CPUs in the Saturn are MIPS cores and the SH-4 (which isn't even in the Saturn, it's in the Dreamcast) is a single-threaded dual-issue CPU. I also can't find any documentation for an SH-2 called "Aurora."
@@superandroidtron Didn't say Saturn's CPU is MIPS Cored. IT HANDLES MIPS. The DSP can stream up to 50 MIPS to the SH-2 Aurora. People seem to think that Saturn can't handle MIPS when it certainly does. It has Double the MIPS of PS1. Aurora is the codename of the Saturn's chipset. The name comes from a "In between combination of JUPITER and Model 2". Its basically a Low End OTS built of Model 2 Hardware. Combining the Phase 2 of Jupiter's "System 32 with Model 2 3D" with a Full Fledged 64-bit Model 2 Powered project. SEGA names the CPUs of its consoles after its codename chipsets. Aurora is the Codename of the SH-2 ISA RISC of Saturn. BTW, RISC has evolved into MIPS. Today, Renasas uses MIPS for ISA.
@@dhkatz_ yeah but this video is a collection of statements. You can go through them one by one and single out which is the first one that doesn't seem meaningful, contains unknown terminology, or presents an apparently logical conclusion that you don't know how the author arrived at. You can request clarification for all such statements one by one and in turn all statements in replies that pose a similar issue and eventually you shall arrive at complete understanding.
This DSP chip would require quite a different paradigm compared to most other chips. Now that I see how it works, I can see that execution path modeling could be used to fairly easily cook up code for this thing, but debugging would be an absolute nightmare for sure.
The PS2 and the PS3 chipset also got their fair share of coding trouble. The PS2 runs a MIPS core with 128-bit SIMD functionality, it runs the MIPS III ISA with specially added extra instructions. the FPU is not IEEE compliant Than come two VLIW units which are simila, but get used for different functions. Oh, and an additional video decoder and an external RDRAM MMU. It got 2 audio processors, one in the main MIPS CPU and the other to emulate the chip found in the PS1 The PS3 runs a combination of one dual threaded, dual issue, in-order core running PowerPC ISA 2.02 and six additional dual-issue units with 6 execution units each and indipendent memory management but without branch predictor, they run their own ISA, have embedded SRAM. All of that connected via a ringbus. Syncing all that stuff correctly must've been a nightmare.
SEGA and Yu Suzuki and the AM2 team were so elite with programming back then. SEGA built the best arcade experiences back then and tried to bring those experiences home. Sadly they built hardware that was so complex for programmers that only the very best could use it to its full potential.
@7MGTESupraTurboA so true 32x not only took a lot of money and resources but a lot of launch and later titles from the Saturn. They should have cancelled the 32x in 94
@7MGTESupraTurboA Can't forget about the Sega CD lack of super scaler games and the push for more FMV games as well. SOA should have consider the Cart version of the Saturn if there were serious about the price point of the 5th gen console. I agree with you but however, Sega Japan made some bad decisions such as the Saturn surprise launch, presenting the Mars project to SOA and the lack of tools and utilization of the Sega CD ASIC chip.
The main problem with the DSP was the amount of time it took to load the matrix and set up the DMA for transforming the points. For transforming a small number of points, it was faster just to do it in the main CPU. At least that's what I found when porting Assault Rigs from PlayStation to Saturn and trying to get it to run at a reasonable frame rate (which kind of failed on the busier levels!).
That's a common problem when you have dedicated hardware for a task; Getting the GPU in a modern PC to do a transform on 500,000 vertices is very fast. But getting it to do a transform on 3? The setup time and draw calls will eat you alive. (actually draw call minimisation is one of the most important game engine optimisation skills of the last 20 or so years on PC - and presumably also modern consoles, since they're almost the same architecture as the PC...)
I heard this is an issue for even SIMD on the same CPU (example: SSE2). You can do things like wreck the CPU pipeline to where it's not worth using SIMD
@@ArneChristianRosenfeldt Except that the Jaguar was designed from scratch to use the custom RISC CPU as a DSP--usable for audio or even setup for the 3D graphics pipeline. On the Saturn, a DSP that looks like a Transport Triggered Architecture (rather than RISC) had a bit of a learning curve for the average programmer.
Wish this video existed about 2 months ago :( My students had to cover this topic (use of Matrices in 3D computer graphics) for an assessment task. I'll probably use this video as an example of application of the technique next year. Cheers :)
@@MrSapps no, just computer science principles, and use of Matrices in graphics applications is one of them. He only briefly touches on it at the start of this video, but it's good to see it in a practical application.
i don't think it's the best idea to use this as a reference, since it might scare them off. Modern CPU/GPU architecture uses SIMD (Single instruction, multiple data) operations instead, which I think means that you can for example, add/subtract/multiply/divide two vectors in one instruction, rather than multiply the individual components of that vector with different instructions that happen to run in parallel.
@@luke_rs hey random question but I wonder if you could help me a bit. What are the basic things I need to learn/research to be able to, Draw a 3d model in a screen? I'm learning stuff alone and sometimes it's hard because you don't know the terms and how shit is called.
FUN FACT. a game on Saturn called MDK was named after the memory of the DSP chip as it was a continual game where the accumulated numbers changed the visuals on screen. Each level just kept going and going until a time clock ended. Each level was coded with a single start “code” and the Saturn itself filled in the level, much in the same way a procedural game is made
I can't wait for the follow up video. What I like most about the saturn is it gives developers so many obscure options to get the results they want. You're made to manage exactly what part of any processor is handling a specific segment of code while another part of that same chip is processing two entirely different things. They don't make hardware like it, a highly specialized hyper paralleled system that evolves 2D game play to the next level and pushes 3D in a way the inter-mingles with what it can do in 2D. Antiquated and advanced, simultaneously! That's why I love learning about it. It's so highly specialized that it earns its own understanding.
@@retractingblinds Correct. Saturn used Trapezoids and Reshaped Sprites(N64 also used Reshaped Sprites but relied on small Triangles for Polygons). PS1 used LARGE,bloated Triangles and Simultaneous Renders. But it was far inferior to Saturn(due to it lacking Recalculation,) and N64(Lacked Perspective Correction and Z Buffering).
Yeah this reminds me a lot of the brief fad of VLIW CPUs like Itanium and Hitachi SH-4. Of course most modern CPUs are VLIW under the hood, with a CISC (or even RISC) instruction decode step that basically recompiles to VLIW on-the-fly. I was always interested in the Transmeta Crusoe since it basically did that except with the decode/conversion step in software. It's a shame that never really went anywhere on the consumer market though, aside from being in a handful of early low-power subnotebooks like the Sony Picturebook. Sun's MAJC architecture also had some interesting design choices and it's a shame that never even made it to market as far as I know.
fluffy: Transmeta was basically impossible to use as a standalone CPU that runs a modern OS, and it wasn’t, unfortunately, that fast, despite x86 JIT being very good. In particular, it lacked good SIMD support with SSE only appearing on Efficeon, too late. But it would be great, of course, if they made it open so the people could run recompilers for other architectures. Before Pentium M appeared, Transmeta was quite popular, especially in Japan. I have Fujitsu subnotebook and HP tc1000, both use Crusoe.
@@noop9k Oh, sure, they never made a standalone version (and having the software instruction decode layer was pretty much the entire point) but it'd have been cool if they'd gotten to the point of supporting multiple ISAs or whatever. Or even had a version with the VLIW ISA exposed directly. I had a Picturebook C1VN for a while and while the performance wasn't great, it was good enough and had *amazing* battery life compared to similarly-performing machines of the time. Of course Atom totally fits in that niche now (and for non-x86-legacy stuff ARM is doing quite nicely), so it's not like we've lost out in that market or anything. It's just always interesting to think about what could have been, or where the technology might have gone if it didn't end up just becoming yet another pile of zombie IP.
Your works have inspired me to become the senior 3D/VR artist that I am today. I just wanted to thank you. Seeing the face of the man behind so many of my favorite games back in the ‘90/‘00 was amazing, knowing that you are a nice person (all the efforts and the charity you are doing) was once again inspiring. Kudos my distant and long unknown mentor, kudos.
Gosh, this is just a masterpiece! I wonder if modern games have something like that, even not exactly like that, but still just beautiful and complex. Also, i just love how you title everything "Impossible %THING%"
I might know of one piece of silicon that might be harder to code for: The iAPX432. I think I might be the only one here thinking "DAMN! THAT CHIP IS AWESOME!".
@@Redhotsmasher Ah definitely the Cell was a tough partner, but the individual programmable units in the Cell weren't too bad after all, in fact not all that dissimilar from the VU units in the Playstation 2. DSP programming has always been tricky, from the early TMS and NI chips to the more modern ones. The saving grace in modern systems is the availability of good tools (mostly the compilers) that help with the instruction scheduling and pairing, filling up the pipelines for you rather than having to do it by hand. On the other hand, having intimate knowledge of the instruction set and programming in assembly can sometimes result in algorithms that are very difficult or impossible to express with a high-level programming language, and that therefore a compiler will not emit on its own...
@@Redhotsmasher Cell was basically a bunch of PPC750 (what Apple called the G3) cores with added SIMD vector instructions tied together with a high-speed memory bus accessible via DMA. It was painful to work with at a memory controller/task scheduling level but the cores themselves were incredibly straightforward to use. Later on Sony released a scheduling library called SPURS which made that a lot easier to deal with, as well. You still had to worry about breaking your tasks down to fit into each individual core's workspace memory but that was more of a design thing than an implementation thing.
I cannot thank you enough for this type of information. Thank you for explaining it in a way where someone with no coding knowledge can somewhat understand.
I am an electrical engineer and digital hardware is my favorite area, so I found this especially fascinating! Even cooler is how you managed to take advantage of the insane and convoluted hardware. I think I followed a good bulk of it, though I'd probably have to watch it again to fully grasp all the math steps going on in each calculation. This was a really neat insight into how you managed to program so much to work at once, all the while having to deal with such a complicated piece of equipment. Very well done!
@@mogo9052 Haha, I'm not sure I would count on that. While I can do at least some programming, it's not my strong suit, unless you're talking Verilog or VHDL firmware. I'd say I'm above average as far as EEs tend to go (I impressed a lot of people in my embedded systems class, to say the least), but still, I'm more interested in the design and functionality of the hardware rather than actually making the software to go with it. I do appreciate the thought though!
That's fascinating, I've never seen that form of paralellization. This kind of code really requires comments for every line to be maintained though, lol.
Hi Mr. Burton! After roughly 26 years, I've finally finished Sonic 3D Blast on Genesis. And with all emeralds! I wanted to let you know I enjoyed the game and that your videos helped me appreciate all the work that went into it. Thank you to you and the team for everything y'all did!!
If you want to understand the basics of this video, you can check this tutorial: skilldrick.github.io/easy6502/ It will teach you the very basics (put a pixel on screen, add, subtract, etc) on the same CPU that was used in the original NES.
@@DlcEnergy Well I started a just few weeks ago. There are a few good links in that same tutorial to learn every instruction code that the 6502 have. Then you can go to the specifics of the NES on this link: nintendoage.com/forum/messageview.cfm?catid=22&threadid=7155
Basically, normal CPUs at the time would take in ONE instruction at a time. (Like, put the number "2" in slot 1) The DSP can do FIVE instructions at once. (Like, put the number "2" in slot 1, take the number in slot 3 out, etc.) In essence, you could call it an octopus CPU, because of how many things it can do at once, like the phrase you hear someone say "I only have 2 arms!", when they are doing two things at once and can't do another.
Oh the joy of parallel computing. Speaking from personal experience, it indeed is hard to get used to a multithreaded or even worse, a multicore environment especially if you are coming from a single core environment. When we started making games ourselves, everything was single core. Taking advantages of multicore, requires you to think completely different. You need to build your engine from the ground up with multicore in mind. And that’s difficult. That said, we could still offload a lot of lower priority or specialized tasks to other cores. Such as network updates, audio rendering, difficult calculations, particle engine updates (syncing with rendering is tricky), and various game systems that had a lower priority or could be updated out of sync with the main thread. For instance we had a sensory system that was complex, but could run “one frame behind” the main update loop. No one would notice. Offloading that to another core, was a massive gain. But the worst bugs arise when there is a race condition. I remember spending a week or two on an obscure audio bug on the Wii U that only occurred once every couple of thousand of samples. Reproducing the bug took hours; but always resulted in a massive crash. Nintendo obviously found the bug as well. So it was a big showstopper. After making a simulator that spawned thousands of audio samples each second inside the game, I managed to get this repro time to down to 5-10 minutes. Still not ideal, but enough to finally find the bug: some core initialization order and a critical section that was thread safe, but not if those threads ran on different cores... oh the joy... I still have nightmares of that garbled audio that was produced by running thousands of samples a second 😊
Crikey, coding that must have been a real brain twister - what can and cannot be done in parallel, which result do you need before some other calculation uses it, etc. I only hope there was a repeating pattern you could grasp. p.s. Extra thanks, considering what happened in Malibu 👍
I like how you show the code/diagrams while also giving an oversimplified explanation. Those who know what they're looking at can stay interested and (i think) those who don't really know can follow
DSPs are very common in mobile phone base stations. I know at Ericsson they had entire achitectures built with many many of these and c code complied into VeryLongInstructionWord assembly to program them. So yeah, this is not ”acient technology”, it is used every day when you call your mother or whatever.
An exposed pipeline and Multiply-Accumulate units is pretty standard DSP architecture. Not a lot of revolution has gone on, mostly just evolution. Datasheets will still have a block diagram that looks very similar these days
I'm a third year university student, and I'm just now learning to program in assembly code. Your videos have inspired me to take up a project programming for the SNES. The technical details for things on the channel are always amazing!
Saturn and N64 have aged with Grace. Mario 64 still looks very colorful and vibrantly polished. Also, the Duke Nukem 3D Saturn port absolutely DESTROYS the PS1. Has Superior Lighting Effects,too.
@@RallyDon82 IKR? Its a shame on how BADLY Sega of America dropped the ball on Saturn. They NEVER taught people how to properly program and code games for it.
Nice implementation of a VLIW processor. Maybe they could've implemented Tomasulo's algorithm to simplify programming but still get a good performance. Anyone has any idea why they didn't use it? The algorithm was developed in 67, so it was already around by the time of the Saturn.
Wow great stuff Jon keep it coming, you explained the DSP so well. I’ve not touched Assembly Code since second year at Uni, it was great fun playing around with it.
Single Instruction Multiple Data (short SIMD) stuff was never really easy. Even today compilers sometimes struggle with it. For my job i needed to learn NEON assembly for our ARM processor. It's different as only one instruction is executed which handles a whole matrix of data instead of the DSP which handles multiple instructions at the same team. But still the the amount of different instructions of NEON was so high that the current GCC was not capable of correctly compiling code for this thing. Also lot's of register magic happened as this thing got 64 Bit Registers and 128 Bit registers which weren't other registers but instead the 64 Bit ones concatenated together. Lot's of thought and documentation was needed to maintain that monster. Keep up the cool videos sir.
Thank you for this video. I have to say I'm surprised the Saturn requires such complex instructions, even though I know very little about computer programming. I seriously thought you guys had some sorts of tools made by Sega to help your work, but it looks like programmers had to start from scratch.
On ALL 5th Gen Consoles. Programmers had to start from scratch when writing Assembly for a new Game. It was just easier on PS1, but just because its easier doesn't make it a superior piece of hardware. PS1 is WAY inferior to Saturn and Nintendo 64 all it did was draw basic Polygons better by making them larger in size and with Simultaneous Renders.
I don't comment often, but I absolutely loved this video. The technical aspects of programming video games truly fascinates me, and you've made it fun to watch!
Ouch. That gives me a headache. Managing That level of parallel code is not a pleasant experience. (It's bad enough when you're dealing with a full symmetric multiprocessing system - but this is much like hyperthreading - except a CPU with hyperthreading manages code balancing by itself, where here you have to code it directly.) The n64 also had a bad reputation for being difficult, but looking at why that was, I don't think it's even remotely comparable. The problems on the n64 are related to high level optimisation, and some frustrating bottlenecks in the architecture that required careful workarounds. Also Nintendo absolutely refused to provide microcode documentation for the Graphics processor until quite late into the system's life, meaning you were entirely dependent on the handful of pre-written routines Nintendo provided to get any 3d acceleration at all. - easier, sure, but not conducive to getting the best out of the system. In hindsight looking at what the system contained it's not at all in the same league of complexity. You had a MIPS 4300i CPU with a floating point unit, and a Graphics chip that consisted of a DSP with fixed function 3d logic, and a second MIPS 4300i core. This core differed from the main CPU core only in terms of having a large dedicated cache memory, accessing memory differently in general, and having a vector co-processor in place of a floating point co-processor. The instruction set was obviously much the same. Leaving aside not being given any low level documentation at all, the difficulty seems to arise not from the complexity of any given part of the system, but the interaction. The Main memory is RAMBUS ram - which is very fast, but has high access latency. (meaning working with RAM efficiently is tricky), the main CPU is incapable of accessing memory independently, so it has to get the RCP (aka the graphics chip) to do it on it's behalf. The texturing unit has a 4 kilobyte 'cache'. Which is just about enough to store a single 64x64 texture at 8 bit colour depth. (only 32x32 if you have mip-mapping enabled). This wouldn't be so bad, but despite being called a 'cache' it's manually controlled by the programmer, and in standard microcode implementations isn't used very effectively. (a major factor in why so many games appeared to have such low resolution - since most developers lacked the documentation and skills to write custom micro-code; Many of the games that seemed to defy the odds later in the system's life did so by using custom microcode.) The cpu core inside the graphics chip, which does such things as geometry transformation and the like cannot run code or work with data directly from main RAM, it has to be loaded into the local cache, which isn't that large, relatively speaking, so the size of code running on this core has to be considered. The pixel fill rate is about 60 megapixels a second. Which sounds like a lot for a system from that era, but becomes a big problem when you consider the system does multi-texturing, bilinear filtering, perspective correct texturing, anti-aliasing and environment mapping, amongst other things, all of which eat up a LOT of fill rate compared to what the system has to work with. (other systems of that era didn't use such effects, and it's estimated because of this the n64 was typically doing something like 5-8 times as much work drawing a single onscreen pixel as say, a playstation was doing.) Essentially, the n64 was frequently being used to render a level of graphical effects that objectively speaking were out of it's league, and probably shouldn't have been attempted on a system with such relatively low performance. So, no individual element of the n64 was that complex, and each part individually was actually quite powerful, but the pieces don't work well together, and the system is full of performance choke points. Thus, what makes it a pain to work with is that it has very high peak performance, but lots of bottlenecks that drag it's average performance into the gutter. In other words, the system is a high level optimisation nightmare. Still, it's pretty clear that it doesn't even come close to the Saturn, and the Saturn's reputation for complexity is well earnt. Just optimising DSP code alone already looks like a nightmare comparable to optimising n64 code in it's entirety... Yeah, I do think you're right there. May just be the trickiest chip seen in a mainstream product. (Can't say trickiest EVER, because I'm sure there's some obscure supercomputer chip or something that's worse. XD)
So that's why Quake 2 64 looked like your marine smeared vaseline all over his visor? Jokes aside, despite the complexity, it all definitely worked out well. Just look at Perfect Dark and compare it with any game in the Saturn or PSX, pretty much no game comes even close to it.
Thanks for making the time to write that. I'd heard a bit about the N64's design but was missing a few details. It makes a lot more sense now. What an odd design, especially that memory addressing quirk.
Programming for DSPs is fun! I never touched the Saturn DSP, but in the past spent some years writing code for Texas Instruments C5000 and C6000 DSPs. They have all kind of weird hacks not usually present in regular CPUs, like guard bits in the accumulator, modulo addressing (extremely useful to implement digital filters), specific loop instructions to avoid branching (causing the pipeline to get emptied), instruction buffers inside the DSP to avoid reloading looping code each iteration, duplicated data buses... If you write C code, the compiler is unable to use many of these (especially when coding for fixed point DSPs), so you end up having to write the computation intensive code in assembly language, and you need a good grasp of the hardware details to take advantage of many of these weird details.
I'm not a programmer, so I couldn't make sense of a lot of what you were showing, but it was still facinating to watch. I still hope your videos benefit the Saturn homebrew community somewhat.
This was always done to some degree, for example on a PS1 you have "double buffering" where by the GPU renders the last "commands" while the CPU is feeding in the current commands. That is concurrency - you had to be careful not to do anything that would require the CPU to wait for the GPU to do something else bye bye performance
@@segaunited3855 It absolutely does have double buffering. And the scratch pad cache is 1kb. Here is a hello world with double buffering using Gs lib: github.com/nicolas17/psxdemo/blob/master/main.c
Either you take and precisely organize meticulous notes or you have a crystal clear memory. Either way, thank you for this incredible breakdown of a complicated process.
This is brilliant, finally i found someone who shows how console optimization actually works. And it's so great to see how each line of code is thought in conjunction with the hardware schemas, .because nowadays when we program, mainly in Pc we need to abstract so many things and pray for everything to work out in hardware, see how values go from one place to another is somewhat comforting.
Wait so you can read from a register and write to it on the same cycle without the data becoming inconsistent? I wonder if the "impossible" bit from the manual is related to that.
You activate the circuits at the same time, but the multiply (for instance) starts at the same time so works with the data that is already there. By the time the new data arrives the operation has already completed.
Good observation. I suspect that there is actually some form of pipelining involved: these instructions do indeed occur at the same time but each one works on the result computed on the previous line, not on the result gotten from the instruction on their left. This is why the first line only contains instructions loading data into the registers and no actual computations. These computations happen on the second line with the content loaded into registers while the new "MOV" instructions load the registers with data to be used for computation on the next line/cycle.
That part is completely normal though. This is the magic part of flipflops. Write occurs at the clock edge, read data is available very shortly afterward, and you have the rest of the clock to compute your expression involving the flop output (which in this case is also the new input to the flop). No data races, totally safe.
It's called latching, the cycle can have a number of phases, for example if all units latch the inputs at the rising edge of the cycle pulse, then all data is consistent, and writing happens somewhere later in the cycle. You can also see a form of manual pipelining involved, as the data is pre-read into the corresponding input register of each computational unit. The computational unit itself would be protected from the change in the corresponding input register during its operation with a latch.
That's why I love but also hate acronyms. Once you've made that one link in your brain, it's never gonna go away. You seen all the boogie2988 fuss? Heard he's the new dsp
Dun-du-du-dunnnnnnnnnnnn! I got you. Try programming for the Atari Jaguar. It was infamously hard to program back in the day, which is one of the reasons it got next to no 3rd party support. Try seeing why, at the very least.
I actually understood what was going on when you explained it, but I'm pretty sure I would have had a really hard time figuring it out with just the documentation and some examples. Thanks for the really good explanation!
How does the DSP does the multiplication and move a new value into the input registers at the same time and maintain consistency between if the original or new value is used? I only takes one clock cycle but surely the operations can't be truly concurrent? Are they sequential between themselves?
An Option cycle is almost certainly multiple clock cycles. (it's possible to run various parts of a chip asynchronously and just have the physics of the chip manage timing of certain operations, but it almost never makes sense to do so)
The answer to this lies in the update order of a data flip-flop (building block of a register) and in delay. The input registers are fronted to a data bus and, after the data is left to settle on the data inputs, is clocked in - however, that doesn't mean it has instantly shown itself on the outputs as that has a time delay associated with it. At the same time, the data on the output has actually already been multiplied during the same settling period as the data inputs saw. The multiplied data is then clocked into the output register's flipflops. No data crossed streams with other data, and consistency is assured. No sub-clocking (phased clocks), etc.. just pure logic gate delay and careful engineering. This was actually animated into the diagrams with the movement of the data on the wires (delay and settling time) and flashes (clocking points).
@@RachelMant the downside is that a lot of those operations become exquisitely sensitive to process variation and design changes. Setting various operations to trigger on rising edges and others on falling edges is more robust if possible and only two operation phases are necessary.
@Robert Szasz That is what circuit analysis and process analysis is for - to determine safe max clock speeds so that process variation and physics don't kick your design into instability. For example, when prototyping to an FPGA, the synthesis tools are able to determine from a known maximum (worst case) per-gate delay time, how long each clock cycle must be to guarantee correct operation of such a design. For something as fast as this and for when it was designed and constructed, such tools existed and the process analysis had been performed. While either multiple clock cycles per operation for deep pipelining, or subdividing the clock up into segments where certain actions are clocked in different parts of the clock waveform - with how it was described here and with what is happening in this block diagram, I find it unlikely such elaborate schemes would be in use. For a DSP such as this, they'd only serve to make a larger surface area chip that runs hotter. More likely I think the DSP designers took care to ensure the worst-case timings weren't violated, and clock the simpler design as hard as they safely could. Less silicon required, less heat, faster in a real-time environment.
I'm always impressed by your videos and efforts to tell us how you guys programmed games for SEGA's consoles back in the 90's. At the same time however, it makes me wonder what made the Saturn so much of a hassle for other studios to develop while there are things such as Sonic R, Tomb Raider and even early development versions of Shenmue and Sonic Adventure running on that thing.
@@segaunited3855 It was very much a hassle. The manuals were poorly translated and riddled with errors, there was a new 68000 sound driver program every week with a different set of bugs, the development systems were flaky, the hardware itself had a lot of corner cases, and VDP1 was about 60% of the speed of the PS1's GPU at drawing polygons. Yes, the system had a certain personality that Sony's more sterile hardware lacked, but Sony's hardware also was more performant and simpler to deal with. We had to write extensive amounts of assembly on the Saturn to get performance even in the ballpark of what the PS1 gave us with straight C code.
@@ischmidt You're referring to the Western SDKs. Sega of America never provided any proper Sophia SDKs for developers because they completely dropped the ball on Saturn. Majority of your problems early on came from programmers only using one Core to code C Language Assembly. Many of the games developed for 3D using only one Core suffered from unbalanced and improper programming. It was if the Saturn was performing with one hand tied behind its back. Another was that many Non Japanese Developers like you weren't schooled on how to code Saturn's graphics and resorted only to using the VDP1 instead of Both VDP1 and VDP2. Saturn's early SDKs were pretty bad, especially due to Sega of America's utter laziness and the fact that they wasted ALL of 1994 on 32X.
5:50 If this is done in parallel, we have a race condition? Moving a value to the X register and at the same time trigger the multiplier sounds dangerous.
Good remark, the execution is actually pipelined and the move will be loaded only right in time for the instruction on the next line to take it into account but not before.
I don't care what anyone says programming in assembly is a real blast lol Anyways, I skimmed through the SCU documentation as well (because what else am I gonna do with my time?) and it seems that there is no issue with loading data from a bank to X&Y while also preforming the multiplication operation like some people seem to speculate; Sega themselves do it in some example code in the document "SCU DSP Assembler User's Manual". I have yet to find the actual "impossible" part of the code, maybe I'll keep trying to find it or I'll leave myself to be surprised in the future, we'll see.
I think i've got it? But i only skimmed, so i might be a bit off. MOV MC1,X and MOV MUL,P are both X-bus operations so they technically collide in their bit encoding, and i think someone (at least whoever wrote the assembler) knew that this combination is possible and is disambiguated by hardware by setting one additional bit to MOV MCx,X and the hardware disambiguates that to not also mean a MOV MCx,P or something like that. But this knowledge was lost on the way to SEGA's techdoc team. Y and A registers exhibit a similar non-collision that also doesn't seem accounted for in the manual.
I noted that as well, though looking at the instruction codes I don't think there's any issue. MOV MC1, X is encoded in the 6 bit X-bus control field as 100xxx (the three xs seem to determine which bank to load the data from) while MOV MUL,P is encoded 010xxx (here the xs are "don't cares"). So I don't think that's the problem...
@@PuyoPuyoMan I mean it obviously works, and seeing that it works it's easy enough to surmise why and how it works; however it's not strictly possible according to the word of the manual. The manual is merely wrong or sloppily formulated, this looks like deliberate hardware design, underpinned by the fact that the tool chain supports this.
"it's not strictly possible according to the word of the manual", did they say that it doesn't work in the documentation? I was looking for anything like that for a bit in the User's Manual but again I just skimmed it so I might've missed it if there was a part where it explicitly said "you can't use multiple x-bus commands at once" or something like that.
@@PuyoPuyoMan no, it doesn't go into enough detail and doesn't say that you can't issue two commands at the same time, but it would be implied given that both commands are specified with explicit overlapping bits that are not described as irrelevant. It is what my reading of the bit charts and it being presented there as alternate commands would suggest if I didn't know that it works. If the docteam expected it to work, they should have left the bit that sets the MUL switch out of the bitmask of the other MOVs on the same bus that are combinable with it and vice versa, and similarity with bits that are written as colliding on the other bus too. I read processor datasheets just about daily and i would have never guessed that it's possible from the datasheet, which is a gross miscommunication on their part.
Ohhhh, I see. I never really understood what "stream processing" was and why DSPs could be so fast. But now I see, instead of the "start and stop" of most CPUs where there's usually one operation at a time a DSP can read in more data at the same time and keep the data moving in a _stream_. And I can also see why this is so useful in digital audio applications where data (often from multiple sources) has to be read, mixed and output all at the same time. Using a DSP or multiple DSPs chained together a tiny, relatively slow DSP processor can handle much more data than a processor clocked much higher. It does look very tricky to program, but for the types of programs it runs it's probably not too bad. I assume most DSP programs perform the same operations over and over, like transforming vertex data through a transformation matrix. That's just a bunch of multiplies in a row with all the data in contiguous memory. I don't see any branches in there, can they even branch?
I assume that the DSP is decoding the 6 instructions simultaneously? In practice is works like a 6 stage pipeline, but in coding, you prime each stage of the pipeline, then issue instructions to perform an action for each stage, is this correct? What frequency did the DSP run at? I think you might get a kick out of the Parallax Propeller microcontroller, it has 8 individual cores with 512 longs per core. In practice, each instruction executes in 4 cycles, with shared memory instructions taking longer if you miss the access window. If written correctly, you can align code and memory accesses to ensure you hit the main memory on every access window, resulting in a true 20MIPS per core. The hub memory is 32KB with another 32KB of ROM containing a bytecode interpreter (SPIN language), SIN/COS/LOG tables, and a bitmap font. I recently got back into doing some coding on a DOS platform and writing inline asm to speed things up, programming the Propeller is a lot like coding for those old processors, but all of the goofy things like "why do I only have 5 general purpose registers when the 8051 has the first 1K as a register", and "why must I load a segment descriptor into a general purpose register first, before loading the segment register", are not present in the Propeller. All 512 longs of COG ram are registers and can be treated as instructions or data. There is no cycle penalty for byte or word memory access, and there are a quite a few non-conventional instructions. Best of all, each core (COG) has a video generator builtin, so you can easily generate high res tiled video output or low res bitmapped output, or high res low-color output (the HUB memory size is the constraint). You can do VGA or NTSC directly from the COG with just a few resistors for support components.
Absolutely, it is clearly a pipelined design, with pipeline stages all taking a single cycle, contrary to other more complex DSP designs. And yes, the Propeller is certainly interesting :)
@@jollyrogerxp It is very cool how they designed it specifically for realtime bit-banging with no unpredictability. No interrupts, no need to care about preemption priorities and saving context. Instead you have enough "cores" to poll many different inputs.
@@noop9k absolutely, this is what one would do on an FPGA or any dedicated ASIC to process streams of data with guaranteed throughput and latency, which is what DSPs are good at for hard real-time constrained systems!
This seems similar to what we called "FMA" today, but in integer only... funny to see today's generic CPUs mimicking DSPs like this in SIMD instructions to make better performance. What seems most complicated to me is all ALUs in that era are all integer only. My programming skills raised in modern days which CPUs already provided nice FPU functionality, even with robust SIMD instructions. But back then you have to use fixed point numbers which is basically just integers with an imaginary decimal. All these bit shifting, number overflowing things, make me headache just for imagine them... Programmers in this era really had hard days.
I know this is a joke but I find rewatching videos on complex topics and sometimes going as far to pause and take notes helps if you really want to understand.
@@AesculapiusPiranha Thanks for the advice...quite wasted on me though as I'm 42 and if I haven't got it by now, and I haven't I shouldn't think it'll ever click. I spent most of the late eighties and early nineties trying to learn to code...but never really progressed past a competent grasp of BASIC. I'm comfortable with the fact I'm not clever enough...it's okay. I love video games so much I love listening to basically white noise (to me) even if I don't get it.
I'm just branching out from CPUs to FPGAs (where EVERYTHING runs in parallel), and haven't looked at DSPs yet, but this totally makes sense. I've always thought of DSP as those magic chips that are somehow really good at math-intensive stuff, but didn't really know what they were doing special. In fact, now some of the (CPU) instructions that are often labeled as "for DSP routines" make more sense too, because they're usually something like multiply, add, and accumulate all in one cycle. Looking at it from the FPGA angle helps, since you can see it's just cascaded ALU blocks that can run to coherency in the time of one cycle. So thanks for the useful intro! :-)
DSPs and of course even more so FPGAs are devices that software-only people have a hard time to wrap their heads around, precisely due to the issue of many states changing simultaneously, rather than thinking about a single stream of instructions... :)
VU microcode is also hell... actually most of PS2 coding is hell for similar reasons to the saturn.. so many damn bits of hardware all operating independently
If this was difficult to learn, imagine the Atari Jaguar coders doing 3D with so archaic and rare design with its two chipsets: "Tom & Jerry" (yeah, like the animation lol). But anyway, this is simply fascinating! I'm not a programmer but i can understand how difficult is (or was) learn the Saturn infamous hardware in general. It was more powerful than the PSX, no doubt, but it was rarely see that (Radiant Silvergun, Panzer Dragoon Saga, Sonic R, and some more, but not much examples unfortunely).
The sad thing about Saturn is how its actually an unfinished 64-bit Console. And how if only it had those 4 Months it REALLY needed to complete taping and documentation, it would have MURDERED the 5th Gen race at the very start.
So I’m interested, given that you seem to know the Saturn hardware well, what you think the most accomplished, commercially released use of the hardware, and therefore the DSP, actually was. You’ve probably answered this question a million times!
also, in the 90's... if you wanted to have enough horse power to do math for a 3d game... a dsp was what you had to use, there was no other choice. unless you fricking Ken Kutaragi. the big difference between ps1 and all the other consoles from that era was the no dsp for the 3d math
Irrelevant. PS2 is a "nightmare" too, and it was an extraordinary success! For that matter i gladly believe the reports that N64 was essentially undebuggable, in spite of being, from the software perspective (not the hardware!) less convoluted. Yeah it is a bit magical how the PS1 came together - integrating a very fast special purpose DSP as processor instruction flow obeying COP was a smooth move. As was the bucketed DMA. The system's a bit crude as far as what it could do, which it pretty much had to be given the time it came out at, but it was laser focused on making it easy and accessible.
@@SianaGearz PS2 was a success mostly due to Sony's marketing and the Playstation brand created by PS1. Also due to actually being fastest at the moment of release. Unlike Saturn which had inferior 3D capabilities vs PS1 straight from the start. Still, it had quite a few bad ports where devs obviously spent most of their attention on GameCube&XBox versions. DC ports with lower res, worse textures, worse audio..
Saturn homebrew dev here. The leaked English coding docs for the Sega Saturn are known to be full of errors as they were rushed through translation or something.
Blame Sega of America. Wasted all of 1994 on 32X. And then in '95, gave Developers sloppy SDKs and failed to communicate with Saturn's Engineers.
Where did the term Homebrew originally come from?
Beer?
Yeah... Most likely.
Homebrew kits for alcohol and beer especially have been on sale long before computer games were ever a thing. (Home BREW. after all. Brewing is how you make beer in particular.)
How this transitioned to use in game development I don't know.
It's also a bit weird that the term has gotten ambiguous enough that we start calling people working with old microcomputers 'homebrew' devs.
Console development without an official license and dev kit is one thing. That's not a normal thing to do, so 'homebrew' kinda makes sense.
But one of the core differences between a console and a microcomputer is that anyone can develop microcomputer programs, and has always been able to.
Me writing a SNES game is a bit unusual...
But writing a game for my atari 800XL is only unusual in terms of how old that system is.
Back in the 80's you had a lot of 'bedroom coders' who did just that, and then got a publishing deal and released their work commercially;
@@segaunited3855 there is more to it than that. SoJ kept projects like Saturn secret from SoA and is why they wasted time and resources making warts for the Genesis.
This reminds me of a quote Jon made in 1997: "while the PlayStation was easier to get started on ... you quickly reach [its] limits, whereas the Saturn's "complicated" hardware had the ability to improve the speed and look of a game when all used together correctly."
Judging from the mastery of the DSP, it seems so. I mean, having an entire 68K all to yourself just to produce sound alone is pretty incredible.
Playstation was simple and hardware wasn't anything over the top.
Saturn had the bits to pull far above anyone then but was difficult to get everything talking at the right time but some did pull it off, look at the quake ports, they said it wasn't possible but they did it anyway because the company said they could, sadly it didn't go much further than that but like most consoles, games get better as years go on as people have a better understanding of it, playstation was to easy and max out in 2 years, saturn still hadn't reached it's limit, i think there was one game that came close but just needed optimization and it would have been better than anything released including PC games as they were on the rise
Very much so, and even the sound processor itself was really impressive for the time...
I have to wonder what kind of game you could make if you actually utilized every part of the Saturn to it. It seemed very powerful for its time but no one ever used it right.
@@PrinceSilvermane fortunately there is an engine out in public you can use but I'm not sure how well optimized and user friendly it is
@@PrinceSilvermane Just as Jon explains extremely well in his videos, it was a complex piece of hardware. Games that can use it best are ones where both VDP processors can be fully utilized, the DSP has a reasonable chance to get used, and both SH-2 CPUs have work to do that can be done without having to fight for the memory bus too much. All in all, either a game is designed specifically with its architecture in mind (e.g. Panzer Dragoon), or a game concept is slightly tweaked to use the system as well as possible (e.g. Sonic R). In Sonic R TT did a really tremendous effort and it is among the best one could hope to obtain on the system, considering a reasonable development schedule. I have the utmost respect for Jon and his team managed to do. With infinite time available one can always do something better, but it is not representative of a realistic commercial game development process...
Very well explained indeed. As an oldtimer PSX and Saturn programmer too, I can appreciate how challenging it is to explain plainly the very concept of simultaneous operation in the different units of a DSP. Something interesting is that the DSP multiplier unit actually performs the multiplication operation between X and Y every cycle, and the instructions are used to tell the unit at any point in time whether one is interested in the result or not... At any rate, the TMS C6000 DSPs are tricky beasts too, as different operations take a different amount of time to complete (like many other pieces of hardware), but the system does not "wait" for an operation to complete before moving forward (pipelining), and one has to (when programming in assembly) take that delay into account to avoid using a result at the wrong time ;)
Good ol' race conditions. This is the same reason making multi-threaded programs (and games) is difficult. It's like coding for the DSP, but you never know when things will finish, and you have as many threads to work with as your heart desires, but overuse will give performance penalty instead of performance benefit.
Jesus man I assumed at the very least the saturn would have pipelining! Also Thanks jollyrogger for your work on making sonic xtreme playable again. I felt like you never got your due for how much work you put in to getting all those different versions working. It seemed like people didn't appreciate it that much and i just wanted to let you know that as a long time programmer and sonic fan i appreciated the fuck out of it :)
@@kaldo_kaldo well said :)
@@littlebigcommentary Thank you so very much, if I ever have time I will go back to it, but now there are a few young developers who have time and talent, and can put the Saturn hardware to good use :)
Which is EXACTLY why most games still don't use more than two cores even now lol. Single core preformance is generally still the best bar to see for benchmarking games then multithreading ones.
Very interesting. Thanks for presenting that in a somewhat understandable manner. I was able to follow but of course that doesn't mean my mind fully grasps it. Congrats on learning the Saturn so well. I had heard that Sega kept a lot of the more complex coding abilities for themselves so that their games looked better than 3rd party games. I don't know how true that is, but it's what I read in magazines back in the day when 3rd parties were complaining about programming and Sega not exactly being helpful in their documentation or programming tools.
I knew I would see you here Joe! We're both fangirls of Gamehut haha
What kind of trick you used to make the perspective without using the W coordinates?
@Joaquín Nuñez That's just conjecture... the "no way they would do that" part. Keep in mind that SoJ was run by Nakayama who was a very strange cat indeed.
Game Sack, you're not the only one who gives Nakayama that reputation. I'm still not forgiving him for his hostile takeover in plans for the American Saturn.
(Edit): Especially how he did it; as follows
(Not part of edit): Turning down Silicon Graphics, insulting Sony, rushing the 32X (which doesn't regard the Saturn, but still make a large confusing position for the Sega fanbase), and pushing the release date from 9/2/95 to 5/15/95.
Ooooh this reminds me.. new Sack episode tomorrow?
The visual representations really help in making the code easier to understand. Thanks Jon!
As somebody who has only barely dabbled in the fundamentals of coding, seeing a breakdown of hardware on such a technical level like this is both fascinating and humbling. It captivates me with how much mastery of the material you and other programmers needed to have at the time for such a dizzying machine while also being awe-inspiring to the point where I want to try and learn even more about this. Thank you, Jon!
Saturn is not a Dizzying Machine.
I burst out laughing when he said "This is where it gets more complicated". I got lost 3 minutes before that.
I got lost yesterday and didn't even start watching the video until tomorrow
I give up when i saw sin cos stuff
After watching the video, now I know that I'll never have the expertise to make a simple game, even though I'm a developer myself... Coding games in the 80's and 90's were the real deal, with all those people (especially you Jon) digging into those pieces of hardware to get the maximum performance.
EDIT: I started programming at age 20, I got my first computer the year before and I was determined to become a good developer; I loved playing games and I was very interested in learning C language to understand the source codes I downloaded from the Internet (about 2000-2001) including some Net Yaroze games I enjoyed playing. I was very proud of myself when I completed a playable version of Tetris in my first year of C learning, but then I started working on a consulting firm and time passed... and now I feel that I have wasted all this time.
So when I see this kind of video, I can't help but thinking that I could have learnt those things instead of letting time go. It's too late now to revert those bad decisions....
This comment made me appreciate today's tech and specs. Though making games today is still difficult, even with "easy" to use engines, we're spoiled compared what people had to go through back then to make a game.
The jump from 2D to 3D created a split between ‘Engine Developer’ and ‘Gameplay Developer’, it’s all a matter of specialization.
But, heck, you can learn to build games for NES or GB without too much effort. If all you’ve done is JavaScript, and you haven’t taken any computer architecture classes, sure there’s a learning curve, but 80’s kids built games for micro computers without too much fuss, and these days an emulator will have debug tools undreamed of at the time
@@johnsimon8457 I have almost entirely on my own created a SNES game engine from scratch, and although it isn't much of a game as of yet, it's still got perfect terrain collision and an infinite (well, as infinite as the cart space) world. It's all a matter of how much time and effort you wanna spend on it.
I think it's a good thing that it's easier to make games nowadays. Not just with the easy to use and free engines, but also pretty much every other software you might need can be free too. Not to mention the abundance of information, tutorials, even straight up lessons that anyone can get for free. A lot of creative minds can then express themselves without working for a big company.
nowadays coding games is still difficult.... but in all sorts of different ways.
first games are more complex, all the math the sonic R used... you will be using more for things like simple texture effects and so, and you gotta code every single one of them, is essentially just a lot more to do.
but the major technical difference nowadays is that you don't really have to worry about the processors having the horsepower to do anything, if you do things right, they do. the problem is the memory latency.
essentially it takes more time to pull a random number on memory than the math you gonna do with it. so you need to code your program in a way that the processor can predict what is the next thing you gonna need from memory, and that can strange
Oh neat, I didn't realize there was a VLIW core in the Saturn. It seems Sega included every possible kind of CPU they could, 2 RISC (SH-2), 1 CISC (68K), and the VLIW DSP core. Truly crazy hardware design. Thanks for sharing!
SH-2 is Dual Cored. NOT Dual CPU'd. The Chips are exactly the same, they're separated on two Wafers due to the Rushed taping of Saturn, that was incompleted due to CSK's Owner Isao Okawa ordering its Taping and Beta design to be rushed out 4 months too early.
The SEGA Saturn is Not a 32-bit only console. Its ACTUALLY a 64-bit Console. An Unfinished one.
Saturn's design was only half finished. SEGA released an 80% compete product as Saturn's Documentation wasn't even fully finished.
@@segaunited3855 Yes, they're identical, but they operate independently. It's 2 separate 32-bit CPUs. Multicore/multi CPU effectively mean the same thing.
@@superandroidtron You are correct. They do operate independently, but the reason why the share the same DSP Bus is because the MIPS Instructions are all embedded it. Each side pushes Simultaneous Dual 32-bit Instructions as the SH-4 is QUAD threaded and Dual Register Coded.
We have a PDF Document of Hitachi SH-2 Aurora. That shows how to get 64-bit instructions on its ISA setup using 6 Instruction Cycles.
@@segaunited3855 I really don't understand what you're saying. None of the CPUs in the Saturn are MIPS cores and the SH-4 (which isn't even in the Saturn, it's in the Dreamcast) is a single-threaded dual-issue CPU. I also can't find any documentation for an SH-2 called "Aurora."
@@superandroidtron Didn't say Saturn's CPU is MIPS Cored. IT HANDLES MIPS. The DSP can stream up to 50 MIPS to the SH-2 Aurora. People seem to think that Saturn can't handle MIPS when it certainly does. It has Double the MIPS of PS1.
Aurora is the codename of the Saturn's chipset. The name comes from a "In between combination of JUPITER and Model 2". Its basically a Low End OTS built of Model 2 Hardware. Combining the Phase 2 of Jupiter's "System 32 with Model 2 3D" with a Full Fledged 64-bit Model 2 Powered project.
SEGA names the CPUs of its consoles after its codename chipsets. Aurora is the Codename of the SH-2 ISA RISC of Saturn.
BTW, RISC has evolved into MIPS. Today, Renasas uses MIPS for ISA.
I only understood about 40%, but when I first started watching your videos I barely understood 5%... Your videos have helped me learn a lot
Well what are the other 60%, i'm sure if you ask pointed questions, someone down here myself included might be able to help out.
@@SianaGearz That's the beauty of not knowing. You don't know what you don't know.
@@dhkatz_ yeah but this video is a collection of statements. You can go through them one by one and single out which is the first one that doesn't seem meaningful, contains unknown terminology, or presents an apparently logical conclusion that you don't know how the author arrived at. You can request clarification for all such statements one by one and in turn all statements in replies that pose a similar issue and eventually you shall arrive at complete understanding.
Oooh, programming that DSP sounds like a fun challenge.
Impressive chip but good god I can't imagine dealing with this
This DSP chip would require quite a different paradigm compared to most other chips. Now that I see how it works, I can see that execution path modeling could be used to fairly easily cook up code for this thing, but debugging would be an absolute nightmare for sure.
The PS2 and the PS3 chipset also got their fair share of coding trouble.
The PS2 runs a MIPS core with 128-bit SIMD functionality, it runs the MIPS III ISA with specially added extra instructions. the FPU is not IEEE compliant Than come two VLIW units which are simila, but get used for different functions. Oh, and an additional video decoder and an external RDRAM MMU. It got 2 audio processors, one in the main MIPS CPU and the other to emulate the chip found in the PS1
The PS3 runs a combination of one dual threaded, dual issue, in-order core running PowerPC ISA 2.02 and six additional dual-issue units with 6 execution units each and indipendent memory management but without branch predictor, they run their own ISA, have embedded SRAM. All of that connected via a ringbus.
Syncing all that stuff correctly must've been a nightmare.
@@johnrickard8512 Debugging on Saturn can be done two ways: Either on Sophia SDKs, or on modded Pheobe.
SEGA and Yu Suzuki and the AM2 team were so elite with programming back then. SEGA built the best arcade experiences back then and tried to bring those experiences home. Sadly they built hardware that was so complex for programmers that only the very best could use it to its full potential.
@7MGTESupraTurboA You are exactly correct. Sega of America wasted ALL of 1994 on 32X.
They went full out in a way only Sony tried later when they went with IBM's Cell for the PS3.
well it didn't really show
@7MGTESupraTurboA so true 32x not only took a lot of money and resources but a lot of launch and later titles from the Saturn. They should have cancelled the 32x in 94
@7MGTESupraTurboA Can't forget about the Sega CD lack of super scaler games and the push for more FMV games as well. SOA should have consider the Cart version of the Saturn if there were serious about the price point of the 5th gen console. I agree with you but however, Sega Japan made some bad decisions such as the Saturn surprise launch, presenting the Mars project to SOA and the lack of tools and utilization of the Sega CD ASIC chip.
The main problem with the DSP was the amount of time it took to load the matrix and set up the DMA for transforming the points. For transforming a small number of points, it was faster just to do it in the main CPU. At least that's what I found when porting Assault Rigs from PlayStation to Saturn and trying to get it to run at a reasonable frame rate (which kind of failed on the busier levels!).
It would have been easier and smooth had both sides of the SH-2 been used.
That's a common problem when you have dedicated hardware for a task;
Getting the GPU in a modern PC to do a transform on 500,000 vertices is very fast.
But getting it to do a transform on 3? The setup time and draw calls will eat you alive.
(actually draw call minimisation is one of the most important game engine optimisation skills of the last 20 or so years on PC - and presumably also modern consoles, since they're almost the same architecture as the PC...)
I heard this is an issue for even SIMD on the same CPU (example: SSE2). You can do things like wreck the CPU pipeline to where it's not worth using SIMD
@@JoeStuffznext time you tell me that single instruction multiple logic operations aka bit operations have issues.
This system is so powerful, it could steal the SOUND CHIP to aid in 3D graphics.
As could the Atari Jaguar
@@ArneChristianRosenfeldt Except that the Jaguar was designed from scratch to use the custom RISC CPU as a DSP--usable for audio or even setup for the 3D graphics pipeline. On the Saturn, a DSP that looks like a Transport Triggered Architecture (rather than RISC) had a bit of a learning curve for the average programmer.
Wish this video existed about 2 months ago :( My students had to cover this topic (use of Matrices in 3D computer graphics) for an assessment task. I'll probably use this video as an example of application of the technique next year. Cheers :)
You are teaching saturn coding?
@@MrSapps no, just computer science principles, and use of Matrices in graphics applications is one of them. He only briefly touches on it at the start of this video, but it's good to see it in a practical application.
@@luke_rs Was going to say - that could be quite a unique course ;)
i don't think it's the best idea to use this as a reference, since it might scare them off. Modern CPU/GPU architecture uses SIMD (Single instruction, multiple data) operations instead, which I think means that you can for example, add/subtract/multiply/divide two vectors in one instruction, rather than multiply the individual components of that vector with different instructions that happen to run in parallel.
@@luke_rs hey random question but I wonder if you could help me a bit.
What are the basic things I need to learn/research to be able to,
Draw a 3d model in a screen?
I'm learning stuff alone and sometimes it's hard because you don't know the terms and how shit is called.
There's something fascinating when you watch a video about stuff you can't even grasp.
lol. For me it's depressing. 'WHY DON'T I KNOW THIS TOO???' :(
I enjoy assembly programming, but I don't get to do it very often, so this video was very enjoyable.
FUN FACT. a game on Saturn called MDK was named after the memory of the DSP chip as it was a continual game where the accumulated numbers changed the visuals on screen. Each level just kept going and going until a time clock ended. Each level was coded with a single start “code” and the Saturn itself filled in the level, much in the same way a procedural game is made
Never heard of this. Was MDK actually ever released on Saturn? Maybe it's not the MDK I am thinking of...
Maybe you got the name wrong? Can't find any reference to that. The only MDK game that i can find was not released on Saturn - only PC and PSX.
I can't wait for the follow up video. What I like most about the saturn is it gives developers so many obscure options to get the results they want. You're made to manage exactly what part of any processor is handling a specific segment of code while another part of that same chip is processing two entirely different things. They don't make hardware like it, a highly specialized hyper paralleled system that evolves 2D game play to the next level and pushes 3D in a way the inter-mingles with what it can do in 2D. Antiquated and advanced, simultaneously!
That's why I love learning about it. It's so highly specialized that it earns its own understanding.
ALL 5th Generation Consoles did 3D by using 2D and 3D playing fields and Models together.
@@segaunited3855 the saturn worked significantly differently due to its reliance on quads and features like infinite planes.
@@retractingblinds Correct. Saturn used Trapezoids and Reshaped Sprites(N64 also used Reshaped Sprites but relied on small Triangles for Polygons).
PS1 used LARGE,bloated Triangles and Simultaneous Renders. But it was far inferior to Saturn(due to it lacking Recalculation,) and N64(Lacked Perspective Correction and Z Buffering).
Nowadays we have our fancy compilers do all that multi-threading nonsense for us. Really makes you appreciate it.
Yeah this reminds me a lot of the brief fad of VLIW CPUs like Itanium and Hitachi SH-4. Of course most modern CPUs are VLIW under the hood, with a CISC (or even RISC) instruction decode step that basically recompiles to VLIW on-the-fly.
I was always interested in the Transmeta Crusoe since it basically did that except with the decode/conversion step in software. It's a shame that never really went anywhere on the consumer market though, aside from being in a handful of early low-power subnotebooks like the Sony Picturebook.
Sun's MAJC architecture also had some interesting design choices and it's a shame that never even made it to market as far as I know.
Minty Meeo this is not multithreading
fluffy: Transmeta was basically impossible to use as a standalone CPU that runs a modern OS, and it wasn’t, unfortunately, that fast, despite x86 JIT being very good. In particular, it lacked good SIMD support with SSE only appearing on Efficeon, too late. But it would be great, of course, if they made it open so the people could run recompilers for other architectures.
Before Pentium M appeared, Transmeta was quite popular, especially in Japan. I have Fujitsu subnotebook and HP tc1000, both use Crusoe.
At least back then you knew exactly what was going on in your code, so in a way it was simpler.
@@noop9k Oh, sure, they never made a standalone version (and having the software instruction decode layer was pretty much the entire point) but it'd have been cool if they'd gotten to the point of supporting multiple ISAs or whatever. Or even had a version with the VLIW ISA exposed directly.
I had a Picturebook C1VN for a while and while the performance wasn't great, it was good enough and had *amazing* battery life compared to similarly-performing machines of the time. Of course Atom totally fits in that niche now (and for non-x86-legacy stuff ARM is doing quite nicely), so it's not like we've lost out in that market or anything. It's just always interesting to think about what could have been, or where the technology might have gone if it didn't end up just becoming yet another pile of zombie IP.
Your works have inspired me to become the senior 3D/VR artist that I am today.
I just wanted to thank you.
Seeing the face of the man behind so many of my favorite games back in the ‘90/‘00 was amazing, knowing that you are a nice person (all the efforts and the charity you are doing) was once again inspiring.
Kudos my distant and long unknown mentor, kudos.
Gosh, this is just a masterpiece! I wonder if modern games have something like that, even not exactly like that, but still just beautiful and complex.
Also, i just love how you title everything "Impossible %THING%"
I might know of one piece of silicon that might be harder to code for: The iAPX432.
I think I might be the only one here thinking "DAMN! THAT CHIP IS AWESOME!".
What about the Cell processor though?
@@Redhotsmasher Ah definitely the Cell was a tough partner, but the individual programmable units in the Cell weren't too bad after all, in fact not all that dissimilar from the VU units in the Playstation 2. DSP programming has always been tricky, from the early TMS and NI chips to the more modern ones. The saving grace in modern systems is the availability of good tools (mostly the compilers) that help with the instruction scheduling and pairing, filling up the pipelines for you rather than having to do it by hand. On the other hand, having intimate knowledge of the instruction set and programming in assembly can sometimes result in algorithms that are very difficult or impossible to express with a high-level programming language, and that therefore a compiler will not emit on its own...
@@jollyrogerxp Cell architecture was actually used in PlayStation, that is the reason for some scientists using them for complex calculations!
@@Redhotsmasher Cell was basically a bunch of PPC750 (what Apple called the G3) cores with added SIMD vector instructions tied together with a high-speed memory bus accessible via DMA. It was painful to work with at a memory controller/task scheduling level but the cores themselves were incredibly straightforward to use.
Later on Sony released a scheduling library called SPURS which made that a lot easier to deal with, as well. You still had to worry about breaking your tasks down to fit into each individual core's workspace memory but that was more of a design thing than an implementation thing.
/r/iamverysmart
I cannot thank you enough for this type of information. Thank you for explaining it in a way where someone with no coding knowledge can somewhat understand.
Early 3D modeling is very fascinating. Now it seems much easier, back then, probably most coding was done manually.
Loving the new logo, Jon! The Sega Saturn really is an enigma when it comes to programming.
I am an electrical engineer and digital hardware is my favorite area, so I found this especially fascinating! Even cooler is how you managed to take advantage of the insane and convoluted hardware. I think I followed a good bulk of it, though I'd probably have to watch it again to fully grasp all the math steps going on in each calculation.
This was a really neat insight into how you managed to program so much to work at once, all the while having to deal with such a complicated piece of equipment. Very well done!
Mr. Eight-Three-One, please tell us you're gonna delve into Saturn homebrew
@@mogo9052 Haha, I'm not sure I would count on that. While I can do at least some programming, it's not my strong suit, unless you're talking Verilog or VHDL firmware. I'd say I'm above average as far as EEs tend to go (I impressed a lot of people in my embedded systems class, to say the least), but still, I'm more interested in the design and functionality of the hardware rather than actually making the software to go with it.
I do appreciate the thought though!
That's fascinating, I've never seen that form of paralellization. This kind of code really requires comments for every line to be maintained though, lol.
Hi Mr. Burton! After roughly 26 years, I've finally finished Sonic 3D Blast on Genesis. And with all emeralds! I wanted to let you know I enjoyed the game and that your videos helped me appreciate all the work that went into it. Thank you to you and the team for everything y'all did!!
I Like how i watch the video and still didn't understand anything and still get impressived
did you just mix impressive with impressed? "impressived" xD that's going in the dictionary "very impressed"
If you want to understand the basics of this video, you can check this tutorial: skilldrick.github.io/easy6502/
It will teach you the very basics (put a pixel on screen, add, subtract, etc) on the same CPU that was used in the original NES.
Eduardo Alvarez just read the intro, then played the snake game. lol imgur.com/a/7jOq6kg
i'ma have to learn this. how long does it take to learn?
@@DlcEnergy Well I started a just few weeks ago. There are a few good links in that same tutorial to learn every instruction code that the 6502 have. Then you can go to the specifics of the NES on this link: nintendoage.com/forum/messageview.cfm?catid=22&threadid=7155
Basically, normal CPUs at the time would take in ONE instruction at a time. (Like, put the number "2" in slot 1)
The DSP can do FIVE instructions at once. (Like, put the number "2" in slot 1, take the number in slot 3 out, etc.)
In essence, you could call it an octopus CPU, because of how many things it can do at once, like the phrase you hear someone say "I only have 2 arms!", when they are doing two things at once and can't do another.
Oh the joy of parallel computing. Speaking from personal experience, it indeed is hard to get used to a multithreaded or even worse, a multicore environment especially if you are coming from a single core environment. When we started making games ourselves, everything was single core. Taking advantages of multicore, requires you to think completely different. You need to build your engine from the ground up with multicore in mind. And that’s difficult.
That said, we could still offload a lot of lower priority or specialized tasks to other cores. Such as network updates, audio rendering, difficult calculations, particle engine updates (syncing with rendering is tricky), and various game systems that had a lower priority or could be updated out of sync with the main thread. For instance we had a sensory system that was complex, but could run “one frame behind” the main update loop. No one would notice. Offloading that to another core, was a massive gain.
But the worst bugs arise when there is a race condition. I remember spending a week or two on an obscure audio bug on the Wii U that only occurred once every couple of thousand of samples. Reproducing the bug took hours; but always resulted in a massive crash. Nintendo obviously found the bug as well. So it was a big showstopper.
After making a simulator that spawned thousands of audio samples each second inside the game, I managed to get this repro time to down to 5-10 minutes. Still not ideal, but enough to finally find the bug: some core initialization order and a critical section that was thread safe, but not if those threads ran on different cores... oh the joy... I still have nightmares of that garbled audio that was produced by running thousands of samples a second 😊
VLIW in action, pretty much. Mad respect to the team for dealing with all that.
That actually makes a lot of sense and is really cool when you think of the potential through put and modular usage of that pipeline
I just took a class on Assembly and computer organization. The fact that this makes sense to me makes me so happy! 🤓
This was fascinating, it would be great to see more in-depth videos like this just to understand more about how the Saturn hardware works.
Crikey, coding that must have been a real brain twister - what can and cannot be done in parallel, which result do you need before some other calculation uses it, etc. I only hope there was a repeating pattern you could grasp.
p.s. Extra thanks, considering what happened in Malibu 👍
I like how you show the code/diagrams while also giving an oversimplified explanation.
Those who know what they're looking at can stay interested and (i think) those who don't really know can follow
DSPs are very common in mobile phone base stations. I know at Ericsson they had entire achitectures built with many many of these and c code complied into VeryLongInstructionWord assembly to program them.
So yeah, this is not ”acient technology”, it is used every day when you call your mother or whatever.
lol
Absolutely, DSPs have been and are very common in so many fields, medical equipment, communication gear, radar/sonar, plane avionics, and many more!
DSPs aren't ancient tech by any means, though I'd be the first to point out that 3d polygon transform is not what most people would use a DSP for.
Currently looking into DSP coding on Saturn. Thank you so much for making my work much easier!
I would call this process "manual hyperthreading" since you have to program it yourself. Pretty interesting concept for something made in the mid-90s
I'd call it VLIW
Well it's not really hyperthreading, that's using the core for something else while it's waiting for data to come.
I would just call it parallel instruction execution, aka. pie :)
@@mattiviljanen8109 It may also be a position-independent executable too.
An exposed pipeline and Multiply-Accumulate units is pretty standard DSP architecture. Not a lot of revolution has gone on, mostly just evolution. Datasheets will still have a block diagram that looks very similar these days
I'm a third year university student, and I'm just now learning to program in assembly code. Your videos have inspired me to take up a project programming for the SNES. The technical details for things on the channel are always amazing!
Sega Saturn is the console i think of when i think of gaming it captured me in 96, just awesome. its the gift that keeps on giving.
Saturn and N64 have aged with Grace. Mario 64 still looks very colorful and vibrantly polished.
Also, the Duke Nukem 3D Saturn port absolutely DESTROYS the PS1. Has Superior Lighting Effects,too.
i played the hell out of Duke 3d on saturn, Lobotomy software were the 3rd party kings for the saturn who knows what they could have achieved.
@@RallyDon82 IKR? Its a shame on how BADLY Sega of America dropped the ball on Saturn. They NEVER taught people how to properly program and code games for it.
Would LOVE a video on the DSP and the issue with the dev docs! Thank you so much for this!
Are we related?
Nice implementation of a VLIW processor. Maybe they could've implemented Tomasulo's algorithm to simplify programming but still get a good performance. Anyone has any idea why they didn't use it? The algorithm was developed in 67, so it was already around by the time of the Saturn.
Wow great stuff Jon keep it coming, you explained the DSP so well. I’ve not touched Assembly Code since second year at Uni, it was great fun playing around with it.
Single Instruction Multiple Data (short SIMD) stuff was never really easy. Even today compilers sometimes struggle with it. For my job i needed to learn NEON assembly for our ARM processor. It's different as only one instruction is executed which handles a whole matrix of data instead of the DSP which handles multiple instructions at the same team.
But still the the amount of different instructions of NEON was so high that the current GCC was not capable of correctly compiling code for this thing. Also lot's of register magic happened as this thing got 64 Bit Registers and 128 Bit registers which weren't other registers but instead the 64 Bit ones concatenated together. Lot's of thought and documentation was needed to maintain that monster. Keep up the cool videos sir.
But here it is multiple instructions. SIMD is PSX GTE and Intel MMX
The way you visualized it made a difficult thing much easier to understand, so thank you for this great video.
Thank you for this video. I have to say I'm surprised the Saturn requires such complex instructions, even though I know very little about computer programming. I seriously thought you guys had some sorts of tools made by Sega to help your work, but it looks like programmers had to start from scratch.
Justin Fanite
From scratch...I wonder how tough , that was.
On ALL 5th Gen Consoles. Programmers had to start from scratch when writing Assembly for a new Game. It was just easier on PS1, but just because its easier doesn't make it a superior piece of hardware. PS1 is WAY inferior to Saturn and Nintendo 64 all it did was draw basic Polygons better by making them larger in size and with Simultaneous Renders.
@@segaunited3855 Exactly!
I don't comment often, but I absolutely loved this video. The technical aspects of programming video games truly fascinates me, and you've made it fun to watch!
Ouch. That gives me a headache. Managing That level of parallel code is not a pleasant experience. (It's bad enough when you're dealing with a full symmetric multiprocessing system - but this is much like hyperthreading - except a CPU with hyperthreading manages code balancing by itself, where here you have to code it directly.)
The n64 also had a bad reputation for being difficult, but looking at why that was, I don't think it's even remotely comparable.
The problems on the n64 are related to high level optimisation, and some frustrating bottlenecks in the architecture that required careful workarounds.
Also Nintendo absolutely refused to provide microcode documentation for the Graphics processor until quite late into the system's life, meaning you were entirely dependent on the handful of pre-written routines Nintendo provided to get any 3d acceleration at all. - easier, sure, but not conducive to getting the best out of the system.
In hindsight looking at what the system contained it's not at all in the same league of complexity.
You had a MIPS 4300i CPU with a floating point unit, and a Graphics chip that consisted of a DSP with fixed function 3d logic, and a second MIPS 4300i core. This core differed from the main CPU core only in terms of having a large dedicated cache memory, accessing memory differently in general, and having a vector co-processor in place of a floating point co-processor.
The instruction set was obviously much the same.
Leaving aside not being given any low level documentation at all, the difficulty seems to arise not from the complexity of any given part of the system, but the interaction.
The Main memory is RAMBUS ram - which is very fast, but has high access latency. (meaning working with RAM efficiently is tricky), the main CPU is incapable of accessing memory independently, so it has to get the RCP (aka the graphics chip) to do it on it's behalf.
The texturing unit has a 4 kilobyte 'cache'. Which is just about enough to store a single 64x64 texture at 8 bit colour depth. (only 32x32 if you have mip-mapping enabled). This wouldn't be so bad, but despite being called a 'cache' it's manually controlled by the programmer, and in standard microcode implementations isn't used very effectively. (a major factor in why so many games appeared to have such low resolution - since most developers lacked the documentation and skills to write custom micro-code; Many of the games that seemed to defy the odds later in the system's life did so by using custom microcode.)
The cpu core inside the graphics chip, which does such things as geometry transformation and the like cannot run code or work with data directly from main RAM, it has to be loaded into the local cache, which isn't that large, relatively speaking, so the size of code running on this core has to be considered.
The pixel fill rate is about 60 megapixels a second. Which sounds like a lot for a system from that era, but becomes a big problem when you consider the system does multi-texturing, bilinear filtering, perspective correct texturing, anti-aliasing and environment mapping, amongst other things, all of which eat up a LOT of fill rate compared to what the system has to work with. (other systems of that era didn't use such effects, and it's estimated because of this the n64 was typically doing something like 5-8 times as much work drawing a single onscreen pixel as say, a playstation was doing.)
Essentially, the n64 was frequently being used to render a level of graphical effects that objectively speaking were out of it's league, and probably shouldn't have been attempted on a system with such relatively low performance.
So, no individual element of the n64 was that complex, and each part individually was actually quite powerful, but the pieces don't work well together, and the system is full of performance choke points.
Thus, what makes it a pain to work with is that it has very high peak performance, but lots of bottlenecks that drag it's average performance into the gutter.
In other words, the system is a high level optimisation nightmare.
Still, it's pretty clear that it doesn't even come close to the Saturn, and the Saturn's reputation for complexity is well earnt.
Just optimising DSP code alone already looks like a nightmare comparable to optimising n64 code in it's entirety...
Yeah, I do think you're right there. May just be the trickiest chip seen in a mainstream product. (Can't say trickiest EVER, because I'm sure there's some obscure supercomputer chip or something that's worse. XD)
So that's why Quake 2 64 looked like your marine smeared vaseline all over his visor?
Jokes aside, despite the complexity, it all definitely worked out well. Just look at Perfect Dark and compare it with any game in the Saturn or PSX, pretty much no game comes even close to it.
@@TDRR_Gamez Shenmue. VF2. Games that can compete against N64 fairly well.
Thanks for making the time to write that. I'd heard a bit about the N64's design but was missing a few details. It makes a lot more sense now. What an odd design, especially that memory addressing quirk.
oh my god wow im never going to be able to make games for that
Possibly your best video yet. Keep up the coding secrets!
Programming for DSPs is fun! I never touched the Saturn DSP, but in the past spent some years writing code for Texas Instruments C5000 and C6000 DSPs. They have all kind of weird hacks not usually present in regular CPUs, like guard bits in the accumulator, modulo addressing (extremely useful to implement digital filters), specific loop instructions to avoid branching (causing the pipeline to get emptied), instruction buffers inside the DSP to avoid reloading looping code each iteration, duplicated data buses...
If you write C code, the compiler is unable to use many of these (especially when coding for fixed point DSPs), so you end up having to write the computation intensive code in assembly language, and you need a good grasp of the hardware details to take advantage of many of these weird details.
I'm not a programmer, so I couldn't make sense of a lot of what you were showing, but it was still facinating to watch. I still hope your videos benefit the Saturn homebrew community somewhat.
Parallel programming on Assembly ? What kind of sorcery is that ???
Anyone can do it. Is just that in the mid 90s, Dual Core CPUs were almost unheard of.
VLIW
This was always done to some degree, for example on a PS1 you have "double buffering" where by the GPU renders the last "commands" while the CPU is feeding in the current commands. That is concurrency - you had to be careful not to do anything that would require the CPU to wait for the GPU to do something else bye bye performance
@@MrSapps PS1 doesn't have Double Buffering. It only has Basic Frame Buffering with heavy distortion. 48KBs Chache Max.
@@segaunited3855 It absolutely does have double buffering. And the scratch pad cache is 1kb. Here is a hello world with double buffering using Gs lib: github.com/nicolas17/psxdemo/blob/master/main.c
Either you take and precisely organize meticulous notes or you have a crystal clear memory. Either way, thank you for this incredible breakdown of a complicated process.
This is the hardware of my nightmares.
Nah.
This is brilliant, finally i found someone who shows how console optimization actually works. And it's so great to see how each line of code is thought in conjunction with the hardware schemas, .because nowadays when we program, mainly in Pc we need to abstract so many things and pray for everything to work out in hardware, see how values go from one place to another is somewhat comforting.
WOOOOOOOOOOOO!!!!!!!!
This is the kind of GameHut I love most!
Looking at this makes me appreciate all the effort that Saturn developers put into their games, especially 3D ones, I wish I was this good at math.
Wait so you can read from a register and write to it on the same cycle without the data becoming inconsistent? I wonder if the "impossible" bit from the manual is related to that.
Nina Satragno
Wait, what?
You activate the circuits at the same time, but the multiply (for instance) starts at the same time so works with the data that is already there. By the time the new data arrives the operation has already completed.
Good observation. I suspect that there is actually some form of pipelining involved: these instructions do indeed occur at the same time but each one works on the result computed on the previous line, not on the result gotten from the instruction on their left.
This is why the first line only contains instructions loading data into the registers and no actual computations.
These computations happen on the second line with the content loaded into registers while the new "MOV" instructions load the registers with data to be used for computation on the next line/cycle.
That part is completely normal though. This is the magic part of flipflops. Write occurs at the clock edge, read data is available very shortly afterward, and you have the rest of the clock to compute your expression involving the flop output (which in this case is also the new input to the flop). No data races, totally safe.
It's called latching, the cycle can have a number of phases, for example if all units latch the inputs at the rising edge of the cycle pulse, then all data is consistent, and writing happens somewhere later in the cycle. You can also see a form of manual pipelining involved, as the data is pre-read into the corresponding input register of each computational unit. The computational unit itself would be protected from the change in the corresponding input register during its operation with a latch.
Awesome quick overview! I love the more technical videos you put out
6:49 >The Impossible DSP
Will it be a video on DSP's ability to somehow stay afloat despite his bad financial skills and general ineptitude?
That's why I love but also hate acronyms. Once you've made that one link in your brain, it's never gonna go away.
You seen all the boogie2988 fuss? Heard he's the new dsp
Absolutely fantastic video can't wait for the next one!!
Dun-du-du-dunnnnnnnnnnnn! I got you.
Try programming for the Atari Jaguar. It was infamously hard to program back in the day, which is one of the reasons it got next to no 3rd party support. Try seeing why, at the very least.
Nikku4211 Jeff Minter moved to VLIW NUON after the Jaguar BTW.
Nikku4211
That and Wii U, apparently.
@DejaVoodooDoll I'm just going by what I heard about developing for it back in the day.
I actually understood what was going on when you explained it, but I'm pretty sure I would have had a really hard time figuring it out with just the documentation and some examples. Thanks for the really good explanation!
Don't lie, people. You mentally switched off half way through and just clicked the Like button. Didn't you.
Haha, nah. I'm playing with my Zilog so this is fun :D
These are absolutely fascinating. And I thought the PS2's two vector processing units were crazy!
How does the DSP does the multiplication and move a new value into the input registers at the same time and maintain consistency between if the original or new value is used? I only takes one clock cycle but surely the operations can't be truly concurrent? Are they sequential between themselves?
I'd guess so. Having a multi phase clock gen isn't unusual
An Option cycle is almost certainly multiple clock cycles. (it's possible to run various parts of a chip asynchronously and just have the physics of the chip manage timing of certain operations, but it almost never makes sense to do so)
The answer to this lies in the update order of a data flip-flop (building block of a register) and in delay.
The input registers are fronted to a data bus and, after the data is left to settle on the data inputs, is clocked in - however, that doesn't mean it has instantly shown itself on the outputs as that has a time delay associated with it.
At the same time, the data on the output has actually already been multiplied during the same settling period as the data inputs saw. The multiplied data is then clocked into the output register's flipflops. No data crossed streams with other data, and consistency is assured.
No sub-clocking (phased clocks), etc.. just pure logic gate delay and careful engineering. This was actually animated into the diagrams with the movement of the data on the wires (delay and settling time) and flashes (clocking points).
@@RachelMant the downside is that a lot of those operations become exquisitely sensitive to process variation and design changes. Setting various operations to trigger on rising edges and others on falling edges is more robust if possible and only two operation phases are necessary.
@Robert Szasz That is what circuit analysis and process analysis is for - to determine safe max clock speeds so that process variation and physics don't kick your design into instability.
For example, when prototyping to an FPGA, the synthesis tools are able to determine from a known maximum (worst case) per-gate delay time, how long each clock cycle must be to guarantee correct operation of such a design.
For something as fast as this and for when it was designed and constructed, such tools existed and the process analysis had been performed. While either multiple clock cycles per operation for deep pipelining, or subdividing the clock up into segments where certain actions are clocked in different parts of the clock waveform - with how it was described here and with what is happening in this block diagram, I find it unlikely such elaborate schemes would be in use.
For a DSP such as this, they'd only serve to make a larger surface area chip that runs hotter. More likely I think the DSP designers took care to ensure the worst-case timings weren't violated, and clock the simpler design as hard as they safely could. Less silicon required, less heat, faster in a real-time environment.
Awesome, the animations really help to understand the explanation
I'm always impressed by your videos and efforts to tell us how you guys programmed games for SEGA's consoles back in the 90's.
At the same time however, it makes me wonder what made the Saturn so much of a hassle for other studios to develop while there are things such as Sonic R, Tomb Raider and even early development versions of Shenmue and Sonic Adventure running on that thing.
It was not a hassle. Non Japanese programmers were never properly trained.
@@segaunited3855 It was very much a hassle. The manuals were poorly translated and riddled with errors, there was a new 68000 sound driver program every week with a different set of bugs, the development systems were flaky, the hardware itself had a lot of corner cases, and VDP1 was about 60% of the speed of the PS1's GPU at drawing polygons. Yes, the system had a certain personality that Sony's more sterile hardware lacked, but Sony's hardware also was more performant and simpler to deal with. We had to write extensive amounts of assembly on the Saturn to get performance even in the ballpark of what the PS1 gave us with straight C code.
@@ischmidt You're referring to the Western SDKs. Sega of America never provided any proper Sophia SDKs for developers because they completely dropped the ball on Saturn.
Majority of your problems early on came from programmers only using one Core to code C Language Assembly. Many of the games developed for 3D using only one Core suffered from unbalanced and improper programming. It was if the Saturn was performing with one hand tied behind its back.
Another was that many Non Japanese Developers like you weren't schooled on how to code Saturn's graphics and resorted only to using the VDP1 instead of Both VDP1 and VDP2.
Saturn's early SDKs were pretty bad, especially due to Sega of America's utter laziness and the fact that they wasted ALL of 1994 on 32X.
That was really interesting, thanks for this! I'm looking forward to 'the impossible DSP' video too.
Thanks Jon, please do follow-up.
This channel is a gold treasure
5:50 If this is done in parallel, we have a race condition? Moving a value to the X register and at the same time trigger the multiplier sounds dangerous.
Good remark, the execution is actually pipelined and the move will be loaded only right in time for the instruction on the next line to take it into account but not before.
im assuming the chip is just designed to write the result before writing in
Really enjoyed the animation that made the pipeline immediately clear!
I don't care what anyone says programming in assembly is a real blast lol
Anyways, I skimmed through the SCU documentation as well (because what else am I gonna do with my time?) and it seems that there is no issue with loading data from a bank to X&Y while also preforming the multiplication operation like some people seem to speculate; Sega themselves do it in some example code in the document "SCU DSP Assembler User's Manual". I have yet to find the actual "impossible" part of the code, maybe I'll keep trying to find it or I'll leave myself to be surprised in the future, we'll see.
I think i've got it? But i only skimmed, so i might be a bit off.
MOV MC1,X and MOV MUL,P are both X-bus operations so they technically collide in their bit encoding, and i think someone (at least whoever wrote the assembler) knew that this combination is possible and is disambiguated by hardware by setting one additional bit to MOV MCx,X and the hardware disambiguates that to not also mean a MOV MCx,P or something like that. But this knowledge was lost on the way to SEGA's techdoc team.
Y and A registers exhibit a similar non-collision that also doesn't seem accounted for in the manual.
I noted that as well, though looking at the instruction codes I don't think there's any issue. MOV MC1, X is encoded in the 6 bit X-bus control field as 100xxx (the three xs seem to determine which bank to load the data from) while MOV MUL,P is encoded 010xxx (here the xs are "don't cares"). So I don't think that's the problem...
@@PuyoPuyoMan I mean it obviously works, and seeing that it works it's easy enough to surmise why and how it works; however it's not strictly possible according to the word of the manual. The manual is merely wrong or sloppily formulated, this looks like deliberate hardware design, underpinned by the fact that the tool chain supports this.
"it's not strictly possible according to the word of the manual", did they say that it doesn't work in the documentation? I was looking for anything like that for a bit in the User's Manual but again I just skimmed it so I might've missed it if there was a part where it explicitly said "you can't use multiple x-bus commands at once" or something like that.
@@PuyoPuyoMan no, it doesn't go into enough detail and doesn't say that you can't issue two commands at the same time, but it would be implied given that both commands are specified with explicit overlapping bits that are not described as irrelevant. It is what my reading of the bit charts and it being presented there as alternate commands would suggest if I didn't know that it works. If the docteam expected it to work, they should have left the bit that sets the MUL switch out of the bitmask of the other MOVs on the same bus that are combinable with it and vice versa, and similarity with bits that are written as colliding on the other bus too. I read processor datasheets just about daily and i would have never guessed that it's possible from the datasheet, which is a gross miscommunication on their part.
Ohhhh, I see. I never really understood what "stream processing" was and why DSPs could be so fast. But now I see, instead of the "start and stop" of most CPUs where there's usually one operation at a time a DSP can read in more data at the same time and keep the data moving in a _stream_. And I can also see why this is so useful in digital audio applications where data (often from multiple sources) has to be read, mixed and output all at the same time. Using a DSP or multiple DSPs chained together a tiny, relatively slow DSP processor can handle much more data than a processor clocked much higher.
It does look very tricky to program, but for the types of programs it runs it's probably not too bad. I assume most DSP programs perform the same operations over and over, like transforming vertex data through a transformation matrix. That's just a bunch of multiplies in a row with all the data in contiguous memory. I don't see any branches in there, can they even branch?
Have you ever worked with Broadcom DSPs?
Very interesting. Thanks for taking the time to make videos like these.
I assume that the DSP is decoding the 6 instructions simultaneously? In practice is works like a 6 stage pipeline, but in coding, you prime each stage of the pipeline, then issue instructions to perform an action for each stage, is this correct? What frequency did the DSP run at? I think you might get a kick out of the Parallax Propeller microcontroller, it has 8 individual cores with 512 longs per core. In practice, each instruction executes in 4 cycles, with shared memory instructions taking longer if you miss the access window. If written correctly, you can align code and memory accesses to ensure you hit the main memory on every access window, resulting in a true 20MIPS per core. The hub memory is 32KB with another 32KB of ROM containing a bytecode interpreter (SPIN language), SIN/COS/LOG tables, and a bitmap font. I recently got back into doing some coding on a DOS platform and writing inline asm to speed things up, programming the Propeller is a lot like coding for those old processors, but all of the goofy things like "why do I only have 5 general purpose registers when the 8051 has the first 1K as a register", and "why must I load a segment descriptor into a general purpose register first, before loading the segment register", are not present in the Propeller. All 512 longs of COG ram are registers and can be treated as instructions or data. There is no cycle penalty for byte or word memory access, and there are a quite a few non-conventional instructions. Best of all, each core (COG) has a video generator builtin, so you can easily generate high res tiled video output or low res bitmapped output, or high res low-color output (the HUB memory size is the constraint). You can do VGA or NTSC directly from the COG with just a few resistors for support components.
EFormance Engineering Yes, Propeller is great.
Yes, this is correct, this is why the first line only contain register loading instructions.
Absolutely, it is clearly a pipelined design, with pipeline stages all taking a single cycle, contrary to other more complex DSP designs. And yes, the Propeller is certainly interesting :)
@@jollyrogerxp It is very cool how they designed it specifically for realtime bit-banging with no unpredictability. No interrupts, no need to care about preemption priorities and saving context. Instead you have enough "cores" to poll many different inputs.
@@noop9k absolutely, this is what one would do on an FPGA or any dedicated ASIC to process streams of data with guaranteed throughput and latency, which is what DSPs are good at for hard real-time constrained systems!
So glad to see you back at it. Good show as always.
This seems similar to what we called "FMA" today, but in integer only... funny to see today's generic CPUs mimicking DSPs like this in SIMD instructions to make better performance.
What seems most complicated to me is all ALUs in that era are all integer only. My programming skills raised in modern days which CPUs already provided nice FPU functionality, even with robust SIMD instructions. But back then you have to use fixed point numbers which is basically just integers with an imaginary decimal. All these bit shifting, number overflowing things, make me headache just for imagine them... Programmers in this era really had hard days.
It's pretty simple. CPUs with more than One Core were unheard at the time.
Yeah 16:16 fixed point is rather limiting
Thank you! This is getting crazier!
feos? What are you doing here?
-SEGA Saturn TAS's when-
Been following since the start. And I don't have any saturn games that interest me.
Oh I see!
Actually no, no I don't. :)
I know this is a joke but I find rewatching videos on complex topics and sometimes going as far to pause and take notes helps if you really want to understand.
I get it!
.... *I don't get it!*
@@AesculapiusPiranha
Thanks for the advice...quite wasted on me though as I'm 42 and if I haven't got it by now, and I haven't I shouldn't think it'll ever click.
I spent most of the late eighties and early nineties trying to learn to code...but never really progressed past a competent grasp of BASIC.
I'm comfortable with the fact I'm not clever enough...it's okay.
I love video games so much I love listening to basically white noise (to me) even if I don't get it.
@@stoicvampirepig6063 Don't give up! I'm 43 and I'm really enjoying getting into c#. =)
@@stoicvampirepig6063 Tried developing little games in GZDoom? It's quite easy and fun.
I'm just branching out from CPUs to FPGAs (where EVERYTHING runs in parallel), and haven't looked at DSPs yet, but this totally makes sense.
I've always thought of DSP as those magic chips that are somehow really good at math-intensive stuff, but didn't really know what they were doing special. In fact, now some of the (CPU) instructions that are often labeled as "for DSP routines" make more sense too, because they're usually something like multiply, add, and accumulate all in one cycle.
Looking at it from the FPGA angle helps, since you can see it's just cascaded ALU blocks that can run to coherency in the time of one cycle.
So thanks for the useful intro! :-)
DSPs and of course even more so FPGAs are devices that software-only people have a hard time to wrap their heads around, precisely due to the issue of many states changing simultaneously, rather than thinking about a single stream of instructions... :)
@@jollyrogerxp Our friends at Alethiea Games are planning to implement FPGA in The RAZOR.
1st 😎
Saturn was such a nice Machine and i like your Videos alot mr. Game Hut 😊
I love the technical explanation!! Thanks for sharing and keep up the good work.
Since you've also programmed the PS2, I'm curious as to why you feel that the Saturn DSP is harder to program than the PS2 EE.
VU microcode is also hell... actually most of PS2 coding is hell for similar reasons to the saturn.. so many damn bits of hardware all operating independently
I LOVE THIS STUFF! Thank you bud! Has anyone told you how calming your voice is?
If this was difficult to learn, imagine the Atari Jaguar coders doing 3D with so archaic and rare design with its two chipsets: "Tom & Jerry" (yeah, like the animation lol). But anyway, this is simply fascinating! I'm not a programmer but i can understand how difficult is (or was) learn the Saturn infamous hardware in general. It was more powerful than the PSX, no doubt, but it was rarely see that (Radiant Silvergun, Panzer Dragoon Saga, Sonic R, and some more, but not much examples unfortunely).
The sad thing about Saturn is how its actually an unfinished 64-bit Console. And how if only it had those 4 Months it REALLY needed to complete taping and documentation, it would have MURDERED the 5th Gen race at the very start.
Plus the mk68k which is a bottleneck cpu in the jaguar and never works well with 32 and beyond chips.
@@maroon9273Saturn has 68k, too
So I’m interested, given that you seem to know the Saturn hardware well, what you think the most accomplished, commercially released use of the hardware, and therefore the DSP, actually was. You’ve probably answered this question a million times!
No wonder the Saturn failed: it was a nightmare to program!
treasure's CEO back in the 90's (someone who touched pretty much every console up to then) said that n64 was many times worse than the saturn
also, in the 90's... if you wanted to have enough horse power to do math for a 3d game... a dsp was what you had to use, there was no other choice.
unless you fricking Ken Kutaragi.
the big difference between ps1 and all the other consoles from that era was the no dsp for the 3d math
khhnator I don’t quite understand you. PS1 had GTE - Geometry Transformation Engine which is a coprocessor integrated into the CPU.
Irrelevant. PS2 is a "nightmare" too, and it was an extraordinary success! For that matter i gladly believe the reports that N64 was essentially undebuggable, in spite of being, from the software perspective (not the hardware!) less convoluted.
Yeah it is a bit magical how the PS1 came together - integrating a very fast special purpose DSP as processor instruction flow obeying COP was a smooth move. As was the bucketed DMA. The system's a bit crude as far as what it could do, which it pretty much had to be given the time it came out at, but it was laser focused on making it easy and accessible.
@@SianaGearz PS2 was a success mostly due to Sony's marketing and the Playstation brand created by PS1. Also due to actually being fastest at the moment of release. Unlike Saturn which had inferior 3D capabilities vs PS1 straight from the start. Still, it had quite a few bad ports where devs obviously spent most of their attention on GameCube&XBox versions. DC ports with lower res, worse textures, worse audio..
Please do a full hour of this.... its awesome
But can do you do it without collecting a coin
To answer that, we need to talk about parallel universes...
@Hickory Mouse It's actually surprisingly simple.
Nicobbq sucks
That's some beautiful code, multiple assembler instructions on 1 line :)
The Saturn is really powerfull. The weak point is the complexty to program in this board.
miasuke
Please elaborate on its power potential. No one seems to mention it's overall potential.
I can seem why this was difficult to code for.
Strangely, though, I understood every bit of what you were saying and never got confused.
Your BG music is phenomenal.
Love this channel! Always something interesting about programming but with the context of retro games 👌🏻