- Видео 27
- Просмотров 189 366
RTL Engineering
США
Добавлен 22 дек 2017
Educational videos covering various topics in computer engineering and computer architecture, specifically focusing on mid 1990s to early 2000s computers and game consoles. Most topics will be presented with FPGA implementation in mind. These topics will be presented assuming some prior knowledge of computer systems and digital electronics.
GPU Memory, and Clogged Pipes (Part 3 - PS2 and Dreamcast/PowerVR) - #GPUJune2
What is the largest factor that makes GPUs tick, besides the core clock? It's the memory, which can cause a GPU pipeline to flow or grind to a halt.
This video picks up where the last one left off, by going over the memory architecture of the Playstation 2 and Dreamcast/PowerVR GPUs. This video was made as part of #GPUJune2.
Part 1 of this video can be found here: ruclips.net/video/wsEcgNcEObI/видео.html
Part 2 of this video can be found here: ruclips.net/video/7_XOmyrtFrM/видео.html
Chapters:
0:00 Recap
1:15 Playstation 2 Overview
5:34 PS2 Memory Structure
8:39 PS2 Detailed Pipeline
15:20 Bandwidth Limitations and Dreamcast
18:28 Binning and TBDR
21:30 Overdraw and HSR
23:19 Transparency
25:17 Dream...
This video picks up where the last one left off, by going over the memory architecture of the Playstation 2 and Dreamcast/PowerVR GPUs. This video was made as part of #GPUJune2.
Part 1 of this video can be found here: ruclips.net/video/wsEcgNcEObI/видео.html
Part 2 of this video can be found here: ruclips.net/video/7_XOmyrtFrM/видео.html
Chapters:
0:00 Recap
1:15 Playstation 2 Overview
5:34 PS2 Memory Structure
8:39 PS2 Detailed Pipeline
15:20 Bandwidth Limitations and Dreamcast
18:28 Binning and TBDR
21:30 Overdraw and HSR
23:19 Transparency
25:17 Dream...
Просмотров: 4 765
Видео
GPU Memory, and Clogged Pipes (Part 2 - 3dfx Voodoo) - #GPUJune2
Просмотров 4,1 тыс.2 года назад
What is the largest factor that makes GPUs tick, besides the core clock? It's the memory, which can cause a GPU pipeline to flow or grind to a halt. This video picks up where the last one left off, by going over the memory architecture of the 3dfx Voodoo GPUs. This video was made as part of #GPUJune2. Part 1 of this video can be found here: ruclips.net/video/wsEcgNcEObI/видео.html Part 3 of thi...
GPU Memory, and Clogged Pipes (Part 1 - PS1 and N64) - #GPUJune2
Просмотров 9 тыс.2 года назад
What is the largest factor that makes GPUs tick, besides the core clock? It's the memory, which can cause a GPU pipeline to flow or grind to a halt. This video goes over the basics of why a GPU pipeline is memory intensive and goes over a detailed example with the PlayStation 1 and Nintendo 64. This video was made as part of #GPUJune2. Part 2 of this video can be found here: ruclips.net/video/7...
GPUs, Points, and Lines - #GPUJune2
Просмотров 2,6 тыс.2 года назад
How GPUs draw triangles is common knowledge, but have you ever wondered how GPUs draw other primitives like points and lines? This video is part of #GPUJune2, and goes over several methods, with some examples in early 3D accelerators. Correction: The far edge is the line where s t - 1 = 0 Clarification: When referring to barycentric rasterizers, I am specifically talking about those that use ed...
Pentium MMX Predecoding in a FPGA
Просмотров 9592 года назад
A short followup video going over how one could map a Pentium MMX predecoder to a FPGA, as well as some resource utilization and performance results. Chapters: 0:00 Predecoder to LUT Mapping 2:34 Adding the L1 Cache 4:08 Test Results
Quake, Floating Point, and the Intel Pentium
Просмотров 77 тыс.2 года назад
The transition from mostly 2D games into immersive 3D environments was brought on by none other than the original Quake, making computer game history. At the same time, computer architecture began making a pivotal change from single pipelines into super-scalar. These two simultaneous changes shook up the PC processor market with architectural ramifications that still last till today. This video...
MicroOps in the Pentium MMX
Просмотров 9742 года назад
A Short video talking about how the Pentium P5 and Pentium MMX used micro-operations compared with more traditional processors like the AMD K6, and the implications for the length pre-decoder of the Pentium MMX. Chapters: 0:00 Intro and PMMX Overview 1:51 Comparison with K6 4:50 PMMX vs K6 uOP Throughput 6:50 Instruction Queue 7:37 uOp Decoding Analysis 9:46 Parameter Elimination 11:02 Opcode t...
x86 Decoding Simulation in the Pentium MMX
Просмотров 5152 года назад
Chapters have been marked, feel free to go directly to the simulation, skipping the intro and explanation. A supplementary video for the longer video on x86 front end complexity. This is a toy simulation of how the decoder in the Pentium MMX worked. Chapters 0:00 Introduction and Decoding Rules 1:34 Simulation Overview 4:15 Simulation 4:35 Comparison with other Architectures Attributions: Decod...
x86 Front End Complexity (Part 2 - Pentium MMX)
Просмотров 1,2 тыс.2 года назад
The second part of a longer video talking about the complications introduced by the x86 instructions set as it relates to front end architecture design, and how early processors handled this complexity. This video covers the Pentium MMX (P55C). The next part will cover the AMD K6 and Intel P6. Chapters: 0:00 Recap and Problem Statement 2:32 Possible Solutions 6:04 Actual Solution 8:05 Solution ...
x86 Decoding Simulation in the 486
Просмотров 7732 года назад
Chapters have been marked, feel free to go directly to the simulation, skipping the intro and explanation. A supplementary video for the longer video on x86 front end complexity. This is a toy simulation of how the decoder in the 486 worked. Chapters 0:00 Introduction and Decoding Rules 1:14 Simulation Overview 2:50 Simulation 3:22 Comparison with other Architectures Attributions: Decode icons ...
x86 Decoding Simulation in the Pentium P5
Просмотров 8232 года назад
Chapters have been marked, feel free to go directly to the simulation, skipping the intro and explanation. A supplementary video for the longer video on x86 front end complexity. This is a toy simulation of how the decoder in the original Pentium worked. Chapters 0:00 Introduction and Decoding Rules 1:50 Simulation Overview 4:28 Simulation 5:03 Comparison with other Architectures Attributions: ...
x86 Front End Complexity (Part 1 - Pentium P5)
Просмотров 4,1 тыс.2 года назад
The first part of a longer video talking about the complications introduced by the x86 instructions set as it relates to front end architecture design, and how early processors handled this complexity. This video covers some of the background needed, gives a short historic overview with the 8086, 286, 386/486, and then dives into the original Pentium (P5). The next part will cover the Pentium M...
Channel Update, New Workflow, and Project Teaser
Просмотров 5042 года назад
A quick channel update given that I have not posted a video in quite a while. Showing off some rasterization demos in the background, with a tease for a project that I am currently working on. The audio will take a bit of experimentation on my part, but so far has been at least 5x faster than using recorded audio. Special thanks to FPGAzumSpass for his framebuffer simulation module and viewing ...
Implementation of a VR4300 ALU Shifter
Просмотров 1,3 тыс.5 лет назад
A short video detailing a few different implementations for an FPGA based mixed Integer and floating-point ALU shifter to be used in the VR4300 CPU implementation. *Edit* Unfortunately, the lecture notes that I referenced have since been removed, so a link cannot be added. I am not affiliated with any of the companies mentioned in the video. This video is intended for educational purposes.
A High Performance FPGA Game Console Platform
Просмотров 11 тыс.5 лет назад
A video discussing a potential future replacement to the MiSTer FPGA platform to allow for larger and more performance intensive FPGA cores to be implemented. I am not affiliated with any of the companies mentioned in the video. This video is intended for educational purposes.
Implementation of an Efficient Floating-Point Complementor
Просмотров 6235 лет назад
Implementation of an Efficient Floating-Point Complementor
Implementing an Efficient MIPS III Multi-Cycle Multiplier
Просмотров 9006 лет назад
Implementing an Efficient MIPS III Multi-Cycle Multiplier
Designing an Efficient MIPS III Load Store Unit
Просмотров 1 тыс.6 лет назад
Designing an Efficient MIPS III Load Store Unit
Designing an Efficient MIPS TLB [Part 2]
Просмотров 6256 лет назад
Designing an Efficient MIPS TLB [Part 2]
Designing an Efficient MIPS TLB [Part 1]
Просмотров 1 тыс.6 лет назад
Designing an Efficient MIPS TLB [Part 1]
Designing an Efficient Combined Register File
Просмотров 1,1 тыс.6 лет назад
Designing an Efficient Combined Register File
Designing an Efficient Leading Zero Counter
Просмотров 2,7 тыс.6 лет назад
Designing an Efficient Leading Zero Counter
Game Console Hardware Architecture (5th and 6th Generation)
Просмотров 9 тыс.6 лет назад
Game Console Hardware Architecture (5th and 6th Generation)
FPGA Game Console Emulation and its Limitations
Просмотров 4 тыс.6 лет назад
FPGA Game Console Emulation and its Limitations
The narrator needs a speech co-processor. How time heals everything. Just a few years later we could play GLQuake on a budget Celeron. It was darker, but very mean and affordable.
Quick note, the Cyrix 6x86 PR233 was not running at 233mhz, it was using a Pentium equivalence rating and running at either 188mhz (2.5x75mhz) or 200mhz (3x66mhz). Same with their MII CPUs, they were still using a Pentium Rating naming scheme instead of the actual clock.
Is this guy ai
Earlyer pseudo 3D games didn,t use floating point as full 3D games,then weak fpu was enought.Quake is full 3D game.Doom 1 and 2 has sprites.Quake has full 3d models.
What I wouldn't give to be an accomplished engineer.
A tip, the synthetic voice is not too bad but boy is fast. With 0.75 playback speed it becomes more comfortable.
This reminds of Jim Keller's comment that x86 decode 'isn't all that bad if you're building a big chip'
Doing the lords work
Great Argument, however one question in the LU OR implementation, can using tristate buffers enabled by the decoder help? The and-or stage is essentialy doing that only. Which would be more preferable from the two, as a first-cut prospective?
Procrastination Is All You Need: Exponent Indexed Accumulators for Floating Point, Posits and Logarithmic Numbers bfloat16 MAC, one addition and one multiplication per clock : ~100 LUTs + 1 DSP48E2 @ > 600 MHz result accumulated in > 256 bits Tensor core needs 64 of these => ~ 6,400 LUTs + 64 DSP48E2
It's on Linkedin, eventually on arXiv. YT is not letting me post more, not sure why
Where was this video when I was scratching my head with _exactly_ the same problem and had to come up with all this by myself :_(
the n64 fell short on the mister platform now we have an idea what it takes for the ps1.saturn,N64 to run what are your thoughts on the new platforms like the mars and the replay2 are you still making your own fpga . its been almost 2 years would like for you to upload a new video what do you think now ?
It saddens me to see all the fuzz about the speech synth thing. If you are into these kinds of things, this video is outstandingly good, going deep into what's going on. I can understand a difference in opinion about the perceived quality of the video (which is subjective anyways) but the claims I've read that the video is low-effort or that the author is lazy are hurtful and unfair, especially taking into account it's done for free and publicly available.
I'm curious now that we did get (almost all) of an N64 core on Mister if you think this platform is more interesting to pursue. In this video you mentioned you'd like to wait for that to come to reality before pursuing this further.
Thank you for this
The N64 could also decode lossily compressed audio like MP3 and today even Opus via the RSP. Some later Rare, Factor 5 and Boss Game Studios games implemented MP3.
Oh wow, blud can actually speak! I'm used to the Zoomer tts.
my first PC was a Pentium 100, I got Quake right after it launched......didn't run that great lol
Quake, Floating Point, and the Intel Pentium because it was a pentium and not a pentium 4🤣🤣🤣🤣🤣
I love this video
The most detail explanation about the gpu memory on the net. Thanks a lot for the videos.👍
Please do a Dreamcast one!
the winchip did well , but it wasnt widely known
it was cheaper, but kind of slow
Why is this done in TTS? I guess if you just like writing powerpoints for RUclips...
Or that I really dislike editing my own audio. AI voice generation took hours rather than days. That is on top of researching, writing the script, and creating the visuals, which took several weeks. Then following that up with tedious spectrogram work is quite an unpleasant experience. The primary content is not the audio, it is only one component to the medium.
I remember buying a Pentium 166mmx chip and thinking it was perfect priced/performance back then.
Funny to think how we see tessellation as triangles when it’s a triangle representing a pyramid, representing points.
I'm not sure clock rate is a good metric to use for gpu speed. Really, it should be transisters x clock speed? It makes the phrase at the end a bit hollow since gpu compute has generally been about scaling horizontally instead of vertically and will definitely give people the wrong impression. It just makes it sound like youre trying very hard to justify your original premise of memory more important than compute, when really it is both. Especially since compute has outstripped memory many times in gpu history, leaving them starved. I otherwise very much enjoy your videos, great work!
Thanks for the feedback! Could you give an example where compute outstripped memory? The only cases I can think of were marketing (i.e. less VRAM was chosen to save cost - which is not a technology / architecture limitation). I disagree with transistors being a good metric, that's similar to comparing software by lines of code. Transistors are used for the compute, but also on-chip memory, data routing (busses), clock distribution, miscellaneous controllers, and I/O buffers. What you really want to use are operations/second, which for fixed function GPUs would be fill-rate. Comparing clock speed and fill-rate gives you an indication of where the performance came from. If fill-rate grows faster tan clock-speed, then the performance comes from scaling horizontally, whereas the contrary is from pipelining or a technology shrink. Bandwidth (memory) does still play a role there, but it's impossible to unlink the two in this domain as it forms an integral part to the processing pipeline. Also note that none of the memory claims (except for the PS2) account for DRAM overhead, which will necessary result in degraded performance compared to the ideal (peak) numbers.
@RTLEngineering sure, transistor count isn't great either, but that doesn't mean clock speed is a good indicator. Current day gpus are WAY more than 15x faster than the PS2 gpu in terms of compute. Regardless, memory and compute are as important as each other, one isn't the main contributor vs. the other. And by compute out stripping memory, I mean memory bandwidth becoming the bottleneck. The geforce 256 was notoriously limited by its memory bandwidth, and they later released the geforce 256 ddr to unlock it's potential. It's simply a matter of balance and bottlenecks. You could possibly chart FLOPs vs memory speed, idk, but anything is better than hz.
Then I guess you're in agreement with me / the videos. The entire premise was that bandwidth was the driving factor for performance, not memory capacity or clock-speed. Every GPU that I am aware of has been limited by the bandwidth in one way or another, the Geforce 256 being no exception. And the Geforce 256 DDR was still limited by the memory bandwidth. Unfortunately we can't plot FLOPs, because most older GPUs didn't execute floating-point operations. Similarly, when considering modern GPUs, FLOPs is not a great metric for render performance since large portion of the pipeline are still fixed function. So fill-rate remains the better metric, which serves as a proxy for "FLOPs". That's also what I used in the videos, not clock speed - the clock speed was shown to indicate that it was not the major contributing factor. Also, what you were describing (FLOPs vs Memory Bandwidth) is called a Roofline, which is the standard method for comparing performance of different architectures and workloads.
@@RTLEngineering My issue is that I'm making a hard distinction between logic units and memory bandwidth, where as I think you've explicitly shown that they are deeply coupled, proving that the line is effectively a lot blurrier than I previously understood. I'm just smoothbrained from all the hardware reviews making hard distinctions between the two. Thanks for your detailed replies!
@@RTLEngineering one word counterarguement: RDNA2
These DAMNED computer READ VIDEOS are BULL SHIT!!>. Can't even pronounce shit right and I WILL NOT Reward YOU for stealing someone's work!!
Your engagement by leaving a comment technically is a reward. Luckily, if you spend a few seconds thinking about it or reading the other comments, you will see that your concern is unjustified (i.e. no work was stolen). If you need a hint, this video was posted almost two years ago (pre AI craze), meaning a human would have had to write the script. The AI voice was chosen to save production time on my part, and I did take care to make sure all of the pronunciations were correct. The only issue was "Id" in "Id Software", which is said twice in a 20 minute video. Regardless, you're free to dislike the video and not watch it due to the voice over, but claiming plagiarism is uncalled for!
Upto to GameCube will be great. WII ,Wii U are modern and will run judy fine in software emulation.
Man, I hate to say it, but x86 is ugly. I can see why RISC was a huge deal back in that era. It would be cool to see the architectures compared. Early ARM was very odd with it's barrel shifter in every instruction, though MIPS and Power were more popular in the 90's. Even just looking at how the Z80 did it's instructions.. DJNZ is just a little dirty.
The most exciting thing about emulation in hardware is the ability to modify the graphics hardware to render in higher resolutions, at least that's one of them.
the cyrix 6x86 pr233 ran at 188 or 200 depending on the version and bus speed.
I need to make a video on the total disaster my first "real" PC made from actually new parts was. It absolutely hauled for productivity and web browsing (back when page rendering speed mattered even on 56k!) but was an absolute dog at games. I picked pretty much the worst combo I could have back then for performance and stability.... A K6-2, An ALI Aladdin V chipset mobo and an NVIDIA TNT2. I'd have been better off with a PPGA Celeron, 66 MHz FSB and all and the cost difference would have been almost nil. Quake engine titles suffered the worst as expected but Unreal engine stuff wasn't exactly amazing either, though the latter DID benefit from 3DNow! without AMD making a special patch like they did for Quake II. I stayed with AMD for the next rig I built for my 16th birthday....Athlon Tbird 1000 AXIA stepping OC'd to 1400 and a Geforce 2 Pro on a KT133A board. That was a proper rig though it combined with the barely 68% efficient PSUs at the time kept my room rather warm. I learned a lot in between those two rigs.
This looks like an AI generated video.
i wonder how big the die size of the power vr 2 GPU inside the DC is,the voodoo 3 is 74sqaure millimeters and the powervr 2 looks like 2x the sizes.
16:55 didn't the Woz design the Apple II video circuitry to do DRAM refresh while drawing the screen, leading to a very unusual framebuffer layout?
What about the 6x86? How does it differ from the K6?
so why dont you just say "matrix operation core" or matrix multiplication core, why would make things complicated with complex differing terminology, "tensor"
Probably because the association was for AI/ML workloads which work with tensors (matrices are a special case of the more general tensor object). Though I am not sure why "Tensor Core" was chosen as the name since other AI/ML architectures call them "Matrix Cores" or "MxM Cores" (for GEMM). It might just be a result of marketing. I would say "MFU" or "Matrix Function Unit" would be the most descriptive term, but that doesn't sound as catchy.
how much memory is chip-internal-local in RDP DMEM? to be used as hardware z-buffer memory buffer, or frame buffer chip-local extension
None. There's a small cache that's controlled by the hardware (to cover bursting), but otherwise the z-buffer and frame buffer are stored in the shared system memory. The DMEM on the RCP can't be used for z-buffer or color directly. It can be used for it indirectly, but you're going to end up copying stuff in and out of main memory which will perform worse than not using it at all. Alternatively, it's possible to program a software renderer using SIMD on the RCP, but it would leave the RDP idle.
you can do microcode changes directly, maybe a true hardware z-buffer, using the DMEM/IMEM 4kb caches@@RTLEngineering
maybe TMEM could be partially used as local z-buffer cache, while other part is used as normal texture memory@@RTLEngineering
That's what I meant by "software render using SIMD". There's no read/write path between the DMEM and IMEM, nor is there a read/write path between the DMEM and the fixed-function RDP path. All communication between them would need to be done using DMA over the main system bus. Regarding TMEM, it's the same. There's no direct write path, where you can only write to the TMEM using DMA. Worse yet, the DMACs in all cases required that one address be in main memory, so you couldn't DMA between the memories without first going through the main memory.
AI voice overs are unlistenable.
I've been trying to work out how they actually implemented the multiplier in the real r4300i design. The datapath diagram in "r4300i datasheet" shows they are using a "CSA multiplier block" and feeding it's result into the main 64bit adder every cycle (which saves gates, why use dedicated full adders at the end of the CSA array when you already have one). Going back to the r4200, there is a research paper explaining how the pipeline works in great detail, and the r4300i is mostly just an r4200 with a cut-down-bus and larger multiplier. The r4200 uses a 3bit multiplier, shifting 3 bits out to LO every cycle (or the guard bits for floats) and latching HI on the final cycle (floats use an extra cycle to shift 0 or 1 bits right then repack). I'm assuming they use much the same scheme, but shifting out more bits per cycle. So it's not that the r4300i has multipliers that take 3 and 6 cycles then take two cycles to move the result to lo/hi, but that the 24bit and 54 bit multiplies can finish 1 cycle sooner. So I think the actual timings are: 3 cycles for 24bit, 4 cycles for 32 bit, 6 cycles for 54 bit and 7 cycles for 64 bit (though, you need an extra bit for unsigned multiplication) To get these timings, the r4300 would need a 10 bit per cycle multiplier. If I'm understanding the design correctly: Every cycle, the CSA block adds ten 64 bit wide partial products. 10 bits are immediately ready shifting 10 bits out to LO, and the remaining 63 bits of partial sums and shifted carries are latched into the adder's S and T inputs. On the next cycle, the CSA block also takes the reduced partial sums from the adder's result as an 11th input to the CSA array.
I've always heard people say that TBDR's have a frame of latency. Maybe that was the case for older designs, I'm really not sure, but a lot of the time it felt like people misinterpreting what was happening because I've never seen anything from Imagination saying that. All that's happening is that, instead of the vertex and pixel shading being interleaved, like in IMRs, it's more like all the vertex shading happens and then all the fragment shading happens. There's nothing about this that requires that the pixel shading needs to happen on the next frame. The two stages don't take the same amount of time either. A triangle more or less just stays three points (three numbers per point) the whole vertex shading stage. One of those triangles can become hundreds of pixels in the rasterization phase though and that's going to take more time to compute and write to memory. In that sense, an IMR may have it's whole pipeline backed up by one triangle that turns into a particularly large amount of pixels. Since a TBDR keeps the stages separate, it can potentially finish it's vertex shading for a frame in far less time with less stalls. Then the fragment shading stage gets a huge boost from HSR and it's dedicated, fast on-chip buffer. Now you're right in that, while it's fragment shading one frame, it can start vertex shading the next, it's not like it's waiting for the next frame in order to start pixel shading. It's just getting started on the next frame before the current frame is done.
What is meant with "1 frame of latency" comes down to the fact that all triangles must be submitted before rendering can begin - at least with the older GPUs. The new PVR archs (especially those used in the Apple SoCs) can reload previously rendered tiles, but the GPU used in the Dreamcast had no method to load the tile buffer from VRAM. So in practice, you want to pipeline the entire process, which gives you that extra frame of latency with IMRs don't require (since the render tiles can be revisited there). Submit -> Vertex Transform -> Bin -> Render -> Display. While you could do all of those in a single frame, that necessarily reduces the total amount of work you can do, else you will have to re-render the previous frame (introducing latency). For IMR, you don't need the Bin stage, and can instead interleave them, meaning you have... TBDR: |Submit -> Vertex Transform -> Bin ->| Render ->| Display|. (3 frames) IMR: |Submit -> Vertex Transform -> Render ->| Display|. (2 frames) Note that the Dreamcast was specifically modified to reduce this latency under certain scenarios, in which the tiles can be rendered in scanline order meaning that the next frame can start to be displayed while the pixel visibility was being computed and then shaded. Dreamcast: |Submit -> Vertex Transform -> Bin ->| Render -> Display|. (2 frames)
@@RTLEngineering If the Dreamcast couldn't re-load the tile buffer from VRAM then I don't know how that would be an issue unless the game was trying to use the rendered image as a texture. Outside of that, what gets rendered to the tile-buffer and then out to VRAM is the finished tile for the frame. It only needs to be read by the display controller and sent to the TV. It can still read from it's tile list in the same frame. |Submit -> Vertex Transform -> Bin ->| Render ->| Display| |Submit -> Vertex Transform -> Bin ->| Render -> Display| What you're saying about the Render and Display steps makes complete sense to me. It's the separation between Bin and Render that makes none. It's not reading from the tile-buffer here. The tile buffer is at the end of the pipeline. The Bin -> Render stage is when the tile list is being pulled into the GPU from VRAM to be rendered. There's nothing that would necessitate waiting for the next frame deadline for this to happen. If the GPU can't read the tile buffer from VRAM then that wouldn't cause an issue because the tile buffer isn't the tile list/parameter buffer which is all that's needed to be read in that stage. The tile list can obviously be read VRAM because that's where they're stored. If it couldn't then the GPU wouldn't work at all. I could understand it if you're looking at a example where the last triangle is submitted close to the deadline though. The IMR will have already completed rendering of almost all previous geometry and only need to finish that up. In that same case, yes, the TBDR will not complete rendering before that deadline because it was waiting for the last triangle to start rendering. But by saying that these two stages always happen in different frames would be incorrect. For example, if you're just rendering a menu on the Dreamcast then the amount of submitted geometry would be so little that it could be in counted by hand. The CPU computation and geometry submission could take, lets say, half of a millisecond. The transform and binning stage would take less than that. At that point it's not going to wait for the next 16ms before it starts rasterizing and texturing those triangles though. It's just going to start reading the tile list right after it's done with binning and it will finish rendering far before the next frame deadline and there will be no frames of latency.
The issue is that every triangle for a frame must be binned before rendering can begin. So if you have 4M triangles in a frame, you must first submit, transform, and bin all 4M triangles before the first render tile can be touched. If you start rendering a tile before binning is complete, then you may finish visibility testing and rendering before all of the triangles are known for that tile - that would result in dropping triangles over the screen randomly based on submission order. This is a hard deadline which is not necessary for IMR - IMR can accept new triangles until a few cycles before the new frame must be presented to the display. Even the IMR architectures that do a type of render tile binning do so on a rolling submission basis because they can return to a previously rendered tile. The scenario you described is correct, in that case you have less work to do and therefore the deadline isn't as tight. But in general, a game developer wants to submit as many triangles with as many textures, with as many effects as possible, per frame. If you combine Submit->Vertex->Bin->Render, into a single frame, and target 60 fps, then that 16ms must be divided into the two phases: Submit->Vertex->Bin, and Render. So if Submit->Vertex->bin takes 10ms, then you only have 6ms to render all of the tiles (480p would be 300 tiles, so 20us per tile), which limits the total triangles per frame. Also keep in mind that Submit->Vertex is done on the CPU (for the Dreamcast) and is interleaved in the game logic itself, so that's going to take longer than if all it were doing is pulling from a preset list in RAM. Binning is done on the GPU, but only handles 1 triangle at a time, so that will be slow if there are too many as well. (It's a write-amplification task, meaning it can be done in bounded but not constant time). Regardless, if you take that approach during a game, you're likely going to drop every other frame to catch up with rendering. The alternative is to render the tiles as you display them, but that would mean that all 20 horizontal tiles need to be rendered within 1/15 of a frame, or 53us each. If the row of tiles is not complete by the time they need to be displayed, then you again need to drop the frame or accept screen tearing. While that same number is also true for the entire screen at once, you have 300 tiles to balance out the load rather than relying on 20 (you're more likely to have some tiles that take 2ms and some that take 2us in a pool of 300 than 20). In both cases, if you drop a frame, then you get 1 extra frame of latency. And besides, in your menu example, 1 extra frame of latency is not important... you should be thinking about the cases in which both latency and performance matters.
@@RTLEngineering I think you're misunderstanding where I'm disagreeing with you. I'm not disputing that submitting and binning all the triangles for the scene before rendering is a hard requirement on a TBDR or that "chasing the beam" with tile-render order would be required to get the most work done before the frame deadline. Where my issue lies is with saying that the 1 frame of additional latency is a rule that's built into how the hardware works when it isn't. That's the reason why I mentioned the menu example. It's not representative of the workload of a full 3D game scene, sure, but it demonstrates a real scenario that's not uncommon on the Dreamcast or a PC where the GPU would be needlessly wasting time and adding latency if it were really a hard requirement for the GPU to do it's vertex stage and fragment shading across two different frames. That's not how any GPU designer would design their GPUs and that's not Imagination designed their's. You could say that a frame of latency is a side effect of the architecture when triangles get submitted too close to the deadline and you could even say that that's common but explaining it as if the hardware literally can't avoid the latency in any scenario and is an absolute requirement of the hardware... is wrong. If the hardware has enough time before the frame deadline to finish rendering after it's done binning (like in the menu example or even in the case of a 2D game) then it will do that and it won't have the latency. This is an architectural video so it should describe the architecture. If there's a realistic limitation to that architecture when in use then that should be mentioned, too, but shouldn't be phrased like that limitation is built into the architecture. I mean you said it yourself. If you target 60 fps then the 16.6ms then to render the frame. That could be 10ms for Submit->Vertex->Bin with 6ms for ->Render->Display... but you CAN do that. It's not a hard limitation. Also keep in mind that workloads in a game aren't constant, they vary. If that's how one frame works out then the next frame could be the inverse of that, 6ms for Submit->Vertex->Bin with 10ms for Rending->Display and that assumes that the CPU waiting until the deadline of frame 1 before it started frame 2. If it started the vertex stage for frame 2 right after it was done with the vertex stage of frame 1 then it would be ready to render to start rendering frame 2 around the time of the deadline for frame 1. Sure, frame times aren't often that erratic, you can argue that the scenario I just mentioned isn't common, and you could say that the hardware would be underutilized in that scenario but the hardware IS capable of it. Lastly, any game that hits a stable 60fps likely isn't just hitting it's deadline, it's done way before it and is just capped at 60fps. The same is also true for 30 fps games. Without a cap, they could run at 35 or 40 but they just cap it at 30fps. That means they have 33.33ms between frames but they'll often be finished rendering in 22-28.5ms.
Sure, although I don't think I ever claimed it was fundamental hardware constraint. That would be entirely wrong as the hardware had no interlocks as far as I am aware of - you could have it display a partially drawn tile if you hit the pointer-flip at the right time. Practically, I gave two examples in my previous response in which there would be no extra frame of delay, and mentioned the limitations of doing so. Typically software running on a TBDR do introduce a second frame of delay in how the software controls the GPU though, for those very reasons. It's also a lot simpler to write the game code to account for that delay than to dynamically adjust to it. Note that even with IMR, you don't need to have any frame delay either - you could just render directly to the display buffer (the PS2 did this in some cases), in which case you would be submitting triangles to the frame as it was being rendered. The Nintendo DS was actually notorious for this as that was the only way to draw the triangles (chasing the beam). Regardless that's arguing more semantics than anything since it's more complicated to say that the extra frame delay was introduced by software, as a result of the TBDR architecture's hard bin deadline requirement (a requirement not imposed by IMR), but could be overwritten in cases where the deadline is more relaxed or visual artifacts were tolerable. Some simplifications need to be made for an architecture video / lecture, as it's not reasonable to list all of the nuances. For example, you could use the PVR2 to compute N-body and fluid simulations instead of drawing triangles, same thing with the PS2's GPU (as a hint, you would do so with blending modes). Drawing 3D graphics is not inherent to the architecture itself, but it's the common / primary use case. So the video should discuss the common case where the rest is left as an exercise to the viewer. I disagree with your last comment about 60fps. You could easily write a game that continually just barely hits the 60fps cap as the GPU has two limits - visibility and shading. So you could have more than enough room in visibility, but be compute limited by the shading engine where a poorly ordered texture cache miss causes you to miss the deadline (this is what happened when drawing 2D sprites). The same thing happens can happen in modern 3D GPUs, but didn't usually occur in the older 3D ones like the Voodoo since the rasterizer was tied to the shading pipeline.
Best hardware channel in RUclips, thanks for the info my friend!
I cannot overstate how much i like your videos. I think I've watched this series 3 times already. I hope you make more videos like this
Typo: should be ...+ A[0,3]*B[3,0]... at 1:32
Thanks for pointing that out!
damn
I have got a general question (as I couldn't find the answer anywhere). Would this overclocking method known from software emulators, which does not break video and audio speed, be possible on FPGAs? Quotation: ""For many years, NES emulators had a method of overclocking where additional scanlines are added for the CPU for each video frame," byuu tells Ars Technica. "Which is to say, only the CPU runs on its own for a bit of extra time after each video frame is rendered, but the video and audio do not do so... This new(ish) overclocking method gives games more processing time without also speeding up the video and audio rates... and so the normal game pace of 60fps is maintained."
I was initially astonished, why there were no synthesis results for most of the video - unlike in previous ones. And even more, by the results - usually it was a struggle to approach ~300 MHz, and here even the Altera chip was decent.
🤔limits?
Perhaps the limits are more relative? You can always sacrifice speed - if you can fit a RISCV, you could run a software emulator.
@@RTLEngineering that is a most excellent observation. 😌
Do you know what the tradeoffs are between a CSA and a Wallace / Dadda multiplier?