@JayDee-b5u it's a story I've talked about before, basically trading software speed. I explain it in the Microarch Club podcast at some point. I'm luckily not a one man team, there's a small group of volunteers who help.
I love Casey, well spoken, knowledgeable, easy to follow even for non native English speaker (edit: I AM not native English speaker, sorry for the confusion). Technical enough yet relatively easy to understand
??? hearing this guy has been the most infuriating experience this week. He just CAN'T get to the point holy.... he kept rambling, I'm at minute 21 of the video and he STILL hasn't got the to point he wanted to say after minute 3 of the video. He reminds me a the boomer engineers that I work with just ramble and complain and never get anything done.
Casey is the best. He's forgotten more than I know. And I'm just a bit behind him staring down reaching 30 years as a Software Engineer. I am in awe of how verbally articulate he is over such a wide range of knowledge, in depth. Both wide and deep knowledge + articulate is a very rare gift and puts you at the top of the top in Engineering. I've had the good fortune of working directly with several "Distinguished Engineers" over my career and Casey has the all of the same qualities. Humble, incredibly articulate to a very detailed level at a wide range of subjects, doesn't talk in absolutes and knows to mention some of the tradeoffs, and know when they are getting into areas where they might lean on someone else for specific expertise. They are the best people to work with and know how to work at different levels of people without being patronizing or making you feel imposter syndrome. Casey is definitely in that class of Engineering and it's always a treat how well him and Prime work together despite coming from very different backgrounds. Well done as always, gentleman! I learned so much from this video that I had to come back and edit my original comment to add much more.
Wanted to put this as background, turned out I can sit on my toilet for whole 1.5 hrs just listening to this. Very informative! Thank you Primeagen and Casey!
I remember back when I was in highschool trying to get into game dev I found Casey's GJK video. Reading the paper was way over my head with academic language and math symbols - but his walktrough helped me implement it and EPA. It really helped me see that stuff that seemed untouchable (paper, cryptic code, abstract code) was understandable if you broke it down, take it step by step and try to visualize. I wish I had more teachers like him back in school, or more material like his available back then. Kids these days are really lucky to have content like this available almost effortlessly
It's both a blessing and a curse. Great learning materials are out there and readily available if you know where to look, but knowing where to look is the hard part, with low quality or outright hostile content often winning at SEO and pushing down the gems.
i thought that especially recently it was common to just write in assembly, and assembly is specifically pretty common, casey said 'only in specific domains of embedded'
That was FANTASTIC!!! Pretty nostalgic too. I was lucky enough to build my 286, 386, & 486 computers back in the day when they came out. If they kept that naming convention, I wonder if the latest computer would be a 10086 or 20086 by now.... I totally had an assembly course in college. It good to know "nobody" writes that stuff nowadays. If you still do, then consider yourself nobody.
You'd be happy to know that there is a new Intel 285 chip coming out soon! The Core Ultra 9 285 has 24 cores and is among the highest tier of the upcoming Arrow Lake chips.
@@r.k.vignesh7832 Yikes! I almost thought they went backwards. 286 was short for the 80286 processor... looks like the 285 is short for 285K (285,000). Not sure if those numbers are a true apples to apples comparison but at least they are headed in the right direction. 😅
@@Angel-Fish The K is used to distinguish chips w/ unlocked multipliers from the standard ones. There will also be a 285 non-K. This would have been the Core i9 15900(K) with last year's naming scheme, but they changed it for some reason. Probably to confuse us even more.
People seem to forget that both Intel and AMD had RISC cpus already in early 90ies. One of Sega's most popular arcade games used the Intel i960 (Sega rally yeaaaahhhh)
@@MartialBoniou oh, a NeXT cube :O I love that design. Always when I did a drawing of a computer I made it like a NeXT Cube :) . I hade forgot about th XScale actually lol :)
So the processors that fetch multiple instructions in one cycle are called superscalar. And they can either be in-order execution or out-of-order execution. When it is out of order, they undergo register renaming (using a map and a free list of physical registers) to resolve dependencies (other than true dependency), and get dispatched into a buffer (Register Update Unit) where they wait until their operands are ready. A group of instructions get picked from this RUU and is executed since all the dependencies are resolved. Then, there is an in-order commit for the instruction at the head of the RUU. So we get in-order dispatch, out-of-order exec and in-order commits
@@grendel6o nah, wikipedia is way better than e.g. most social media, including youtube comments. wikipedia is also way better than many youtube videos, especially when it comes to stuff like accuracy
@@grendel6o try to use social media for that :P anyways, i have found the wikipedia articles for the scientific topics i have studied helpful, though not very pedagogical. i have no idea about the situation for aerospace engineering, but in general it's an encyclopedia, not a textbook on advanced topics. if you're looking at wikipedia for course material for any graduate level courses, you're using it wrong
This was a great talk from Casey, especially out of the head. There is one thing i would like to add to "the ARM ISR". There is not only one but a bunch of them. The most important ones are Cortex -A, -M, and -R. Their main difference is how you attack performance requirements from a (descrete) math point. Cortex A is the general compute approach. They are designed to run an OS and are used as CPU's, in phones, mobile, or AI clusters. Their goal is pure compute power, even at the cost of determinism or safety with things like branch-prediction, chunkwise caching, etc. Cortex R is realtime applications like the ABS/ESP in a car, a flight controller or the primary controll of a power/production plant. They are designed to guarantee a computation within a certain timeframe, provide redundancy, private memory for certain things etc. Cortex M is for microcontrollers. In very broad terms they are a hybrid of R and A. They can map a view realtime features, while still do some general compute when necessary. They are a great choice for a car door with the window control and a view buttons. Intel used to have different sets with x8150, x82 etc. but the portfolio narrowed down to what is known as x86 today. While ARM diversified from theoriginal ARMv1 / ARMv2 chip. They are also roughly the same age, they just grew in different industries.
Ian Cutress did an interview with Jim Keller and has a clip that would make a great supplement to this titled "Jim Keller: Arm vs x86 vs RISC-V - Does it Matter?".
One big extra power burn is x86-64 devices is the platform is desktop and laptop with expandable RAM. You need more voltage to drive big ram sticks further away. And ARM has always been on embedded with soldered down ram. Intel just demonstrated with Lunar Lake chips with soldered ram on the laminate, saving the memory controller voltage puts them a LOT closer to Apple Silicon in terms of performance per watt. You could bucket a big thing like RAM config in Casey's business explanation. REALLY good explanation from Casey!
This was a great one. I spent thousands of hours programming the 6502, M68000, and M68020 back in the ’80s and ’90s. It was a lot of fun, but nowadays I’m quite happy to be coding in higher-level languages, especially my favourite - Clojure. Still, I sometimes miss the days of programming in Assembly and C. There was something special about having complete control over everything running on the machine.
Assembly is still pretty fun its just a lot of instructions to keep track of. Ive messed around with doing a basic x11 hello world and it was almost 1000 lines
0:58 Prime being hilarious while ruffling a lot of feathers completely on accident. The risc-v guys really don’t like being called cisc even though it essentially turns into one the moment you include any of the common high perf extensions.
I think there's not really a solid boundary between risc and cisc, but I reckon risc-v at least does it well by splitting the entire isa into extensions which have individual purpose as opposed to have extensions hacked on with new versions or whatever. I believe the beauty of risc-v is that you can create tailored chips for a specific application. For example, you might slap a bunch of vector extensions and parallelisation extensions but leave out stuff like atomics to get a low power, efficient gpu (ofc the technology isn't really developed to that point, but that's the theory anyway). So risc-v is really good for specialised chips as opposed to necessarily desktop cpus, which are pretty much always going to devolve into cisc anyway at some point
Finally got time to sit and watch this. I absolutely love these chats with Casey, I always learn so much. He is an amazing teacher and I'm glad there are people out there like him. I'm so glad Prime has him on and that Casey wants to be on as well. Can't wait for the next lesson.
Great talk! A good follow-up topic might be the memory model differences because (1) it's one of the major differences an actual programmer might hit when porting code from x86 to ARM, and (2) I would imagine it has power consumption implications since x86 chips are required to do more possibly useless work to keep caches coherent.
That intro was the most thoroughly agreed upon and understandable statement I've heard in a long time lol.. and I'm only seen one appearance of this guy.. It is kind of amazing The the real bar for entry when it comes to being able to distribute this kind of knowledge is simply to have somebody that also knows how to talk to people, or at least wants to try lol.. That's usually where it falls apart.
Used the BBC micro B at school.... It was the business.... The RISC based Archimedes was on the horizon and it was truly from another universe 😊. It was so far ahead it was indescribable in the late 80s... It was a jump from 8 bit to 32... That's pretty massive.... Price tag to match.....
I jumped on getting an MC68K Mac when it came out thinking this was leading edge in both software and choice of CPU (for personal computer market) - little did I know what our cousins across the pond were cooking up (and, of course, Apple today has landed on using the CPU family that they created back then)
About the arm chip being 0 power usage; if memory serves the anecdote is that the input power of the clock signal for the display was enough to power the rest of the chip
That's how I remember it. Or was it current on the data pins? Something like that. Not electric fields though, never heard of that. And doesn't really make sense, either. :D
It was voltage leakage from the support chips that provided enough power for the first ARM samples to run without any dedicated power supply of their own.
Power consumption is a byproduct of the electronics design (transistor architecture) and NOT ANY firmware or software characteristics. That's why the first ARM chip just happened to be able to operate using stray electric currents from peripheral components on the PCB. That wasn't on purpose but something that was discovered by accident. Well, that sort of discovery now becomes a desired "feature" to pursue on purpose and here we are.
That is true. However energy = power x time. So if a process takes longer to execute it can consume more energy even at a lower power consumption. So for a particular application a lower power device is not guaranteed to be more energy efficient.
@@SimonAyers i heard somwhere that sending a message in facebook, is like having a 40-60w lightbulb turned on for 3 hours. claiming that languages and the code we write has no effect is complete bs.... what does consume more power a loop wich runs on C , or a loop wich runs on python :DD "not any firmware or software" myths like thjis probably why software sucks these days, everything getting slower and slower...
For the variable length instruction decoding on Intel, the CPU doesn’t necessarily need to decode what the compiler generated, it can theoretically decode something else. The CPU executes what is in instruction cache and the move from memory to instruction cache is slow. In theory you could remove variable length instructions on the fetch to instruction cache and give the CPU fix length microcode instructions.
That have cons. Intel CPUs are designed to execute legacy x86 instructions, and these are inherently variable length. Converting instructions into fixed length microcode would require a significant architecture overhaul, impacting compatibility with existing software and instructions. Intel CPUs already have optimizations like the micro op cache. This cache holds decoded uops for reuse, reducing the need to repeatedly decode instructions from memory. This already achieves a similar goal of reducing decoding overhead by reusing pre decoded instructions
> For the variable length instruction decoding on Intel, the CPU doesn’t necessarily need to decode what the compiler generated, it can theoretically decode something else. No. The incoming instruction stream, regardless of whether it is variable or fixed length, has to be decoded as is. > The CPU executes what is in instruction cache and the move from memory to instruction cache is slow. As slow as the memory system can operate at, provided that software does not interfere by making things worse - which sadly is a common case. Without reuse caching is not faster than directly running off memory. > In theory you could remove variable length instructions on the fetch to instruction cache and give the CPU fix length microcode instructions. In practice this is what various platforms did and continue to do in various forms for several decades. What gets fed into the core from the instruction stream perspective is very different to what is actually being acted upon internally.
56:00 There's a great 3-parter video interview with Sophie Wilson on channel "Charbax". If I remember correctly, she talks about the low power ARM stuff in one of those.
Thank you for introducing the godbolt decompiler for those of us that didn't know. Having done some x86, PIC and other chip assembly programming in school long long ago ( that I hardly remember) this is a great primer for demystifying low-level instructions. There is a small hang-up I'd love to get his take on for clarity, I seem to recall that x86 had a much much larger instruction set with machine instructions that would take 10-20 cycles to execute while the more basic (Motorolla etc) chips did not; the more basic chips used, AFAIR, only the accumulator to perform operations (with few exceptions), while x86 allowed a subset of instructions to perform operations entirely within CPU registers without touching the accumulator value. Even ops like addition to direct memory locations were possible (beyond the CPU registers) whereas basic chips would have to move those values from memory to registers, perform add op and the result would have to be moved back from accumulator to the original mem location. All this to say the idle power draw to the extra transistors that x86 has to perform the ops on so many working registers was significantly higher, and as a result x86 arch was not as power efficient over the long periods where it doesn't use those extra functions. Is that still the case or is ARM arch now as "bloated" as x86 where it has similar transistor count in the ballpark order of magnitude as x86?
You cannot understand assembly without understanding Von Neumann architecture, and most programmers don’t go into the detail of how VNA works. That’s the big benefit of understanding that learning assembly gives you. It’s a fundamental understanding of how the processor is actually processing. Once that clicks working with data in registers at an address level through pointers becomes the most natural feeling in the world.
With power, he touches on the fact that there isn't much of a difference between the ISA. Which I think is true, to expand on that ARM was built around a SOC design with just only the requirements for the device. With x86, it was designed for a generalized desktop pc so the chips also included PCI connections and buses for expansion. Each PCI connection requires die space and consumes energy.
As someone with like datascience/machine learning, I always have no idea where Casey is going but I always love to come on the adventure and I always learn something new -- pulling up the webtool and following and playing along really helps with this video! Casey's channel is "molly rocket" btw, it always escapes my brain and then I remember -- incase you are looking for it u.u
I bought "Creating Arcade Games on the Commodore 64," and I think I also bought a machine language book, too. Sadly, I didn't get very far with either book. But, I remember the excitement I had finding out that books like that existed, because I really wanted to program games. Too bad I didn't have the skills that others did.
Exactly this got me into assembler on the C64. Pure performance poverty 😂 Not even a compiler. Just writing code directly in my Power Cartridge monitor.
About the ARM no power anecdote there is an interview to one of the engineers that work on the first ARM chip in Acorn (ARM used to be Acorn Risc Machine) in which he explains that the first prototype of the chip when they first tested it they measured 0mA current going in the power rails. They soon realize it was because the power ralis were disconnected but the chip was working anyway because the current was flowing in by other pins in the package. It doesn't mean that the chip used virtually no power, only that it used little, so little that only with the input signals and capacitors had enough to work with, without the power rail conected to anything.
i'm sure this won't get seen, but around 1:17:39 when discussing instructions getting added bc they are used commonly etc-is it possible an adaptive ISA could solve these issues? i guess this is something like the instruction cache that was mentioned, but i'm imagining something that would keep commonly-cached instructions around and somehow build them in to the instruction set, as it's being used... is this possible? impractical? am i misunderstanding something?
Casey is right that it's not the ISA that's mostly affecting efficiency. Intel Lunar Lake is an example of how x86 can match or even beat ARM in terms of low power - while keeping backwards compatibility. Intel and AMD just needed to prioritize low power and Apple + Qualcomm finally gave them a real reason to. Lunar lake has similar performance, heat, and battery runtime numbers to M3 and Snapdragon. See Just Josh's lunar lake video for more about this. However ARM is better since it's more open and more competition is happening there to get the best performance per watt.
1:17:08 I remember when Intel invented new instructions specifically for XML parsing. I would not be surprised if we see JSON parsing instructions in next i9 or something. EDIT: I exaggerated quite a bit: SSE4.2 text processing instructions are general purpose, not intended for XML processing only.
@@poteitogamerbr2927 SSE4.2 text processing instructions: PCMPESTRI, PCMPESTRM, PCMPISTRI and PCMPISTRM. I guess when they were introduced, XML was the new hotness, and these were marketed accordingly. Looks like they actually are general purpose and can be used for JSON processing too.
@@KvapuJanjalia Those are really just for string searching. You can use them to implement for example strpbrk. And they have a variant for null-terminated strings.
@@KvapuJanjalia thanks, it seems very cool. I wonder if compilers like gcc actually optimize say C code into those instructions since they are very specific or you must call them directly.
@@poteitogamerbr2927 that might depend on a couple of things. As far as I understand, if it's a fairly widely supported instruction then your compiled binary may contain it with a fallback for a chip that doesn't support it. If it's quite specific you might need to let the compiler know through flags to include it.
56:00 Guy in a documentary I saw told they forgot to connect Vcc rail, but the first Acorn Risc Machine chip was able to run on currents passing through pull-up resistors (stuff that stabilizes bus state).
ex-system architect here. instruction sets are not the issue, it's the way how it's architected. As ex-bios engineer worked on APM and ACPI and later specialized in power management on ARM devices, it's just night and day differences on how two architecture approaches designs. One example of why instructions doesn't matter. When I was a bios engineer, I worked on x86 asm. When I worked on arm, I've mostly used c/c++. Only the rare time I have to use jtag and debug in asm and that's almost never the issue. On power implementation approach, x86 is almost an after thought. ARM platforms I worked on literary think of every possible way to try to improve power in every iteration.
Great to come across someone who’s really familiar with it. For Intel - WHY is it an afterthought? Don’t they have as much to gain from the same? But by the original notion - isn’t it expensive to run all this fancy decode outside the core when modern compilers just aren’t using the breadth of x86? Surely that’s a whole bunch of transistors ARM just doesn’t need to contend with?
@@Freshbott2 I didn't work for Intel but I suspect it's purely due to politics. They had ARM license back in the days when they did PXA270 and they know how it works. The fact that they sold it off and not apply much to their architecture (at least from external pov), seems they just didn't care for it enough. I'd assume they were making so much money from server side that they just didn't care for the ARM threat. On fancy decode, it's not that expensive to run outside (just think how mobile works). Also not that complex to add these to compilers (maybe back in the days if they add that to gcc). Or the ISA can have prefetch to sort of know this is certain kind of workload that needs to be offload to the correct component/core. Again, just think how mobile works. It has all the features of a pc in a SoC.
But can the layout of the individual institutions themselves be tailored to an optimization in the CPU architecture itself? For instance I was looking at the layout of some of the riskV instructions and the size of and bit layout of the operands, destinations and instruction seemed kinda random in certain instructions. Can’t think of one off the top of my head but I remember reading that the decision of the bit layout was due to it somehow being beneficial for the physical configuration of the CPU components. With this in mind in a sense, is the instruction set and the architecture be kinda coupled in terms of performance or power efficiency?
@@lyingcat9022 There are lots of factors for layout. If you compete in Asia, like mediatek, they would use that potentially as a competitive advantage. The trade offs are the hours of hardwork from their engineers. I'd say these days, we are fairly modularized in the sense that the subsystems are large enough and wants less interference from another subsystem, so the layout on performance probably isn't top of the priority as to other factors like thermal... but that's just my observation.
If you get Casey on again for a similar topic, I think reading through and discussing David Chisnall's article "There's No Such Thing as a General-Purpose Processor: And the belief in such a device is harmful" would be interesting -- he goes into things like the energy impact of complex decoding machinery.
You should have someone on to talk about the difference in memory models (x86 strong, arm/riscv weak). Also worth touching on how the C11 memory model's adoption has made far more software compatible with weak memory models.
Fun fact, the a in arm originally stood for acorn, the makers of the BBC micro... The first arm chips were literally acorn asking how they could make a sequel to the BBC micro [or one of its successors. I'm not British or a computer historian for that matter]😊
SO glad this is finally up. ARM is on my to 'RUN' list. It's apparently effective at reading Malware. I've been spoiled by Lua, Python, JavaScript and so on.
Maybe I do not understand all the details, but I think memory model is way more important in the limit. x86 is way more restrictive on how it can reorder memory access (for atomic operations it will always be memory_order_seq_cst) in spirit it is very similar to GIL in python. While arm is free to do way more reordering and given how slow memory access is I can see how this difference can bring substantial edge in performance.
It is lower power because it is lower count of transistors AND there are less switching transitions per productive computation. Initially. Then yes, the trend to lower voltages and physical layout of transistors. Still though, those initial design constructs count. Also, switching to thumb mode is way to power down extra circuitry in chip. Power is burned when a transistor transitions.
54:23 at minute 54 and orthogonal memory access is not mentioned. 1:01:09 beside the decoders. Orthogonal memory access modes in x86 is why it needs more transistors to be implemented.
What I take from this is that x86 comes from a very old place where instructions didn't take more than just 2 bytes, but as time went by, the need for bigger instructions lead to a solution meant for retrocompatibility, which made instructions take more clock cycles to figure out what you're trying to do. ARM, on the other hand, decided (probably due to experience) to keep a fixed size for instructions with a certain large that they would think it's enough, and thus making them all take the same time which would be (I would assume) 1 clock cycle. The other thing I take from this is that there's not a big necessity for better CPU's and the companies are relying on programmers wasting resources so they would need better products due to that inefficiency so they can keep the marketing going, which is... concerning.
From my own understanding, the original ARM processor and its design was originally motivated by the ability of it being engineered by a small team of talented people as opposed to large team. This was one of the major pushes for the RISC design compared to the then Intel's CISC design. Once they achieved this stage of development the project leader then pushed it onto the engineers to reduce the amount of Heat output from all of the individual parts of the underlying circuitry - logic. The project leader didn't want any additional cooling components. He wanted it to be manufactured on small, cheap simple plastic substrates without the need for any kind of heat sinks. He wanted the cost to be about $.04-0.6 / chip as opposed to $20.00 / chip. The was also a huge influence in the original design. The engineers then had to go and measure all of the voltages, amps, and watts for every single path and connected component - device within the chip. This was a huge task and ended up being an engineering feat in its own right. This is what I know about the history of ARM from the BBC days that ASFAIK did originally use the 6502 as opposed to the Motorola, the Z80, or the 8086 of the early 80s.
Well....x86 around 1600 instructions, Arm around 150 and RISC-V (GC) has around 40....but that's not the sole deciding thing, On RISC-V the instructions are no longer human readable (if that's even possible) in their hexadecimal form and optimized for the instruction decoding logic to be as simple as it could get. So if we compare those, compare comparable things. But other than that detail, fantastic video and great knowledge shared by Casey! Thank you very much!
I'm indeed very lucky to have been born with this awesome family name :) Thanks for the shout out! 😊
What gave you the inspiration to create this tool? And are you a one man team?
@JayDee-b5u it's a story I've talked about before, basically trading software speed. I explain it in the Microarch Club podcast at some point. I'm luckily not a one man team, there's a small group of volunteers who help.
Actual GOAT right here. Love you mate!
Thank you for that tool. In the modern world distance between low-level and high level grow fast and what you did is just amazing.
Hour and a half with Casey? YES!
You sound like an anime girl and I'm all for it 👍
I love Casey, well spoken, knowledgeable, easy to follow even for non native English speaker (edit: I AM not native English speaker, sorry for the confusion). Technical enough yet relatively easy to understand
Smart yet humble, good combo and makes for good teachers
@@pablomelana-dayton9221he’s not very humble
yeah i like him for another reason too
@@mattmurphy7030 he actually is.
??? hearing this guy has been the most infuriating experience this week. He just CAN'T get to the point holy.... he kept rambling, I'm at minute 21 of the video and he STILL hasn't got the to point he wanted to say after minute 3 of the video. He reminds me a the boomer engineers that I work with just ramble and complain and never get anything done.
I could listen to Casey talk for DAYS and not be bored
DA : Once you know the stuff, you will get bored. It's like a machine on repeat.
@@RealGrandFail I feel like once you know the stuff, the joy comes from teaching others!
@@grimm_gen totally agree 💯
mollyrocket is his youtube handle (his wife does a childrens novel I think if I remember the lore correctly?), he has several amazing vids on there!
Casey is the best. He's forgotten more than I know. And I'm just a bit behind him staring down reaching 30 years as a Software Engineer.
I am in awe of how verbally articulate he is over such a wide range of knowledge, in depth. Both wide and deep knowledge + articulate is a very rare gift and puts you at the top of the top in Engineering.
I've had the good fortune of working directly with several "Distinguished Engineers" over my career and Casey has the all of the same qualities.
Humble, incredibly articulate to a very detailed level at a wide range of subjects, doesn't talk in absolutes and knows to mention some of the tradeoffs, and know when they are getting into areas where they might lean on someone else for specific expertise.
They are the best people to work with and know how to work at different levels of people without being patronizing or making you feel imposter syndrome.
Casey is definitely in that class of Engineering and it's always a treat how well him and Prime work together despite coming from very different backgrounds.
Well done as always, gentleman! I learned so much from this video that I had to come back and edit my original comment to add much more.
Casey is my favorite of your guests. Always love when he's on
Wanted to put this as background, turned out I can sit on my toilet for whole 1.5 hrs just listening to this.
Very informative! Thank you Primeagen and Casey!
hope your legs recovered and youre not wheelchairbound
Casey is such a great guest! I always learn so much when I watch these videos
I remember back when I was in highschool trying to get into game dev I found Casey's GJK video. Reading the paper was way over my head with academic language and math symbols - but his walktrough helped me implement it and EPA. It really helped me see that stuff that seemed untouchable (paper, cryptic code, abstract code) was understandable if you broke it down, take it step by step and try to visualize.
I wish I had more teachers like him back in school, or more material like his available back then. Kids these days are really lucky to have content like this available almost effortlessly
It's both a blessing and a curse. Great learning materials are out there and readily available if you know where to look, but knowing where to look is the hard part, with low quality or outright hostile content often winning at SEO and pushing down the gems.
The issue of junk search results is only growing, hopefully soon we get hypergoogle.
As an embedded engineer, this was so great to listen to. It's hard to find good content in the embedded domain.
i thought that especially recently it was common to just write in assembly, and assembly is specifically pretty common, casey said 'only in specific domains of embedded'
That was FANTASTIC!!! Pretty nostalgic too. I was lucky enough to build my 286, 386, & 486 computers back in the day when they came out. If they kept that naming convention, I wonder if the latest computer would be a 10086 or 20086 by now.... I totally had an assembly course in college. It good to know "nobody" writes that stuff nowadays. If you still do, then consider yourself nobody.
You'd be happy to know that there is a new Intel 285 chip coming out soon! The Core Ultra 9 285 has 24 cores and is among the highest tier of the upcoming Arrow Lake chips.
@@r.k.vignesh7832 Yikes! I almost thought they went backwards. 286 was short for the 80286 processor... looks like the 285 is short for 285K (285,000). Not sure if those numbers are a true apples to apples comparison but at least they are headed in the right direction. 😅
@@Angel-Fish The K is used to distinguish chips w/ unlocked multipliers from the standard ones. There will also be a 285 non-K. This would have been the Core i9 15900(K) with last year's naming scheme, but they changed it for some reason. Probably to confuse us even more.
I love the way Casey explains stuff. I learned so much just from his preamble.
It’s time to reboot the “Jeff and Casey” show with the new “Prime and Casey” show.
I would love to see Jeff interact with Prime too. And throw in Jon Blow there too.
Casey just seems like such a wonderful human being.
People seem to forget that both Intel and AMD had RISC cpus already in early 90ies. One of Sega's most popular arcade games used the Intel i960 (Sega rally yeaaaahhhh)
True. I still have a i860 in my NeXTcube. At some point, Intel has also made an ARM CPU: the XScale.
@@MartialBoniou oh, a NeXT cube :O I love that design. Always when I did a drawing of a computer I made it like a NeXT Cube :) . I hade forgot about th XScale actually lol :)
OMG we only have i9 today and there was already i960 in 90ies
So the processors that fetch multiple instructions in one cycle are called superscalar. And they can either be in-order execution or out-of-order execution. When it is out of order, they undergo register renaming (using a map and a free list of physical registers) to resolve dependencies (other than true dependency), and get dispatched into a buffer (Register Update Unit) where they wait until their operands are ready. A group of instructions get picked from this RUU and is executed since all the dependencies are resolved. Then, there is an in-order commit for the instruction at the head of the RUU. So we get in-order dispatch, out-of-order exec and in-order commits
Casey’s performance aware programming course is so rad, this dude rules
Casey is better than wikipedia
No doubt
Most things are
@@grendel6o nah, wikipedia is way better than e.g. most social media, including youtube comments. wikipedia is also way better than many youtube videos, especially when it comes to stuff like accuracy
@@asdfghyter Get a degree in aerospace engineering and try to use Wikipedia for anything related.
@@grendel6o try to use social media for that :P
anyways, i have found the wikipedia articles for the scientific topics i have studied helpful, though not very pedagogical. i have no idea about the situation for aerospace engineering, but in general it's an encyclopedia, not a textbook on advanced topics. if you're looking at wikipedia for course material for any graduate level courses, you're using it wrong
This was a great talk from Casey, especially out of the head. There is one thing i would like to add to "the ARM ISR". There is not only one but a bunch of them. The most important ones are Cortex -A, -M, and -R. Their main difference is how you attack performance requirements from a (descrete) math point.
Cortex A is the general compute approach. They are designed to run an OS and are used as CPU's, in phones, mobile, or AI clusters. Their goal is pure compute power, even at the cost of determinism or safety with things like branch-prediction, chunkwise caching, etc.
Cortex R is realtime applications like the ABS/ESP in a car, a flight controller or the primary controll of a power/production plant. They are designed to guarantee a computation within a certain timeframe, provide redundancy, private memory for certain things etc.
Cortex M is for microcontrollers. In very broad terms they are a hybrid of R and A. They can map a view realtime features, while still do some general compute when necessary. They are a great choice for a car door with the window control and a view buttons.
Intel used to have different sets with x8150, x82 etc. but the portfolio narrowed down to what is known as x86 today. While ARM diversified from theoriginal ARMv1 / ARMv2 chip. They are also roughly the same age, they just grew in different industries.
Ian Cutress did an interview with Jim Keller and has a clip that would make a great supplement to this titled "Jim Keller: Arm vs x86 vs RISC-V - Does it Matter?".
One big extra power burn is x86-64 devices is the platform is desktop and laptop with expandable RAM. You need more voltage to drive big ram sticks further away. And ARM has always been on embedded with soldered down ram. Intel just demonstrated with Lunar Lake chips with soldered ram on the laminate, saving the memory controller voltage puts them a LOT closer to Apple Silicon in terms of performance per watt. You could bucket a big thing like RAM config in Casey's business explanation. REALLY good explanation from Casey!
ARM was developed as a desktop CPU though, and that's where it started. On the desktop.
@@-_James_- thanks for the correction. It wasn't until 1992 that the apple newton was a mobile device with an arm cup in it.
To be fair, mobile atom CPUs used in cellphones of the era were using embedded dram too.
First there are cores developed by ARM UK and GPUs developed by ARM Norway, then there are third party designs, by Qualcomm and Apple.
@@Loanshark753 Intel had some ARM designs for a while too after they acquired them from DEC.
This was a great one. I spent thousands of hours programming the 6502, M68000, and M68020 back in the ’80s and ’90s. It was a lot of fun, but nowadays I’m quite happy to be coding in higher-level languages, especially my favourite - Clojure. Still, I sometimes miss the days of programming in Assembly and C. There was something special about having complete control over everything running on the machine.
Yep, past few years, been filling in and expanding knowledge and capability in assembly, for fun
Assembly is still pretty fun its just a lot of instructions to keep track of. Ive messed around with doing a basic x11 hello world and it was almost 1000 lines
Same for me... 6502 and 68000. I still prefer lower level coding. Most of my work is with lagacy C code and C++
I had no idea Godbolt was named after a Mr. Godbolt!!!! He just took the #1 spot on the "best surnames of all time" list from my friend Mr. Goldhammer
0:58 Prime being hilarious while ruffling a lot of feathers completely on accident.
The risc-v guys really don’t like being called cisc even though it essentially turns into one the moment you include any of the common high perf extensions.
I think there's not really a solid boundary between risc and cisc, but I reckon risc-v at least does it well by splitting the entire isa into extensions which have individual purpose as opposed to have extensions hacked on with new versions or whatever. I believe the beauty of risc-v is that you can create tailored chips for a specific application. For example, you might slap a bunch of vector extensions and parallelisation extensions but leave out stuff like atomics to get a low power, efficient gpu (ofc the technology isn't really developed to that point, but that's the theory anyway). So risc-v is really good for specialised chips as opposed to necessarily desktop cpus, which are pretty much always going to devolve into cisc anyway at some point
Finally got time to sit and watch this. I absolutely love these chats with Casey, I always learn so much. He is an amazing teacher and I'm glad there are people out there like him. I'm so glad Prime has him on and that Casey wants to be on as well. Can't wait for the next lesson.
Great talk! A good follow-up topic might be the memory model differences because (1) it's one of the major differences an actual programmer might hit when porting code from x86 to ARM, and (2) I would imagine it has power consumption implications since x86 chips are required to do more possibly useless work to keep caches coherent.
Casey has literally flipped my approach to web performance on its head. Love it!
"I can't believe we're doing all of this just to run JavaScript"
lmao
That intro was the most thoroughly agreed upon and understandable statement I've heard in a long time lol.. and I'm only seen one appearance of this guy.. It is kind of amazing The the real bar for entry when it comes to being able to distribute this kind of knowledge is simply to have somebody that also knows how to talk to people, or at least wants to try lol.. That's usually where it falls apart.
Used the BBC micro B at school.... It was the business.... The RISC based Archimedes was on the horizon and it was truly from another universe 😊. It was so far ahead it was indescribable in the late 80s... It was a jump from 8 bit to 32... That's pretty massive.... Price tag to match.....
I jumped on getting an MC68K Mac when it came out thinking this was leading edge in both software and choice of CPU (for personal computer market) - little did I know what our cousins across the pond were cooking up (and, of course, Apple today has landed on using the CPU family that they created back then)
Casey and Prime is such a great content combo, i can't describe how much i love that
My favorite episodes literally the ones that include Casey! Thank you both!
I've been followed Casey since he started Handmade Hero and I love the dynamic between you two.
Another Casey video, this is just what I needed to make my day.
i am 30 minutes in and i think i can listen to casey 10 hours. 👍🏽
Casey is so powerful flip actually zoomed in when he said it.
The amount of preamble here was v precisely calibrated - I’ve never looked at assembly at all, but followed every point made, expertly done!!!
Love The Primeagen’s priorities on display! ❤
I didn't think much about ARM until I had to program data transfer using DMA. The ARM DMA subsystem is a marvel to behold, a fine piece of art.
About the arm chip being 0 power usage; if memory serves the anecdote is that the input power of the clock signal for the display was enough to power the rest of the chip
That's how I remember it. Or was it current on the data pins? Something like that. Not electric fields though, never heard of that. And doesn't really make sense, either. :D
@@ControversialOpinion input signals in general most likely yeah, might have a variation of which input depending on where you heard it from haha
It was voltage leakage from the support chips that provided enough power for the first ARM samples to run without any dedicated power supply of their own.
Power consumption is a byproduct of the electronics design (transistor architecture) and NOT ANY firmware or software characteristics. That's why the first ARM chip just happened to be able to operate using stray electric currents from peripheral components on the PCB. That wasn't on purpose but something that was discovered by accident. Well, that sort of discovery now becomes a desired "feature" to pursue on purpose and here we are.
That is true. However energy = power x time. So if a process takes longer to execute it can consume more energy even at a lower power consumption. So for a particular application a lower power device is not guaranteed to be more energy efficient.
@@SimonAyers i heard somwhere that sending a message in facebook, is like having a 40-60w lightbulb turned on for 3 hours.
claiming that languages and the code we write has no effect is complete bs....
what does consume more power a loop wich runs on C , or a loop wich runs on python :DD
"not any firmware or software"
myths like thjis probably why software sucks these days, everything getting slower and slower...
well software can be written inefficienylu
You guys really need to just start a podcast. The chemistry is great, Casey is a blackhole of knowledge and Prime keeps the mood lighthearted and fun.
idk id rather he alaborate on the mbedded systems use of assembly, seems to be less common now but p comm
For the variable length instruction decoding on Intel, the CPU doesn’t necessarily need to decode what the compiler generated, it can theoretically decode something else.
The CPU executes what is in instruction cache and the move from memory to instruction cache is slow. In theory you could remove variable length instructions on the fetch to instruction cache and give the CPU fix length microcode instructions.
That have cons. Intel CPUs are designed to execute legacy x86 instructions, and these are inherently variable length. Converting instructions into fixed length microcode would require a significant architecture overhaul, impacting compatibility with existing software and instructions. Intel CPUs already have optimizations like the micro op cache. This cache holds decoded uops for reuse, reducing the need to repeatedly decode instructions from memory. This already achieves a similar goal of reducing decoding overhead by reusing pre decoded instructions
> For the variable length instruction decoding on Intel, the CPU doesn’t necessarily need to decode what the compiler generated, it can theoretically decode something else.
No. The incoming instruction stream, regardless of whether it is variable or fixed length, has to be decoded as is.
> The CPU executes what is in instruction cache and the move from memory to instruction cache is slow.
As slow as the memory system can operate at, provided that software does not interfere by making things worse - which sadly is a common case. Without reuse caching is not faster than directly running off memory.
> In theory you could remove variable length instructions on the fetch to instruction cache and give the CPU fix length microcode instructions.
In practice this is what various platforms did and continue to do in various forms for several decades. What gets fed into the core from the instruction stream perspective is very different to what is actually being acted upon internally.
As someone that did some arm assembly writting for learning and such, this was really cool to listen to.
Thank you Casey, it's always a treat to learn from you.
i 💜 Casey Muratori's deep dives
Thank you for going slowly to make sure that you don't leave anyone behind, Casey! Thank you!
Love to see Casey, please come on more often!
56:00 There's a great 3-parter video interview with Sophie Wilson on channel "Charbax".
If I remember correctly, she talks about the low power ARM stuff in one of those.
ANOTHER CASEY VIDEO!!! ❤🎉
Thank you for introducing the godbolt decompiler for those of us that didn't know. Having done some x86, PIC and other chip assembly programming in school long long ago ( that I hardly remember) this is a great primer for demystifying low-level instructions. There is a small hang-up I'd love to get his take on for clarity, I seem to recall that x86 had a much much larger instruction set with machine instructions that would take 10-20 cycles to execute while the more basic (Motorolla etc) chips did not; the more basic chips used, AFAIR, only the accumulator to perform operations (with few exceptions), while x86 allowed a subset of instructions to perform operations entirely within CPU registers without touching the accumulator value. Even ops like addition to direct memory locations were possible (beyond the CPU registers) whereas basic chips would have to move those values from memory to registers, perform add op and the result would have to be moved back from accumulator to the original mem location.
All this to say the idle power draw to the extra transistors that x86 has to perform the ops on so many working registers was significantly higher, and as a result x86 arch was not as power efficient over the long periods where it doesn't use those extra functions. Is that still the case or is ARM arch now as "bloated" as x86 where it has similar transistor count in the ballpark order of magnitude as x86?
Lmao prime bailing to deal w the kid is brilliant. Love it
You cannot understand assembly without understanding Von Neumann architecture, and most programmers don’t go into the detail of how VNA works. That’s the big benefit of understanding that learning assembly gives you. It’s a fundamental understanding of how the processor is actually processing. Once that clicks working with data in registers at an address level through pointers becomes the most natural feeling in the world.
Man. This guy is so good at explaining things that even someone such as myself that doesn't code can understand.
I loved this talk, I learnt a ton, and helps understand everything so much better
Casey my man, still looking great bro. This reminds to go back to Handmade Hero. Thanks Prime for this video man; you made this possible.
this is absolutely fantastic! very informative!
Low level programming but in simple language.
What a treat!
❤❤❤
I wouldn't say LLL talks in an overly complicated way
With power, he touches on the fact that there isn't much of a difference between the ISA. Which I think is true, to expand on that ARM was built around a SOC design with just only the requirements for the device. With x86, it was designed for a generalized desktop pc so the chips also included PCI connections and buses for expansion. Each PCI connection requires die space and consumes energy.
As someone with like datascience/machine learning, I always have no idea where Casey is going but I always love to come on the adventure and I always learn something new -- pulling up the webtool and following and playing along really helps with this video!
Casey's channel is "molly rocket" btw, it always escapes my brain and then I remember -- incase you are looking for it u.u
Casey is the GOAT. I can't get enough
In the early 80s, if you had a Commodore 20/64 8-bit with a MOS6510 and your programs had to run full speed, there was nothing but assembler.
I bought "Creating Arcade Games on the Commodore 64," and I think I also bought a machine language book, too. Sadly, I didn't get very far with either book. But, I remember the excitement I had finding out that books like that existed, because I really wanted to program games. Too bad I didn't have the skills that others did.
@@michaelday341 Basic was better than nothing.
Exactly this got me into assembler on the C64. Pure performance poverty 😂 Not even a compiler. Just writing code directly in my Power Cartridge monitor.
x86 is like utf8 and ARM is like utf16
About the ARM no power anecdote there is an interview to one of the engineers that work on the first ARM chip in Acorn (ARM used to be Acorn Risc Machine) in which he explains that the first prototype of the chip when they first tested it they measured 0mA current going in the power rails. They soon realize it was because the power ralis were disconnected but the chip was working anyway because the current was flowing in by other pins in the package. It doesn't mean that the chip used virtually no power, only that it used little, so little that only with the input signals and capacitors had enough to work with, without the power rail conected to anything.
i'm sure this won't get seen, but around 1:17:39 when discussing instructions getting added bc they are used commonly etc-is it possible an adaptive ISA could solve these issues? i guess this is something like the instruction cache that was mentioned, but i'm imagining something that would keep commonly-cached instructions around and somehow build them in to the instruction set, as it's being used... is this possible? impractical? am i misunderstanding something?
Godbolt sounds like a man who is blazingly fast!
Casey is right that it's not the ISA that's mostly affecting efficiency. Intel Lunar Lake is an example of how x86 can match or even beat ARM in terms of low power - while keeping backwards compatibility.
Intel and AMD just needed to prioritize low power and Apple + Qualcomm finally gave them a real reason to.
Lunar lake has similar performance, heat, and battery runtime numbers to M3 and Snapdragon. See Just Josh's lunar lake video for more about this.
However ARM is better since it's more open and more competition is happening there to get the best performance per watt.
1:17:08 I remember when Intel invented new instructions specifically for XML parsing. I would not be surprised if we see JSON parsing instructions in next i9 or something.
EDIT: I exaggerated quite a bit: SSE4.2 text processing instructions are general purpose, not intended for XML processing only.
Seriously? Tried to google it to found what instructions do this but found nothing. Do you have sources?
@@poteitogamerbr2927 SSE4.2 text processing instructions: PCMPESTRI, PCMPESTRM, PCMPISTRI and PCMPISTRM. I guess when they were introduced, XML was the new hotness, and these were marketed accordingly. Looks like they actually are general purpose and can be used for JSON processing too.
@@KvapuJanjalia Those are really just for string searching. You can use them to implement for example strpbrk. And they have a variant for null-terminated strings.
@@KvapuJanjalia thanks, it seems very cool. I wonder if compilers like gcc actually optimize say C code into those instructions since they are very specific or you must call them directly.
@@poteitogamerbr2927 that might depend on a couple of things.
As far as I understand, if it's a fairly widely supported instruction then your compiled binary may contain it with a fallback for a chip that doesn't support it.
If it's quite specific you might need to let the compiler know through flags to include it.
How great can a podcast be! Thx
I adore the Casey streams and the rabbitholes²
Lot of knowledge and history here! Sounds like ARM instructions are a better design, I'll keep it in mind
56:00 Guy in a documentary I saw told they forgot to connect Vcc rail, but the first Acorn Risc Machine chip was able to run on currents passing through pull-up resistors (stuff that stabilizes bus state).
ex-system architect here. instruction sets are not the issue, it's the way how it's architected. As ex-bios engineer worked on APM and ACPI and later specialized in power management on ARM devices, it's just night and day differences on how two architecture approaches designs.
One example of why instructions doesn't matter. When I was a bios engineer, I worked on x86 asm. When I worked on arm, I've mostly used c/c++. Only the rare time I have to use jtag and debug in asm and that's almost never the issue.
On power implementation approach, x86 is almost an after thought. ARM platforms I worked on literary think of every possible way to try to improve power in every iteration.
Great to come across someone who’s really familiar with it. For Intel - WHY is it an afterthought? Don’t they have as much to gain from the same?
But by the original notion - isn’t it expensive to run all this fancy decode outside the core when modern compilers just aren’t using the breadth of x86? Surely that’s a whole bunch of transistors ARM just doesn’t need to contend with?
@@Freshbott2 I didn't work for Intel but I suspect it's purely due to politics. They had ARM license back in the days when they did PXA270 and they know how it works. The fact that they sold it off and not apply much to their architecture (at least from external pov), seems they just didn't care for it enough. I'd assume they were making so much money from server side that they just didn't care for the ARM threat.
On fancy decode, it's not that expensive to run outside (just think how mobile works). Also not that complex to add these to compilers (maybe back in the days if they add that to gcc). Or the ISA can have prefetch to sort of know this is certain kind of workload that needs to be offload to the correct component/core.
Again, just think how mobile works. It has all the features of a pc in a SoC.
But can the layout of the individual institutions themselves be tailored to an optimization in the CPU architecture itself? For instance I was looking at the layout of some of the riskV instructions and the size of and bit layout of the operands, destinations and instruction seemed kinda random in certain instructions. Can’t think of one off the top of my head but I remember reading that the decision of the bit layout was due to it somehow being beneficial for the physical configuration of the CPU components. With this in mind in a sense, is the instruction set and the architecture be kinda coupled in terms of performance or power efficiency?
@@lyingcat9022 There are lots of factors for layout. If you compete in Asia, like mediatek, they would use that potentially as a competitive advantage. The trade offs are the hours of hardwork from their engineers. I'd say these days, we are fairly modularized in the sense that the subsystems are large enough and wants less interference from another subsystem, so the layout on performance probably isn't top of the priority as to other factors like thermal... but that's just my observation.
Very nice presentation with an excellent guest appearance. Yet as much as you guys did cover within this, it's still only the tip of the iceberg.
I can’t shake the feeling that this discussion becomes second guessing after some 40 mins. It’d be good to invite Jim Keller on the show.
If you get Casey on again for a similar topic, I think reading through and discussing David Chisnall's article "There's No Such Thing as a General-Purpose Processor: And the belief in such a device is harmful" would be interesting -- he goes into things like the energy impact of complex decoding machinery.
You should have someone on to talk about the difference in memory models (x86 strong, arm/riscv weak).
Also worth touching on how the C11 memory model's adoption has made far more software compatible with weak memory models.
Love it, the content we need. Thx ❤
Fun fact, the a in arm originally stood for acorn, the makers of the BBC micro... The first arm chips were literally acorn asking how they could make a sequel to the BBC micro [or one of its successors. I'm not British or a computer historian for that matter]😊
Love these videos with Casey
i cant get enough of casey talking about computers ❤
I learn so much from this, quality content
"I only look at it occasionally" lol after that knowledge bomb
Thank you, Casey.
i love hearing casey talk about anything
SO glad this is finally up. ARM is on my to 'RUN' list. It's apparently effective at reading Malware. I've been spoiled by Lua, Python, JavaScript and so on.
love Casey! he has a big brain
Love the opening 😂
That was great. I learned a lot - thank you!
Maybe I do not understand all the details, but I think memory model is way more important in the limit. x86 is way more restrictive on how it can reorder memory access (for atomic operations it will always be memory_order_seq_cst) in spirit it is very similar to GIL in python. While arm is free to do way more reordering and given how slow memory access is I can see how this difference can bring substantial edge in performance.
My favorite comment was "I think you taught me something and I didn't come here to learn" lololol😂
It is lower power because it is lower count of transistors AND there are less switching transitions per productive computation. Initially. Then yes, the trend to lower voltages and physical layout of transistors. Still though, those initial design constructs count. Also, switching to thumb mode is way to power down extra circuitry in chip. Power is burned when a transistor transitions.
54:23 at minute 54 and orthogonal memory access is not mentioned. 1:01:09 beside the decoders. Orthogonal memory access modes in x86 is why it needs more transistors to be implemented.
Does the CPU not do part of what a compiler did in yesteryear ? When converting complex "assembly" assumptions into u-ops ?
Legendary video with a mandatory algorithm boosting comment from me.
Love these discussions
What I take from this is that x86 comes from a very old place where instructions didn't take more than just 2 bytes, but as time went by, the need for bigger instructions lead to a solution meant for retrocompatibility, which made instructions take more clock cycles to figure out what you're trying to do. ARM, on the other hand, decided (probably due to experience) to keep a fixed size for instructions with a certain large that they would think it's enough, and thus making them all take the same time which would be (I would assume) 1 clock cycle.
The other thing I take from this is that there's not a big necessity for better CPU's and the companies are relying on programmers wasting resources so they would need better products due to that inefficiency so they can keep the marketing going, which is... concerning.
Great stuff! More Casey please :)
From my own understanding, the original ARM processor and its design was originally motivated by the ability of it being engineered by a small team of talented people as opposed to large team. This was one of the major pushes for the RISC design compared to the then Intel's CISC design. Once they achieved this stage of development the project leader then pushed it onto the engineers to reduce the amount of Heat output from all of the individual parts of the underlying circuitry - logic. The project leader didn't want any additional cooling components. He wanted it to be manufactured on small, cheap simple plastic substrates without the need for any kind of heat sinks. He wanted the cost to be about $.04-0.6 / chip as opposed to $20.00 / chip. The was also a huge influence in the original design. The engineers then had to go and measure all of the voltages, amps, and watts for every single path and connected component - device within the chip. This was a huge task and ended up being an engineering feat in its own right. This is what I know about the history of ARM from the BBC days that ASFAIK did originally use the 6502 as opposed to the Motorola, the Z80, or the 8086 of the early 80s.
Well....x86 around 1600 instructions, Arm around 150 and RISC-V (GC) has around 40....but that's not the sole deciding thing, On RISC-V the instructions are no longer human readable (if that's even possible) in their hexadecimal form and optimized for the instruction decoding logic to be as simple as it could get. So if we compare those, compare comparable things.
But other than that detail, fantastic video and great knowledge shared by Casey! Thank you very much!