Top 10 Craziest Assembly Language Instructions

Поделиться
HTML-код
  • Опубликовано: 17 дек 2024

Комментарии • 1,3 тыс.

  • @davidjohnston4240
    @davidjohnston4240 3 года назад +1922

    RdSeed - It's not always slow. There's a FIFO on the output of the RNG. RdSeed pulls from that FIFO. If you haven't just pulled a bunch of values from the FIFO, the value will be available immediately because the FIFO is not empty. If you try to continuously pull from Rdseed and measure the average time per instruction, it will appear slower because you are limited to the physical rate of generation of full entropy numbers from the RNG, which requires a whole lot of computation - Generate 512 bits from the entropy source, AES-CBC-MAC them together to get 128 bits (that's two RdRand result's worth) XOR it with and output from the DRBG (another 3 AES operations, just like SP800-90C describes) stuff the two 64 bit numbers from the 128 bit result into the output FIFO. How do I know all that? I designed it.

    • @xelaxander
      @xelaxander 3 года назад +295

      The true gold is down in the comments

    • @LKRaider
      @LKRaider 3 года назад +113

      Oh cool. When did you design it? Care to share some history?

    • @davidjohnston4240
      @davidjohnston4240 3 года назад +815

      @@LKRaider It was around 2009 I started. It ended up first in the Ivy Bridge processors with the RdRand instruction. I had been working on writing cryptographic protocols in standard committees (802.11i, 802.16 etc) and they all needed cryptographically secure random numbers and when I looked at the SP800-90 specification back then, it was not sufficient. It described DRBGs (aka PRNGs) but not entropy extraction or physical entropy sources. A small team of 4 people was assembled, myself, a mathematician, an analog designer and a corporate cat herder. the math guy came up with some of the mathematical principles and identified the best papers describing how to quantify the entropy, the analog guy did the physical entropy source, the cat herder got it into silicon and I designed the digital logic that takes the partially random bits, turns them into full random bits with an entropy extractor and seeds a PRNG/DRBG with that full entropy data to make the resulting stream of random numbers fast enough. Since then the other three left (2 retired and one died) and I've been the main owner of the RNGs since. RdSeed which gives full entropy output as per SP800-90C and X9.82 was added with Broadwell. This was so you could make arbitrarily large keys from it. Faster and slower versions were created (fast for servers, slower for energy efficient chips) also I've designed a few other types of RNG for specific needs, like super small ones, non uniform ones and floating point ones. I contributed to the development of SP80090B and SP800-90C and the revision of SP800-90A which now cover most of what you need in a secure RNG. A couple of years ago I finished a book on random numbers which was published (Random Number Generators, Principles and Practices). So getting involved to solve my problem of where do I get random numbers has turned into the defining part of my career. The standard are still changing. Certification requirements are still evolving and the need for new RNGs that fit in different contexts keeps up apace, so it has become a full time job for myself and a small number of colleagues.

    • @luxsomething
      @luxsomething 3 года назад +79

      Wow that's amazing

    • @ohchristusername
      @ohchristusername 3 года назад +181

      @@davidjohnston4240 What a lovely comment chain to stumble upon, great read!
      May your random continue to prosper!

  • @electroflame6188
    @electroflame6188 3 года назад +1462

    Dot product of packed singles in your area

    • @Rudxain
      @Rudxain 3 года назад +94

      I would like it in my boot sector

    • @TheLightningStalker
      @TheLightningStalker 3 года назад +21

      The probability of finding a project worth uploading commits of my sus code is very low.

    • @molybd3num823
      @molybd3num823 3 года назад +14

      @@TheLightningStalker but never zero

    • @dubbynelson
      @dubbynelson 3 года назад +24

      dot product of deez nuts packed on your chin

    • @sumuduranathunga
      @sumuduranathunga 3 года назад

      I think 🤔 it's must be cross product

  • @NotDwight
    @NotDwight 3 года назад +4179

    TIL I learned there's an audience for top 10 videos about assembly instructions. Cool.

    • @thegrandnil764
      @thegrandnil764 3 года назад +88

      I'm surprised our community is so large

    • @TheActualDP
      @TheActualDP 3 года назад +44

      I'm surprised this has > 10^5 views.

    • @icedragon769
      @icedragon769 3 года назад +42

      having only ever worked with RISC assembly like MIPS in school, seeing the extremes of what you poor poor x86 driver authors have to deal with is entertaining and enlightening.

    • @jimviau327
      @jimviau327 3 года назад +4

      Sojit , in this case It doesn't appear that this video content will ever be of service to the quality of life you are seeking. Did I just wrote that ? I'm not even sure I understand myself. :)

    •  3 года назад +3

      @@TheActualDP It has 2#10_1111_0010_1010_0010# views (I love ADA's based integers :D)

  • @luck3949
    @luck3949 3 года назад +1632

    Wow, so the task I was given in a job interview was actually an assambler one-liner. Good to know.

    • @DOSeater
      @DOSeater 3 года назад +279

      If you'd said that in the job interview you'd get instantly hired

    • @luck3949
      @luck3949 3 года назад +209

      @@DOSeater I wish I knew this 2 months ago. I got that job anyway, but it took a little more interview iterations. Now I'm a happy developer of a delivery robot :)

    • @DOSeater
      @DOSeater 3 года назад +29

      @@luck3949 Nice! I'm happy it worked out for you

    • @guywithknife
      @guywithknife 3 года назад +366

      "Oh, that's easy, you can do it in one cycle using the PSCMPXCHGFMADDRABCXYZUW instruction"

    • @mika2666
      @mika2666 3 года назад +15

      Which one was it?

  • @Requiem100500
    @Requiem100500 3 года назад +709

    I love how hyped this guy is about CPU instructions. Really fun to listen to.

    • @tkeleth2931
      @tkeleth2931 3 года назад +29

      This dude could describe paint drying on a wall and I'd be entertained. I've never seen an assembly instruction before this video lol

    • @ChristopherGray00
      @ChristopherGray00 2 года назад +2

      i don't know why but for me it's quite annoying.

    • @HuntingKingYT
      @HuntingKingYT Год назад +1

      I'm also hyped when I learn something truly revolutionary

    • @MichaelMantion
      @MichaelMantion Год назад +1

      I am surprised he wasn't more excited.

    • @____________________________.x
      @____________________________.x Год назад +1

      Are you kidding me? I hate his voice with every fibre of my being. I've subbed only because he has subtitles and the other videos look interesting. That first 30 seconds was excruciating, I may need a lie down in a dark room

  • @ChildOfTheLie96
    @ChildOfTheLie96 3 года назад +572

    Lol, this guy has that kind of voice that makes it sound like he's constantly on the brink of laughter

    • @KanaalMTS
      @KanaalMTS 3 года назад +6

      The way you write sounds very British 😂😂

    • @douwehuysmans5959
      @douwehuysmans5959 3 года назад +4

      He sounds like BuzzFeeds IT guy

    • @julian-xy7gh
      @julian-xy7gh 3 года назад +5

      I have the same feeling with Tim from the Unmade Podcast. Maybe it's the Australian accent haha

    • @bakedbeings
      @bakedbeings 3 года назад +8

      @@julian-xy7gh Australian here: it's not universal for Aussies, he's just a gem 💎

    • @2112jonr
      @2112jonr 3 года назад +7

      More like madness.
      Assembly language has that effect...
      .

  • @1111757
    @1111757 3 года назад +699

    I can't get over this presentation. That's the kind of nerdy content you expect to find in a recording of a 10 year old talk that was given to 50 people in a tent :D

  • @zrebbesh
    @zrebbesh 3 года назад +655

    "HCF" -- Halt and Catch Fire.
    On a lot of early CPUs (1970s/1980s, yes damnit I am old) the manual gave the bit pattern for each instruction - and the the rest of the bit patterns did undocumented things. Some were just a different way to spell NOP, some did deeply bizarre unintended things that happened because the bits randomly activated chunks of the CPU circuitry that mixed and matched chunks that were used in different combinations for other commands, and some did things that were only ever intended to be done in the factory, during QA testing.
    We used to hunt through these "undocumented instructions" looking for anything interesting or cool that we could then figure out uses for. But this was a bit risky. A fair number of CPUs had at least one undocumented instruction that would immediately cause the machine to lock up and, a few seconds later, destroy the CPU. Sometimes they caught fire, sometimes they melted through the PCB. Sometimes they desoldered themselves from the board and fell out. Whenever we found it we called it a "Halt And Catch Fire" instruction and patched the name 'HCF' into our macro assembler for that bit pattern, in order to avoid accidentally finding it again.
    Naturally when I saw the title of this video I figured HCF would be at the top of the list.
    Finding an HCF usually meant a new version of the chip as soon as the company could mask it off. We thought of ourselves as contributing to their QA efforts, although very few of them thanked us for it.

    • @ducksonplays4190
      @ducksonplays4190 3 года назад +69

      That is ridiculous, thank you for this comment.

    • @rty1955
      @rty1955 3 года назад +54

      Write while rewind
      Eject disc
      Read & write while ripping tape
      Disable console
      active emergency power off
      Electrocute operator
      Sense card deck on printer and open cover
      Write past EOT
      Read and scramble data
      I have a huge list of them along with my green cards

    • @Safyire_
      @Safyire_ 3 года назад +24

      Can you give some examples of interesting undocumented instructions you came across with?

    • @zrebbesh
      @zrebbesh 3 года назад +146

      @@Safyire_ We found things like 'compare while swapping' that swapped the values in two registers while writing 1 to the comparison bit if the first was higher than the second. That was actually a little bit useful. We found a lot of things that tried to do two or three things at once but did them in a random-ish order because of race conditions. One of those was useful because it consistently did xor before swap if the CPU was hot and swap before xor if the CPU was cold, so we could write code that monitored the CPU and shut things down if it got too hot. We found instructions that connected multiple registers to the bus for output, meaning the result of the instruction would be written to four different registers at once. We also found instructions that connected multiple registers to the bus for input, which was useless and sometimes damaged the CPU. It was a real crapshoot. Also a very expensive hobby if you damaged the machine and your professor wasn't ready to write it off to "research." CPUs were not cheap.

    • @morgwai667
      @morgwai667 3 года назад +12

      ​@@rty1955 ​ @Zrebbesh you crazy old hackers! ;-) you are legends! :)

  • @0xABADCAFE
    @0xABADCAFE 3 года назад +1351

    So the most amazing thing about these instructions to me is the fact so many of them run in single digit cycles. You have to marvel at the engineering effort that has gone into it. Also, a compiler has to basically be sentient to know when and how to use some of these.

    • @MrHaggyy
      @MrHaggyy 3 года назад +128

      Yes there went millions of hours of engineering into getting to the point where you could write Hallo World in Python etc.

    • @altaroffire56
      @altaroffire56 3 года назад +506

      No. If the compiler was sentient, it would kill itself.

    • @swarnavasamanta2628
      @swarnavasamanta2628 3 года назад +28

      @@altaroffire56 LOL

    • @swarnavasamanta2628
      @swarnavasamanta2628 3 года назад +75

      @@MrHaggyy And billions of hours for a javascript hello world. i think capable computer engineers brought this upon their selves by providing layers and layers of abstraction and burying need for internal necessary concepts to get something done. No wonder the developers now are too shallow in their concepts, probably not their fault if they get hired only after 6 months of python for data structures (they have no incentive to learn the deeper internals if they get paid shitload for sitting in a desk). Hell i would say most people choose programming or development for making bucks, learning and interest comes later. There only a few people now who are truly interested and curious in the core of things and it might just be that after 10 years understanding these would just be luxury and not necessity. Also no wonder why most programmers hate their jobs and want to die after getting one.

    • @MrHaggyy
      @MrHaggyy 3 года назад +82

      @@swarnavasamanta2628 mhm i think the horizon of programmer/developer/engineer in this field got much broader. Yes, there are many abstraction layers we have invented and standardized over the years. I have a mechatronics degree with a microsystem-technology specialization. Most of my field works on improving the hardware for existing assembly code. But we also introduce new things in hardware which we map to assembly or C/C++ code. On that layer, you have the guys who are building assemblers, linkers, and compilers. These are the programs you need to actually execute code on a machine. On top of that, you have the Microsoft, Android, Apple, Linux, etc guys who write an operating system that provides useability with that stuff. And on that foundation, you can start building languages, IDEs, or any program you can open on your computer. And if we finally have these higher-level languages and programs we can start building frameworks or things like python. That field can write very powerful applications that millions of people can use, or that run on many machines at the same time, or all the things these cloud-native guys are doing. The interest in these fields is widely different. I personally love hardware, and the guys I work with love building hardware or building systems with hardware. Systems can be the new Intel i3-i7, over to raspberry pi or smartphone processor, to small controllers like an STM32 which are used in smartwatches, cars, microwaves, freezers down to something like an Arduino which is easy to learn.
      There are a lot of people working on those layers. Many of them being the stereotype white europeon/north-american older man. But this field is one of the most global out there. With Korea, Taiwan, Japan and China being the "most" impactful.
      The amount of things you could learn about computer and software layers is way beyond one's reach. 99.99% of all programmers don't have a clue how transistors are formed into bite logic, scaled to 16-32-64-86-128bit wide memory, how this memory became a register with a specific purpose and how you address this register so you can call it. But you don't need to know it in order to write a program. :-) we have you covered in that one :-)
      So even assembly can teach you a lot about how a computer works, you don't need to write it. In fact you shouldn't write it for any used code. Use a compiler and write it in a higher-level language. All the smart people from the compiler department will cover you there. And so on and so on. Until the hip young facebook star engineer can write his php or python code for his next new feature. And if we do something amazing down the layers he will get a new version that will make his software even better than before. And the only thing he needs to do is trust the work of other people.
      The unpleasant truth about why so many programmers want to die or really do it is a mismatch between management, expectations and skills pared with bad working environments. Coding and engineering computers is a mentally very hard and demanding task. You have to know your tools, get to know the problem, which I like to call a puzzle, identify the pieces of your puzzle, sometimes create a new piece that fits, and solve the puzzle. This takes time. A good time is anything from 2 to 4 hours. Less is only sufficient for really easy tasks, longer is better but you need to train for it and you need to go to the toilet, move, eat, sleep etc. In most companies, this deep focus session gets corrupted by meetings, telephone, angry managers, or people that think they are important to the problem. These corruptions drain a lot of willpower and unless you are an (senior) engineer and prepared for this kind of stuff it will depress you. You need to get your routines in place in order to sustain. The other part is once you solved the puzzle your company needs to give you a reward for this. If your management doesn't like your result and lets you feel their miss liking, you need someone holding you on the bridge. That's why many companies in this field like Facebook and Intel don't have 9 to 5 jobs. You get paid to work for them. There are recommendations on how you should set up your routines and there are people helping you. But you can come and go as you like. But you get certain tasks and a timeframe. Once the timeframe is over people all over the world are counting on you getting the job done in time.
      So very wide, very different, and very interesting domain. And it's very rewarding if you know that you did something that all of mankind will use and benefit from in a view month after you finished your work.

  • @zactron1997
    @zactron1997 3 года назад +769

    Good lord that poor silicon. I can't even begin to imagine how you'd design chips to implement some of these instructions. I'd love to see a followup video showing some examples of using these instructions, and if they're superceded, what should be used instead!

    • @Noctew
      @Noctew 3 года назад +78

      They committed the cardinal sin in the 1970s with REP MOVx and it went downhill from there.

    • @fake12396
      @fake12396 3 года назад +125

      microcode, lots of microcode

    • @shinyhappyrem8728
      @shinyhappyrem8728 3 года назад +28

      I'd think that there are massive groups of "one circuit per operation", and they all work in parallel. From all the results only the specified one is selected.

    • @Lukas-er4nd
      @Lukas-er4nd 3 года назад +11

      Microcode. Lots and lots of microcode.

    • @polypolyman
      @polypolyman 3 года назад +35

      A long time ago, they actually gave up on x86, and have been making much simpler chips that convert x86 to that simpler system using "microcode"

  • @KazeN64
    @KazeN64 Год назад +9

    I've used MIPS excessively and never looked at X86 much. This feels like when you were playing yugioh in 1999 and you were summoning and setting 1 card every turn and then you get teleported to 2023 where people play their entire deck in one turn and have cards with effects that are 7 paragraphs

    • @jhgvvetyjj6589
      @jhgvvetyjj6589 11 месяцев назад

      Even when cutting off all SSE and up instructions (making it useful for legacy x86 device targetting) there is still a lot of complexity, including very precise x87 floating point and MMX vectorization. What makes it especially fascinating is how compatible it has become; a 640×480 60fps renderer on a very old x86 processor with MMX might very well be the exact same program that does 3840×2160 60fps on a modern PC.

    • @splits8999
      @splits8999 3 месяца назад

      huh.... what the fuck

  • @Andrath
    @Andrath 3 года назад +967

    You'd almost think silicon makers like to mess with compiler writers.

    • @kestasjk
      @kestasjk 3 года назад +142

      I doubt these instructions were aimed at people writing compilers, they'd be aimed at people doing things with encryption, low-level synchronization, multimedia.. I think these days people would first try and come up with a GPU based way to tackle these large data-processing problems, but before GPUs were general purpose parallel computers you had to do these single instruction multiple data things on the CPU

    • @toboterxp8155
      @toboterxp8155 3 года назад +67

      @@kestasjk Also, doing stuff with a good CPU instruction is generally more efficient than doing it on the GPU, simply because you have to send across the data and get the result back on a GPU.

    • @kestasjk
      @kestasjk 3 года назад +44

      @@toboterxp8155 Sort of.. The thing is if you’ve got enough data the GPU is so much faster it’s worth the overhead (and the memory space is getting more integrated / unified all the time), and if you’ve not got enough data to make sending to the GPU worthwhile the speed up for processing a small amount of data on the CPU more efficiently probably isn’t worth it. Perhaps for certain encryption or compression tasks where it can’t be parallelised very well on the GPU but it still needs lots of processing power they may still be useful, but I doubt these sorts of instructions are used in modern software very often

    • @toboterxp8155
      @toboterxp8155 3 года назад +21

      @@kestasjk Your generally correct, but those instructions are a standard way of making programs faster, used to this day. If your task isn't easily converted to the GPU, you don't want the extra work, or you don't want the program to require a GPU, using some complex instructions is an easy, fast and simple way to optimize for some extra speed when needed.

    • @kestasjk
      @kestasjk 3 года назад +17

      @@toboterxp8155 True.. but I think you can probably attribute ARM/NVIDIA’s ability to keep improving by leaps and bounds while Intel is reaching a plateau to its need to maintain a library of instructions that aren’t really necessary in modern software. If it gets rid of them old software breaks, if it keeps them any improvement it wants to make to the architecture needs to work with all these. Intel went for making the fastest possible CPU, but we now know a single thread can only go so fast (and the tricks like branch prediction have exposed gaping security holes in CPUs, forcing users to choose a pretence of security or turning branch prediction off and getting a huge performance hit). So parallelism is the future: In the 00s this meant multi-core CPUs, today this means offloading massive jobs to the GPU, but the breakthrough will come with CPUs and GPUs merging into one. Not to an SoC, like we already have, but with GPU-like programmable shaders as a part of the CPU instruction set and compiler chain, so that talking about CPU/GPU will be like talking about CPU/ALU. You’ll be able to do the operations like these instructions do in a single cycle, but by setting up a “CUDA-core” with general purpose instructions that can access the same memory.

  • @flowerpt
    @flowerpt 3 года назад +1259

    Intel: One cycle
    Bioinformaticists: lemme reimplement that in Python and take 300,000 cycles to compute the same thing.

    • @kestasjk
      @kestasjk 3 года назад +134

      Don't worry; as long as computer time remains far more valuable than developer time, and no alternative graphics-based technology appears for custom parallel processing operations, Intel will be just fine

    • @SimonBuchanNz
      @SimonBuchanNz 3 года назад +34

      @@kestasjk eh, emulation of x86 on ARM on both Windows and Mac is apparently good enough now that I'd be seriously worried if I was Intel. AMD at least have their GPUs...

    • @JayOhm
      @JayOhm 3 года назад +21

      @@SimonBuchanNz I think AMD wouldn't mind going ARM too much, if they have to. Maybe even will design dual-instruction-set chips for the transition period. Good thing that China won't let Nvidia buy ARM.
      In general, nowadays there is a tendency towards "crossplatform" software design practices, so the question of "Can it run widespread software fast?" would soon become irrelevant. For example, Adobe Lightroom already works on ARM on Windows and their other products will follow soon. Itanium might not have flopped if it happened a few years from now, at least not for the reason it did, which was poor x86 emulation performance.

    • @codycast
      @codycast 3 года назад +24

      @@JayOhm how exactly can China stop a US company from buying a UK company?
      Should we find out what Italy and Argentina think too?

    • @JayOhm
      @JayOhm 3 года назад +10

      @@codycast The short answer is Qualcomm. They are banned by US so if ARM becomes US-owned, Qualcomm will no longer be able to legally produce ARM chips. Possible political implications of that are just too painful to risk so regulators almost certainly won't allow it.

  • @DukePaprikar
    @DukePaprikar 3 года назад +41

    Yeah, watch-mojo really dropped the ball by not covering this one.

  • @ZILtoid1991
    @ZILtoid1991 3 года назад +148

    PMADDWD is quite useful for fast affine transformation functions. On SSE2, I can even calculate two pixels at once

  • @icarvs_vivit
    @icarvs_vivit 3 года назад +83

    #1 is the definition of insane and incredibly useful.
    Thank you for translating the Enginese into English.
    Now I can delete my string comparison macros forever.

  • @redsmith9953
    @redsmith9953 3 года назад +289

    I just remember, porting the torque game engine to PSP, and from all the work, the CMPXCHG instruction for the mutex, i implemented some native PSP intrinsic to do that, good memories, the best optimization trick also, the game was doing 10 fps at the best, the problem was matrix transposition, between the engine and PSP "opengl", so i made a transposition on the fly changing the order of reading and writing of the registers in the VFPU instructions, kicking the Sony engineers 'axe' ; ), and getting 30 fps, enough to pass their performance standards.

    • @KangJangkrik
      @KangJangkrik 3 года назад +9

      Wow you made PSP games?

    • @redsmith9953
      @redsmith9953 3 года назад +33

      @@KangJangkrik , i made the Torque game engine port, and on top of that another team was developing games using it.

    • @DiThi
      @DiThi 3 года назад +1

      Nice, but wouldn't it have been better to change which indices of matrices are used in vector and matrix functions? E.g. using m[4] instead of m[1] and vice versa.

    • @kyrylmelekhin2667
      @kyrylmelekhin2667 3 года назад +1

      Marix transpose is the dumbest operation ever, you shouldn't be doing that, ever.

    • @redsmith9953
      @redsmith9953 3 года назад +14

      @@DiThi that implementation costs 20 fps in that platform, you need to swap the entire matrix operations for every calculation, sounds trivial but was not for a 333 Mhz processor with slow RAM.
      before was:
      matrix.transpose(); // bloated operation
      vector.mul(matrix);
      after optimization was:
      vector.mul(matrix); // due to the trick no transpose needed

  • @bobbymorelli9763
    @bobbymorelli9763 3 года назад +104

    alright guys lets brainstorm what kind of algorithm could benefit from all 10...maybe search for a specific font in an image by comparing each glyphs bitmap to the image using MPSADBW and search for words within identified glyphs using the last instruction?

    • @AlexanderBukh
      @AlexanderBukh 3 года назад +40

      careful, or you might ending up creating another awfully named megainstruction

    • @bakedbeings
      @bakedbeings 3 года назад +9

      @@AlexanderBukh ALRTGYSBSTRM

    • @nyanpasu64
      @nyanpasu64 3 года назад +1

      Needs moar threads.

    • @abebuckingham8198
      @abebuckingham8198 3 года назад +2

      MPSADBW can be used for all sorts of optimization problems as the sum of absolute differences is a metric. It's often faster than using the Euclidean metric which requires a square root and you can substitute one for the other in many situations.

    • @gazehound
      @gazehound 10 месяцев назад

      you could feasibly use a good chunk of these by implementing a fancy video encoding

  • @soranuareane
    @soranuareane 3 года назад +37

    CMPXCHG is how mutual-exclusion, locks, and semaphores are implemented in systems like QEMU. I remember having to fix a bug with a race condition in the QEMU Sparc interpreter by adding judicious use of CMPXCHG locking. It's an amazing instruction and, with its guaranteed atomic behavior, can be used to trivialize mutexes.

  • @quadroninja2708
    @quadroninja2708 Год назад +9

    This video has such an unique editing. The topic isn't any less obscure, and it's really cool to hear the author being so enthusiastic about those instructions. It's a really interesting experience

  • @FinaISpartan
    @FinaISpartan 3 года назад +337

    Can't wait till you remake this vid in 10 years with all the custom RISC-V extension instructions. Gonna be pretty wild to see what people come up with.

    • @ritteradam
      @ritteradam 3 года назад +15

      The big mistake Intel made is to create fixed width vector instructions. The V in RISC-V points to the importance of the variable width vector instructions where the assembly code doesn’t need to know the vector register size (V extension), and a similar matrix extension is coming for machine learning I think (though V is already a great improvement)

    • @canaDavid1
      @canaDavid1 3 года назад +35

      @@ritteradam The V in risc-v is a roman numeral standing for 5, as it is the 5th iteration of risc from Berkeley (i think).

    • @ritteradam
      @ritteradam 3 года назад +16

      @@canaDavid1 Officially yes, but you can find videos of the people who developed RISC- on RUclips, and they mentioned that they originally developed it because they wanted to get the vector extension right, and that's why they called it RISC-V at the start.

    • @bFix
      @bFix 3 года назад +8

      Also it's a reduced instruction set (risc) and not a complex instruction set (cisc) like x86
      So why should risc-v even get some of these?
      just do them in software and let the compiler do it's magic.

    • @TheMixedupstuff
      @TheMixedupstuff 3 года назад +18

      The point of risc-v is to have a common set of instructions understood by many cpus and to be extended with application specific extensions where needed. So you can be 100% sure there will be many wild instruction extensions.

  • @lonsbury
    @lonsbury 3 года назад +282

    I feel bad for the CPU engineers who will need to add compatibility for this stuff in 20 years
    Edit: finished watching the video. This was pretty fascinating, and the 3D text made it very nice to watch. I hope you gain more subscribers!

    • @vylbird8014
      @vylbird8014 3 года назад +15

      They'll do it in microcode, I imagine. Apart from the RNG, they can all be done purely in heaps of microcode if you don't care about performance, no dedicated hardware needed.

    • @gorilladisco9108
      @gorilladisco9108 3 года назад +13

      If you ever learn about microprocessors, it's all about microcode. Every assembly instruction are function call to microcode. The design will basically the same, with microcode printed in ROM inside the chip. You just have to be creative using that microcode to come up with a new instruction.

    • @johnbrown9181
      @johnbrown9181 3 года назад +16

      @@gorilladisco9108 There's definitely a lot more to it than just microcode. Things that are both easy and compact in hardware - such as a linear-list search or swizzling - and microcode won't get you there.
      Also I'm not aware of any major RISC implementations that use a significant amount of microcode, very much unlike x86.

    • @gorilladisco9108
      @gorilladisco9108 3 года назад +7

      @@johnbrown9181 And that's why you won't see any instruction like the ones listed on this video on any RISC microprocessors. The thing about x86 and other CISC microprocessors is they use microcode liberally.
      Microcode is how a microprocessor work. All you have to do is to have imagination.

    • @Waccoon
      @Waccoon 3 года назад +4

      Depends on how fast it needs to be. Optimizing complex instructions to use all of a core's hardware is difficult, but just getting older instructions to work for the sake of compatibility isn't that hard. Hence, x86 code from a couple decades ago will work fine on a modern x64 chip, while ARM, PowerPC, and other RISC designs have suffered mountains of compatibility issues over time.

  • @ukyoize
    @ukyoize 3 года назад +55

    The string instructions seem like half of grep implemintation.

  • @dkosmari
    @dkosmari 3 года назад +25

    The carryless multiplication is polynomial multiplication modulo 2. It's used to implement things like CRC computation, and Reed-Solomon error correction codes.

    • @jgunther3398
      @jgunther3398 Год назад

      i was disturbed to find any mul instruction. i loved my homemade multiplication and division routines

    • @gazehound
      @gazehound 10 месяцев назад +1

      Yes, it's useful for all kinds of codes. It's a direct implementation of a field theory concept

  • @ishdx9374
    @ishdx9374 3 года назад +27

    the last one seems so damn complex it's unbelievable it takes 3-4 cycles

  • @Kyrelel
    @Kyrelel 3 года назад +66

    Bear in mind that some instructions were not designed, they are a by-product of the design process.
    In essence, take any bit-pattern that is not assigned to an instruction and look at what the processor will do.
    Most often it will do nothing (which his why there are so many NOP's in instruction sets) or it may crash, but sometimes it will do something weird and wonderful and be included as an "official" instruction while the designers pretend it was intentional.

    • @Rudxain
      @Rudxain Год назад +12

      That's like exploiting hardware-level undefined-behavior

    • @lPlanetarizado
      @lPlanetarizado Год назад +12

      there is a comment that mentions HCF -Halt and Catch Fire- , "undocumented instruction" that sometimes could catch fire...damn, thats amazing lol

    • @appelnonsurtaxe
      @appelnonsurtaxe Год назад +7

      ​​@@lPlanetarizadohat wouldn't happen today on your PC's x86. Or this would be a terrible security issue. On modern systems userspace processes should be able to (try to) run any instruction they want without the CPU melting down.

    • @NormanVN
      @NormanVN Год назад +12

      All of the instructions in this video were quite intentional, but niche. Well, only some are niche. cmpxchg is a _foundational_ instruction whose importance cannot be understated, while pshufb is going to be in pretty much every vector codebase. dpps is pretty well known, parallel dot product. not a fan of dpps tbh.

  • @rockercas
    @rockercas 3 года назад +468

    wow, that were 1010 assembly language instructions, not a mere 10!

    • @i_am_aladeen
      @i_am_aladeen 3 года назад +22

      I actually crunched these numbers in my head before I realized what you did. I feel ashamed. +1

    • @bbq1423
      @bbq1423 3 года назад +62

      There are 10 kinds of people in this world. Those who know binary, and those who do not.

    • @threepointonefour607
      @threepointonefour607 3 года назад +61

      @@bbq1423 there are 10 kinds of people in the world: those who understand hexadecimal and F the rest

    • @skilz8098
      @skilz8098 3 года назад +3

      @@threepointonefour607 0000 0000b - 1111 1111b == 0x00 - 0xFF since log2(x) is a factor of log16(x)! If you are doing simple programming, then 90% of the time you'll only need hexadecimal. If you are actually building and designing hardware and implementing it's data paths, control lines and control bits... You are not going to get very far without binary and Boolean Algebra! If you get into Cryptography, or Signal Analysis you might want to know binary as you'll end up performing a lot of bit manipulation!

    • @alg3n320
      @alg3n320 3 года назад +15

      @@bbq1423 and those who didn't expect a trinary joke

  • @elietheprof5678
    @elietheprof5678 3 года назад +10

    Excellent visualizations btw. Way more straightforward than instruction manuals that try to explain everything with just words.

  • @CrittingOut
    @CrittingOut Год назад +1

    one of the assembly instruction video's of all time.

  • @quickstartprojects2162
    @quickstartprojects2162 3 года назад +40

    Finally SSE 4.2 string compare is understandable. I wish we had the Australian version, Creel version, of the intel instruction set manuals.

    • @deppy2165
      @deppy2165 3 года назад +11

      if you're struggling with the intel manuals I personally find the amd manuals more comprehensible

  • @galier2
    @galier2 3 года назад +44

    TMS-9900 also has a very unique instruction: X Rn . Execute the instruction in register n. It's the only CPU I know of that has the equivalent of an eval() function (as the registers are stored in external RAM, it's clear that it's not difficult to implement in that case).

    • @Rudxain
      @Rudxain 3 года назад +6

      It has SEVERE security issues. But hey, at least it can be used for self-modifying programs

    • @galier2
      @galier2 3 года назад +9

      @@Rudxain for a CPU that doesn't have priviledge levels or memory protection, I don't think that security is an issue with the X instruction.

    • @peterfireflylund
      @peterfireflylund Год назад +2

      S/360 had the EX instruction for that. The instruction wasn’t in a register but in memory (S/360 was variable length, 2/4/6 bytes). This kind of instruction was fairly common in the 50’s and 60’s.

    • @galier2
      @galier2 Год назад

      @@peterfireflylund interesting. Btw in the TMS-9900 the instruction is also in memory because the register window is in memory.

  • @sasas845
    @sasas845 3 года назад +8

    I've worked with or in close proximity of most of these. If you do high performance number crunching or data crunching, the value logistics (i.e. which value needs to be in what operand in which SIMD position) very quickly becomes a major issue and for that all these shuffle/rotate/select/ are a godsend, especially since they tend to be just rewiring of existing ALU functionality so AFAIK should be easy to implement in silicon. Number 1 on the list is the only instruction family I'd put into "space magic" territory, but I might just not have seen its use case yet.

  • @DjVortex-w
    @DjVortex-w 3 года назад +29

    You know that an instruction is complex if implementing it in a higher-level programming language would take literally hundreds of lines of code.

  • @MjuMeli
    @MjuMeli 3 года назад +14

    This getting recommended to people is almost as oddly specific as the sound of sorting algorithms

  • @ProjectPhysX
    @ProjectPhysX 3 года назад +7

    Fantastic video! Such exotic instructions can insanely speed up / shorten certain algorithms. Back when I did MPASM (has only 35ish instructions), there are some rarely used ones that magically do exactly what you can also emulate in 10 more common instructions.
    From the instructions in the video I so far only used cmpxchg to emulate floating-point atomic addition in OpenCL.

  • @kippers12isOG
    @kippers12isOG 3 года назад +8

    I love your vids mate. You’re such a god dam likeable character

  • @glikar1
    @glikar1 3 года назад +32

    Exciting! Love your enthusiasm. Almost makes c redundant. There is something about machine code that feels right.

    • @bootmii98
      @bootmii98 3 года назад +3

      did you know that ++ and -- were VAX intrinsics?

    • @seneca983
      @seneca983 3 года назад +7

      There is something about machine code that feels right.
      I dunno. I've not done any actual assembly programming so maybe my opinion doesn't matter but x86 just seems so bloated and inelegant.

    • @swarnavasamanta2628
      @swarnavasamanta2628 3 года назад

      @@seneca983 you would be partially right. Bloated or not depends on the way of implementation, if these instructions were to be implemented by microcode, yes absolutely, better let the programmer handle them. But if they are direct on chip Hardware implementation of these instructions then it's a different story, it takes the opposite route of bloat. Takes 1 instruction instead of writing a 100 line function in C and hoping compiler would get the translation right. Also x86 being firmly established the engineers have to make sure they are compatible all the way. Support for languages will drop eventually, while x86 is going to stay.

    • @seneca983
      @seneca983 3 года назад +1

      @@swarnavasamanta2628 One advantage of a simpler and smaller instruction set is that microcoding might not then be necessary and the chip could be simpler.
      Indeed x86 would be rather difficult to supplant. However, it seems possible that ARM could do it though it's uncertain and would probably take a long time if it happened.

    • @swarnavasamanta2628
      @swarnavasamanta2628 3 года назад

      @@seneca983 ARM is definitely a beast, and their methodology is completely different from other CISC approaches. It began first as a project to see if a computer really needs large complex instructions, they thought they would come at a halt problem but nothing really came up and they could make everything work with 1 cycle simple instructions (although with a bit of microcode). At this point hard to tell what the future holds, maybe there will be standardization when one architecture has so many advantages that renders other architectures almost useless or unworthy of learning curve. Who knows what the future holds but up until that the architecture land of computers is like wild wild west and i kind of love it that way.

  • @superblaubeere27
    @superblaubeere27 3 года назад +18

    7:30 Btw, the carryless multiply is extremely useful when making parsers

    • @mohammedjawahri5726
      @mohammedjawahri5726 3 года назад

      :o, can u elaborate pls xD

    • @superblaubeere27
      @superblaubeere27 3 года назад +1

      @@mohammedjawahri5726 here is a video about it, you will need the context: ruclips.net/video/wlvKAT7SZIQ/видео.html

    • @mohammedjawahri5726
      @mohammedjawahri5726 3 года назад

      @@superblaubeere27 thanks!

    • @0MoTheG
      @0MoTheG 3 года назад

      @@superblaubeere27 You mean at 35:00 ?

    • @superblaubeere27
      @superblaubeere27 3 года назад

      @@0MoTheG exactly.

  • @T33K3SS3LCH3N
    @T33K3SS3LCH3N 3 года назад +2

    My little brother is doing a similar major as I did and will have a course with some practical work in assembly next year. Your video just gave me the inspiration to help him find some more "creative" solution to those assignments.

  • @educate9946
    @educate9946 3 года назад +4

    I love this presentation, it fits the weirdness of the ops! Great job!

  • @first-thoughtgiver-of-will2456
    @first-thoughtgiver-of-will2456 2 года назад

    This and 2 minute papers are the most important channels on my RUclips thank you for your service.

  • @GaryBickford
    @GaryBickford 3 года назад +52

    Don't forget the Motorola 6800 "Halt and catch fire" instruction. It was an unpublished byte code that caused a branch to itself until the chip overheated.

    • @BrianG61UK
      @BrianG61UK 3 года назад +7

      No. en.wikipedia.org/wiki/Halt_and_Catch_Fire_(computing)

    • @GaryBickford
      @GaryBickford 3 года назад +6

      @@BrianG61UK Long ago a computer center I worked in had a list created by IBMers in the 1960s of amusing opcodes, including HCF. But I didn't want to complicate the text, and the MC6800 item is there in the Wikipedia description, though I did have the details incorrect😊.

    • @tomysshadow
      @tomysshadow 3 года назад +2

      This video is about x86 though. Given, it does have the HLT instruction, and if you use it in your user mode application it will catch fire (if by catching fire you mean cause a privileged instruction exception) :0)

    • @rty1955
      @rty1955 3 года назад

      HCF was around in the 60s way before the 6800

    • @GaryBickford
      @GaryBickford 3 года назад +3

      @@rty1955 yes, I recall on the wall of a data center I worked at, a paper list of spoof IBM machine instructions that included this HCF instruction. Iirc there was also BAH, Branch And Hang😂. The only CPU that actually did this that I'm aware of was the early 6800, but it's possible there were others. The 6800 was an "unimplemented" instruction bit pattern that unbeknownst to Motorola effectively branched to itself immediately and repeatedly until the heat built up enough to burn the logic.
      I also personally knew experienced the result of two amusing (to me) episodes - at a college I was attending, a kid running a canned BASIC business program that managed somehow to overwrite the entire disk map, effectively erasing everything, and a kid looking for a job used social engineering to get the guy running jobs to dive and hit the Big Red Halt button. Each of those events caused the Computer Center to be offline for more than a week. And an entire computer center at a company where I worked got completely fried including three mainframes due to a lightning strike right at the pole outside the Center. The senior manager had resisted spending the $5 million required for a motor generator to isolate the computers from the world. We had 400 engineers twiddling thumbs for two weeks. He got a new job.

  • @gabrote42
    @gabrote42 3 года назад +2

    I haven't watched a video like this ever. Saving it for arguments. Thanks!

  • @bbq1423
    @bbq1423 3 года назад +322

    Wouldn’t it be better to call them functions instead of instructions at this point?

    • @jjoonathan7178
      @jjoonathan7178 3 года назад +167

      Needs a RUNDOOM instruction.

    • @allmycircuits8850
      @allmycircuits8850 3 года назад +37

      @@jjoonathan7178 At least IDDQD seems plausible, integer divide quads by double, store results as double :)

    • @oldxuyoutube1
      @oldxuyoutube1 3 года назад +54

      They have their own implementation circuitry therefore they should be called instruction, and this is also one of the most important feature of x86 ISA, we make complex operation into an instruction to shorten the execution time and make program smaller.

    • @yadt
      @yadt 3 года назад +27

      @@oldxuyoutube1 well, there is microcode...

    • @microcolonel
      @microcolonel 3 года назад +1

      No, because they are not functions; maybe you could call them routines but not functions.

  • @MoosesValley
    @MoosesValley Год назад

    Appreciate the tour. Did quite a lot of Assembly coding in my earlier years, and quickly grew to love it - it's a lot of fun when you get up and running, but you need to keep so much more information in your brain / at your finger tips compared to higher level languages.

  • @SimGunther
    @SimGunther 3 года назад +49

    EIEIO
    I know it's a PPC instruction, but still...
    Seriously, the craziest ASM instructions are the ones not documented in any of the instruction manuals, but are only found by the sandsifter program (written by xoreaxeaxeax)

    • @sebastiaanpeters2971
      @sebastiaanpeters2971 3 года назад

      Any proof for your second claim?

    • @danyildiabin4953
      @danyildiabin4953 3 года назад +12

      @@sebastiaanpeters2971
      ruclips.net/video/_eSAF_qT_FY/видео.html
      ruclips.net/video/ajccZ7LdvoQ/видео.html
      This guy had a few talks about undocumented instructions or whole undocumented cpu hardware blocks

    • @SimGunther
      @SimGunther 3 года назад +7

      @@sebastiaanpeters2971 Any of Chris Domas' talks around unlocking God Mode or breaking x86 should suffice

    • @StannyObelisk
      @StannyObelisk 3 года назад +12

      Old McDonald had an assembler, EIEIO.

  • @furyzenblade3558
    @furyzenblade3558 3 года назад +1

    Woa, high quality video, I love it! And the 3d visuals really help to represent the instructions

  • @Chrisuan
    @Chrisuan 3 года назад +4

    Found this randomly in my suggestions. Insane content, great stuff. As a C++ programmer this assembly stuff scares me lol

    • @GogiRegion
      @GogiRegion 3 года назад +2

      I’ve never done programming in assembly on any newer hardware, so to be I always thought of assembly operations as stuff like move this to there, add, subtract, compare two registers, so even as someone who’s used assembly this is absurd to me.

  • @NogCube
    @NogCube 3 года назад +1

    I love your style bro! This is a great one. 👌
    Back to 2000.

  • @lx2222x
    @lx2222x 3 года назад +4

    Very cool video with very good animations, pls continue making this videos 👍, I just love ur channel

  • @SaHaRaSquad
    @SaHaRaSquad 3 года назад +49

    Not gonna lie, string comparison on the instruction set level actually sounds pretty useful. Not a fan of the absolutely insane arguments though.

    • @WhatsACreel
      @WhatsACreel  3 года назад +3

      Yes, they are magnificent instructions!! Assembly can be super fiddly to code, but very powerful if you have the time to make sure it is correct.

    • @Gulleization
      @Gulleization 3 года назад

      Yeah, as an accountant by profession I still wonder how mathematical reconciliation of bank statements and checking accounts can be so complicated to program and usually buggy.
      I guess that last instruction combined with machine learning techniques really could speed up the process.

    • @SaHaRaSquad
      @SaHaRaSquad 3 года назад +10

      @@Gulleization You absolutely don't want machine learning near anything that requires accurate numbers. ML has its place but it isn't nearly as useful or reliable as the hype often makes it appear.

    • @somdudewillson
      @somdudewillson Год назад

      @@SaHaRaSquad It depends on they type of ML. Neural networks are generally fuzzy, but there are lots and lots of other kinds of machine learning implementations, and some of them work very well for accurate numbers.

    • @jgunther3398
      @jgunther3398 Год назад

      it would only be four or five instructions in a loop. but if it was four or five times faster and all you did was compare strings, very valuable!

  • @salainen6850
    @salainen6850 3 года назад +13

    PEXT is so useful! I can finally get the correct bits from a 4X 1R 1G 1B 1I 8-bit color buffer to the "layers" in mode 12h easily!

    • @WhatsACreel
      @WhatsACreel  3 года назад +4

      Mode 12h? Are you coding EGA? That's awesome!

    • @salainen6850
      @salainen6850 3 года назад +1

      @@WhatsACreel Yup! I think I should also do something on UEFI though, as it gives higher resolutions.

    • @ivanbrezina7632
      @ivanbrezina7632 3 года назад +1

      Also DES, RC4 and other cyphers based on Feistel's schema would ridiculously slow without this.

  • @louistournas120
    @louistournas120 Год назад +1

    It is great having a visual of these operations.
    Intel had once made an app that showed how each SSE instruction worked. I used that to learn and to write assembly code.

  • @OzoneGrif
    @OzoneGrif 3 года назад +48

    I wonder which language compilers are able to detect these patterns and use the ASM operand instead of doing the slow imperative way.

    • @WhatsACreel
      @WhatsACreel  3 года назад +37

      I love Clang! It does a lot of optimisations. You might have to use intrinsics, but these things are available in C++. Best way to know if the compiler is using decent instructions is to disassemble and check what it's doing. Or use the ‘Godbolt Compiler Explorer’ website.
      I don't think there's any compilers that are better at applying these instructions than humans. The gap is narrowing, and maybe one day, we'll get AI compilers that can do these things better.

    • @OzoneGrif
      @OzoneGrif 3 года назад +6

      @@WhatsACreel Right, I guess the best bet would be to use/create libraries providing these functions as interfaced tooling; the librairies making use of ASM internally if possible (since it depends on the CPU type)

    • @Winnetou17
      @Winnetou17 3 года назад +6

      @@WhatsACreel AI compilers that can do things better than humans! NEVER! Maybe just faster... (insecure human signing off)

    • @mthf5839
      @mthf5839 3 года назад +2

      @@Winnetou17 I might be wooshing rn, but there are quite a few examples of AI doing better than humans. Google has some wild stuff for recognizing numbers from blured photos for its street view stuff.

    • @swarnavasamanta2628
      @swarnavasamanta2628 3 года назад +5

      @@OzoneGrif please no more abstraction by library interfaces at low level. It is a nightmare, i say let good prpgrammers handle this.

  • @alienrenders
    @alienrenders 3 года назад +21

    Is it bad that I've used most of these and consider them perfectly normal? Glad you didn't get into OS level instructions that set up descriptors and gates. Now those are weird.

    • @keokawasaki7833
      @keokawasaki7833 3 года назад +8

      bruh that shit fucks with my head, i tried getting into it but then the whole GDT, protected mode, gates and shit just knocked the air out of me by punching my brain in the balls (figuratively)

    • @ethanpayne4116
      @ethanpayne4116 3 года назад

      considering these instructions normal is like knowing the difference between the ruddy northeastern gray-banded ant and the ruddy northeastern gray-striped ant. The world of CISC is truly a jungle

  • @davannaleah
    @davannaleah 3 года назад +4

    I remember the old Intel 8085 had some hidden instructions we used in our projects we knew they would not be changed because the instruction were used in some of the development tools for the MDS (Microprocessor Development System). There were instructions like LDHLISP with an 8 offset parameter. Basically it was "Load the HL register Indirectly with the Stack Pointer with the offset added" it was essential for writing re-entrant code (in 8085 assembler!). BTW this was way back in 1980!

  • @gazehound
    @gazehound 10 месяцев назад

    Wow that carryless multiplication instruction took me straight back to my Information & Coding Theory class.

  • @SvetlinAnkov
    @SvetlinAnkov 3 года назад +8

    @Creel, I love how you slipped in DNA nucleotide bases in the string match example 😃

    • @allmycircuits8850
      @allmycircuits8850 3 года назад +6

      As soon as genetic scientists move from Excel to ASM, we are DOOMED!

  • @FlorianEagox
    @FlorianEagox 3 года назад

    I love that I can tell how much fun you were having with this!

  • @intvnut
    @intvnut 3 года назад +3

    Carryless multiplication also comes up in error correcting codes and checksums. And, of course, it can implement INTERCAL's unary bitwise XOR if you multiply by 3.

    • @intvnut
      @intvnut 3 года назад +1

      Hmm... my other comment about PEXT got deleted, probably because I included a link. PEXT implements INTERCAL's _select_ operator. And I believe PDEP can implement INTERCAL's _mingle_ operator. It's good to see Intel catching up with the amazing INTERCAL language!

  • @patrickpinholt
    @patrickpinholt Год назад +1

    Fun to hear about the rarely seen instructions 🎉🎉🎉

  • @realhet
    @realhet 3 года назад +75

    PUNPCKLDQD is sad and disappointed not being able to get on the list ;D

    • @juliankandlhofer7553
      @juliankandlhofer7553 3 года назад +28

      Gesundheit.

    • @WhatsACreel
      @WhatsACreel  3 года назад +21

      I am sorry, PUNPCKLQDQ... :( if we do a follow-up video, I will be sure to include the unpacking instructions in that :)

    • @realhet
      @realhet 3 года назад +18

      @@WhatsACreel I remember doing a 8x8 16bit matrix transpose for a jpeg decoder with only 8 sse regs and 2 memory temp 'regs' with these crazy-named instructions. It was so satisfying when it finally started working correctly. :D

    • @WhatsACreel
      @WhatsACreel  3 года назад +11

      @@realhet Wow!! Things were certainly tough when we only had 8 regs :)

    • @SuperSmashDolls
      @SuperSmashDolls 3 года назад +13

      At some point that stopped being an x86 instruction and started being a DooM cheatcode.

  • @islandfireballkill
    @islandfireballkill 3 года назад +180

    I wonder how complicated it would be to try to formulate compiler autorecognition for instruction selection for these. That last one is easily a couple hundred lines of C code.

    • @FinaISpartan
      @FinaISpartan 3 года назад +90

      Very complicated. Most of these optimizations are often missed by c compilers and have to be manually implemented in assembly. In some cases (video de/encoding) up to 50% of the codebase has to be rewritten in asm for these reasons.

    • @Abu_Shawarib
      @Abu_Shawarib 3 года назад +56

      Your only hope is to use a library that already has fast paths coded in assembly to do this for you.

    • @jfwfreo
      @jfwfreo 3 года назад +30

      The best way to do this would be to implement these as compiler intrinsics that would then be substituted with the correct ASM instructions.

    • @bootmii98
      @bootmii98 3 года назад +1

      @@jfwfreo what if some other arch doesn't have them? most compiler suites support at least one other architecture.

    • @jfwfreo
      @jfwfreo 3 года назад +10

      @@bootmii98 Most compilers for x86/x64 (including GCC and Microsoft) already support a boatload of compiler intrinsics for SSE and all sorts of things.

  • @desmond-hawkins
    @desmond-hawkins Год назад +4

    About *CMPXCHG* being "absolutely bizarre" (6:22), this is not only used for mutexes and semaphores as explained, but is also the most common primitive used for "lock-free" concurrent data structures (see for example Doug Lea's amazing ConcurrentSkipListMap implementation). It is so useful that many languages export it in some core library, like in C++ or java.util.concurrent in Java. Most programs you use every day likely rely on it or its equivalent in another architecture, unlike some of the other weird instructions listed in this video.

    • @michaelcederberg7937
      @michaelcederberg7937 Год назад +1

      And it is not very useful as presented where all operands were registers. You want to executed this on a piece of memory.

  • @kingtutthefirst
    @kingtutthefirst 3 года назад +3

    I've always loved the absurdity of the PA-RISC2 instructions SET/RESET/MOVE TO SYSTEM MASK and the PSW E-bit. By changing it, you change the endianness of the entire CPU... And, because of pipelining, the instruction has to be followed by 7 palindromic NOP instructions. That's just always cracked me up.

  • @Molotom
    @Molotom Год назад

    You have great energy and enthusiasm in this video! Keep it up :)

  • @adamengelhart5159
    @adamengelhart5159 3 года назад +42

    The other day I learned about the POLY instruction on the VAX. That's POLY as in polynomial, so when I heard of it I thought "well, I guess there could be a use for it in numerical apps, maybe? It's not like it's going to be more than a few coefficients. Maybe a cubic; that's only four."
    I was only off by twenty-eight! That's right--the VAX can, with a single terrible opcode, compute the value of up to a thirty-first degree polynomial, to either float or double precision.

    • @TotalImmort7l
      @TotalImmort7l 3 года назад +3

      Isn't assembly strangely awesome?

    • @romannasuti25
      @romannasuti25 3 года назад +2

      ...wouldn't a 31 degree polynomial just smash the value to negative infinity, positive infinity, or zero? What the hell is even the use of that lol

    • @juanthehorse420
      @juanthehorse420 3 года назад +7

      @@romannasuti25 nope, if you need to do some crazy ass Taylor series or something and just look at a certain portion

    • @meneldal
      @meneldal 3 года назад

      @@juanthehorse420 Outside of bragging about computing Pi faster, is there any use for 10+ long Taylor series in practice?

    • @MsHumanOfTheDecade
      @MsHumanOfTheDecade 3 года назад +1

      @@meneldal approximating any function with nicer ones and then being able to calculate that fast on the fly can be useful, though most of those often-used functions have fast instructions themselves at this point.

  • @SERVOPUNK
    @SERVOPUNK 3 года назад +1

    Ah, now I have a solution for the task of making any x86 compiler author cry in 15 minutes.

  • @soonts
    @soonts 3 года назад +5

    addsubps was probably made for complex numbers packed into these vectors.
    mpsadbw and similar psadbw indeed were made for video codecs, to estimate errors. You should avoid mpsadbw because too slow, but psadbw is good.
    I think the craziest of them are for cryptography, like aeskeygenassist or sha1rnds4. Good luck explaining what they do.
    Another notable mentions are insertps (SSE 4.1; inserts a lane into vector + selectively zeroes out lanes; I used for lots of things), pmulhrsw (SSSE3; hard to explain what it does but I used it to apply volume to 16-bit PCM audio), and all of the from FMA3 set (easy to explain what they do, that’s ±(a*b)±c in one instruction for float numbers, but the throughput is so good).

    • @WhatsACreel
      @WhatsACreel  3 года назад +2

      Great points mate! Cheers for watching :)

  • @raz0229
    @raz0229 3 года назад +1

    I like how high-level language programmers are like; _"Wot?! all that to get the product of two numbers?!?"_

  • @spacewolfjr
    @spacewolfjr 3 года назад +3

    Creel, you are most excellent!

  • @Erizo_
    @Erizo_ Год назад +1

    I never knew i needed this, until now.

  • @jimviau327
    @jimviau327 3 года назад +82

    I'm no programmer but it appears to me that programming these instructions into a CPU is just about as complicated and fascinating as quantum physics.

    • @BrightBlueJim
      @BrightBlueJim 3 года назад +12

      Then you don't know much about quantum physics. The point is that these instructions were added because doing these operations (which are needed in very specific cases) in software is otherwise very inefficient. In fact, in a microcoded CPU, they aren't that difficult to implement. If you really had to do these things in "hardware" (i.e., dedicated logic gates), that would be a whole lot of square microns.

    • @mage3690
      @mage3690 2 года назад +1

      @@BrightBlueJim man, what a day and age we live in, to have real estate measured in microns! I'm only 20 years old, and I'm already living in the future. Imagine what the _actual_ future holds!

    • @Roxor128
      @Roxor128 Год назад +1

      @@mage3690 Microns? If you want an actual comparison to real estate in terms of cost for high-end parts, you're going to want something a little bigger. Your unit will be the nanohectare (10mm^2). Your typical big complicated chip will therefore be around 20-40 nanohectares in size and will have cost Intel or AMD the equivalent of buying 20-40 hectares of actual land to develop.

    • @jgunther3398
      @jgunther3398 Год назад +1

      half of the fun was all the bizarre "words" that mystified everybody else. it made you feel special. it's not as complicated as it looks. abstracting the problem into code is harder

    • @jimviau327
      @jimviau327 Год назад

      @mage3690 here is a hint. In the future you will be a borg. With NeuralLink, all will be connected to the WEB and our reality will be online. Disconnecting from it would represent another phase of consciousness. Then, you will be able to experiment with 5 phases of consciousness , sleep, awake, dream, WEB and illumination. The latter being the most fantastic of all.

  • @adamwieckowski6082
    @adamwieckowski6082 3 года назад +2

    pmaddwd is my all time favorite instruction. Totally priceless for video coding!

  • @jerradn
    @jerradn 3 года назад +4

    I felt like I had to clean my glasses several times during this video, haha.

  • @Ale-bj7nd
    @Ale-bj7nd Год назад +1

    I always forget how beauty assembly is.

  • @overcritical304
    @overcritical304 3 года назад +3

    Honestly, 2 days ago I was trying to figure out what the hell does MPSADBW do!. Love you Creel, I hope you will make videos on in-depth explanation of these instruction.

    • @WhatsACreel
      @WhatsACreel  3 года назад +1

      Hahaha, that's awesome! Thank you for watching :)

  • @WhaleMilk
    @WhaleMilk 3 года назад +1

    gonna admit, I don't know a lick of Assembly, but I enjoy trying to decode what anything here means while also listening to this dude's voice. Very entertaining

  • @JohnCLiberte
    @JohnCLiberte 3 года назад +6

    Just imagine pitch meetings to decide which instructions should go in the set :D. I'm surprised they don't have a 'calculate your taxes and clean the house' instruction

    • @boptillyouflop
      @boptillyouflop 3 года назад +3

      These instructions do have a couple really solid selling points: (1) they don't write to multiple registers (2) they don't do special memory accesses (3) they don't cause any weird special interrupts.

  • @HowieDue416
    @HowieDue416 Год назад

    This video makes no sense to me, but my uncles used to code in assembly language. It just truly gives me awe and appreciation for the pioneers who used this language (WITHOUT DEBUGGING) and makes me see them in a new light as men of math.
    Thanks for humbling me and thanking god that there are higher level languages

  • @colinstu
    @colinstu 3 года назад +5

    that glow around the bright text on dark background is driving my eyeballs crazy.

    • @WhatsACreel
      @WhatsACreel  3 года назад +1

      Noted! Thanks for letting me know and cheers for watching :)

    • @colinstu
      @colinstu 3 года назад +2

      @@WhatsACreel interesting vid / instructions nonetheless. but yeah, the glow reminds me of when my eyes are wet from crying, I kept having to pause and rub my eyes to "dry" them only to see it's still foggy looking lol.

    • @WhatsACreel
      @WhatsACreel  3 года назад +3

      @@colinstu Ha! I felt the same way while making it! I toned down the glow from 6 to 2.5. It was still hard to look at, but I’d already rendered half the animations, so had to settle. I’m hoping to use animations resembling construction paper in the future. They are very easy to look at, but more time consuming to create. We will have to see how we go.

    • @snoozy355
      @snoozy355 3 года назад

      @@WhatsACreel what software did you create your animations in?

  • @saudude2174
    @saudude2174 Год назад

    PEXT made me laugh for some reason. Don't know if it's the particular tone you explained it in or the absolute (seemingly for my stupid brain) randomness and bizarreness of this operation, but I love it.

  • @Nesetalis
    @Nesetalis 3 года назад +4

    Very interesting, though I don't speak Assembly I recognized a lot of the terms (C/C++ here). However, that glow effect you're using drove my eyes batty.

  • @lt3880
    @lt3880 Год назад

    This was in my recommendations dozens of times in the last year. I finally watched, and I dont know what to do with this information

  • @xymaryai8283
    @xymaryai8283 3 года назад +17

    god, not even cryptographers would bother figuring these instructions out nowadays. no wonder RISC instruction sets are so much faster for the same electrons, they don't need to snake around the dark winding alleys of the ALU

    • @mduckernz
      @mduckernz 3 года назад +3

      They absolutely do, though. Crypto nearly exclusively is written in assembler, and prioritises code that always takes the same amount of time to execute (to prevent timing attacks), and code that also otherwise doesn't leak state (the amount of time something takes to execute is a leak, but if it's always the same you can't extract any data from it)

  • @craigix
    @craigix 3 года назад

    How did the youtube algorithm know that I'd like this combination? I'm legit impressed.

  • @reirei_tk
    @reirei_tk 3 года назад +4

    Honestly it's amazing how much work PCMPxSTRx can do in 3 or 4 clock cycles.

  • @UnkownUnkown01
    @UnkownUnkown01 Год назад

    As weird as this video is, I never enjoyed a video so much, I think it's just the enthusiasm this guy has... damn, I wish everyone who made videos like there would have that same enthusiasm, but if you're reading this, thanks, I can't remember the last time I liked a video this much

  • @monad_tcp
    @monad_tcp 3 года назад +91

    This made me realize that X86 is more abstract than the C language, each of those instructions are like 4 or 5 lines of C.

    • @andersjjensen
      @andersjjensen 3 года назад +60

      Now imagine having to teach a compiler to take your 5 lines of C code.... and figuring out which of the five thousand different x86 instructions is the perfect fit :P

    • @codahighland
      @codahighland 3 года назад +21

      That's the opposite of more abstract. Being more abstract means you have tools that are more general-purpose in order to handle a variety of different uses. These instructions are not abstract; they are intended for specific purposes and aren't especially useful at all otherwise.
      Consider that these instructions are actually implemented as microcode inside the CPU -- miniature programs built out of primitive building blocks.

    • @sunnohh
      @sunnohh 3 года назад +9

      @@codahighland i guess what he is really trying to say is that x86 is so bloated you can implement the same thing a billion different ways

    • @codahighland
      @codahighland 3 года назад +3

      @@davestephens3246 Was the ad hominem even necessary? I wasn't judging. I was just giving information.

    • @monad_tcp
      @monad_tcp 3 года назад +1

      @@codahighland "they are more general-purpose in order to handle a variety of different uses"
      that's why I said what I said. "X86 is more abstract than C"
      x86 has lots and lots of complexity, the instruction set has lots of arguments and things that happen in some state and not in others, the instruction is variable length.
      So, the instructions can be used for lots of different purposes, with different modes, different registers, and so on, and so forth.
      The instructions are actually implemented as microcode should be more than enough evidence that assembly is more abstract than the machine itself.
      Assembly is much more complex than the abstract machine that defines C and which you program to.
      C is basically a macro-assembler for the PDP11, X86 is a monster near it, it can do a lot, much more things, you can fine control memory load/store ordering, lots of abstract things that you can't even do in C, like barriers, for example.
      One practical example, there are SIMD instructions that a single instruction will to an entire for loop with sum and comparative to a variable, but in a register, like 4 or 5 lines of C is just a single asm in x86, and the compilers know how to translate that, because you can't even declare data-paralelism in C, the compilers have to pretty much guess so otherwise the CPU would be idling because C programs are sequential, but what we care about is how data relates to itself, not the control-flow of the program, the CPU couldn't care less about it (speculative execution for the win!), all because the C has less abstraction power than the machine itself.
      C is really, really outdated.

  • @swoopskee
    @swoopskee 3 года назад +1

    whoah, this is some premium content right here, thank you! Subbed and notifications on

  •  3 года назад +4

    It's like watching golden globes for nerds

  • @notamouse5630
    @notamouse5630 Год назад +1

    CMPXCHG16B is used for atomic operations required by lock free and block free queues.

  • @neolordie
    @neolordie 3 года назад +32

    when the recommendation are about instructions set you know you are on another nerd level

    • @paulmccartney2327
      @paulmccartney2327 4 месяца назад

      I've never seen more emberassing artwork in my life

    • @paulmccartney2327
      @paulmccartney2327 4 месяца назад

      honestly you should just delete everything again for like the third time

    • @paulmccartney2327
      @paulmccartney2327 4 месяца назад

      @@neolordie >was

  • @douggale5962
    @douggale5962 3 года назад

    To be honest, I 3/4 expected this to be a dumb list, but I was pleasantly surprised that you actually know some stuff!

  • @DogsRNice
    @DogsRNice 3 года назад +5

    Of all the thousands of videos I’ve watched this is the one that went farthest over my head

  • @SmoothCode
    @SmoothCode 3 года назад

    lol you sound so excited about this. I too caught your enthusiasm in the subject over video. I hope you make more intro to understanding wth assembly language is and how it works in relation to microprocessor/bitwise operands would be really helpful for struggling CS students.

  • @haraldsbaumanis
    @haraldsbaumanis 3 года назад +7

    It would be very interesting to talk to the people who designed these chips

    • @GogiRegion
      @GogiRegion 3 года назад +3

      I’m just imagining that the entire design team for #1 probably go into extreme PTSD flashbacks any time they see the letters PCMP anywhere near STR. I just can’t imagine what the proposal Idea was like that led to the instruction being considered.

  • @distrologic2925
    @distrologic2925 3 года назад

    Love how excited he is constantly

  • @mojeimja
    @mojeimja 3 года назад +3

    I can not imagine a compiler that utilizes these fully! Use asm, optimize by hand!

    • @soonts
      @soonts 3 года назад

      I agree it’s borderline impossible for compilers to emit them automatically. I saw clang’s auto-vectorizer emitting vpshufb but that was very simple code.
      I disagree about ASM. All these instructions can be used in C or C++ as compiler intrinsics, way more practical.

    • @mojeimja
      @mojeimja 3 года назад

      @@soonts yes, but if one can understand and use intrinsic properly, then heshe can just write entire function in ASM too (right there inside C code), so it not about how exactly to use it, its about to use it efficiently at all.

    • @soonts
      @soonts 3 года назад

      @@mojeimja The code I write often has both SIMD and scalar parts, interleaved tightly.
      Modern compilers are quite good at scalar stuff, they abuse LEA instruction for integer math because it’s faster, and do many more non-obvious things. Just because they suck at automatic vectorization doesn’t mean they suck generally.
      For SIMD code, manually allocating registers, and conforming to the ABI (i.e. which registers to backup/restore when doing function calls) is not fun.
      With intrinsics, the compiler takes care about these boring pieces.

  • @ChrisM541
    @ChrisM541 3 года назад +2

    Fascinating! Thanks for this, a few questions...
    1) How aware are today's compilers of these instructions? - e.g. fully? partially? poorly?
    2) How good a job do they do in making use of them? (will there always be scenario permutations out with the compiler's standard translation/reduction definitions?)
    3) Is it fair to assume that these instructions are always faster than stringing together their equivalent single function, datatype/bit size equivalents? (loaded question?)
    4) Given today's and tomorrow's CPU and GPU speeds, do you think compiler engineers are really all that concerned/have only a passing incentive with the level of CPU and memory optimisations absolutely critically required in an era not that long ago, where e.g. virtually every commercial game was hand coded in assembly using, literally, every CPU and memory optimisation trick possible?