computers suck at division (a painful discovery)

Поделиться
HTML-код
  • Опубликовано: 30 ноя 2022
  • I tried to take on a simple task. I TRIED to do a simple assembly problem. But, the flaws of the ARM architecture ultimately almost defeated me. Computers suck at division, and there's a few reasons for that. Division is so hard for computers, that the ARM processor core didn't have a divide instruction until 2004. Even now, certain ARM Cortex M series processers don't have the divide instruction.
    So then, how do the processors do division? Watch the video and find out ;)
    🏫 COURSES 🏫 Learn to code in C at lowlevel.academy
    📰 NEWSLETTER 📰 Sign up for our newsletter at mailchi.mp/lowlevel/the-low-down
    🛒 GREAT BOOKS FOR THE LOWEST LEVEL🛒
    Blue Fox: Arm Assembly Internals and Reverse Engineering: amzn.to/4394t87
    Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation : amzn.to/3C1z4sk
    Practical Malware Analysis: The Hands-On Guide to Dissecting Malicious Software : amzn.to/3C1daFy
    The Ghidra Book: The Definitive Guide: amzn.to/3WC2Vkg
    🔥🔥🔥 SOCIALS 🔥🔥🔥
    Low Level Merch!: lowlevel.store/
    Follow me on Twitter: / lowleveltweets
    Follow me on Twitch: / lowlevellearning
    Join me on Discord!: / discord
  • НаукаНаука

Комментарии • 2,4 тыс.

  • @LowLevelLearning
    @LowLevelLearning  8 месяцев назад +58

    Wanna learn MORE awesome stuff while helping out the channel? Go get a FREE month of Skillshare using my link: join.skillshare.com/learn-1/?coupon=AFF30D23

    • @LithiumDeuteride-6
      @LithiumDeuteride-6 5 месяцев назад +1

      The "Degrees" class unambiguously requires floating-point numbers.

    • @user-qh5br9jl9g
      @user-qh5br9jl9g 4 месяца назад

      What about the neon DSP code? That handles div. Wide accumulated Also.

  • @sprytnychomik
    @sprytnychomik Год назад +6817

    Other solution: don't use Fahrenheits.

    • @LowLevelLearning
      @LowLevelLearning  Год назад +844

      Imperial Unit Gang

    • @PRIMEVAL543
      @PRIMEVAL543 Год назад +356

      The good old freedom units that you got forced on by the uk cause nothing stands mor for freedom than the British empire.
      Did you learn about picas and points btw? And can you tell me how many miles 1 pica + 3 points are?

    • @renakunisaki
      @renakunisaki Год назад +694

      @@PRIMEVAL543 about 8 football fields give or take an eagle or two

    • @NinjaRunningWild
      @NinjaRunningWild Год назад +29

      F that!

    • @devinnall2284
      @devinnall2284 Год назад +81

      For Non-Americans an easy way to understand Fahrenheit is to think of them as a percentage anything below 40% is cold 40% to around 75% are nice medium temperatures and anything above that is hot as hell!

  • @snoupix3332
    @snoupix3332 Год назад +2469

    I love how us, programmers, are spending hours and nights to not use an existing external lib just to say "I've made it myself" 🤣

    • @themalaysiandude3903
      @themalaysiandude3903 11 месяцев назад +176

      nothing ever beat than making your own function all day instead of just copy pasting

    • @OhhCrapGuy
      @OhhCrapGuy 10 месяцев назад +294

      I think the real reason we do this isn't because we don't like libraries, it's because we want to understand.
      There's so many times that I've written something myself, fully understood the entire problem, gotten the bugs worked out, and as soon I was entirely satisfied with my understanding, I simply immediately replaced my solution with a library, because I don't want to maintain any of that nonsense!

    • @JonJenkins1982
      @JonJenkins1982 10 месяцев назад +49

      If you do that, you acquire skills that easily translate to other realms of computational problem solving.

    • @OhhCrapGuy
      @OhhCrapGuy 10 месяцев назад +31

      @@JonJenkins1982 not only that, but even for higher level programmers, every time you deal with bit twiddling and such, you learn to think that way even better.
      Case in point, I had mostly figured out the solution for this only a few moments after he pointed out that ARM M0 cores don't have division capabilities, or at least close to it. I was thinking the solution was overflowing the irrelevant bits out of the multiplication register, but this is effectively the same thing, just putting them into a word that we simply ignore.
      It's not that I, or anyone else, have some sort of unique insight into how it works, I never would have thought of that earlier in my career. It's just that being exposed to these sorts of cool hacks over time gives you the tools you need to come up with new bit hacks when you encounter a problem.

    • @ricardocasimiro6424
      @ricardocasimiro6424 10 месяцев назад

      agree

  • @JustAnotherAlchemist
    @JustAnotherAlchemist Год назад +413

    The generalization of this concept is called "division by multiplicative Inverse" or less commonly "multiplication by the reciprocal" and is a relatively common practice when you have a division where the divisor is a constant. It's talked about a lot in low-level books, like "Hacker's Delight." Basically, you precompute the division by encoding it as a reciprocal. It's fatal flaw is you have to know what you are dividing with from the start.
    To go farther, you have to use something like the Newton-Raphson method for division. This is the same as the above, in that it finds the reciprocation quotient. The first major difference is that it is a run time operation. That is, it FINDS the needed reciprocal during execution, no precomputation. The second major difference is that it uses an approximation to division _as a function_ to get close to the correct result, then uses Newton-Raphson method to refine that guess to as many digits as you desire. For 32bits, I find that 3 iterations is plenty.
    Incidentally, all of this is possible because floating point/fixed point number systems encode multiplication and division _into_ the number itself, in the same sense that having a sign bit encodes addition and subtraction into a number. Floating point, specifically, also encodes exponentiation into the number, which is why it is the usual goto number system.
    ... The comment claiming this to be called "Division by Invariant Multiplication" is the first time I have ever heard it called specifically by that name, and has just one research paper with that exact title... Wikipedia calls it "multiplication by [the] reciprocal" which someone responding to said comment linked to as well. multiplication by the reciprocal is a basic math operation, and gets you a lot more hits than "Division by Invariant Multiplication." It's also certainly not complicated to understand, as this ~5 min video has clearly demonstrated.

    • @idonnow2
      @idonnow2 11 месяцев назад +9

      Based informative comment

    • @rickroller1566
      @rickroller1566 5 месяцев назад

      Doesn't Newton's method use division? What am I missing?

    • @JustAnotherAlchemist
      @JustAnotherAlchemist 5 месяцев назад +1

      @@rickroller1566 en.wikipedia.org/wiki/Division_algorithm#Newton%E2%80%93Raphson_division
      ...specifically, you want "Xi =(2-DXi) ... which can be calculated from Xi using only multiplication and subtraction, or using two fused multiply-adds. "

    • @pulakgautam3536
      @pulakgautam3536 4 месяца назад +1

      you rlly are an alchemist

    • @Fujui
      @Fujui 2 месяца назад

      Why dont you divide by 1.8

  • @fireball2275
    @fireball2275 Год назад +2902

    I like how it changed to night with the easy bit at the start

    • @LowLevelLearning
      @LowLevelLearning  Год назад +345

      FINALLY someone noticed that XD

    • @fireball2275
      @fireball2275 Год назад +68

      @@LowLevelLearning legend

    • @Xevos701
      @Xevos701 Год назад +22

      The Firestorm core in Apple M1 has a dedicated Divider ALU.

    • @foobar1500
      @foobar1500 Год назад +22

      @@Xevos701 The Firestorm implementation of integer division is surprisingly efficient, but it's nonetheless 2.5-4 times worse in performance than the corresponding multiplication.

    • @DavidGarcia-se3ko
      @DavidGarcia-se3ko Год назад +9

      Multiply by 555, then bit shift

  • @Finkelfunk
    @Finkelfunk Год назад +6216

    What you discovered here is something called "Division by Invariant Multiplication". That topic is so insanely complicated that a good number of people have written dissertations in computer sciences about this stuff.
    Long story short: It makes division go vrooooom on your processor.

    • @null6482
      @null6482 Год назад +87

      Where can I find more info about this

    • @solomontan1524
      @solomontan1524 Год назад +281

      This comment should be pinned so everyone can learn about the name for this concept. Very interesting stuff

    • @adamrak7560
      @adamrak7560 Год назад +193

      if you want something even more crazy, look up fast inverse square root algorithm

    • @christopherbroms2508
      @christopherbroms2508 Год назад +88

      @@null6482 en.wikipedia.org/wiki/Division_algorithm#Division_by_a_constant

    • @okkoheinio5139
      @okkoheinio5139 Год назад +11

      @@MustacheMerlin wow, that sounds cool

  • @jensknudsen4222
    @jensknudsen4222 Год назад +1786

    This brings back painful memories of 8-bit computing. Consider yourself lucky to have a multiply instruction.

    • @svampebob007
      @svampebob007 Год назад +152

      I remember building a 8 bit cpu, the division circuit was by far the longest to figure out.
      I started with obviously +-, *... then tried for months to figure out how to do division (without obviously looking for the answer online) once I realized how you could do division with substractions and additions it took me just a couple of weeks to deal with IO and W/R to ram.

    • @linuxization4205
      @linuxization4205 Год назад +49

      just add the number multiple times if you need multiplication.

    • @stinchjack
      @stinchjack Год назад +49

      I dabble with Z80 assembly. Multiplication is a charm.
      e.g. multiply HL register by 10
      push de
      ld d, h
      ld e, l
      add hl, hl
      add hl, hl
      add hl, de
      add hl, hl
      pop de
      I've written something like that so often by now its just muscle memory :D

    • @daishi5571
      @daishi5571 Год назад +25

      Motorola 6809 (1978) has a multiply command, takes register A and multiplies that by register B then combines the A & B registers to make a register D (which is 8+8 bit so 16 bit) and outputs the multiplication result.

    • @robertherndon4351
      @robertherndon4351 Год назад +12

      Ah, tables of squares and difference-of-squares multiplication methods. And then there are Newton-Raphson approximations for computing reciprocals so you can pipeline divisions using multiplication. All good fun...

  • @filipesaz
    @filipesaz Год назад +184

    For those of you who, like me, had some difficulty understanding his explanation.
    His final calculation is something like: C = (F - 32) * 5 * Const / (2^32); where Const = ceil((2^32) / 9);
    He divides by 2^32 by grabbing the highest 32 bits of a little-endian architecture number, because it would be equivalent to bit shifting the 64-bit positive number 32 times to the right.

  • @chsovi7164
    @chsovi7164 Год назад +1925

    "I didn't want to use an incredibly low level library, because it felt like cheating" Been there before, I know where this road leads. I wish you luck on your future breadboard CPU project!

    • @MayorMcC666
      @MayorMcC666 Год назад +193

      Either breadboard or TempleOS, one of two roads

    • @mikicerise6250
      @mikicerise6250 Год назад +24

      RNJesus speaks to me.

    • @EmptyZoo393
      @EmptyZoo393 Год назад +102

      Seriously, there is a reason high level languages exist. If you can let the compiler and the language take care of your branch tables for you, you can save a lot of development time. These days, the only reasons to work with assembly are for SIMD optimization, extremely optimized library functions (often involving SIMD), and occasionally interrupt routines.
      There are occasional times with small micro-controllers like this where you need to use assembly, but code the compiler creates is typically roughly equivalent to what you'd create, and often better. Don't underestimate the power of auto-loop unrolling and function inlining. The linker can take all sorts of shortcuts that you wouldn't think to.

    • @NoX-512
      @NoX-512 Год назад +37

      @@EmptyZoo393 What's the fun in that?

    • @TheDavidlloydjones
      @TheDavidlloydjones Год назад

      @@mikicerise6250
      Do you speak Aramaic? What has He been telling you?
      Lemme guess. "Donald Trump has been sent to save you all" maybe?

  • @Soupie62
    @Soupie62 Год назад +894

    5 / 9 is roughly 569 / 1024. Dividing by multiples of 2 is an Arithmetic Shift Right.
    Multiply by 569, shift the result Right by ten bits. Multiply by 36409, shift 16 bits...
    and 2330169 with a shift of 32 bits merges nicely with your code.

    • @VictorCampos87
      @VictorCampos87 Год назад +183

      Sir, this is what I call an elegant solution. Easy to understand. Easy to code. Easy to process. And give the most precise results available for this architeture (I think).

    • @Kyle-ke5fx
      @Kyle-ke5fx Год назад +147

      @@VictorCampos87 This is exactly why they don't have these operations on the processor. Because skilled embedded developers don't need it and it would make it too expensive or large for the applications it was built for. People that do this every day generally don't need it.

    • @joshuahudson2170
      @joshuahudson2170 Год назад +20

      @@Kyle-ke5fx What do you do when a variable divide comes up?

    • @TrackedHiker
      @TrackedHiker Год назад +25

      @@joshuahudson2170 do it the same way we do long division by hand in decimal

    • @joshuahudson2170
      @joshuahudson2170 Год назад +3

      @@TrackedHiker That's a guess and check pattern.

  • @kensmith5694
    @kensmith5694 Год назад +50

    As someone who does microcontroller stuff, I often do that sort of thing. There is even a method that allows you to effectively divide by a variable amount. It is based on two observations:
    1) A/B = (N*A)/(N*B) so you can pick an N to make the denominator something easier to deal with
    2) For X very small 1/(1+X) = 1-X this allows you to deal with any number near a power of two without a divide.

  • @Olav_Hansen
    @Olav_Hansen Год назад +46

    As soon as you aren't working with whole numbers, most humans start to break at division as well.
    90/9 is something most people can do, 90/8 becomes a bit tricky to some, but 90/7.6 will require most people to pull out a writing sheet to 'cache partial results', and even makes us struggle past this point.
    And this is while we are the creatures that learn how to divide things evenly with others even before we learn math.

    • @farfa2937
      @farfa2937 10 месяцев назад +16

      I'd say that most people (including myself) would actually no even try to do 90/7.6 by themselves at all, sheet or not.

    • @nerobernardino88
      @nerobernardino88 5 месяцев назад

      @@farfa2937 My solution would be: "90/7.6=Fuckyou"

    • @llllNEOllllchannel
      @llllNEOllllchannel 4 месяца назад +9

      90/7.6=900/76

    • @ss3nm0dn4r8
      @ss3nm0dn4r8 3 месяца назад

      the big difference here is that 8 and 9 have 1 digit while 7.6 has two not the fact its not a whole number if you say divide by .9 now its a cakewalk for most people again

    • @Olav_Hansen
      @Olav_Hansen 3 месяца назад

      @@ss3nm0dn4r8 I wouldn't say most would find it easy again (althought definitely easier then 90/7.6), but that's also because the solution is a whole, natural number again.

  • @razorstone3088
    @razorstone3088 Год назад +844

    bro's just casually programming some ARM assembly. love it man stuff like this makes me motivated to learn low level stuff even though it looks kinda scary to me

    • @LowLevelLearning
      @LowLevelLearning  Год назад +111

      Do it!

    • @Yas-gs8cm
      @Yas-gs8cm Год назад +15

      I think it's a waste of time beyond knowing "what is assembly and how it works + basics"
      (Note: I, myself, know the "basics" I mentioned above. It's logical that "knowing" is always superior comparing to "not knowing"...)

    • @andrewdunbar828
      @andrewdunbar828 Год назад +22

      Learning it will improve your high-level coding. Worth it!

    • @LuskyMJ
      @LuskyMJ Год назад +82

      @@Yas-gs8cm Once you learn assembly your thought process never becomes the same again. Even if you work with higher level languages you still constantly think about: "What happens once this gets converted to assembly". It helps you develop new ways to solve problems that you wouldn't have if you never coded in assembly.

    • @ImperatorZed
      @ImperatorZed Год назад +9

      @@Yas-gs8cm this isn't even the basics though, this video is super simple stuff

  • @richardericlope3341
    @richardericlope3341 Год назад +1327

    Even as late as the the Nintendo DS ARM 9, we needed to use fixed point mathematics. The hardware registers themselves accepted only fixed point values.
    Same with trancendental trig functions. We had to use lookup tables in BRADIAN(Binary Radian) units interpolated on the fly.
    Fun times.

    • @circuit10
      @circuit10 Год назад +25

      Did you program DS games?

    • @raymundhofmann7661
      @raymundhofmann7661 Год назад +40

      Fixed point or fractional integers can in fact do adequate calculations for most practical things and in signal processing, it just gets increasingly burdening on the programmer to decide where the decimal point is put best for each number represented in the program to not lose precision or keep the S/N.
      Consequently, it is easier for the programmer to have this done at runtime with floating points, but also more wasteful and also possibly numerically more unstable/indeterministic when dealing with feed back or iteration over data sets.
      Scientific calculations or a FFT (and other transformations) are examples where floating point calculations may be needed.
      In a FFT for example all data points in the time domain may contribute to only one data point in the frequency domain and vice versa, floating point may be the better choice to deal with the bigger dynamic of the transformed compared to the source data, compared to just increasing integer bit precision overall.

    • @tomcombe4813
      @tomcombe4813 Год назад +45

      Yep one of the hardest problems in computer design has always been finding a good way to implement division.
      There have been some creative solutions, like in the cray-1 supercomputer which would instead compute a reciprocal that you could then multiply to do your division. But often, designers just omit division all together from their CPUs.

    • @grimonce
      @grimonce Год назад +2

      Well this was just division... By constant....

    • @declanmoore
      @declanmoore Год назад +18

      The DS also has a divider unit, which makes up somewhat for the lack of support by the CPU :p

  • @RokeJulianLockhart.s13ouq
    @RokeJulianLockhart.s13ouq Год назад +8

    This is why, whenever you're programming, and realize a solution that seems like cheating, do it. Cheat.

  • @johnnyvsx
    @johnnyvsx Год назад +9

    I used this same technique years ago. Another way of thinking about it is that you’re converting your fraction to a close fraction which has a power of 2 as the denominator since that will just be a bit shift operation. It was an old intel 8088 microprocessor. It had a divide opcode, but execution took over 160 clock cycles. The real-time code couldn’t finish processing one block of data before the next arrived. Converting all the multiply & divides into multiply & shifts saved the day.

  • @watsonwrote
    @watsonwrote Год назад +171

    I love how this highlights how division is fundamentally different from the other basic operations. It opens up a whole new domain of numbers and we have to contend with that.

    • @goldenhate6649
      @goldenhate6649 Год назад +16

      The mathematical proofs for multiplication and division is what made me hate math. I am an chem engineer by education and environmental by trade. I love equations, but screw any sort of proof. Hated every second of every math class I took in college for that reason. I would have loved to complete matrix theory, (still have nightmares there), but proofs made me quit. I think I made it less than a month. Kicker is, the class even said it WASN'T going to have proofs.

    • @Your_choise
      @Your_choise Год назад +27

      In a sense, subtraction also opens up a new domain as it gives negative numbers.

    • @coopergates9680
      @coopergates9680 Год назад +7

      @@goldenhate6649 Pretty much, I do recreational math all day, but forget deep dives into multivar calculus or other long routes by hand just to demonstrate a simple conjecture

    • @weirdlyspecific302
      @weirdlyspecific302 10 месяцев назад +3

      It simply isn't. Subtraction does the same thing.

    • @antoinem3659
      @antoinem3659 6 месяцев назад

      @@weirdlyspecific302 but division open up more

  • @jameshogge
    @jameshogge Год назад +256

    A couple of extra notes:
    - This division method (multiply+shift) is attempted by compilers whenever the divisor is known at compile time. It isn't used to divide by an unknown number as the constants can't be predetermined.
    - Another alternative to implementing these things in hardware is to translate the instruction into micro ops on the cpu. (I can't say for certain which instructions do this but I believe all exponential/trig functions usually do.) I believe the reciprocal is usually calculated in hardware so a division might be converted to this plus a multiplication
    And yes, regular divisions can take 100s of cycles

    • @weetabixharry
      @weetabixharry Год назад +15

      There are a lot of iterative algorithms for computing divisions. My favourite is successive approximation: x = a/b, we rearrange to x.b = a. We then *guess* the value of x 1 bit at a time, by setting each bit to '1' from MSB to LSB and checking if x_guess.b is larger than a. If it is smaller (or equal), then we leave that bit set. If it is larger, then we reset that bit to '0'. This algorithm works nicely for a lot of monotonic functions with simple inverses (e.g. division --> multiplication, square root --> square, and there is even a clever trick for calculating logarithms this way).

    • @weetabixharry
      @weetabixharry Год назад +9

      Of course, that algorithm is slow, but I think it is elegant. In digital hardware, it needs N clock cycles to calculate an N-bit result (but uses an incredibly small amount of silicon). In a CPU, much longer.

    • @absalomdraconis
      @absalomdraconis Год назад

      @@weetabixharry : I wonder if anyone's tried to shorten it with a "binary search" or priority encoder?

    • @weetabixharry
      @weetabixharry Год назад

      @@absalomdraconis There are plenty of ways of doing it faster in hardware, it just costs more transistors. Depending on the clock frequency, it may even be possible in 1 clock cycle. The best solution depends on the use case.

    • @weetabixharry
      @weetabixharry Год назад +3

      @@absalomdraconis And it already is a binary search (because we're setting 1 bit at a time, not incrementing the value +1 for each guess).

  • @mervmartin2112
    @mervmartin2112 5 месяцев назад +5

    At the machine level, computers don't subtract, multiply, or divide. They add, compliment and shift. So, in binary, shift the dividend left past the radix as many times as you want significant digits. Two's compliment the divisor and add it to the dividend until the quotient sign bit is negative. Shift the quotient right the same amount you shifted the dividend left. Viola! The decimal is not called decimal in any other number base but 10. It's the radix in any number base.
    The video was great! Good job digging into the software that deep!!!!

  • @NothingButFish
    @NothingButFish Год назад +11

    Taking me back to applying newton's method for calculating square root in MIPS assembly as a programming assignment in an introductory computer architecture/CS class. I think more than anything else it taught me to respect all of the crazy shit that's happening at the lowest levels (although of course it goes even lower).

    • @jasonenns5076
      @jasonenns5076 2 месяца назад

      MIPS is dead! The company responsible for MIPS now is RISC-V.

  • @gapplegames1604
    @gapplegames1604 Год назад +204

    My favorite trick is to take some huge power of 2, let’s say 1024. Turn 5/9 into the closest fraction with 1024 as the denominator. In this case you would get 570/1024. Now all you have to do is multiply your binary number by 570, and then move the decimal 10 places to the left. I’m sure you guys know way better ways to do this, but this was my trick when I had to write code in assembly for my computer science class.

    • @gapplegames1604
      @gapplegames1604 Год назад +32

      Just realized this is almost exactly what you did in your video!

    • @user-dh8oi2mk4f
      @user-dh8oi2mk4f Год назад +12

      @@gapplegames1604 This is exactly what he did in the video

    • @meneldal
      @meneldal Год назад +4

      @@gapplegames1604 Depending on the architecture and the number of bits you have there are tradeoffs with precision/max value you can accept. Since 32 bit multiplies tend to be faster (on cheap hardware), something with a lower bit shift like your example (but most likely 16 to use tricks with using just high/low word) has its uses in some situations. If you're stuck with an 8-bit AVR, you're definitely not doing 64 bits multiplies (and probably not 32 either).

    • @gregorymorse8423
      @gregorymorse8423 Год назад +2

      Except multiplying by 590 reduces the domain and causes overflow in cases the better method doesn't. You usually like to write crappy buggy code, without consideration of any problems or limitations or domain restrictions. Yes we know. Hopefully nobody is paying you for such garbage tho

    • @rileyalden723
      @rileyalden723 Год назад +44

      @@gregorymorse8423 jesus christ bro chill, he's explaining how he did it "for my computer science class", he's just working with the closest possible precision he can given the limitations of practical time constraints and diminishing returns. For the purposes of understanding how assembly works, the goal of most courses that are going to require a basic understanding of this kind of math, this is a perfectly sufficient method.

  • @Ma1ne2
    @Ma1ne2 Год назад +218

    Loving this type of video format where you just tell a short story about something you have been working on yourself and learning something new! Would love to see more of this format!

    • @LowLevelLearning
      @LowLevelLearning  Год назад +14

      Thank you! I really had fun with this one

    • @enriqueamaya3883
      @enriqueamaya3883 Год назад

      Jesus Ioves youx[]cpzx[c]

    • @MrZcar350
      @MrZcar350 Год назад

      And has the magic number (at least in the 32-bit version) 0x5F3759DF.

  • @EvilidelRio
    @EvilidelRio Год назад +7

    In the not-so-good old days, on an IBM mainframe and before IEEE math, we've got a lot of fun adding 100 events each one with a weight of 1/100... just to discover after many hours reviewing the code that 1/100 is a periodic fraction in binary! So we changed and weighted 128 events.

  • @MrEnyecz
    @MrEnyecz Год назад +94

    It is nice to see young kids discovering the wonders of the world! :) What if I told you that the famous 6502 CPU which powered such machines like the C64 (and many-many others) had not only no division but no multiplication too. And it was still able to rule the 80s.

    • @kc9scott
      @kc9scott 10 месяцев назад +9

      When I was using Apple II, I wrote a macro library for 16- and 32-bit int operations like add/subtract/multiply/divide. But when I got to wanting to do floating point from assembler, I cheated and just bought a floating-point processor board.

    • @whannabi
      @whannabi 5 месяцев назад +1

      You simply have to implement that in software as far as I know. It will just be slower but that's It.

    • @karmatical5837
      @karmatical5837 5 месяцев назад +1

      ​@@whannabiwell, by deciding to work with floating points, you kinda accepted the fact that you will sacrifice memory and performance.
      After some time studying hardwares, specially old ones, i got why its that so fundamental and common.

    • @TheNvipy
      @TheNvipy 5 месяцев назад

      Traditionally, we used log/exp approach on 8bit/16bit. Hitachi and Mips RISC cpu work around the problem. On mips, multiplication and division are carried out in parallel and store the result in specific registers. Hitachi "sh" cpus provides a division by part, we can set the required precision: one instruction to initialize, another to perform a step.

    • @MrEnyecz
      @MrEnyecz 5 месяцев назад

      ​@@whannabiYes, of course, it was Turing complete. I just said that it was even more primitive, but still a CPU. Getting surprised that some CPU has no division shows that you're way too young. :)

  • @evan7721
    @evan7721 Год назад +406

    Once you described how that weird arbitrarily large number was found/represented, it reminded me of the Fast Inverse Square Root function in Quakes engine that tells the computer to cast a float as an int with no conversion

  • @lonegecko
    @lonegecko Год назад +144

    I've got some bad news... looks like the Cortex M0, M0+ & M1 don't support the SMULL instruction (32-bit multiplication with 64-bit result). You only get the lower 32 bits. If you set the compiler flag `-march=armv6-m` in Godbolt, you'll see it calls a library function for division (`__aeabi_idiv`) instead.

    • @nahco3994
      @nahco3994 Год назад +15

      Yup, the same thing tripped me up when I was fooling around with some playful attempts to do baremetal stuff on a Raspberry Pi 3B, which has a Cortex A53 I believe. I didn't bother to include or link any libraries, because I was only doing very basic arithmetic and pointer shit anyway, outputting over UART. I was quite surprised that I was getting an error referencing this function call, in a place where I fully expected a plain integer division to be.

    • @foobar1500
      @foobar1500 Год назад +7

      I strongly doubt if he actually needs to convert millions of Fahrenheit to Celsius. For a reduced range a 32-bit multiplication with a 16-bit constant and a 16-bit shift is enough for correctness.

    • @Carewolf
      @Carewolf Год назад +1

      Dont code for armv6... That stuff is ancient.

    • @ABaumstumpf
      @ABaumstumpf Год назад +13

      @@Carewolf "Dont code for armv6... That stuff is ancient."
      ah yes - the "ancient" 9 years ago... cause nobody would ever use something that was first released that long ago............................................................ yeah no.

    • @renakunisaki
      @renakunisaki Год назад +6

      @@foobar1500 hopefully anyone working with temperatures over 32767 degrees (in any scale) is already using Celsius or Kelvin 🤪

  • @brianfox7067
    @brianfox7067 Год назад +2

    I can't get out of bed and brush my teeth and this guy is doing this with his time

  • @chrisnatale5901
    @chrisnatale5901 Год назад +1

    Nice solution! I remember reading a book back in the 90's called "Game Programmer's Black Book" that detailed workarounds for faster division performance on the 486's and Pentiums of the day.

  • @KillerSpud
    @KillerSpud Год назад +231

    I've been developing on a Motorola/freescale/NXP DSC family chip for a few years now and I'm always having to do multiply shift approximations. A simple divide operation takes a few hundred microseconds, and when I'm running a one millisecond loop that can blow things out quick.
    This is just one of those cases where close enough really is close enough.

    • @LowLevelLearning
      @LowLevelLearning  Год назад +48

      Yeah I’ll probably do another video for this but it’s crazy that even on modern CISC chips DIV instructions take sometimes up to 12-20 clock cycles

    • @johntrevy1
      @johntrevy1 Год назад +19

      @@LowLevelLearning But then the human brain typically struggles with division the most too. I still can't even get my head around long division lol.

    • @slicer95
      @slicer95 Год назад +12

      @@LowLevelLearning That is because Division is one of the hardest problems in Computer Arithmetic.

    • @UnblockMind
      @UnblockMind Год назад +2

      Maybe we should teach them short division method.

    • @andrewdunbar828
      @andrewdunbar828 Год назад +16

      @@LowLevelLearning If you want to realize you're still coding at a high level even in asm, then try to implement a divide yourself in FPGA (-:

  • @rparl
    @rparl Год назад +36

    The first computer which I programmed in Assembly was the IBM 704, in 1960. There were two floating point division instructions, Divide AND Stop & Divide OR Stop. This had to do with the requirement that the result had to be less than one. So the inputs had to be scaled to make sure.

  • @JaseewaJasee
    @JaseewaJasee 4 дня назад

    the depth you go into each topic is remarkable, truly enlightening!

  • @jeffcauhape6880
    @jeffcauhape6880 Год назад

    Awesome! I have recently decided to go back and learn some assembler, and chose the ARM chip. This is gold! Thank you.

  • @hayfahvytsen
    @hayfahvytsen Год назад +16

    Before the prevalence of div instructions and FPUs this was standard practice in embedded systems and decades of products used fixed point math to do incredible things, including keeping your car on the road in the winter!

  • @nandomax3
    @nandomax3 Год назад +7

    Next time I get lost in a lucid dream, I'll try to divide with an old chipset. If division is supported I'm dreaming

  • @daylen577
    @daylen577 Месяц назад +1

    The thing people often don't understand is that every operation eventually leads to an addition or a subtraction; multiplying is just adding the same number multiple times, and dividing is just looping through a bunch of potential numbers to see which one is closest.

  • @rulee9543
    @rulee9543 Год назад

    This is a very great video. Thanks for making it! I am an silicon engineer developed processors and I am surprised with ARM's brilliant solution! That make die smaller and power efficient!

  • @dr.strangelove5622
    @dr.strangelove5622 Год назад +34

    Really cool video!! It feels kind of great when I get to see someone else passionate about assembly language. This video reminded me about the time I was figuring out on how I should multiply and divide a number in the assembly language of ATMega8A microcontroller (and later with an 8085 processor): I had to use repeated addition and subtraction to get the job done. Later on, when I worked with 8086 microprocessor I got to know that division is an expensive operation (just like LOOP and REPNE instruction in x86 processor. Even in x86-64 family of processors, they are just artificates which no one has ever optimized, so everyone favours JMP instead of LOOP).
    Again, this was a great video! 👍👍

  • @henriksundt7148
    @henriksundt7148 Год назад +49

    If you had divided by a variable instead of 9 in the C source, the true face of division would show up. The code would then have to call a runtime division algorithm instead of multiplying by a constant. The ARM assembler code for efficient division is worth a look.

  • @Autotrope
    @Autotrope 23 дня назад

    I started this on the wrong foot, I had "just use a high level language" yelling in my head until I realized that looking at these from the low level is the point of the video and indeed the channel

  • @HansStrijker
    @HansStrijker Год назад +3

    If you need to do many chained calculations containing different operators, it can be faster to do them as rationals, and only do the final division at the end. Summation and division will end up to be three operations minimally, but it can pay off quite a bit for lower complexity with other operations. Though it can quickly run into rollover issues, so might need to keep an eye out for that, and apply a simplify step (gcd algorithm) where necessary.

  • @menotworking
    @menotworking Год назад +16

    Those of us who work with smaller micros do this on an almost daily basis. Create a multiplier/divisor pair with the divisor a power of 2. 2^8, 2^16 and 2^32 are especially nice because they just involve discarding low bytes. The only caution is to make sure your intermediate multiplication cannot overflow. For "normal" temperatures, a 2^8 divisor is all you need. 5/9 = 142.22, so the multiplier can be 142 or 143. Now you work back to determine the maximum input temperature that will avoid overflow, so 65535/142 = 461 and change. The conversion will work with inputs up to 461 degrees, with an error of about 0.16%. You can also add a rounding factor of 256/2 to the multiply before the 2^8 "division".

  • @maksymiliank5135
    @maksymiliank5135 Год назад +5

    That's actually a pretty cool solution. I thought you were going to talk about floating point division and how inaccurate it is, but this was also interesting to watch

  • @arnabbiswasalsodeep
    @arnabbiswasalsodeep Год назад +1

    Booth's multiplier is so great, it can be even used for division for infinite precision given infinite time as it operates in clock cycles & can be bit banged as well.

  • @owensmith7530
    @owensmith7530 Год назад +4

    The ARM1 processor didn't have a multiply instruction either. Multiply was added for the ARM2 not really for maths but because of how often the C compiler had to calculate addresses (eg array members) and multiply is very useful for that.

  • @SansNeural
    @SansNeural Год назад +14

    Old timer here... visiting you from the days before MUL and DIV instructions in microprocessors. I should be well versed in all manner of integer math operations, but thankfully I've been able to forget (or at least not recall and use) all that for years now. Heck, for some of the industrial machines (Z80 based) I programmed in the '80s we splurged and added a math co-processor card!
    Probably the main takeaway in your journey shown here is that ARM is older than we tend to think. Born in the '80s itself, ARM was radically new (RISC) but still influenced by its times. Also, adding hardware DIV might have been considered antithetical to the new RISC concept.

    • @HappyBeezerStudios
      @HappyBeezerStudios Год назад +1

      And nowadays RISC and CISC aren't really meaningful anymore. The most widely used RISC designs have tons of instructions and extensions and the most widely used CISC designs translate CISC instructions into RISC-like µOps

  • @stevenpersoon
    @stevenpersoon Год назад +11

    I encountered such a thing on a school project with FPGA's. With an FPGA you allocate logic units (AND, OR etc.) to build your own logic circuit.
    Using a single division operation used like a thousand logic units, which is a lot. Instead we used an division algorithm which got it down to under 200. It probably used multiplication and bit shifting, but I'm not sure anymore.

  • @kenhaley4
    @kenhaley4 Год назад +3

    I'm old enough to remember the days when all calculators were mechanical (sometimes electrically powered, but not always). Unless they were the more expensive variety, they did not support division. I specifically remember my first hand-held 4-function electronic calculator (circa 1960) and being truly amazed that it could divide! There was actually a short pause (maybe 1/8 second) before it displayed the answer to a division problem.

  • @Kruimeldief
    @Kruimeldief Год назад

    Thank you for not making this video unnecessarily long like so many RUclipsrs do (8+ minute videos with boring history and whatnot).

  • @edgarbonet1
    @edgarbonet1 Год назад +8

    Nice video, thanks!
    It is worth noting that, even when division is supported by the hardware, it can be quite slow. I once timed the basic floating point operations on a Cortex-A5, which has an FPU implementing the VFPv4-D16 instruction set. Multiplications turned out to be fast (one/two cycles in single/double precision respectively), but divisions took more than seven times longer than multiplications.

    • @zlcoolboy
      @zlcoolboy Год назад

      Shows that ARM was right to forego a division module.

    • @watchm4ker
      @watchm4ker Год назад

      ​@@zlcoolboy That's still faster than doing it in software. This method only works because it's a known constant that can be approximated. Once you divide by a variable, you've got to use much more sophisticated approximations, which will likely take multiple times as long as a hardware divide unit.

  • @adailtonjn
    @adailtonjn Год назад +10

    Amazing content. It would also be interesting videos about division of non-constant numbers, the implementation of a floating point unit (showing how to add and multiply floats) and also the implementation of algorithms like sine and square root. All of that comparing C and assembly.

    • @SolomonUcko
      @SolomonUcko Год назад

      It looks like the library function __aeabi_idiv/__aeabi_uidiv is generally used on ARM to handle division by a non-constant denominator.

    • @alexc4924
      @alexc4924 Год назад +1

      Binary long division isn't as complicated as you might think, though it's fiddly

  • @CosmiaNebula
    @CosmiaNebula Год назад +1

    old Babylon style division: use Newton's method to find 1/y, then multiply x * (1/y)
    This is actually often used on the microcode level, so that the ALU doesn't have to implement one-step division. It trades computation-time for chip-space.

  • @philmccavity
    @philmccavity 5 месяцев назад

    I learned about this in a pure math course but this gives another perspective. Well done!

  • @RoseHayes-321
    @RoseHayes-321 Год назад +17

    I strongly recommend the book "Hacker's Delight" by Henry S. Warren, Jr. It's particularly valuable for anyone doing assembly language programming. I wish it had been around in the 80s when I started out.

    • @ShimonDanilov
      @ShimonDanilov Год назад +1

      I also thought of that immediately, because I remember that in Hacker’s Delight there is this exact algorithm. Such a good book.

  • @dickheadrecs
    @dickheadrecs Год назад +3

    this is the real reason america measures temperature in furlongs. before 2004 you needed to write out and mail your division operations in a postage paid envelope to the centre for general compumentrics and applied calculations. in most cases you could expect to receive a response in 7-10 business days

  • @kevintan5497
    @kevintan5497 Год назад +1

    dividing using assembly was one of my least favorite assignments when i took a class with assembly

  • @jasonyesmarc309
    @jasonyesmarc309 Год назад +1

    I think you've just solved my engineering problem. Thank you!

  • @_dekinci
    @_dekinci Год назад +17

    When we manually tried different division approaches on a computer engineering courses I was fascinated by how hard division acually is. It makes sense though - division is also harder to do on paper.
    P.S. there is also a square root

    • @akallio9000
      @akallio9000 Год назад

      That's why they call it a TRIAL divisor, if it's too big you get to start over again.

    • @Mueller3D
      @Mueller3D Год назад +1

      Division in binary is actually not that difficult, since the (appropriately shifted) divisor either does or doesn't go into the dividend (determined using subtraction). There are even shortcuts you can do to make this fairly efficient. However, it still takes a number of cycles equal to the quotient length.

    • @_dekinci
      @_dekinci Год назад

      @@Mueller3D yeah, but it's much less intuitive than multiplication. It's also longer algorithms with more steps and branching. In multiplication you just go with a naive approach. In division I don't think any approach can be considered naive

  • @svenmorgenstern9506
    @svenmorgenstern9506 Год назад +30

    Welcome to 6502 assembler with more bits! 😱
    Seriously, at the assembly level (especially with RISC type instruction sets) you end up really appreciating how much heavy lifting your compiler does for you. I've never written anything in ARM assembler, but this video takes me back. Thanks for sharing! 👍🍺

    • @markday3145
      @markday3145 Год назад +2

      One of my favorite optimizations had to do with converting timestamps from 32-bit number of seconds to separate year/month/day/weekday/hour/minute/seconds fields on a 6502. The typical approach involves multiple 32-bit divisions. I realized that most of those quotient bits were always zero. So I essentially unrolled those divisions and eliminated all of the trial subtractions that were guaranteed to result in a zero. It ended up needing a total of something like 34 subtractions, resulting in a speedup of 7X.

    • @foobar1500
      @foobar1500 Год назад

      On systems with fixed instruction length even stuff like loading immediates (that is, constants) to registers is quite funky, even on symbolic assembler. On x86 with variable instruction length but no widely available barrel shifter logic available to instructions these are quite literal "move immediate" instructions, but on ARM, complicated constants either take several instructions to load (if not using constant islands), or generate constants in interesting ways...

    • @ArneChristianRosenfeldt
      @ArneChristianRosenfeldt Год назад

      JRISC in 1993 ( Atari Jaguar ) and SH2 in 1994 on Sega 32x and Saturn offered 32-bit division. It has some latency, but did not block the processor. Similar how quake did division on CISC Pentium in 1996 .

  • @paradiselost9946
    @paradiselost9946 Год назад

    This is the bit i love.
    Go deep, go to circuit level... half bit and full adders, shift registers...
    We have paper, pencils, and extra ways of accessing memory.
    A puter is stick with whatever figures are in its amu. It "sees" one digit at a time. It cant see the whole number, it cant go back and "carry the one"...
    Subtraction by addition and complements has always fascinated me...

  • @minhazulislam4682
    @minhazulislam4682 Год назад +1

    0:37 "I felt I was on my way to a quick LOW LEVEL victory" nice one.

  • @Ghoetic
    @Ghoetic Год назад +22

    I usually pre-scale a fraction a with an acceptable binary power, eg 2^8 giving (5/9)*256 = 142, and multiply with that then right shifting the result by 8 to fast divide.

  • @elcaciquedev
    @elcaciquedev Год назад +2

    The first 2 minutes of this video are the reason why they usually make you take higher level math classes in most computer science programs. The lower you go, the more theory you'll need to know. Of which I, myself, don't recall any, so here I am. Great video!

    • @bradhaines3142
      @bradhaines3142 Год назад +1

      i mean all computer programming is math and translation. so some language and math classes would be all you need wouldnt it? something about how to learn new languages before you start actually learning to speak machine

  • @fk319fk
    @fk319fk Год назад

    I think the interesting part of this was when I was coding floating point multiplication and division, I found the algorithm was identical! Where you had + in multiplication, you replace it with -. (not exactly, but surprisingly close, well, at the time)

  • @svenbtb
    @svenbtb 9 месяцев назад

    I was just reading about this in one of the Write Great Code books the other day, it's super interesting how difficult fractions actually are for computers.

  • @capability-snob
    @capability-snob Год назад +31

    This was a triumph!
    FWIW most x86_64 CPUs have floating point division hardware but integer division is implemented using micro ops for newton-raphson method. This bugs me to no end because you normally want high throughput for floating division, so you use n/r for that, but for ints it usually is a single division on the critical path. The difference is huge - unpipelined trial subtraction hardware has an 8 cycle typical latency, but n/r is 26 cycles.

    • @soonts
      @soonts Год назад +5

      On modern AMD CPUs, integer division only takes 2 micro-ops. Still slow though, for 64-bit integers the latency is up to 19 cycles.

    • @capability-snob
      @capability-snob Год назад +1

      @@soonts ooh, interesting! DIV and IDIV timings going back to Bulldozer appear to be for a slim hardware division unit with up to 18 cycles of latency. Intel are still in the high 20s though. Thanks, I've been team red for over a decade but somehow never noticed this.

    • @Carewolf
      @Carewolf Год назад +1

      @@soonts It takes a varied amount of time depending on the value you divide with. Anywhere from 2 to 30 cycles.

    • @torbjorngranlund6299
      @torbjorngranlund6299 Год назад

      Not true. Hardware division makes use of SRT, not Newton's method.

    • @lolerie
      @lolerie Год назад

      @@capability-snob divq on ice lake is fast as hell. That is 128 bit division.

  • @phenanrithe
    @phenanrithe Год назад +8

    The RISC philosophy, love it or hate it 😉 Yes, I saw the interesting multiply-to-divide trick in printf routines to display numbers, where it has to divide by 10 (or powers of 10). I was impressed, it was packed with such little techniques to save cycles (on x86_64).

    • @linuxization4205
      @linuxization4205 Год назад

      Literally the only CISC cpu that is halfway alive and not x86 is the Motorola 680x0 architecture.

    • @phenanrithe
      @phenanrithe Год назад

      @@linuxization4205 It depends which categories you are willing to consider. There are microcontrollers and CPUs derived for ex. from the 6809, 8051 or 6502 and so on which are borderline or plain CISC. Though many designs, including the x86, only have CISC instructions but a RISC architecture, so it's not as straightforward as when the term was initially coined. 😀
      I believe Intel started translating the instructions into micro-instructions in their Pentium series, otherwise they wouldn't have achieved the same level of performance. It was bold and brilliant. Nexgen, which was bought by AMD to save their failing K5 architecture, did the same back then.

    • @HappyBeezerStudios
      @HappyBeezerStudios Год назад +1

      @@phenanrithe It was the Pentium Pro, (P6 or i686), which all modern x86 chips still identify as.
      So yeah, the big x86 CISC architecture uses RISC-like tricks under the hood and only displays itself as CISC on the outside for compatibility's sake.
      And that whole thing was also supposed to be replaced. After IA-16 and IA-32 should be IA-64, but that went to nowhere just like i960, i860 or i432. And obviously Itanium which, as P7 was somewhat supposed to be the successor in 98

    • @boptillyouflop
      @boptillyouflop Год назад

      @@HappyBeezerStudios Yeah... what happened was that IA-64/Itanium was a disaster, it tried to optimize all the wrong things and ended up being somehow SLOWER than x86 if you can believe such a thing.

  • @amigalemming
    @amigalemming 8 месяцев назад +1

    Some architectures like the Motorola 56k DSP have an instruction for doing one step of a division. If you call it often enough it will yield the correct quotient, still sufficiently fast. This can be a good compromise between speed and saving chip space.

  • @LaurenGlenn
    @LaurenGlenn Год назад

    Very cool. Reminds me of the video I saw where they used some exponent or something in Doom / Quake to speed up calculation of 3D back in the day.... Gotta love the ingenuity of programmers stuck with resource constraints.

  • @macmcleod1188
    @macmcleod1188 Год назад +3

    5/9 is a constant. As an assembly language programmer, I'd do the math by hand once and then use that resulting value as a constant.
    Heck, I still do it in java today.
    But you have a good point where you are dividing variable values.

    • @FilipArlet
      @FilipArlet Год назад

      Problem is 5/9 is floating point constant, you can multiply by that, but you need FPU (or something that handles floating points)

  • @embeddedroom
    @embeddedroom Год назад +6

    True story. Encounter this frequently when working with code generators! Good job!

  • @marcoronzani7197
    @marcoronzani7197 5 месяцев назад +2

    Yeah, fixed point is a very effective technique to get around divisions, unfortunately you can't always use it when high precision is required. Luckly the problem does not apply when you have an FPU. A very interesting alternative I suggest you to look into is CORDIC, it computes only 1 bit per iteration, but if your architecture is superscalar that is not a big issue. Furthermore I also read something about CORDIC and loop pipelining at a certain point, thus looking that up could help as well!

  • @Paul-sj5db
    @Paul-sj5db Год назад

    I used to program assembler on the Acorn Archimedes which was the first commercially available computer with ARM4 chips.
    There was a really good algorithm for dividing two 32 bit numbers. It involved about 96 instructions, no loops, 3 instructions to set it up, 30 repeated identical sets of 3 instructions and then a final 3 instructions.
    Those chips didn't have a multiply instruction either so you'd have to multiply by 5 by shifting left 2 and adding to the original.

  • @Optimus6128
    @Optimus6128 Год назад +3

    Multiplication by the fixed point reciprocal is very common trick in retro coding. It might be even a faster way even if the hardware did have division.
    It's just surprising the ARM didn't have it even much later, when even the old 8088 had an integer division instruction.

    • @foobar1500
      @foobar1500 Год назад

      Modern compilers convert integer division by a constant invariably to a multiplication and a (possibly implied) shift on all modern architectures. It is not so much about integer division instruction being slow although it's available, but rather that multiplication units have gotten quite fast in comparison to division implementations, well, actually a decade or two ago. Also, how often one really needs to perform an integer division by an arbitrary, non-predetermined value? I'd say that in large scheme of things, not really that often.
      I still believe that the fact that original 8086/8088 even had a "hardware" division instruction was mostly a question of code density on the time of real shortage of memory back in the seventies and easier entry to market for compilers of high level languages - which were not that fancy back then for microprocessors.

  • @m.hosseinmahmoodi
    @m.hosseinmahmoodi Год назад +18

    I remember when I first started learning to use assembly, I started coding for Game Boy. I decided I'll write a few simple functions to get a feel of assembly, so I wrote a function for Fibonacci sequence then I attempted to write a function for “n!” !!!
    Imagine my surprise when I found out that GBZ80(CPU of Game Boy) doesn't have any instruction for multiplication. I had to write a multiplication function using bit shifting.
    I have used a lot of different CPU's assembly and most of the very old ones (6502, 65C816, …) doesn't have a mul\div instruction; and old ones (8086, 8186) only have integer mul\div; and newer ones (from pentium until now) are very hard to program (if you want to make them fast, that is) because of SIMDs.

  • @LuealEythernddare
    @LuealEythernddare 5 месяцев назад

    I don’t think I’ll go into learning assembly at any point. But I started programming in my 3rd year of college. I first switched over from music education when I learned I didn’t want to be a teacher. I first learned some basics in game maker, then the teacher taught us some simple Visual Basic (it was outdated frankly even back in 2019, but it was a good intro language imo, statically typed yet simple to grasp the syntax). From there the next classes taught C#, sql server, then some basic web dev with html, css, and php. I didn’t finish there, but went on to teach myself python, JavaScript, Ruby, and I’m even trying to pick up some C++ and Go. I transferred my credits to WGU and am trying to finish up my degree and get out there as a developer, and I love that you, primeagen, melkey, and others put great content out there and it helps me to keep myself motivated to finish.

  • @svenprigann2631
    @svenprigann2631 Год назад +2

    I am following your videos quite a while now. Me, holding an masters degree in eletrical engineering and quite a dubious amount programming M0,M0+,M3,M7 ...still amazed by the simplicity of your videos about complicated topics. Keep up the great work =)

  • @abdox86
    @abdox86 Год назад +6

    I remember when I first hit by this wall, but mine was bigger, just imagine ( 8-bit processor, and no multiply instructions)
    It was a PIC mcu, and it almost boiled me to write a floating point 64-bit based division and multiplication macros, it was enjoyable and hard challenge to tackle.

    • @cpK054L
      @cpK054L Год назад

      PIC is trash, move to ARM master race.

    • @abdox86
      @abdox86 Год назад

      @@cpK054L design one and then call it trash, for studying fundamentals pic mcus are the best , ARM is more advanced and complex.

    • @cpK054L
      @cpK054L Год назад +2

      @@abdox86 I used PICs for undergraduate programs and for the price to performance ARM gave me a better result as a general purpose MCU

  • @soberbard
    @soberbard Год назад +4

    i remember making batch scripts to automate certain things and needing floating point except batch doesnt support it... i ended up doing fpm after trial and error without knowing that it had an actual name cus i suck at maths. interesting to see that it has real uses outside of my crappy scripts 😭

  • @thomquiri9860
    @thomquiri9860 9 месяцев назад

    ok I'm pretty new to programmation (like I started this year) and this seemed like an interesting exercise. What I would do to divide by a variable without using the division operation would be just as we teach manual divisions in french elementary schools:
    -1: a loop checking if the number to divide (a) is greater than the number by which we divide (b). If it is then substract the number by which we divise. If it isn't then we're left with the int just bellow the actual answer.
    -2: if we need more precision, we do the same loop, but if a divideBy)
    {
    currentDigit++;
    toDivide = toDivide - divideBy;
    }
    cout >> currentDigit;
    if (i == 0)
    cout >> ".";
    currentDigit = currentDigit *10;
    }
    }
    Keep in mind I've never coded in C in my life, much less in assembly, so low level logic like that isn't really my thing, I'm sure this is inefficient af, but it was interesting to exercise my brain during these school holidays.

    • @LowLevelLearning
      @LowLevelLearning  9 месяцев назад +1

      this is a solid solution! and this is actually how the processor implements the divide instruction under the hood (with some optimizations here and there obviously) the problem is that this takes a TON of clock cycles, so sometimes fixed point division is just faster.
      good job!

    • @thomquiri9860
      @thomquiri9860 9 месяцев назад

      ​@@LowLevelLearning thanks, is that actually how it work? Wow that's something I need to take into account in my future codes when I carelessly put divisions inside loops that are themselves inside loops O_o. That's what I find interesting with learning what the processor actually does when you add a single operation that seems so innocent, you actually come to understand WHY everyone teach do to do/avoid this or that. Because damn that looks like a very inefficient way to divide.

  • @bsandoval2340
    @bsandoval2340 Год назад

    Man you know I never thought dividing would be one of this frustrating problems that seem to just take forever

  • @jasonknight1085
    @jasonknight1085 Год назад +6

    As an old assembly hound, it's fun to see something like ARM. I don't do a lot of RISC stuff in ASM because of the old adage "RISC is for people who write compilers. CISC is for people who write programs"
    On the 8086 and 8088 we too couldn't rely on there even being a FPU, and our MUL and DIV statements were painfully slow -- in excess of 150 clocks EACH.
    An old trick for doing percentages in that case was to do a multiply very like this as a fraction of 65536. What made that "fun" was you ended up with the integer result in DX, and the numerator (65536 as the denominator) in AX. We were further hobbled by there being no MUL reg, immed
    ; assumes temperature in AX
    sub ax, 0x20
    mov bx, 0x8E38 ; roughly 5/9ths 0x10000
    mul bx
    ; DX = integer result, AX = numerator over 0x10000
    Becuase it uses only the one multiply, it takes half the time that MUL 5 and DIV 9 would. Which is good when we're talking turning 300+ clocks into 150.

    • @customsongmaker
      @customsongmaker Год назад

      So true, typical MUL reg

    • @ArneChristianRosenfeldt
      @ArneChristianRosenfeldt Год назад

      What are 8086 and 68000 even doing in all those cycles. The factors are 16 bit as is the ALU. This only makes sense to me if MUL and DIV were just an afterthought which had to be cheap. Indeed I have seen a large stretches of code without any of these instructions. This typical IBM-PC business code does not MUL or DIV. Maybe once to calculate taxes.

    • @jasonknight1085
      @jasonknight1085 Год назад

      @@ArneChristianRosenfeldt A hardware multiply requires a lot... no, A LOT ... of silicon. We're talking as much as 20 times the silicon as an addition. To that end a lot of early processors left it out entirely, or implemented it in "microcode"
      In fact that's exactly what the 8086 / 8088 does, is use microcode. A small bit of code on the processor that pretends to be an opcode, that runs on the other instructions of the processor.
      Even with a hardware multiply the operation is intensive enough to require a lot of clock cycles as multiply and divide -- even just as integer -- are expensive operations. A mul reg16 taking anywhere from 25 to 40 clocks depending on CPU. (laugh is the 486 is slower than the 286 or 386), and thus it wasn't uncommon for stuff like vga video routines to still use shifts instead.
      ; ACCEPTS
      ; AX = Y
      ; CORRUPTS
      ; AX
      ; RETURNS
      ; DI = Y * 320
      xchg ah, al ; == y * 256
      mov di, ax
      shl ax, 2 ; == y * 64
      add di, ax ; di == y * 320
      Runs in about half the clock cycles of a hardware multiply right up through the first pentium. Though still not as fast as a XLAT

    • @ArneChristianRosenfeldt
      @ArneChristianRosenfeldt Год назад

      @@jasonknight1085 still does not explain why anybody would need more than 20 cycles in microcode. With a barrel shifter and some look ahead for zeros all those shift special cases would go away. Zero flag running from msb to lsb for early out (like in 386). Silicon usage goes up with the square of bits. Hence it makes sense to implement 8x8 as MUL ( costs like a 64 bit add ), and then run the microcode over this. 4 cycles for 68k like 16x16 -> 32. ARM7 in the GBA has hardware 32x8 and needs 4 cycles for 32x32.

    • @jasonknight1085
      @jasonknight1085 Год назад

      ​@@ArneChristianRosenfeldt Do you realize just how much silicon an ARMv4 has that an 8086 does not? We're talking about a ~29,000 transistor and ~5000 gate processor here. That ARM7TDMI (the 7 is misleading, it's a v4) is what, 280,000 transistors and 90k or so gates?
      What you describe in silicon alone would be a fifth the die space the 8086 used. JUST for the multiple logic alone.
      Remember, this is an age where a DRAM memory access was 4 clocks per bus width (so a word was 8 clocks on the 8088), and even a simple byte shift by fixed one place took two clocks. Memory shift was 15 + and effective address calculation, and shift reg16 by CL was 8+4n.
      But yeah, once transistor counts grew these problems went away. By the time of the 80186 you're looking at 26 to 33 clocks, and the 286 it's 13 to 26 clocks.
      The 186 was 55,000 transistors, the 286 was 134,000.
      Also worthy of mention is the NEC V20 and V30, drop-in replacements for the 8088 and 8086 respectively, with timings closer to and in some places even faster than the 286. See its flat 21 clocks for all 8 bit multiplies and 26 clocks for 16 bit.
      It achieved this by having over 60,000 transistors, more than double the stock intel part.
      The type of infrastructure needed for the microcode you describe by the time you do both 8 bit and 16 bit mul and div is more silicon than many processors even had when the 8086 started being designed in 1976. That's no joke either, look at the 8080 with its whopping 4500 transistors-- and it was considered a chonky boy on release. The Z80 was around 8500, 6502 was 4500. And IF contemporary processors had a multiple opcode (many did not) they were inherently 8 bit with a 16 bit result (in two registers).
      There's a reason the Motorola 68000 with its amazing fast hardware multiply via its 16 bit ALU needed 68000 transistors (hence its name), an absurd number for the time that took five years to even come close to looking viable outside of big iron use... because manufacturing processes for that many transistors reduced yields resulting in elevated prices.

  • @rodneylives
    @rodneylives Год назад +6

    A breakthrough moment for processor-based math for me was, way back in a college Computer Architecture class, realizing that, internally, computers do addition and multiplication exactly as we do on paper, they just do it in base 2. Changing the base simplifies the operation greatly. There is no deep magic.

  • @pretentiousarrogance3614
    @pretentiousarrogance3614 8 месяцев назад

    How did you get your number by dividing it on that program though is it through a different computer or some sort of stuff a layman wouldn’t know?

  • @austinrosenbaum2848
    @austinrosenbaum2848 Год назад +1

    Nice video! I run into this when writing code for FPGAs. Usually there is lots of multiplier hardware, but rarely is there division hardware. Solution? Everything is a (fixed-point) multiply!

  • @jesseparrish1993
    @jesseparrish1993 Год назад +12

    That's why I stick to multiplication by inverses

  • @user-fc8xw4fi5v
    @user-fc8xw4fi5v Год назад +19

    ah yes, fixed point multiplication with bit shifting! i had to do this once to use trigonometric functions in a Gameboy Advance game i was writing; those processors had no builtin floating point instructions either! cool stuff

  • @AgentChick
    @AgentChick 9 месяцев назад +1

    "I didn't want to use a float library because it felt like cheating"
    Ah yes, a programmer's worst enemy, our own ego.

  • @janniklasbertram9436
    @janniklasbertram9436 10 месяцев назад

    Hey
    Nice video.
    One small note, when I had to do a larger assembly project, I would use precompiler instructions to name as much as I could in my assembly code, especially registers. Made for much more readable code.
    Maybe this could add a bit to your code in the videos, since viewers spent very little time reading the code, they (I) may grasp the meaning of the code way quicker if they know what the registers mean

  • @hoej
    @hoej Год назад +3

    This brought me back to my time at uni when we had some somewhat beefy FPGA units. FPGA is basically programmable breadboard, only you can reprogram them on the fly (which NASA does all the time). You just got to remember that even though the functions are written sequentially they are executed simultaneously. Anyway, those beefy units could hold a whopping THREE division circuits. And then only a few gates were left for everything else. Our first unit did not have space for a single division circuit. It could calibrate and control a six axis gimble, but division was too much.

    • @Cooe.
      @Cooe. Год назад +3

      FPGA's have come a long damn way since you were in school lol. You can recreate entire mid-late 90's machines on modern FPGA's, and not even crazy high end FPGA's either. And not simple machines either, but stuff like the Nintendo N64 w/ its beefy 64-bit MIPS CPU + SGI GPU.

    • @hoej
      @hoej Год назад +1

      @@Cooe. this unit was a Cortex-M with FPGA around it. We used it for pipelining computer vision to identify edges via a modified Hough transform supporting splitting the image into 9 parts after converting input from VGA, then sending the results to the Cortex-M. Has it come further than that in 6 years?

  • @thisisreallyme3130
    @thisisreallyme3130 Год назад +9

    YES - more of this please! I’m learning 6502 (cc65 but no stdlib lib emulation), and this kind of trickery totally relates. 😊

    • @54egg
      @54egg Год назад

      Lookup tables are your friend on 6502 where practical. Check this video which has a link to page (as I recall) that shows how to do 8x8 bit multiply on 6502 with 4 table lookups, 2 8 bit add/subtracts, and a 16 but subtract in under 40 clock cycles, ruclips.net/video/WEliEAc3ZyA/видео.html

  • @thepoliticalstartrek
    @thepoliticalstartrek Год назад

    Number 1 thing I learned in CPU design is early on CPUs at the logic level can do one form of math. Addition . It does everything with addition. It uses flags or registers to determine negative numbers. It also uses counters for multiplication and division.

  • @wardyorgason
    @wardyorgason Год назад

    You can also calculate it using subtraction. IE you first multiply by five. Then count how many times you can subtract 9 before it crosses 0.
    I made the exact same thing in college using the LC3 which also doesn’t have a divide instruction.

  • @_--_--_
    @_--_--_ Год назад +6

    Also important to know is that both integer divide and FPU divide are *very* slow even on architectures that supports it in hardware.
    An integer divide is up to 12 cycles on Cortex M4 and 10-50 cycles on modern X86-64 for example, Skylake needed like 40-100 cycles for integer div just a couple generations ago.
    FPU divide on Cortex M4 is 14 cycles. Compare that to 1 cycle for integer add/sub/mul/shift and 1 cycle for FPU add/sub and 3 cycles for FPU mul on M4.
    Also noteworthy is that even architectures such as M0/M0+ can implement external division hardware, such as the case for the RP2040 using M0+ architecture which integer division hardware needs only 8 cycles and therefore is actually faster than standard Cortex M4 for integer division.

    • @ewenchan1239
      @ewenchan1239 Год назад

      If integer ADD/SUB/MUL is so fast, then why isn't it used everywhere?

    • @_--_--_
      @_--_--_ Год назад

      @@ewenchan1239 Can you restate your question, as it doesnt really make any sense. Are you asking why a div instruction isnt substituted with a combination of add/sub/mul instructions? They are, the compiler will try to replace a divide either by a shift, a combination of muls and shifts or adds and shifts, depending on if your target supports mul instructions.
      However this only works if the divisor is a compile time constant. For other cases the compiler most likely will use div instruction unless you tell it otherwise. For example you could use libdivide instead, which can be faster than hardware divide on a lot of targets. Or you could tell it to use its default inbuild algorithm, which most likely will be quite a bit slower than a hardware divider *if* the compiler cant optimize it down, which in a lot of cases it can.

    • @ewenchan1239
      @ewenchan1239 Год назад

      @@_--_--_
      "Can you restate your question, as it doesnt really make any sense."
      You state:
      "FPU divide on Cortex M4 is 14 cycles. Compare that to 1 cycle for integer add/sub/mul/shift and 1 cycle for FPU add/sub and 3 cycles for FPU mul on M4."
      If your statement above is true, then why isn't it used in lieu of using the ACTUAL divide instruction, ALL the time?
      (You've answered it for the most part in the second half of your reply above.)
      But for said second half of your reply to be true, you would have to know:
      a) the divisor is a compile time constant (which means that if it isn't a compile time constant, it won't work)
      and b) you explicitly tell the compiler what to do/how to handle the DIV instructions, which typically means that you would have to know the specific purpose or application that you are programming for, AND that the difference in the quotient is not significant nor critical enough IF the approximation is sufficient.
      (i.e. you're programming for yourself rather than for someone else to use where the importance of the epsilon value is not known nor what someone else's tolerance limit for epsilon is)

    • @_--_--_
      @_--_--_ Год назад

      @@ewenchan1239
      "You state:
      "FPU divide on Cortex M4 is 14 cycles. Compare that to 1 cycle for integer add/sub/mul/shift and 1 cycle for FPU add/sub and 3 cycles for FPU mul on M4.""
      I mentioned the integer divide instruction a bit before that, lets not mix up integer operations with floating point. As far as I am aware we are talking about integer arithmetics specifically.
      "a) the divisor is a compile time constant (which means that if it isn't a compile time constant, it won't work)"
      Wont work is not the right wording here. Yes if the divisor isnt a compile time constant than the compiler cant replace the div instruction with a faster combination of other instructions by default, however thats an optimization problem, it still "works".
      Also its often a good idea to rethink your code, often a lot of divisions are unnecessary (especially with floating point, but we are talking integer here), the compiler cant do this for you most of the time. This was mostly my intent of the original comment anyway, the compiler cant optimize all your mistakes away and using divides carelessly is often a mistake if performance is of any importance.
      "which typically means that you would have to know the specific purpose or application that you are programming for"
      As this discussion centers around low level programming, mostly embedded programming, we are talking about target specific code, where you know exactly what target you are progamming for and its specific compiler. Portable code is not relevant to this discussion.
      "AND that the difference in the quotient is not significant nor critical enough IF the approximation is sufficient."
      I am not talking about approximation, thats a good way of potentially optimizing code if its feasible, but thats not part of this discussion.
      All compiler optimizations or replacement software algorithms will give the exact same output as the hardware divider.

    • @ewenchan1239
      @ewenchan1239 Год назад

      @@_--_--_
      "I mentioned the integer divide instruction a bit before that, lets not mix up integer operations with floating point."
      But you also the "cost" (in terms of the number of operations) it takes for FP ADD/SUB and FP MUL on said Cortex M4 as well.
      "FPU divide on Cortex M4 is 14 cycles."
      "As far as I am aware we are talking about integer arithmetics specifically."
      My understanding is that we can be talking about both (given that you can represent floating point numbers as a sum-product of pairs of two integers, can you not? (i.e. the approximation for 1/9 ~= 1 * 2^-4 + 1 * 2^-5 + 1*2^-6 ~= 0.109375)
      And isn't the mul/shift - it's doing floating point math with integers, isn't it? (unless I am missing something here and/or misunderstanding what @Low Level Learning is talking about here, in this video)
      "Wont work is not the right wording here. Yes if the divisor isnt a compile time constant than the compiler cant replace the div instruction with a faster combination of other instructions by default, however thats an optimization problem, it still "works"."
      The "won't work" here refers to the substitution or the replacement of the slower div operation with the faster combination of other instructions. That's the "work" that you were talking about it doing.
      Therefore; pursuant to your statement, if you aren't dividing by a compile time constant, then said replacement/substitution won't take place - i.e. the "work" that I am referring to, based on what you wrote, won't take place. Ergo; "won't 'work'" (div operation won't get replaced by "combination of other instructions" (which should help to perform the division faster without using the div operation/instruction).)
      "often a lot of divisions are unnecessary"
      Varies, I would think.
      "This was mostly my intent of the original comment anyway, the compiler cant optimize all your mistakes away and using divides carelessly is often a mistake if performance is of any importance."
      Again, I think that it varies. (depends on what you are doing/why you are using division in the first place)
      re: mistakes
      I wouldn't necessarily call using divisions a mistake.
      I agree that this is where a skilled programmer comes in, but if you are learning or you are an unskilled programmer, then that's something that they will have to learn.
      Take for example, the inversion of 1/(1/9).
      If you use exact arithmetic, your E(X) is 9.
      If you approximate (1/9) ~= 1*2^-4 + 1*2^-5 + 1*2^-6 ~= 0.109375,
      then 1/0.109375 = 9.142857142857142....
      Therefore; your epsilon would be 9.142857142857142... - 9 = 0.142857142857142....
      or about 1.6% error.
      And therefore; the question that you would have to ask yourself is "is that 1.6% error significant or not?" In some cases, it might not be. In other cases, it can be VERY significant.
      It really depends on what you are doing.
      (Note that often times, if you are replacing a division operation, I have yet to see someone who advocates for the replacement or substitution of the div operations with a combination of other operations - they don't automatically talk about (nor calculate) their epsilon and let the user know what that epsilon is that arises from said replacement/substitution. I think that if you are going to implement such a kind of replacement (of the div operation) with a combination of other operations, where and whenever possible, you SHOULD be REQUIRED to tell the user what your epsilon is, when appropriate.)
      "As this discussion centers around low level programming, mostly embedded programming, we are talking about target specific code, where you know exactly what target you are progamming for and its specific compiler. Portable code is not relevant to this discussion."
      That's an assumption on your part though.
      There is nothing that explicitly states this.
      As I said, it is important for people NOT to have it as a key takeaway from watching this video that they are going to implement this everywhere they see because they see it as a speed/performance improvement whilst they fail to recognise that this speed/performance improvement comes at the expense of precision and accuracy.
      Even for assembly code (cf. GROMACS/Folding@Home), maintaining the precision and the accuracy can still be vital for the program to perform the calculations that is asked of it properly and correctly.
      "I am not talking about approximation, thats a good way of potentially optimizing code if its feasible, but thats not part of this discussion."
      The replacement of a division operation with a combination of other operations is almost always an approximation of the actual division operation.
      (cf. the example that @Low Level Learning shows above with respect to the approximation of 1/9 ~= 1 * 2^-4 + 1 * 2^-5 + 1 * 2^-6 ~= 0.109375)
      That IS an approximation.
      That IS what he did and that IS what is shown.

  • @a_soulspark
    @a_soulspark Год назад +3

    very interesting video!
    I'm curious, would it be more efficient to use the libc floating point library or the fixed point method you used?

    • @LowLevelLearning
      @LowLevelLearning  Год назад +6

      For just this example the fixed point will definitely be faster. Converting the number to a float, sending it to the FPU, and then bringing it back will take way more cycles than what we’ve done here. The trade off is accuracy though.

    • @andrewdunbar828
      @andrewdunbar828 Год назад +1

      The fixed point method will be fastest when diving by a constant. The FP library will be best in the general case.

    • @SolomonUcko
      @SolomonUcko Год назад

      Specialized integer division functions (__aeabi_idiv/__aeabi_uidiv) are probably faster than floating-point division, I would assume

    • @richardericlope3341
      @richardericlope3341 Год назад

      Some toolchains emulate those calls. Not great if you want ot fast.

  • @kebman
    @kebman Год назад

    This is, hands down, some of the coolest stuff I've ever seen on RUclips. Ok slight exaggeration. But it was pretty cool. You made division errors look cool, and that's a feat on its own!

  • @zmike9831
    @zmike9831 Год назад

    Amazing how you give explanation as to why it don't have a divide instruction,

  • @devtadeo
    @devtadeo Год назад +3

    I have seen in many of your videos you write a platform specific assembly but then you execute it in your machine, how you do that? (I'm not taking about VMs because u run it from the terminal, for example at the end of the MIPS ASM video or even this one)

    • @LowLevelLearning
      @LowLevelLearning  Год назад +1

      Later versions of the Linux Kernel will automatically invoke qemu for non native architectures.

    • @devtadeo
      @devtadeo Год назад

      @@LowLevelLearning thats awesome

    • @christophneumann4990
      @christophneumann4990 Год назад +1

      @@LowLevelLearning Would you happen to have any resources on this? I'd like to learn more, but can't seem to find the right search term

  • @QuantenMagier
    @QuantenMagier Год назад +13

    It's called RISC (Reduced Instruction Set Computing) for a reason, if you want hardware division operations use a CISC (Complex Inst..) processor or a mathematical co-processor. ;)

    • @GeorgeTsiros
      @GeorgeTsiros Год назад +1

      RISC does not mean that it won't have certain instructions, or that it lacks functionality, or that it is a _limited_ instruction set computer. It just means load/store arch, orthogonal reg file and instr set.

    • @QuantenMagier
      @QuantenMagier Год назад +1

      @@GeorgeTsiros Well it's not limited it's just reduced, but it still is general purpose. ;)
      The idea behind RISC was to improve the efficiency per transistor count and a division unit would mean more transistors with almost no benefit regarding the most programs.

  • @thumbwarriordx
    @thumbwarriordx Год назад

    One time on a high school math test I showed my work in complement division as a protest against showing my work.
    As it turns out I'm not great at math, but I like to flex when I can and I'm pretty sure the math teacher knew about it but had to look it up for a refresher to follow it. Great victory.

  • @ambassadorkees
    @ambassadorkees Год назад

    1985 I did this all the time: floating point in assembler (or even direct hex machine code 💪) on a MC6800.
    Graphics card, automatic adaptive lighting with integration calculus, programmable school bell clock, all applications on the same self designed and soldered Eurocard sized controller.
    Basic components a 6800 + PIA I(O for e.g. relays, numeric multiplexed keyboard and lux sensor.