Is this New Quadratic Formula Actually Faster than the OG Quadratic Formula? (For a Computer)

Поделиться
HTML-код
  • Опубликовано: 13 сен 2024
  • In this video we look at how well Po-Shen Loh's new quadratic formula compares to the original, tried-and-true Quadratic Formula in terms of speed. The new formula from Po-Shen Loh certainly looks simpler, but which is faster for a computer?
    Po-Shen Loh's explanation of his formula: • Examples: A Different ...
    Support What's a Creel? on Patreon: / whatsacreel
    FaceBook: / whatsacreel
    Background images from animations from HDRI Haven: hdrihaven.com/
    Image of Lesley 'Twiggy' Lawson from Wikipedia
    Software used to make this vid: Visual Studio 2019 Community: www.visualstud...
    Blender: www.blender.org/
    Audacity: www.audacityte...
    Davinci Resolve 16: www.blackmagic...
    OpenOffice: www.openoffice...
    Gimp: www.gimp.org/

Комментарии • 373

  • @gideonmaxmerling204
    @gideonmaxmerling204 2 года назад +691

    Would probably be faster to just return 1 for every entry.
    It would be accurate enough in astronomical scales.

    • @rayniac211
      @rayniac211 2 года назад +39

      Unless your quadratics are astronomical in nature ;)

    • @WitherLele
      @WitherLele 2 года назад +10

      @@rayniac211 then it would be accurate on a bigger scale

    • @RobertLoweIsAwesome
      @RobertLoweIsAwesome 2 года назад

      @@WitherLele ruclips.net/video/ICv6GLwt1gM/видео.html

    • @airkami
      @airkami 3 месяца назад +1

      You remind me of the algorithm i designed to check an array of positive integers to return true for each value divisible by 3.
      My Algorithm:
      Return false
      It is fast, it doesn't contain unpredictable failure, and it is correct most of the time given the set of all positive integers.

  • @josemonge4604
    @josemonge4604 2 года назад +269

    Dark background and blue color aren't a good combination of colors for readable text. Try white or yellow next time or something like that. Anyway great comparison and great job on the optimizations!

    • @HansLemurson
      @HansLemurson 2 года назад +6

      If you want Blue to be seen, add some Green!

    • @firstname4337
      @firstname4337 2 года назад +19

      LOL, he said blue code and i said what code ?
      I had to change the brightness of my monitor to see it

    • @sdjhgfkshfswdfhskljh3360
      @sdjhgfkshfswdfhskljh3360 2 года назад +9

      Not just it bad by itself, but it is also more damaged by RUclips compression algorithms.

    • @NoNameAtAll2
      @NoNameAtAll2 Год назад +1

      yellow on white?
      ha ha

    • @professortrog7742
      @professortrog7742 Год назад +1

      @@NoNameAtAll2 no on the dark background

  • @pyromen321
    @pyromen321 2 года назад +340

    I’m guessing the new version might actually use less power! They should be practically the same speed due to ILP, but the the new version requires less calculations

    • @-mwolf
      @-mwolf 2 года назад +9

      whats ILP?

    • @leocomerford
      @leocomerford 2 года назад +31

      @@-mwolf Instruction-level parallelism en.wikipedia.org/wiki/Instruction-level_parallelism .

    • @MenkoDany
      @MenkoDany 2 года назад +5

      Thank u for the wisdom, dolan

    • @ABaumstumpf
      @ABaumstumpf Год назад

      But it simply does not take less computational powe

    • @kingacrisius
      @kingacrisius Год назад +2

      ​@@ABaumstumpf I believe they mean literal power, like electricity. It might be less expensive in terms of electricity.

  • @47Mortuus
    @47Mortuus 2 года назад +403

    9:35 don't divide by 2, multiply by 0.5. The compiler is not allowed to do so, as it will sometimes produce different bitwise results.
    ... Unless you have fast unsafe math enabled, which will also allow for FMA reordering, which is a pretty significant optimization...
    Also, much of the runtime is actually float division. It's not pipelined and you can only start a division after another has finished, which is 14 clock cycles at best, which is _extremely_ relevant when you can replace a FP division by using "RCP_PS" and do some Newton-Raphson, which has _WAY_ higher throughput than FP divison (relevant for tight loops such as this one), and can be FMA fused when multiplying by the reciprocal instead.
    Here's the code for an accurate reciprocal (only _possibly_ worth it if your CPU supports FMA; total latency (fastest possible): 4 (rcp) + 2 * 4 (FMA) = 12 cycles vs 11 cycles (fastest possible division), but way higher throughput and FMA "fusable", meaning another - up to - 3 clock cycles saved):
    __m128 rcp23_ps(__m128 a)
    {
    const __m128 ONE = _mm_set1_ps(1f);
    __m128 r = _mm_rcp_ps(a);
    return _mm_fnmadd_ps(_mm_fmsub_ps(r, a, ONE), r, r);
    }

    • @DDlol01
      @DDlol01 2 года назад +19

      I was looking for this one.
      I also want to add: even if fast unsafe math is enabled, on some compilers (vcc: /fp:fast) you should also enable optimizations for speed to rewrite constants as needed (done when using /O2 or /Ot but NOT /Ox). /Ot is default. /O2 does some more afaik.
      (no indepth knowledge use at your own risk. Please corret me if im wrong)
      P.S.: AFAIK in gcc constant rewriting is auto enabled when using fast unsafe math (-ffast-math) and optimization levels do not affect constants, -ffasm-math takes precedence?

    • @cH3rtzb3rg
      @cH3rtzb3rg 2 года назад +29

      Dividing by powers of two can and will actually be replaced by multiplying with the inverse by the compiler even without fast-math flags, since the result will be exactly the same.
      One could replace the divisions by a with a multiplications by (1/a) -- this is a thing the compiler won't do in exact-math mode. In fast-math mode, it may even use the fast rcpps (reciprocal packed single) instruction with a refinement step to do this faster.

    • @JamesEdwards780
      @JamesEdwards780 2 года назад +5

      bit shift >> may be faster still

    • @vytah
      @vytah 2 года назад +44

      @@JamesEdwards780 You can't just bitshift floats.

    • @PremierSullivan
      @PremierSullivan 2 года назад +57

      @@vytah Not with that attitude!

  • @bronzecomeshome9517
    @bronzecomeshome9517 2 года назад +103

    My man really putting the work in on the Blender effects 😂 but damn I was so shocked to see deleting a useless branch cut the processing time in half

    • @protonmaster76
      @protonmaster76 2 года назад +28

      He's done an interesting video on branchless coding.
      In modern CPUs, a branch can be very expensive.

    • @bronzecomeshome9517
      @bronzecomeshome9517 2 года назад +7

      @@protonmaster76 I've seen it and enjoyed it, I just assumed that as a false branch it would have been optimized out by the compiler.

    • @chainingsolid
      @chainingsolid 2 года назад +16

      The combo of a branch + floating point, may have resulted in the compiler not realizing it was redundant. And having a branch that also probably didn't predict very well (its based soley on incoming data) it may have done major performance damage. Would be cool to see the pattern of which way the branch goes, given its based solely on the input. Wonder if "sorting" the input to make it predictable would allow removal of the divide (by the CPU's branch predictor), for some of the solve attempts and thus, maybe go even faster for those cases. ​ @Creel looking at the branch predictability could be an interesting video.

  • @pyromen321
    @pyromen321 2 года назад +161

    26:24 it’s worth noting that because you’re dividing my a constant power of two for the third div in the new version, all modern compilers will optimize that to a single mul instruction.
    I’m almost certain that -x/y will always become an xor and mul when y is a power of two
    Edit: I don’t want to litter the comment section too much, so I’ll just put this here: awesome video, man! Some of my work involves optimization, and I absolutely love when other people make this knowledge super accessible!

    • @WhatsACreel
      @WhatsACreel  2 года назад +47

      I think you're right about the division being swapped for a mul! I can't remember what I did in the ASM, but hopefully a compiler would figure that one out... Cheers for watching mate, have a good one :)

    • @PhilippeVerdy
      @PhilippeVerdy 2 года назад +2

      An actual implementation would distinguish 4 main cases, depending on if 'a' is zero or not, and on if 'b' is zero on not:
      - 1. When (a=b=0), if (c=0) there's a single root (r1=r2=0), otherwise there's no solution (r1=r2=NaN).
      - 2. Otherwise when only (a=0), there's no solution if (b=0), otherwise a single solution (r1=r2=-c/b).
      - 3. Otherwise when only (b=0), there are two solutions of opposite signs (r2=-(r1=√(c/a)).
      - 4. Otherwise, we have the two quadratic roots, possibly complex depending on the sign of the determinant (b²-4ac).
      For the last 4th case, you need **numeric stability** (due to limited precision for "float", "double" or "long double" types used by FPU) and the two quadratic solutions should be given by (r{1,2}=[√(1-4ac/b²)±1]*b/2a), where the square root may be imaginary if the "normalized" determinant (1-4ac/b²) is negative, i.e. when the usual determinant (b²-4ac) is negative (both conditions are equivalent because b≠0 in that 4th case).
      This gives a total of 7 cases, where conditionals (if statements) are justified.
      But the test on if b=1 in this video is just useless and stupid (it tries to avoid a trivial division by 1 but at the cost of a branch, which disables the optimization or parallelisation by the compiler: divisions by 1 are trivial also in FPUs, may just cost 1 cycle, LESS than the cost of a branch).

    • @PhilippeVerdy
      @PhilippeVerdy 2 года назад +2

      (-x/y) cannot be "optimized" with a xor and mul: we're working here with floatting point values, not integers! The negation of floatting points just flips & single bit, the sign bit, it does not perform any "xor" on the whole value (the exponent or the mantissa part are not modified), and this can be done before the division on any operand x or y, or on the result; but in fact the result sign is NOT[sign(x)AND sign(y)], and the absolute quotient is the separate division of abs(x)/abs(y).
      To compute that (positiive) dividend of these two positive floatting points,
      - the FPU separately computes the quotient of the two mantissa, which are in (0.5, 1.0], so that non-normalized quotient is in (0.5, 2], and the difference of exponent parts;
      - finally it renormalizes the quotient into (0.5, 1.0] and accordingly adjusts the difference of exponent parts by conditionally substracting 1.
      The final result may become a denormal (or zero) floatting point value, or an "infinite" floatting point value, depending if the final exponent part results in an underflow or overflow (for its allowed range), but it will be a "NaN" (or would raise a "'division-by-zero" exception) if both mantissa parts (including the implicit most significant bit if it's not stored in the floatting point format) are zero. In all cases, the result sign computed separately is blindly applied as is (possibly giving "positive or negative zero", "'positive or negative NaN", or "positive or negative infinite" values; some libraries may or may not discard the sign distinction for zeroes and NaNs, but with some extra cost using conditional branches)

    • @pyromen321
      @pyromen321 2 года назад +4

      @@PhilippeVerdy, you probably should have checked godbolt before that whole explanation. I just confirmed that on some compilers it becomes a mul and xor, on others it just becomes a mul by -0.5.
      Remember, in this case we are dividing by a positive constant power of two and negating. This means the compiler can take shortcuts.

    • @PhilippeVerdy
      @PhilippeVerdy 2 года назад

      @@pyromen321 if you are dividing à float by a negative power of two, you don't need a mul, just z decrementation of the exposant part, and flipping the sign bit, this tales two instructions, worse than a single FPU instruction performing a single multiplication by constant like -0.5 (consider data path on thd same word containing thd sign bit and the exponent, plus you need to test underflows when decrementing the exponent, or you'll wrong results !

  • @larswilms8275
    @larswilms8275 2 года назад +7

    just so everybody knows:
    the OG quadratic formula is the closed form of the Po-Shen Lo algorithm.
    the PSL algorithm is simpler to understand and perform by a mere human, but I would not expect optimized code for the two variants to differ al to much.

  • @sabriath
    @sabriath 2 года назад +5

    CPU latency of modern computers are roughly:
    fadd 4 (as well as fsub)
    fmul 4
    fdiv 11-25
    fsqrt 35
    At the very least, this puts the original at between 97 and 125 latency, while the new formula takes between 88 and 130 latency. This means on natural programming it is both slower and faster depending on division requirements.
    The issue I have with your assessment at the end is you say that the new formula takes 3 divisions, it actually takes 2 to normalize the equation off the first coefficient. You are dividing by 2.0f, this is wasting 11-25 cpu latency on chip where you could simply multiply by 0.5f which is 4 latency. Even worse is the fact that floats are in powers of 2, so dividing by 2 is just decrementing the exponent, which is a 1 latency to perform an integer decrement prior to loading the float into register for use (even before normalizing division in the fpu). Some compilers would have noticed the 2.0f division and possibly switched to a fmul on 0.5f for speed, but it wouldn't have figured out that you could do a dec prior to any of it, which means that you didn't actually test the right code.
    If done correctly, the new method should have used 1xDEC (1), 2xFDIV (22-50), 1xFMUL (4), 1xFSQRT (35), and 3xFADD/FSUB (12), which comes to 74-102, faster than the original by 21%. Also, complex numbers are automatically formatted correctly with the new method, the old method has a division after sqrt, which takes longer to simplify (squaring it, dividing the sqrt answer by it to get the complex part formatted). You can make the code a little bit faster by taking the reciprocal of the leading coefficient (1xFDIV) and then multiplying it to the other coefficients to speed up the process further, but both can make use of that trick....despite that, the new method becomes almost 23% faster.
    The real bottleneck comes down to the square root function, it is the most expensive part in both scenarios. If we could get that down to 15 latency, the new method would be 28% faster than the old.

  • @alexwolski3344
    @alexwolski3344 2 года назад +46

    What a great idea! Gamifying the benchmarks makes it so much fun. Love the animations and benchmarking song. Really interesting stuff.

  • @FrostMonolith
    @FrostMonolith 2 года назад +5

    I felt that you haven't watch Po-Shen Loh's video at its full, because he explicitly mentions this is NOT his own equation. It's his way to interpret the logic in solving quadratic equations. He even mentions who originally had this idea.

  • @marknn3
    @marknn3 2 года назад +25

    Instead of:
    b[i] /= a[i];
    c[i] /= a[i];
    do:
    float a_inverse = 1.0 / a[i];
    b[i] *= a_inverse;
    c[i] *= a_inverse;
    This will remove one division and replacing it with a much faster multiply.

    • @radeklew1
      @radeklew1 2 года назад

      That replaces a division with TWO multiplication, which... might still be faster? You'd have to benchmark, and it may depend on the CPU.

    • @teambellavsteamalice
      @teambellavsteamalice 2 года назад

      I like the *= steps. Someone else remarked instead of mid b /= 2 or b *=.5.
      will using r2 instead of u be faster still?
      so r2 = sqrt(mid*mid - c)
      r1 = mid - r2
      r2 += mid

    • @thomasbrotherton4556
      @thomasbrotherton4556 Год назад

      And even faster to not be doing I/O back to the arrays.

    • @ABaumstumpf
      @ABaumstumpf Год назад

      And that change will give slightly different results

  • @derfogel3491
    @derfogel3491 2 года назад +30

    The Po-Shen Loh method is not new. Learned that method over ten years ago in school.

    • @omegaroguelp
      @omegaroguelp 2 года назад

      yea like same

    • @Kebabrulle4869
      @Kebabrulle4869 2 года назад +3

      Same. It’s called the pq formula in Swedish schools. It’s basically just dividing the polynomial by a and deriving the quadratic formula again.

  • @WolfgangBrehm
    @WolfgangBrehm 2 года назад +9

    4:31 skip the if. Just always divide. If the coefficients are random, the probability of a being equal to 1 are very slim. Also it might be worthwhile to multiply by the precomputed inverse of a.

    • @rrshier
      @rrshier Год назад +2

      I was just thinking what you put down here. The if statement is actually probably his biggest adder of time here! AND usually the vectorization wants to have pure math, nothing with conditionals

  • @InvaderMik
    @InvaderMik 2 года назад +8

    As a tip: at 3:45 the contrast with the blue text is so low that I can’t read it. So I’m following along with your commentary by imagining the route in my head rather than reading along with you!
    EDIT: ok so this happens a lot in this video.

  • @procedupixel213
    @procedupixel213 2 года назад +32

    If the least significant bit of precision can be sacrificed in the new method, the two divisions by a can be replaced by one reciprocal (i.e. 1.0/a) and two multiplications with that reciprocal. Thus one division can be traded for two multiplications, which may be faster.

    • @SmallSpoonBrigade
      @SmallSpoonBrigade 2 года назад +4

      Honestly, that's the real solution here. How much accuracy are you allowed to sacrifice in order to speed things up. I'm taking accounting courses right now and there's a bunch of shortcuts that are taken for similar reasons. The result is somewhat different from what it would be with proper math, but it's usually a small amount and you're rarely dealing with more than 4 decimal places when dealing with currency anways. Plus, since people are using one of two sets of rules regarding how these things are don, there usually isn't much trouble.

    • @attilatorok5767
      @attilatorok5767 2 года назад +2

      Accuracy is also why std:complex is usually slow. The C++ standard library has to work in all cases and for complex numbers not sacraficing precidion means a lot slower code. You also cant use a fast-math optimised version because the standard library is precompiled (altough you can work around this by compiling a glibc with fast-math and using that with static linking it would probably still preform worse than hand-written speed-for-accuracy optimised code). I would personally not recommend std:complex in any case unless you are sure you need super-duper accurate answers but than you would probably go with an arbitrary precidion library wich has complex numbers.

    • @teambellavsteamalice
      @teambellavsteamalice 2 года назад +2

      Isn't the step B = b/a and then mid = B/2 the real waste here? You only need mid.
      Also you can use a trick used for the quadratic, where r2 is used for the discriminant, r1 = (-b + r2)/2a and r2 = (-b - r2)/2a.
      For the Po Shen Lo it works even better:
      mid = b/2a
      c = c/a
      r2 = sqrt(mid*mid - c)
      r1 = mid - r2
      r2 = mid + r2
      the last step I imagine can be faster as it's adding mid to itself ( r2 += mid?)
      This should already always beat the quadratic. Would using reciprocals be even faster?

  • @matthewkolar7560
    @matthewkolar7560 2 года назад +19

    CPU's & Compilers can vectorize Array of Structures as well, the difference between this and SOA is more about optimizing cache access patterns for functions that dont use every element in the structure. You should double check your compiler optimization settings if you are getting wildly differing results from a structure this small.

  • @Aras14
    @Aras14 Год назад +1

    After a little bit of tinkering, I got the Po Shen method down to:
    2 Float subtractions
    2 Transfers (a := b)
    1 Integers subtraction (from a float to reduce the exponent by 1 aka dividing by 2)
    1 XOR (to multiply by -1)
    1 Float addition
    1 Square root
    It also only uses 3 Registers.
    It goes as follows (from x² + Ax + B; A, B, C are floats; p is the number of significand bits of the type of float, l is the size of the total number for example 32, 64):
    1. Subtract from A 2^p
    2. XOR A with 2^l
    3. Set C to A
    4. Multiply C by A
    5. Subtract from C B
    6. Take the square root of C
    7. Set B to C
    8. Add to B A
    9. Subtract from A C

  • @HDX30
    @HDX30 2 года назад +48

    Super interesting video! At the end, you do mention how multiplications versus divisions may affect running time, and list that Po-Shen Loh has 3 divisions. However, one of the divisions from Po-Shen Loh is a simple division by 2, which is effectively a multiplication by 0.5.
    With that said, Po-Shen Loh would have the same addition, 1 less subtraction, 3 less multiplication, the same division, and the same square root! Making it strictly better in terms of operation counts!
    I'd be curious to understand what you think.

    • @WhatsACreel
      @WhatsACreel  2 года назад +32

      I found it very hard to count the operations. The negatives can become subtractions, the divisions can be invterted (as you mentioned), and the FMADD instructions can perform add/sub along with a multiplication.
      My operation counts were a little hit and miss!
      I think you’re right :)

  • @ProjectPhysX
    @ProjectPhysX 2 года назад +24

    For the red code at 3:30, wouldn't it be faster to compute the sqrt(...) part only once in a variable and then insert the variable twice for the +/- solutions? Or does the compiler do that automatically?
    Edit: Nevermind, just saw you test this later in the video and the compiler does indeed optimize it :)
    I personally never trust the compiler :D

    • @Wyoming307
      @Wyoming307 2 года назад +2

      is this actually a legal optimisation for the compiler though?
      if r1 points to the same address as a, b or c then the first computation would change the parameters for the second. So optimising this would lead to the wrong result.
      I guess if the function is only called once and it knows all the arguments at compile time, then it knows this can't happen, but I'm still surprised it would do this.

  • @astrobullivant5908
    @astrobullivant5908 2 года назад +6

    When I first saw Po-Shen Loh's method, I thought it looked like he was solving it by substitution, just as Tartaglia and Cardano solved cubics with at first before generalizing the formula. Po-Shen Loh's way is cool and nifty even though it's slower than the regular quadratic formula. Plus, there are situations where we'd get solutions in human-friendly form faster using Po-Shen Loh's method than we would with the regular quadratic formula.

    • @duckymomo7935
      @duckymomo7935 2 года назад +1

      It’s also just quadratic formula but removing the fluff

  • @FaZekiller-qe3uf
    @FaZekiller-qe3uf 2 года назад +3

    What a great way to pad the runtime

  • @shanehebert396
    @shanehebert396 2 года назад +6

    For the CPU/FPU code, if the bottleneck is reading from RAM, you might try putting in some prefetch loads of the next set of data (particularly for your ASM code). It may not make much difference, though, given the loads are cache line at a time. I remember way back in the day we were looking at the Fastest Fourier Transform in the West (FFTW) on the UltraSparc and telling the C compiler to heavily optimize it (including using static profile guided optimization) and the ASM it generated was pretty slick, including prefetch loading so the data for the next loop was in L1 when the next loop iteration needed it. If the compiler you used didn't use PGO to optimize but has the option, it might be interesting to see if it helps.
    Also, the dark blue text used for Po-Shen Loh code is pretty hard (for at least me) to read. The red was OK even if a little dark.

    • @booom4849
      @booom4849 Год назад +1

      Manual prefetching will be slower because there are now additional instructions to be executed. He is just iterating over arrays, the hardware prefetcher likes that pattern.

  • @AMan-xz7tx
    @AMan-xz7tx 2 года назад +12

    love the video, but some of the text is a little hard to read, maybe add a centered shadow text effect to make it pop out a bit, and lighten the text colours just a small amount too?

    • @sotrh7974
      @sotrh7974 2 года назад +3

      Yeah the blue on dark gray in particular is really hard to see

  • @falklumo
    @falklumo 2 года назад +25

    Po-Shen Lo did NOT derive a different quadratic formula, his’ is the same! He just derives it a little bit differently. The way he writes the formula just reshuffled common factors.

    • @93hothead
      @93hothead 2 года назад +1

      Nobody said anything about deriving

  • @kasuha
    @kasuha 2 года назад +27

    So, you take (A ± B)/C expression, separate it into A/C ± B/C and call it a new revolutionary formula? Seriously?
    The whole comparison is about how smart is the optimizer in your compiler and how efficiently or inefficiently can you put the two into source code. But in the end of the day, it's exactly the same formula. Programmers go to far greater lengths in deviating from formulae optimized for viewing to formulae optimized for execution when it's really important.

    • @Daniel_VolumeDown
      @Daniel_VolumeDown 2 года назад +1

      @@jeremiahbullfrog9288 idk because i did not wathed but formula was proposed by someone else whose video is in the discription

    • @spencergouw3316
      @spencergouw3316 2 года назад

      The formula is meant for pedagogy, to teach a more intuitive approach to the quadratic equation to students in the original video. Your complaint is pretty silly considering no one claimed it was "revolutionary", only that it was "easier on pen and paper", which it is. Po Shen Loh even derives the quadratic equation from his reformulation in his video.

  • @mikefochtman7164
    @mikefochtman7164 Год назад

    Worked on old 32-bit mainframes and we used to pour over our code for hours to hand-optimize the FORTRAN. Looking for division operations that could be replaced with multiplication by an inverse. And the bane of our work were pesky 'implicit type converions' where someone would do something like * instead of *. Yeah, those old compilers weren't very good at optimization so we had to do it ourselves. Love your channel and loved this dive into some interesting bit of math.

  • @falknfurter
    @falknfurter 2 года назад +3

    When doing computations not only performance is important, but also the accuracy of the results. Quadratic equations have two cases which need to be treated carefully from a numerical point of view, or you might suffer a loss of accuracy: (1) where -b ~ sqrt(b^2 - 4ac) and (2) where b^2 ~ 4ac. In both cases you basically subtract two almost equal numbers. In combination with the finite precision of floats the siginificant digits of the result are reduced. Both algorithms (as shown here) suffer from these problems. Obviously, for numerically stable code these cases should be handled appropriately.

    • @MatthewJacksonTheMuuj
      @MatthewJacksonTheMuuj 2 года назад

      I do remember this coming up in Numerical Analysis class. Avoid subtracting two numbers if the result might be close to zero!
      College may have been a long time ago for me but that one guideline has stuck with me.

    •  Год назад

      Totally underrated comment. This is overlooked so often.

  • @JamesJones-zt2yx
    @JamesJones-zt2yx 2 года назад +4

    Would really like to see a discussion of whether Po-Shen Loh's method is preferable to the one we all learned from a numerical analysis POV. Not sure error analysis leads to quite as spiffy graphics, though. :)

  • @protonmaster76
    @protonmaster76 2 года назад +18

    It's interesting that the std::complex is so much slower than the scalar. It would have been interesting to see a complex implementation in C that didn't use the std::complex methods.

    • @zactron1997
      @zactron1997 2 года назад +10

      It does make sense that for complex numbers you're looking at much more compute. In the complex domain you have to store 2 values per number for the same level of precision. You'd think that means you only have double the work, but you actually end up with substantially more in certain operations.
      Consider addition and subtraction: that maps quite nicely to ReC = ReA + ReB, and ImC = ImA + ImB. So double the operations.
      Now consider multiplication: ReC = ReA × ReB - ImA × ImB, and ImC = ReA × ImB + ImA × ReB. Now you have 4 multiplications instead of 1, and two additions.
      Division and square roots are even worse. Now granted, you could speed this up by working in polar form, but then you need trig functions to handle addition and subtraction.

    • @Mernom
      @Mernom 2 года назад

      @@zactron1997 How fast is conversion between polar and grid forms?

    • @protonmaster76
      @protonmaster76 2 года назад

      @@zactron1997 it is true that when operating in the complex domain that some maths operations require a lot more operations than its real counterpart. My point is that when he used std::complex implementation it took 10,000 ms to run as opposed to a few hundred ms. But his asm real vs complex implementations did not have such a wide time difference between the two.

    • @protonmaster76
      @protonmaster76 2 года назад +1

      @@Mernom cartesian/polar conversation requires about the same amount of trig as you're trying to avoid in the add and sub.

    • @monad_tcp
      @monad_tcp 2 года назад +2

      std::complex is pretty terrible

  • @turun_ambartanen
    @turun_ambartanen Год назад +1

    I was excited for a new formula, but it's just the regular one we are taught in school here in Germany :(
    It's called "pq-Formel", with p=b/a and q=c/a

  • @josephvictory9536
    @josephvictory9536 2 года назад +4

    An idea for future videos:
    Can you put the standard deviation with the speed result if its less than 2? I found myself really wanting to see it for some of those closer matches.
    Its only us nerds watching so there is no shame! Awesome video btw!

  • @stevegremory3849
    @stevegremory3849 2 года назад +6

    bloody awesome! never knew this existed, extremely cool, hell it'll be useful for school work too ;)

    • @WhatsACreel
      @WhatsACreel  2 года назад +3

      I reckon! Such a clever formula! Cheers for watching :)

    • @stevegremory3849
      @stevegremory3849 2 года назад

      @@WhatsACreel Cheers :D

  • @Andrew90046zero
    @Andrew90046zero 2 года назад +2

    Really interesting. I was really hoping Po-Shen Loh was going to be faster across the board :P But the results and the overall lesson of the video was still very interesting.
    It would be cool to see a video where you only talk about the Op Times table you showed at 26:36 and maybe showing other examples of algorithms and how they stack up in terms of Op Times.
    On top of that, I also found myself wondering a little bit about "compiler vectorization". Idk if you have any video's on that subject, but I am interested in learning more about compiler vectorization. Especially including parts where you talked about "Structure of Arrays" and how the compiler (I guess) can more easily take advantage of simd instructions.

  • @christopherhume1631
    @christopherhume1631 2 года назад +1

    The two algorithms are practically indistinguishable. One can regard the principle difference between them as an algebraically motivated optimization.
    The main value of this investigation is to underscore the fact that divisions are more expensive than multiplications. This difference would be substantially amplified by the use of double precision math. I haven't looked closely; but it may be possible to reduce the number of divisions in the Po Shen Loh approach.
    Regardless of approach, trading division for multiplication should improve performance - especially for double precision math. Note, however, I am not referring to constant division by powers of two. Though I am not certain, it is possible those will be converted to arithmetic shifts.

    • @PhilippeVerdy
      @PhilippeVerdy 2 года назад

      That's no longer true with modern FPUs and GPUs, where divisions are now actually a bit FASTER than multiplications! But this is actually visible only with GPUs, not with existing x64 CPU/CPU where both perform the same in one cycle... except when using "vectorized" expressions: AVX2 processor extensions (or newer), if they are used by your C/C++ compiler or its libraries, may show that divisions CAN perform faster than multiplications! You may also get different figures depending on CPU brands and models (how their internal FPU units are internally built). Do not assume things from the time where divisions were still using our "naive" algorithms (with bit-by-bit compares, shifts, and substraction) when we had no FPU to compute them in a single hardware instruction. As well, there are optimizations in modern FPUs to efficiently compute square roots. Modern FPUS can use "FFT-based" algorithms implemented by hardware (using many more silicon gates but structured in a much larger "tree" or parallel branches, but with lower depth, so with faster total propagation time): this is true for the multiplication, the division, the square root, and other trigonometric/exponential/logarithmic functions natively supported by FPU or GPU instructions.

    • @PhilippeVerdy
      @PhilippeVerdy 2 года назад

      As well there are new types of processing units, notably for IA and signal processing, using optimization of tensor flows. They start appearing on CPUs for smartphones and will be soon available on desktop PCs, and are already used in large computers performing massive computation (think about what the CERN needs for the LHC, or what Google needs for its IA with BigData, or what is needed in astronomy, fluid mechanic, or computer-aided design, notably in the aerospatial and defense sectors). These processing units have new specialized instructions for efficiently computing "convolution products" and "FFT's" by hardware. Here also, they can accelerate basic divisions faster than basic multiplications (and the difference is even more visible with massively parallel computing, with very large vectors or matrices rather than individual scalars).

  • @GiveAcademy
    @GiveAcademy Год назад

    Love all the content!!! Reminds me of my random coding adventures! And very inspiring!!! Very happy that you are active again! I am struggling to get myself back up and going.., would love to see a continuation of your GraceAI neural network series!! ❤

  • @zaydabbas1609
    @zaydabbas1609 2 года назад +1

    Just to be clear here, these are algebraically identical. There was no 'new math' discovered here as some people claim. This is merely an optimization for computation

  • @michaelwilkes0
    @michaelwilkes0 Год назад

    just getting close to such a tiny we ll established formula is amazing. epic.

  • @Mernom
    @Mernom 2 года назад +2

    That branch really muddied the water. I think the first test should've been redone with the optimized code.

  • @syntaxerorr
    @syntaxerorr 2 года назад

    I love your videos. I don't want to critique your style and this is the first I've seen with this style...but if you speed up all the animations (mainly the countdowns a lot, like 5 seconds less) I think it would flow better. I know animation is hard and a trial by error process sometimes.

  • @corscheid
    @corscheid 2 года назад

    This video was absolutely amazing. Beautifully done. WOW.

  • @DOROnoDORO
    @DOROnoDORO 2 года назад +1

    Love the format, very nice animation, and even nicer maths!

  • @beaconofwierd1883
    @beaconofwierd1883 2 года назад +2

    At first I was angry with you for implementing it using a branch, the I was very happy that you did so because it made me feel good spotting the issue from the start :D
    So thank you for making me feel clever ;D

  • @algorithminc.8850
    @algorithminc.8850 2 года назад

    Great channel ... really enjoying it ... thank you. Cheers

  • @josefaguilar2955
    @josefaguilar2955 2 года назад +1

    I absolutely loved this, thanks for making such great content!

  • @Zero-4793
    @Zero-4793 2 года назад

    with Po=Shen Loh's, you don't need to calculate with complex numbers. you just put an i out the front
    if c^2 < 0:
    c = |c|
    ans = ic

  • @Chris-op7yt
    @Chris-op7yt Год назад

    that's the form for finding x axis intercepts. if just computing the y value for any particular a,b,c and x, then just plug it into y=axx+bx+c. no square roots and no division.

  • @XfStef
    @XfStef 2 года назад +2

    Glad round 2 took a Data Oriented approach!

  • @badhombre4942
    @badhombre4942 Год назад

    √(b²-4ac)/(2a) == √((b/2a * b/2a) - c/a) == u. Therefore both formulas are the same.
    If a is > 1 in the majority of your dataset, then the if block adds 2 division operations. Your red code can be optimized to make it even faster.

  • @jerryplayz101
    @jerryplayz101 2 года назад

    15:09 - my guessing that the compiler is using branch prediction, meaning that the program is already precomputing "a" and prepares for the branch as part of low level optimisation

  • @kestans
    @kestans 2 года назад +3

    brain neglects reading darkblue code :D

  • @ExaltedDuck
    @ExaltedDuck Год назад

    fascinating. I haven't watched the other video yet to better understand this other method but from the 1-sentence description early on in your video, it doesn't sound like they're doing different things. The description of Po Shen Lo as finding the middle and then adding and subtracting the difference to find the roots... that's exactly what the quadratic formula does. -b/2a is the middle and then the difference is sqrt(b^2-4ac)/2a. You might be able to speed up the quadratic code by a tiny amount by calculating the -b/2a as well as the discriminant. Although in the scalar version, the compiler probably already has. I'm really curious to go see the more explanatory video soon, though. Maybe there's something I'm missing...

  • @effektor2000
    @effektor2000 2 года назад +3

    I would try to calculate 1 / (2.0) * a[i]) and then do multiplications instead of the two fivisions.
    Also I aliasing rules prevent further compiler optimizations, making the pointers restrict should help in this case, in particular in Quad2 as it does more memory access.

    • @andersgoude
      @andersgoude 2 года назад

      I am a bit curious about the aliasing rules here. If the compiler was reusing the value for the square root in the first case (at around 7:00), wouldn't this be against the aliasing rules, as if r1 would be the same as a, b or c, it would have changed the values for the square root for the second calculation. So the compiler may already assume that there is no aliased pointers

  • @mt0software
    @mt0software 2 года назад

    I wish this was the kind of game show they'd show on TV. great video

  • @DanikaOliver
    @DanikaOliver 2 года назад +2

    I can't see the code, the contrast is not good. The letters are small.

  • @nyaromeneko4455
    @nyaromeneko4455 2 года назад

    For completeness, I would suggest doing the benchmark calculations in both float and double. I would have liked to see complex. 🙃

  • @felixstuber8046
    @felixstuber8046 Год назад

    You can substitute one division in the po-shen lo formula by a multiplication if you instead of dividing b by a and then by 2 simply devide it by 2*a. On the other hand, multiplication and division by 2 shouldn't be a problem at all for a binary machine.

  • @mikeblaszczak5346
    @mikeblaszczak5346 2 года назад

    Dark blue on a grey background? Good going.

  • @TECHN01200
    @TECHN01200 6 месяцев назад

    Once you are doing GPGPU, your limit is becoming the speed of copying to and from the card over the PCI bus rather than the compute.

  • @furyzenblade3558
    @furyzenblade3558 2 года назад

    Your videos are always super interesting!
    Thank you

  • @barakap2372
    @barakap2372 Год назад

    Hey friend. Why didnt you upload for such a long time? You were a reallllllllly good introduction for me for assembly C++ and in general compsci. Please reply

  • @TerjeMathisen
    @TerjeMathisen 2 года назад +2

    First, the two algorithms really are not separate: They have exactly the same numerical qualities and effectively do the exact same operations, only in a slightly modified order!
    On a modern CPU, sqrt() and FDIV are both far slower than fadd/fsub/fmul & fmadd, so I would start by calculating the a_reciprocal = (1/a) term: The FDIV in that operation will then be nearly or completely free because it can overlap with all the rest of the calculations. In the PSL version we immediately use the result of that division to scale b and c, which simplifies the logic but generates a latency barrier. If we instead calculate half_a_recip = (0.5/a) then we can multiply the final answers by that value.
    The sqrt term is sqrt(b*b-4*a*c), here the b2=b*b term and the ac=a*c term can run at the same time, then we combine them in tsqrt=sqrt(fmadd(ac,4.0,-b2)).
    By the time that has finished, the fdiv will be be all done, so we just have to do the final res0 = (-b + tsqrt) * half_a_recip and res1 = (-b-tqrt) * half_a_recip calculations which will also overlap.
    Since sqrt is slightly slower than fdiv, we can in fact delay the half_a_recip calculation a smidgen, starting it just after the b2=b*b; and ac=a*c; operations, it will still finish before the sqrt. On an x64 cpu with at least two fmadd pipelines, but in scalar mode, these calculations should take ~40 clock cycles (I'm assuming 5-cycle fmadd and 20-cycle fsqrt).

    • @cH3rtzb3rg
      @cH3rtzb3rg 2 года назад

      I doubt the latency of the instructions matters at all in case you calculate millions of independent roots -- only instruction throughput and port-usage should matter (and in fact memory throughput).
      Instruction latency would matter, if the output of the previous calculation influences the input of the next.

    • @TerjeMathisen
      @TerjeMathisen 2 года назад

      @@cH3rtzb3rg I mostly agree: When processing huge amounts of data, typically more than last level cache, almost any calculation will be totally dominated by memory bandwidth. OTOH, that is pretty much unimagninable for the case of solving quadratic roots. :-)

    • @cH3rtzb3rg
      @cH3rtzb3rg 2 года назад

      @@TerjeMathisen I also agree. The benchmark in the video is pretty much meaningless for most practical purposes. One could compare the latency of the algorithms by calculating new coefficients based on the previous roots (that way, there would also be no memory bottleneck).

  • @plokki456
    @plokki456 2 года назад +2

    interesting video! Though I find it hard to believe that the whole gpu is only twice as fast as a single cpu core for such a massively parallel task. Is there some cuda optimization flag that you did not turn on? Also the gpu may not have had the time to increase its frequency to the max if the task was too short. Did you compute the effective bandwidth vs theoretical bandwidth as you did for the cpu test? I'm curious about that as well.

    • @aliengenie8896
      @aliengenie8896 2 года назад +2

      When you're doing stuff with GPGPU, if you try to simply time the time from when you "submit" the work to the GPU to when it finishes, you end up also measuring the overhead that is involved with deploying kernels and eventually having all of them hit the synchronization point at the end. There's a good chance that the amount of work that was submitted was not even to keep the GPU busy for long enough, and so the vast majority of the time is spent on that overhead. Depending on how things are done, you may end up also measuring the transfer of the inputs/outputs to/from the GPU, which would make the results look even more similar.
      The "dumb" way to measure how much overhead there is, would be to give it a ridiculously small amount of work, like only one calculation, and see if it ends up being ~75ms again. The better way is that Nvidia supplies tools for measuring how long each part of the process takes. (Launching the kernel vs memory transfers vs the actual work).

  • @mattewlefty991
    @mattewlefty991 2 года назад +7

    From a numerical point of view, the quadratic formula might give you roots wich are very distant from the true ones, depending on the values of a,b,c. So we will need to use more different methods for computing the roots. Btw the "red an blue tower thing" was really anxiety-inducing 😂

    • @user-sl6gn1ss8p
      @user-sl6gn1ss8p 2 года назад +1

      "From a numerical point of view, the quadratic formula might give you roots wich are very distant from the true ones, depending on the values of a,b,c"
      Really? Like, in a way that po-shen loh wouldn't?

    • @mattewlefty991
      @mattewlefty991 2 года назад

      @@user-sl6gn1ss8p That's a good question, po's formula divides all the coefficients by a, and then does the calculations (but when abs(a) is small there might be some problems in general). @Creel would you share how you generated the test cases?

    • @user-sl6gn1ss8p
      @user-sl6gn1ss8p 2 года назад

      @@mattewlefty991 now that you mention it, comparing the results might be fun as well (if fairly slow)

    • @monad_tcp
      @monad_tcp 2 года назад

      why would it do that ? the roots is basically when the curve crosses the 0. we are talking about a parabola, it must cross the origin

    • @mattewlefty991
      @mattewlefty991 2 года назад +1

      @@monad_tcp Not all parabolas cross the x axis... But even then there are some operations (like subtraction, sum and division) that introduces errors because the computer works with a finite number of digits

  • @PhilippeVerdy
    @PhilippeVerdy 2 года назад

    So suppress the "if", and just use ONLY two intermediate variables "mid" and "u" declared as "long double", storing their complete expression; then you get "r1=mid+u" and "r2=mid-u"; you can even assign the intermediate valid of "mid" and "u" within the expression of "r1", and reuse "r1" into the expression of "r2" in a single statement, so you get "r2=(r1=(mid=sqrt(b*b-a*c*4)/a/2)+(u=...))-u*2;"
    As well avoid converting "float" read from your "quad" structure multiple times to "double" or "log double", just read them once in a local temp variable of type "long double" (this can be done in BOTH "formulas" to drop that overhead). Note that if you use assembly, those temp variables would be FPU registers of type "long double" (but you need a good C/C++ compiler to do the same optimization and use "IEEE rounding modes" flags in the compiler or with macros in the soruce code to avoid all unnecessary intermediate conversions and roundings).
    You can redo your test as well by changing your "quad" structure to declare "double" or "long double" fields (in both "formulas") So what you are really testing is how C/C++ compilers optimize the generated code. There's in fact NO difference between the formulas, except that if you incorrectly round intermediate results in your naive "Poh-Shen-Loh" implementation, you get DIFFERENT results between the two "formulas", where your "Poh-Shen-Loh" implementation is LESS PRECISE than your "original" implementation! You did not even compare the results between the two formulas to see that their results have differences (visible here on the 9th significant digit which can deviate by +/-2 !)

  • @phizc
    @phizc Год назад

    26:12 One of the divisions in Po-Shen Loh is x/2. It can be turned into a multiplication x*0.5. I'm kinda surprised if the compiler doesn't do that.
    So PSL vs "old" should have 1 less sub, 3 less mul and all the others the same. I.e. 7 clocks less than the "old" in the first table, and 2.5 less in the second table.

  • @BboyKeny
    @BboyKeny 2 года назад +3

    I'm guessing the new 1 is faster because square roots and divisions are expensive. The old 1 does 2 square roots.
    Although this could maybe be optimized with bit manipulation like the infamous Quake engine bit hack.
    Or I'm talking nonsense and should go to bed

    • @BboyKeny
      @BboyKeny 2 года назад

      Aaah reusing the discriminant eliminates the second sqrt and the branch bugged me before but I forgot how bad branches are.

  • @thomasolson7447
    @thomasolson7447 Год назад

    Suppose every program is just a number. I'm wondering how that number would be calculated. Is it left to right, or right to left? Then there is the stack and heap. Is there a stage before it enters memory where it is one complete number? Suppose we added 1 to it, where would it be? Is there a halt at the end? Does the increment have to take place before the halt?

  • @etopowertwon
    @etopowertwon 2 года назад

    At differences so low you more quickly run into what compiler can optimize better at given flags (O0 vs O1 vs O3, -ffast-math/-fno-fast-math)
    For example: final table is unoptimized. Nobody counts 2a as (2*a). It's (a+a).
    (Unless compiler would decide that 0.5 * X/a is faster than X/(a+a) -- it generally shouldn't, floats have no instruction for quickly halving a value)

    • @ABaumstumpf
      @ABaumstumpf Год назад

      Changing a float by a factor of 2 is a single addition/subtraction

  • @gummansgubbe6225
    @gummansgubbe6225 2 года назад +1

    What a nice video!
    In the original quadratic formulae you get into trouble if a, c or both are small compared with b. You should compute the roots as q=== -1/2(b+sqn(b)sqrt(b*b-4ac)). Roots are x1=q/a and x2=c/q.

  • @rubensramos6458
    @rubensramos6458 Год назад +1

    To find an analytical solution for ax^2+bx+c = 0 is easy. However, what is the analytical solution for ax^(2+e)+bx+c=0 with ‘e’ being a real number? The solutions are
    x1=(b/(az))Wq(((-c/b)^z)(a/b)z)^(1/z), where z = (1+e) and q = 1-1/z.
    x2 = (-y(a/b)Wq((-1/y)(b/a)((-c/a)^(-1/y))))^(-1/(1+e)) where y = (2+e)/(1+e) and q = 1+y
    Wq is the Lambert-Tsallis function (a generalization of the Lambert function). Sometimes the correct solution is x1, in other cases the correct one is x2 and there are cases where x1 = x2, depending on the values of a, b and c. For example, the solution of
    x^(2.001)-11x+30 = 0 is x1 = 5.0430 (up to 4 decimals).

  • @maxvilla5005
    @maxvilla5005 2 года назад

    Nice vid, wish the code text in red and blue (especially blue) was easier to read though.

  • @JustATempest
    @JustATempest Год назад

    I'd be interested and you using the fast square root algorithm for quake as a substitute for the square root for both of these

  • @mikefochtman7164
    @mikefochtman7164 Год назад

    The 'blue' version @ 9:26, since 'bt' is only used for 'mid', why not eliminate it completely? Just 'float mid = - b[i] / (a[i] + a[i]). You eliminate a division operation as well as eliminate the storage 'bt' usage (and you don't need the constant '2' anymore). Just seems this code could be 'tweaked' a bit further and maybe beat the original quadratic formula.
    Later at 26:20, you would then beat the original by having fewer multiplications and equal number of divisions.

  • @prof.salomonibarra6402
    @prof.salomonibarra6402 2 года назад

    ¡ Excelente trabajo !. You deserve the gold medal in computer olympics. For real numbers OQ performs very well.I'm going to try this on ARM V8 and compare with similar Intel. Cheers

  • @boelwerkr
    @boelwerkr 2 года назад

    That "if" breaks a lot of processor optimizations and it's not needed i think. if "quads[i].a" is 1.0f the equation inside the if is: b/=1.0f; c/=1.0f which didn't change the values. This would speed up the code a lot. With that it should have a fighting chance. The slowest part of both codes is the "sqrt" it is a magnitude slower than any other operation in that code. So using it as little as possible is always preferable. So this code:
    void Po_Shen_Loh_Simple_AoS(QuadStruct* quads, int count){
    for(int i =0; i< count; i++){
    float mid = ( quads[i].b / quads[i].a ) * -0.5f
    float u = sqrt(mid * mid - quads[i].c / quads[i].a );
    quads[i].r1 = mid + u;
    quads[i].r2 = mid - u;
    }
    }
    should be faster if i haven't made a stupid logical error. 🙂

  • @PhilippeVerdy
    @PhilippeVerdy 2 года назад +1

    The Po-Shen-Loh implementation is disavantaged because it is bad ly implemented: first, it uses a conditional "if" that is not even needed (cPUs don't like conditional branches), because you can still divide by 1 unconditionally. Second, it stores intermediate results into "float" variables, forcing thecompiler to round these intermediate results (something that does not happen with the "'original" which evaluates the whole expression for each root directly (rounding it to a "float" only at end).
    And actually FPUs in processors do not even compute with "float", not even with "double", but with their internal maximum precision ("long double"): each time you read from or write into a "float" or "double" variable, the C/C++ compiler will generate instructions for extending/rounding the precision.
    So, drop the "if" condition, and change all your intermediate variables into "long double", and things change radically:
    What Po-Shen-Loh does is just separating and "factorizing" the computation of additive/substractive differences, so that it can compute the same "square root" only once and reuse it. Fundamentally, "Po-Shen-Loh" is not a different formula than the "original", and it is NOT "new".

    • @enochliu8316
      @enochliu8316 2 года назад

      Actually, his complier, MSVC, does not use the FPU extended format for long double, and will generate native single precision adds and multiplies for the code he is running.

    • @enochliu8316
      @enochliu8316 2 года назад

      Also, the optimised version does remove the branch.

  • @PhilippeVerdy
    @PhilippeVerdy 2 года назад

    Is summary, we cannot decide which approach ("original" from maths on real or complex numbers with infinite precision, or "PSL"-optimized) is faster or slower as it largely depends on how you program your code, the programming language you use (notably how it rounds the results per its semantics), the compiler you use (and its compilation options notably about rounding of intermediate results, and how it may cache multiple reads from external structures/variables into local registers), the library you use, and the hardware you use: finally you can only decide after really making some local benchmark. NEVER forget the language semantics (effects of "implicit" roundings, and of "if" branches) when creating ANY implementation with that language.

  • @camofelix
    @camofelix 2 года назад +3

    I didn’t notice what chip you’re running this on! If it’s AVX512 with FP16 (early alder lake with ecores off) enabled there’s some SIMD fast paths for complex calculations that can do it all much faster, and because you’re using a smaller data type (fp16 instead of fp32) you get much better bang per buck on memory bandwidth

    • @monad_tcp
      @monad_tcp 2 года назад

      its an x86 probably

  • @sharkysharkerson
    @sharkysharkerson 2 года назад +1

    Based on code complexity, I would think Po-Shen Lo might do better on an FPGA. It could probably be worked out that both take the same time, but even so, Po-Shen Lo looks like it would use less blocks on the FPGA.

  • @dananskidolf
    @dananskidolf 2 года назад

    I think processor architecture plays a big role in this. The operation times depend on FPU implementation - for example Intel improved division by a factor of two and sqrt by 4 in the Penryn architecture if I remember correctly. More cache or memory bandwidth will target the key bottleneck here - that's probably the main gain from using a GPU. Branch prediction is powerful, but difficult and I think still evolving, so this affects different CPUs differently, as well as depending on the data fed in, and the order it's in. (The branch was only there to demonstrate bad code but I thought I'd include it.)

  • @wolfgangreichl3361
    @wolfgangreichl3361 Год назад

    The way I learnt it, the expression (double==constant) is a bug anyway. (9:12)
    Before going to assembly I would have replaced std::complex, which seems to me to be the culprit messing up memory access.

  • @cH3rtzb3rg
    @cH3rtzb3rg 2 года назад +1

    Why would the original Po-Shen Loh code check for a!=1, but no code actually checked for a!=0?

  • @BillWangard
    @BillWangard 2 года назад

    "bt" is not needed. Just define "mid = 0.5 * b[i] / a[i];"
    BTW, better also to create ainv = 1/a[i]. Then use
    mid = 0.5 * b[i] * ainv;
    and
    ct = c[i] * ainv;
    Multiplying is much faster than dividing.

  • @PplsChampion
    @PplsChampion 2 года назад +1

    "benchmarking as entertainment" is next to godliness

  • @thrownchance
    @thrownchance 2 года назад +3

    Which CPU and GPU was used? A speedup of only factor 2 to go to the GPU seems to be barely worth it imho.

    • @user-sl6gn1ss8p
      @user-sl6gn1ss8p 2 года назад

      specially considering the time it takes to get stuff into the gpu

  • @Zooiest
    @Zooiest Год назад

    How does the tie rule work? Is it something like abs(resultsA[i] - resultsB[i]) < stddev(resultsA)? I'm not really a statistician, but I like learning these things

  • @ABaumstumpf
    @ABaumstumpf Год назад +1

    The same as nearly every time - no, it is not faster, more efficient, it doesn't red less registers, cache or memory - it isn't new either.
    It is really old and basically just a slight refactoring of the "normal" formula with more steps that can introduce inaccuracies.
    And no, the compiler is smarter than basically any of us and so lang as you give it the correct information it will be faster than hand written assembly - specifically for C pointer parameters you should use "restrict" - or better yet not use pointers as parameters to begin with as now the compiler Must assume that the different addresses are overlapping, so it can not effectively vectorise it, it can not even assume that the values don't influence each other.

  • @definitelynotofficial7350
    @definitelynotofficial7350 2 года назад +3

    I don't think you can call it a "new quadratic formula". It's the same formula really. It simply is a different derivation of it. The mathematical result is the same.

    • @kazedcat
      @kazedcat 2 года назад

      If they are the same formula then the runtime should be the same. The resulting value is the same but the time to execute the formula is not the same. There are different ways to compute Pi some are faster than the other but the result are always Pi. Are you arguing that there is only one formula for calculating Pi?

    • @definitelynotofficial7350
      @definitelynotofficial7350 2 года назад +2

      @@kazedcat The formula is the same. The way the calculation has been implemented here however isn't. Say you have a formula x=a+b+c. Someone may try to implement it by first adding a and b, and then adding c to that. Someone else may try to add b and c, and then add a. These are different ways to implement it but really the formula is the same. Really what is happening here is that something like the proof of the formula has been implemented as an algorithm to get the solution. The formula is 100% the same though.

    • @kazedcat
      @kazedcat 2 года назад

      @@definitelynotofficial7350 Again if the formula was completely the same the runtime should be completely the same. X=(A+B)+C and X=A+(B+C) has exactly the same runtime. But if you have X=4×A and X=A+A+A+A then they have different runtime and depending on hardware they might even give different result so they are different formula.

    • @definitelynotofficial7350
      @definitelynotofficial7350 2 года назад +1

      @@kazedcat They're not. They are, mathematically, entirely equivalent. Whether one formula has a different runtime than another or not does not make it be a different formula somehow, it's a programming language/compiler detail that doesn't matter for the math.
      You brought up the concept of multiple "formulas" for pi. These formulas are usually certain sequences that converge to pi. You can have multiple different sequences like that with some converging faster than others. But this is a completely different situation, there is no different formula. All there is here is a different method to derive essentially the same formula, and the derivation has been implemented as an algorithm. There is no difference.

    • @kazedcat
      @kazedcat 2 года назад

      @@definitelynotofficial7350 The result are equivalent but the formulas are different. Archimedes method of calculating pi is not the same as Leibniz formula for pi. You are confuse with equevalence classes and thought it apply to everything. No for sets two sets are equal if they contain the same elements but two sets are equivalent if they have the same cardinality. For sequences two sequence are equal if every terms are the same but they are equivalent if they have the same limit. Since a formula is a sequence of set mapping then two formula are equal if each set mapping in the sequence is the same but two formula are equivalent if they produce the same output. Runtime is very important that is why there is a branch in mathematics focus on comparing the relative speed of different formula's. This is why the problem of factoring in polynomial time is important. And this is why the Riemann Hypothesis is the one of the biggest problem in mathematics.

  • @sykomantis
    @sykomantis 2 года назад

    So I got nerd-sniped for a few minutes and here's what I've come up with.
    If we divide B and C by different amounts, we can beat the Op times of BOTH algorithms by a decent margin, going be EITHER time scoring rule.
    If we let j = -2A, B = jk, and C = Am, then the quadratic formula becomes
    r_1
    = (-B + sqrt(B^2 - 4AC))/(2A)
    = (2Ak + sqrt(4A^2k^2 - 4A^2m))/(2A)
    = (2Ak + 2Asqrt(k^2 - m))/(2A)
    = k + sqrt(k^2 - m)
    and
    r_2
    = (-B - sqrt(B^2 - 4AC))/(2A)
    = (2Ak - sqrt(4A^2k^2 - 4A^2m))/(2A)
    = (2Ak - 2Asqrt(k^2 - m))/(2A)
    = k - sqrt(k^2 - m).
    So, converting things into a bunch of variable assignments, we have
    j = -2.0 * A
    k = B / j
    m = C / A
    n = k * k
    p = n - m
    q = sqrt(p)
    r_1 = k - q
    r_2 = k + q
    Counting everything up, we have 1 addition, 2 subtractions, 2 multiplications, 2 divisions, and 1 square root. Using the first scoring rule, that comes to an Op time of 1+2+4+16+12 = 35, and using the second scoring rule we get a score of 1+2+1+16+12=32.
    So with this schema, we beat the best of the first rule by an Op time of 7 and the second by an Op time of 3.5.
    So the code would look something like:
    j = B / (-2.0 * A)
    k = C / A
    m = sqrt( j*j - k )
    r_1 = j - m
    r_2 = j + m
    which I find easier to write and easier to follow, and which I believe should give a significant speed up over both algorithms.

  • @PaulMurrayCanberra
    @PaulMurrayCanberra 2 года назад +1

    So, PSL moves memory fetch of a,b and c into auto variables like any sensible person would, and performs the call to sqrt() once. It isn't that the blue code is brilliant, it's that the red code is so shockingly bad. Sure, the complier can optimize up to a point, but it can't optimise out the two calls to sqrt() unless there's some new hotness in C that lets you declare a function to be idempotent that I'm not aware of.
    It might (might) improve PSL further to move the fetch from memory onto stack in one operation. Instead of declaring a and b as variables in the loop, have
    QuadStruct qd = quads[i];
    if(qd.a != 0f) {
    qd.b /= qd.a;
    qd.c /= qd.a;
    }
    // etc etc
    Doing this for both methods and saving the sqrt in the red code in a local variable (like absolutely anyone would) would give a fairer comparison.
    --EDIT--
    Also produpixel's suggestion - precalculate .5 a and multiply by that rather dividing twice.

    • @enochliu8316
      @enochliu8316 2 года назад

      `Saving the sqrt in the red code in a local variable.` Um, that is actually done in the video. 8:50

  • @heaslyben
    @heaslyben 2 года назад

    Thrilling stuff 😃

  • @CjqNslXUcM
    @CjqNslXUcM 2 года назад

    awesome video, well done!

  • @graealex
    @graealex 2 года назад

    The font used in the animation always reminds me of the animations from surreal entertainment. Just waiting for "Sheldon is kill" to appear.

  • @tikabass
    @tikabass 2 года назад

    @25:00 Multiplication is 1, division is 19.

  • @Antagon666
    @Antagon666 2 года назад

    Funny thing, if you do optimized ray sphere intersection quadratic, it reduces to the exactly same formula as this one (really old formula from CG)...
    So it might not be inovative as you might think...

  • @monad_tcp
    @monad_tcp 2 года назад

    In the end, both are good enough, most of the time goes to pure moving data around, and the square root itself. So it doesn't matter much. But it was a cool thing to know.

  • @bart2019
    @bart2019 Год назад

    It's not a "new formula", it may be just a different way of looking at it.
    The top of the parabola is at x=-b/2a, and the distance of both roots to that top is sqrt(b^2 - 4ac)/2a, if it's real.

  • @MrRyanroberson1
    @MrRyanroberson1 2 года назад +2

    4:23 already i see a problem. The +/- part should be computed once and reused, not just computed once for each root

  • @mhcbon4606
    @mhcbon4606 2 года назад +1

    at 4:30, friend, the code in blue is unreadable unless i use 1080p.

  • @disasterarea9341
    @disasterarea9341 2 года назад

    hard to see the blue code around 4 minutes in, dark text on dark background :x