Branchless Programming: Why "If" is Sloowww... and what we can do about it!

Поделиться
HTML-код
  • Опубликовано: 14 май 2024
  • Support What's a Creel? on Patreon: / whatsacreel
    Office merch store: whats-a-creel-3.creator-sprin...
    FaceBook: / whatsacreel
    In this video we look at branchless programming. This is a technique to gain speed in our high and low level programming by avoiding branching code as much as possible.
    Software used to make this vid:
    Visual Studio 2019 Community: www.visualstudio.com/downloads/
    OBS: obsproject.com/
    Davinci Resolve 16: www.blackmagicdesign.com/prod...
    OpenOffice: www.openoffice.org/
    Gimp: www.gimp.org/

Комментарии • 3,3 тыс.

  • @jeremyseay
    @jeremyseay 3 года назад +8976

    I love this, but I'm reminded of Kernighan's law: “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.”

    • @microcolonel
      @microcolonel 3 года назад +1059

      Brian misses the big-brain move here: just rewrite your code instead of debugging it.

    • @funkyflames7430
      @funkyflames7430 3 года назад +339

      Work twice as long. Checkmate

    • @rickarmbruster8788
      @rickarmbruster8788 3 года назад +118

      @this senator sometimes rewriting is a decent opportunity that you see your brain varies in qualaty of output each time :D

    • @mr_confuse
      @mr_confuse 3 года назад +109

      @@rickarmbruster8788 I notice that in most of my projects, I do the basic structure then add more to it come back to the basic structure see what a mess it is CTRL + A delete and rewrite it....

    • @rickarmbruster8788
      @rickarmbruster8788 3 года назад +53

      @@mr_confuse u will get a hang of nice design eventually ^^

  • @RyanTosh
    @RyanTosh 3 года назад +4107

    When a normal person says something is slow: 5-10 minutes, maybe an hour or two
    When a programmer says something is slow: 2 nanoseconds

    • @AntonioNoack
      @AntonioNoack 3 года назад +627

      well, normal people wouldn't think about doing it a billion times ;)

    • @soblackismyself
      @soblackismyself 3 года назад +132

      When those instructions run multiple times throughout a code that 2ns starts to makes a difference.

    • @spicybaguette7706
      @spicybaguette7706 3 года назад +457

      When you get a cache miss:
      This little manoeuvre is going to cost us 51 nanoseconds

    • @nyrahl593
      @nyrahl593 3 года назад +80

      @@spicybaguette7706 for an android that is an eternity...

    • @SamGarcia
      @SamGarcia 3 года назад +114

      those build up... and then you have Internet Explorer

  • @Mackinstyle
    @Mackinstyle 2 года назад +1790

    I LOVE how you introduced the concept by showing a way to manually optimize and how it actually fails to be better than the straightforward if-based version. That's such an important point to share up front any time we talk about clever optimization.

    • @edwardmacnab354
      @edwardmacnab354 Год назад +23

      except for his last example which was light years faster as hand coded

    • @damaomiX
      @damaomiX Год назад +22

      It just applies to c/cpp, for most other languages that don't have these kind of tricky optimizations, branchless is always faster (jit languages have jit optimization but in real scenarios the branch prediction miss a lot). Not to mention that in the video there're circumstances that branchless version did out performs branched version.

    • @ocoolwow
      @ocoolwow 9 месяцев назад

      ​@@edwardmacnab354oh you poor fool

    • @go-away-5555
      @go-away-5555 9 месяцев назад +11

      ⁠branchless generally applies when you're talking about assembly. C/C++ compiles to assembly, but so do many other languages and some languages like C# can be compiled to assembly even though typically they are compiled into a virtual-machine bytecode (using il2cpp and then a c++ compiler).
      For a purely interpreted language like python, branch prediction and branchless code barely comes into play for the developer. When do any operation in python, the actual python runtime is almost always running dozens or hundreds of lines of C (compiled to assembly) that aren't related to your code in any way except for ensuring that you're requesting valid operations. And that itself usually requires tons of branching. If you add 2 numbers, the python runtime treats those as full objects. First it checks that it's not operating on null objects, it then ensures the wrapped value is a number or a type with the addition operator defined. If it is a number, it will call the C function it has for addition. If it's not a number type, it will evaluate the PyFunctionObject representing the python code of the function, which itself has like 100 lines of C just dedicated to passing the argument in properly as a Python variable. The most optimized way to write Python code is to minimize your time spent evaluating lines of Python code. When running a purely interpreted language like Python or Ruby, the actual operations you request to do are about 1% of the actual work going on in the CPU.
      For a JIT'd language like Java, Javascript, regular C#, it really depends on how good the bytecode compiler and the JIT are. There are 2 steps here that can potentially optimize your code, the first compilation step to bytecode can definitely do many of the same optimizations a C++ compiler can, and the JIT can help pick the correct cpu instruction that can handle several equivalent lines of a higher level language (like the first function example). The Java compiler and JIT are quite good, sometimes better than current C or C++ compilers. At least in synthetic benchmarks of calculating prime numbers, sorting arrays, or etc repeatedly. For example, Java is capable of generating code that uses SIMD or AVX like in the final example in the video.
      Basically it doesn't work in a purely interpreted language because the runtime system itself has many branches. And for other languages, it depends on the code. If you're writing a branchless algorithm, it wouldn't hurt to benchmark it against the "naive" branching version of that algorithm.
      And benchmarking these system specific optimizations may even produce different results on different CPU architectures. For example, some DSP devices have no branch prediction because they need to have strict guarantees on consistent realtime delivery. It's a slower architecture on average, but you can calculate a guaranteed number of clock cycles some sequence of instructions will take. A branch predicting CPU can't make those same guarantees.

    • @asaskald
      @asaskald 7 месяцев назад

      I'm sure he just LOVEs it too. We all LOVE it.

  • @randyscorner9434
    @randyscorner9434 9 месяцев назад +636

    I've designed CPUs for over 30 years and we pay a LOT of attention breaks in code streams. Branch prediction, speculative execution, and even register reassignments are all about minimizing branch and pipeline impacts. What you may be missing in the above examples is that what seems to be avoiding branches ... don't. Every conditional test is a branch. Every conditional assignment is a branch. It is hard to beat the compiler optimization that understands the pipeline of the processor it's targeting. Clever manual decimation can be effective for long pipelines, such in GPUs, but even then compilers are pretty smart. The most efficient code is when tasks are parallelized so that multiple operations can be done based off of one or a small number of decisions.
    In trying to avoid branches, the cost is paid in non-maintainable code and is very high. If you really want to improve efficiency, don't rely on multiple software packages which actually may include interpreted languages somewhere underneath the shiny API. Layering of packages and the use of interpreted languages (Python, PERL, etc.) waste much of the increasing performance of processors. Yes, it means recompiling for a new processor, but one does that if efficiency is required.
    In one example, we recoded a numeric-heavy program that was synthesizing frequencies for an electronic piano. Recasting it to vector operations allowed the program to run on a very inexpensive Beagle-Bone Black instead of a MAC. On the MAC it consumed 35% of the processor. On the BBB it used 15% of a much less powerful processor by vectorizing the operations. These are the sorts of changes that matter.

    • @kingacrisius
      @kingacrisius 9 месяцев назад +36

      I mean he literally looked at the decompiled assembly code to show how the branching and branchless code affected whether or not the assembly decompilation was branching. In the first case, he showed that the compiler sometimes understands the purpose of code and makes great optimizations that already remove the branches, but also shows how in other, more complex code, the branchless code was much faster than the branching code. So I really don't think he or the video missed this.

    • @lobovutare
      @lobovutare 9 месяцев назад +42

      cmov is probably not branchless as claimed in the video. On the micro architectural level is still needs to branch.

    • @BogdanBaudis
      @BogdanBaudis 9 месяцев назад +43

      @@kingacrisius The thing is that looking at the assembly may or may not tell what actually CPU does. There are some cores (old ADI's SHARCs, Microchip's DSCs and few other) that have such predictable deterministic pipeline you can actually count the clocks and if you couple it to SRAM, then you have whole story.
      But modern IA64 or Cortex-A cores do so many things on so many levels, that without an intimate knowledge of what core actually does a manual optimizations is just a hopeless exercise. Silicon vendors want their wares to shine so they contribute to the compilers development heavily. Unless you are in the know in these matters, the best you can is to optimize for a particular platform, test it, and then freeze SW AND HW design. Sometimes this is what is needed and acceptable. But if you think such code will survive on other platforms or onthe future revisions of a given platform you may be for a rude surprise.

    • @kingacrisius
      @kingacrisius 9 месяцев назад +15

      @@BogdanBaudis Well regardless he timed it too. Like it is so clear when the branchless programming is better, because he timed it and shows how the assembly is faster. I seriously don't understand where all of this is coming from.

    • @kingacrisius
      @kingacrisius 9 месяцев назад +3

      @@BogdanBaudis And obviously some code may not be portable, but he was coding C++, which is almost always portable. And that doesn't matter anyways for this discussion.

  • @barmetler
    @barmetler 3 года назад +2278

    So what we've learnt: If you write normal code and normal patterns that everybody knows, most likely it will be a pattern that the developers of gcc (or other compilers) thought of. That way you have a higher chance of the compiler doing good things.

    • @anandsuralkar2947
      @anandsuralkar2947 3 года назад +30

      Yup

    • @insomnia20422
      @insomnia20422 2 года назад +216

      or when in doubt just measure the performance and use the thing thats faster

    • @QuantumConundrum
      @QuantumConundrum 2 года назад +80

      @@insomnia20422 Real truth. I tell people on my team this all the time. If it's not hard, actually try it and extrapolate. Rarely (but it happens) you'll get different behavior past a data size threshold. And even then... usually you can infer that (or predict this would happen)

    • @user-zn4pw5nk2v
      @user-zn4pw5nk2v 2 года назад +27

      all the codes shown missed the fact that you could just remove bit 5 when in range shaving 3+ additional instructions with a well placed xor eax, 20h(stored in r8d instead of -1)(times char's ID stored in the multi data case) since you don't really care for the original data or you would back it up(the more complex code still used sub 32 and some other data shuffle junk). So sometimes it's better to assemble it yourself.

    • @RobBCactive
      @RobBCactive 2 года назад +15

      The program I optimised for gcc did a good job of making bottom level branchless if statements, but it didn't stop my carefully optimised C being 100x faster than the natural C++ implementation using stdlib.
      It was well worth increasing processing complexity.

  • @szymoniak75
    @szymoniak75 3 года назад +7900

    DISCLAIMER FOR ALL BEGINNER PROGRAMMERS!
    Do not attempt it at home! Unless you really, really have to. In most cases you won't outsmart your compiler. For high level programming, you should focus on keeping your code clean and easy to read and maintain. Usually you'll notice speedup in your code by using correct algorithms and data structures, not by doing micro-optimizations. And if speed is your concern - benchmark first! Do not try optimizing code that may not need it.

    • @alexreeve
      @alexreeve 3 года назад +268

      Well, doesnt the first example in the video show exactly that? I mean that goes for all optimization methods, only use them when you have verified that you actually need them, (and then verify again) :)

    • @davidm.johnston8994
      @davidm.johnston8994 3 года назад +92

      Thank you for saying that. That's what I was thinking, I was very intrigued by all of this additional complexity (from a readability standpoint)... Wondering if I could really outsmart a compiler by doing these techniques "by hand".

    • @Merthalophor
      @Merthalophor 3 года назад +245

      Donald Knuth: Premature optimization is the root of all evil.
      This technique is only valuable in highly specific situations. If you're not sure if it applies to you, it doesn't.

    • @brianbarefootburns3521
      @brianbarefootburns3521 3 года назад +116

      This whole video is off limits for beginner programmers and comedy for experienced programmers. That piece of paper under his wired USB mouse is another telling sign.

    • @LeTonyJr1
      @LeTonyJr1 3 года назад +82

      Yes, this. How many times the code needs to run is a big consideration. If a function needs to loop through a dataset that numbers in the billions, it’s a good sign that micro-improvements like these might be of interest.

  • @kylek.3689
    @kylek.3689 2 года назад +441

    Cryptographic libraries also use branchless programming a lot. This is to prevent timing side channel attacks, where the timing of an operation can provide a hint as to what was encrypted.

  • @frogge6443
    @frogge6443 3 года назад +184

    Your mousepad is a masterpiece.

    • @robertoaguiar6230
      @robertoaguiar6230 2 года назад +10

      high albedo for that max dpi specs. Don't tell gamers, or the prices will blast up

    • @evanmagill9114
      @evanmagill9114 2 года назад +2

      @@robertoaguiar6230 I haven't thought about the albedo consideration there... I've seen a lot of very dark color mousepads... Hmmm
      But I suppose it would be albedo in the specific range the mouse actually uses. Infrared?

    • @kaanozk
      @kaanozk 3 месяца назад

      Don't look at the finger pointing towards the moon or you will miss the heavenly glory

  • @Caesim9
    @Caesim9 3 года назад +3513

    I think it should be noted that branchless programming is VERY important in cryptography. It turns out that if you ise conditional statements -> sometimes your code runs faster sometimes slower (specific branches just are faster/ slower), the attacker can get info on your private key. So all cryptographic sound functions must use branchless programming.
    Very interesting topic

    • @Photo-Jay
      @Photo-Jay 3 года назад +9

      Do they use branchless for the most part? Or are they in violation of this?

    • @georgionic
      @georgionic 3 года назад +146

      They could use constant time operations

    • @flyingsayon
      @flyingsayon 3 года назад +25

      Why would you do that? Can't you use a timer to achieve the same result (not disclosing the computation time)?

    • @thenameipicked
      @thenameipicked 3 года назад +115

      @sayon It is very difficult to actually accomplish this in practice. If your delay is a random value, then you can repeat your request multiple times and use the average time. Delaying so your total time is fixed is really difficult: You have to know in advance the maximum amount of time your algorithm could possibly take, and even with that knowledge, wait() is complicated and imprecise, and it is definitely possible that what you did before calling wait() will still affect the total time.

    • @flyingsayon
      @flyingsayon 3 года назад +11

      @@thenameipicked i am talking about fixed delay, yes, not a random value. Besides how are you hoping to achieve fixed time with just balancing the computations when the process can be interrupted at any time? We usually have preemptive multitasking in our systems.

  • @MultiMrAsd
    @MultiMrAsd 3 года назад +1412

    I want to add that the advantage of branchless programming is many times bigger if you are programming for the GPU and not the CPU. GPUs have multiple cores that share a instruction pipeline, where each core runs the same instruction but with other numbers. That often leads to both sides of a branch being executed and then one is discarded. I have seen performance improvements above 100x by going branchless!

    • @peterbonnema8913
      @peterbonnema8913 3 года назад +40

      Didn't know that. Thanks. Good point.

    • @nistarok123
      @nistarok123 3 года назад +54

      So, vertex/fragment shaders for opengl, for example?

    • @Luxalpa
      @Luxalpa 3 года назад +169

      @@nistarok123 Yes, exactly. Avoid branches in GPU code as much as possible. In fact I think there are some shader languages that don't even allow the if statement.

    • @Kino-Imsureq
      @Kino-Imsureq 3 года назад

      @@JeredDanielson me too ;)

    • @scottmichaud
      @scottmichaud 3 года назад +59

      Note: GPUs these days won't take the performance hit if all threads in a warp (NVIDIA), wavefront (AMD), etc. (others) go down the same branch... whichever branch that is. So let's say you have a complicated material depending on whether or not you're shadowed -- you will only traverse both paths if your warp/wavefront corresponds to a block of pixels that overlaps with an edge of the shadow. If the block is entirely shadowed or entirely illuminated, then it will only run the shadowed or unshadowed path, respectively. This is 32 threads in NVIDIA, 64 in AMD... although they do reserve the right to change.

  • @vanlepthien6768
    @vanlepthien6768 Год назад +469

    As a long time software developer, in virtually every case, code maintainability is far more important than speed.
    I've written programs that processed millions of transactions - and branching was the least of the performance concerns. Paging and I/O typically are much more a problem. Except in toy problems, memory management will be a much larger concern.
    Introducing repeatedly executed multiplication generally is not good for performance.
    "And"-ing 01011111 with ASCII lower case converts to uppercase as well as subtraction, and to my mind, is much clearer.
    Unless there is a genuine need, don't make your code ugly to eke out a few cycles.

    • @henrikskott
      @henrikskott 10 месяцев назад +27

      I agree, except I'm convinced we all need to learn this the hard way. I haven't met any developer who did not try to be clever in the first 20 years of their career. Took me 30 years to start understanding this myself, but I need to really try all the fun things before I do it the hard way.

    • @gorak9000
      @gorak9000 10 месяцев назад +83

      @@henrikskott I had a CS prof once tell a story of how he spent so much time trying to optimize the code as he wrote it that it took him weeks to write the code, and when he ran it for the first time, the performance was terrible. He ran it through a profiler, and the part he optimized took virtually no time to execute anyways, and it was some other part he didn't even think of that was taking all of the time. His message was "just write the code sensibly the first time, then run it, and only then if there are performance issues do you optimize anything, once you know exactly what part needs to be optimized. That's definitely a winning strategy! Don't waste time optimizing things that don't matter anyways

    • @jacobzimmerman3492
      @jacobzimmerman3492 10 месяцев назад +7

      Once you understand cache hierarchy , you realize a heap access can be equivalent to MANY branches

    • @akenteva
      @akenteva 10 месяцев назад +8

      Good point. I found the video interesting and informative, but part of me was wondering "in what context would these techniques genuinely be useful?". While I'm sure there are things out there that require this level of optimization, typically it makes more sense to prioritize readability.
      Also, great idea with the AND-ing! Very elegant way to solve it, at least in the case where the string only has letters. If non-letter symbols are present though, you'd need to add more logic to handle that.

    • @Josus
      @Josus 10 месяцев назад +2

      it could be even better to implement/improve this kind of optimizations in the compiler itself, this way you can focus on the elegant code and gain a boost in maintainability while the compiler does the hard job, of this kind of very low level optimizations

  • @RFC-3514
    @RFC-3514 2 года назад +447

    For someone who started coding in assembly before pipelining and speculative execution were a thing, and when multiplications were super expensive, the idea of multiplying by a boolean to make a selection always feels slightly wrong. And a part of me still wants to replace every multiplication with shifts + adds, or look-up tables. :-P

    • @Theineluctable_SOME_CANT
      @Theineluctable_SOME_CANT 2 года назад +5

      Yep!

    • @WorBlux
      @WorBlux 2 года назад +41

      Multiply by 1 or 0 and you get an early exit on modern ALU's.
      With pipe-lining and SIMD, you come out ahead if you stack a large enough vector up. 8 ops/cyle + constant 3 cycle multiply penalty. If you have 800 elements, you are basically doing that multiply in 1/8 a cycle. So even with 2 multiplies and an add, it's still over twice as fast before any branch time or penalties is involved.
      And then AVX512 and SVE2/Cray vectors has predicated SIMD ops where you can create a predicate mask and then conditionally perform a vector op. CompareLT V1 and V2 to create mask than copy V1 into V2 using the mask to predicate each lane. the element from v1 is only copied of the LT mask is true.

    • @Theineluctable_SOME_CANT
      @Theineluctable_SOME_CANT 2 года назад +6

      @@WorBlux can I hire you on for my next big programming job?

    • @WorBlux
      @WorBlux 2 года назад +28

      @@Theineluctable_SOME_CANT Keeping myself informed about cpu microarchetecture and programming is a hobby of mine. Actual programming not so much.
      Being outside and producing something I can touch is more rewarding for me.

    • @Theineluctable_SOME_CANT
      @Theineluctable_SOME_CANT 2 года назад +7

      @@WorBlux yes, for me also.
      Machine microarchitechture is quite interesting to me.

  • @christianbarnay2499
    @christianbarnay2499 3 года назад +1728

    05:00 The lesson here: never over-optimize your code beforehand. Start by writing natural and simple code. Let the compiler do its tricks. Check the result and only look for optimizations in the parts that the compiler didn't already optimize.

    • @tetramaximum
      @tetramaximum 3 года назад +118

      I think the lesson is: know your hardware.

    • @scottmichaud
      @scottmichaud 3 года назад +181

      @@tetramaximum Both are valuable lessons. Another one is "know your data". Imagine a super-inefficient O(n!)-scaling algorithm. If you can prove that your data must always be n

    • @tetramaximum
      @tetramaximum 3 года назад +8

      @@scottmichaud agree

    • @SleepyMongoose
      @SleepyMongoose 3 года назад +37

      Very true, the more generic your code, the more likely the compiler will understand what you are doing and perform optimization.

    • @scottmichaud
      @scottmichaud 3 года назад +20

      @@SleepyMongoose For the things that it can. It's your job when it comes to effectively using overall data structures, etc.

  • @programaths
    @programaths 3 года назад +1448

    Shaders. Branchless techniques are mandatory in those!

    • @plazmotech5969
      @plazmotech5969 3 года назад +228

      When I first heard people saying "avoid branching in shaders" I had no idea what they were talking about. "How could you ever possibly avoid branching???" I thought. Now I see how. Thanks!

    • @galandilvogler8577
      @galandilvogler8577 3 года назад +214

      It must be said, though, that even shaders compilers are getting better and better, and you could use simple ifs which don't end up in branching.
      For example, if you write:
      if(x>=0.8) {y=x} else {y=0}
      one might be tempted to optimize by writing:
      y = x * step(0.8,x)
      And yet, at least in HLSL, the compiler creates the same low level code for both.

    • @plazmotech5969
      @plazmotech5969 3 года назад +31

      @@galandilvogler8577 Good to know! But still, best to make sure...

    • @clepirelli
      @clepirelli 3 года назад +123

      This is a common misconception. IIRC, uniform branching is 100% fine, the only problem with shaders branching is that if a heavy branch is followed by one invocation, the whole warp will stall. This is problematic, but does not mean branchless techniques are better as branchless techniques would just calculate both branches, meaning you'd actually have the whole warp calculate those heavy calculations.
      See shadertoy (dot) com/view/wlsGDl and twitter (dot )com/bgolus/status/1235254923819802626

    • @plazmotech5969
      @plazmotech5969 3 года назад +5

      @@clepirelli Neat! Thank you

  • @connoisseurofcookies2047
    @connoisseurofcookies2047 3 года назад +191

    I remember Bjarne Stoustrup saying something to the effect of 'The compiler will write better assembly than a human ever could' and in most cases, especially larger ones, that holds true.

    • @christopherjoseph651
      @christopherjoseph651 3 года назад +43

      The compiler can only write assembly as good as the person who wrote the compiler can write assembly. Your statement is mostly true for people who don't understand assembly and the way CPUs actually work

    • @CottidaeSEA
      @CottidaeSEA 3 года назад +34

      @@christopherjoseph651 Let any of the people who made the compiler try to write the same program in assembly as in C++ and see which one performs faster and with fewer bugs. It doesn't matter if you understand a CPU, the assembly code from the C++ compiler will be far better than any manually written assembly. Because the people who make the compiler only do translations and optimizations, they do not write entire systems with it. Because that's stupid.

    • @dionyzus2909
      @dionyzus2909 3 года назад +11

      @@christopherjoseph651 Not really, in the largest scales this statement holds absolute true. There are limitations to what humans can do, a large project in pure assembly would be just plain impossible for a human to do, it's unrealistic.

    • @christopherjoseph651
      @christopherjoseph651 2 года назад +27

      @@CottidaeSEA Did you not watch the entire video? He proved that a human who understands the instruction set of the CPU and who understands the purpose of the high level code can fully utilize the instruction set and hardware of the CPU. He literally beat the compiler by 45x! If the compiler is so superior why didn't it write the code using AVX? Exactly, a human wrote the rules that the translations and optimizations follow. So a complier is only going to write code as good as the person who wrote the compiler. This has nothing to do with program scale. The point is a human wrote the compiler so the complier can only write optimized code as well as the person who wrote the compiler. If that person doesn't realize how to perform an optimized task in assembly then the compiler won't be able to optimize the code either. What, do you think we're living in a world of Skynet where computers build more sophisticated computes and write their own programs?

    • @christopherjoseph651
      @christopherjoseph651 2 года назад +13

      @@dionyzus2909 This is a childish statement written by someone who believes in the new education model in which older languages are out dated and inferior and therefore shouldn't be taught. We don't need new, more complex languages to solve new and more complicated problems. To this day, when a program needs to be optimized for speed, it is written by a human using inline assembly. Even the guy who made this video understands that you should look at your disassembled code and be capable of understanding what the compiler is doing. Do you need to write a large program that is not time or resource critical in assembly? No, there's no need for that, but the statement that a compiler will write better assembly than a human ever could is COMPLETELY FALSE!

  • @josiahmanson
    @josiahmanson 3 года назад +291

    A fast branchless way to calculate ToUpper is to use a lookup table. The table is 256 bytes long, and easily fits in L1 cache, so character conversion takes a single instruction and will be a fast memory access. I think this is what is done in the standard library.

    • @IceExtremeGamers
      @IceExtremeGamers 3 года назад +4

      Doubt it will work with non ASCII characters, though.

    • @Henrix1998
      @Henrix1998 3 года назад +128

      @@IceExtremeGamers none of this works for non-ascii

    • @IceExtremeGamers
      @IceExtremeGamers 3 года назад +5

      @@Henrix1998 sadly

    • @josiahmanson
      @josiahmanson 3 года назад +43

      @@IceExtremeGamers His whole discussion was assuming ASCII as a toy example. Yes, the small lookup table will not work for general multi-byte Unicode (especially variable length encodings like UTF-8), but Unicode is more complex and none of the other stuff described in the video, like SIMD, would work either. For 8-bit character encodings, the table is actually far more general than anything based on conditionals such as conditional moves. The table approach would also be usable for 16-bit character encodings, as 64 KiB is not a lot of memory nowadays, and the parts of the table that would actually be used in practice would almost certainly fit in L1D$.

    • @igorstasenko9183
      @igorstasenko9183 3 года назад +9

      @@josiahmanson yeah i was waiting from author to present a final-final version of algorithm which using lookup tables.. but video ended :)

  • @rorytulip9343
    @rorytulip9343 3 года назад +659

    Leading with a negative example was smart - I think a lot of beginners would have tried applying this to homework assignments that really didn't need it if you hadn't.

    • @andrewsprojectsinnovations6352
      @andrewsprojectsinnovations6352 3 года назад +69

      "I can teach a monkey to do what you do. What I can't teach the monkey is when NOT to do it." - one of my father's chiropractic professors
      But yeah. Branchless programming is just one of many tools in our arsenal, not a silver bullet. Having a hammer doesn't make every problem a nail. Same here.

    • @aviralsood8141
      @aviralsood8141 3 года назад +23

      @@andrewsprojectsinnovations6352 Maybe he should have taught your father not to become a chiropractic, much more helpful.

    • @maxineyoung3236
      @maxineyoung3236 2 года назад +6

      As a student & beginner, this is certainly correct. Was ready and excited to lmao.

  • @MM-24
    @MM-24 3 года назад +973

    I want to buy this guy a mouse pad and maybe a second monitor...

    • @Bobby.Kristensen
      @Bobby.Kristensen 3 года назад +47

      He seems to be stuck in the 90s.

    • @stevecarter8810
      @stevecarter8810 3 года назад +26

      Nobody's stopping you

    • @lour.222
      @lour.222 3 года назад +38

      Goes to show it's what you know, not what you own.

    • @MM-24
      @MM-24 3 года назад +3

      @@stevecarter8810 no they aren't, that's good to know

    • @MM-24
      @MM-24 3 года назад +15

      @@lour.222 I think its a combination, good tools in the hands of a craftsmen sort of thing

  • @bandogbone3265
    @bandogbone3265 3 года назад +258

    I was doing this back in the late 1970's, on a Cray-1 supercomputer with pipelined vector registers. Complex single-instruction operations were broken down into a sequence of simpler operations, each performed by a separate functional subunit. After one clock cycle, the first functional subunit was free to begin operating on the second element on the vector, while the second functional subunit began operating on the intermediate result in the first element of the vector that was left there by the first functional subunit. Complex operations like floating point division could then produce a result every clock cycle, even though the operation could take a total of six or more clock cycles to perform on a single operand beginning-to-end. Like stations on a car assembly line, the game was to keep as many functional subunits busy as possible at any given time. We would carefully map out each tick of each component of the CPU on a big chart before implementing in assembler. The thing was blazing fast, even by today's standards. Nice vid!

    • @duanebonas8630
      @duanebonas8630 2 года назад +9

      Class explanation. I Love it . U know what u going on about.

    • @spx0
      @spx0 2 года назад +4

      Great

    • @treelibrarian7618
      @treelibrarian7618 10 месяцев назад +7

      this is still the game, with 5 logic and 5 memory execution ports available for out-of-order processing (execution is determined by data dependency order), writing code to keep all of them busy with chains of non-dependent code in each loop to mitigate latency over iteration - although division is one thing that isn't pipelined. I guess division is rare enough that it's not worth the extra silicon.

    • @danielstory2761
      @danielstory2761 10 месяцев назад +1

      The iPhone 6 is capable of 20 times more FLOPS than the cray-1

    • @GregMoress
      @GregMoress 9 месяцев назад +4

      That's called 'Pipelining', and I'm sure you knew that. It's a form of parallel processing.

  • @unformedvoid2223
    @unformedvoid2223 2 года назад +97

    It's useful when you do shader programming where branching is very expensive. I had to «invent» some of those examples by myself when did shader programming. Great video!

    • @SeanCMonahan
      @SeanCMonahan 10 месяцев назад +2

      Same! It was an interesting experience refactoring some branched OpenCL code to use algebraic solutions to replace the branches.

  • @brandonchurch6025
    @brandonchurch6025 3 года назад +48

    This is really good fundamental information for some of us "spoiled" high-level programmers who don't typically think beyond compilation

    • @ponypapa6785
      @ponypapa6785 3 года назад +13

      Agreed, but highly dangerous and misleading information for all the others, as a disclaimer that this is basically "experts only" is missing.
      And seeing what inexperianced programmers try to accomplish, this can lead to a lot of technical debt, as it will be misunderstood, badly used and then mixed with branching techniques . So yeah, fun :D

  • @khhnator
    @khhnator 3 года назад +611

    I'm surprised with the amount of negative comments, he is not telling people do always code like this... there is a difference between time critical code and not.
    get the damn profile out and find where your program is spending most of their time, and THEN you consider doing this or not.
    the only negative thing about optimization is that it costs development time. the idea that "the computers will be fast enough" is just being naive.
    effective optimization can be is the difference between a product that reaches some of the market or most of the market or a program that can be extended for the decades to come or something that needs to be redone in the next few years.

    • @PhilippeCarphin
      @PhilippeCarphin 3 года назад +21

      @kingofallcrypto Yeah, I can definitely imagine seeing stuff like this in first-year students code. In one project, they had to keep track of 16 boolean values on a microcontroller that had way more than enough memory for what we were doing. They wanted to encode them in the 16 bits of a uint16_t (16 bit integer type).
      It's natural for them to think that it's a good idea since we did tell them to use uint8_t over uint16_t when the number they needed to store was only going from 0 to 10. But the reason we told them that was mainly as a reason to talk about architecture and explain that since the microcontroller had an 8bit bus and discuss stuff.
      It did give me a chance to have a good discussion about balancing programmer effort with execution speed. In the uint8_t vs uint16_t, there are no downsides and there is a theoretical benefit in speed. In the store 16 booleans in a uint16_t, it's going to be a complete pain in the a** to store and retrieve the actual booleans (if we don't consider that the memory saving has a runtime cost), we're introducing the possibility for errors and we have more than enough memory by a wide margin for that project.

    • @dincerekin
      @dincerekin 3 года назад

      time critical code should not be run on a micro

    • @iVTECInside
      @iVTECInside 3 года назад +5

      @kingofallcrypto , I would hope that it would be obvious to anyone that ended up at this video.

    • @wusulus
      @wusulus 3 года назад +8

      You try to make it seem that it is worth it everytime, while it cerntainly isn't. This is the kind of thing you would use if you used every other trick in the book and need it for some one in a million project.

    • @webkar
      @webkar 3 года назад +18

      @@dincerekin Think about what you said next time when your ESP/ABS saves you from skidding into a tree.

  • @youuuuuuuuuuutube
    @youuuuuuuuuuutube 2 года назад +51

    Branchless works great for GPU programming (GLSL etc). It's actually worth complexifying the algorithm just to minimize the branching (but not always).

    • @jcm2606
      @jcm2606 Год назад +5

      Really depends on whether the branch is divergent or not. Modern GPUs have some level of branch prediction built in and will try their best to only take one path through the branch while leaving the other path alone, so if you know for a fact that the branch is very convergent (ie it's using a uniform/constant value, or it's based on data that's convergent in nature), then there's nothing wrong with branching (except on ancient GPUs). However, if the branch is very divergent (ie it's based on data that's divergent in nature), *then* you should try to implement the same logic in a branchless manner, as the GPU is forced to take *both* paths through the branch, with half of the warp being disabled through each branch.

  • @Vacuon
    @Vacuon 2 года назад +50

    If is not *necessarly* slow, if your condition is going to the same place 98% of the time, your branch predictor will be very accurate, and the pipeline will rarely be broken.

    • @musaran2
      @musaran2 9 месяцев назад +2

      Even that is not free either.
      Branch predictors tend to have limited memory, relying on it somewhere hurts branches elsewhere.
      But then cache is limited too, so maller code is often better.

    • @iXPilot
      @iXPilot 8 месяцев назад

      Funny that I got recommendation of this video after watching a few presentations about the Elbrus CPU, which doesn't have the branch predictor at all. Now I understand what kind of hoops the developers of it's compiler and the porters of JVM had to jump through...

  • @StructOfArrays
    @StructOfArrays 3 года назад +170

    For whatever it's worth... Smaller_Branchless produces identical assembly as a branched version when using clang 10 and compiling with -O2.
    In GCC 8.3 Smaller_Branchless is flat out worse than the branched version - branches been optimized out by the compiler.
    In any case for the branched versions the awesome compiler optimized away the branches.
    So for the noobs out there, if you're doing something like this, always check the compiler output and measure measure measure.

    • @Guztav1337
      @Guztav1337 3 года назад +32

      “in 99% cases the compiler is smarter“

    • @Henrik_Holst
      @Henrik_Holst 3 года назад +7

      Not only that but Smaller_Branchless is not actually branchless, conditionals are just hidden branches. Writing a * (a < b) still forces the compiler to add a test and jump aka branch to perform the a

    • @0LoneTech
      @0LoneTech 3 года назад +19

      @@Henrik_Holst No, branches are distinct from conditionals. In this case, we see it transferring a condition flag into a general purpose register value using SETc instructions, because x86 has a condition flag register; not all architectures do. Conditional instructions like x86 CMOV or AVR SBRS operate by cancelling the retire step of a single instruction, which doesn't divert the instruction decoding; that's a cheaper form of branching. What you say would apply to architectures where the condition test is embedded only within branch instructions, as if you removed the CMP* instructions and had to use B* instructions in Nios II. That would be a typical case where a compiler might have a different optimization to avoid the issue, like using the sign bit from a subtraction.

    • @Guztav1337
      @Guztav1337 3 года назад +6

      @Isaac Beizsley It, indeed, won't turn your O(n^2) into O(n). But it will fix your branches better than most people in most cases. (The example OP made).
      It indeed won't do any simd. However, if you are prepared to go assembly just to speed up the code, it is worth assuming you know what you are doing. Like in the video, where he does indeed outsmart the compiler.

    • @Guztav1337
      @Guztav1337 3 года назад

      @Isaac Beizsley Yeah that's annoying. I want to say that the world would be a little bit greener if it wasn't for all bad code. lol

  • @krytharn
    @krytharn 3 года назад +67

    Branching in a shader on the GPU is also a thing, but for a different reason. Here, the GPU runs the code in lockstep, which means that the ALU runs the same instruction for multiple pixels or vertices at once. To stay synchronised, it runs both branches of the conditional, then throws away the result that did not meet the conditional. So, in general, it will have to run all the code all the time and branching does not give any advantage. Note: if all pixels meet the same condition, only one of the branches is executed.

    • @abhinoorsinghpannu2993
      @abhinoorsinghpannu2993 3 года назад +1

      Same goes for cuda programming also

    • @WhatsACreel
      @WhatsACreel  3 года назад +18

      So true! GPU's hate branching maybe even more than CPU's! They call the conditional execution 'predicate' in CUDA.

    • @dakrontu
      @dakrontu 3 года назад

      Yeah but that inefficiency of computing both branches and discarding one is the passport to being able to run the many parallel streams so you still gain.

  • @Zeuts85
    @Zeuts85 2 года назад +28

    This sort of optimization seems like what you'd do only for high performance real-time applications--like video games for instance--and only then as a third-pass after you'd first written the code in much more human-readable format. My coding bootcamp instructors drilled into me the mantra: First make it correct, then make it clean, then make it fast. Ideally by the time you reach 'make it clean' you've got good tests that pin down the functionality of the code before you refactor. After that you can come up with fast, clever implementations that still pass the tests. You shouldn't ever need to debug the fast clever implementations if you've restricted yourself to good, pure functions and written your tests correctly.

    • @saudude2174
      @saudude2174 9 месяцев назад +4

      bootcamp instructors prepare people for no brain web dev jobs. This is something you will never use in such a work environment. It's mostly for advanced, low level programmers.

  • @insomnia20422
    @insomnia20422 2 года назад +5

    Boss: "What did you do all day?"
    Me: "Well, optimizing some code, you know..."

  • @psun256
    @psun256 3 года назад +194

    I never really thought about CPU caching and "predictions" affecting conditional statements, this is super cool!

    • @konstantinkh
      @konstantinkh 3 года назад +24

      When you get into serious depths of optimization, you start looking at things like branch predictability. Like, there are situations when you chose to write a less optimized version of algorithm, because it has more consistent depth, and that means the loops are going to repeat a similar number of times, making branches on them more predictable, resulting in faster execution.

    • @psun256
      @psun256 3 года назад +2

      @@konstantinkh So things like for loops?

    • @konstantinkh
      @konstantinkh 3 года назад +26

      @@psun256 Yeah. Because on assembly level, every loop is just a branch that sends the execution to beginning of the loop. So the CPU will hit the same branch over and over again as it's running the loop. And modern CPUs have very clever ways to get a heuristic for these branches. Like, if you have a very tight loop in some algorithm that always runs for, say, 10 iterations, your CPU can actually learn to predict that every 10th time it's not going to jump and start caching instructions from the correct branch. At that point, it's like the branch isn't even there! If the number of iterations varies, then the prediction is worse. Best your CPU can do in these cases is "usually, jump is taken," so it will always cache instructions for repeating the loop, and will miss and have to flush the pipeline whenever the loop finishes, which is still better than not having prediction at all, but not optimal. That's one of the reasons why an algorithm with consistent iteration counts can sometimes perform better than algorithm where iteration count varies. Of course, there are also optimizations your compiler can do if it can "guess" the number of iterations, which can make even bigger difference. We are getting into territory of optimizations where you _really_ need to have a good reason for going after, though, as you'll spend a lot of time testing, timing, and looking at disassembly, and you shouldn't be doing that until you've fixed all other timing problems. The only real world example I have where we've spent time going after this kind of stuff is character animation in games. When you have hundreds of bones in hundreds of skeletons for which you have to update the transform matrices 60+ times per second while leaving enough time for CPU to do everything else it needs to, you start looking at things like cache and branching prediction optimizations.

    • @psun256
      @psun256 3 года назад +3

      @@konstantinkh Huh, that's pretty neat. Thanks for sharing!

    • @A.Martin
      @A.Martin 3 года назад +4

      @@konstantinkh It would be great if you could actually tell your cpu which branch to load by default.

  • @willofirony
    @willofirony 3 года назад +360

    Awesome!
    Branchless programming would be even faster if the CPU had an instruction that filled all the bits of a register with the zero flag.
    These "computer science approach to C++ programming" videos are getting really good and I suspect are unique on youTube. You are improving (possibly, even saving) our mental health with these conceptual challenges and I, for one, am so grateful.

    • @WhatsACreel
      @WhatsACreel  3 года назад +40

      Thanks mate! I certainly wish the instruction set had a few things that are missing too! Cheers for watching :)

    • @vyor8837
      @vyor8837 3 года назад +6

      @@WhatsACreel could ask intel and AMD to add that...

    • @halbgefressen9768
      @halbgefressen9768 3 года назад +8

      setz eax
      ror eax, 1
      sar eax, 31

    • @CarrotCakeMake
      @CarrotCakeMake 3 года назад +10

      Nothing in this video had anything to do with computer science. It was all computer architecture.

    • @RAMIdotGG
      @RAMIdotGG 3 года назад +5

      every community needs a Michael Keegan :) refreshing to read a wholesome honest opinion

  • @tolkienfan1972
    @tolkienfan1972 2 года назад +41

    1. Make sure your branch targets are aligned to 16 byte boundaries. 2. If you're comparing to upper AND lower bounds, (whether masking or branching) you can subtract the lower bound and do a single compare instead. 3. When writing assembly pay attention to instruction dependencies, and interleave independent instructions. E.g. load 1st chat, load 2nd char, compare 1st, gen mask, compare 2nd char, and so on. 4. You can often apply these techniques directly in C++, possibly with intrinsics. ... Just my 2c

  • @sylviaelse5086
    @sylviaelse5086 3 года назад +44

    In some architectures, multiplication takes more cycles to execute than less demanding arithmetic. In the first example if a < b usually comes out to be the same (whether true or false), then branch prediction will avoid the delays inherent in "if", and the result could be faster than using multiplication.
    The architecture may even be incapable of calculating the true/false result other than by the use of branches.
    In any case, unless the function is called intensively, the benefits, or otherwise, of the branchless approach will be invisible.

    • @notgabby604
      @notgabby604 10 месяцев назад +1

      You can get by 4 or by 8 the speed for some numeric code using assembly (eg. fast Walsh Hadamard transform, neural network code.) Sometimes the compiler can find better assembly than you can. If you need performant code try both ways.
      GCC found much better code than me for this in FreeBasic.
      sub subrandom1(result as single ptr,x as single ptr,n as ulongint)
      dim as ulong ptr resultptr=cast(ulong ptr,result)
      dim as ulong ptr xptr=cast(ulong ptr,x)
      dim as ulong r=&hE8FE90EB,rs,shift=18
      for i as ulongint=0 to n-1 step 4
      rs+=r
      resultptr[i]=xptr[i] xor ((rs xor (rs shl shift)) and &h80000000)
      rs+=r
      resultptr[i+1]=xptr[i+1] xor ((rs xor (rs shl shift)) and &h80000000)
      rs+=r
      resultptr[i+2]=xptr[i+2] xor ((rs xor (rs shl shift)) and &h80000000)
      rs+=r
      resultptr[i+3]=xptr[i+3] xor ((rs xor (rs shl shift)) and &h80000000)
      next
      end sub

  • @C5FlyingSquad
    @C5FlyingSquad 3 года назад +63

    I think more people need to see your videos. Although they're specialised, they feel valuable for non-systems programmers like myself.

    • @gangstasteve5753
      @gangstasteve5753 3 года назад +14

      this is a very underrated youtube channel. i learned assembly from him.

  • @donjindra
    @donjindra 3 года назад +15

    In time-critical code know your cpu. I once rewrote a graphics library in Z80 assembler and sped it up 10x by unrolling loops and counting cpu cycles for every bit of code. There are all sorts of tricks you can use to improve speed as any good programmer knows.

    • @donjindra
      @donjindra 2 года назад

      @jshowa o No, tools of the trade don't necessarily make one a good programmer. But pretty much only good programmers know these things. And there is nothing unmaintainable in these designs. Again, one just has to be wise about implementation. But that's the case in all languages, including those that advertise themselves as nearly fool-proof. No language prevents or even hinders unmaintainable code.

    • @donjindra
      @donjindra 2 года назад +1

      @jshowa o Yes, I watched the video. I also have 40 years of experience. You repeat what I've heard all too often. It's true compilers are very good at many things. It does not follow that they are best at everything. Experience has conclusively demonstrated that compiled code is not always the quickest code or the smallest code and there are still cases where maximum efficiency is required.

  • @rongarza9488
    @rongarza9488 2 года назад +12

    Early on in my programming career I noticed that there was usually a relationship between speed and instructions -- time and space. Plus, you can throw in the time it will take to debug clever code.

  • @valtergomes5808
    @valtergomes5808 2 года назад +16

    In my opinion, the main reason to avoid too branchy code is readability and maintaining time consuming. But for this issue, a clean code with well designed patterns will save you a bunch of 'ifs', which automatically make you code easier to understand and easier to be optimized by the compiler. Branchless programming the way it is presented here has a proper place and time.. if you need it you will know..

  • @volchonokilliR
    @volchonokilliR 3 года назад +310

    There's [[likely]] and [[unlikely]] attributes in C++20, they can help in some cases

    • @vyor8837
      @vyor8837 3 года назад +37

      And that's why you update your compiler and SDK...

    • @neur303
      @neur303 3 года назад +4

      Also there is profile guided optimization, but sometimes you have paths that are hard to predict, because the patterns are too elaborate.

    • @user-tr8kr1jd2o
      @user-tr8kr1jd2o 3 года назад +4

      Vyor GCC has had macros for it for years

    • @leo69105
      @leo69105 3 года назад +4

      Have you tried if these help? Afaik, the branch hints are x86 only and no longer available in x86-64 mode. Not sure about arm.

    • @HermanWillems
      @HermanWillems 3 года назад +6

      @@neur303 For example i have a branch that says High Alarm. (So x value is higher than y) I as a programmer know that one of the two almost never gets executed. Does this mean i can "guide" the compiler in to say hey branch predictor. Pls choose this one first for your prediction!! The other branch will almost never hit. Is this possible ? Especially if you need to check this for like hundreds of thousands of signals each cycle ? [[likely]] to be lower, [[unlikely]] to be higher than Y.

  • @TheAguydude
    @TheAguydude 3 года назад +60

    Note that the branch predictor is generally able to predict branches accurately for loops, since they almost always take the same path. Thus, getting rid of a loop branch is usually not of much concern, for purposes of avoiding branch prediction failures. Of course, there's still value in improving loop iteration code, since it still represents instructions that are run on ever iteration.

    • @TheGandorX
      @TheGandorX 3 года назад +2

      That's why JIT (Java) is such a good idea. The run time / JIT platform applies the optimizations best suited for the hardware.

    • @garretgang8349
      @garretgang8349 3 года назад +10

      @@TheGandorX Unless you have a limited amount of parallelization that doesn't allow JIT to run on a separate physically parallel thread. Than it about doubles your run time.

    • @daantimmer
      @daantimmer 2 года назад +2

      Only if you have sorted data. Once you enter the territory of true random data the branch predictor practically doesn't exist anymore. Since it simply won't be able to predict.
      Now, some CPU's, (don't pin me on which ones exactly) can execute two branches simultaneously in parallel while still processing the condition. Once the condition output is known the output of the parallel processes is used and the other data path is thrown away. This has quite some prerequists though, but it works pretty well.

    • @andywolan
      @andywolan 2 года назад +2

      @@daantimmer Ya, I think that behavior was the basis of the Spector and Meltdown CPU bugs discovered a few years back. (Correct me if I'm wrong.)

    • @daantimmer
      @daantimmer 2 года назад +1

      @@andywolan I won't be able to correct you even if you are wrong because I never really dived in to those issues :-)

  • @fie1329
    @fie1329 10 месяцев назад +3

    I did not expect a character like yours in a programming video! You do a great job on keeping things interesting and a bit funny while making good use of the 20 minutes with practicle examples. Really good video!

  • @frigzy3748
    @frigzy3748 3 года назад +13

    You can replace 32 in C with 'a' - 'A' just to not have magic numbers. Great vid - thanks!

    • @shanehebert396
      @shanehebert396 3 года назад +6

      Although not the purpose of this video, the old school trick for toupper is to bitwise AND each byte with a mask (~(0x20)), presupposing the arrays are all ASCII characters in the range of A-Z/a-z. Of course, once out of ASCII, this isn't the right method.

    • @frigzy3748
      @frigzy3748 3 года назад

      @@shanehebert396 I never thought of it - thanks!

  • @Bregylais
    @Bregylais 3 года назад +76

    Somebody gift this man a mouse-pad.

  • @berndeckenfels
    @berndeckenfels 3 года назад +79

    A good example for branch less is in Crypto, where data dependent code execution can lead to side channel leaks, therefore you look for brnachless and constant time algorithms. So for example to compare two byte arrays with hash results you would add the xors of each byte and not break the loop on the first difference.

    • @inx1819
      @inx1819 3 года назад +1

      add random sleeps in the code and you're alright
      /s

    • @yvrelna
      @yvrelna 3 года назад +3

      @@inx1819 Actually you don't want to add random sleeps. If you just sleep randomly, all operations will be extended by roughly the same amount of time, so you can still statistically figure out which operations takes longer than the other.
      Instead, what you want to do is to sleep longer if the operation finished faster and sleep less when the operation are slower. You can do this by deciding beforehand an operation how long an operation should take. For example, you might decide that an operation should take 1 second, so you do the operation, and if there's any remaining time, you sleep for the remainder. That means no matter which branch the code takes, they'll always take 1 second.
      Now, this isn't perfect, there's also power analysis because doing an actual operation likely costs you more power than sleeping, so this is enough if your attacker is remote (e.g. webserver). But if you need to consider a local attacker that has full physical access to the hardware (e.g. smart cards), you might have to busy sleep.

    • @gutoguto0873
      @gutoguto0873 3 года назад +1

      @@yvrelna you didn’t see his /s

  • @Websitedr
    @Websitedr 2 года назад +21

    I can see these techniques being applied for things that require a lot of calculations to be done as fast as possible. The issue is when it comes to other people looking at the code without fully understanding why it's written this way.

    • @eomoran
      @eomoran 2 года назад +5

      Aw jeez, if only you could write a comment annotating it /s

    • @TheLordoftheDarkness
      @TheLordoftheDarkness Год назад +1

      Skill issues

  • @jonathanross6260
    @jonathanross6260 2 года назад +8

    Branchless programming is often about taking a step back and finding a simpler approach. In the second example, I would just take d[i] & 0xDF; it keeps the upper case characters intact, but flips the 6th bit for lower case letters making them upper.

    • @WhatsACreel
      @WhatsACreel  2 года назад +6

      Yes, that will change some non-letter characters too. If those characters aren’t in your text, then AND away my friend! It will certainly be faster!!
      The example is not super important, just the technique of masking in SIMD to perform multiple branches without actually branching.
      Cheers for watching :)

    • @thewhitefalcon8539
      @thewhitefalcon8539 8 месяцев назад

      could try something like d[i] ^= ((d[i] - 26) < 'a')

  • @insertoyouroemail
    @insertoyouroemail 3 года назад +19

    Very interesting! Another good example for when you want to write branchless code is when writing shaders for the GPU.

    • @overloader7900
      @overloader7900 3 года назад +1

      Why though?

    • @jarrad2000
      @jarrad2000 3 года назад +6

      PLCs are very often programmed branchless as well. Mostly because many of the PLC programming languages don't or only rudimentary support them.
      They're also typically programmed by people with a background in electric engineering with little to no programming experience.
      They like to use things like ladder logic.
      There are languages similar to assembly (IL) and pascal (ST) though.

    • @insertoyouroemail
      @insertoyouroemail 3 года назад +4

      @@overloader7900 Because shader instructions are executed inside an if-then block even if the test doesn't pass for the data that the shader is executed for.

    • @xinaesthetic
      @xinaesthetic 3 года назад +3

      Overloader7 you have many-many threads running parallel and if they don’t all follow the same path that can be problematic.

  • @johnnisshansen
    @johnnisshansen 3 года назад +63

    In these 19 minuts of video my computer could raise to uppercase all books written the last 10 years.

    • @zwz.zdenek
      @zwz.zdenek 3 года назад +8

      Sounds like you have a pretty neat OCR and also pirated a lot of books.
      But really, unless your corpus is already in plaintext and loaded in RAM, other bottlenecks will make the conversion speed irrelevant.

    • @johnnisshansen
      @johnnisshansen 3 года назад +10

      @@zwz.zdenek well, my point was that computers are cheep and humans rarely have to care optimizing the code.
      But ok reducing branches can sometimes be worthwhile.

    • @A.Martin
      @A.Martin 3 года назад +6

      @@johnnisshansen its most important where massive amounts of calculations are being done, like I dunno, finding large primes to have very optimized code.
      For regular programs less effort is put into optimizing now than was done back in like the 80s or 90s, because computer power is so much higher that it is usually not needed. They do stuff near enough instantaneously anyway.

    • @TheSkepticSkwerl
      @TheSkepticSkwerl 3 года назад +3

      Never assume optimization is achieved. It comes down to your use case and needs. If your need to look through 50gb databases or you need to talk to 100k computers any and all tricks can help. Granted when talking to 100k computers, network is a big bottleneck. But I've been able to take 20 day scripts and convert them to a few hours. Optimization can and will matter in certain scenarios.

    • @Argosh
      @Argosh 3 года назад +1

      That kind of thinking lead us to page loading times that were worse than what we could achieve in the 90ies...

  • @SpiritmanProductions
    @SpiritmanProductions 2 года назад +54

    This will seem trivial to some, but for anyone planning to do arithmetic with the result of a boolean expression in .NET should bear in mind that 'True' equates to -1, rather than 1, as in C++.
    EDIT: I'll just add that any non-zero value (positive or negative) equates to 'True' in a conditional statement, and zero equates to 'False'.

    • @TurboXray
      @TurboXray 2 года назад +16

      Someone needs to explain why that was chosen hahah.

    • @Vacuon
      @Vacuon 2 года назад +10

      -1 (int) is just max_uint, aka 0xFFFF... you can do (boolVal & 1), but it gets tedious haha
      And I've never used a system where true is -1 myself but if I had to guess I'd say it's because the designers wanted logical operators to work on boolean values, e.g.:
      not false = true (~0 = -1)
      true or x = true (-1 | x = -1)
      etc, etc, etc.

    • @TurboXray
      @TurboXray 2 года назад +3

      ​@@Vacuon I've worked with plenty of systems where -1 can be true, but it's not the ONLY truth value. As in, 0 is false and everything else is 'truthy'. ~ (BoolValAsNegativeOne) is just BoolValAsNegativeOne XOR -1, vs BoolValAsOne XOR 0x01. It doesn't appear to show negate -1 as an advantage. I mean, for either method you still have to auto-demote to a bool type to perform the operation. And it's the same amount of assembly instructions to invert; load the value, perform the operation. I'm SURE they had a reason for doing it.
      The thing about having a TRUE as a negative -1, means you can't index an array via an evaluation myArray[a>=b].. well I guess it works if your language supports negative indexing (like Python).

    • @SeanCMonahan
      @SeanCMonahan 10 месяцев назад +5

      Oh, god, I bet that has led to more than a few bizarre bugs.

    • @vladimirarnost8020
      @vladimirarnost8020 9 месяцев назад +3

      @@SeanCMonahan Not sure about C# but I wish C/C++ also used -1 (aka ~0) for 'true'. It would make very practical bit masks.
      For example, the video uses the slow integer multiplication (imul) instruction where a simple bitwise AND with a proper mask (0x0...000 or 0xF...FFF) would do instead. Conveniently, 0xF...FFF = -1U. Also, 0x0...000 and 0xF...FFF are just complementary values. You get the opposite one just by flipping all bits. x86 CPUs even have an instruction for that (NOT). ~0x0...000 = 0xF...FFF. All expressions consisting only of 'magic numbers' like this are pre-computed by the compiler.
      So if 'true' was -1, the min(a, b) function could be expressed as: return a & (a < b) | b & !(a < b);
      Since 'true' is 1, we have to flip the sign: return a & -(a < b) | b & -!(a < b);
      Or leave it to the compiler because the compiler writers know well how to compute a minimum of 2 numbers properly on your platform. Use 'std::min()/max()' for branch-less programming.
      The CMOVxx instructions have been with us for a very long time (since Pentium Pro in the mid-1990s) so they can be taken for granted now. :)

  • @ChrisM541
    @ChrisM541 2 года назад +5

    Brilliant explanation. The following bombshell caught my eye...
    12:19 "So it just goes to show that you don't automatically get speed from jumping over to assembly...you have to code it well" - very well said ;-)
    Wil every language above assembly, it's a shame that we have so little control over the final machine code output, even with C/C++.
    Lesson: 1) Never trust your compiler 2) Become an expert in assembly language and learn to e.g. inline where needed.

  • @vinzer72frie
    @vinzer72frie 3 года назад +285

    "The CIA wants you to believe its only if then else if then else"

    • @gumbo64
      @gumbo64 3 года назад +56

      Run over all the glow-in-the-darks

    • @anant6778
      @anant6778 3 года назад +28

      The FBI wants to have a word about switch statements

    • @VulcanOnWheels
      @VulcanOnWheels 3 года назад +4

      What is this quote from? If you quote something from the video you're commenting on, then please include the time where you found it.

    • @SiisKolkytEuroo
      @SiisKolkytEuroo 3 года назад +14

      @@VulcanOnWheels ruclips.net/video/7uLzaKlZSQQ/видео.html it's in that video at around 5m 40s

    • @SiisKolkytEuroo
      @SiisKolkytEuroo 3 года назад +16

      But it's worth it watching the entire video because Terry Davis is a legend

  • @Walrusbonzo
    @Walrusbonzo 3 года назад +5

    Loved the reference to Al Di Meola, one of my all time favourite guitarists. Great video too, right up my street.

  • @metamud8686
    @metamud8686 2 года назад +57

    The primary rule of optimization is left out here, which is to FIRST MEASURE aka 'profile' to see which part of your code the CPU is spending most of its time in.
    Once you know where that is, you try to find out any algorithm / complexity improvements (take your algorithm from O(N2) to O(N) or even O(1)).
    Then finally, after all that is done, you may think about 'intervening', checking the assembly and seeing if you can "do better yourself".
    Introducing obfuscating code like a*(a>b)+b*(b>=a) leads to bugs and unnecessary rabbit holes for your future self when you get called in to debug the mess you made.

  • @TheCroustade
    @TheCroustade 3 года назад +8

    Some processors own a conditional selection instruction, which at source level corresponds to a>b ? a:b.
    branchless programming is a must in time critical applications where worst case execution time prevails. This skill is appreciated in avionics design companies.

  • @ncot_tech
    @ncot_tech 3 года назад +6

    I’ve been doing Z80 assembly recently, and seeing all these fancy conditional instructions and more than 8 registers is crazy 😀

    • @WhatsACreel
      @WhatsACreel  3 года назад +1

      Long live the ZX Spectrum :)

    • @protonjinx
      @protonjinx 3 года назад

      cheater! 6502 only had 3! spoiled brat.

  • @danilotoebdot5616
    @danilotoebdot5616 3 года назад +6

    Got recommended this, watched it all, now time to find even more good content you have

  • @BlankBrain
    @BlankBrain 3 года назад +9

    Back in the late '70s, I came across a set of algorithms that computed Julian and Gregorian dates using only double precision arithmetic. Leap years and Y2K worked without modification. Operating systems and databases frequently got leap years and daylight savings time calculations wrong. It took a lot of time an effort to recover. The software that I wrote using the arithmetic functions only failed once, and that was when someone compiled them single precision. It made me a big fan of using arithmetic functions to perform logic.

  • @Bunnokazooie
    @Bunnokazooie 28 дней назад

    Love when experts have a down-to-earth attitude like this. You rock man.

  • @Toksyuryel
    @Toksyuryel 3 года назад +11

    While these techniques aren't very useful for most programmers, they are extremely useful for compiler programmers. The more patterns they can recognize and replace, the better everything is for everyone.

    • @vladimirarnost8020
      @vladimirarnost8020 9 месяцев назад +1

      Or library developers, e.g. providing the fastest primitives like memcpy, strcpy, strlen, etc. is essential for C/C++ performance.
      However all the ugliness and complexity should stay locked inside the library implementation and covered with extensive unit tests and benchmarks.

    • @peterSobieraj
      @peterSobieraj 9 месяцев назад

      I disagree a little.
      Thinking that most programers shouldn't care about a performance lead us to situation today.
      That you need 2x2GHz CPU just to start windows. Just to display desktop background image, some icons, and do nothing while waiting for user to click something.
      And 10 seconds of website loading is considered normal.
      And fast typing programmer can type faster than average IDE can display text.

    • @Toksyuryel
      @Toksyuryel 9 месяцев назад

      @@peterSobieraj The goal is that top level programmers shouldn't *have* to think about performance because the low level stuff takes care of it for them. I agree that we aren't there yet though.

  • @russellthorburn9297
    @russellthorburn9297 3 года назад +38

    It seems to me that it's a choice of evils:
    1. Easy to read but much slower code.
    2. Optimized but heavily obfuscated code.
    Option 2 is WAY above my pay grade :-) so I'd go with #1 unless I had no other choice. Still, your video was extremely well done. Even I was able to follow what you were saying and I'm merely an intermediate level programmer. That's definitely worth a "Subscribe".

    • @tragedyofwind
      @tragedyofwind 3 года назад +10

      this is why programming back in the very old days, they were really hard to read because scientists optimised the program to do things when the clock cycle is like below thousand Hz. However, the raise in computation power, mean there is no need to write optimal code for the machine, when the machine can do more work for you. so human programmer can spend less time to write less complex code, so that we could produce more programs for more purpose. This is a trade off, it really depend what type of work you were doing.

    • @Dji-gurda_Jdi-druga
      @Dji-gurda_Jdi-druga 3 года назад +6

      I would say you should determine bottlenecks and use #2 in them and #1 everywhere else.

    • @retrozvoc6189
      @retrozvoc6189 3 года назад

      How about a middleware that's readable to you and which gets JIT-compiled to whatever CPU you wanna run it on?

    • @DoomRater
      @DoomRater 3 года назад +4

      I dunno Option 2 is what compilers are for. Or in my case, precompilers since I work predominantly in a compiled language without certain niceties and use a precompiler that translates the extra control statements I want to use into standard controls that actually exist. Also look into the MOVuscator for more evidence that optimizations can be often offloaded onto the compiler.
      Edit: I said all that BEFORE WATCHING THE VIDEO WHERE HE SAYS COMPILERS HELP

    • @PhilippeCarphin
      @PhilippeCarphin 3 года назад +3

      "Before, we used programmers to save on computing resources, now we use computing resources to save on programmer time." Don't remember who said that.
      And like the others say, that doesn't mean #2 should never be used, just that you should only use it for measured bottlenecks that are worth it. Even if you identify a bottleneck but your program doesn't need to be fast, then you still don't need to optimize.

  • @jakeplonk888
    @jakeplonk888 10 месяцев назад

    Absolutely fascinating. I had never, ever heard of this technique. Thanks for posting!

  • @JosephSaintClair
    @JosephSaintClair 2 года назад

    Glad I saw this. Took me back nearly 30 years. Thanks 🙏

  • @atackhelikopter4303
    @atackhelikopter4303 10 месяцев назад +3

    13:30 improvement idea:
    use an array that has the length of 255 and place basically the whole ASCII table there but make it so that the portion a-z has the values A-Z
    this should make it faster because you aren't comparing anything, you just replacing information with existing one

    • @shinyhappyrem8728
      @shinyhappyrem8728 10 месяцев назад

      *256

    • @atackhelikopter4303
      @atackhelikopter4303 10 месяцев назад

      @@shinyhappyrem8728 it stops at null so you never have to take into account the null character lol

  • @Louis_Marcotte
    @Louis_Marcotte 10 месяцев назад +8

    I didn't even know branchless coding was a thing. I just assumed "if" was necessary to test a condition and didn't really question it, seemed reasonable enough. You've opened my eyes. I'm not gonna use branchless coding yet, most of it flew WAY over my head, but knowing why it's bad is still good

    • @LocrianDorian
      @LocrianDorian 10 месяцев назад

      It really isn't that difficult if you think about it, for comparison replacement, a>b itself is a mathematical expression which returns 1 if it's true, so if you do x=a*(a>b), that will mean x=a if a is larger than b. You can use that same principle for anything. That said, I think the main lesson learned from this video is that you shouldn't try to be too clever because most of the time the compiler will do a better job than you ever will.

    • @Louis_Marcotte
      @Louis_Marcotte 10 месяцев назад +1

      ​@@LocrianDorian ye nah I get the algebra, I just don't understand how converting "if(a>b) return a" to "return a * (a>b)" removes a comparison, you're still comparing a to b

    • @LocrianDorian
      @LocrianDorian 10 месяцев назад +1

      @@Louis_Marcotte It doesn't remove the comparison, it removes the branch. The thing that slows down the CPU is not the comparison itself, it's the "if" statement.

  • @duncanbeauch9598
    @duncanbeauch9598 10 месяцев назад +1

    I remember learning all these topics knowing it would speed up code but I didn’t know to what degree. It was very informative to walk through the same example for direct comparison

  • @supericeg5913
    @supericeg5913 5 месяцев назад

    Thanks for showing this! I now use this every day in my game in C and it's a great extra bit of fun problem solving!

  • @markwilliamhumphries
    @markwilliamhumphries 2 года назад +15

    You can toggle case on an ASCII alpha (from lower to upper, or vice versa) by xoring it with the space character (i.e. 0x20).

    • @Reginoid
      @Reginoid 9 месяцев назад +1

      yes, he did it with -32 (0x20) in that example

  • @giladreich810
    @giladreich810 3 года назад +50

    RUclips should pay you x10 times more to each view you get for the underrated quality content you publish. Love your channel! ❤ I was recently getting more into this subject myself, especially reading more about the subject of how different compilers handles N/RVO (Return Value Optimization) and Copy Elision to reduce copying when all the move semantics were introduced in C++11. Would be also a great idea for the next video to understand more about optimization and what we should not worry about when we write our code.

    • @scottmichaud
      @scottmichaud 3 года назад

      C++17 mandated that RVO (not NRVO) is performed. If you're allowed to use C++17 (or later) then the compiler is not allowed to perform intermediate constructors and destructors anymore, so there shouldn't be any differences between compilers or optimization levels. You can see the edge cases here: en.cppreference.com/w/cpp/language/copy_elision

  • @pierrevillemaire-brooks4247
    @pierrevillemaire-brooks4247 2 года назад

    Happy Holidays man !
    ... and thanks for the lesson ;-)

  • @lucisetumbrae
    @lucisetumbrae 3 года назад

    Great video! Loved the random Al di Meola reference in assembly language

  • @joshuap7406
    @joshuap7406 3 года назад +34

    I never liked the if statement, but this takes it to a whole new level...

    • @CottidaeSEA
      @CottidaeSEA 3 года назад +1

      Conditionals are always necessary. The difference is that branchless programming doesn't create two different scenarios.
      return (a * (a > b) + b * (a

  • @JayVal90
    @JayVal90 3 года назад +40

    I used the vector technique for a mapping algorithm at my last job, using the Accelerate framework on iOS. Got a HUGE performance increase, although I never thought about the branchless part of it. I was just happy that I had an algorithm that could convert tens of thousands of lat/lon pairs to rasterized mercator tiles of 32-bit indexes into an array of values for each mercator tile pixel in the same time that it took the naive, pointer-chasing for each vector method to process 10 lat/lon pairs. I was working with 300x speedups.

    • @flossless200
      @flossless200 3 года назад +8

      Cool story bro

    • @arnoio8355
      @arnoio8355 3 года назад +1

      You must slay at parties

    • @patrickthepure
      @patrickthepure 3 года назад +4

      I observed a similar situation in BVH tree traversal. For a raytracing program, my first approach was, storing BVH data with pointers and dynamic memory, and traversing it recursively. It was a very elegant and OOP styled code. Then I had to make it run on GPU, so re-made it in a branchless way, I stored all its data in contiguous memory, and used stackless BVH traversal algorithm. Yep... 100x speedups when run on CPU, 1000x speedups when run on GPU.

    • @fyfaenihelvete
      @fyfaenihelvete 3 года назад

      That's pretty cool

    • @JayVal90
      @JayVal90 3 года назад +3

      Garfieluigi Yes. Especially parties full of people with a high appreciation for the technical.

  • @godnyx117
    @godnyx117 10 месяцев назад

    Thank you! I'm writing a low level library to use instead of libc for my programming language and this tricks will come in handly!

  • @bluebukkitdev8069
    @bluebukkitdev8069 2 года назад

    Thank you for this video! RUclips recommended it and I am glad it did. You earned a like.

  • @originaltonywilk
    @originaltonywilk 3 года назад +8

    Good to know some programmers are interested in this stuff... thought it was a dying art :)
    Did a lot of stuff back in late 90's with good old MMX instructions to process video in real time (which was tricky back then).
    Even now some clever optimization can make a huge difference - like being able to adjust photo editing filters in realtime with a slider (e.g. Affinity Photo) or waiting 1/2 a second or more for each tweak.

  • @mohrizkyk
    @mohrizkyk 2 года назад +3

    "If you want to be most people, if you want to be an ordinary dev with common skillset, this video is not for you."
    - Peerple -

  • @Tolg
    @Tolg 9 месяцев назад +8

    This sort of optimization is very hard to get right and often leads to premature optimization. For tight loops on higher level languages, you should check your byte code. And for lower level languages you’re the disassembly. However, don’t forget about the third thing…. The CPU execution pipeline. There are also many optimization each CPU can take. So while one disassembly looks slower than the other, it may not always be true as the CPU might be able to optimize one better than the other. So it’s very tricky and really ends up, requiring a lot of testing on specific CPUs. I wish there was a platform we had where are you can benchmark your code against hundreds of different CPU types (with emulation) and see the results.

  • @Rudolf215
    @Rudolf215 2 года назад

    I love your videos so much!! Keep up the great work!

  • @code-dredd
    @code-dredd 3 года назад +31

    *Programmer:* I'm going to code this in assembly.
    *C++ Compiler:* So, you have chosen inefficiency...

  • @donald-parker
    @donald-parker 3 года назад +76

    I thought it was funny when he said "The compiler couldn't figure out what we were trying to do". Same as most people trying to read that code. There are lots of compiler tricks for optimizing - removing branches is just one of them. And optimizing for speed is not the only type of optimization a compiler can do. My own intuition is to strive to make your code readable first. If you are not meeting speed requirements, reconsider your algorithm first. As a very last resort, consider using assembler. I have trouble imagining a scenario where trying to come up with some contorted approach in a high level language that is hard to understand, maintain, or predict what the performance implications would be (without looking at disassembled code, which seems kind of like "why bother") would be the preferred approach. BTW - the same high level code can be compiled into different native instructions for various reasons (e.g. different compilers, different target processors, different compiler options, ...), so it seems like you would need to buy into disassembling your code EVERY time one of these variables changes to see if it is still optimized. Of course, it is not uncommon for the writer of the code to not really know when one of these things changes in their build environment. In general, I would say if you start trying to second guess your compiler, you are at best wasting your time, and at worst, creating a maintenance nightmare.

    • @Antoinetheman
      @Antoinetheman 3 года назад +18

      To be fair, I think all of these points are taken into account by the video. He's not saying you should do this stuff everywhere, only in cases where you need extreme speed, and especially in SIMD scenarios. Nobody's doing SIMD unless they need speed.

    • @georganatoly6646
      @georganatoly6646 3 года назад +2

      I believe you may have meant to say assembly instead of "... consider using assembler" to refer to an assembly language.

    • @shiuido359
      @shiuido359 3 года назад +16

      Remember that not everyone is writing code for webservers and PCs. If you work on high level applications, the it's hard to imagine, but there are people writing for low level applications where cycles and bytes are scarce.

    • @christopherjoseph651
      @christopherjoseph651 2 года назад +4

      @@shiuido359 EXACTLY! 99.999% of people that write code are obtuse to the fact that low level programming is still relevant and necessary in certain applications. The live under the belief that assembly is old, out dated, incapable and doesn't need to be taught because they learned java or python.

  • @Kyojin743
    @Kyojin743 19 дней назад

    I’m an undergraduate in CpE and I was working on an autonomous robot, and I had this really over engineered way to do course correction. I watched this video, and just today I was looking over my system and noticed that the techniques in your video were perfect for this application. I reduced it from a 50 line solution of ugly nested if else’s about 4 deep at max, to 2 lines. Thank you!

  • @protheu5
    @protheu5 10 месяцев назад

    Great video, very comprehensible.

  • @Hydrozoa
    @Hydrozoa 3 года назад +64

    Interesting video! I'll trust the compiler to optimize my readable code over trying to outsmart it, though :b

    • @matsv201
      @matsv201 3 года назад +6

      Well... I think the ultimate lesson is to speed test your code

    • @nguyenvu8262
      @nguyenvu8262 3 года назад +1

      I think there are many things to do to optimize speed. This should not be high on the list but it's good to know that it's there. If a code is branch-heavy, this probably would help more. Speed up 1ms code 1000 times doesn't have that much impact on performance as a whole.

    • @judewestburner
      @judewestburner 3 года назад +1

      CPU's are cheaper than developers

  • @NetAndyCz
    @NetAndyCz 3 года назад +9

    Interesting stuff, I never dug too deep into programming so I never even saw the instructions that go the CPU/GPU, would love to learn more about the actual assembly code

  • @Hiphopasaurus
    @Hiphopasaurus 3 года назад +1

    Great video! One caveat, the 'CMOVcc' instruction was introduced with the Intel P6 microarchitecture (Pentium Pro+), so if you're building for generic 32-bit platform, gcc will still branch: 'CMP, Jcc' ...

  • @miricaley4988
    @miricaley4988 2 года назад

    Very cool video, thanks a lot, been learning a lot from you lately, hope you can finish your neural network series in the future, that series is actually the one got me patreoned you and again hope you can bring up some mobile hacking material in the future as well, might be too much to ask but I am really looking forward to your new shit man, you have a good one.

  • @yuryschkatula9026
    @yuryschkatula9026 3 года назад +19

    Fashion: you have to have that glass table
    Optical mouse: I hate this... Not again!

    • @horowitzhill6480
      @horowitzhill6480 3 года назад

      funny thing is, i've used an optical (laser? don't really remember) mouse that had *no problem* working on a glass surface
      really weird

    • @tunahankaratay1523
      @tunahankaratay1523 3 года назад

      @@horowitzhill6480 That mouse probably uses infrared. Although glass is transparent to the visible spectrum, it reflects infrared. This way the heat energy at home can't escape through windows.

    • @horowitzhill6480
      @horowitzhill6480 3 года назад

      @@tunahankaratay1523 pretty sure 1) glass behaves the same with infrared and visible 2) if it did reflect, the mouse would not work, as it would be like using it on a mirror

    • @tunahankaratay1523
      @tunahankaratay1523 3 года назад

      @@horowitzhill6480 No. It's pretty common knowledge that glass reflects infrared. Otherwise it would just let the heat energy as infrared out. But most of the heat loss through windows happen at the edges, because it's hard to perfectly seal the glass. That's why window companies are focusing on the seal, not the glass itself. Also they probably engineered it further to make it work with glass, I don't think they just replaced the LED with an infrared one and called it a day.

    • @christopherpepin6059
      @christopherpepin6059 3 года назад

      @@horowitzhill6480 Not every thing that reflects light acts like a mirror. Every object that is not transparent reflects light, it's how you see them in the first place. Glass most definitely reflects IR. If you ever saw one of those IR booths at a science museum it is why people's glasses appear as dark spots. The glass prevents the IR from passing through it.

  • @adamp9553
    @adamp9553 3 года назад +8

    For unsigned ranges like 'a' to 'z', a smaller range check I use is (value-lower+0u>range), where 0u forces the comparison to be unsigned and range would be the 'z'-'a', that any overflow would also result in an unsigned value out of range; instead of two compares where both values would have to accounted for, one subtract and one compare.

    • @RMLLcrazy
      @RMLLcrazy 3 года назад

      This seems like a common enough thing that the compiler should be able to provide the optimisation. Isn't that the case?

    • @tolkienfan1972
      @tolkienfan1972 3 года назад +1

      I like explicit casts in such cases, with comments so people reading the code have no cognitive load

    • @rinrin4711
      @rinrin4711 3 года назад +1

      It will probably be slower since you changed one comparison and logical operator to one subtraction. Comparison is a subtraction without memory assignment, so cmp is faster.

  • @rickylefebvre
    @rickylefebvre 10 месяцев назад

    Love the Al Di Meola name drop! Amazing guitarist!

  • @sannekaribo4253
    @sannekaribo4253 2 года назад

    THIS VIDEO BLEW MY MIND

  • @ImHencke
    @ImHencke 3 года назад +685

    Imagine if YandereDev saw this
    EDIT: Its a joke, chill out you angry weebs.

    • @vinzer72frie
      @vinzer72frie 3 года назад +120

      This video showed up on my recommended after watching yanderedev videos youtube is so smart lol

    • @scottmichaud
      @scottmichaud 3 года назад +169

      ​@@vinzer72frie and @Henrik :: I'm a programmer that's helped with optimizing Yandere Simulator.
      ---
      I should note that the types of optimizations mentioned in this video are *hot path* optimizations. You're referring to one script that's executed ~80 times per frame, and its execution time, as measured by the profiler, is under 1ms on a typical, modern ~4 GHz CPU. If this was your bottleneck, then you would be in the ~1000 FPS range (because 1ms = 1/1000th of a second). This measurement is the sum of all 80 NPCs, and it also includes the time that the script is spent blocked on heavy calls into C++ (ex: some when some NPCs need to adjust their animation weights).
      ---
      Performance mostly depends on data... and game logic doesn't have a whole lot of data flowing through it (when compared to rendering thousands of objects, thousands of physics objects interacting, hundreds of animated objects that each have dozens of bones, etc.). People put a lot of emphasis on code, but optimizing your data (how much you load and how it is laid out) is usually where you will get the huge wins.
      ---
      Also, when you're in the realm of Object-Oriented Programming, one L2 cache miss is typically 10x to 100x slower than a branch, depending on whether the branch prediction was a success. This obviously varies from CPU-to-CPU, but it's a good order of magnitude comparison for modern x86-64ish CPUs. A correct branch is ~2 cycles. An incorrect branch is ~15-20 cycles for AMD Ryzen / Intel -Lake CPUs ( See Agner Chapter 3.8 and 3.13: www.agner.org/optimize/microarchitecture.pdf ). A cold memory fetch (reading memory that's not in cache) is ~200 cycles. That's not too big of an issue for something that happens dozens of times per frame, though, because a 4 GHz CPU does ~67 million cycles per frame at 60 FPS (per core). In other words, you would need to do ~1.5 million branches on your main thread, and have them all be incorrectly predicted by the CPU, to eat up just half of your 60FPS frame budget.
      ---
      The actual performance issues with Yandere Simulator are mostly related to animations and physics... and we've been fixing them up based on priority and availability. You can see over a 2x performance increase between January 2019 and July 2020 (assuming you have a discrete GPU and shadows are disabled -- neither of which have anything to do with game scripts). It's not uncommon to see a gaming PC with a good (ballpark ~GTX 1060-or-better) GPU and a gaming CPU get 80-110 FPS when shadows are disabled.
      ---
      Unfortunately that "if/else" meme fools a lot of people, though. It's not, and has never been the issue for that game. I mean if performance got over 2x better, and the game scripts haven't changed, then it clearly could not have been the issue. Over 50% of all execution time, good and wasteful, was removed... yet the if/else statements are still there.

    • @NihongoWakannai
      @NihongoWakannai 3 года назад +84

      @@scottmichaud nice comment, too bad not really anyone is going to see this. I'm a gamedev myself and seeing the way people rag on yandere dev is pretty often kinda cringe. It's very obviously a lot of script kiddies or people with no experience just jumping on a bandwagon without actually knowing what to be mad at.
      Did the game have optimisation problems? Sure. Is it because of some random out of context if/else chain? probably not.

    • @cortexauth4094
      @cortexauth4094 3 года назад +13

      @@scottmichaud I have been studying graphic libraries and trying to make an engine, meanwhile looking at code of games, It always baffled me what's wrong with those if/else, can't imagine a RPG without those. It cleared up the doubts

    • @scottmichaud
      @scottmichaud 3 года назад +25

      @@cortexauth4094 Yup. No problem. You want to have as much of your hardware at max capacity as possible (until there's no more work for the frame, then idle in one big chunk). When a branch is mispredicted, it needs to flush the bad work. That wastes "low tens" of cycles. ~20 cycles is nothing when you're concerned about work that happens once every 10 milliseconds (100 FPS). That's 0.0000025% (1/40,000,000th) of your budget. That could be a problem if you're in a loop of thousands or millions of objects, though, and the CPU can't figure out your pattern.
      Game engines will mostly have problems with L2 cache misses, though, which are 10x worse than a mispredicted branch. This means that work that needs to operate on many things (ex: thousands) should be laid out linearly in memory so your next calculations are on the same cache line, and, when you leave your cache line, you land on the start of the next cache line. The CPU watches your memory accesses and it will speculatively load cache lines based on what it thinks you will need.
      One of the biggest problems with memory accesses is dynamic inheritance. The way to allow multiple different child classes to fit on an array is to *not* have them contiguous in memory (because they're all different sizes based on the child class). There's a *lot* of different discussions that can go on here... but, yeah, it depends on the data you have flowing through it. Being wasteful on the order of the millionths of a 100 FPS budget might be okay if you're talking about hundreds of things... but, on the other hand, it is wasteful. IMO the biggest part is knowing what your waste is, and making a conscious judgment to waste (rather than being blindsided by the profiler months or years later).

  • @TheEkkas
    @TheEkkas 2 года назад +7

    Very nice. Not often we see real code optimisation. First time I’ve seen this. Love it.
    Smaller is faster. Not only your data, but your code blocks also need to move in/out of cpu caches so avoiding a possible branch outside of L1 size will add up and makes sense (now). This is a great technique for ultra-performance code. The algorithm at the end is a nice added bonus. Subscribed.

  • @nerdydrew6818
    @nerdydrew6818 3 года назад +3

    this reminds me so much of how The Cherno explains c++. Good thing to know that the compiler at times knows best :D

    • @WhatsACreel
      @WhatsACreel  3 года назад +2

      Cherno is a legend!! We certainly share the accent :)

  • @eterpaykugml4751
    @eterpaykugml4751 3 года назад +1

    Really neat. Optimizing for speed by sacrificing some memory is an excellent trade-off in modern systems.

  • @SaidMetiche-qy9hb
    @SaidMetiche-qy9hb 3 года назад +21

    Really interesting, optimization is often neglected

    • @svnhddbst8968
      @svnhddbst8968 3 года назад +4

      it's fun to look into python for optimization. sometimes the most readable code is only mediocre for timing, where the really verbose code that could be 3 times as long and look like a foreign language could be three times as fast.

    • @yScribblezHD
      @yScribblezHD 3 года назад +5

      svnhddbst I think optimizing python code is more or less a pipe dream, considering anything highly performant should just be written in some other language. Writing python code that's short and simple is what makes it fun heh.

  • @xarlysx
    @xarlysx 3 года назад +9

    When I started learning four years ago, I had this constant feeling that there was something wrong with conditionals but I didn't know enough to be able to even ask the question. And now finally I have my answer.

    • @thomasmaughan4798
      @thomasmaughan4798 3 года назад +1

      There is nothing wrong with conditionals. The author has shown a way to HIDE the conditionals from the source code but the fact is a conditional is inescapable in the object code. It is also very fast, typically 4 clock cycles to execute a conditional. The problem with JUMP instructions is the difficulty of debugging, and removing obvious branches from source code simply makes it harder to debug.
      Optimizing your code requires some knowledge of how long each instruction requires; logical instructions (and, or, compare) typically are very fast requiring 4 clocks. Arithmetic instructions add and subtract take considerably longer, up to 100 clocks, and multiplication becomes non-deterministic but can take hundreds of clocks. So using multiplication to avoid a simple 4 clock compare instruction is incredibly inefficient.
      It is this inefficiency that led to the development of "Branch Prediction", suppose there's a multiplication followed by a conditional. It is going to take such a long time to calculate the item to be compared that the processor supposes that the compare could go either way and starts loading the instructions from main memory that follow both possible branches. The CPU is vastly faster than main memory. Then when it gets there, figures out which branch to take, discards the branch not taking but the discarding takes zero time as far as impact on your program.
      In that scenario, using the simple approach MIGHT not even be faster since is means more of your program is going to be waiting on main memory anyway. But the CPU will certainly run cooler and consume less electricity if you use simple instructions rather than these hugely complex instructions.

    • @oliver_twistor
      @oliver_twistor 10 месяцев назад +1

      I know this comment is old, and that you probably have learned a lot more since you wrote it, but be aware that this micro-optimisation is for specific usecases only. For the vast amounts of production code you'll write, if statements are much better. Your coworkers will thank you for it, because they won't have to spend their precious time trying to understand what you have written. I have worked with many "clever" coders and I usually end up deleting it all and rewrite it myself, because I can't be bothered taking precious time trying to understand what the code does.

  • @lautaro736
    @lautaro736 3 года назад

    Awesome, thanx! I will be patching my assembly code woth this from now on

  • @NielMalan
    @NielMalan 3 года назад +1

    So good to see that at least some coders are working on improving performance.

    • @Wylie288
      @Wylie288 10 месяцев назад

      This doesn't really improve performance in the real world lol. It only works on this very specific, bad, compiler and a very specific out dated instruction set. In the real world, you use a modern CPU where comparisons do not take anywhere near this much time. And you use compilers that do this for you IF its actually faster. Most of the time it isn't.
      This is ONLY useful for GLSL shaders. Thats this videos only real world application.

    • @NielMalan
      @NielMalan 10 месяцев назад

      @@Wylie288 My comment was intended to be read ironically. I don't know enough about processors and compilers to have a conversation with you about it, but I am a victim of slow software. This makes me doubt the ability and willingness of coders to study and improve their code's performance.
      I hate the attitude of "modern CPUs are fast so code performance doesn't matter." If a comparison only takes a microsecond, a million comparisons takes a second. If your code could have avoided those million comparisons you've wasted a second of my time, and I hate you.
      (Why I hate you: I have ADHD, and every time my computer hesitates it disturbs or disrupts my focus, and I pay dearly in energy and patience to regain focus. Some software (notably Evernote) I cannot use at all, because it is simply too slow and therefore takes too much of my energy.)

  • @bobbobbity463
    @bobbobbity463 3 года назад +4

    Very interesting video.
    Another technique you didn't show (and is a lot easier for beginners) is caching. In the case of toUpperCase, you could prepopulate an array of 255 elements filled with zeroes and a couple of -32s. You sacrifice a bit of memory for great gains in speed.
    Also, depending on the language you use, you can store function references in an array and call funcArray[a>b](a,b) to reduce the amount of branching. Or even funcArray[a%4](args) for more options.

  • @teropiispala2576
    @teropiispala2576 3 года назад +13

    Nice to find out that someone is still interested about performance. I started serious programming in 80's and did lots on embedded assembler programming in 90's. Computers are fast today but still it feels crime to write slow code. Have given up assembler for compatibility and readability reasons but with time critical code always check compiled results and make performance tests.
    Parallel pipelined execution is so complex nowadays that sometimes it takes long to understand why some implementation works as fast as it do.
    Sadly web based techniques and scripted languages spreading into areas they shouldn't. For many modern programmer, any operation is fast if it takes less than second even if being done right, time could be nanosecond level.

    • @ImperatorZed
      @ImperatorZed Год назад +2

      Optimizing isn't always worth increased complexity, so it's often a tradeoff choice. Benchmarking seems to be the way to go, I need to learn how to do that.

    • @teropiispala2576
      @teropiispala2576 Год назад +5

      @@ImperatorZed I’m not talking about hacker optimizations but higher level architectural choises.
      Web technologies I mentioned, are extremely slow compared to natively executed code which is tailored into purpose. They also create hidden complexity, which makes the code unreliable.
      Applications are being made using complex ready made components which functionalities don’t typically fit well on purpose. They bring huge amount of new failure paths, which are mostly unhandled, or make proper error handling extremely complex. Programming errors are typically caught as runtime exceptions.
      Components have complex dependency network on other components, which are constantly upgraded, partly because security vulnerabilies from functions which are not needed in most cases. Full system is never tested and applications work only in happy-day scenario principle. Often developer’s answer for a problem is, try another browser or it works in my machine.
      So, my point is, mostly self-written functional code is several orders of magnitude simpler and faster when written in traditional C or C++ compared to current web technologies.
      The evilness lies in easiness to use those complex components and difficulty to use simple approach. You use what you got works well in this.

  • @collinvisser7108
    @collinvisser7108 Год назад

    Very useful for shader code - I think I have a bit of a better understanding now. Thx

  • @SuperGeneralCrazy
    @SuperGeneralCrazy 2 года назад

    I love this dude's energy, gotta say it