DISCLAIMER FOR ALL BEGINNER PROGRAMMERS! Do not attempt it at home! Unless you really, really have to. In most cases you won't outsmart your compiler. For high level programming, you should focus on keeping your code clean and easy to read and maintain. Usually you'll notice speedup in your code by using correct algorithms and data structures, not by doing micro-optimizations. And if speed is your concern - benchmark first! Do not try optimizing code that may not need it.
Well, doesnt the first example in the video show exactly that? I mean that goes for all optimization methods, only use them when you have verified that you actually need them, (and then verify again) :)
Thank you for saying that. That's what I was thinking, I was very intrigued by all of this additional complexity (from a readability standpoint)... Wondering if I could really outsmart a compiler by doing these techniques "by hand".
Donald Knuth: Premature optimization is the root of all evil. This technique is only valuable in highly specific situations. If you're not sure if it applies to you, it doesn't.
This whole video is off limits for beginner programmers and comedy for experienced programmers. That piece of paper under his wired USB mouse is another telling sign.
Yes, this. How many times the code needs to run is a big consideration. If a function needs to loop through a dataset that numbers in the billions, it’s a good sign that micro-improvements like these might be of interest.
I love this, but I'm reminded of Kernighan's law: “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.”
@@rickarmbruster8788 I notice that in most of my projects, I do the basic structure then add more to it come back to the basic structure see what a mess it is CTRL + A delete and rewrite it....
I LOVE how you introduced the concept by showing a way to manually optimize and how it actually fails to be better than the straightforward if-based version. That's such an important point to share up front any time we talk about clever optimization.
It just applies to c/cpp, for most other languages that don't have these kind of tricky optimizations, branchless is always faster (jit languages have jit optimization but in real scenarios the branch prediction miss a lot). Not to mention that in the video there're circumstances that branchless version did out performs branched version.
branchless generally applies when you're talking about assembly. C/C++ compiles to assembly, but so do many other languages and some languages like C# can be compiled to assembly even though typically they are compiled into a virtual-machine bytecode (using il2cpp and then a c++ compiler). For a purely interpreted language like python, branch prediction and branchless code barely comes into play for the developer. When do any operation in python, the actual python runtime is almost always running dozens or hundreds of lines of C (compiled to assembly) that aren't related to your code in any way except for ensuring that you're requesting valid operations. And that itself usually requires tons of branching. If you add 2 numbers, the python runtime treats those as full objects. First it checks that it's not operating on null objects, it then ensures the wrapped value is a number or a type with the addition operator defined. If it is a number, it will call the C function it has for addition. If it's not a number type, it will evaluate the PyFunctionObject representing the python code of the function, which itself has like 100 lines of C just dedicated to passing the argument in properly as a Python variable. The most optimized way to write Python code is to minimize your time spent evaluating lines of Python code. When running a purely interpreted language like Python or Ruby, the actual operations you request to do are about 1% of the actual work going on in the CPU. For a JIT'd language like Java, Javascript, regular C#, it really depends on how good the bytecode compiler and the JIT are. There are 2 steps here that can potentially optimize your code, the first compilation step to bytecode can definitely do many of the same optimizations a C++ compiler can, and the JIT can help pick the correct cpu instruction that can handle several equivalent lines of a higher level language (like the first function example). The Java compiler and JIT are quite good, sometimes better than current C or C++ compilers. At least in synthetic benchmarks of calculating prime numbers, sorting arrays, or etc repeatedly. For example, Java is capable of generating code that uses SIMD or AVX like in the final example in the video. Basically it doesn't work in a purely interpreted language because the runtime system itself has many branches. And for other languages, it depends on the code. If you're writing a branchless algorithm, it wouldn't hurt to benchmark it against the "naive" branching version of that algorithm. And benchmarking these system specific optimizations may even produce different results on different CPU architectures. For example, some DSP devices have no branch prediction because they need to have strict guarantees on consistent realtime delivery. It's a slower architecture on average, but you can calculate a guaranteed number of clock cycles some sequence of instructions will take. A branch predicting CPU can't make those same guarantees.
I think it should be noted that branchless programming is VERY important in cryptography. It turns out that if you ise conditional statements -> sometimes your code runs faster sometimes slower (specific branches just are faster/ slower), the attacker can get info on your private key. So all cryptographic sound functions must use branchless programming. Very interesting topic
@sayon It is very difficult to actually accomplish this in practice. If your delay is a random value, then you can repeat your request multiple times and use the average time. Delaying so your total time is fixed is really difficult: You have to know in advance the maximum amount of time your algorithm could possibly take, and even with that knowledge, wait() is complicated and imprecise, and it is definitely possible that what you did before calling wait() will still affect the total time.
@@thenameipicked i am talking about fixed delay, yes, not a random value. Besides how are you hoping to achieve fixed time with just balancing the computations when the process can be interrupted at any time? We usually have preemptive multitasking in our systems.
I've designed CPUs for over 30 years and we pay a LOT of attention breaks in code streams. Branch prediction, speculative execution, and even register reassignments are all about minimizing branch and pipeline impacts. What you may be missing in the above examples is that what seems to be avoiding branches ... don't. Every conditional test is a branch. Every conditional assignment is a branch. It is hard to beat the compiler optimization that understands the pipeline of the processor it's targeting. Clever manual decimation can be effective for long pipelines, such in GPUs, but even then compilers are pretty smart. The most efficient code is when tasks are parallelized so that multiple operations can be done based off of one or a small number of decisions. In trying to avoid branches, the cost is paid in non-maintainable code and is very high. If you really want to improve efficiency, don't rely on multiple software packages which actually may include interpreted languages somewhere underneath the shiny API. Layering of packages and the use of interpreted languages (Python, PERL, etc.) waste much of the increasing performance of processors. Yes, it means recompiling for a new processor, but one does that if efficiency is required. In one example, we recoded a numeric-heavy program that was synthesizing frequencies for an electronic piano. Recasting it to vector operations allowed the program to run on a very inexpensive Beagle-Bone Black instead of a MAC. On the MAC it consumed 35% of the processor. On the BBB it used 15% of a much less powerful processor by vectorizing the operations. These are the sorts of changes that matter.
I mean he literally looked at the decompiled assembly code to show how the branching and branchless code affected whether or not the assembly decompilation was branching. In the first case, he showed that the compiler sometimes understands the purpose of code and makes great optimizations that already remove the branches, but also shows how in other, more complex code, the branchless code was much faster than the branching code. So I really don't think he or the video missed this.
@@kingacrisius The thing is that looking at the assembly may or may not tell what actually CPU does. There are some cores (old ADI's SHARCs, Microchip's DSCs and few other) that have such predictable deterministic pipeline you can actually count the clocks and if you couple it to SRAM, then you have whole story. But modern IA64 or Cortex-A cores do so many things on so many levels, that without an intimate knowledge of what core actually does a manual optimizations is just a hopeless exercise. Silicon vendors want their wares to shine so they contribute to the compilers development heavily. Unless you are in the know in these matters, the best you can is to optimize for a particular platform, test it, and then freeze SW AND HW design. Sometimes this is what is needed and acceptable. But if you think such code will survive on other platforms or onthe future revisions of a given platform you may be for a rude surprise.
@@BogdanBaudis Well regardless he timed it too. Like it is so clear when the branchless programming is better, because he timed it and shows how the assembly is faster. I seriously don't understand where all of this is coming from.
@@BogdanBaudis And obviously some code may not be portable, but he was coding C++, which is almost always portable. And that doesn't matter anyways for this discussion.
So what we've learnt: If you write normal code and normal patterns that everybody knows, most likely it will be a pattern that the developers of gcc (or other compilers) thought of. That way you have a higher chance of the compiler doing good things.
@InSomnia DrEvil Real truth. I tell people on my team this all the time. If it's not hard, actually try it and extrapolate. Rarely (but it happens) you'll get different behavior past a data size threshold. And even then... usually you can infer that (or predict this would happen)
all the codes shown missed the fact that you could just remove bit 5 when in range shaving 3+ additional instructions with a well placed xor eax, 20h(stored in r8d instead of -1)(times char's ID stored in the multi data case) since you don't really care for the original data or you would back it up(the more complex code still used sub 32 and some other data shuffle junk). So sometimes it's better to assemble it yourself.
The program I optimised for gcc did a good job of making bottom level branchless if statements, but it didn't stop my carefully optimised C being 100x faster than the natural C++ implementation using stdlib. It was well worth increasing processing complexity.
It still matters. Compilers are great but not perfect. There are still seemingly trivial and arcane situations where the compiler will be confounded and unable to apply proper optimizations on code. It won't be noticeable if ur writing a user application because those misses will be fairly infrequent. If you are writing a linear algebra library for physics calculations then you are gonna be performing millions of these instructions repeatedly in a very short period and the misses become noticeable. Advocating against premature optimization doesn't mean all optimization is premature.
I want to add that the advantage of branchless programming is many times bigger if you are programming for the GPU and not the CPU. GPUs have multiple cores that share a instruction pipeline, where each core runs the same instruction but with other numbers. That often leads to both sides of a branch being executed and then one is discarded. I have seen performance improvements above 100x by going branchless!
@@nistarok123 Yes, exactly. Avoid branches in GPU code as much as possible. In fact I think there are some shader languages that don't even allow the if statement.
Note: GPUs these days won't take the performance hit if all threads in a warp (NVIDIA), wavefront (AMD), etc. (others) go down the same branch... whichever branch that is. So let's say you have a complicated material depending on whether or not you're shadowed -- you will only traverse both paths if your warp/wavefront corresponds to a block of pixels that overlaps with an edge of the shadow. If the block is entirely shadowed or entirely illuminated, then it will only run the shadowed or unshadowed path, respectively. This is 32 threads in NVIDIA, 64 in AMD... although they do reserve the right to change.
Cryptographic libraries also use branchless programming a lot. This is to prevent timing side channel attacks, where the timing of an operation can provide a hint as to what was encrypted.
05:00 The lesson here: never over-optimize your code beforehand. Start by writing natural and simple code. Let the compiler do its tricks. Check the result and only look for optimizations in the parts that the compiler didn't already optimize.
@@tetramaximum Both are valuable lessons. Another one is "know your data". Imagine a super-inefficient O(n!)-scaling algorithm. If you can prove that your data must always be n
When I first heard people saying "avoid branching in shaders" I had no idea what they were talking about. "How could you ever possibly avoid branching???" I thought. Now I see how. Thanks!
It must be said, though, that even shaders compilers are getting better and better, and you could use simple ifs which don't end up in branching. For example, if you write: if(x>=0.8) {y=x} else {y=0} one might be tempted to optimize by writing: y = x * step(0.8,x) And yet, at least in HLSL, the compiler creates the same low level code for both.
This is a common misconception. IIRC, uniform branching is 100% fine, the only problem with shaders branching is that if a heavy branch is followed by one invocation, the whole warp will stall. This is problematic, but does not mean branchless techniques are better as branchless techniques would just calculate both branches, meaning you'd actually have the whole warp calculate those heavy calculations. See shadertoy (dot) com/view/wlsGDl and twitter (dot )com/bgolus/status/1235254923819802626
As a long time software developer, in virtually every case, code maintainability is far more important than speed. I've written programs that processed millions of transactions - and branching was the least of the performance concerns. Paging and I/O typically are much more a problem. Except in toy problems, memory management will be a much larger concern. Introducing repeatedly executed multiplication generally is not good for performance. "And"-ing 01011111 with ASCII lower case converts to uppercase as well as subtraction, and to my mind, is much clearer. Unless there is a genuine need, don't make your code ugly to eke out a few cycles.
I agree, except I'm convinced we all need to learn this the hard way. I haven't met any developer who did not try to be clever in the first 20 years of their career. Took me 30 years to start understanding this myself, but I need to really try all the fun things before I do it the hard way.
@@henrikskott I had a CS prof once tell a story of how he spent so much time trying to optimize the code as he wrote it that it took him weeks to write the code, and when he ran it for the first time, the performance was terrible. He ran it through a profiler, and the part he optimized took virtually no time to execute anyways, and it was some other part he didn't even think of that was taking all of the time. His message was "just write the code sensibly the first time, then run it, and only then if there are performance issues do you optimize anything, once you know exactly what part needs to be optimized. That's definitely a winning strategy! Don't waste time optimizing things that don't matter anyways
Good point. I found the video interesting and informative, but part of me was wondering "in what context would these techniques genuinely be useful?". While I'm sure there are things out there that require this level of optimization, typically it makes more sense to prioritize readability. Also, great idea with the AND-ing! Very elegant way to solve it, at least in the case where the string only has letters. If non-letter symbols are present though, you'd need to add more logic to handle that.
it could be even better to implement/improve this kind of optimizations in the compiler itself, this way you can focus on the elegant code and gain a boost in maintainability while the compiler does the hard job, of this kind of very low level optimizations
Leading with a negative example was smart - I think a lot of beginners would have tried applying this to homework assignments that really didn't need it if you hadn't.
"I can teach a monkey to do what you do. What I can't teach the monkey is when NOT to do it." - one of my father's chiropractic professors But yeah. Branchless programming is just one of many tools in our arsenal, not a silver bullet. Having a hammer doesn't make every problem a nail. Same here.
I'm surprised with the amount of negative comments, he is not telling people do always code like this... there is a difference between time critical code and not. get the damn profile out and find where your program is spending most of their time, and THEN you consider doing this or not. the only negative thing about optimization is that it costs development time. the idea that "the computers will be fast enough" is just being naive. effective optimization can be is the difference between a product that reaches some of the market or most of the market or a program that can be extended for the decades to come or something that needs to be redone in the next few years.
@kingofallcrypto Yeah, I can definitely imagine seeing stuff like this in first-year students code. In one project, they had to keep track of 16 boolean values on a microcontroller that had way more than enough memory for what we were doing. They wanted to encode them in the 16 bits of a uint16_t (16 bit integer type). It's natural for them to think that it's a good idea since we did tell them to use uint8_t over uint16_t when the number they needed to store was only going from 0 to 10. But the reason we told them that was mainly as a reason to talk about architecture and explain that since the microcontroller had an 8bit bus and discuss stuff. It did give me a chance to have a good discussion about balancing programmer effort with execution speed. In the uint8_t vs uint16_t, there are no downsides and there is a theoretical benefit in speed. In the store 16 booleans in a uint16_t, it's going to be a complete pain in the a** to store and retrieve the actual booleans (if we don't consider that the memory saving has a runtime cost), we're introducing the possibility for errors and we have more than enough memory by a wide margin for that project.
You try to make it seem that it is worth it everytime, while it cerntainly isn't. This is the kind of thing you would use if you used every other trick in the book and need it for some one in a million project.
For someone who started coding in assembly before pipelining and speculative execution were a thing, and when multiplications were super expensive, the idea of multiplying by a boolean to make a selection always feels slightly wrong. And a part of me still wants to replace every multiplication with shifts + adds, or look-up tables. :-P
Multiply by 1 or 0 and you get an early exit on modern ALU's. With pipe-lining and SIMD, you come out ahead if you stack a large enough vector up. 8 ops/cyle + constant 3 cycle multiply penalty. If you have 800 elements, you are basically doing that multiply in 1/8 a cycle. So even with 2 multiplies and an add, it's still over twice as fast before any branch time or penalties is involved. And then AVX512 and SVE2/Cray vectors has predicated SIMD ops where you can create a predicate mask and then conditionally perform a vector op. CompareLT V1 and V2 to create mask than copy V1 into V2 using the mask to predicate each lane. the element from v1 is only copied of the LT mask is true.
@@Theineluctable_SOME_CANT Keeping myself informed about cpu microarchetecture and programming is a hobby of mine. Actual programming not so much. Being outside and producing something I can touch is more rewarding for me.
A fast branchless way to calculate ToUpper is to use a lookup table. The table is 256 bytes long, and easily fits in L1 cache, so character conversion takes a single instruction and will be a fast memory access. I think this is what is done in the standard library.
@@IceExtremeGamers His whole discussion was assuming ASCII as a toy example. Yes, the small lookup table will not work for general multi-byte Unicode (especially variable length encodings like UTF-8), but Unicode is more complex and none of the other stuff described in the video, like SIMD, would work either. For 8-bit character encodings, the table is actually far more general than anything based on conditionals such as conditional moves. The table approach would also be usable for 16-bit character encodings, as 64 KiB is not a lot of memory nowadays, and the parts of the table that would actually be used in practice would almost certainly fit in L1D$.
Branching in a shader on the GPU is also a thing, but for a different reason. Here, the GPU runs the code in lockstep, which means that the ALU runs the same instruction for multiple pixels or vertices at once. To stay synchronised, it runs both branches of the conditional, then throws away the result that did not meet the conditional. So, in general, it will have to run all the code all the time and branching does not give any advantage. Note: if all pixels meet the same condition, only one of the branches is executed.
Yeah but that inefficiency of computing both branches and discarding one is the passport to being able to run the many parallel streams so you still gain.
@@robertoaguiar6230 I haven't thought about the albedo consideration there... I've seen a lot of very dark color mousepads... Hmmm But I suppose it would be albedo in the specific range the mouse actually uses. Infrared?
For whatever it's worth... Smaller_Branchless produces identical assembly as a branched version when using clang 10 and compiling with -O2. In GCC 8.3 Smaller_Branchless is flat out worse than the branched version - branches been optimized out by the compiler. In any case for the branched versions the awesome compiler optimized away the branches. So for the noobs out there, if you're doing something like this, always check the compiler output and measure measure measure.
Not only that but Smaller_Branchless is not actually branchless, conditionals are just hidden branches. Writing a * (a < b) still forces the compiler to add a test and jump aka branch to perform the a
@@Henrik_Holst No, branches are distinct from conditionals. In this case, we see it transferring a condition flag into a general purpose register value using SETc instructions, because x86 has a condition flag register; not all architectures do. Conditional instructions like x86 CMOV or AVR SBRS operate by cancelling the retire step of a single instruction, which doesn't divert the instruction decoding; that's a cheaper form of branching. What you say would apply to architectures where the condition test is embedded only within branch instructions, as if you removed the CMP* instructions and had to use B* instructions in Nios II. That would be a typical case where a compiler might have a different optimization to avoid the issue, like using the sign bit from a subtraction.
@Isaac Beizsley It, indeed, won't turn your O(n^2) into O(n). But it will fix your branches better than most people in most cases. (The example OP made). It indeed won't do any simd. However, if you are prepared to go assembly just to speed up the code, it is worth assuming you know what you are doing. Like in the video, where he does indeed outsmart the compiler.
It's useful when you do shader programming where branching is very expensive. I had to «invent» some of those examples by myself when did shader programming. Great video!
Brilliant explanation. The following bombshell caught my eye... 12:19 "So it just goes to show that you don't automatically get speed from jumping over to assembly...you have to code it well" - very well said ;-) Wil every language above assembly, it's a shame that we have so little control over the final machine code output, even with C/C++. Lesson: 1) Never trust your compiler 2) Become an expert in assembly language and learn to e.g. inline where needed.
Note that the branch predictor is generally able to predict branches accurately for loops, since they almost always take the same path. Thus, getting rid of a loop branch is usually not of much concern, for purposes of avoiding branch prediction failures. Of course, there's still value in improving loop iteration code, since it still represents instructions that are run on ever iteration.
@@TheGandorX Unless you have a limited amount of parallelization that doesn't allow JIT to run on a separate physically parallel thread. Than it about doubles your run time.
Only if you have sorted data. Once you enter the territory of true random data the branch predictor practically doesn't exist anymore. Since it simply won't be able to predict. Now, some CPU's, (don't pin me on which ones exactly) can execute two branches simultaneously in parallel while still processing the condition. Once the condition output is known the output of the parallel processes is used and the other data path is thrown away. This has quite some prerequists though, but it works pretty well.
In some architectures, multiplication takes more cycles to execute than less demanding arithmetic. In the first example if a < b usually comes out to be the same (whether true or false), then branch prediction will avoid the delays inherent in "if", and the result could be faster than using multiplication. The architecture may even be incapable of calculating the true/false result other than by the use of branches. In any case, unless the function is called intensively, the benefits, or otherwise, of the branchless approach will be invisible.
You can get by 4 or by 8 the speed for some numeric code using assembly (eg. fast Walsh Hadamard transform, neural network code.) Sometimes the compiler can find better assembly than you can. If you need performant code try both ways. GCC found much better code than me for this in FreeBasic. sub subrandom1(result as single ptr,x as single ptr,n as ulongint) dim as ulong ptr resultptr=cast(ulong ptr,result) dim as ulong ptr xptr=cast(ulong ptr,x) dim as ulong r=&hE8FE90EB,rs,shift=18 for i as ulongint=0 to n-1 step 4 rs+=r resultptr[i]=xptr[i] xor ((rs xor (rs shl shift)) and &h80000000) rs+=r resultptr[i+1]=xptr[i+1] xor ((rs xor (rs shl shift)) and &h80000000) rs+=r resultptr[i+2]=xptr[i+2] xor ((rs xor (rs shl shift)) and &h80000000) rs+=r resultptr[i+3]=xptr[i+3] xor ((rs xor (rs shl shift)) and &h80000000) next end sub
I remember Bjarne Stoustrup saying something to the effect of 'The compiler will write better assembly than a human ever could' and in most cases, especially larger ones, that holds true.
The compiler can only write assembly as good as the person who wrote the compiler can write assembly. Your statement is mostly true for people who don't understand assembly and the way CPUs actually work
@@christopherjoseph651 Let any of the people who made the compiler try to write the same program in assembly as in C++ and see which one performs faster and with fewer bugs. It doesn't matter if you understand a CPU, the assembly code from the C++ compiler will be far better than any manually written assembly. Because the people who make the compiler only do translations and optimizations, they do not write entire systems with it. Because that's stupid.
@@christopherjoseph651 Not really, in the largest scales this statement holds absolute true. There are limitations to what humans can do, a large project in pure assembly would be just plain impossible for a human to do, it's unrealistic.
@@CottidaeSEA Did you not watch the entire video? He proved that a human who understands the instruction set of the CPU and who understands the purpose of the high level code can fully utilize the instruction set and hardware of the CPU. He literally beat the compiler by 45x! If the compiler is so superior why didn't it write the code using AVX? Exactly, a human wrote the rules that the translations and optimizations follow. So a complier is only going to write code as good as the person who wrote the compiler. This has nothing to do with program scale. The point is a human wrote the compiler so the complier can only write optimized code as well as the person who wrote the compiler. If that person doesn't realize how to perform an optimized task in assembly then the compiler won't be able to optimize the code either. What, do you think we're living in a world of Skynet where computers build more sophisticated computes and write their own programs?
@@dionyzus2909 This is a childish statement written by someone who believes in the new education model in which older languages are out dated and inferior and therefore shouldn't be taught. We don't need new, more complex languages to solve new and more complicated problems. To this day, when a program needs to be optimized for speed, it is written by a human using inline assembly. Even the guy who made this video understands that you should look at your disassembled code and be capable of understanding what the compiler is doing. Do you need to write a large program that is not time or resource critical in assembly? No, there's no need for that, but the statement that a compiler will write better assembly than a human ever could is COMPLETELY FALSE!
Awesome! Branchless programming would be even faster if the CPU had an instruction that filled all the bits of a register with the zero flag. These "computer science approach to C++ programming" videos are getting really good and I suspect are unique on youTube. You are improving (possibly, even saving) our mental health with these conceptual challenges and I, for one, am so grateful.
This will seem trivial to some, but for anyone planning to do arithmetic with the result of a boolean expression in .NET should bear in mind that 'True' equates to -1, rather than 1, as in C++. EDIT: I'll just add that any non-zero value (positive or negative) equates to 'True' in a conditional statement, and zero equates to 'False'.
-1 (int) is just max_uint, aka 0xFFFF... you can do (boolVal & 1), but it gets tedious haha And I've never used a system where true is -1 myself but if I had to guess I'd say it's because the designers wanted logical operators to work on boolean values, e.g.: not false = true (~0 = -1) true or x = true (-1 | x = -1) etc, etc, etc.
@@Vacuon I've worked with plenty of systems where -1 can be true, but it's not the ONLY truth value. As in, 0 is false and everything else is 'truthy'. ~ (BoolValAsNegativeOne) is just BoolValAsNegativeOne XOR -1, vs BoolValAsOne XOR 0x01. It doesn't appear to show negate -1 as an advantage. I mean, for either method you still have to auto-demote to a bool type to perform the operation. And it's the same amount of assembly instructions to invert; load the value, perform the operation. I'm SURE they had a reason for doing it. The thing about having a TRUE as a negative -1, means you can't index an array via an evaluation myArray[a>=b].. well I guess it works if your language supports negative indexing (like Python).
@@SeanCMonahan Not sure about C# but I wish C/C++ also used -1 (aka ~0) for 'true'. It would make very practical bit masks. For example, the video uses the slow integer multiplication (imul) instruction where a simple bitwise AND with a proper mask (0x0...000 or 0xF...FFF) would do instead. Conveniently, 0xF...FFF = -1U. Also, 0x0...000 and 0xF...FFF are just complementary values. You get the opposite one just by flipping all bits. x86 CPUs even have an instruction for that (NOT). ~0x0...000 = 0xF...FFF. All expressions consisting only of 'magic numbers' like this are pre-computed by the compiler. So if 'true' was -1, the min(a, b) function could be expressed as: return a & (a < b) | b & !(a < b); Since 'true' is 1, we have to flip the sign: return a & -(a < b) | b & -!(a < b); Or leave it to the compiler because the compiler writers know well how to compute a minimum of 2 numbers properly on your platform. Use 'std::min()/max()' for branch-less programming. The CMOVxx instructions have been with us for a very long time (since Pentium Pro in the mid-1990s) so they can be taken for granted now. :)
@@neur303 For example i have a branch that says High Alarm. (So x value is higher than y) I as a programmer know that one of the two almost never gets executed. Does this mean i can "guide" the compiler in to say hey branch predictor. Pls choose this one first for your prediction!! The other branch will almost never hit. Is this possible ? Especially if you need to check this for like hundreds of thousands of signals each cycle ? [[likely]] to be lower, [[unlikely]] to be higher than Y.
Agreed, but highly dangerous and misleading information for all the others, as a disclaimer that this is basically "experts only" is missing. And seeing what inexperianced programmers try to accomplish, this can lead to a lot of technical debt, as it will be misunderstood, badly used and then mixed with branching techniques . So yeah, fun :D
Agreed 100%. As a Javascript programmer, I noticed many of my peers don't seem to care much about how the language works. I know there are many performance decreases with it being a dynamic/weakly typed language with a just-in-time compilation. So whatever edge I can eek out in performance I will take. Plus I eventually want to learn more than one language. But making sure I REALLY understand JS first.
I was doing this back in the late 1970's, on a Cray-1 supercomputer with pipelined vector registers. Complex single-instruction operations were broken down into a sequence of simpler operations, each performed by a separate functional subunit. After one clock cycle, the first functional subunit was free to begin operating on the second element on the vector, while the second functional subunit began operating on the intermediate result in the first element of the vector that was left there by the first functional subunit. Complex operations like floating point division could then produce a result every clock cycle, even though the operation could take a total of six or more clock cycles to perform on a single operand beginning-to-end. Like stations on a car assembly line, the game was to keep as many functional subunits busy as possible at any given time. We would carefully map out each tick of each component of the CPU on a big chart before implementing in assembler. The thing was blazing fast, even by today's standards. Nice vid!
this is still the game, with 5 logic and 5 memory execution ports available for out-of-order processing (execution is determined by data dependency order), writing code to keep all of them busy with chains of non-dependent code in each loop to mitigate latency over iteration - although division is one thing that isn't pipelined. I guess division is rare enough that it's not worth the extra silicon.
When you get into serious depths of optimization, you start looking at things like branch predictability. Like, there are situations when you chose to write a less optimized version of algorithm, because it has more consistent depth, and that means the loops are going to repeat a similar number of times, making branches on them more predictable, resulting in faster execution.
@@psun256 Yeah. Because on assembly level, every loop is just a branch that sends the execution to beginning of the loop. So the CPU will hit the same branch over and over again as it's running the loop. And modern CPUs have very clever ways to get a heuristic for these branches. Like, if you have a very tight loop in some algorithm that always runs for, say, 10 iterations, your CPU can actually learn to predict that every 10th time it's not going to jump and start caching instructions from the correct branch. At that point, it's like the branch isn't even there! If the number of iterations varies, then the prediction is worse. Best your CPU can do in these cases is "usually, jump is taken," so it will always cache instructions for repeating the loop, and will miss and have to flush the pipeline whenever the loop finishes, which is still better than not having prediction at all, but not optimal. That's one of the reasons why an algorithm with consistent iteration counts can sometimes perform better than algorithm where iteration count varies. Of course, there are also optimizations your compiler can do if it can "guess" the number of iterations, which can make even bigger difference. We are getting into territory of optimizations where you _really_ need to have a good reason for going after, though, as you'll spend a lot of time testing, timing, and looking at disassembly, and you shouldn't be doing that until you've fixed all other timing problems. The only real world example I have where we've spent time going after this kind of stuff is character animation in games. When you have hundreds of bones in hundreds of skeletons for which you have to update the transform matrices 60+ times per second while leaving enough time for CPU to do everything else it needs to, you start looking at things like cache and branching prediction optimizations.
A good example for branch less is in Crypto, where data dependent code execution can lead to side channel leaks, therefore you look for brnachless and constant time algorithms. So for example to compare two byte arrays with hash results you would add the xors of each byte and not break the loop on the first difference.
@@inx1819 Actually you don't want to add random sleeps. If you just sleep randomly, all operations will be extended by roughly the same amount of time, so you can still statistically figure out which operations takes longer than the other. Instead, what you want to do is to sleep longer if the operation finished faster and sleep less when the operation are slower. You can do this by deciding beforehand an operation how long an operation should take. For example, you might decide that an operation should take 1 second, so you do the operation, and if there's any remaining time, you sleep for the remainder. That means no matter which branch the code takes, they'll always take 1 second. Now, this isn't perfect, there's also power analysis because doing an actual operation likely costs you more power than sleeping, so this is enough if your attacker is remote (e.g. webserver). But if you need to consider a local attacker that has full physical access to the hardware (e.g. smart cards), you might have to busy sleep.
This sort of optimization seems like what you'd do only for high performance real-time applications--like video games for instance--and only then as a third-pass after you'd first written the code in much more human-readable format. My coding bootcamp instructors drilled into me the mantra: First make it correct, then make it clean, then make it fast. Ideally by the time you reach 'make it clean' you've got good tests that pin down the functionality of the code before you refactor. After that you can come up with fast, clever implementations that still pass the tests. You shouldn't ever need to debug the fast clever implementations if you've restricted yourself to good, pure functions and written your tests correctly.
bootcamp instructors prepare people for no brain web dev jobs. This is something you will never use in such a work environment. It's mostly for advanced, low level programmers.
If is not *necessarly* slow, if your condition is going to the same place 98% of the time, your branch predictor will be very accurate, and the pipeline will rarely be broken.
Even that is not free either. Branch predictors tend to have limited memory, relying on it somewhere hurts branches elsewhere. But then cache is limited too, so maller code is often better.
Funny that I got recommendation of this video after watching a few presentations about the Elbrus CPU, which doesn't have the branch predictor at all. Now I understand what kind of hoops the developers of it's compiler and the porters of JVM had to jump through...
1. Make sure your branch targets are aligned to 16 byte boundaries. 2. If you're comparing to upper AND lower bounds, (whether masking or branching) you can subtract the lower bound and do a single compare instead. 3. When writing assembly pay attention to instruction dependencies, and interleave independent instructions. E.g. load 1st chat, load 2nd char, compare 1st, gen mask, compare 2nd char, and so on. 4. You can often apply these techniques directly in C++, possibly with intrinsics. ... Just my 2c
Branchless works great for GPU programming (GLSL etc). It's actually worth complexifying the algorithm just to minimize the branching (but not always).
Really depends on whether the branch is divergent or not. Modern GPUs have some level of branch prediction built in and will try their best to only take one path through the branch while leaving the other path alone, so if you know for a fact that the branch is very convergent (ie it's using a uniform/constant value, or it's based on data that's convergent in nature), then there's nothing wrong with branching (except on ancient GPUs). However, if the branch is very divergent (ie it's based on data that's divergent in nature), *then* you should try to implement the same logic in a branchless manner, as the GPU is forced to take *both* paths through the branch, with half of the warp being disabled through each branch.
Early on in my programming career I noticed that there was usually a relationship between speed and instructions -- time and space. Plus, you can throw in the time it will take to debug clever code.
I did not expect a character like yours in a programming video! You do a great job on keeping things interesting and a bit funny while making good use of the 20 minutes with practicle examples. Really good video!
In my opinion, the main reason to avoid too branchy code is readability and maintaining time consuming. But for this issue, a clean code with well designed patterns will save you a bunch of 'ifs', which automatically make you code easier to understand and easier to be optimized by the compiler. Branchless programming the way it is presented here has a proper place and time.. if you need it you will know..
Branchless programming is often about taking a step back and finding a simpler approach. In the second example, I would just take d[i] & 0xDF; it keeps the upper case characters intact, but flips the 6th bit for lower case letters making them upper.
Yes, that will change some non-letter characters too. If those characters aren’t in your text, then AND away my friend! It will certainly be faster!! The example is not super important, just the technique of masking in SIMD to perform multiple branches without actually branching. Cheers for watching :)
In time-critical code know your cpu. I once rewrote a graphics library in Z80 assembler and sped it up 10x by unrolling loops and counting cpu cycles for every bit of code. There are all sorts of tricks you can use to improve speed as any good programmer knows.
@jshowa o No, tools of the trade don't necessarily make one a good programmer. But pretty much only good programmers know these things. And there is nothing unmaintainable in these designs. Again, one just has to be wise about implementation. But that's the case in all languages, including those that advertise themselves as nearly fool-proof. No language prevents or even hinders unmaintainable code.
@jshowa o Yes, I watched the video. I also have 40 years of experience. You repeat what I've heard all too often. It's true compilers are very good at many things. It does not follow that they are best at everything. Experience has conclusively demonstrated that compiled code is not always the quickest code or the smallest code and there are still cases where maximum efficiency is required.
I would love to get your email, discord or linkedin or whatever and chat with you about the various assembly optimization tricks you could teach others! I'm an operating systems developer myself with 3 years of experience and I'm just now starting to really get into assembly language (I use an internal closed-source C-like language at work which is so old, it doesn't have an easy to use print function, so often the best way for me to debug the runtime is to freeze the operating system at certified compiler-emitted assembly instructions and look at registers and memory locations hahaha) and hardware architecture and optimizations enabled by a programmer's understanding thereof. My current side project is 6000 lines of mostly algorithmic C code (custom BigInt and cryptography library to be used for a secure chat app) and I'm sure there would be a lot for us to talk about and a ton for me to learn from you!
The primary rule of optimization is left out here, which is to FIRST MEASURE aka 'profile' to see which part of your code the CPU is spending most of its time in. Once you know where that is, you try to find out any algorithm / complexity improvements (take your algorithm from O(N2) to O(N) or even O(1)). Then finally, after all that is done, you may think about 'intervening', checking the assembly and seeing if you can "do better yourself". Introducing obfuscating code like a*(a>b)+b*(b>=a) leads to bugs and unnecessary rabbit holes for your future self when you get called in to debug the mess you made.
I can see these techniques being applied for things that require a lot of calculations to be done as fast as possible. The issue is when it comes to other people looking at the code without fully understanding why it's written this way.
Back in the late '70s, I came across a set of algorithms that computed Julian and Gregorian dates using only double precision arithmetic. Leap years and Y2K worked without modification. Operating systems and databases frequently got leap years and daylight savings time calculations wrong. It took a lot of time an effort to recover. The software that I wrote using the arithmetic functions only failed once, and that was when someone compiled them single precision. It made me a big fan of using arithmetic functions to perform logic.
It seems to me that it's a choice of evils: 1. Easy to read but much slower code. 2. Optimized but heavily obfuscated code. Option 2 is WAY above my pay grade :-) so I'd go with #1 unless I had no other choice. Still, your video was extremely well done. Even I was able to follow what you were saying and I'm merely an intermediate level programmer. That's definitely worth a "Subscribe".
this is why programming back in the very old days, they were really hard to read because scientists optimised the program to do things when the clock cycle is like below thousand Hz. However, the raise in computation power, mean there is no need to write optimal code for the machine, when the machine can do more work for you. so human programmer can spend less time to write less complex code, so that we could produce more programs for more purpose. This is a trade off, it really depend what type of work you were doing.
I dunno Option 2 is what compilers are for. Or in my case, precompilers since I work predominantly in a compiled language without certain niceties and use a precompiler that translates the extra control statements I want to use into standard controls that actually exist. Also look into the MOVuscator for more evidence that optimizations can be often offloaded onto the compiler. Edit: I said all that BEFORE WATCHING THE VIDEO WHERE HE SAYS COMPILERS HELP
"Before, we used programmers to save on computing resources, now we use computing resources to save on programmer time." Don't remember who said that. And like the others say, that doesn't mean #2 should never be used, just that you should only use it for measured bottlenecks that are worth it. Even if you identify a bottleneck but your program doesn't need to be fast, then you still don't need to optimize.
Some processors own a conditional selection instruction, which at source level corresponds to a>b ? a:b. branchless programming is a must in time critical applications where worst case execution time prevails. This skill is appreciated in avionics design companies.
You're correct. Neither the video maker nor the adoring commentators understand what a branch is, and are just jerking off trying to eliminate short jumps. Fucking brutal.
People have done it before, here's an example I found online, If you wanna know if it's smaller like in the video just put a ! on the function call: int isGt(int a, int b) { int diff = a ^ b; diff |= diff >> 1; diff |= diff >> 2; diff |= diff >> 4; diff |= diff >> 8; diff |= diff >> 16; diff &= ~(diff >> 1) | 0x80000000; diff &= (a ^ 0x80000000) & (b ^ 0x7fffffff); return !!diff; }
I thought it was funny when he said "The compiler couldn't figure out what we were trying to do". Same as most people trying to read that code. There are lots of compiler tricks for optimizing - removing branches is just one of them. And optimizing for speed is not the only type of optimization a compiler can do. My own intuition is to strive to make your code readable first. If you are not meeting speed requirements, reconsider your algorithm first. As a very last resort, consider using assembler. I have trouble imagining a scenario where trying to come up with some contorted approach in a high level language that is hard to understand, maintain, or predict what the performance implications would be (without looking at disassembled code, which seems kind of like "why bother") would be the preferred approach. BTW - the same high level code can be compiled into different native instructions for various reasons (e.g. different compilers, different target processors, different compiler options, ...), so it seems like you would need to buy into disassembling your code EVERY time one of these variables changes to see if it is still optimized. Of course, it is not uncommon for the writer of the code to not really know when one of these things changes in their build environment. In general, I would say if you start trying to second guess your compiler, you are at best wasting your time, and at worst, creating a maintenance nightmare.
To be fair, I think all of these points are taken into account by the video. He's not saying you should do this stuff everywhere, only in cases where you need extreme speed, and especially in SIMD scenarios. Nobody's doing SIMD unless they need speed.
Remember that not everyone is writing code for webservers and PCs. If you work on high level applications, the it's hard to imagine, but there are people writing for low level applications where cycles and bytes are scarce.
@@shiuido359 EXACTLY! 99.999% of people that write code are obtuse to the fact that low level programming is still relevant and necessary in certain applications. The live under the belief that assembly is old, out dated, incapable and doesn't need to be taught because they learned java or python.
Although not the purpose of this video, the old school trick for toupper is to bitwise AND each byte with a mask (~(0x20)), presupposing the arrays are all ASCII characters in the range of A-Z/a-z. Of course, once out of ASCII, this isn't the right method.
For the conversion from upper to lower, wouldn’t it be faster to just do an AND on the byte with a register with the value of the inverse of 32 in binary (6th bit set to 0). So loop over value && 11011111; depending on your architecture this conversion could be done in a single operation.
For unsigned ranges like 'a' to 'z', a smaller range check I use is (value-lower+0u>range), where 0u forces the comparison to be unsigned and range would be the 'z'-'a', that any overflow would also result in an unsigned value out of range; instead of two compares where both values would have to accounted for, one subtract and one compare.
It will probably be slower since you changed one comparison and logical operator to one subtraction. Comparison is a subtraction without memory assignment, so cmp is faster.
I didn't even know branchless coding was a thing. I just assumed "if" was necessary to test a condition and didn't really question it, seemed reasonable enough. You've opened my eyes. I'm not gonna use branchless coding yet, most of it flew WAY over my head, but knowing why it's bad is still good
It really isn't that difficult if you think about it, for comparison replacement, a>b itself is a mathematical expression which returns 1 if it's true, so if you do x=a*(a>b), that will mean x=a if a is larger than b. You can use that same principle for anything. That said, I think the main lesson learned from this video is that you shouldn't try to be too clever because most of the time the compiler will do a better job than you ever will.
@@LocrianDorian ye nah I get the algebra, I just don't understand how converting "if(a>b) return a" to "return a * (a>b)" removes a comparison, you're still comparing a to b
@@Louis_Marcotte It doesn't remove the comparison, it removes the branch. The thing that slows down the CPU is not the comparison itself, it's the "if" statement.
@@LocrianDorian He's right, you're wrong. Eliminating jumps to minimize cache miss is not what branchlessness is about. Whether you're using an if or multiplying with the result of a comparison, you are creating a branch for the CPU's branch predictor: either the comparison will be true or false, and your multiplication result depends on that. It will either predict the branch result correctly or not, resulting in either a speedup or wasted work. (a * (a > b)) creates a branch for the branch predictor no different from an if.
However, if the data is predictable, we can use indirect addressing, hash tables, and other branchless techniques. This is likely the limit of optimization in modern contexts.
Good to know some programmers are interested in this stuff... thought it was a dying art :) Did a lot of stuff back in late 90's with good old MMX instructions to process video in real time (which was tricky back then). Even now some clever optimization can make a huge difference - like being able to adjust photo editing filters in realtime with a slider (e.g. Affinity Photo) or waiting 1/2 a second or more for each tweak.
This sort of optimization is very hard to get right and often leads to premature optimization. For tight loops on higher level languages, you should check your byte code. And for lower level languages you’re the disassembly. However, don’t forget about the third thing…. The CPU execution pipeline. There are also many optimization each CPU can take. So while one disassembly looks slower than the other, it may not always be true as the CPU might be able to optimize one better than the other. So it’s very tricky and really ends up, requiring a lot of testing on specific CPUs. I wish there was a platform we had where are you can benchmark your code against hundreds of different CPU types (with emulation) and see the results.
Ah, branch prediction. Yeah, I see. I guess I spend too much time with ancient hardware that doesn't do anything fancy like that. You do a branch on a 6502, the CPU does that pretty quickly. Then again, since there's little to no pipelining (there is some limited precursor to a pipeline which is why the cycle counts are so low; basically it does instruction decoding and execution in parallel for multiple instructions), no cache, and nothing like branch prediction... A branch is about as fast as anything else... Ah well. The 6502 is an evolutionary dead end. Not because it couldn't have been extended to higher bit depths (65816) and faster clock speeds (current record for a 65816 FPGA core is 280 mhz, roughly). No, the problem... Which can be seen indirectly with the whole Zero page/Direct page idea, is that it's a Register + Memory design. If an instruction takes two operands, in almost all cases one of them is from memory, and one is a register. It also does single cycle memory access. (compare with the slightly later and more prolific 68000 which takes 4 cycles to access memory) For a given CPU clockspeed, the 6502 requires memory that can be accessed in one cycle. The 65816 does even worse, demanding half cycle access. On the face of it there's nothing explicitly wrong with this design. The problem is that the trend in technology (even to this day) is that CPU performance has outpaced memory performance. (at least, for affordable memory) This is why the CPU cache is a thing. Not because anyone especially wants the added complexity, but because a small amount of very fast cache memory + much slower main memory is much more practical and affordable than having your entire memory system operating at speeds equal to or faster than your CPU... Branch prediction I guess is a side effect of deep CPU pipelines, which in turn are a way of reducing instruction cycle count without actually making instructions faster. Want a single cycle multiply instruction? Well, either you make some complex circuit that can actually do that in a single cycle, or you build a pipeline where 10 instructions are in different stages of execution at the same time. The net effect is that each instruction takes 1 cycle... ... If the pipeline doesn't stall... At this point I'm curious what the maximum speed of a CPU would currently be if it had no pipelines and no cache memory. I suppose you can predict that from memory speeds. If we can deal with the logic used for DDR, then I guess we get something akin to 2-4 ghz to play with, so it could be worse... But there's almost certainly some major caveats involved here...
KuraIthys good post, i am a 6502 hobbyist and watched this video for a laugh at the semantics of "branch less programming", i get the feeling tools are so easy now programmers like to seek some form of division from the less technically inclined people who write software.
It's not just memory bandwidth, it's latency. A CPU with no cache today would be catastrophically slow because an L1 cache read takes ~1ns and a ddr read is ~100ns. I've actually wondered about doing the opposite, making a CPU with huge amounts of cache that only uses main memory to load entirely new programs. Never have to leave the cpu die.
If you're talking speed of a single instruction without pipelining, of course it wouldn't change from its pipelined version if you run the instruction with the pipeline flushed. Pipelining increases throughput, but there's no reason to expect the latency to be any different.
@@Badspot Good point. I'm certainly no CPU designer, so I'm sure there was more to it. Though as to your idea... With a large enough cache you basically end up back where you started. If your cache is so large that you never need to access main memory, then what you've basically done is turn the cache into the system memory, and system memory into something akin to a RAMdisk... 100 ns latency... hmmh... I admit I didn't think it was still that bad... I mean, a Super Nintendo was already demanding ROM access latency of 110 nanoseconds or more... That would imply for mainstream RAM we've made no real progress with access latency in about 25+ years...
Loop unrolling is very powerful, but the technique is limited to applications where the loops have a know/fixed iteration count at compile time. Loops with a dynamic iteration count or count based on the size of a dynamically sized container (like a vector) cannot be unrolled since the compiler has no way of knowing how many times to copy the code in the body of the loop. Luckily, the branch prediction hardware in modern CPUs is really good (like 90+ % hit rate) at predicting the correct branches for looping statements since loops are such a commonly encountered structure.
Crap, should have read comments *again* . I had the same comment. It's very much a branch removing method which was usually the reason to do it. You removed the branching and compare at the end of the loop which would shave several clock cycles. The draw back is more code bloat though and if you have limited memory that could become a real problem after awhile. I remember spending weeks trying to tweak a mobile port of some Tom Clancy game (this was back in '01/'02 and I can't remember the title). I kept running out of memory and I ended up having to roll back up some loops :). I had to make some other "creative" changes in the game play to compensate for some frame stuttering caused by that, but in the end it was worth it because the game just didn't want to fit on that device despite me having at least 10k overhead beyond the device limit. This was a problem with a lot of early devices. They might have say 200k on the spec sheet, but there is some process somewhere using 10k of that and so you really only have 190k. Yeah, I had to work with that much RAM at times on some form factors and the concept of phone manufacturers was it's a phone first before anything else, so calls always have priority. This could reek havoc on your carefully planned memory map when the handset gets a call during the most memory intensive point. We would have some weird bugs for sure and they almost always were RAM related. When Apple invented the iPhone I knew mobile gaming was going to change forever and in a good way.
EDIT: I want to say it was Splinter Cell maybe? That could have been a later port, but I want to think it was Splinter Cell. If you played that on Qualcom handset or more specifically a BREW handset - Kyocera, LG, and a few other manufacturers were still making BREW phones back then - I might have been the one to port it, so you can blame me for any shitty gameplay as a result :P.
One of the reasons people should try to use ranged for. However I find it hard to move away from indices if you are doing some mathematical operation, particularly when the index is part of the calculation. Probably choosing a better data structure (maybe a struct) instead of pod would help here, but that has its own tradeoffs.
3:44 - That code is C syntax. The syntax uses a comparison operator to return a number equivalent to Boolean. However, there is no guarantee of how the syntax will translate to whatever CPU is chosen, especially when using a language whose compiler you didn't study or write. Thank you for showing the concept, but I wish you would explain up front that this is a C-syntax feature, not a reliable technique in programming. (it would set the context better and be less misleading to the naive viewer)
This amounts to obfuscating the clear intent of the program and puts the activity of programming back into the mentality of doing assembly language programming where one knew the underlying CPU with intimacy and coded for the CPU, i.e., programming for the low level hardware instead of programming to human friendly abstractions of high level programming languages. Whereas this problem could (and should) be left to compiler optimization and/or CPU architecture techniques to solve. Pushing hardware problems of this nature back into how the program is written is going the wrong direction. Instead of statistical branch prediction, CPUs can just have two instruction pipelines running in tandem per the branch point and the actual branch result is used to finally determine which pipeline to retain. Set up several of these pipelines and just treat them like circular buffer of instruction decoding pipelines. Every time a branch appears in the instruction scanning, just start using an available pipeline to decode on one of the branch decision paths. Given how many cores are built into CPUs these days, the argument of using more transistors for this kind of solution doesn't hold water any more.
This is so bad... "Premature optimization" in the worst sense i've ever seen it. Sure, looks cool and smart, but makes the code ureadable and/or hard to maintain. Might make some sense in a VERY TIGHT domain case, but for general coding, it's more like teaching people how to dig a hole and jump in it...
I get what you're saying, but I'd like to offer a counter argument: Since this is, visually, little more than a syntactic sugar, I doubt it'd make the code less readable. We've already gotten used to switch statements and ternary operators as high level alternatives with various pros and cons to simple if branches, what's one more. If you've ever used linear interpolation, you'll recognize this. I agree that anything a compiler should be able to do, humans shouldn't have to deal with by hand, but _if_ you're running a time critical operation I'd still prefer seeing this rather than a deluge of if's to check simple conditions. Again, I'm not dismissing your point, I'm just saying this isn't meant to replace every branch in every program. It's a tool to be used judicially if and when it is needed. Like null pointers ;)
Branch prediction is what led to the spectre vulnerability. Also, the structure for the non-branching code he showed us isn't hard to understand. It's actually pretty comparable to a switch statement. Throw in a comment to remind people that conditions evaluate to 0 or 1 and it quickly becomes apparent what the code does. result = newvalue1 * (condition1) + newvalue2 * (condition2) + newvalue3 * (condition3); This also seems like a nifty way to construct byte-sized flags if the newvalues are powers of 2.
Kiddobyte Yep, this original comment is SUPER anal. Speculative execution is trying to emulate what branchless programming does natively - saying "throw more cores at it", or "run code in tandem" ignores the fact that any wasted work is wasted work, and you're condemning a core to execute throwaway instruction EVERY time a branch is encountered. That being said, these optimizations should probably be put into the compiler, rather than user code.
I agree if you'd start writing business logic like this, but for the examples he used this seems very reasonable, as long as you check performance and correctness which can be done easily. After that nobody will have to look at this code. I guess he went a little too far when he generally claimed this would be worth looking into, it's most likely not.
RUclips should pay you x10 times more to each view you get for the underrated quality content you publish. Love your channel! ❤ I was recently getting more into this subject myself, especially reading more about the subject of how different compilers handles N/RVO (Return Value Optimization) and Copy Elision to reduce copying when all the move semantics were introduced in C++11. Would be also a great idea for the next video to understand more about optimization and what we should not worry about when we write our code.
C++17 mandated that RVO (not NRVO) is performed. If you're allowed to use C++17 (or later) then the compiler is not allowed to perform intermediate constructors and destructors anymore, so there shouldn't be any differences between compilers or optimization levels. You can see the edge cases here: en.cppreference.com/w/cpp/language/copy_elision
ARM32 (not sure about newer ARM64) has a really interesting instruction set that encourages branchless programming. You can predicate every opcode on a flag condition (e.g. LE/GT/EQ/etc).
@@SerBallister That's still a branch, though - just merged into an instruction. That's also why ARM64 got rid of them and only supports them explicitly with branch, select, and compare instructions (for performance reasons, no less!).
@@totalermist I thought they differed by not using branch prediction so did not come with misprediction penalties. The none executed opcode behave more like NOPs. I didn't study the internals that much. Its an interesting instruction set with some interesting opportunities for optmisation
Interesting stuff, I never dug too deep into programming so I never even saw the instructions that go the CPU/GPU, would love to learn more about the actual assembly code
I remember learning all these topics knowing it would speed up code but I didn’t know to what degree. It was very informative to walk through the same example for direct comparison
You can use AND, yes. But you have to make sure to only AND the lower case letters. I don't think it saves any time since VPAND and VPADD both run at the same speed. Excellent idea still, and I'm certain my code is not the fastest possible, so it would be worth exploring other instruction :)
But this correlation between upper and lower only works for ASCII, which means that for any international solution it does not work at all since the language uniqe chars are not spaced 32 numbers apart.
@@user-zu1ix3yq2w Yes, for example. But when used inside a program its often expanded to 16 bit chars or 32 bit structures for simplicity, having different sizes for different chars makes optimizations very difficult ;), at least as far as I have seen. C# uses 16 bit chars, but do have surrogate pairs for some chars that do not fit into 16 bit.
Sounds like you have a pretty neat OCR and also pirated a lot of books. But really, unless your corpus is already in plaintext and loaded in RAM, other bottlenecks will make the conversion speed irrelevant.
@@zwz.zdenek well, my point was that computers are cheep and humans rarely have to care optimizing the code. But ok reducing branches can sometimes be worthwhile.
@@johnnisshansen its most important where massive amounts of calculations are being done, like I dunno, finding large primes to have very optimized code. For regular programs less effort is put into optimizing now than was done back in like the 80s or 90s, because computer power is so much higher that it is usually not needed. They do stuff near enough instantaneously anyway.
Never assume optimization is achieved. It comes down to your use case and needs. If your need to look through 50gb databases or you need to talk to 100k computers any and all tricks can help. Granted when talking to 100k computers, network is a big bottleneck. But I've been able to take 20 day scripts and convert them to a few hours. Optimization can and will matter in certain scenarios.
@@vinzer72frie and @Henrik :: I'm a programmer that's helped with optimizing Yandere Simulator. --- I should note that the types of optimizations mentioned in this video are *hot path* optimizations. You're referring to one script that's executed ~80 times per frame, and its execution time, as measured by the profiler, is under 1ms on a typical, modern ~4 GHz CPU. If this was your bottleneck, then you would be in the ~1000 FPS range (because 1ms = 1/1000th of a second). This measurement is the sum of all 80 NPCs, and it also includes the time that the script is spent blocked on heavy calls into C++ (ex: some when some NPCs need to adjust their animation weights). --- Performance mostly depends on data... and game logic doesn't have a whole lot of data flowing through it (when compared to rendering thousands of objects, thousands of physics objects interacting, hundreds of animated objects that each have dozens of bones, etc.). People put a lot of emphasis on code, but optimizing your data (how much you load and how it is laid out) is usually where you will get the huge wins. --- Also, when you're in the realm of Object-Oriented Programming, one L2 cache miss is typically 10x to 100x slower than a branch, depending on whether the branch prediction was a success. This obviously varies from CPU-to-CPU, but it's a good order of magnitude comparison for modern x86-64ish CPUs. A correct branch is ~2 cycles. An incorrect branch is ~15-20 cycles for AMD Ryzen / Intel -Lake CPUs ( See Agner Chapter 3.8 and 3.13: www.agner.org/optimize/microarchitecture.pdf ). A cold memory fetch (reading memory that's not in cache) is ~200 cycles. That's not too big of an issue for something that happens dozens of times per frame, though, because a 4 GHz CPU does ~67 million cycles per frame at 60 FPS (per core). In other words, you would need to do ~1.5 million branches on your main thread, and have them all be incorrectly predicted by the CPU, to eat up just half of your 60FPS frame budget. --- The actual performance issues with Yandere Simulator are mostly related to animations and physics... and we've been fixing them up based on priority and availability. You can see over a 2x performance increase between January 2019 and July 2020 (assuming you have a discrete GPU and shadows are disabled -- neither of which have anything to do with game scripts). It's not uncommon to see a gaming PC with a good (ballpark ~GTX 1060-or-better) GPU and a gaming CPU get 80-110 FPS when shadows are disabled. --- Unfortunately that "if/else" meme fools a lot of people, though. It's not, and has never been the issue for that game. I mean if performance got over 2x better, and the game scripts haven't changed, then it clearly could not have been the issue. Over 50% of all execution time, good and wasteful, was removed... yet the if/else statements are still there.
@@scottmichaud nice comment, too bad not really anyone is going to see this. I'm a gamedev myself and seeing the way people rag on yandere dev is pretty often kinda cringe. It's very obviously a lot of script kiddies or people with no experience just jumping on a bandwagon without actually knowing what to be mad at. Did the game have optimisation problems? Sure. Is it because of some random out of context if/else chain? probably not.
@@scottmichaud I have been studying graphic libraries and trying to make an engine, meanwhile looking at code of games, It always baffled me what's wrong with those if/else, can't imagine a RPG without those. It cleared up the doubts
@@cortexauth4094 Yup. No problem. You want to have as much of your hardware at max capacity as possible (until there's no more work for the frame, then idle in one big chunk). When a branch is mispredicted, it needs to flush the bad work. That wastes "low tens" of cycles. ~20 cycles is nothing when you're concerned about work that happens once every 10 milliseconds (100 FPS). That's 0.0000025% (1/40,000,000th) of your budget. That could be a problem if you're in a loop of thousands or millions of objects, though, and the CPU can't figure out your pattern. Game engines will mostly have problems with L2 cache misses, though, which are 10x worse than a mispredicted branch. This means that work that needs to operate on many things (ex: thousands) should be laid out linearly in memory so your next calculations are on the same cache line, and, when you leave your cache line, you land on the start of the next cache line. The CPU watches your memory accesses and it will speculatively load cache lines based on what it thinks you will need. One of the biggest problems with memory accesses is dynamic inheritance. The way to allow multiple different child classes to fit on an array is to *not* have them contiguous in memory (because they're all different sizes based on the child class). There's a *lot* of different discussions that can go on here... but, yeah, it depends on the data you have flowing through it. Being wasteful on the order of the millionths of a 100 FPS budget might be okay if you're talking about hundreds of things... but, on the other hand, it is wasteful. IMO the biggest part is knowing what your waste is, and making a conscious judgment to waste (rather than being blindsided by the profiler months or years later).
I think there are many things to do to optimize speed. This should not be high on the list but it's good to know that it's there. If a code is branch-heavy, this probably would help more. Speed up 1ms code 1000 times doesn't have that much impact on performance as a whole.
Two things I would add. 1) Conditional move et al still create dependency chains which can slow down throughput. 2) Speculative execution vulnerabilities like Spectre, as well as side-channel attacks on the execution time of a piece of code, can make branchless algorithms more useful.
While these techniques aren't very useful for most programmers, they are extremely useful for compiler programmers. The more patterns they can recognize and replace, the better everything is for everyone.
Or library developers, e.g. providing the fastest primitives like memcpy, strcpy, strlen, etc. is essential for C/C++ performance. However all the ugliness and complexity should stay locked inside the library implementation and covered with extensive unit tests and benchmarks.
I disagree a little. Thinking that most programers shouldn't care about a performance lead us to situation today. That you need 2x2GHz CPU just to start windows. Just to display desktop background image, some icons, and do nothing while waiting for user to click something. And 10 seconds of website loading is considered normal. And fast typing programmer can type faster than average IDE can display text.
@@peterSobieraj The goal is that top level programmers shouldn't *have* to think about performance because the low level stuff takes care of it for them. I agree that we aren't there yet though.
I used the vector technique for a mapping algorithm at my last job, using the Accelerate framework on iOS. Got a HUGE performance increase, although I never thought about the branchless part of it. I was just happy that I had an algorithm that could convert tens of thousands of lat/lon pairs to rasterized mercator tiles of 32-bit indexes into an array of values for each mercator tile pixel in the same time that it took the naive, pointer-chasing for each vector method to process 10 lat/lon pairs. I was working with 300x speedups.
I observed a similar situation in BVH tree traversal. For a raytracing program, my first approach was, storing BVH data with pointers and dynamic memory, and traversing it recursively. It was a very elegant and OOP styled code. Then I had to make it run on GPU, so re-made it in a branchless way, I stored all its data in contiguous memory, and used stackless BVH traversal algorithm. Yep... 100x speedups when run on CPU, 1000x speedups when run on GPU.
I love the SIMD part. I never got deep into it, sadly. I believe there is one more thing to mention in correlation with the topic: branch prediction. Often, a simple if statement is used to iterate through values and perform different actions based on them. You might find a significant improvement simply by sorting the data so branch prediction can kick in and lessen cache misses. It would be interesting to see all you mentioned to perform in this scenario.
Very interesting video. Another technique you didn't show (and is a lot easier for beginners) is caching. In the case of toUpperCase, you could prepopulate an array of 255 elements filled with zeroes and a couple of -32s. You sacrifice a bit of memory for great gains in speed. Also, depending on the language you use, you can store function references in an array and call funcArray[a>b](a,b) to reduce the amount of branching. Or even funcArray[a%4](args) for more options.
i would like to add that, from my testing with actually benchmarking branchless code, especially with "trivial" examples like the smaller than function you used as a first example, and with compiler optimizations enabled, there's actually a lot of cases where branchless is "at best equal" to branched code, and it'll even flip entirely based on the compiler used. staying with the smaller than example, without optimization the branchless version is 10% slower in clang, but 10% faster in gcc, while with optimization they're exactly equal. The second example works mainly because the branchless form can be vectorized, such that the 1024 elements can be processed in just ~300 cycles. I've tried to apply this branchless approach to my own project (doing some work with big integers. I changed only the division operator to work branchless) and it increased overall compile time of my tests 10-fold (from about 20 seconds to over 3.5 minutes) and nearly quadrupled the runtime for my long-running testcase, calculating 10, 8-byte prime numbers, from 2259 ms to over 8000 ms. Trying a branchless approach has singlehandedly undone all of the optimization i've done since first getting it working, fortunately i have backups. The reason for this is that, while it's true that mispredicted branches are slow, it only takes about 10-20 cycles to flush the buffer and switch branches, while on the other hand branchless needs to perform the work required to evaluate both branches, plus some extra overhead, so even for trivial situations it's quite difficult to actually get benefit unless your branchless code allows the compiler to vectorize the work. which is what happened for the ToUpper function. In more complex scenarios it only gets worse, since if there is more extensive work that needs to be done in each branch, the work for both branches needs to be fully executed before the return value can be determined. There are some genuine use cases for branchless programming, as noted in another comment it may be important in cryptography despite being slower, and there are situations in which branchless is actually faster, but unless you can confidently say that the branch predictor mispredicts more than normal, or the branchless form will benefit from vectorization, it will usually perform equal to branching at best. CPU's are sufficiently complex beasts that unless you actually benchmark what you're trying, it's really not possible to predict what will perform better, for instance the speed gain in the ToUpper function from earlier flips with -O1 on GCC, becoming 50% slower than the naiive branched version, whereas phrasing the condition as a binary and rather than a logical and is 20% faster.
the places where branchless is faster is usually because it allows the compiler to see optimizations (usually SIMD) that it otherwise wouldn't have found. iirc -O1 doesn't do SIMD optimizations so that's why the results flip. personally I've seen cases where replacing branches with boolean checks allowed the compiler to do some pretty crazy things like replacing entire loops with bitwise operations resulting in over 100x speedups
Very nice. Not often we see real code optimisation. First time I’ve seen this. Love it. Smaller is faster. Not only your data, but your code blocks also need to move in/out of cpu caches so avoiding a possible branch outside of L1 size will add up and makes sense (now). This is a great technique for ultra-performance code. The algorithm at the end is a nice added bonus. Subscribed.
Nice video. But remember, unless something is actually a bottleneck in your program's performance, you never want to waste time and effort squeezing 1ms out of it. Always profile first, optimize last. (Obviously, this doesn't apply to study/benchmark stuff.)
If you look at the assembly code the branches are still there, just using cmovl or setle or setl to handle the condition within the instruction. What the transistors are actually doing within those instructions is just doing the branch in a very efficient way, but the condition/branch is still there.
13:30 improvement idea: use an array that has the length of 255 and place basically the whole ASCII table there but make it so that the portion a-z has the values A-Z this should make it faster because you aren't comparing anything, you just replacing information with existing one
So I thought about it and decided to have a go at doing a similar optimised routine in ARM assembler (I'm not too familiar with x86) using bitwise operators instead of tests and conditional set... operation: duplicate the char in upper half of register -:- add a constant that brings 'a' over 256 in upper half and 'z' tot 255 in lower half -:- combine bit 24 and bit 8 with and-not -:- shift and filter so bit 24 becomes #32 -:- subtract from the char and store back // char* text is in r0 from function call to-upper: ldrb r1, [r0] // load first byte mov.w r3, #0x85 // this constant doesn't fit in a single immediate in ARM code orr.w r3, #0x9f0000 // could also be implemented as a load from memory orrs r2, r1, r1, lsl 16 // duplicate in upper halfword, setting flags (zero flag is the important one) add r2, r2, r3 // add the magic number. this results in bit 24 and bit 8 having results of "c >= 97" and "c >= 123" respectively. bic r2, r2, r2, lsl 16 // combine those two bits with [bit 24] &&= ! [bit 8] bxeq lr // if loaded byte was zero, return push {r4, r5, lr} // r4 & r5 will now be clobbered mov.w r4, #32 // keeping this constant in a register allows two instructions to fold into one loop: //lsr r2, r2, #19 // shift bit 24 down to bit 5 (=32) //and r2, r2, #32 // filter off unwanted bits and r2, r4, r2, lsr #19 // use of r4 to keep the constant 32 allows previous two instructions to become this one sub r5, r1, r2 // subtract the result (0 or 32) from original character ldrb r1, [r0, #1] // get next byte. The next instructions are repeats of the ones from before the loop. strb r5, [r0], #1 // ((store the result back, post-incrementing the pointer)) orrs r2, r1, r1, lsl 16 // The reason for repeating them is to give the branch predictor warning of the direction add r2, r2, r3 // that will be taken. Optional condition flag setting means that when it sees a conditional bic r2, r2, r2, lsl 16 // branch, it can look at all the instructions in the pipeline and say "nothing is going to bne loop // change this flag, so I will be going that way". Z flag is set by orrs, then used by bne. pop {r4, r5, lr} // unclobber r4, r5 and return as well as optimising the actual calculation, also optimised are the test and branch for the loop (enables branch prediction), and memory operations given distance from their registers use. I have found that translating optimised ASM code back to C generates basically the same result but is more readable - but compilers won't produce the same code from C that is not written the same way. The compiler often clobbers its own registers or does things in weird, expensive ways... perhaps that's just with the arm-embedded compiler I'm used to though.
Nice mate!! C++ compiler in MS does some strange things with the registers too! I guess it's almost impossible to allocate the best regs every time? Well cheers for sharing mate, and cheers for watching :)
it's fun to look into python for optimization. sometimes the most readable code is only mediocre for timing, where the really verbose code that could be 3 times as long and look like a foreign language could be three times as fast.
svnhddbst I think optimizing python code is more or less a pipe dream, considering anything highly performant should just be written in some other language. Writing python code that's short and simple is what makes it fun heh.
You had me at :"Assembly Code". Subscribed. This is a great topic for driver, and library, and bottle-neck algorithm implementors. idea zero: surface atomic conditionals e.g. 'mov if " from machine-code up to the high-level language (yes "Branchless C/C++", "Branchless Python", etc) If the compiler (and libraries) aren't implementing branchless methods, you may be simply optimizing one engine on a 10 engine super tanker. branchless compilers would deprecate (warn/error) branching where a branchless construct will do. If your compiler pushes you to code branchless, and the Standard Template Library, and other supporting libraries are branchless, then all the code will be pulling to the same drumbeat. With or without a "branchless compiler": I'd never code this way for the prototype. I don't like the idea of "try it and THEN see how much it helps". I prefer to respond to ALARMS, or a post prototype-review of code with a "branchless" eye. Your code profiler should be able to let you know if the cache is being blown and reloaded too often. If they don't, that's a market opportunity.
“99% the compiler is smarter“ Yes, the compiler tries to be but the main problem with compiler optimalizations is variance and variance depending on stuff out of your control. And there IS a lot of variance, probably more than you would think. Watch the talk “Performance Matters” by Emery Berger, it goes in-depth on this and a great talk! As a general rule of thumb, compiler optimalizations is nice to have but you should not rely on it to make your code fast. Branchless programming is really valuable if you explicitly do know what you are doing. Otherwise it probably won’t matter for you (for simpler stuff) Ps. Thanks for the more advanced topics, great video as always 😀
Meanwhile in reality: Geek A: Look, I sped this code up by 200 times! (Already disappointed by life) geek B: How long did it take to code? A: Only 3 days! B: What's it gonna save? A: 200 times! B: No, in computing time. How many hours of computing time? A: Over the whole lifetime of this program (maths) ... maybe 10 seconds? And yes, I've been both A and B in my career.
Dear teacher, Respect! Awesome delivery. I teach microprocessors and may want to argue on the overhead compared to the low loss in productivity around this topic. Pentium introduced branch prediction and the higher processors made it even better. Modem scenarios give 98% accuracy in branch prediction algorithms. So all we lose is basically the flushing due to 2% of the rogue code. Isn’t a whole new way of writing code? To be honest, the first example of smaller number would actually take more time to execute due to the multiplications involved. Yeah I teach ARM7 too and it incorporates conditional branching along with every instruction but just branch instructions. We have conditional mov, conditional add etc. it does make it more efficient no doubt but unless there are too many rogue singular branches that fail during branch predictions, the overhead doesn’t weigh up with the loss of efficiency. Hope I’ve made my point clear. Nonetheless, taking nothing away from your awesome work. Keep them coming. Best wishes, always.
I'm a long time programming practitioner, and I loved it. I'm at the other end of the chain, if you will. Python, JS, various DB's and assorted Web Technologies, but this is really interesting. Haven't touched 86 Assembler since the 90's, so that was interesting. Loved the SIMD stuff. That's all new. Liked his points about premature optimization. The keyword is *premature*. You don't ignore optimization, but be wary of it
Nice to find out that someone is still interested about performance. I started serious programming in 80's and did lots on embedded assembler programming in 90's. Computers are fast today but still it feels crime to write slow code. Have given up assembler for compatibility and readability reasons but with time critical code always check compiled results and make performance tests. Parallel pipelined execution is so complex nowadays that sometimes it takes long to understand why some implementation works as fast as it do. Sadly web based techniques and scripted languages spreading into areas they shouldn't. For many modern programmer, any operation is fast if it takes less than second even if being done right, time could be nanosecond level.
Optimizing isn't always worth increased complexity, so it's often a tradeoff choice. Benchmarking seems to be the way to go, I need to learn how to do that.
@@ImperatorZed I’m not talking about hacker optimizations but higher level architectural choises. Web technologies I mentioned, are extremely slow compared to natively executed code which is tailored into purpose. They also create hidden complexity, which makes the code unreliable. Applications are being made using complex ready made components which functionalities don’t typically fit well on purpose. They bring huge amount of new failure paths, which are mostly unhandled, or make proper error handling extremely complex. Programming errors are typically caught as runtime exceptions. Components have complex dependency network on other components, which are constantly upgraded, partly because security vulnerabilies from functions which are not needed in most cases. Full system is never tested and applications work only in happy-day scenario principle. Often developer’s answer for a problem is, try another browser or it works in my machine. So, my point is, mostly self-written functional code is several orders of magnitude simpler and faster when written in traditional C or C++ compared to current web technologies. The evilness lies in easiness to use those complex components and difficulty to use simple approach. You use what you got works well in this.
DISCLAIMER FOR ALL BEGINNER PROGRAMMERS!
Do not attempt it at home! Unless you really, really have to. In most cases you won't outsmart your compiler. For high level programming, you should focus on keeping your code clean and easy to read and maintain. Usually you'll notice speedup in your code by using correct algorithms and data structures, not by doing micro-optimizations. And if speed is your concern - benchmark first! Do not try optimizing code that may not need it.
Well, doesnt the first example in the video show exactly that? I mean that goes for all optimization methods, only use them when you have verified that you actually need them, (and then verify again) :)
Thank you for saying that. That's what I was thinking, I was very intrigued by all of this additional complexity (from a readability standpoint)... Wondering if I could really outsmart a compiler by doing these techniques "by hand".
Donald Knuth: Premature optimization is the root of all evil.
This technique is only valuable in highly specific situations. If you're not sure if it applies to you, it doesn't.
This whole video is off limits for beginner programmers and comedy for experienced programmers. That piece of paper under his wired USB mouse is another telling sign.
Yes, this. How many times the code needs to run is a big consideration. If a function needs to loop through a dataset that numbers in the billions, it’s a good sign that micro-improvements like these might be of interest.
I love this, but I'm reminded of Kernighan's law: “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.”
Brian misses the big-brain move here: just rewrite your code instead of debugging it.
Work twice as long. Checkmate
@this senator sometimes rewriting is a decent opportunity that you see your brain varies in qualaty of output each time :D
@@rickarmbruster8788 I notice that in most of my projects, I do the basic structure then add more to it come back to the basic structure see what a mess it is CTRL + A delete and rewrite it....
@@mr_confuse u will get a hang of nice design eventually ^^
I LOVE how you introduced the concept by showing a way to manually optimize and how it actually fails to be better than the straightforward if-based version. That's such an important point to share up front any time we talk about clever optimization.
except for his last example which was light years faster as hand coded
It just applies to c/cpp, for most other languages that don't have these kind of tricky optimizations, branchless is always faster (jit languages have jit optimization but in real scenarios the branch prediction miss a lot). Not to mention that in the video there're circumstances that branchless version did out performs branched version.
@@edwardmacnab354oh you poor fool
branchless generally applies when you're talking about assembly. C/C++ compiles to assembly, but so do many other languages and some languages like C# can be compiled to assembly even though typically they are compiled into a virtual-machine bytecode (using il2cpp and then a c++ compiler).
For a purely interpreted language like python, branch prediction and branchless code barely comes into play for the developer. When do any operation in python, the actual python runtime is almost always running dozens or hundreds of lines of C (compiled to assembly) that aren't related to your code in any way except for ensuring that you're requesting valid operations. And that itself usually requires tons of branching. If you add 2 numbers, the python runtime treats those as full objects. First it checks that it's not operating on null objects, it then ensures the wrapped value is a number or a type with the addition operator defined. If it is a number, it will call the C function it has for addition. If it's not a number type, it will evaluate the PyFunctionObject representing the python code of the function, which itself has like 100 lines of C just dedicated to passing the argument in properly as a Python variable. The most optimized way to write Python code is to minimize your time spent evaluating lines of Python code. When running a purely interpreted language like Python or Ruby, the actual operations you request to do are about 1% of the actual work going on in the CPU.
For a JIT'd language like Java, Javascript, regular C#, it really depends on how good the bytecode compiler and the JIT are. There are 2 steps here that can potentially optimize your code, the first compilation step to bytecode can definitely do many of the same optimizations a C++ compiler can, and the JIT can help pick the correct cpu instruction that can handle several equivalent lines of a higher level language (like the first function example). The Java compiler and JIT are quite good, sometimes better than current C or C++ compilers. At least in synthetic benchmarks of calculating prime numbers, sorting arrays, or etc repeatedly. For example, Java is capable of generating code that uses SIMD or AVX like in the final example in the video.
Basically it doesn't work in a purely interpreted language because the runtime system itself has many branches. And for other languages, it depends on the code. If you're writing a branchless algorithm, it wouldn't hurt to benchmark it against the "naive" branching version of that algorithm.
And benchmarking these system specific optimizations may even produce different results on different CPU architectures. For example, some DSP devices have no branch prediction because they need to have strict guarantees on consistent realtime delivery. It's a slower architecture on average, but you can calculate a guaranteed number of clock cycles some sequence of instructions will take. A branch predicting CPU can't make those same guarantees.
I'm sure he just LOVEs it too. We all LOVE it.
I think it should be noted that branchless programming is VERY important in cryptography. It turns out that if you ise conditional statements -> sometimes your code runs faster sometimes slower (specific branches just are faster/ slower), the attacker can get info on your private key. So all cryptographic sound functions must use branchless programming.
Very interesting topic
Do they use branchless for the most part? Or are they in violation of this?
They could use constant time operations
Why would you do that? Can't you use a timer to achieve the same result (not disclosing the computation time)?
@sayon It is very difficult to actually accomplish this in practice. If your delay is a random value, then you can repeat your request multiple times and use the average time. Delaying so your total time is fixed is really difficult: You have to know in advance the maximum amount of time your algorithm could possibly take, and even with that knowledge, wait() is complicated and imprecise, and it is definitely possible that what you did before calling wait() will still affect the total time.
@@thenameipicked i am talking about fixed delay, yes, not a random value. Besides how are you hoping to achieve fixed time with just balancing the computations when the process can be interrupted at any time? We usually have preemptive multitasking in our systems.
When a normal person says something is slow: 5-10 minutes, maybe an hour or two
When a programmer says something is slow: 2 nanoseconds
well, normal people wouldn't think about doing it a billion times ;)
When those instructions run multiple times throughout a code that 2ns starts to makes a difference.
When you get a cache miss:
This little manoeuvre is going to cost us 51 nanoseconds
@@spicybaguette7706 for an android that is an eternity...
those build up... and then you have Internet Explorer
I've designed CPUs for over 30 years and we pay a LOT of attention breaks in code streams. Branch prediction, speculative execution, and even register reassignments are all about minimizing branch and pipeline impacts. What you may be missing in the above examples is that what seems to be avoiding branches ... don't. Every conditional test is a branch. Every conditional assignment is a branch. It is hard to beat the compiler optimization that understands the pipeline of the processor it's targeting. Clever manual decimation can be effective for long pipelines, such in GPUs, but even then compilers are pretty smart. The most efficient code is when tasks are parallelized so that multiple operations can be done based off of one or a small number of decisions.
In trying to avoid branches, the cost is paid in non-maintainable code and is very high. If you really want to improve efficiency, don't rely on multiple software packages which actually may include interpreted languages somewhere underneath the shiny API. Layering of packages and the use of interpreted languages (Python, PERL, etc.) waste much of the increasing performance of processors. Yes, it means recompiling for a new processor, but one does that if efficiency is required.
In one example, we recoded a numeric-heavy program that was synthesizing frequencies for an electronic piano. Recasting it to vector operations allowed the program to run on a very inexpensive Beagle-Bone Black instead of a MAC. On the MAC it consumed 35% of the processor. On the BBB it used 15% of a much less powerful processor by vectorizing the operations. These are the sorts of changes that matter.
I mean he literally looked at the decompiled assembly code to show how the branching and branchless code affected whether or not the assembly decompilation was branching. In the first case, he showed that the compiler sometimes understands the purpose of code and makes great optimizations that already remove the branches, but also shows how in other, more complex code, the branchless code was much faster than the branching code. So I really don't think he or the video missed this.
cmov is probably not branchless as claimed in the video. On the micro architectural level is still needs to branch.
@@kingacrisius The thing is that looking at the assembly may or may not tell what actually CPU does. There are some cores (old ADI's SHARCs, Microchip's DSCs and few other) that have such predictable deterministic pipeline you can actually count the clocks and if you couple it to SRAM, then you have whole story.
But modern IA64 or Cortex-A cores do so many things on so many levels, that without an intimate knowledge of what core actually does a manual optimizations is just a hopeless exercise. Silicon vendors want their wares to shine so they contribute to the compilers development heavily. Unless you are in the know in these matters, the best you can is to optimize for a particular platform, test it, and then freeze SW AND HW design. Sometimes this is what is needed and acceptable. But if you think such code will survive on other platforms or onthe future revisions of a given platform you may be for a rude surprise.
@@BogdanBaudis Well regardless he timed it too. Like it is so clear when the branchless programming is better, because he timed it and shows how the assembly is faster. I seriously don't understand where all of this is coming from.
@@BogdanBaudis And obviously some code may not be portable, but he was coding C++, which is almost always portable. And that doesn't matter anyways for this discussion.
So what we've learnt: If you write normal code and normal patterns that everybody knows, most likely it will be a pattern that the developers of gcc (or other compilers) thought of. That way you have a higher chance of the compiler doing good things.
Yup
@InSomnia DrEvil Real truth. I tell people on my team this all the time. If it's not hard, actually try it and extrapolate. Rarely (but it happens) you'll get different behavior past a data size threshold. And even then... usually you can infer that (or predict this would happen)
all the codes shown missed the fact that you could just remove bit 5 when in range shaving 3+ additional instructions with a well placed xor eax, 20h(stored in r8d instead of -1)(times char's ID stored in the multi data case) since you don't really care for the original data or you would back it up(the more complex code still used sub 32 and some other data shuffle junk). So sometimes it's better to assemble it yourself.
The program I optimised for gcc did a good job of making bottom level branchless if statements, but it didn't stop my carefully optimised C being 100x faster than the natural C++ implementation using stdlib.
It was well worth increasing processing complexity.
It still matters. Compilers are great but not perfect. There are still seemingly trivial and arcane situations where the compiler will be confounded and unable to apply proper optimizations on code.
It won't be noticeable if ur writing a user application because those misses will be fairly infrequent. If you are writing a linear algebra library for physics calculations then you are gonna be performing millions of these instructions repeatedly in a very short period and the misses become noticeable.
Advocating against premature optimization doesn't mean all optimization is premature.
I want to add that the advantage of branchless programming is many times bigger if you are programming for the GPU and not the CPU. GPUs have multiple cores that share a instruction pipeline, where each core runs the same instruction but with other numbers. That often leads to both sides of a branch being executed and then one is discarded. I have seen performance improvements above 100x by going branchless!
Didn't know that. Thanks. Good point.
So, vertex/fragment shaders for opengl, for example?
@@nistarok123 Yes, exactly. Avoid branches in GPU code as much as possible. In fact I think there are some shader languages that don't even allow the if statement.
@@JeredDanielson me too ;)
Note: GPUs these days won't take the performance hit if all threads in a warp (NVIDIA), wavefront (AMD), etc. (others) go down the same branch... whichever branch that is. So let's say you have a complicated material depending on whether or not you're shadowed -- you will only traverse both paths if your warp/wavefront corresponds to a block of pixels that overlaps with an edge of the shadow. If the block is entirely shadowed or entirely illuminated, then it will only run the shadowed or unshadowed path, respectively. This is 32 threads in NVIDIA, 64 in AMD... although they do reserve the right to change.
Cryptographic libraries also use branchless programming a lot. This is to prevent timing side channel attacks, where the timing of an operation can provide a hint as to what was encrypted.
05:00 The lesson here: never over-optimize your code beforehand. Start by writing natural and simple code. Let the compiler do its tricks. Check the result and only look for optimizations in the parts that the compiler didn't already optimize.
I think the lesson is: know your hardware.
@@tetramaximum Both are valuable lessons. Another one is "know your data". Imagine a super-inefficient O(n!)-scaling algorithm. If you can prove that your data must always be n
@@scottmichaud agree
Very true, the more generic your code, the more likely the compiler will understand what you are doing and perform optimization.
@@SleepyMongoose For the things that it can. It's your job when it comes to effectively using overall data structures, etc.
Shaders. Branchless techniques are mandatory in those!
When I first heard people saying "avoid branching in shaders" I had no idea what they were talking about. "How could you ever possibly avoid branching???" I thought. Now I see how. Thanks!
It must be said, though, that even shaders compilers are getting better and better, and you could use simple ifs which don't end up in branching.
For example, if you write:
if(x>=0.8) {y=x} else {y=0}
one might be tempted to optimize by writing:
y = x * step(0.8,x)
And yet, at least in HLSL, the compiler creates the same low level code for both.
@@galandilvogler8577 Good to know! But still, best to make sure...
This is a common misconception. IIRC, uniform branching is 100% fine, the only problem with shaders branching is that if a heavy branch is followed by one invocation, the whole warp will stall. This is problematic, but does not mean branchless techniques are better as branchless techniques would just calculate both branches, meaning you'd actually have the whole warp calculate those heavy calculations.
See shadertoy (dot) com/view/wlsGDl and twitter (dot )com/bgolus/status/1235254923819802626
@@clepirelli Neat! Thank you
As a long time software developer, in virtually every case, code maintainability is far more important than speed.
I've written programs that processed millions of transactions - and branching was the least of the performance concerns. Paging and I/O typically are much more a problem. Except in toy problems, memory management will be a much larger concern.
Introducing repeatedly executed multiplication generally is not good for performance.
"And"-ing 01011111 with ASCII lower case converts to uppercase as well as subtraction, and to my mind, is much clearer.
Unless there is a genuine need, don't make your code ugly to eke out a few cycles.
I agree, except I'm convinced we all need to learn this the hard way. I haven't met any developer who did not try to be clever in the first 20 years of their career. Took me 30 years to start understanding this myself, but I need to really try all the fun things before I do it the hard way.
@@henrikskott I had a CS prof once tell a story of how he spent so much time trying to optimize the code as he wrote it that it took him weeks to write the code, and when he ran it for the first time, the performance was terrible. He ran it through a profiler, and the part he optimized took virtually no time to execute anyways, and it was some other part he didn't even think of that was taking all of the time. His message was "just write the code sensibly the first time, then run it, and only then if there are performance issues do you optimize anything, once you know exactly what part needs to be optimized. That's definitely a winning strategy! Don't waste time optimizing things that don't matter anyways
Once you understand cache hierarchy , you realize a heap access can be equivalent to MANY branches
Good point. I found the video interesting and informative, but part of me was wondering "in what context would these techniques genuinely be useful?". While I'm sure there are things out there that require this level of optimization, typically it makes more sense to prioritize readability.
Also, great idea with the AND-ing! Very elegant way to solve it, at least in the case where the string only has letters. If non-letter symbols are present though, you'd need to add more logic to handle that.
it could be even better to implement/improve this kind of optimizations in the compiler itself, this way you can focus on the elegant code and gain a boost in maintainability while the compiler does the hard job, of this kind of very low level optimizations
Leading with a negative example was smart - I think a lot of beginners would have tried applying this to homework assignments that really didn't need it if you hadn't.
"I can teach a monkey to do what you do. What I can't teach the monkey is when NOT to do it." - one of my father's chiropractic professors
But yeah. Branchless programming is just one of many tools in our arsenal, not a silver bullet. Having a hammer doesn't make every problem a nail. Same here.
@@andrewsprojectsinnovations6352 Maybe he should have taught your father not to become a chiropractic, much more helpful.
As a student & beginner, this is certainly correct. Was ready and excited to lmao.
I'm surprised with the amount of negative comments, he is not telling people do always code like this... there is a difference between time critical code and not.
get the damn profile out and find where your program is spending most of their time, and THEN you consider doing this or not.
the only negative thing about optimization is that it costs development time. the idea that "the computers will be fast enough" is just being naive.
effective optimization can be is the difference between a product that reaches some of the market or most of the market or a program that can be extended for the decades to come or something that needs to be redone in the next few years.
@kingofallcrypto Yeah, I can definitely imagine seeing stuff like this in first-year students code. In one project, they had to keep track of 16 boolean values on a microcontroller that had way more than enough memory for what we were doing. They wanted to encode them in the 16 bits of a uint16_t (16 bit integer type).
It's natural for them to think that it's a good idea since we did tell them to use uint8_t over uint16_t when the number they needed to store was only going from 0 to 10. But the reason we told them that was mainly as a reason to talk about architecture and explain that since the microcontroller had an 8bit bus and discuss stuff.
It did give me a chance to have a good discussion about balancing programmer effort with execution speed. In the uint8_t vs uint16_t, there are no downsides and there is a theoretical benefit in speed. In the store 16 booleans in a uint16_t, it's going to be a complete pain in the a** to store and retrieve the actual booleans (if we don't consider that the memory saving has a runtime cost), we're introducing the possibility for errors and we have more than enough memory by a wide margin for that project.
time critical code should not be run on a micro
@kingofallcrypto , I would hope that it would be obvious to anyone that ended up at this video.
You try to make it seem that it is worth it everytime, while it cerntainly isn't. This is the kind of thing you would use if you used every other trick in the book and need it for some one in a million project.
@@dincerekin Think about what you said next time when your ESP/ABS saves you from skidding into a tree.
For someone who started coding in assembly before pipelining and speculative execution were a thing, and when multiplications were super expensive, the idea of multiplying by a boolean to make a selection always feels slightly wrong. And a part of me still wants to replace every multiplication with shifts + adds, or look-up tables. :-P
Yep!
Multiply by 1 or 0 and you get an early exit on modern ALU's.
With pipe-lining and SIMD, you come out ahead if you stack a large enough vector up. 8 ops/cyle + constant 3 cycle multiply penalty. If you have 800 elements, you are basically doing that multiply in 1/8 a cycle. So even with 2 multiplies and an add, it's still over twice as fast before any branch time or penalties is involved.
And then AVX512 and SVE2/Cray vectors has predicated SIMD ops where you can create a predicate mask and then conditionally perform a vector op. CompareLT V1 and V2 to create mask than copy V1 into V2 using the mask to predicate each lane. the element from v1 is only copied of the LT mask is true.
@@WorBlux can I hire you on for my next big programming job?
@@Theineluctable_SOME_CANT Keeping myself informed about cpu microarchetecture and programming is a hobby of mine. Actual programming not so much.
Being outside and producing something I can touch is more rewarding for me.
@@WorBlux yes, for me also.
Machine microarchitechture is quite interesting to me.
A fast branchless way to calculate ToUpper is to use a lookup table. The table is 256 bytes long, and easily fits in L1 cache, so character conversion takes a single instruction and will be a fast memory access. I think this is what is done in the standard library.
Doubt it will work with non ASCII characters, though.
@@IceExtremeGamers none of this works for non-ascii
@@Henrix1998 sadly
@@IceExtremeGamers His whole discussion was assuming ASCII as a toy example. Yes, the small lookup table will not work for general multi-byte Unicode (especially variable length encodings like UTF-8), but Unicode is more complex and none of the other stuff described in the video, like SIMD, would work either. For 8-bit character encodings, the table is actually far more general than anything based on conditionals such as conditional moves. The table approach would also be usable for 16-bit character encodings, as 64 KiB is not a lot of memory nowadays, and the parts of the table that would actually be used in practice would almost certainly fit in L1D$.
@@josiahmanson yeah i was waiting from author to present a final-final version of algorithm which using lookup tables.. but video ended :)
Branching in a shader on the GPU is also a thing, but for a different reason. Here, the GPU runs the code in lockstep, which means that the ALU runs the same instruction for multiple pixels or vertices at once. To stay synchronised, it runs both branches of the conditional, then throws away the result that did not meet the conditional. So, in general, it will have to run all the code all the time and branching does not give any advantage. Note: if all pixels meet the same condition, only one of the branches is executed.
Same goes for cuda programming also
So true! GPU's hate branching maybe even more than CPU's! They call the conditional execution 'predicate' in CUDA.
Yeah but that inefficiency of computing both branches and discarding one is the passport to being able to run the many parallel streams so you still gain.
Your mousepad is a masterpiece.
high albedo for that max dpi specs. Don't tell gamers, or the prices will blast up
@@robertoaguiar6230 I haven't thought about the albedo consideration there... I've seen a lot of very dark color mousepads... Hmmm
But I suppose it would be albedo in the specific range the mouse actually uses. Infrared?
Don't look at the finger pointing towards the moon or you will miss the heavenly glory
For whatever it's worth... Smaller_Branchless produces identical assembly as a branched version when using clang 10 and compiling with -O2.
In GCC 8.3 Smaller_Branchless is flat out worse than the branched version - branches been optimized out by the compiler.
In any case for the branched versions the awesome compiler optimized away the branches.
So for the noobs out there, if you're doing something like this, always check the compiler output and measure measure measure.
“in 99% cases the compiler is smarter“
Not only that but Smaller_Branchless is not actually branchless, conditionals are just hidden branches. Writing a * (a < b) still forces the compiler to add a test and jump aka branch to perform the a
@@Henrik_Holst No, branches are distinct from conditionals. In this case, we see it transferring a condition flag into a general purpose register value using SETc instructions, because x86 has a condition flag register; not all architectures do. Conditional instructions like x86 CMOV or AVR SBRS operate by cancelling the retire step of a single instruction, which doesn't divert the instruction decoding; that's a cheaper form of branching. What you say would apply to architectures where the condition test is embedded only within branch instructions, as if you removed the CMP* instructions and had to use B* instructions in Nios II. That would be a typical case where a compiler might have a different optimization to avoid the issue, like using the sign bit from a subtraction.
@Isaac Beizsley It, indeed, won't turn your O(n^2) into O(n). But it will fix your branches better than most people in most cases. (The example OP made).
It indeed won't do any simd. However, if you are prepared to go assembly just to speed up the code, it is worth assuming you know what you are doing. Like in the video, where he does indeed outsmart the compiler.
@Isaac Beizsley Yeah that's annoying. I want to say that the world would be a little bit greener if it wasn't for all bad code. lol
It's useful when you do shader programming where branching is very expensive. I had to «invent» some of those examples by myself when did shader programming. Great video!
Same! It was an interesting experience refactoring some branched OpenCL code to use algebraic solutions to replace the branches.
Brilliant explanation. The following bombshell caught my eye...
12:19 "So it just goes to show that you don't automatically get speed from jumping over to assembly...you have to code it well" - very well said ;-)
Wil every language above assembly, it's a shame that we have so little control over the final machine code output, even with C/C++.
Lesson: 1) Never trust your compiler 2) Become an expert in assembly language and learn to e.g. inline where needed.
I want to buy this guy a mouse pad and maybe a second monitor...
He seems to be stuck in the 90s.
Nobody's stopping you
Goes to show it's what you know, not what you own.
@@stevecarter8810 no they aren't, that's good to know
@@lour.222 I think its a combination, good tools in the hands of a craftsmen sort of thing
Note that the branch predictor is generally able to predict branches accurately for loops, since they almost always take the same path. Thus, getting rid of a loop branch is usually not of much concern, for purposes of avoiding branch prediction failures. Of course, there's still value in improving loop iteration code, since it still represents instructions that are run on ever iteration.
That's why JIT (Java) is such a good idea. The run time / JIT platform applies the optimizations best suited for the hardware.
@@TheGandorX Unless you have a limited amount of parallelization that doesn't allow JIT to run on a separate physically parallel thread. Than it about doubles your run time.
Only if you have sorted data. Once you enter the territory of true random data the branch predictor practically doesn't exist anymore. Since it simply won't be able to predict.
Now, some CPU's, (don't pin me on which ones exactly) can execute two branches simultaneously in parallel while still processing the condition. Once the condition output is known the output of the parallel processes is used and the other data path is thrown away. This has quite some prerequists though, but it works pretty well.
@@daantimmer Ya, I think that behavior was the basis of the Spector and Meltdown CPU bugs discovered a few years back. (Correct me if I'm wrong.)
@@andywolan I won't be able to correct you even if you are wrong because I never really dived in to those issues :-)
In some architectures, multiplication takes more cycles to execute than less demanding arithmetic. In the first example if a < b usually comes out to be the same (whether true or false), then branch prediction will avoid the delays inherent in "if", and the result could be faster than using multiplication.
The architecture may even be incapable of calculating the true/false result other than by the use of branches.
In any case, unless the function is called intensively, the benefits, or otherwise, of the branchless approach will be invisible.
You can get by 4 or by 8 the speed for some numeric code using assembly (eg. fast Walsh Hadamard transform, neural network code.) Sometimes the compiler can find better assembly than you can. If you need performant code try both ways.
GCC found much better code than me for this in FreeBasic.
sub subrandom1(result as single ptr,x as single ptr,n as ulongint)
dim as ulong ptr resultptr=cast(ulong ptr,result)
dim as ulong ptr xptr=cast(ulong ptr,x)
dim as ulong r=&hE8FE90EB,rs,shift=18
for i as ulongint=0 to n-1 step 4
rs+=r
resultptr[i]=xptr[i] xor ((rs xor (rs shl shift)) and &h80000000)
rs+=r
resultptr[i+1]=xptr[i+1] xor ((rs xor (rs shl shift)) and &h80000000)
rs+=r
resultptr[i+2]=xptr[i+2] xor ((rs xor (rs shl shift)) and &h80000000)
rs+=r
resultptr[i+3]=xptr[i+3] xor ((rs xor (rs shl shift)) and &h80000000)
next
end sub
I remember Bjarne Stoustrup saying something to the effect of 'The compiler will write better assembly than a human ever could' and in most cases, especially larger ones, that holds true.
The compiler can only write assembly as good as the person who wrote the compiler can write assembly. Your statement is mostly true for people who don't understand assembly and the way CPUs actually work
@@christopherjoseph651 Let any of the people who made the compiler try to write the same program in assembly as in C++ and see which one performs faster and with fewer bugs. It doesn't matter if you understand a CPU, the assembly code from the C++ compiler will be far better than any manually written assembly. Because the people who make the compiler only do translations and optimizations, they do not write entire systems with it. Because that's stupid.
@@christopherjoseph651 Not really, in the largest scales this statement holds absolute true. There are limitations to what humans can do, a large project in pure assembly would be just plain impossible for a human to do, it's unrealistic.
@@CottidaeSEA Did you not watch the entire video? He proved that a human who understands the instruction set of the CPU and who understands the purpose of the high level code can fully utilize the instruction set and hardware of the CPU. He literally beat the compiler by 45x! If the compiler is so superior why didn't it write the code using AVX? Exactly, a human wrote the rules that the translations and optimizations follow. So a complier is only going to write code as good as the person who wrote the compiler. This has nothing to do with program scale. The point is a human wrote the compiler so the complier can only write optimized code as well as the person who wrote the compiler. If that person doesn't realize how to perform an optimized task in assembly then the compiler won't be able to optimize the code either. What, do you think we're living in a world of Skynet where computers build more sophisticated computes and write their own programs?
@@dionyzus2909 This is a childish statement written by someone who believes in the new education model in which older languages are out dated and inferior and therefore shouldn't be taught. We don't need new, more complex languages to solve new and more complicated problems. To this day, when a program needs to be optimized for speed, it is written by a human using inline assembly. Even the guy who made this video understands that you should look at your disassembled code and be capable of understanding what the compiler is doing. Do you need to write a large program that is not time or resource critical in assembly? No, there's no need for that, but the statement that a compiler will write better assembly than a human ever could is COMPLETELY FALSE!
Awesome!
Branchless programming would be even faster if the CPU had an instruction that filled all the bits of a register with the zero flag.
These "computer science approach to C++ programming" videos are getting really good and I suspect are unique on youTube. You are improving (possibly, even saving) our mental health with these conceptual challenges and I, for one, am so grateful.
Thanks mate! I certainly wish the instruction set had a few things that are missing too! Cheers for watching :)
@@WhatsACreel could ask intel and AMD to add that...
setz eax
ror eax, 1
sar eax, 31
Nothing in this video had anything to do with computer science. It was all computer architecture.
every community needs a Michael Keegan :) refreshing to read a wholesome honest opinion
This will seem trivial to some, but for anyone planning to do arithmetic with the result of a boolean expression in .NET should bear in mind that 'True' equates to -1, rather than 1, as in C++.
EDIT: I'll just add that any non-zero value (positive or negative) equates to 'True' in a conditional statement, and zero equates to 'False'.
Someone needs to explain why that was chosen hahah.
-1 (int) is just max_uint, aka 0xFFFF... you can do (boolVal & 1), but it gets tedious haha
And I've never used a system where true is -1 myself but if I had to guess I'd say it's because the designers wanted logical operators to work on boolean values, e.g.:
not false = true (~0 = -1)
true or x = true (-1 | x = -1)
etc, etc, etc.
@@Vacuon I've worked with plenty of systems where -1 can be true, but it's not the ONLY truth value. As in, 0 is false and everything else is 'truthy'. ~ (BoolValAsNegativeOne) is just BoolValAsNegativeOne XOR -1, vs BoolValAsOne XOR 0x01. It doesn't appear to show negate -1 as an advantage. I mean, for either method you still have to auto-demote to a bool type to perform the operation. And it's the same amount of assembly instructions to invert; load the value, perform the operation. I'm SURE they had a reason for doing it.
The thing about having a TRUE as a negative -1, means you can't index an array via an evaluation myArray[a>=b].. well I guess it works if your language supports negative indexing (like Python).
Oh, god, I bet that has led to more than a few bizarre bugs.
@@SeanCMonahan Not sure about C# but I wish C/C++ also used -1 (aka ~0) for 'true'. It would make very practical bit masks.
For example, the video uses the slow integer multiplication (imul) instruction where a simple bitwise AND with a proper mask (0x0...000 or 0xF...FFF) would do instead. Conveniently, 0xF...FFF = -1U. Also, 0x0...000 and 0xF...FFF are just complementary values. You get the opposite one just by flipping all bits. x86 CPUs even have an instruction for that (NOT). ~0x0...000 = 0xF...FFF. All expressions consisting only of 'magic numbers' like this are pre-computed by the compiler.
So if 'true' was -1, the min(a, b) function could be expressed as: return a & (a < b) | b & !(a < b);
Since 'true' is 1, we have to flip the sign: return a & -(a < b) | b & -!(a < b);
Or leave it to the compiler because the compiler writers know well how to compute a minimum of 2 numbers properly on your platform. Use 'std::min()/max()' for branch-less programming.
The CMOVxx instructions have been with us for a very long time (since Pentium Pro in the mid-1990s) so they can be taken for granted now. :)
There's [[likely]] and [[unlikely]] attributes in C++20, they can help in some cases
And that's why you update your compiler and SDK...
Also there is profile guided optimization, but sometimes you have paths that are hard to predict, because the patterns are too elaborate.
Vyor GCC has had macros for it for years
Have you tried if these help? Afaik, the branch hints are x86 only and no longer available in x86-64 mode. Not sure about arm.
@@neur303 For example i have a branch that says High Alarm. (So x value is higher than y) I as a programmer know that one of the two almost never gets executed. Does this mean i can "guide" the compiler in to say hey branch predictor. Pls choose this one first for your prediction!! The other branch will almost never hit. Is this possible ? Especially if you need to check this for like hundreds of thousands of signals each cycle ? [[likely]] to be lower, [[unlikely]] to be higher than Y.
This is really good fundamental information for some of us "spoiled" high-level programmers who don't typically think beyond compilation
Agreed, but highly dangerous and misleading information for all the others, as a disclaimer that this is basically "experts only" is missing.
And seeing what inexperianced programmers try to accomplish, this can lead to a lot of technical debt, as it will be misunderstood, badly used and then mixed with branching techniques . So yeah, fun :D
Agreed 100%. As a Javascript programmer, I noticed many of my peers don't seem to care much about how the language works.
I know there are many performance decreases with it being a dynamic/weakly typed language with a just-in-time compilation.
So whatever edge I can eek out in performance I will take. Plus I eventually want to learn more than one language. But making sure I REALLY understand JS first.
I was doing this back in the late 1970's, on a Cray-1 supercomputer with pipelined vector registers. Complex single-instruction operations were broken down into a sequence of simpler operations, each performed by a separate functional subunit. After one clock cycle, the first functional subunit was free to begin operating on the second element on the vector, while the second functional subunit began operating on the intermediate result in the first element of the vector that was left there by the first functional subunit. Complex operations like floating point division could then produce a result every clock cycle, even though the operation could take a total of six or more clock cycles to perform on a single operand beginning-to-end. Like stations on a car assembly line, the game was to keep as many functional subunits busy as possible at any given time. We would carefully map out each tick of each component of the CPU on a big chart before implementing in assembler. The thing was blazing fast, even by today's standards. Nice vid!
Class explanation. I Love it . U know what u going on about.
Great
this is still the game, with 5 logic and 5 memory execution ports available for out-of-order processing (execution is determined by data dependency order), writing code to keep all of them busy with chains of non-dependent code in each loop to mitigate latency over iteration - although division is one thing that isn't pipelined. I guess division is rare enough that it's not worth the extra silicon.
The iPhone 6 is capable of 20 times more FLOPS than the cray-1
That's called 'Pipelining', and I'm sure you knew that. It's a form of parallel processing.
I never really thought about CPU caching and "predictions" affecting conditional statements, this is super cool!
When you get into serious depths of optimization, you start looking at things like branch predictability. Like, there are situations when you chose to write a less optimized version of algorithm, because it has more consistent depth, and that means the loops are going to repeat a similar number of times, making branches on them more predictable, resulting in faster execution.
@@konstantinkh So things like for loops?
@@psun256 Yeah. Because on assembly level, every loop is just a branch that sends the execution to beginning of the loop. So the CPU will hit the same branch over and over again as it's running the loop. And modern CPUs have very clever ways to get a heuristic for these branches. Like, if you have a very tight loop in some algorithm that always runs for, say, 10 iterations, your CPU can actually learn to predict that every 10th time it's not going to jump and start caching instructions from the correct branch. At that point, it's like the branch isn't even there! If the number of iterations varies, then the prediction is worse. Best your CPU can do in these cases is "usually, jump is taken," so it will always cache instructions for repeating the loop, and will miss and have to flush the pipeline whenever the loop finishes, which is still better than not having prediction at all, but not optimal. That's one of the reasons why an algorithm with consistent iteration counts can sometimes perform better than algorithm where iteration count varies. Of course, there are also optimizations your compiler can do if it can "guess" the number of iterations, which can make even bigger difference. We are getting into territory of optimizations where you _really_ need to have a good reason for going after, though, as you'll spend a lot of time testing, timing, and looking at disassembly, and you shouldn't be doing that until you've fixed all other timing problems. The only real world example I have where we've spent time going after this kind of stuff is character animation in games. When you have hundreds of bones in hundreds of skeletons for which you have to update the transform matrices 60+ times per second while leaving enough time for CPU to do everything else it needs to, you start looking at things like cache and branching prediction optimizations.
@@konstantinkh Huh, that's pretty neat. Thanks for sharing!
@@konstantinkh It would be great if you could actually tell your cpu which branch to load by default.
A good example for branch less is in Crypto, where data dependent code execution can lead to side channel leaks, therefore you look for brnachless and constant time algorithms. So for example to compare two byte arrays with hash results you would add the xors of each byte and not break the loop on the first difference.
add random sleeps in the code and you're alright
/s
@@inx1819 Actually you don't want to add random sleeps. If you just sleep randomly, all operations will be extended by roughly the same amount of time, so you can still statistically figure out which operations takes longer than the other.
Instead, what you want to do is to sleep longer if the operation finished faster and sleep less when the operation are slower. You can do this by deciding beforehand an operation how long an operation should take. For example, you might decide that an operation should take 1 second, so you do the operation, and if there's any remaining time, you sleep for the remainder. That means no matter which branch the code takes, they'll always take 1 second.
Now, this isn't perfect, there's also power analysis because doing an actual operation likely costs you more power than sleeping, so this is enough if your attacker is remote (e.g. webserver). But if you need to consider a local attacker that has full physical access to the hardware (e.g. smart cards), you might have to busy sleep.
@@yvrelna you didn’t see his /s
This sort of optimization seems like what you'd do only for high performance real-time applications--like video games for instance--and only then as a third-pass after you'd first written the code in much more human-readable format. My coding bootcamp instructors drilled into me the mantra: First make it correct, then make it clean, then make it fast. Ideally by the time you reach 'make it clean' you've got good tests that pin down the functionality of the code before you refactor. After that you can come up with fast, clever implementations that still pass the tests. You shouldn't ever need to debug the fast clever implementations if you've restricted yourself to good, pure functions and written your tests correctly.
bootcamp instructors prepare people for no brain web dev jobs. This is something you will never use in such a work environment. It's mostly for advanced, low level programmers.
If is not *necessarly* slow, if your condition is going to the same place 98% of the time, your branch predictor will be very accurate, and the pipeline will rarely be broken.
Even that is not free either.
Branch predictors tend to have limited memory, relying on it somewhere hurts branches elsewhere.
But then cache is limited too, so maller code is often better.
Funny that I got recommendation of this video after watching a few presentations about the Elbrus CPU, which doesn't have the branch predictor at all. Now I understand what kind of hoops the developers of it's compiler and the porters of JVM had to jump through...
1. Make sure your branch targets are aligned to 16 byte boundaries. 2. If you're comparing to upper AND lower bounds, (whether masking or branching) you can subtract the lower bound and do a single compare instead. 3. When writing assembly pay attention to instruction dependencies, and interleave independent instructions. E.g. load 1st chat, load 2nd char, compare 1st, gen mask, compare 2nd char, and so on. 4. You can often apply these techniques directly in C++, possibly with intrinsics. ... Just my 2c
Branchless works great for GPU programming (GLSL etc). It's actually worth complexifying the algorithm just to minimize the branching (but not always).
Really depends on whether the branch is divergent or not. Modern GPUs have some level of branch prediction built in and will try their best to only take one path through the branch while leaving the other path alone, so if you know for a fact that the branch is very convergent (ie it's using a uniform/constant value, or it's based on data that's convergent in nature), then there's nothing wrong with branching (except on ancient GPUs). However, if the branch is very divergent (ie it's based on data that's divergent in nature), *then* you should try to implement the same logic in a branchless manner, as the GPU is forced to take *both* paths through the branch, with half of the warp being disabled through each branch.
I think more people need to see your videos. Although they're specialised, they feel valuable for non-systems programmers like myself.
this is a very underrated youtube channel. i learned assembly from him.
Early on in my programming career I noticed that there was usually a relationship between speed and instructions -- time and space. Plus, you can throw in the time it will take to debug clever code.
I did not expect a character like yours in a programming video! You do a great job on keeping things interesting and a bit funny while making good use of the 20 minutes with practicle examples. Really good video!
In my opinion, the main reason to avoid too branchy code is readability and maintaining time consuming. But for this issue, a clean code with well designed patterns will save you a bunch of 'ifs', which automatically make you code easier to understand and easier to be optimized by the compiler. Branchless programming the way it is presented here has a proper place and time.. if you need it you will know..
Got recommended this, watched it all, now time to find even more good content you have
Branchless programming is often about taking a step back and finding a simpler approach. In the second example, I would just take d[i] & 0xDF; it keeps the upper case characters intact, but flips the 6th bit for lower case letters making them upper.
Yes, that will change some non-letter characters too. If those characters aren’t in your text, then AND away my friend! It will certainly be faster!!
The example is not super important, just the technique of masking in SIMD to perform multiple branches without actually branching.
Cheers for watching :)
could try something like d[i] ^= ((d[i] - 26) < 'a')
I was just about to write this, but it’s obvious someone beat me to it (years ago)! :D
In time-critical code know your cpu. I once rewrote a graphics library in Z80 assembler and sped it up 10x by unrolling loops and counting cpu cycles for every bit of code. There are all sorts of tricks you can use to improve speed as any good programmer knows.
@jshowa o No, tools of the trade don't necessarily make one a good programmer. But pretty much only good programmers know these things. And there is nothing unmaintainable in these designs. Again, one just has to be wise about implementation. But that's the case in all languages, including those that advertise themselves as nearly fool-proof. No language prevents or even hinders unmaintainable code.
@jshowa o Yes, I watched the video. I also have 40 years of experience. You repeat what I've heard all too often. It's true compilers are very good at many things. It does not follow that they are best at everything. Experience has conclusively demonstrated that compiled code is not always the quickest code or the smallest code and there are still cases where maximum efficiency is required.
I would love to get your email, discord or linkedin or whatever and chat with you about the various assembly optimization tricks you could teach others! I'm an operating systems developer myself with 3 years of experience and I'm just now starting to really get into assembly language (I use an internal closed-source C-like language at work which is so old, it doesn't have an easy to use print function, so often the best way for me to debug the runtime is to freeze the operating system at certified compiler-emitted assembly instructions and look at registers and memory locations hahaha) and hardware architecture and optimizations enabled by a programmer's understanding thereof. My current side project is 6000 lines of mostly algorithmic C code (custom BigInt and cryptography library to be used for a secure chat app) and I'm sure there would be a lot for us to talk about and a ton for me to learn from you!
The primary rule of optimization is left out here, which is to FIRST MEASURE aka 'profile' to see which part of your code the CPU is spending most of its time in.
Once you know where that is, you try to find out any algorithm / complexity improvements (take your algorithm from O(N2) to O(N) or even O(1)).
Then finally, after all that is done, you may think about 'intervening', checking the assembly and seeing if you can "do better yourself".
Introducing obfuscating code like a*(a>b)+b*(b>=a) leads to bugs and unnecessary rabbit holes for your future self when you get called in to debug the mess you made.
"If you want to be most people, if you want to be an ordinary dev with common skillset, this video is not for you."
- Peerple -
I can see these techniques being applied for things that require a lot of calculations to be done as fast as possible. The issue is when it comes to other people looking at the code without fully understanding why it's written this way.
Aw jeez, if only you could write a comment annotating it /s
Skill issues
Back in the late '70s, I came across a set of algorithms that computed Julian and Gregorian dates using only double precision arithmetic. Leap years and Y2K worked without modification. Operating systems and databases frequently got leap years and daylight savings time calculations wrong. It took a lot of time an effort to recover. The software that I wrote using the arithmetic functions only failed once, and that was when someone compiled them single precision. It made me a big fan of using arithmetic functions to perform logic.
You can toggle case on an ASCII alpha (from lower to upper, or vice versa) by xoring it with the space character (i.e. 0x20).
yes, he did it with -32 (0x20) in that example
It seems to me that it's a choice of evils:
1. Easy to read but much slower code.
2. Optimized but heavily obfuscated code.
Option 2 is WAY above my pay grade :-) so I'd go with #1 unless I had no other choice. Still, your video was extremely well done. Even I was able to follow what you were saying and I'm merely an intermediate level programmer. That's definitely worth a "Subscribe".
this is why programming back in the very old days, they were really hard to read because scientists optimised the program to do things when the clock cycle is like below thousand Hz. However, the raise in computation power, mean there is no need to write optimal code for the machine, when the machine can do more work for you. so human programmer can spend less time to write less complex code, so that we could produce more programs for more purpose. This is a trade off, it really depend what type of work you were doing.
I would say you should determine bottlenecks and use #2 in them and #1 everywhere else.
How about a middleware that's readable to you and which gets JIT-compiled to whatever CPU you wanna run it on?
I dunno Option 2 is what compilers are for. Or in my case, precompilers since I work predominantly in a compiled language without certain niceties and use a precompiler that translates the extra control statements I want to use into standard controls that actually exist. Also look into the MOVuscator for more evidence that optimizations can be often offloaded onto the compiler.
Edit: I said all that BEFORE WATCHING THE VIDEO WHERE HE SAYS COMPILERS HELP
"Before, we used programmers to save on computing resources, now we use computing resources to save on programmer time." Don't remember who said that.
And like the others say, that doesn't mean #2 should never be used, just that you should only use it for measured bottlenecks that are worth it. Even if you identify a bottleneck but your program doesn't need to be fast, then you still don't need to optimize.
Some processors own a conditional selection instruction, which at source level corresponds to a>b ? a:b.
branchless programming is a must in time critical applications where worst case execution time prevails. This skill is appreciated in avionics design companies.
Love when experts have a down-to-earth attitude like this. You rock man.
Loved the reference to Al Di Meola, one of my all time favourite guitarists. Great video too, right up my street.
"The CIA wants you to believe its only if then else if then else"
Run over all the glow-in-the-darks
The FBI wants to have a word about switch statements
What is this quote from? If you quote something from the video you're commenting on, then please include the time where you found it.
@@VulcanOnWheels ruclips.net/video/7uLzaKlZSQQ/видео.html it's in that video at around 5m 40s
But it's worth it watching the entire video because Terry Davis is a legend
You write return a*(a
You're correct. Neither the video maker nor the adoring commentators understand what a branch is, and are just jerking off trying to eliminate short jumps. Fucking brutal.
People have done it before, here's an example I found online, If you wanna know if it's smaller like in the video just put a ! on the function call:
int isGt(int a, int b)
{
int diff = a ^ b;
diff |= diff >> 1;
diff |= diff >> 2;
diff |= diff >> 4;
diff |= diff >> 8;
diff |= diff >> 16;
diff &= ~(diff >> 1) | 0x80000000;
diff &= (a ^ 0x80000000) & (b ^ 0x7fffffff);
return !!diff;
}
I’ve been doing Z80 assembly recently, and seeing all these fancy conditional instructions and more than 8 registers is crazy 😀
Long live the ZX Spectrum :)
cheater! 6502 only had 3! spoiled brat.
I thought it was funny when he said "The compiler couldn't figure out what we were trying to do". Same as most people trying to read that code. There are lots of compiler tricks for optimizing - removing branches is just one of them. And optimizing for speed is not the only type of optimization a compiler can do. My own intuition is to strive to make your code readable first. If you are not meeting speed requirements, reconsider your algorithm first. As a very last resort, consider using assembler. I have trouble imagining a scenario where trying to come up with some contorted approach in a high level language that is hard to understand, maintain, or predict what the performance implications would be (without looking at disassembled code, which seems kind of like "why bother") would be the preferred approach. BTW - the same high level code can be compiled into different native instructions for various reasons (e.g. different compilers, different target processors, different compiler options, ...), so it seems like you would need to buy into disassembling your code EVERY time one of these variables changes to see if it is still optimized. Of course, it is not uncommon for the writer of the code to not really know when one of these things changes in their build environment. In general, I would say if you start trying to second guess your compiler, you are at best wasting your time, and at worst, creating a maintenance nightmare.
To be fair, I think all of these points are taken into account by the video. He's not saying you should do this stuff everywhere, only in cases where you need extreme speed, and especially in SIMD scenarios. Nobody's doing SIMD unless they need speed.
I believe you may have meant to say assembly instead of "... consider using assembler" to refer to an assembly language.
Remember that not everyone is writing code for webservers and PCs. If you work on high level applications, the it's hard to imagine, but there are people writing for low level applications where cycles and bytes are scarce.
@@shiuido359 EXACTLY! 99.999% of people that write code are obtuse to the fact that low level programming is still relevant and necessary in certain applications. The live under the belief that assembly is old, out dated, incapable and doesn't need to be taught because they learned java or python.
You can replace 32 in C with 'a' - 'A' just to not have magic numbers. Great vid - thanks!
Although not the purpose of this video, the old school trick for toupper is to bitwise AND each byte with a mask (~(0x20)), presupposing the arrays are all ASCII characters in the range of A-Z/a-z. Of course, once out of ASCII, this isn't the right method.
@@shanehebert396 I never thought of it - thanks!
For the conversion from upper to lower, wouldn’t it be faster to just do an AND on the byte with a register with the value of the inverse of 32 in binary (6th bit set to 0). So loop over value && 11011111; depending on your architecture this conversion could be done in a single operation.
String may contain non-alphabetic characters and they should not be 'converted', i suppose.
For unsigned ranges like 'a' to 'z', a smaller range check I use is (value-lower+0u>range), where 0u forces the comparison to be unsigned and range would be the 'z'-'a', that any overflow would also result in an unsigned value out of range; instead of two compares where both values would have to accounted for, one subtract and one compare.
This seems like a common enough thing that the compiler should be able to provide the optimisation. Isn't that the case?
I like explicit casts in such cases, with comments so people reading the code have no cognitive load
It will probably be slower since you changed one comparison and logical operator to one subtraction. Comparison is a subtraction without memory assignment, so cmp is faster.
I didn't even know branchless coding was a thing. I just assumed "if" was necessary to test a condition and didn't really question it, seemed reasonable enough. You've opened my eyes. I'm not gonna use branchless coding yet, most of it flew WAY over my head, but knowing why it's bad is still good
It really isn't that difficult if you think about it, for comparison replacement, a>b itself is a mathematical expression which returns 1 if it's true, so if you do x=a*(a>b), that will mean x=a if a is larger than b. You can use that same principle for anything. That said, I think the main lesson learned from this video is that you shouldn't try to be too clever because most of the time the compiler will do a better job than you ever will.
@@LocrianDorian ye nah I get the algebra, I just don't understand how converting "if(a>b) return a" to "return a * (a>b)" removes a comparison, you're still comparing a to b
@@Louis_Marcotte It doesn't remove the comparison, it removes the branch. The thing that slows down the CPU is not the comparison itself, it's the "if" statement.
@@LocrianDorian He's right, you're wrong. Eliminating jumps to minimize cache miss is not what branchlessness is about. Whether you're using an if or multiplying with the result of a comparison, you are creating a branch for the CPU's branch predictor: either the comparison will be true or false, and your multiplication result depends on that. It will either predict the branch result correctly or not, resulting in either a speedup or wasted work. (a * (a > b)) creates a branch for the branch predictor no different from an if.
However, if the data is predictable, we can use indirect addressing, hash tables, and other branchless techniques. This is likely the limit of optimization in modern contexts.
Good to know some programmers are interested in this stuff... thought it was a dying art :)
Did a lot of stuff back in late 90's with good old MMX instructions to process video in real time (which was tricky back then).
Even now some clever optimization can make a huge difference - like being able to adjust photo editing filters in realtime with a slider (e.g. Affinity Photo) or waiting 1/2 a second or more for each tweak.
Somebody gift this man a mouse-pad.
This sort of optimization is very hard to get right and often leads to premature optimization. For tight loops on higher level languages, you should check your byte code. And for lower level languages you’re the disassembly. However, don’t forget about the third thing…. The CPU execution pipeline. There are also many optimization each CPU can take. So while one disassembly looks slower than the other, it may not always be true as the CPU might be able to optimize one better than the other. So it’s very tricky and really ends up, requiring a lot of testing on specific CPUs. I wish there was a platform we had where are you can benchmark your code against hundreds of different CPU types (with emulation) and see the results.
Ah, branch prediction.
Yeah, I see.
I guess I spend too much time with ancient hardware that doesn't do anything fancy like that.
You do a branch on a 6502, the CPU does that pretty quickly.
Then again, since there's little to no pipelining (there is some limited precursor to a pipeline which is why the cycle counts are so low; basically it does instruction decoding and execution in parallel for multiple instructions), no cache, and nothing like branch prediction...
A branch is about as fast as anything else...
Ah well.
The 6502 is an evolutionary dead end. Not because it couldn't have been extended to higher bit depths (65816) and faster clock speeds (current record for a 65816 FPGA core is 280 mhz, roughly).
No, the problem...
Which can be seen indirectly with the whole Zero page/Direct page idea, is that it's a Register + Memory design.
If an instruction takes two operands, in almost all cases one of them is from memory, and one is a register.
It also does single cycle memory access.
(compare with the slightly later and more prolific 68000 which takes 4 cycles to access memory)
For a given CPU clockspeed, the 6502 requires memory that can be accessed in one cycle.
The 65816 does even worse, demanding half cycle access.
On the face of it there's nothing explicitly wrong with this design.
The problem is that the trend in technology (even to this day) is that CPU performance has outpaced memory performance. (at least, for affordable memory)
This is why the CPU cache is a thing. Not because anyone especially wants the added complexity, but because a small amount of very fast cache memory + much slower main memory is much more practical and affordable than having your entire memory system operating at speeds equal to or faster than your CPU...
Branch prediction I guess is a side effect of deep CPU pipelines, which in turn are a way of reducing instruction cycle count without actually making instructions faster.
Want a single cycle multiply instruction?
Well, either you make some complex circuit that can actually do that in a single cycle, or you build a pipeline where 10 instructions are in different stages of execution at the same time.
The net effect is that each instruction takes 1 cycle...
... If the pipeline doesn't stall...
At this point I'm curious what the maximum speed of a CPU would currently be if it had no pipelines and no cache memory.
I suppose you can predict that from memory speeds.
If we can deal with the logic used for DDR, then I guess we get something akin to 2-4 ghz to play with, so it could be worse...
But there's almost certainly some major caveats involved here...
KuraIthys
good post, i am a 6502 hobbyist and watched this video for a laugh at the semantics of "branch less programming", i get the feeling tools are so easy now programmers like to seek some form of division from the less technically inclined people who write software.
It's not just memory bandwidth, it's latency. A CPU with no cache today would be catastrophically slow because an L1 cache read takes ~1ns and a ddr read is ~100ns. I've actually wondered about doing the opposite, making a CPU with huge amounts of cache that only uses main memory to load entirely new programs. Never have to leave the cpu die.
If you're talking speed of a single instruction without pipelining, of course it wouldn't change from its pipelined version if you run the instruction with the pipeline flushed. Pipelining increases throughput, but there's no reason to expect the latency to be any different.
@@Badspot Good point. I'm certainly no CPU designer, so I'm sure there was more to it.
Though as to your idea... With a large enough cache you basically end up back where you started.
If your cache is so large that you never need to access main memory, then what you've basically done is turn the cache into the system memory, and system memory into something akin to a RAMdisk...
100 ns latency... hmmh... I admit I didn't think it was still that bad...
I mean, a Super Nintendo was already demanding ROM access latency of 110 nanoseconds or more...
That would imply for mainstream RAM we've made no real progress with access latency in about 25+ years...
@@KuraIthys Some high-speed DDR3 and DDR4 can reach 10ns latency for the first word, though delivering the whole cache line can still take 90-100ns.
Should've mentioned loop unrolling which is also a kind of branch removing method.
He did, but it was really swift!
"Well, there is always the end of the loop except if your code [move and shows code repeated]".
Loop unrolling is very powerful, but the technique is limited to applications where the loops have a know/fixed iteration count at compile time. Loops with a dynamic iteration count or count based on the size of a dynamically sized container (like a vector) cannot be unrolled since the compiler has no way of knowing how many times to copy the code in the body of the loop. Luckily, the branch prediction hardware in modern CPUs is really good (like 90+ % hit rate) at predicting the correct branches for looping statements since loops are such a commonly encountered structure.
Crap, should have read comments *again* . I had the same comment. It's very much a branch removing method which was usually the reason to do it. You removed the branching and compare at the end of the loop which would shave several clock cycles. The draw back is more code bloat though and if you have limited memory that could become a real problem after awhile. I remember spending weeks trying to tweak a mobile port of some Tom Clancy game (this was back in '01/'02 and I can't remember the title). I kept running out of memory and I ended up having to roll back up some loops :). I had to make some other "creative" changes in the game play to compensate for some frame stuttering caused by that, but in the end it was worth it because the game just didn't want to fit on that device despite me having at least 10k overhead beyond the device limit. This was a problem with a lot of early devices. They might have say 200k on the spec sheet, but there is some process somewhere using 10k of that and so you really only have 190k. Yeah, I had to work with that much RAM at times on some form factors and the concept of phone manufacturers was it's a phone first before anything else, so calls always have priority. This could reek havoc on your carefully planned memory map when the handset gets a call during the most memory intensive point. We would have some weird bugs for sure and they almost always were RAM related. When Apple invented the iPhone I knew mobile gaming was going to change forever and in a good way.
EDIT: I want to say it was Splinter Cell maybe? That could have been a later port, but I want to think it was Splinter Cell. If you played that on Qualcom handset or more specifically a BREW handset - Kyocera, LG, and a few other manufacturers were still making BREW phones back then - I might have been the one to port it, so you can blame me for any shitty gameplay as a result :P.
One of the reasons people should try to use ranged for. However I find it hard to move away from indices if you are doing some mathematical operation, particularly when the index is part of the calculation. Probably choosing a better data structure (maybe a struct) instead of pod would help here, but that has its own tradeoffs.
3:44 - That code is C syntax. The syntax uses a comparison operator to return a number equivalent to Boolean. However, there is no guarantee of how the syntax will translate to whatever CPU is chosen, especially when using a language whose compiler you didn't study or write. Thank you for showing the concept, but I wish you would explain up front that this is a C-syntax feature, not a reliable technique in programming. (it would set the context better and be less misleading to the naive viewer)
This amounts to obfuscating the clear intent of the program and puts the activity of programming back into the mentality of doing assembly language programming where one knew the underlying CPU with intimacy and coded for the CPU, i.e., programming for the low level hardware instead of programming to human friendly abstractions of high level programming languages. Whereas this problem could (and should) be left to compiler optimization and/or CPU architecture techniques to solve. Pushing hardware problems of this nature back into how the program is written is going the wrong direction. Instead of statistical branch prediction, CPUs can just have two instruction pipelines running in tandem per the branch point and the actual branch result is used to finally determine which pipeline to retain. Set up several of these pipelines and just treat them like circular buffer of instruction decoding pipelines. Every time a branch appears in the instruction scanning, just start using an available pipeline to decode on one of the branch decision paths. Given how many cores are built into CPUs these days, the argument of using more transistors for this kind of solution doesn't hold water any more.
This is so bad... "Premature optimization" in the worst sense i've ever seen it. Sure, looks cool and smart, but makes the code ureadable and/or hard to maintain. Might make some sense in a VERY TIGHT domain case, but for general coding, it's more like teaching people how to dig a hole and jump in it...
I get what you're saying, but I'd like to offer a counter argument: Since this is, visually, little more than a syntactic sugar, I doubt it'd make the code less readable. We've already gotten used to switch statements and ternary operators as high level alternatives with various pros and cons to simple if branches, what's one more. If you've ever used linear interpolation, you'll recognize this. I agree that anything a compiler should be able to do, humans shouldn't have to deal with by hand, but _if_ you're running a time critical operation I'd still prefer seeing this rather than a deluge of if's to check simple conditions.
Again, I'm not dismissing your point, I'm just saying this isn't meant to replace every branch in every program. It's a tool to be used judicially if and when it is needed. Like null pointers ;)
Branch prediction is what led to the spectre vulnerability. Also, the structure for the non-branching code he showed us isn't hard to understand. It's actually pretty comparable to a switch statement. Throw in a comment to remind people that conditions evaluate to 0 or 1 and it quickly becomes apparent what the code does.
result =
newvalue1 * (condition1) +
newvalue2 * (condition2) +
newvalue3 * (condition3);
This also seems like a nifty way to construct byte-sized flags if the newvalues are powers of 2.
Kiddobyte Yep, this original comment is SUPER anal. Speculative execution is trying to emulate what branchless programming does natively - saying "throw more cores at it", or "run code in tandem" ignores the fact that any wasted work is wasted work, and you're condemning a core to execute throwaway instruction EVERY time a branch is encountered. That being said, these optimizations should probably be put into the compiler, rather than user code.
I agree if you'd start writing business logic like this, but for the examples he used this seems very reasonable, as long as you check performance and correctness which can be done easily. After that nobody will have to look at this code. I guess he went a little too far when he generally claimed this would be worth looking into, it's most likely not.
RUclips should pay you x10 times more to each view you get for the underrated quality content you publish. Love your channel! ❤ I was recently getting more into this subject myself, especially reading more about the subject of how different compilers handles N/RVO (Return Value Optimization) and Copy Elision to reduce copying when all the move semantics were introduced in C++11. Would be also a great idea for the next video to understand more about optimization and what we should not worry about when we write our code.
C++17 mandated that RVO (not NRVO) is performed. If you're allowed to use C++17 (or later) then the compiler is not allowed to perform intermediate constructors and destructors anymore, so there shouldn't be any differences between compilers or optimization levels. You can see the edge cases here: en.cppreference.com/w/cpp/language/copy_elision
Something weirdly satisfying about listening to someone talk Assembly with an Australian accent.
I would be very interested to see a video on the low level differences between x86 and ARM. Especially when it comes to SIMD and such
They are very different! I'd love to make some ARM ASM videos, the Neon instructions are great :)
ARM32 (not sure about newer ARM64) has a really interesting instruction set that encourages branchless programming. You can predicate every opcode on a flag condition (e.g. LE/GT/EQ/etc).
@@SerBallister That's still a branch, though - just merged into an instruction. That's also why ARM64 got rid of them and only supports them explicitly with branch, select, and compare instructions (for performance reasons, no less!).
@@totalermist I thought they differed by not using branch prediction so did not come with misprediction penalties. The none executed opcode behave more like NOPs. I didn't study the internals that much. Its an interesting instruction set with some interesting opportunities for optmisation
@@WhatsACreel Bonus points if you make an ASMR version of the vidz
Interesting stuff, I never dug too deep into programming so I never even saw the instructions that go the CPU/GPU, would love to learn more about the actual assembly code
I remember learning all these topics knowing it would speed up code but I didn’t know to what degree. It was very informative to walk through the same example for direct comparison
Since upper and lower case ASCII characters differ only in the fifth bit, can't you just AND each character with 0xef?
You can use AND, yes. But you have to make sure to only AND the lower case letters. I don't think it saves any time since VPAND and VPADD both run at the same speed. Excellent idea still, and I'm certain my code is not the fastest possible, so it would be worth exploring other instruction :)
@@WhatsACreel Very good point. I forgot that we might be given non alphabetic characters.
But this correlation between upper and lower only works for ASCII, which means that for any international solution it does not work at all since the language uniqe chars are not spaced 32 numbers apart.
@@davidmartensson273 You don't mean like UTF-8, do you?
@@user-zu1ix3yq2w Yes, for example. But when used inside a program its often expanded to 16 bit chars or 32 bit structures for simplicity, having different sizes for different chars makes optimizations very difficult ;), at least as far as I have seen. C# uses 16 bit chars, but do have surrogate pairs for some chars that do not fit into 16 bit.
In these 19 minuts of video my computer could raise to uppercase all books written the last 10 years.
Sounds like you have a pretty neat OCR and also pirated a lot of books.
But really, unless your corpus is already in plaintext and loaded in RAM, other bottlenecks will make the conversion speed irrelevant.
@@zwz.zdenek well, my point was that computers are cheep and humans rarely have to care optimizing the code.
But ok reducing branches can sometimes be worthwhile.
@@johnnisshansen its most important where massive amounts of calculations are being done, like I dunno, finding large primes to have very optimized code.
For regular programs less effort is put into optimizing now than was done back in like the 80s or 90s, because computer power is so much higher that it is usually not needed. They do stuff near enough instantaneously anyway.
Never assume optimization is achieved. It comes down to your use case and needs. If your need to look through 50gb databases or you need to talk to 100k computers any and all tricks can help. Granted when talking to 100k computers, network is a big bottleneck. But I've been able to take 20 day scripts and convert them to a few hours. Optimization can and will matter in certain scenarios.
That kind of thinking lead us to page loading times that were worse than what we could achieve in the 90ies...
12:50 you mentioned xor
would it not be objectively better to use xor since we're talking about coding it well?
Imagine if YandereDev saw this
EDIT: Its a joke, chill out you angry weebs.
This video showed up on my recommended after watching yanderedev videos youtube is so smart lol
@@vinzer72frie and @Henrik :: I'm a programmer that's helped with optimizing Yandere Simulator.
---
I should note that the types of optimizations mentioned in this video are *hot path* optimizations. You're referring to one script that's executed ~80 times per frame, and its execution time, as measured by the profiler, is under 1ms on a typical, modern ~4 GHz CPU. If this was your bottleneck, then you would be in the ~1000 FPS range (because 1ms = 1/1000th of a second). This measurement is the sum of all 80 NPCs, and it also includes the time that the script is spent blocked on heavy calls into C++ (ex: some when some NPCs need to adjust their animation weights).
---
Performance mostly depends on data... and game logic doesn't have a whole lot of data flowing through it (when compared to rendering thousands of objects, thousands of physics objects interacting, hundreds of animated objects that each have dozens of bones, etc.). People put a lot of emphasis on code, but optimizing your data (how much you load and how it is laid out) is usually where you will get the huge wins.
---
Also, when you're in the realm of Object-Oriented Programming, one L2 cache miss is typically 10x to 100x slower than a branch, depending on whether the branch prediction was a success. This obviously varies from CPU-to-CPU, but it's a good order of magnitude comparison for modern x86-64ish CPUs. A correct branch is ~2 cycles. An incorrect branch is ~15-20 cycles for AMD Ryzen / Intel -Lake CPUs ( See Agner Chapter 3.8 and 3.13: www.agner.org/optimize/microarchitecture.pdf ). A cold memory fetch (reading memory that's not in cache) is ~200 cycles. That's not too big of an issue for something that happens dozens of times per frame, though, because a 4 GHz CPU does ~67 million cycles per frame at 60 FPS (per core). In other words, you would need to do ~1.5 million branches on your main thread, and have them all be incorrectly predicted by the CPU, to eat up just half of your 60FPS frame budget.
---
The actual performance issues with Yandere Simulator are mostly related to animations and physics... and we've been fixing them up based on priority and availability. You can see over a 2x performance increase between January 2019 and July 2020 (assuming you have a discrete GPU and shadows are disabled -- neither of which have anything to do with game scripts). It's not uncommon to see a gaming PC with a good (ballpark ~GTX 1060-or-better) GPU and a gaming CPU get 80-110 FPS when shadows are disabled.
---
Unfortunately that "if/else" meme fools a lot of people, though. It's not, and has never been the issue for that game. I mean if performance got over 2x better, and the game scripts haven't changed, then it clearly could not have been the issue. Over 50% of all execution time, good and wasteful, was removed... yet the if/else statements are still there.
@@scottmichaud nice comment, too bad not really anyone is going to see this. I'm a gamedev myself and seeing the way people rag on yandere dev is pretty often kinda cringe. It's very obviously a lot of script kiddies or people with no experience just jumping on a bandwagon without actually knowing what to be mad at.
Did the game have optimisation problems? Sure. Is it because of some random out of context if/else chain? probably not.
@@scottmichaud I have been studying graphic libraries and trying to make an engine, meanwhile looking at code of games, It always baffled me what's wrong with those if/else, can't imagine a RPG without those. It cleared up the doubts
@@cortexauth4094 Yup. No problem. You want to have as much of your hardware at max capacity as possible (until there's no more work for the frame, then idle in one big chunk). When a branch is mispredicted, it needs to flush the bad work. That wastes "low tens" of cycles. ~20 cycles is nothing when you're concerned about work that happens once every 10 milliseconds (100 FPS). That's 0.0000025% (1/40,000,000th) of your budget. That could be a problem if you're in a loop of thousands or millions of objects, though, and the CPU can't figure out your pattern.
Game engines will mostly have problems with L2 cache misses, though, which are 10x worse than a mispredicted branch. This means that work that needs to operate on many things (ex: thousands) should be laid out linearly in memory so your next calculations are on the same cache line, and, when you leave your cache line, you land on the start of the next cache line. The CPU watches your memory accesses and it will speculatively load cache lines based on what it thinks you will need.
One of the biggest problems with memory accesses is dynamic inheritance. The way to allow multiple different child classes to fit on an array is to *not* have them contiguous in memory (because they're all different sizes based on the child class). There's a *lot* of different discussions that can go on here... but, yeah, it depends on the data you have flowing through it. Being wasteful on the order of the millionths of a 100 FPS budget might be okay if you're talking about hundreds of things... but, on the other hand, it is wasteful. IMO the biggest part is knowing what your waste is, and making a conscious judgment to waste (rather than being blindsided by the profiler months or years later).
Interesting video! I'll trust the compiler to optimize my readable code over trying to outsmart it, though :b
Well... I think the ultimate lesson is to speed test your code
I think there are many things to do to optimize speed. This should not be high on the list but it's good to know that it's there. If a code is branch-heavy, this probably would help more. Speed up 1ms code 1000 times doesn't have that much impact on performance as a whole.
CPU's are cheaper than developers
Two things I would add. 1) Conditional move et al still create dependency chains which can slow down throughput. 2) Speculative execution vulnerabilities like Spectre, as well as side-channel attacks on the execution time of a piece of code, can make branchless algorithms more useful.
While these techniques aren't very useful for most programmers, they are extremely useful for compiler programmers. The more patterns they can recognize and replace, the better everything is for everyone.
Or library developers, e.g. providing the fastest primitives like memcpy, strcpy, strlen, etc. is essential for C/C++ performance.
However all the ugliness and complexity should stay locked inside the library implementation and covered with extensive unit tests and benchmarks.
I disagree a little.
Thinking that most programers shouldn't care about a performance lead us to situation today.
That you need 2x2GHz CPU just to start windows. Just to display desktop background image, some icons, and do nothing while waiting for user to click something.
And 10 seconds of website loading is considered normal.
And fast typing programmer can type faster than average IDE can display text.
@@peterSobieraj The goal is that top level programmers shouldn't *have* to think about performance because the low level stuff takes care of it for them. I agree that we aren't there yet though.
I used the vector technique for a mapping algorithm at my last job, using the Accelerate framework on iOS. Got a HUGE performance increase, although I never thought about the branchless part of it. I was just happy that I had an algorithm that could convert tens of thousands of lat/lon pairs to rasterized mercator tiles of 32-bit indexes into an array of values for each mercator tile pixel in the same time that it took the naive, pointer-chasing for each vector method to process 10 lat/lon pairs. I was working with 300x speedups.
Cool story bro
You must slay at parties
I observed a similar situation in BVH tree traversal. For a raytracing program, my first approach was, storing BVH data with pointers and dynamic memory, and traversing it recursively. It was a very elegant and OOP styled code. Then I had to make it run on GPU, so re-made it in a branchless way, I stored all its data in contiguous memory, and used stackless BVH traversal algorithm. Yep... 100x speedups when run on CPU, 1000x speedups when run on GPU.
That's pretty cool
Garfieluigi Yes. Especially parties full of people with a high appreciation for the technical.
I love the SIMD part. I never got deep into it, sadly.
I believe there is one more thing to mention in correlation with the topic: branch prediction. Often, a simple if statement is used to iterate through values and perform different actions based on them. You might find a significant improvement simply by sorting the data so branch prediction can kick in and lessen cache misses. It would be interesting to see all you mentioned to perform in this scenario.
Very interesting video.
Another technique you didn't show (and is a lot easier for beginners) is caching. In the case of toUpperCase, you could prepopulate an array of 255 elements filled with zeroes and a couple of -32s. You sacrifice a bit of memory for great gains in speed.
Also, depending on the language you use, you can store function references in an array and call funcArray[a>b](a,b) to reduce the amount of branching. Or even funcArray[a%4](args) for more options.
i would like to add that, from my testing with actually benchmarking branchless code, especially with "trivial" examples like the smaller than function you used as a first example, and with compiler optimizations enabled, there's actually a lot of cases where branchless is "at best equal" to branched code, and it'll even flip entirely based on the compiler used.
staying with the smaller than example, without optimization the branchless version is 10% slower in clang, but 10% faster in gcc, while with optimization they're exactly equal.
The second example works mainly because the branchless form can be vectorized, such that the 1024 elements can be processed in just ~300 cycles.
I've tried to apply this branchless approach to my own project (doing some work with big integers. I changed only the division operator to work branchless) and it increased overall compile time of my tests 10-fold (from about 20 seconds to over 3.5 minutes) and nearly quadrupled the runtime for my long-running testcase, calculating 10, 8-byte prime numbers, from 2259 ms to over 8000 ms. Trying a branchless approach has singlehandedly undone all of the optimization i've done since first getting it working, fortunately i have backups.
The reason for this is that, while it's true that mispredicted branches are slow, it only takes about 10-20 cycles to flush the buffer and switch branches, while on the other hand branchless needs to perform the work required to evaluate both branches, plus some extra overhead, so even for trivial situations it's quite difficult to actually get benefit unless your branchless code allows the compiler to vectorize the work. which is what happened for the ToUpper function.
In more complex scenarios it only gets worse, since if there is more extensive work that needs to be done in each branch, the work for both branches needs to be fully executed before the return value can be determined.
There are some genuine use cases for branchless programming, as noted in another comment it may be important in cryptography despite being slower, and there are situations in which branchless is actually faster, but unless you can confidently say that the branch predictor mispredicts more than normal, or the branchless form will benefit from vectorization, it will usually perform equal to branching at best.
CPU's are sufficiently complex beasts that unless you actually benchmark what you're trying, it's really not possible to predict what will perform better, for instance the speed gain in the ToUpper function from earlier flips with -O1 on GCC, becoming 50% slower than the naiive branched version, whereas phrasing the condition as a binary and rather than a logical and is 20% faster.
the places where branchless is faster is usually because it allows the compiler to see optimizations (usually SIMD) that it otherwise wouldn't have found. iirc -O1 doesn't do SIMD optimizations so that's why the results flip. personally I've seen cases where replacing branches with boolean checks allowed the compiler to do some pretty crazy things like replacing entire loops with bitwise operations resulting in over 100x speedups
Holy fuck I started reading and hit read more, what a fucking paragraph
Very nice. Not often we see real code optimisation. First time I’ve seen this. Love it.
Smaller is faster. Not only your data, but your code blocks also need to move in/out of cpu caches so avoiding a possible branch outside of L1 size will add up and makes sense (now). This is a great technique for ultra-performance code. The algorithm at the end is a nice added bonus. Subscribed.
this reminds me so much of how The Cherno explains c++. Good thing to know that the compiler at times knows best :D
Cherno is a legend!! We certainly share the accent :)
Nice video.
But remember, unless something is actually a bottleneck in your program's performance, you never want to waste time and effort squeezing 1ms out of it. Always profile first, optimize last.
(Obviously, this doesn't apply to study/benchmark stuff.)
If you look at the assembly code the branches are still there, just using cmovl or setle or setl to handle the condition within the instruction. What the transistors are actually doing within those instructions is just doing the branch in a very efficient way, but the condition/branch is still there.
13:30 improvement idea:
use an array that has the length of 255 and place basically the whole ASCII table there but make it so that the portion a-z has the values A-Z
this should make it faster because you aren't comparing anything, you just replacing information with existing one
*256
@@shinyhappyrem8728 it stops at null so you never have to take into account the null character lol
Huh, neat. I think unless I really need to optimize run time I'm going stick to more human readable code.
Exactly. Debugging a program can be extremely difficult if the programmer thought he was being clever.
So I thought about it and decided to have a go at doing a similar optimised routine in ARM assembler (I'm not too familiar with x86) using bitwise operators instead of tests and conditional set...
operation: duplicate the char in upper half of register -:- add a constant that brings 'a' over 256 in upper half and 'z' tot 255 in lower half -:- combine bit 24 and bit 8 with and-not -:- shift and filter so bit 24 becomes #32 -:- subtract from the char and store back
// char* text is in r0 from function call
to-upper:
ldrb r1, [r0] // load first byte
mov.w r3, #0x85 // this constant doesn't fit in a single immediate in ARM code
orr.w r3, #0x9f0000 // could also be implemented as a load from memory
orrs r2, r1, r1, lsl 16 // duplicate in upper halfword, setting flags (zero flag is the important one)
add r2, r2, r3 // add the magic number. this results in bit 24 and bit 8 having results of "c >= 97" and "c >= 123" respectively.
bic r2, r2, r2, lsl 16 // combine those two bits with [bit 24] &&= ! [bit 8]
bxeq lr // if loaded byte was zero, return
push {r4, r5, lr} // r4 & r5 will now be clobbered
mov.w r4, #32 // keeping this constant in a register allows two instructions to fold into one
loop:
//lsr r2, r2, #19 // shift bit 24 down to bit 5 (=32)
//and r2, r2, #32 // filter off unwanted bits
and r2, r4, r2, lsr #19 // use of r4 to keep the constant 32 allows previous two instructions to become this one
sub r5, r1, r2 // subtract the result (0 or 32) from original character
ldrb r1, [r0, #1] // get next byte. The next instructions are repeats of the ones from before the loop.
strb r5, [r0], #1 // ((store the result back, post-incrementing the pointer))
orrs r2, r1, r1, lsl 16 // The reason for repeating them is to give the branch predictor warning of the direction
add r2, r2, r3 // that will be taken. Optional condition flag setting means that when it sees a conditional
bic r2, r2, r2, lsl 16 // branch, it can look at all the instructions in the pipeline and say "nothing is going to
bne loop // change this flag, so I will be going that way". Z flag is set by orrs, then used by bne.
pop {r4, r5, lr} // unclobber r4, r5 and return
as well as optimising the actual calculation, also optimised are the test and branch for the loop (enables branch prediction), and memory operations given distance from their registers use.
I have found that translating optimised ASM code back to C generates basically the same result but is more readable - but compilers won't produce the same code from C that is not written the same way. The compiler often clobbers its own registers or does things in weird, expensive ways... perhaps that's just with the arm-embedded compiler I'm used to though.
Nice mate!! C++ compiler in MS does some strange things with the registers too! I guess it's almost impossible to allocate the best regs every time? Well cheers for sharing mate, and cheers for watching :)
Really interesting, optimization is often neglected
it's fun to look into python for optimization. sometimes the most readable code is only mediocre for timing, where the really verbose code that could be 3 times as long and look like a foreign language could be three times as fast.
svnhddbst I think optimizing python code is more or less a pipe dream, considering anything highly performant should just be written in some other language. Writing python code that's short and simple is what makes it fun heh.
You had me at :"Assembly Code". Subscribed.
This is a great topic for driver, and library, and bottle-neck algorithm implementors.
idea zero: surface atomic conditionals e.g. 'mov if " from machine-code up to the high-level language (yes "Branchless C/C++", "Branchless Python", etc)
If the compiler (and libraries) aren't implementing branchless methods, you may be simply optimizing one engine on a 10 engine super tanker.
branchless compilers would deprecate (warn/error) branching where a branchless construct will do.
If your compiler pushes you to code branchless, and the Standard Template Library, and other supporting libraries are branchless, then all the code will be pulling to the same drumbeat.
With or without a "branchless compiler":
I'd never code this way for the prototype. I don't like the idea of "try it and THEN see how much it helps".
I prefer to respond to ALARMS, or a post prototype-review of code with a "branchless" eye.
Your code profiler should be able to let you know if the cache is being blown and reloaded too often.
If they don't, that's a market opportunity.
“99% the compiler is smarter“
Yes, the compiler tries to be but the main problem with compiler optimalizations is variance and variance depending on stuff out of your control.
And there IS a lot of variance, probably more than you would think.
Watch the talk “Performance Matters” by Emery Berger, it goes in-depth on this and a great talk!
As a general rule of thumb, compiler optimalizations is nice to have but you should not rely on it to make your code fast.
Branchless programming is really valuable if you explicitly do know what you are doing. Otherwise it probably won’t matter for you (for simpler stuff)
Ps. Thanks for the more advanced topics, great video as always 😀
We need some compiler tutorials then so we know how to use them optimally.
"You should not rely on the compiler to make your code fast." Do you compile all your code with - O0?
Meanwhile in reality:
Geek A: Look, I sped this code up by 200 times!
(Already disappointed by life) geek B: How long did it take to code?
A: Only 3 days!
B: What's it gonna save?
A: 200 times!
B: No, in computing time. How many hours of computing time?
A: Over the whole lifetime of this program (maths) ... maybe 10 seconds?
And yes, I've been both A and B in my career.
"trying to out code the compiler" is one of the most badass things I've ever heard haha
*Programmer:* I'm going to code this in assembly.
*C++ Compiler:* So, you have chosen inefficiency...
For the branchless toUpper, can't you do something like d[i] -= (d[i] >= 'a' && d[i]
was about to comment the same but I guess something is coming up
Thanks for showing this! I now use this every day in my game in C and it's a great extra bit of fun problem solving!
Dear teacher,
Respect! Awesome delivery.
I teach microprocessors and may want to argue on the overhead compared to the low loss in productivity around this topic. Pentium introduced branch prediction and the higher processors made it even better. Modem scenarios give 98% accuracy in branch prediction algorithms. So all we lose is basically the flushing due to 2% of the rogue code. Isn’t a whole new way of writing code? To be honest, the first example of smaller number would actually take more time to execute due to the multiplications involved. Yeah I teach ARM7 too and it incorporates conditional branching along with every instruction but just branch instructions. We have conditional mov, conditional add etc. it does make it more efficient no doubt but unless there are too many rogue singular branches that fail during branch predictions, the overhead doesn’t weigh up with the loss of efficiency. Hope I’ve made my point clear.
Nonetheless, taking nothing away from your awesome work. Keep them coming. Best wishes, always.
I'm a long time programming practitioner, and I loved it. I'm at the other end of the chain, if you will. Python, JS, various DB's and assorted Web Technologies, but this is really interesting. Haven't touched 86 Assembler since the 90's, so that was interesting. Loved the SIMD stuff. That's all new.
Liked his points about premature optimization. The keyword is *premature*. You don't ignore optimization, but be wary of it
Nice to find out that someone is still interested about performance. I started serious programming in 80's and did lots on embedded assembler programming in 90's. Computers are fast today but still it feels crime to write slow code. Have given up assembler for compatibility and readability reasons but with time critical code always check compiled results and make performance tests.
Parallel pipelined execution is so complex nowadays that sometimes it takes long to understand why some implementation works as fast as it do.
Sadly web based techniques and scripted languages spreading into areas they shouldn't. For many modern programmer, any operation is fast if it takes less than second even if being done right, time could be nanosecond level.
Optimizing isn't always worth increased complexity, so it's often a tradeoff choice. Benchmarking seems to be the way to go, I need to learn how to do that.
@@ImperatorZed I’m not talking about hacker optimizations but higher level architectural choises.
Web technologies I mentioned, are extremely slow compared to natively executed code which is tailored into purpose. They also create hidden complexity, which makes the code unreliable.
Applications are being made using complex ready made components which functionalities don’t typically fit well on purpose. They bring huge amount of new failure paths, which are mostly unhandled, or make proper error handling extremely complex. Programming errors are typically caught as runtime exceptions.
Components have complex dependency network on other components, which are constantly upgraded, partly because security vulnerabilies from functions which are not needed in most cases. Full system is never tested and applications work only in happy-day scenario principle. Often developer’s answer for a problem is, try another browser or it works in my machine.
So, my point is, mostly self-written functional code is several orders of magnitude simpler and faster when written in traditional C or C++ compared to current web technologies.
The evilness lies in easiness to use those complex components and difficulty to use simple approach. You use what you got works well in this.