A cool project would be to set up a git repo for everyone to check in their implementations of this algorithm in their favourite language. Then set up a CI pipeline to run them every time someone commits an optimisation. Chart the results.
People will write platform specific assembly code in C++ to optimize the code. But then it doesn't work on another hardware platform..... so it's kinda useless. X86 has alot alot of very specialized operations that can speed things up hundreds of times. But it will not work on another cpu.
@@HermanWillems does that make it any less interesting of an exercise? It might make it even more fragmented and difficult in being able to set up appropriate CI pipelines, but I still think it’d be interesting. Aside from that, you could forbid dropping down to assembly if you felt that to be cheating.
The C++ STL does actually have a bit array! It is just unfortunately called std::vector. Seriously, the standard says this specialization of vector should be implemented as an array of bits!
@@vaualbus this opinion is so tired and old. I work in high performance systems, and we don't avoid the std. More often than not it's better than developing your own solutions or pulling in third party code. We have a very small selection of custom containers and algorithms and use the std mostly everywhere else. Vendors these days are pretty good at keeping things lean.
When writing tight loops in Python, you have to remember two things about the language: 1. Attribute lookups and variable lookups not in the local namespace are slow. 2. Calling functions is slow. Thus I was able to speed up the Python version in the repo from 39 iterations for limit 1_000_000 to ~150 iterations just by inlining the code from GetBits/ClearBits and creating a reference for this.rawbits and this.sieveSize in local variables (and by eliminating the superfluous check for index%2 in the inner loop). This speedup is achieved without any optimizations to the algorithm.
Spot on - generating the function preamble is maybe 20 cycles or more (on x86 anyway) Also - keep loop conditions as simple as possible (check line 139 of the CPP) But the point of the video really is that in Python or any interpreted language - writing lazy code will cost you exponentially more execution time
@@aouerfelli hi there - when I say "lazy code", I mean, not taking the time to do things properly and obey the basic rules of good programming practice. Be nice to the compiler (or interpreter) and it will be nice to you :) As for "exponentially" - I was referring to the inherent overhead of Python being an interpreted language coupled with "lazy code", will make your code exponentially slower. The trade off here is supposed to be readability and ease of use - but it comes at a VERY large cost.
Oh and btw - for line 139 - implementing that as a "do while" rather than a "while" and performing the calculation in the loop would give the compiler a better chance of optimizing it .... Which leads me to a question - what compiler optimization level were you using Dave? Could make a huge difference....
Reminds me of a similar comparison that Google did a decade ago. It got kind of ridiculous when the Java engineers went "We can do better than 30% of C performance, we just need to hand-tune the VM and allocation settings!"
I noticed in his comments that Dave's reasoning for using std:out was to avoid printing new lines with Python's build in print function, but there's actually an easier way to do it. When calling the print function you can customize the endline character by using, well, endline="whatever you want as endline" as a parameter. That way, your endline character could be a comma followed by a space, or whatever else you needed. Other than that small tidbit which I came across by chance, awesome job as always Dave.
We are about same age, i have enjoyed many of your videos because the products were so important in my career. Its good to put a human face on the digital world and see a programmer who worked on the product.
Massive respect for the systematic and clear approach to this comparison (the experience you gathered over the years is very clearly showing in the methodology and explanation). Instant subscribe! Thanks for this and keep up the great work!
ikr, I was looking at that and thought "is this really python? this guy probably doesn't even know that there is a built-in function to get the length of an array!" Actually, while learning C++ arrays, I had to look up how to get a length of an array, and was unpleasantly surprised to find that there is no built-in way to do that.
@@bluesillybeard std::array has both size() and max_size() as well as the ability to call std::size(array). std::size also works on old style c arrays (e.g. int x[])which I assume is what you were looking at but std::size only entered the standard fairly recently (c++17). Tricky thing about c++ is there are a lot of out of date answers that say it can't do things, or has to do things in gross ways, that there are much better ways of doing it now.
His HS CS class: Algorithm optimization competition with classmates My HS CS class: Creating seemingly never ending popup dialogs that ultimately climax with "You Suck"
It was Visual Basic. The class was called "Computer Applications 2" (Apps 1 was Excel, Access and PP). The teacher had learned that in theory every shuffle of the old card game Freecell is beatable. So she was on a mission to beat all 62,000 (or whatever it is) possible games. Basically we just learned from our books. We would quickly make whatever Fahrenheit to Celsius converter, or pizza topping chooser thing we had to do for class. Then we'd just screw around and do dumb stuff.
About the only thing I remember of my hs prog class is that we used 15 year old apple ][ computers. The teacher may've been good at geometry but not so much at programming. His robotics class was more interesting but he still caused people to drop the class on day 1 😣
@@DavesGarage my buddy Ian McCormick just called me and told me I had to check out this video. Very cool but maybe we can throw Golang and/or Rust into the mix. I'm incredibly interested to know what an OG C engineer thinks of Go vs Rust (vs C and/or C++)
As someone who gets pumped following along to a free code academy tutorial video for Python. I am awe struck by this persons career and his ability to explain it to someone like myself. Keep rocking it!
Beyond the "Hello World" program in C64 basic 30+ years ago, I'm not a coder. So it's a testament to your presentation style that I can more or less follow what you're doing, and enjoy watching the show. Keep it up!
I started out with Atari Basic. It actually wasn't an interpreted language but would compile to P-Code when a line was entered and when run, it ran the P-Code with a software engine.
@@bjbell52 Well, that how most interpreters that time were implemented. Even if the program was parsed during input and stored as p-code to save memory, the p-code was still interpreted at runtime but not compiled. Was "Atari-Basic" derived from Extended Microsoft Basic? Well, the same I started with in mid 80's in former East Germany on a KC85/2.
@@Merilix2 Sorry, I hit by mistake so I'll try again. Atari Basic was NOT derived from M.S. Basic but Cromemco 16K Structured BASIC. The three other home computers in the U.S. at the time used a version of M.S. Basic and NO, they did NOT compile to p-code and HAD to be interpreted at runtime, unlike Atari Basic. So, Atari Basic should have been the fastest of the two Basics since it was pre-compiled, right. Nope : it was the slowest. Why? Because the writers were told they had to do few things to their version of Basic. 1) It had to check the syntax of a statement when one is ENTERED. You claimed that all the other basics precompiled but OBVIOUSLY M.S. wasn't. One could write the following line of code : 200 IF X=Y THEN PINT "X & Y ARE EQUAL" M.S. Basic would allow that line to be entered and would NOT find the syntax error until the line is executed (showing that M.S. Basic was NOT pre-compiled). That means it would run ONLY of X=Y. Atari Basic would have rejected the line, highlighting the word "PINT" to show where the syntax error was. 2) They had to add graphic commands like PLOT, DRAWTO, LOCATE, POSITION, COLOR to the language. 3) The Basic had to fit inside an 8K cartridge. They bought the source code to M.S. Basic but couldn't do those 3 things. They gave it to another company but they couldn't make M.S. Basic fit into an 8K cartridge and do those 3 things. So they wrote a new Basic for it. So why did it end up so slow? Because in order to fit into an 8K cartridge they didn't write a math package for it. Instead they used the one contained in the O.S.. But that package wasn't intended to be used in a language but only to do a few things the O.S. needed to do in BCD. The person writing it never wrote a math package before and didn't optimize it. The authors of Atari Basic ended up releasing the source code. Someone wrote a new Basic for it named Turbo Basic (using the source code (I believe)) but replaced the slow BCD package with an optimized one. I tried a few benchmarks with the original Atari Basic against Turbo Basic and Turbo Basic won every time - sometime running 3 or 4 times faster than Atari Basic. For the record, some people kept crying that it wasn't M.S. Basic like the other computers. Eventually they did come out with a M.S.Basic for the Atari (I have it) but I never used it that much, partly because Atari Basic was good enough for my needs AND a new language named ACTION came out that was easy to learn and use and compiled to machine code making it much faster.
@@bjbell52 No, I didn't claimed all other Basic's were precompiled. I said they were parsed and stored as p-code. By p-code i meant code lines became kind of linked lists and keywords were turned into a short one byte value with bit7 set such that they didn't had to be parsed again. But that p-code still had to be interpreted. That was almost the same as Atari Basic did. One exception: It seems like Atari Basic also converted (tokenized) constant numbers during input which MS Basic didn't. By the way: your 'PINT "X"' would become NT "X" on my Basic as is the token for constant PI ;) But yes, M.S. Basic was hard to fit into 8k. On early KC85/2 machine it had to be loaded from cassette tape and took about 10k of 16k available RAM. 85/3 had a switchable ROM instead.
There are optimizsations you can do even on assembly level. Like using fancy vector instructions and such. I once optimized a piece of C++ code with some embedded assembly instructions to gain 10x performance just using MMX on a Pentium chip. Usually the hard part is identifying which 0.1% of code really needs to be optimized.
I’ve had one embedded application that needed to be 100% rewritten in assembly language. The compiled C code was about six times as large as the assembly language version, and the assembly language version used all but 15 bytes of 60k of flash memory. No larger memory versions of the processor were available. The C code was not anywhere close to fitting in the given processor. I think it took me about a month and a half to write the whole application in assembly.
That can make a significant difference on the larger tests. Take for example, factor =709. Using factor * 3 means you’ll start at 2,127, while factor ^2 will start at 494,209, removing 353 unnecessary calls to test/clear that bit from that innermost loop. And that’s testing just one of 168 prime factors less than 1,000 (which is what would be used looking for primes less than 1M). The amount of unnecessary loops skipped by his change scales exponentially with the upper limit of primes you’re looking for. This will be almost unmeasurable on the smaller tests. Such as primes up to 10,000.
Additionally, above factor = 3, you can instead of 'for(int num=factor; num < sieveSize; num++)' write 'num += 6', and have two if-statements, one for "getBit(num - 1)" and second for "getBit(num + 1)", since all primes except 2 & 3 can be written as "p = 6k +- 1"
Although there were no surprises, it is a great video. Many Python programmers knows that the best way to achieve performance in Python is not using Python. This means that you should do most of the computation calling C optimized libraries like numpy, tensorflow, sklearn.
@@retrolabo yes it still python but C/C++ will do the heavy work for it, python just need a wrapper to call them, and most of language can make a wrapper to call to C/C++ function
@@nguyentranminhquang2861 yes this is what I mean by glue language. This is a strange distinction: think about it "print" in python is written in C in the interpreter, does this make the print function not python? :)
I loved your charismatic way of explaining the technical details. I was able to pull out a little bit more efficient (codewise) Python version while using the same algorithm you taught us, but still, the performance gap is astounding, I read there are many people here in the comment section sharing their viewpoint about a better Python implementation as if Dave was trying to undervalue Python, remember, this is a kind of "syntetic test" and depending on the Context it might be more reasonable to use one or the other of the languages.
In C++ you have the bitset class and can create a bitset of 1 million bits. And the other even faster method and more memory efficient is to create a vector. C++ is king.
@@jplflyer Nope. STL defines a special case just for vector. Sometimes you really want a vector of actual bool type values, whether that's a byte or the same size as int, but good luck defining it due to this special case. What if you want to define a reference to some Nth bool in a vector? There is no reference to a bit in C++.
Since you asked, here some things about your Python code that I haven't seen commented here yet: • print() has a keyword argument for specifying the character(s) added to the end of the line, so for example you can do print("Hello, world", end="") to avoid printing at the end of the line. • From what I've seen, its convention to use all-lowercase snake_case for variables and function names, and CamelCase for class names (though the stdlib doesn't always respect that) • You don't need the parentheses on your if statements
Thank you for taking me back to the Commodore Pet. Our highschool didn’t have any computers when I started, but in the 4th year they started showing up in shops, and I was one of the “regulars” using the demo model. Eventually a TRS-80 model 1 was bought to help staff plan classes, (Dutch schooling allows you to select a subset of classes for 4th to 6th year, so they have to juggle schedules to ensure optimal planning.) I was then the “gang-leader” of those allowed to use it for the rest of the year. Exciting times.
Respect, that's all I can say other than I am impressed. I also am very happy, C++ has been my go to language for a long time. Keep these videos coming. The prime number program was one that we were required to code in my algorithms college course, also one that I found difficult was a program to estimate pi to a given decimal place(this was user defined at runtime). Ugh, we were forced to write that one in of all languages basic. Yes the professor would allow only basic for that one. I had a sadistic algorithms professor I suppose. He allowed me to use C for the prime number program though so can't complain too much. Love the video and will be watching your others in the future.
Really great channel and video - and fun presentation. And a fair language comparison, I'd say. We've been developing in straight C since the late 80's ... and have "secure/portable" libraries for just about everything. The modern compilers are amazing at yielding reasonable machine code (no need to spend so much time on assembly, aside for maybe some hardware specialties). We wrap our C with whatever language does the job for the client or product development. Some of those higher-level languages might be useful these days for people learning how coding works ... kind of like assembly language and BASIC may have been for us. So great finding your channel ... subscribed and telling friends ... cheers ...
Thank you for showing me a baseline - how does a really passionate programmer are look like. Huge difference from many other RUclips participants. No society related talks. Just the code and figuring out things using code.
As a retired CPU designer, I am constantly surprised by the "discovery" that interpreted languages (even those that use a JIT) are so much slower than optimized C or even assembly. There is little appreciation for the massive overhead of many of these script-like languages. As a demonstration to convince a software developer that we could run their massive program on a $35 compute module I recoded their most critical routine in assembly (60 instructions long) and showed that their entire system ran with less than 10% of a very cheap machine rather than 40% of a Mac. The real nightmare, however, is the strato-layering of "packages" one on top of another instead for minimal additional functionality but a perceived decrease in design time. These chew up CPU cycles in massive overhead damaging the responsiveness and size of the code generated. As CS schools have stopped teaching even the rudiments of computer architecture this is not likely to change. Great for CPU producers, but a massive waste in time, power, and cost.
@@albertmagician8613 Yeah, but I could not force myself to always work off a stack.... RPN on an HP calculator was great but for general purpose programming?!? I think it would be a great video to show how to profile some code and optimize the slowest pieces for better performance. That's effectively what I ended up doing and it makes a huge difference for any sort of interactive code, such as CAD tools.
Why do you think CS programs have stopped teaching such things? I graduated relatively recently and my average rated US school required two courses on embedded design, one on assembly language, and one computer design which focused on the specifics of logic gates, ALUs, etc. I suspect you're trying to strengthen your position by creating a strawman argument. I think at best you could demonstrate that the worst programs don't teach these things but that was likely always true.
@@nonconsensualopinion I have no need of a strawman as I'm not arguing, but stating an observation. Some programs do indeed require traditional logic and some architecture classes, but increasingly those are being phased out as more emphasis is placed on web programming and large variety of dialects available for layering systems. I could give a list of such programs but it's not my point. Deep layering and package-based design are making systems increasingly inefficient. The good news is that I think we're headed for more actions like the one I stated in my original post, but slowly. I also think it's a massive opportunity as server and system costs can be reduced through breaking down software layers.
Some comments on the Cpp version: 1. There's a native bit array, it's just called "std::vector". It's slightly horrible that it's a specialization of vector, but it does work and does what you expect. 2. The std::chrono default aliases (seconds, milliseconds etc) are all specified with integers as the base type, for compile-time safety. If you want fractional time amounts, you can define your own alias, e.g. using dseconds = std::chrono::duration; And now you can duration_cast to that, without having to fiddle with getting it in microseconds and then multiplying by a million to get more resolution.
If only I could be such an expert... I'm just blown away. With coding each time I stop and think I'm somewhat not bad at it, then each and every time it is happening, I see someone who makes me feel like I know nothing...
For python, what I managed to catch: self instead of this, index//2 instead of int(index/2), use if __name__ == '__main__' for mains, so in case you import the code elsewhere it doesn't run the main, you do not need to retype something as a string in a print, also prettier way is to use f strings, you can use **.5 instead of sqrt, but this is more optional, than previous tips. Also for the sieve, you can actually start with square multiple, instead of 3rd multiple. That was all I caught, happy coding
There are also a number of loops that could be optimized. Generally speaking, in Python for-loops are faster than while-loops, and often more so, list comprehensions are even faster than for-loops. In code like this with possibly many iterations you can often see a very significant performance (and sometimes memory usage) improvement when switching to list comprehensions. Plus when you are used to using them list comprehensions are often more readable. You can go even further with the functools module and using "pure" functions that avoid side-effects (save calls to print or log to places outside the pure functions). This approach can lead to much more readable and succinct code.
Great video! Just want to point out that most python devs wouldn’t try to do something like that in native python. We’d either use a library like numpy or build one in c. Python is really more of a scripting language, I use it to call other bits of code in a readable sort of way. Once you learn/build the main packages that are relevant to your job, it’s crazy how fast you can push out code, which might not be 100% as fast as C, but it’s fast enough and very easy to maintain.
fast enough is kind of relative, isn't it ? I mean, his code ran close to ten thousand times faster in C++/64 compared to Python 😂 I remember writing a small snippet of code a few weeks back just to count from 1 to a billion I think on C, Python, VB, and MATLAB, and I remember the code was much, much slower in Python than C, though not that much slower if memory serves.
@@MoodyG That's the thing. In real life there would never be a requirement to "count to 1 billion". But I would create a dataframe in Pandas that contained the positive integers from 1 to 1 x 10^9. If you're doing that kind of thing in native Python, you're doing it wrong.
@@tomwalker996 I think you totally missed the point. Of course you wouldn't wanna count to a billion in real life. It just serves as a simple example to illustrate the comparative difference in speed between different languages. Counting up to a billion is way, way more trivial a task than what you'd actually wanna do in real life. An actual useful code for some real-life application may very well end up eating through computations many orders of magnitude more than a billion primitive addition operations.
This was wild, and mind blowing. I correctly guessed the order, but the speed of the code is what really floored me. Thanks for the demo, walking us through it briefly, and for displaying the code as well.
In C#, have you compiled it in release mode? For me, this increased the benchmark value from ~1900 to ~5000. Also, updating it to .NET 5 improved the performance on my machine by another 2% to ~5100.
Yeah it looks like its set to Release mode; I was looking for the same thing. I got similar results switching from Debug to Release, plus the X2 performance from 32bit to 64bit. Very eye opening.
@@vonBlankenburgLP Interesting, I wonder whats making the significant difference for me. In release mode 64bit I'm getting around 4900 passes every run, but then drops down to 2800 in 32bit very consistently. I'm on an 8700K clocked at 4.4GHZ running Windows 10.
Can confirm, I'm getting ~5300 when running the C# code in release mode on my 3700X, which in theory should be slower than Dave's 3970X. The C++ code is reasonably close with ~7300. Seems to me that something is wrong with his C# test.
@@moki5796 My guess is he's a C++ guy?😋 just kidding. Interesting though but we all know C++ is faster sooo..... C# is my fav language, C++ is great too. I'm going to get hate for this but every time I use Python I just feel like something is missing and it bugs me but I can't put my finger on it. I get the popularity and it's used by around 8 Million people but I'm guessing at least 50% are noobs, where as around 6Million use C# and am guessing less than 10% will be noobs...sooo...don't know never tried IronPython maybe that will peak/solidify my interest?
Python is "compiled" into bytecode too (usually saved as .pyc files), but the bytecode interpreter is really just that. So no overhead for parsing, but no optimizations either, that's why it's slow when doing compurational intensive tasks (for which often times you just use optimized packaged/libraries). There are versions of python that jit this bytecode to increase raw performance, e.g. pypy. Running on my old laptop, I get 42 Passes from your Python Program when run in CPython (what most people use as "Python"). Simply switching the Interpreter to pypy I get 561 Passes from the exact same source file.
Another reason why Python code is slower than C code is dynamic typing. Even when JIT compiling, the compiler needs to add checks that the types of the variables are still what the JIT compiler believes them to be. Same for array bounds checks.
@@mihiguy That's interesting. It seems when we re-write the Python code can be statically typed (eg; compilers like Cython, Numba, ect... all implement their own type systems) it would remove all that.
@@knowlen For JavaScript, that is what asm.js is doing. Expect a restricted, strongly typed subset of JavaScript and compile directly to machine code. However, this has mostly been obsoleted by WebAssembly.
Wow! Loving C# more every day! Just looking at Python for a project using Raspberry Pi, and have to say, it's a dog. I'm about the same age as the presenter, so his language experience really resonated with me. Great video! Wish I could up-vote more than once. Thanks! 👍👏👏👏
For ex-BASIC coders like myself C# is an ideal halfway house to C++. Learn C# and learn Powershell and you can address most programmic needs. Never as fast as C++ of course, but a lot less cryptic and most times it catches you when you fall and gives you a decent diagonistic message.
Excellent treatment of this topic. The only suggested follow up I would have is some discussion of why 64 bit turns out to be much faster than 32 bit and if the C# implementation uses 64 bit "under the hood". Keep them coming Dave, you've got a new fan!
I mean C++ techincally has vector for dynamic bitarrays. I believe it uses size_t instead of 8-bit chunks, because it's made to dynamically change size even after being created. I know technically alot of people don't like it in bigger codebases for various reasons, but in isolation it works fine just to access and change bits.
Wanted to suggest the same. It is actually not specified how it needs to be implemented, it is not even required that it allocates 1/8th of the memory of `vector`. And I agree having a specialization for `vector` is indeed a mess, since it technically not fulfilling the requirements of `std::vector`, such as having the elements stored as-if they were in a plain C-array.
The x64 calling convention is much simpler .. first few args are passed in registers, akin to __fastcall in MS VC++. And there are a lot more registers, which means less juggling values to/from the stack frame, and it also allows for more aggressive function-inlining by the compiler. Also also, compiling for x64 is like an implicit hint to the compiler, "this is a modern CPU so you can use all the new SSE etc instructions.. don't have to worry about compat with a 20 year old Pentium"
Dave, your breakdown of how some of this works is very interesting to me. I'm a Server and Storage Architect and team manager, and I have a BS is Computer Information Systems. By the time I was learning in the mid to late 00's, they didn't teach about bitarrays or any of the way the assembly works, we just learned OO languages like Java, C#, J#, and of course I've done some python dabbling since. I prefer C# but getting a peak behind the covers from someone who understands the assembly behind the scenes was interesting.
Believe it or not, I wrote a prime number generator in dBase IV. It saved the prime numbers in the database. When I packed up at the end of the day, I left it running on the IBM AT overnight. Since the prime numbers were in a database, it could pick up where it left off the previous morning. The database was handy for factoring huge numbers.
I did some small tests. Changing to x64 native compilation yielded a 15% speedup for me. Upgrading to .NET5 gave a roughly 2x speedup. So with those results, C# is about 2x slower than optimized C++ in my quick tests. I assume it has to do with C# being unable to do as much aggressive inlining. That's what it seemed like from running the profiler. Turning off inlining in C++ dropped performance by half. I did bump the num primes calculated by x10, though, figuring it might help avoid the allocation overhead that C# has to pay since it's all heap allocated.
@@Spelter Just be sure to do some benchmarks yourself and just rely on some random person on the internet. While these are the results I got, the results may vary for you.
honestly I really enjoyed this comparison. It was so in-depth and fun and I really liked the anectode. Also, Dave's diction and oration skills are so good, like ??? the way he speaks is so engrossing **BIG shoutout to Dave's python code using this instead of self. If Guido didn't want us using the **correct** keyword for the current instance, he should not have made it customizable**
Enjoyed the story. I took some time to work on an implementation in R and got some pretty good performance out of it. Major take-away: understanding what is REALLY going on in a programming language can allow us to write clean code. As I tell my Grad students: first get it RIGHT, then make it BETTER.
You look like the Chuck Norris of coders. I am even more amazed by you clear and understandable explanation than by the performance differencess. Would have been interesting to see the java and java script performance - just to finally end all those performance discussions.
And here I struggled with iterative loops and abstract classes within my 6 months C# crash course learning lol, the amount of knowledge you must posses is truly astonishing.
It's been years since I have written code, but having cut my teeth on c, I was curious. Your experiment did not disappoint. Recently, I have been playing with arm, so when you mentioned the M1, it peeked my interest. If you run it, I will watch.
Hey man, you could use the numpy lib for Python to run functions like "sqrt" even faster. Python wasn't really made with speed in mind but numpy was. Many of numpy's functions were writting in C to make it performant. I would like to think that's the reason why Python is used in Machine Learing - numpy + other C libs for Python (speed) + python (Minimal code, readability). Loved you video!
Using Python's abstractions tends to be faster than trying more low-level optimizations. That's because for many of these Python is able to optimize stuff using lower level tools that are not accessible to the user, like C bindings. JS also performs quite a lot of optimizations at the engine level. If you could somehow implement the sieve using numeric methods, you could use something like Numpy and you'd be essentially running C behind the scenes.
As a little advice : to compare a value with either 'True' or 'False', use 'is' instead of '=='. Python will compare the objects' id directly and it's the most optimized way to do it.
Yup, do not do this. Never compare against True or False using is, unless there are other "truthy" or "falsy" values that are anticipated, e.g. use of None as a placeholder, canary, or NULL equivalent. Identity comparisons may seem to operate the way you expect (True and False being literal singletons in CPython), however other situations will be far stranger. 27 is 27 → True. Some larger numbers that end up not being interred will not have this identity match, a similar problem with using this to compare strings for equality. Philip Münch has it right. Just use the "if" statement itself as the mechanism of casting and boolean comparison. The right tool for the job. Combined with "exit early" patterns, my special case of using None should be accounted for first, then the remainder of the function exited from if needed, so that the subsequent truthy or falsy comparison doesn't worry about None at all. In JS there is the paradigm of double-inversion to cast to a boolean, e.g. !!foo, Python simply does not have this silly need.
What a cool comparison! I learned something too. I recently started learning Julia, so I copied the program more or less to test the advertised speed of Julia. For 1 million upper limit, I got 4228 passes in 10 seconds, an average of 2.3 milliseconds. I was quite astonished! This is comparable to C# and C++.
I like benchmarks too. I converted the C# version to PHP. CPU: i7-2600k - 5sec of iterations Python 3.8.5: 23 passes PHP v7.4.3: 50 passes C# .NETCore 3.1 compiled with VS 2019: 1623 passes
@@JannisAdmek Has anyone tried it in Rust? Seems to be the language everyone wants to talk about these days. Be interesting to see what, if any, hit there is. Also be more interested in time for n cycles rather than cycles in t time.
I'm curious about why there's a count of 1 prime for the limit of 10 in the historical list: surely there should be 4 (2, 3, 5, 7)? Or am I missing something?
Typo, never encountered because no one tested it testing to 10. Or maybe it’s deliberate for testing that the validation code correctly flags errors. Test to 10 and it’s always “wrong”.
Have you considered doing a version of this video that uses Python, Python with Numpy, and Cython as well? Numpy and especially Cython can help things up quite a bit. (Or at least that's what I recall from uni)
😅 that would be a bit of cheating. Especially Numpy just wraps python calls around highly optimized c or fortran code. So you would expect the c or c++ performance plus one function call with the associated type checks python does.
And Cython has it already in the name. It's just inline C in a python framework. Which is absolutely genius as you can have performance over flexibility where needed.
Omg... This is going to be soooo awesome 😎. I already decided my religious and no changeable attitude towards all these platforms, but damn it'll be fun to either be right or rectified :p More content like this... It's getting really geeky 👾
You should have tried Delphi (or the free Lazarus). I worked at a large bank and tried to get someone to look at Delphi but nobody did. We were a Window's shop. I was hired to rewrite their Paradox for DOS system but I tried to convince them how good Delphi was. Finally they decided after one of our best programmers had calculated he could write a trading application in VB in one year after we got a new department head who believed in free software and chose Java instead. It was way too slow and after 2 years his final work was rejected totally (and he had a 6 figure salary). I tried many times to explain that we were a Window's shop and should use Windows development tools but they fell on deaf ears. Our new boss decided to write everything else in PERL. Goodbye Windows' GUI. Finally the good programmer convinced them to try C# and in our meeting I was told not to mention Delphi again because C# had so many great features. I fought off the urge to scream and explained to them that C# was Delphi with a C syntax that was designed and programmed by the same people who wrote Delphi and Delphi had all those great features many many years earlier. It didn't matter, the company went out of business a little later on.
How good it is to have a view of start to finish of this situation. I guess this must have went back and forth for a couple, let me say about 3 years, I wished they had given you a chance, and tried out Delphi, lol. But then, life goes on.
Hi Dave, I realized that your python code style looks more like C++, rather than python, which is fine but since you asked about any comments about your python coding style I can provide some insight. 1. When printing something you can use formating, because it simplifies things and allows for extra options. e.g istead of writing print('Passes: ' + str(passes) + ', Time: ' + str(duration)) You can use formating like print('Passes: {}, Time: {}'.format(passes, duration)) which is much more clear, or for python >= 3.6 you can use f-strings like: print(f'Passes: {passes}, Time: {duration}'). This allows for extra formatting options. For example if you want to trim duration to 2 decimals you can write print(f'Passes: {passes}, Time: {duration:.2f}') which in code is very clear and consice. 2. Instead of importing stdout to avoid printing newlines you can use python's built in print. The print function takes some extra keyword arguments like "sep" and "end". "sep" denotes the separator (default " ") and "end" denotes the end character (default " "). So in your case instead of importing stdout, you can use print('.....', end='') 3. There is no need to wrap conditions in parentheses. E.g if (condition1 == True) and (contition2 == True) can be written as if condition1 == True and condition2 == True or if condition1 and condition2. You can always ommit the == True if you know that condition is a boolean which you would expect it is otherwise there would be some bug I guess. And here are some coding style conventions (you can read more about it by googling pep8) 4. Function/method names and variables are by convention snake_case (except maybe cases with backwards compatibility). 5. Class names are by convention PascalCase. That's all keep up the good work, I really like your videos.
That Python bar should be almost twice as thick. You're using the floating point division operator / with a conversion to integers every time you index. Try Python's integral division operator //.
Isn't floating point division quite a bit faster than integer division on modern x86 architecture? I vaguely remember something like 11 cycles vs 26 cycles on Skylake CPUs.
@@nahco3994 On my somewhat old Inspiron 3847, i5-4400 3.1GHz machine, replacing "int(index/2)" with "index//2" on lines 51 and 64 (of 0cb3ff5 commit to Primes/PrimeSievePY/PrimePY.py) resulted in average runtimes being improved from about 0.385 to about 0.254; the integral division version being about 1.52 times faster. (Windows 10 Pro, ~2 year old python 3.6 in cygwin). (Note Dave's github link in the description).
I tried that and got an immediate 30% speedup. Eliminating the whole clearBit and getBit functions (directly accessing self.rawbits) and holding the whole list, not just the evens gave a 290% speedup. The problem is that the things that are good optimisations in a low level language may not be in a high level one.
And that goes up to 465% if you eliminate the class which is doing nothing useful. I'm sure it could be completely rewritten in a more pythonic way to be far faster, but interesting to just make some edits from the original program. That is still obviously nowhere near the C# time - but now more like a factor of 20 rather than 100.
Sometimes compilers and interpreters are surprising! I wrote a Monte Carlo simulation in VBA under Excel, coded as creating and destroying objects for each simulated item. Then recoded it to do all the memory allocation. and deallocation on a static array and it ran at near enough exactly the same speed! That OO stuff was implemented well!
@@stke1982 Well sometimes you have to do what the client wants! I’ve written MC simulations in a variety of languages from Fortran IV onwards! My point was that the hand crafted stack operation version was indistinguishable in performance from the OO version - to my surprise.
1st C++, 2nd C#, and then Python will be a distant 3rd. The performance delta between C++ and C# will likely be heavily influenced by how much garbage is generated that the GC feels the need to clean up. If you're very careful about your memory, then C# can get really close to C++. But if you write idiomatic C# (e.g. using Linq) then it's going to be well behind C++ but still a country mile ahead of Python.
Dr. Dobbs had an article where they guy boots DOS, runs a program, it takes something like a second, then runs it again, and it's 300x slower. Rebooting gets the 1 sec timing, rerun gets the 300x slower time. The author had no idea what caused it. This was on a 386. What DOS was doing was setting up the virtual to physical memory map on boot. It was a simple 1:1 map. Virtual block 1 is physical block 1. However, after you run something, the physical memory is freed, and the 1:1 mapping is changed. Virtual block 1 might now be physical block 7. The processor bit that maps virtual to physical is the TLB - the Table Lookaside Buffer. It's a small table on the chip. When you ask for a virtual block that isn't listed in the TLB, there's a system interrupt, and the correct mapping is added. This happens because the TLB isn't big enough for even fairly modest sized programs. For speed, when a new entry for a virtual block is needed, it doesn't go just anywhere (like the least recently used spot), but instead, a hash function is used to figure out where to put it. If Function A and Function B happen to hash to the same TLB location, then you get a TLB interrupt when A calls B and when B returns from A. This is TLB thrashing. There's also a version with data addresses. And TLB trashing can give you a factor of 300x slowdown. At least, that's the slowdown last i looked. On modern operating systems, the virtual to physical mapping is set up for every new process. You should get similar performance for similar input for the same code. However, large code, but especially large data can suffer from this in wonky unpredictable ways. I'm unaware of a computer architecture where the TLB hashing function is public. But MIPS (1980s) had a code analysis tool that built a function calling tree and could rearrange your modules to avoid TLB thrashing for your code. I've not seen anything like this recently.
I really like your pointing out that the algorithm can greatly affect the run time. Back in the 8080 days, I had a TRS-80 computer. Basically a 1Mhz Z-80 chip. I wrote a poor algorithm in assembler and a smart one in interpreted BASIC. The BASIC code ran faster. I don't code much anymore. I'm pretty much maintaining an obsolete system, which will be retired soon, followed by me. Now, I mainly game on a console. I can feel my brain rotting.
Would be interesting to do this with C and gcc as well and see if the compiler makes a huge difference between. Also keeping track of the results as a bitmap is a great space savings, but is it really a time saving? With a modern system you could store everything in a byte array, or even an integer array.
Yeah, the bitmap seems like a downgrade. On the 64-bit compiler, an int64 is probably the fastest array type (I'm guessing.) However, the point is to establish relative performance between different platforms, so perhaps super-optimization is less important than using the same code on multiple machines. Actually, come to think of it, that's the best possible reason to skip the bitmap array, since some platforms may have trouble with it.
This was an excellent video, and you are fantastic teacher. I use C# in my daily job, and for what we are doing, its performance is acceptable. But the capabilities of C++, as you have demonstrated, are amazing.
Cool! I got into coding on programmable calculators (TI-57). A buddy and I would compete to see who could write the same projects within the limited memory constraints of the calculator, and with the least number of steps. It was a great way to learn by trying different methods of coding to reduce the step count. Later I went on to become a telecommunications engineer and he went on to become a hard core programmer from assembly up. He would write code, then compile and link, in different languages, and benchmark sections of each by counting clock cycles, and would often rewrite the problem parts in assembly code to speed things up (he specialized in optimizing video drivers). These days I design and build test systems that have to communicate with other equipment that have different response characteristics so to help optimize the code I put timer marks that I can turn off and on at startup that help me find the bottlenecks. It would seem this is a lot of work for a little slice of time but since the code runs thousands and thousands of time every day, those little slices add up to a lot of time by the end of a year worth of testing. That said, your Software Drag Race video should be the poster child of software development for anybody running code in a high volume production environment. Thank you sir!
Amazing video. Btw would highly recommend taking a look at other python interpreters like pypy, to librarires like numba, or to cython. The last two options will impact the flexibility that made me love python but the results are totally worth it. Especially cython.
the only thing I would like to say in accordance with your request for information on better practices to use in your code would be to: "stop" + variable.doing + "this" with your strings. in both Python 3.6 and up as well as in C# you can prefix your strings to add variables directly into them. I do also believe it is a little faster execution wise, also being easier to read. The prefixes are as follows. for Python it is f or F, and in C# it is $ in C# it would look like this: Console.WriteLine($"Hello, {name}! Today is {date.DayOfWeek}, it's {date:HH:mm} now."); and in Python it is much the same: print(f"Hello , {name}! Today is {date.DayOfWeek}, it's {date:HH:mm} now.") i am unsure if C++ has something like this though 🤷♂️ hope i was not being mean 😅
I created a similar program using assembly language on a ZX Spectrum with 48 kB of memory to see how fast it could fill the memory (up to about 260000), I was surprised it only took a few seconds. Note that this was almost 40 years ago!
Yep very interested to see it tested on other chips, especially the M1. Now I'm curious about how Rust will perform as well as something functional like F#
In Rust we will then write a module that will call special cpu specific optimalisations and beat F# the fuck out of the water. That's what's so nice about system programming languages. They can go deep into the cpu while bytecode languages cannot do that so easily.
std::vector does pretty much exactly what you're trying to do with bit arrays. It will pack 8 values per byte and do some bit manipulation to access them as if it was a standard C array.
You can optimize the c++ code even further by prepending ++ instead of appending it to iterators. This way an unnecessary copy creation will be avoided. Also building the program as a release build will give a performance boost. Instead of % 2 using & 1 may also result in a performance improvement since it only requires checking the least significant bit to know about evenness :)
When I got started in computing, the "drag race" would have been Assembler vs. FORTRAN vs. interpreted HP BASIC. For a one-off problem, BASIC was usually the correct choice, because the program could be written, debugged and would have solved the problem at hand before the other two had been debugged, even though it executed at an order of magnitude slower speed than the other two choices. In the real world of engineering as a profession, one's employer doesn't care how elegant the solution is. All they care about is how quickly you can produce the correct result with the tools at hand, and the total time to the solution includes how long it takes to set up the tools.
That's why most engineering coding is in Matlab with a mature 20y JIT compiler. Crosscheckin the codegen C results within the Matlab enviroment is the key.
@""this" in python feels so wrong to me". You are in luck, you can use _any_ name for the class instance that you want to. 'this' is just a convention.
I am a beginner in Python. Pls explain to me then what "self" does mean in python? I did not find "this" keyword in python, what does it do? I found it in c#, Java, JS.
@@AIII_Brainwashed 'self' or 'this' are just keywords for a class to act as a reference of the instantiated object. 'self' is usually the standard for Python, but it can be any word you want (some limitations).
Best video I have watched in a long time.
Yes for sure, he brings some inspiration :-)
time to redo this
Bad@$$ Studio, 🎤 🎙 with your 🧰 Toolbox & LEDs Projects🎉❤
22:45 Please No More "FANCY " @$$ GRAPHS 😅😅
A cool project would be to set up a git repo for everyone to check in their implementations of this algorithm in their favourite language. Then set up a CI pipeline to run them every time someone commits an optimisation. Chart the results.
Nice! Reminds me of the fire routine speed challenge
That would be Hella cool
Time to rewriting everything everything in Rust again...
People will write platform specific assembly code in C++ to optimize the code. But then it doesn't work on another hardware platform..... so it's kinda useless. X86 has alot alot of very specialized operations that can speed things up hundreds of times. But it will not work on another cpu.
@@HermanWillems does that make it any less interesting of an exercise? It might make it even more fragmented and difficult in being able to set up appropriate CI pipelines, but I still think it’d be interesting.
Aside from that, you could forbid dropping down to assembly if you felt that to be cheating.
The C++ STL does actually have a bit array! It is just unfortunately called std::vector. Seriously, the standard says this specialization of vector should be implemented as an array of bits!
all STL is just bad if you can avoid it you better do that.
@@vaualbus this opinion is so tired and old. I work in high performance systems, and we don't avoid the std. More often than not it's better than developing your own solutions or pulling in third party code. We have a very small selection of custom containers and algorithms and use the std mostly everywhere else. Vendors these days are pretty good at keeping things lean.
I was under the impression that the bool type was one byte
@@KleptomaniacJames It may, or may not, be, but here @juvenal1228 wanted a bit array, not a byte array!
Let us then use the RtlInitializeBitMap() and friends from ntdll !
When writing tight loops in Python, you have to remember two things about the language:
1. Attribute lookups and variable lookups not in the local namespace are slow.
2. Calling functions is slow.
Thus I was able to speed up the Python version in the repo from 39 iterations for limit 1_000_000 to ~150 iterations just by inlining the code from GetBits/ClearBits and creating a reference for this.rawbits and this.sieveSize in local variables (and by eliminating the superfluous check for index%2 in the inner loop).
This speedup is achieved without any optimizations to the algorithm.
Spot on - generating the function preamble is maybe 20 cycles or more (on x86 anyway)
Also - keep loop conditions as simple as possible (check line 139 of the CPP)
But the point of the video really is that in Python or any interpreted language - writing lazy code will cost you exponentially more execution time
@@mrroryc What is lazy code?
And, what do you mean by "exponentially"?
@@aouerfelli hi there - when I say "lazy code", I mean, not taking the time to do things properly and obey the basic rules of good programming practice. Be nice to the compiler (or interpreter) and it will be nice to you :)
As for "exponentially" - I was referring to the inherent overhead of Python being an interpreted language coupled with "lazy code", will make your code exponentially slower.
The trade off here is supposed to be readability and ease of use - but it comes at a VERY large cost.
Oh and btw - for line 139 - implementing that as a "do while" rather than a "while" and performing the calculation in the loop would give the compiler a better chance of optimizing it ....
Which leads me to a question - what compiler optimization level were you using Dave? Could make a huge difference....
Be weary that you may have different hardware which would impact your results.
Reminds me of a similar comparison that Google did a decade ago. It got kind of ridiculous when the Java engineers went "We can do better than 30% of C performance, we just need to hand-tune the VM and allocation settings!"
I noticed in his comments that Dave's reasoning for using std:out was to avoid printing new lines with Python's build in print function, but there's actually an easier way to do it.
When calling the print function you can customize the endline character by using, well, endline="whatever you want as endline" as a parameter.
That way, your endline character could be a comma followed by a space, or whatever else you needed.
Other than that small tidbit which I came across by chance, awesome job as always Dave.
We are about same age, i have enjoyed many of your videos because the products were so important in my career. Its good to put a human face on the digital world and see a programmer who worked on the product.
Massive respect for the systematic and clear approach to this comparison (the experience you gathered over the years is very clearly showing in the methodology and explanation).
Instant subscribe! Thanks for this and keep up the great work!
Incredible comparison!
I'd also like to add how -- this man has successfully managed to write the most C++ looking script in Python 😂
ikr, I was looking at that and thought "is this really python? this guy probably doesn't even know that there is a built-in function to get the length of an array!"
Actually, while learning C++ arrays, I had to look up how to get a length of an array, and was unpleasantly surprised to find that there is no built-in way to do that.
@@bluesillybeard std::array has both size() and max_size() as well as the ability to call std::size(array). std::size also works on old style c arrays (e.g. int x[])which I assume is what you were looking at but std::size only entered the standard fairly recently (c++17). Tricky thing about c++ is there are a lot of out of date answers that say it can't do things, or has to do things in gross ways, that there are much better ways of doing it now.
@@flamingmonky111 good to know, thanks!
I looked for this comment earlier. Yes, I had to do a double take at that Python. :D
@@bluesillybeard I have to learn c++ for work. Ive mostly used java and python. im scared lol
I LOVE your channel. I learn so much by watching you. Thank you!
His HS CS class: Algorithm optimization competition with classmates
My HS CS class: Creating seemingly never ending popup dialogs that ultimately climax with "You Suck"
Ha! It was a great class... but I've been in the one you describe too I think :-)
At least you had CS in HS. I had to learn it in College.
It was Visual Basic. The class was called "Computer Applications 2" (Apps 1 was Excel, Access and PP). The teacher had learned that in theory every shuffle of the old card game Freecell is beatable. So she was on a mission to beat all 62,000 (or whatever it is) possible games.
Basically we just learned from our books. We would quickly make whatever Fahrenheit to Celsius converter, or pizza topping chooser thing we had to do for class. Then we'd just screw around and do dumb stuff.
About the only thing I remember of my hs prog class is that we used 15 year old apple ][ computers. The teacher may've been good at geometry but not so much at programming. His robotics class was more interesting but he still caused people to drop the class on day 1 😣
@@markp5726 We used Apple ]['s as well but they were brand new! And the students were still generally better at programming than the teachers.
I rarely see well-executed language comparisons. I love these performance/comparison types of videos.
I really enjoyed it, thank you!
Best original Content on RUclips right now. Killing the Game Dave!!!
Thanks! Tell a friend! :-)
@@DavesGarage my buddy Ian McCormick just called me and told me I had to check out this video.
Very cool but maybe we can throw Golang and/or Rust into the mix. I'm incredibly interested to know what an OG C engineer thinks of Go vs Rust (vs C and/or C++)
Agreed!
As someone who gets pumped following along to a free code academy tutorial video for Python. I am awe struck by this persons career and his ability to explain it to someone like myself. Keep rocking it!
Beyond the "Hello World" program in C64 basic 30+ years ago, I'm not a coder. So it's a testament to your presentation style that I can more or less follow what you're doing, and enjoy watching the show. Keep it up!
I would hope your C64 version of Hello World was the evergreen:
10 PRINT "HELLO WORD"
20 GOTO 10
RUN
;)
I started out with Atari Basic. It actually wasn't an interpreted language but would compile to P-Code when a line was entered and when run, it ran the P-Code with a software engine.
@@bjbell52 Well, that how most interpreters that time were implemented. Even if the program was parsed during input and stored as p-code to save memory, the p-code was still interpreted at runtime but not compiled.
Was "Atari-Basic" derived from Extended Microsoft Basic? Well, the same I started with in mid 80's in former East Germany on a KC85/2.
@@Merilix2 Sorry, I hit by mistake so I'll try again.
Atari Basic was NOT derived from M.S. Basic but Cromemco 16K Structured BASIC.
The three other home computers in the U.S. at the time used a version of M.S. Basic and NO, they did NOT compile to p-code and HAD to be interpreted at runtime, unlike Atari Basic.
So, Atari Basic should have been the fastest of the two Basics since it was pre-compiled, right. Nope : it was the slowest.
Why? Because the writers were told they had to do few things to their version of Basic.
1) It had to check the syntax of a statement when one is ENTERED. You claimed that all the other basics precompiled but OBVIOUSLY M.S. wasn't. One could write the following line of code :
200 IF X=Y THEN PINT "X & Y ARE EQUAL"
M.S. Basic would allow that line to be entered and would NOT find the syntax error until the line is executed (showing that M.S. Basic was NOT pre-compiled). That means it would run ONLY of X=Y. Atari Basic would have rejected the line, highlighting the word "PINT" to show where the syntax error was.
2) They had to add graphic commands like PLOT, DRAWTO, LOCATE, POSITION, COLOR to the language.
3) The Basic had to fit inside an 8K cartridge.
They bought the source code to M.S. Basic but couldn't do those 3 things. They gave it to another company but they couldn't make M.S. Basic fit into an 8K cartridge and do those 3 things.
So they wrote a new Basic for it. So why did it end up so slow? Because in order to fit into an 8K cartridge they didn't write a math package for it. Instead they used the one contained in the O.S.. But that package wasn't intended to be used in a language but only to do a few things the O.S. needed to do in BCD. The person writing it never wrote a math package before and didn't optimize it.
The authors of Atari Basic ended up releasing the source code. Someone wrote a new Basic for it named Turbo Basic (using the source code (I believe)) but replaced the slow BCD package with an optimized one. I tried a few benchmarks with the original Atari Basic against Turbo Basic and Turbo Basic won every time - sometime running 3 or 4 times faster than Atari Basic.
For the record, some people kept crying that it wasn't M.S. Basic like the other computers. Eventually they did come out with a M.S.Basic for the Atari (I have it) but I never used it that much, partly because Atari Basic was good enough for my needs AND a new language named ACTION came out that was easy to learn and use and compiled to machine code making it much faster.
@@bjbell52 No, I didn't claimed all other Basic's were precompiled. I said they were parsed and stored as p-code.
By p-code i meant code lines became kind of linked lists and keywords were turned into a short one byte value with bit7 set such that they didn't had to be parsed again. But that p-code still had to be interpreted.
That was almost the same as Atari Basic did. One exception: It seems like Atari Basic also converted (tokenized) constant numbers during input which MS Basic didn't.
By the way: your 'PINT "X"' would become NT "X" on my Basic as is the token for constant PI ;)
But yes, M.S. Basic was hard to fit into 8k. On early KC85/2 machine it had to be loaded from cassette tape and took about 10k of 16k available RAM. 85/3 had a switchable ROM instead.
There are optimizsations you can do even on assembly level. Like using fancy vector instructions and such. I once optimized a piece of C++ code with some embedded assembly instructions to gain 10x performance just using MMX on a Pentium chip. Usually the hard part is identifying which 0.1% of code really needs to be optimized.
I’ve had one embedded application that needed to be 100% rewritten in assembly language. The compiled C code was about six times as large as the assembly language version, and the assembly language version used all but 15 bytes of 60k of flash memory. No larger memory versions of the processor were available. The C code was not anywhere close to fitting in the given processor. I think it took me about a month and a half to write the whole application in assembly.
All the major C++ compilers vectorise if optimisation is turned on.
This line here: `for (int num = factor * 3; num
That can make a significant difference on the larger tests. Take for example, factor =709. Using factor * 3 means you’ll start at 2,127, while factor ^2 will start at 494,209, removing 353 unnecessary calls to test/clear that bit from that innermost loop. And that’s testing just one of 168 prime factors less than 1,000 (which is what would be used looking for primes less than 1M). The amount of unnecessary loops skipped by his change scales exponentially with the upper limit of primes you’re looking for.
This will be almost unmeasurable on the smaller tests. Such as primes up to 10,000.
Note this
Additionally, above factor = 3, you can instead of 'for(int num=factor; num < sieveSize; num++)' write 'num += 6', and have two if-statements, one for "getBit(num - 1)" and second for "getBit(num + 1)", since all primes except 2 & 3 can be written as "p = 6k +- 1"
Although there were no surprises, it is a great video. Many Python programmers knows that the best way to achieve performance in Python is not using Python. This means that you should do most of the computation calling C optimized libraries like numpy, tensorflow, sklearn.
But those are still python IMHO. The power of python is to be a glue language like that!
@@retrolabo yes it still python but C/C++ will do the heavy work for it, python just need a wrapper to call them, and most of language can make a wrapper to call to C/C++ function
@@nguyentranminhquang2861 yes this is what I mean by glue language. This is a strange distinction: think about it "print" in python is written in C in the interpreter, does this make the print function not python? :)
Using numpy or pandas to implement a sieve is the equivalent of robbing a bank with an atom bomb.
@@BrunodeSouzaLino you need to win a race, whatever it takes lol... wait a minute who said we cannot use a GPU for that? :P
Would love to see them run on the M1 and ARM on a Pi4
Mee too.
Would also love to see them run on Sparc 64 and MIPS
Oh, I didn't even think of the raspberry PI. If Dave needs one, I can send him a PI to have so he can do this test. I have a few laying around.
Me too
Cool, I will add for the Pi3 and Pi4 if I get a moment!
this is such a sick video. I love this channel now, its so cool to get these comparisons.
I loved your charismatic way of explaining the technical details. I was able to pull out a little bit more efficient (codewise) Python version while using the same algorithm you taught us, but still, the performance gap is astounding, I read there are many people here in the comment section sharing their viewpoint about a better Python implementation as if Dave was trying to undervalue Python, remember, this is a kind of "syntetic test" and depending on the Context it might be more reasonable to use one or the other of the languages.
In C++ you have the bitset class and can create a bitset of 1 million bits. And the other even faster method and more memory efficient is to create a vector.
C++ is king.
Wouldn't the vector allocate a byte per boolean? So that's not memory efficient, which was one of his criteria.
@@jplflyer Nope. STL defines a special case just for vector. Sometimes you really want a vector of actual bool type values, whether that's a byte or the same size as int, but good luck defining it due to this special case. What if you want to define a reference to some Nth bool in a vector? There is no reference to a bit in C++.
The first 30 seconds of this earned you a subscription
Why thanks!
Since you asked, here some things about your Python code that I haven't seen commented here yet:
• print() has a keyword argument for specifying the character(s) added to the end of the line, so for example you can do print("Hello, world", end="") to avoid printing
at the end of the line.
• From what I've seen, its convention to use all-lowercase snake_case for variables and function names, and CamelCase for class names (though the stdlib doesn't always respect that)
• You don't need the parentheses on your if statements
tbf naming conventions aren't language specific, they're programmer specific.
First time I've ever subscribed to a channel after just one video. This was great!
Welcome aboard! Thanks for joining!
Thank you for taking me back to the Commodore Pet. Our highschool didn’t have any computers when I started, but in the 4th year they started showing up in shops, and I was one of the “regulars” using the demo model. Eventually a TRS-80 model 1 was bought to help staff plan classes, (Dutch schooling allows you to select a subset of classes for 4th to 6th year, so they have to juggle schedules to ensure optimal planning.) I was then the “gang-leader” of those allowed to use it for the rest of the year. Exciting times.
Respect, that's all I can say other than I am impressed. I also am very happy, C++ has been my go to language for a long time. Keep these videos coming. The prime number program was one that we were required to code in my algorithms college course, also one that I found difficult was a program to estimate pi to a given decimal place(this was user defined at runtime). Ugh, we were forced to write that one in of all languages basic. Yes the professor would allow only basic for that one. I had a sadistic algorithms professor I suppose. He allowed me to use C for the prime number program though so can't complain too much. Love the video and will be watching your others in the future.
Thank you for that Simpsons reference. I’ve literally been doing that voice for 20 some odd years!
You and me both :-)
Really great channel and video - and fun presentation. And a fair language comparison, I'd say. We've been developing in straight C since the late 80's ... and have "secure/portable" libraries for just about everything. The modern compilers are amazing at yielding reasonable machine code (no need to spend so much time on assembly, aside for maybe some hardware specialties). We wrap our C with whatever language does the job for the client or product development. Some of those higher-level languages might be useful these days for people learning how coding works ... kind of like assembly language and BASIC may have been for us. So great finding your channel ... subscribed and telling friends ... cheers ...
This is so awesome sir! I really really wonder how come I didn't find your channel so far!
Thank you for showing me a baseline - how does a really passionate programmer are look like. Huge difference from many other RUclips participants. No society related talks. Just the code and figuring out things using code.
As a retired CPU designer, I am constantly surprised by the "discovery" that interpreted languages (even those that use a JIT) are so much slower than optimized C or even assembly. There is little appreciation for the massive overhead of many of these script-like languages. As a demonstration to convince a software developer that we could run their massive program on a $35 compute module I recoded their most critical routine in assembly (60 instructions long) and showed that their entire system ran with less than 10% of a very cheap machine rather than 40% of a Mac.
The real nightmare, however, is the strato-layering of "packages" one on top of another instead for minimal additional functionality but a perceived decrease in design time. These chew up CPU cycles in massive overhead damaging the responsiveness and size of the code generated. As CS schools have stopped teaching even the rudiments of computer architecture this is not likely to change. Great for CPU producers, but a massive waste in time, power, and cost.
@randy You discovered the secret of Forth. Forth allows easy replacement of a single bottle neck.
@@albertmagician8613 Yeah, but I could not force myself to always work off a stack.... RPN on an HP calculator was great but for general purpose programming?!?
I think it would be a great video to show how to profile some code and optimize the slowest pieces for better performance. That's effectively what I ended up doing and it makes a huge difference for any sort of interactive code, such as CAD tools.
Yes. It's a revolting waste...
Why do you think CS programs have stopped teaching such things? I graduated relatively recently and my average rated US school required two courses on embedded design, one on assembly language, and one computer design which focused on the specifics of logic gates, ALUs, etc. I suspect you're trying to strengthen your position by creating a strawman argument. I think at best you could demonstrate that the worst programs don't teach these things but that was likely always true.
@@nonconsensualopinion I have no need of a strawman as I'm not arguing, but stating an observation. Some programs do indeed require traditional logic and some architecture classes, but increasingly those are being phased out as more emphasis is placed on web programming and large variety of dialects available for layering systems. I could give a list of such programs but it's not my point. Deep layering and package-based design are making systems increasingly inefficient. The good news is that I think we're headed for more actions like the one I stated in my original post, but slowly. I also think it's a massive opportunity as server and system costs can be reduced through breaking down software layers.
Some comments on the Cpp version:
1. There's a native bit array, it's just called "std::vector". It's slightly horrible that it's a specialization of vector, but it does work and does what you expect.
2. The std::chrono default aliases (seconds, milliseconds etc) are all specified with integers as the base type, for compile-time safety. If you want fractional time amounts, you can define your own alias, e.g.
using dseconds = std::chrono::duration;
And now you can duration_cast to that, without having to fiddle with getting it in microseconds and then multiplying by a million to get more resolution.
If only I could be such an expert... I'm just blown away. With coding each time I stop and think I'm somewhat not bad at it, then each and every time it is happening, I see someone who makes me feel like I know nothing...
Very cool, Dave. Thanks for the great content. It strikes a nice balance between educational and entertaining. Beauty
For python, what I managed to catch: self instead of this, index//2 instead of int(index/2), use if __name__ == '__main__' for mains, so in case you import the code elsewhere it doesn't run the main, you do not need to retype something as a string in a print, also prettier way is to use f strings, you can use **.5 instead of sqrt, but this is more optional, than previous tips. Also for the sieve, you can actually start with square multiple, instead of 3rd multiple. That was all I caught, happy coding
f strings are not only prettier but also faster, since they don't require an expensive function call
There are also a number of loops that could be optimized.
Generally speaking, in Python for-loops are faster than while-loops, and often more so, list comprehensions are even faster than for-loops. In code like this with possibly many iterations you can often see a very significant performance (and sometimes memory usage) improvement when switching to list comprehensions. Plus when you are used to using them list comprehensions are often more readable.
You can go even further with the functools module and using "pure" functions that avoid side-effects (save calls to print or log to places outside the pure functions). This approach can lead to much more readable and succinct code.
@@pjkirkham yeah. I saw some of those but since i did not had the code, I didnt want to tell him about it. But you Are correct
These programming videos have been some of my favorites you've done! Please do the M1!
Great video! Just want to point out that most python devs wouldn’t try to do something like that in native python. We’d either use a library like numpy or build one in c. Python is really more of a scripting language, I use it to call other bits of code in a readable sort of way. Once you learn/build the main packages that are relevant to your job, it’s crazy how fast you can push out code, which might not be 100% as fast as C, but it’s fast enough and very easy to maintain.
fast enough is kind of relative, isn't it ? I mean, his code ran close to ten thousand times faster in C++/64 compared to Python 😂
I remember writing a small snippet of code a few weeks back just to count from 1 to a billion I think on C, Python, VB, and MATLAB, and I remember the code was much, much slower in Python than C, though not that much slower if memory serves.
@@MoodyG That's the thing. In real life there would never be a requirement to "count to 1 billion". But I would create a dataframe in Pandas that contained the positive integers from 1 to 1 x 10^9. If you're doing that kind of thing in native Python, you're doing it wrong.
@@tomwalker996 I think you totally missed the point. Of course you wouldn't wanna count to a billion in real life. It just serves as a simple example to illustrate the comparative difference in speed between different languages. Counting up to a billion is way, way more trivial a task than what you'd actually wanna do in real life. An actual useful code for some real-life application may very well end up eating through computations many orders of magnitude more than a billion primitive addition operations.
This was awesome. Thanks for the explanation. It's interesting to see how far we've come. I started on TRS-80 basic.
This was wild, and mind blowing. I correctly guessed the order, but the speed of the code is what really floored me. Thanks for the demo, walking us through it briefly, and for displaying the code as well.
In C#, have you compiled it in release mode? For me, this increased the benchmark value from ~1900 to ~5000. Also, updating it to .NET 5 improved the performance on my machine by another 2% to ~5100.
Yeah it looks like its set to Release mode; I was looking for the same thing. I got similar results switching from Debug to Release, plus the X2 performance from 32bit to 64bit. Very eye opening.
@@DOSdaze If you compile the c# project to x86 only, the speed drops from ~5000 to ~4500 on my machine.
@@vonBlankenburgLP Interesting, I wonder whats making the significant difference for me. In release mode 64bit I'm getting around 4900 passes every run, but then drops down to 2800 in 32bit very consistently. I'm on an 8700K clocked at 4.4GHZ running Windows 10.
Can confirm, I'm getting ~5300 when running the C# code in release mode on my 3700X, which in theory should be slower than Dave's 3970X. The C++ code is reasonably close with ~7300. Seems to me that something is wrong with his C# test.
@@moki5796 My guess is he's a C++ guy?😋 just kidding. Interesting though but we all know C++ is faster sooo..... C# is my fav language, C++ is great too. I'm going to get hate for this but every time I use Python I just feel like something is missing and it bugs me but I can't put my finger on it. I get the popularity and it's used by around 8 Million people but I'm guessing at least 50% are noobs, where as around 6Million use C# and am guessing less than 10% will be noobs...sooo...don't know never tried IronPython maybe that will peak/solidify my interest?
Python is "compiled" into bytecode too (usually saved as .pyc files), but the bytecode interpreter is really just that. So no overhead for parsing, but no optimizations either, that's why it's slow when doing compurational intensive tasks (for which often times you just use optimized packaged/libraries). There are versions of python that jit this bytecode to increase raw performance, e.g. pypy.
Running on my old laptop, I get 42 Passes from your Python Program when run in CPython (what most people use as "Python"). Simply switching the Interpreter to pypy I get 561 Passes from the exact same source file.
Another reason why Python code is slower than C code is dynamic typing. Even when JIT compiling, the compiler needs to add checks that the types of the variables are still what the JIT compiler believes them to be. Same for array bounds checks.
That's not what "compiled" means. That's simply syntax tokenized reduction. It's been around since the days of BASIC.
@@mihiguy That's interesting. It seems when we re-write the Python code can be statically typed (eg; compilers like Cython, Numba, ect... all implement their own type systems) it would remove all that.
@@knowlen For JavaScript, that is what asm.js is doing. Expect a restricted, strongly typed subset of JavaScript and compile directly to machine code. However, this has mostly been obsoleted by WebAssembly.
Python is just the "Flavor-of-month" at one time it was Javascript. C++ is closer to the "metal" and you have a dozens of toolkits for optimization.
Wow! Loving C# more every day!
Just looking at Python for a project using Raspberry Pi, and have to say, it's a dog. I'm about the same age as the presenter, so his language experience really resonated with me.
Great video! Wish I could up-vote more than once. Thanks! 👍👏👏👏
For ex-BASIC coders like myself C# is an ideal halfway house to C++. Learn C# and learn Powershell and you can address most programmic needs. Never as fast as C++ of course, but a lot less cryptic and most times it catches you when you fall and gives you a decent diagonistic message.
Excellent treatment of this topic. The only suggested follow up I would have is some discussion of why 64 bit turns out to be much faster than 32 bit and if the C# implementation uses 64 bit "under the hood". Keep them coming Dave, you've got a new fan!
I mean C++ techincally has vector for dynamic bitarrays. I believe it uses size_t instead of 8-bit chunks, because it's made to dynamically change size even after being created. I know technically alot of people don't like it in bigger codebases for various reasons, but in isolation it works fine just to access and change bits.
Wanted to suggest the same. It is actually not specified how it needs to be implemented, it is not even required that it allocates 1/8th of the memory of `vector`. And I agree having a specialization for `vector` is indeed a mess, since it technically not fulfilling the requirements of `std::vector`, such as having the elements stored as-if they were in a plain C-array.
The 64-bit result is shocking; I'd of never thought you'd get that much of a delta over 32-bit!
64-bit supports some game changing new instructions witch can be used to optimise a program
The x64 calling convention is much simpler .. first few args are passed in registers, akin to __fastcall in MS VC++. And there are a lot more registers, which means less juggling values to/from the stack frame, and it also allows for more aggressive function-inlining by the compiler.
Also also, compiling for x64 is like an implicit hint to the compiler, "this is a modern CPU so you can use all the new SSE etc instructions.. don't have to worry about compat with a 20 year old Pentium"
std::vector is size optimized for something like a "dynamic bitset"
You should add golang and rust into the mix
I like to watch that
And maybe Java
@@SulemanAsghargoion _All of them_
and the fastest scripting language: LuaJIT
@@August0Moura This would actually be really interesting, luajit is really quite fast
Dave, your breakdown of how some of this works is very interesting to me. I'm a Server and Storage Architect and team manager, and I have a BS is Computer Information Systems. By the time I was learning in the mid to late 00's, they didn't teach about bitarrays or any of the way the assembly works, we just learned OO languages like Java, C#, J#, and of course I've done some python dabbling since. I prefer C# but getting a peak behind the covers from someone who understands the assembly behind the scenes was interesting.
Nice to see good stuff from another former Blue Badge!!! Just found your channel and I'm enjoying your stuff! 😊👍
Believe it or not, I wrote a prime number generator in dBase IV. It saved the prime numbers in the database. When I packed up at the end of the day, I left it running on the IBM AT overnight. Since the prime numbers were in a database, it could pick up where it left off the previous morning. The database was handy for factoring huge numbers.
I'd love to see comparisons of .NET Framework vs 5 vs AoT, as well as 32 vs 64 in C# if it makes any difference at all
Attack On Titan wins 😅👌
I still would like to have .net5 tests if it's worth bragging my boss about patching VS and install 5.0
I did some small tests. Changing to x64 native compilation yielded a 15% speedup for me. Upgrading to .NET5 gave a roughly 2x speedup. So with those results, C# is about 2x slower than optimized C++ in my quick tests. I assume it has to do with C# being unable to do as much aggressive inlining. That's what it seemed like from running the profiler. Turning off inlining in C++ dropped performance by half. I did bump the num primes calculated by x10, though, figuring it might help avoid the allocation overhead that C# has to pay since it's all heap allocated.
@@Sayuri998 Nice. I have to ask my boss for an upgrade then. In company networks, you don't upgrade your stuff yourself, it's what they give you.
@@Spelter Just be sure to do some benchmarks yourself and just rely on some random person on the internet. While these are the results I got, the results may vary for you.
honestly I really enjoyed this comparison. It was so in-depth and fun and I really liked the anectode. Also, Dave's diction and oration skills are so good, like ??? the way he speaks is so engrossing
**BIG shoutout to Dave's python code using this instead of self. If Guido didn't want us using the **correct** keyword for the current instance, he should not have made it customizable**
Enjoyed the story. I took some time to work on an implementation in R and got some pretty good performance out of it. Major take-away: understanding what is REALLY going on in a programming language can allow us to write clean code. As I tell my Grad students: first get it RIGHT, then make it BETTER.
You look like the Chuck Norris of coders. I am even more amazed by you clear and understandable explanation than by the performance differencess. Would have been interesting to see the java and java script performance - just to finally end all those performance discussions.
You can probably speed up the Python code Cythoning some of it (Cython is still Python, well... Sort of...). Great video and excellent test!
Basically me watching this channel: i like your funny word, magic man
Hahaha, Clone High reference?
And here I struggled with iterative loops and abstract classes within my 6 months C# crash course learning lol, the amount of knowledge you must posses is truly astonishing.
In C++ there is std::vector which actually is implemented as 1 bit per element, though it likely would be slower.
It's been years since I have written code, but having cut my teeth on c, I was curious. Your experiment did not disappoint.
Recently, I have been playing with arm, so when you mentioned the M1, it peeked my interest.
If you run it, I will watch.
Hey man, you could use the numpy lib for Python to run functions like "sqrt" even faster. Python wasn't really made with speed in mind but numpy was. Many of numpy's functions were writting in C to make it performant. I would like to think that's the reason why Python is used in Machine Learing - numpy + other C libs for Python (speed) + python (Minimal code, readability). Loved you video!
Was thinking the same thing
Using Python's abstractions tends to be faster than trying more low-level optimizations. That's because for many of these Python is able to optimize stuff using lower level tools that are not accessible to the user, like C bindings. JS also performs quite a lot of optimizations at the engine level. If you could somehow implement the sieve using numeric methods, you could use something like Numpy and you'd be essentially running C behind the scenes.
As a little advice : to compare a value with either 'True' or 'False', use 'is' instead of '=='. Python will compare the objects' id directly and it's the most optimized way to do it.
Better yet, don‘t compare with True or False altogether, i.e.:
if this.getBit(num):
pass
@@anonanon3066 Python is a real language.
@@anonanon3066 that makes you sound like you're not a real programmer
@@philipmunch3547 I agree, this is even better, but I have not coded in Python for a long time, I've been more into Rust recently.
Yup, do not do this. Never compare against True or False using is, unless there are other "truthy" or "falsy" values that are anticipated, e.g. use of None as a placeholder, canary, or NULL equivalent. Identity comparisons may seem to operate the way you expect (True and False being literal singletons in CPython), however other situations will be far stranger. 27 is 27 → True. Some larger numbers that end up not being interred will not have this identity match, a similar problem with using this to compare strings for equality.
Philip Münch has it right. Just use the "if" statement itself as the mechanism of casting and boolean comparison. The right tool for the job. Combined with "exit early" patterns, my special case of using None should be accounted for first, then the remainder of the function exited from if needed, so that the subsequent truthy or falsy comparison doesn't worry about None at all.
In JS there is the paradigm of double-inversion to cast to a boolean, e.g. !!foo, Python simply does not have this silly need.
What a cool comparison! I learned something too. I recently started learning Julia, so I copied the program more or less to test the advertised speed of Julia. For 1 million upper limit, I got 4228 passes in 10 seconds, an average of 2.3 milliseconds. I was quite astonished! This is comparable to C# and C++.
Awesome video, eh. Loved it--definitely a new subscriber here. Thanks for your efforts Dave!
Thanks and welcome aboard!
I like benchmarks too. I converted the C# version to PHP.
CPU: i7-2600k
- 5sec of iterations
Python 3.8.5: 23 passes
PHP v7.4.3: 50 passes
C# .NETCore 3.1 compiled with VS 2019: 1623 passes
* rolls eyes *
Implementing that same prime number generator in a new language is a fantastic idea, I'll try out something similar next time!
someone do a java2k version please!
@@derkeksinator17 omg ok, I'll put it on github should I succeed, but that's rough
@@JannisAdmek Has anyone tried it in Rust? Seems to be the language everyone wants to talk about these days. Be interesting to see what, if any, hit there is. Also be more interested in time for n cycles rather than cycles in t time.
I'm curious about why there's a count of 1 prime for the limit of 10 in the historical list: surely there should be 4 (2, 3, 5, 7)? Or am I missing something?
I wanna know too
I think it was redefined along with pi in the indiana bill ;)
I was going to ask that as well.
Typo, never encountered because no one tested it testing to 10. Or maybe it’s deliberate for testing that the validation code correctly flags errors. Test to 10 and it’s always “wrong”.
"I somehow managed to get into the list of students who would start programming in high school".Are you kidding,Mr. Dave?You are super bright!
I've never heard of you before now. That intro was legendary. I'm building a breadboard computer this year.
I've never heard of you either, but I'd be interested in hearing more about the breadboard computer, what kind, what CPU, etc!
@@DavesGarage I'm not entirely sure. 🤔 I was inspired by Ben Eater. I'll keep you updated.
Have you considered doing a version of this video that uses Python, Python with Numpy, and Cython as well? Numpy and especially Cython can help things up quite a bit. (Or at least that's what I recall from uni)
😅 that would be a bit of cheating. Especially Numpy just wraps python calls around highly optimized c or fortran code. So you would expect the c or c++ performance plus one function call with the associated type checks python does.
And Cython has it already in the name. It's just inline C in a python framework. Which is absolutely genius as you can have performance over flexibility where needed.
Omg... This is going to be soooo awesome 😎.
I already decided my religious and no changeable attitude towards all these platforms, but damn it'll be fun to either be right or rectified :p
More content like this... It's getting really geeky 👾
You should have tried Delphi (or the free Lazarus). I worked at a large bank and tried to get someone to look at Delphi but nobody did. We were a Window's shop. I was hired to rewrite their Paradox for DOS system but I tried to convince them how good Delphi was. Finally they decided after one of our best programmers had calculated he could write a trading application in VB in one year after we got a new department head who believed in free software and chose Java instead. It was way too slow and after 2 years his final work was rejected totally (and he had a 6 figure salary). I tried many times to explain that we were a Window's shop and should use Windows development tools but they fell on deaf ears. Our new boss decided to write everything else in PERL. Goodbye Windows' GUI. Finally the good programmer convinced them to try C# and in our meeting I was told not to mention Delphi again because C# had so many great features. I fought off the urge to scream and explained to them that C# was Delphi with a C syntax that was designed and programmed by the same people who wrote Delphi and Delphi had all those great features many many years earlier. It didn't matter, the company went out of business a little later on.
How good it is to have a view of start to finish of this situation. I guess this must have went back and forth for a couple, let me say about 3 years, I wished they had given you a chance, and tried out Delphi, lol.
But then, life goes on.
Why look at half-dead technology. What could be the reason for this.
Hi Dave, I realized that your python code style looks more like C++, rather than python, which is fine but since you asked about any comments about your python coding style I can provide some insight.
1. When printing something you can use formating, because it simplifies things and allows for extra options. e.g istead of writing print('Passes: ' + str(passes) + ', Time: ' + str(duration)) You can use formating like print('Passes: {}, Time: {}'.format(passes, duration)) which is much more clear, or for python >= 3.6 you can use f-strings like: print(f'Passes: {passes}, Time: {duration}'). This allows for extra formatting options. For example if you want to trim duration to 2 decimals you can write print(f'Passes: {passes}, Time: {duration:.2f}') which in code is very clear and consice.
2. Instead of importing stdout to avoid printing newlines you can use python's built in print. The print function takes some extra keyword arguments like "sep" and "end". "sep" denotes the separator (default " ") and "end" denotes the end character (default "
"). So in your case instead of importing stdout, you can use print('.....', end='')
3. There is no need to wrap conditions in parentheses. E.g if (condition1 == True) and (contition2 == True) can be written as if condition1 == True and condition2 == True or if condition1 and condition2. You can always ommit the == True if you know that condition is a boolean which you would expect it is otherwise there would be some bug I guess.
And here are some coding style conventions (you can read more about it by googling pep8)
4. Function/method names and variables are by convention snake_case (except maybe cases with backwards compatibility).
5. Class names are by convention PascalCase.
That's all keep up the good work, I really like your videos.
As optimization instead of starting from 3rd multiple, the start can be done from the square of the unmarked factor.
All the rest factors would have been crossed in earlier passes as a citation Wikipedia article on Sieve of Eratosthenes can be referred.
That Python bar should be almost twice as thick. You're using the floating point division operator / with a conversion to integers every time you index. Try Python's integral division operator //.
Isn't floating point division quite a bit faster than integer division on modern x86 architecture? I vaguely remember something like 11 cycles vs 26 cycles on Skylake CPUs.
He could also just use shift operator instead dividing by 2,4,8 etc..
@@nahco3994
On my somewhat old Inspiron 3847, i5-4400 3.1GHz machine, replacing "int(index/2)" with "index//2" on lines 51 and 64 (of 0cb3ff5 commit to Primes/PrimeSievePY/PrimePY.py) resulted in average runtimes being improved from about 0.385 to about 0.254; the integral division version being about 1.52 times faster. (Windows 10 Pro, ~2 year old python 3.6 in cygwin).
(Note Dave's github link in the description).
I tried that and got an immediate 30% speedup. Eliminating the whole clearBit and getBit functions (directly accessing self.rawbits) and holding the whole list, not just the evens gave a 290% speedup. The problem is that the things that are good optimisations in a low level language may not be in a high level one.
And that goes up to 465% if you eliminate the class which is doing nothing useful. I'm sure it could be completely rewritten in a more pythonic way to be far faster, but interesting to just make some edits from the original program. That is still obviously nowhere near the C# time - but now more like a factor of 20 rather than 100.
Sometimes compilers and interpreters are surprising! I wrote a Monte Carlo simulation in VBA under Excel, coded as creating and destroying objects for each simulated item. Then recoded it to do all the memory allocation. and deallocation on a static array and it ran at near enough exactly the same speed! That OO stuff was implemented well!
or the other stuff really bad.
main mistake was using vba and excel ;)
@@stke1982 Well sometimes you have to do what the client wants! I’ve written MC simulations in a variety of languages from Fortran IV onwards! My point was that the hand crafted stack operation version was indistinguishable in performance from the OO version - to my surprise.
1st C++, 2nd C#, and then Python will be a distant 3rd. The performance delta between C++ and C# will likely be heavily influenced by how much garbage is generated that the GC feels the need to clean up. If you're very careful about your memory, then C# can get really close to C++. But if you write idiomatic C# (e.g. using Linq) then it's going to be well behind C++ but still a country mile ahead of Python.
@Tudi20 It will be a drag race miracle
...because Python is interpreted, its really not fair to compare it to 2 compiled languages.
@@nolram c# is also interpreted from crl too, so...
Unless you use .net native, that's not an excuse really
@@honguyenminh Nonsense, the CRL is not an interpreter.
JIT != interpreter
@@honguyenminh C# is absolutely not interpreted, its compiled by, for example, Roslyn.
Dr. Dobbs had an article where they guy boots DOS, runs a program, it takes something like a second, then runs it again, and it's 300x slower. Rebooting gets the 1 sec timing, rerun gets the 300x slower time. The author had no idea what caused it. This was on a 386. What DOS was doing was setting up the virtual to physical memory map on boot. It was a simple 1:1 map. Virtual block 1 is physical block 1. However, after you run something, the physical memory is freed, and the 1:1 mapping is changed. Virtual block 1 might now be physical block 7. The processor bit that maps virtual to physical is the TLB - the Table Lookaside Buffer. It's a small table on the chip. When you ask for a virtual block that isn't listed in the TLB, there's a system interrupt, and the correct mapping is added. This happens because the TLB isn't big enough for even fairly modest sized programs. For speed, when a new entry for a virtual block is needed, it doesn't go just anywhere (like the least recently used spot), but instead, a hash function is used to figure out where to put it. If Function A and Function B happen to hash to the same TLB location, then you get a TLB interrupt when A calls B and when B returns from A. This is TLB thrashing. There's also a version with data addresses. And TLB trashing can give you a factor of 300x slowdown. At least, that's the slowdown last i looked. On modern operating systems, the virtual to physical mapping is set up for every new process. You should get similar performance for similar input for the same code. However, large code, but especially large data can suffer from this in wonky unpredictable ways. I'm unaware of a computer architecture where the TLB hashing function is public. But MIPS (1980s) had a code analysis tool that built a function calling tree and could rearrange your modules to avoid TLB thrashing for your code. I've not seen anything like this recently.
Holy cow, i always heard c# and c++ were faster than python, but this is way faster than i thought, this video is awesome
I really like your pointing out that the algorithm can greatly affect the run time. Back in the 8080 days, I had a TRS-80 computer. Basically a 1Mhz Z-80 chip. I wrote a poor algorithm in assembler and a smart one in interpreted BASIC. The BASIC code ran faster.
I don't code much anymore. I'm pretty much maintaining an obsolete system, which will be retired soon, followed by me. Now, I mainly game on a console. I can feel my brain rotting.
Would be interesting to do this with C and gcc as well and see if the compiler makes a huge difference between. Also keeping track of the results as a bitmap is a great space savings, but is it really a time saving? With a modern system you could store everything in a byte array, or even an integer array.
Yeah, the bitmap seems like a downgrade. On the 64-bit compiler, an int64 is probably the fastest array type (I'm guessing.)
However, the point is to establish relative performance between different platforms, so perhaps super-optimization is less important than using the same code on multiple machines. Actually, come to think of it, that's the best possible reason to skip the bitmap array, since some platforms may have trouble with it.
Space saving helps speed in modern CPUS because it increases cache coehrence.
From Australia, keep these coming.
This was an excellent video, and you are fantastic teacher. I use C# in my daily job, and for what we are doing, its performance is acceptable. But the capabilities of C++, as you have demonstrated, are amazing.
Glad it was helpful!
Cool! I got into coding on programmable calculators (TI-57). A buddy and I would compete to see who could write the same projects within the limited memory constraints of the calculator, and with the least number of steps. It was a great way to learn by trying different methods of coding to reduce the step count. Later I went on to become a telecommunications engineer and he went on to become a hard core programmer from assembly up. He would write code, then compile and link, in different languages, and benchmark sections of each by counting clock cycles, and would often rewrite the problem parts in assembly code to speed things up (he specialized in optimizing video drivers). These days I design and build test systems that have to communicate with other equipment that have different response characteristics so to help optimize the code I put timer marks that I can turn off and on at startup that help me find the bottlenecks. It would seem this is a lot of work for a little slice of time but since the code runs thousands and thousands of time every day, those little slices add up to a lot of time by the end of a year worth of testing. That said, your Software Drag Race video should be the poster child of software development for anybody running code in a high volume production environment. Thank you sir!
Yes, please do the M1 Mac!
This is such a great channel.
Amazing video. Btw would highly recommend taking a look at other python interpreters like pypy, to librarires like numba, or to cython. The last two options will impact the flexibility that made me love python but the results are totally worth it. Especially cython.
I'm currently running from the police. I just broke out of jail to watch your video, and I have to say, it was worth it.
The old programmers had the brain, I felt like what do I know about how things work, Thanks a lot for sharing your experience sir.
the only thing I would like to say in accordance with your request for information on better practices to use in your code would be to: "stop" + variable.doing + "this" with your strings.
in both Python 3.6 and up as well as in C# you can prefix your strings to add variables directly into them. I do also believe it is a little faster execution wise, also being easier to read.
The prefixes are as follows. for Python it is f or F, and in C# it is $
in C# it would look like this: Console.WriteLine($"Hello, {name}! Today is {date.DayOfWeek}, it's {date:HH:mm} now.");
and in Python it is much the same: print(f"Hello , {name}! Today is {date.DayOfWeek}, it's {date:HH:mm} now.")
i am unsure if C++ has something like this though 🤷♂️
hope i was not being mean 😅
C++ has printf for inline variables carried over from C
I created a similar program using assembly language on a ZX Spectrum with 48 kB of memory to see how fast it could fill the memory (up to about 260000), I was surprised it only took a few seconds. Note that this was almost 40 years ago!
ROFL. Isane! Insane how much compute resource is getting wasted.
Yep very interested to see it tested on other chips, especially the M1. Now I'm curious about how Rust will perform as well as something functional like F#
In Rust we will then write a module that will call special cpu specific optimalisations and beat F# the fuck out of the water. That's what's so nice about system programming languages. They can go deep into the cpu while bytecode languages cannot do that so easily.
std::vector does pretty much exactly what you're trying to do with bit arrays. It will pack 8 values per byte and do some bit manipulation to access them as if it was a standard C array.
Yup, next episode does exactly that!
I feel like you could get the C# faster by not using the heap and going stack only and Span slicing or something.
You can optimize the c++ code even further by prepending ++ instead of appending it to iterators. This way an unnecessary copy creation will be avoided. Also building the program as a release build will give a performance boost. Instead of % 2 using & 1 may also result in a performance improvement since it only requires checking the least significant bit to know about evenness :)
LOL compilers are not that stupid not to optimize that.
You're right about release build instead of debug build.
@@johnatan_does do they also optimize the appention of ++ for primitive types?
When I got started in computing, the "drag race" would have been Assembler vs. FORTRAN vs. interpreted HP BASIC. For a one-off problem, BASIC was usually the correct choice, because the program could be written, debugged and would have solved the problem at hand before the other two had been debugged, even though it executed at an order of magnitude slower speed than the other two choices.
In the real world of engineering as a profession, one's employer doesn't care how elegant the solution is. All they care about is how quickly you can produce the correct result with the tools at hand, and the total time to the solution includes how long it takes to set up the tools.
That's why most engineering coding is in Matlab with a mature 20y JIT compiler. Crosscheckin the codegen C results within the Matlab enviroment is the key.
Oh my god, referring to class instance as "this" in python feels so wrong to me
I hate the indentation of python. In any language....
@@HermanWillems Why?
@""this" in python feels so wrong to me".
You are in luck, you can use _any_ name for the class instance that you want to. 'this' is just a convention.
I am a beginner in Python. Pls explain to me then what "self" does mean in python? I did not find "this" keyword in python, what does it do? I found it in c#, Java, JS.
@@AIII_Brainwashed 'self' or 'this' are just keywords for a class to act as a reference of the instantiated object. 'self' is usually the standard for Python, but it can be any word you want (some limitations).
I was hooked by Pascal's wager at the end. I subscribed. :-)
Dave, very nice production values: camera, lighting and sound.