Can’t wait for python programmers to evolve into Mojo programmers who just never use any of the new stuff, but now can say the language they write in is using modern process optimisations and cache-efficient data structures. Kinda like C++
Did people forget about Julia & LISP? You could have easily use macros (compilers) to turn one high-level S-expression format into intermediate S-expression format into C S-expressions. Even in Python today you had @numba(nopython) to turn static looking Python that looks like C-code into C-code and compile it.
Why don't people write their code in terms of relational constraints. Like I need this and this and this, then use chatbots and solvers to generate code and custom hardware for your application? A good way of transforming your constraints to code is Lisp's s-expressions and quasi quoting which sounds like Julia.
@@aoeu256 Too complex, high-level scripting is the closest you'll get to that. You're talking about something that exceeds semantic language and symbolic language, conceptual language. Doable but will likely involve probability, don't think LLM are it though.
Cache-line sized vectors being their own type is pretty brilliant idea. It probably allows even better performance compared to manually doing that, but also just reduces typing.
Aligning types to the size of caches in general is a great optimization. People most often think of optimizations in terms of algorithmic order of n magnitude things but the reality is making something 8x faster is a substantial speedup. Making something even 30 or 10% faster will make something take about 10 less minutes if it takes an hour
Numba, a JIT compiler package for python, seems to do a good portion of what Mojo promises. I regularly get big speedups over numpy using it, particularly because it can auto-parallelize both native python loops and many numpy function calls.
@@ckmichael8 But Numba doesn't require as high IQ. If you can use numpy you can get C-ish performance in a single function with just a decorator. It's finicky with any argument not a boolean, numerical or a numpy vector thereof though.
@@yeetdeets Yes you are right. I think the usecase for Cython and Mojo alike is for things that Numpy does not support yet, like new algorithms that cannot be efficiently expressed in existing Numpy funtions. If there is a numpy way of doing that than Numba is certainly the better way. But then for research things like new ML algorithms there maybe no existing implementation available at all so a Cython/Mojo implementation would be required.
did some 5min optimizations by using numpy and got it to be 1400-1800x faster than the example he provided. Still, if i can continue to code in python and make it faster, and have strong types, then i see this as an absolute win lol
In Rust, you can pin a shared buffer, and dispatch slices from it to each core. That’s basically what I’m expecting that Mojo code to actually be doing.
f32 is directly supported in almost all SIMD ISAs. f64 reduces the number of components (in 128 bits, you can have 4 32 bit float, but only 2 64 bit floats).
Amazing stuff. Still I wonder...how is it fair to compare Mojo with plain python when numpy is basically a part of python itself at this point? Numpy often outperforms even Julia (for large arrays).
They also have comparisons to optimized numpy implementations and still achieve 2.5x over numpy. Also note that Mojo is being built as a hetegenerous language, meaning that it should be pretty straightforward to utilize GPU or other accelerators. Having all this in one single coherent package is a very big deal
Hey, on tilling: this is necessary to keep the processor cache hot. The classical example is inverting the index of the two loops in a matrix-vector multiplication. The parallel algorithms for the same operation can be tuned by sizing the chunk of the matrix your are operating on. This becomes even more critical when you add another level of locality by using an accelerator like a GPU or when working in a MPI cluster.
No, you are wrong about f32 and f64. On the CPU side of things (which the example is on), the float is always faster and all games, all cad, all highperf code always use float instead of double unless precision errors crop up. This is even true without simd, but with simd things are even worse because you can either do 2x as many operations on floats or 1x as many double, which is literally halving the speed! Also in the past there was a case when doing 8 byte loads were slower than 64 byte ones because certain (in most cases risc) CPUs could not even address smaller memory. But even there this have stopped at 32 bit usually so 32 bits you can manipulate even as integers just like you would do the 64 bit ones even on arm usually. So even for integers it is worthwhile to use 32 bits instead of 64 bit - for example there are builds of 64 linux kernels that enable 32 bit pointers and those are much better if you are not having more than 4gb of memory but otherwise want to do 64 bit ops usually. Also in many highperf codes of mine I tend to store indices instead of pointers because indices can be stored on 32 bit and thus your "pointer-ish" part of the data eats literally half of data cache - and yes memory is plentiful, but caches are very limited. What you say had some merit in the past, but in current state of the art highperf optimized codebases it is actually a bad advice to use double - unless of course float errors kill your alg. The story also highly differ on GPUs, but traditionally GPUs also massively operate on float and not double so in most cases its faster there too. I do not follow all architectures on the gpu side of thing for cuda and all though, but there sometimes using the bigger type is better - like on GPUs that do not support float16, those will be emulated with float32 which is bad. If the GPU does support float16 and suchformats however those can be immensely more faster for machine learning if float errors let you do that so codes usually just ask the APIs if there is support and if they can use 16 bit floats they do. Its good to have this language because python is extraordinarily slow.... Extremely... Only good for glue-code kind of fast hacking but sometimes even the glue part is slow so this is good development. I don't know what are the results though if he would compare to lets say cpython or something that is compiler ahead to time...
From what I know about GPUs FP32 is the main focus in gaming while PF64 is for compute applications. Gaming cards often lock down FP64 performance so you need to buy a workstation card to get the full performance. Sometimes different architectures are used like right now for AMD RDNA is optimized for gaming with slow FP64 while CDNA is optimized for compute with very fast FP64. I think some consumer GPUs like Intel's have no hardware FP64 so it isn't used much in client applications. It is my understanding that lower precision is becoming more and more important thanks to its use in machine learning with architectures getting improved performance for FP16, INT8 and even FP8 on Nvidia Hopper.
Basically, if you exclude a chunk of what Python can express, what remains can be made very efficient. So add a little syntax to allow you to ring-fence stuff that you want optimised. Makes a lot of sense.
If it lowers the difficulty in being able to make code that can run on GPU's and mixed use cases, I'm all for it, still it being signup only feels very weird right now,
i think mojo is basically a proof of concept/best showcase for MLIR and what better accessible lang to superset to be honest. very exciting project and also very curious and excited about what mlir can accomplish for other languages.
Python is chosen because of its ease of use and libraries which take care of things for us. If we add all these specialist language constructs back into it, have we just undone that ease of use; is it still easily understandable; or does it provide a reasonable pathway from noob to expert?
Probably the idea is that it can be used for those creating the language and libraries. Currently many python libraries are implemented in C, C++ and Fortran. Therefore if it is possible to write fast Mojo code, the library could just be written in that reducing. The hurdle of linking different programming languages.
He's well known person in deep learning community but I would say in order to compare you could compare numpy vs mojo for matrix multiplications , dot products , etc .
Kind of reminds me of Common Lisp with one of the several approaches to integrate python. For integrating python in a faster environment ofc, not syntax. SBCL even has a lot of nice SIMD stuff, native threads and green threads, has nice interactive developement and debug tools. You can also optionally declare types, which does impact performance. It could use a better package manager and some better project tools.
This is awesome and I'm really looking forward to when it gets released. But it is a marketing stunt. You should compare it to something like numpy with multithreading. Probably still 10-50x faster, but no one who has the slightest idea about numerical calculations in python uses for loops.
I can't wait for the next stage of new programming languages, like Writescript, an superset of Typescript; C+++ an superset of C++ and lets not forget; GoGo an superset of Go.
@13:00 I felt like there was a code that just entered your being. And you had that revelation of macro expanding life altering script that subtly changes ones life. Not all in one call but gently nudgingly like a hash or crc or plane like CUDA where the code of the Jedi master is shared with everyone everywhere with everything .. EVERYTIME
I don’t get the criticism that they picked a bad feature of python to compare against (i.e. for loop). In my mind, it’s fantastic that improves on what python does badly. I don’t see why anyone would use it if it just improved on libraries that already use C or C++ under the hood like bumpy or pandas. The whole idea for me is that, by using mojo, you get a better version of python in all cases and especially the most basic ones (e.g. just a simple for loop) without having to learn any new syntax. I think mojo will be great for already good python developers who already use type hints. Although I am a bit salty that mojo doesn’t use the same case for types as Python (e.g. Int vs int). I don’t think mojo is trying to replace Rust or C++. The jobs current python users most do simply isn’t the same as what Rust and C++ users do (unless for some reason you work at a company that uses python for backend or game engine development). Mojo is supposed to make data engineering, data analysis, data science and ML work better. No one was really using Rust for that.
I actually work my backend in python! Our web server uses flask and is pretty good! De dx of making backend with pythong is amazing. And IMO alot better than javascript. Also the production web servers if i am correct are written in c++. So you dont get as much of a performance penalty
@@fueledbycoffee583 I would prefer a backend in python than JS too, but I know that doesn’t scale well and the dynamic typing is problematic. Not saying your company is bad or anything. It’s just that massive companies aren’t likely to extensively use Python on the backend.
@@playea123 we do have a rule: All python must be written in a typed way. We extensivly use data classes, enums and validators so we do shoot ourself in the foot as much as posssible. Since our backend is a big thing we must doit that way because without it, it would be a tangled mess.
Ironically we dont use typescript because we arrived to the conclusion that typescript is not a good type system for js. Is hightly subjective but we dont enjoy the typesystem of TS. We go along pretty ok with vanilla js.
It is only very fast if you have very fast hardware fore it. Auto-tune may work in the bootstrap code that measures which settings gives you the fastest result. And maybe this could be changed dynamically during the running of your project.
7:20 I have used it a lot, through a custom Postgres extension written in Rust using the awesome PGRX framework. When you've got a good fit for SIMD and the tools to easily apply it the performance improvement is like going from Python to C#
I’m sure it’s already been pointed out, but SIMD instructions are sized specific to the registers they can handle, and some architectures aren’t actually flexible-if you don’t have data that fills the register when issuing on a GPU, then you pad with 0’s.
5:45 most operations you do with just floats (if you are actually writing low level fast code) is probably going to be memory bottle necked so even if in theory an operation would take a few extra machine cycles, from my understanding it could still be potentially faster to use f32's because they theoretically take up half the memory bandwidth of a f64.
This is so funny. There are languages like rust and Julia which had bindings too all sorts of neural network frameworks which are really fast. I don't know why these people think it's a great idea to reinvent the wheel without using these other languages first. Julia is almost as fast as c in some cases and it's got all sorts of really cool symbolic math features in it
I think for the 32/64 bit operations.... All depends what the compiler does.... Thought on a 64-bit machine the compiler may simply use 64 bit float under and be done with it. I remember and was surprised at work going from 16 to 32 bit was little or no problem. When moving to 64bit machines *EVERYTHING* had to be on a 64bit word boundary, when not... BAM! EXCEPTION!!!! So the easiest solution was to tell the compiler to automatically align everything internally on a 64bit alignments, most everything will work, while if i'm remembering, some places the auto align didn't work but more often when doing bit operations.... Was working in c/c++ so could be were some cases where it seemed remembering most of the problems if i'm remembering correctly struct/unions/classes are , given i was working in C++ at the time, would think MOJO or other things these days, might be a hidden detail under the cover they simply do and fix up for you automagically.
That's a VERY sexy product they got there. I really need my manual memory management and the syntax is still stupid, but it's a step in the right direction I think.
True, but numpy is written actually in C, C++, Cython and Fortran, and this is the point. How can you author fast Python libraries/code without using these languages.
Hey Prime, serious question(forgive me for asking before watching the video): I've heard Mojo's aiming to replace Python as the AI "vehicle" language - but what's the point if the heavy lifting is done by the CUDA/GPU stuff? How much realistically(5, maybe 10%) can you speed up by replacing the non-GPU related things?
@@TCH534 yes, the "heavy lifting" I referenced. All the big bulky matrix multiplication stuff is done on GPUs, and is the vast majority of any workload. Python is there just as a high-level script for ease of use.
it's not free to call other languages, esp from python, nor are the type conversions (which is why the 'serious' python libraries force you to commit to types). The beauty of mojo will be for researchers to set fire to fewer trees, with less effort.
@@jereziah that's not gonna make the whole thing thousands or even hundreds of times faster. It's gonna make 5-10% tens, maybe hundreds of times faster.
@@markusmachel397 I feel like if anyone cared about that sort of margins(and I don't think it's 10% always... it's like... 1-5% if I think really hard about it, and it's skewed heavily towards... 1%, maybe 1.5% being at the mean), we'd see a lot more implementations being... C/C++/Rust with pure compute shaders. If we're talking squeezing every little drop of performance. I don't have hard data for this, but looks like convenience and just "roll it and shove it at hardware" is the approach. ...like with so many things these days... Seems to work not that all bad though. (though who can tell how it would work at the peak perf, def. not me)
The only thing I dislike about Mojo is the name. But that being said, the fact that it is a superset of Python makes it so that even if I only use it 1% of the time, that it is still worth learning the extra bit of syntax. It does what Julia promised, but it actually understood the assignment: you are competing to get the Python folks, who often are not programmers but scientists, so make the transition as easy as possible. In this case, the transition is instantaneous, as you can still use all your tools. That actually makes me sad a bit, because I like the name Julia much more than I like Mojo, but oh well.
I'm not a CPU expert either. But with regards to float32, the FP is done in the SIMD registers. Most likely Mojo will convert things to use packed SIMD where possible, and you can fit twice as many FP32s in a SIMD register as FP64s. Loading a single FP32 or FP64 is likely memory bound, so using FP32s means you can have more in cache at the same time. I guess an expert (i.e. not me) doing proper benchmarks will give a better picture.
Did I miss something, where in the code examples did it imply a +35,000X speedup, I'm only seeing +4,000x at most, not a dig, but just where is it? Also, does the +14x speedup imply that the differences in hardware between Fireship and the other dev's computers at compilation affected the outcome of their code tests?
The autotune feature was low hanging fruit. I'm actually stunned its a new concept to people because it's such an old one for me. I just assumed the compiler was already doing it.
This is the first I'm hearing about mojo. I wonder how it compares in performance to julia, which is really supposed to target the same audience (in most ways) and has been around for a little bit longer. I have used julia for a while and it is very easy to write incredibly performant code, often the compiler is good enough to do some basic simd on the arrays you pass in. I would love to see how they go toe to toe.
The techniques in Mojo will proliferate other more popular languages, but I don't buy this Apple hype train one bit. If they changed how we think about how we can more easily reason about programs and prototype, I'd buy into it.
If Mojo can support these speedups even just with the base 8x gains. It could save lots of money when it comes to heavy computations when training big models.
The only semi-useful part of the demo is where they did the same exact code in Python vs. Mojo, and even that was biased against Python. Let's see the most performant Python vs. the most performant Mojo. After all, Mojo isn't competing with Python, it's competing with things like Cython. Still, I'm curious about this language and what it brings to the table in terms of drag-and-drop performance improvements to Python code and an excuse to write Rust code while tricking my employers into thinking I'm writing Python.
I have done this parallelized matmul in C and this looks nicer but also not as hardcore, and I'm all about that hardcore. I don't want to call 'simd', i want __m512d _mm512_mask_mul_round_pd(__m512d s, __mmask8 k, __m512d a, __m512d b, int r) and nothing less
The other thing to consider is that by restricting this stuff to a subset of Python (in terms of what can be optimised), and not allowing precise low-level stuff like e.g. C, there are potentially more optimisations available since Mojo doesn't need to worry about pointers and other stuff. (Roughly: The more your language can do, the less the compiler can assume about a program's behaviour.) The average data scientist using this could quite likely end up with something faster than what your average C programmer could do in C (as said average C programmer likely knows less about optimising numerical code than the authors of Mojo). I look forward to seeing where this ends up.
float32 and float64 are computed in amd64 in the same FPU with 80 bits of precision, so time (latency) to compute is the same, but, if aligned properly, the cpu can parallelize internally, so you always can do twice as much float32s than float64, thus float32 gives a higher throughput generally, but both have the same latency.
plus the point of SIMD is to do identical operations on multiple values, and if you are using a f64 rather than f32 that takes up twice the amount of slots for the SIMD operation
A major part of making python 35,000x faster had to do with just how awfully slow it is to begin with. It seems like not using the integrated multi threading/parallize stuff in mojo was still only getting Python up into the realms of JS speed.
Every time Prime streams I just imagine a ranch with a bench and Prime sitting on it, caught in a 2-hour heated debate with himself, trying to desperately convince a field of grazing donkeys what the best software engineering paradigm is. Only kidding, but the "unga bunga" chats made me think of that, love your streams my man!
Imagine if you did that level of optimization in assembly 😂 the processor will be chilling at cold temperatures in the corner and you brain and fingers will catch on fire 😂 , imagine doing it in native binary instead 💀 but it will be rewarding in terms of performance
Mojo is using MLIR and LLVM under the hood. The authors of Mojo are compiler experts, so it's not surprising MOJO is so much faster. The real reason why MOJO has auto-tune is it makes it trivial to target different hardware. NVidia 4 series cards have thousands of cuda cores, so the speed up is going to be even higher than CPU cores. Then there's TPU and the other AI accelerators.
Maybe "MLIR" is something else, but I would guess that this is already happening with most languages, and ESPECIALLY it's already happening - and unavoidable - on GPUs? Like Rust, and also clang and most other modern compiled languages I know of, has its own intermediate representation. It does some optimizations on that, and then that goes to LLVM, another intermediate representation, LLVM then compiles to machine code. Thats 2 intermediate representations already. Then, when you want to run stuff on a GPU, modern APIs like Vulcan, Metal or Dx12 are targets from some default shader language that has a first party compiler (often C dialects) but there's also other frontends, like eg Rust - experimental community project - , but that then gets compiled to an intermediate language that is part of the API, for Vulcan that is SPIR-V, and that then usually gets compiled again to a vendor specific representation that actually gets executed.
It's a good thing. Give better tools to the noobs who only understand Python. At least they can learn about types and see that you get vastly improved performance by actually understanding data types. It is all machine code in the end. This is a good stopgap or learning language before systems programming.
(I'm at 14:09), the one benefit I see is that it's Python but has all the quality of lifes of C. Even if it isn't for AI, it's basically Python that isn't as gimped if you've used lower level languages.
Well according to the playground and documentation, it works with All python libraries and isn't only for Machine Learning so we will se what the future holds.
Thanks @ThePrimeagen I just burned all my rust lang books and am eagerly awaiting your mojo merch and future everything becomes a mojo convo, also can we agree that we measure dicts with the same measurement we use for horses. Hands, how many hands is your dict?
Python is versatile, If you need speed you have Cython, if you need a full stack web-app platform then you have Anvil run Python in the browser and server side, you Numpy, TF, PyTorch, Pandas, Plotly. I hardly use C/C++/VB. SQL stood the test of time, still using that since the 90s...
4:11 I've heard that new fancy algorithms in spite of having better big O complexity, not really cache-friendly so in practice standard algorithm won't be slower at least. But not 100% sure that it's true.
Mojo Python is compiled, at the moment WSL if you have Windows. Waiting for a Windoze version, but RW read rights twins... now have X as a buddy (what a Twatter MuskRat?)
Autotune feels like its essentially JIT? Also, once you understand the superset, is it really python? Type systems, structs, etc aren't just some syntactical sugar and take a bit of learning to truly understand.
It seems like the main difference between JIT and autotune, as I understand them, is that JIT will do extra compiling work and cache the results at runtime, based on what parts of the code are being run in the interpreter most often and thereby using up redundant processing by being interpreted over and over, whereas autotune is actually compiling a given section of code a few different ways at compile time, measuring what the performance is like on that particular system, and including that in the rest of the compiled code. I’m not an expert in either feature though, and the Just In Time compiler implementation probably varies across languages.
I think you're wrong on the float 32 thing. At least generally. A compiler can actually recognize that the numbers are 32 bit and pack 2 into a single 64bit registry, which can improve performance.
In the intel world at least the FPU does 80bit internal operations and you read / write either 32bit or 64bit in a single instruction either way. SIMD is packed stuff.
If you want any language have good performance numbers, then compare it to Python.
classic w
Lol facts
MY language (slj) is 4x faster than python (really)
(it's 18x slower that C tho)
Python is not fast lol, make a better research
@@suarezlifestyleexactly why you gotta compare it with python
Can’t wait for python programmers to evolve into Mojo programmers who just never use any of the new stuff, but now can say the language they write in is using modern process optimisations and cache-efficient data structures.
Kinda like C++
LMAO why violate them like that!
Did people forget about Julia & LISP? You could have easily use macros (compilers) to turn one high-level S-expression format into intermediate S-expression format into C S-expressions. Even in Python today you had @numba(nopython) to turn static looking Python that looks like C-code into C-code and compile it.
Why don't people write their code in terms of relational constraints. Like I need this and this and this, then use chatbots and solvers to generate code and custom hardware for your application? A good way of transforming your constraints to code is Lisp's s-expressions and quasi quoting which sounds like Julia.
@@aoeu256 Too complex, high-level scripting is the closest you'll get to that. You're talking about something that exceeds semantic language and symbolic language, conceptual language. Doable but will likely involve probability, don't think LLM are it though.
From Pythonistas -> Mojicians.
I'm looking forward to mojo. Anything Latner touches turns to gold.
Mhm how does it shake you?
It’s not about the size of your SIMD it’s what you do with it
facts
As long as it can measure dicts, I’m happy.
@@SpaceChicken Look let’s not turn this comparison video into a dict measuring contest.
If it performance better in worst case, like for loops, isn’t that a benefit?
@@SpaceChicken Salutations, extraterrestrial avian
Cache-line sized vectors being their own type is pretty brilliant idea. It probably allows even better performance compared to manually doing that, but also just reduces typing.
Aligning types to the size of caches in general is a great optimization. People most often think of optimizations in terms of algorithmic order of n magnitude things but the reality is making something 8x faster is a substantial speedup. Making something even 30 or 10% faster will make something take about 10 less minutes if it takes an hour
This new watchmojo language is looking really cool, wish I could use it to compile rust
Numba, a JIT compiler package for python, seems to do a good portion of what Mojo promises. I regularly get big speedups over numpy using it, particularly because it can auto-parallelize both native python loops and many numpy function calls.
That is basically Cython with some vectorization steriods, which can be implemented in Cython given engineering resources.
@@ckmichael8 But Numba doesn't require as high IQ. If you can use numpy you can get C-ish performance in a single function with just a decorator. It's finicky with any argument not a boolean, numerical or a numpy vector thereof though.
@@yeetdeets Yes you are right. I think the usecase for Cython and Mojo alike is for things that Numpy does not support yet, like new algorithms that cannot be efficiently expressed in existing Numpy funtions. If there is a numpy way of doing that than Numba is certainly the better way. But then for research things like new ML algorithms there maybe no existing implementation available at all so a Cython/Mojo implementation would be required.
Good stuff dude, I find your content in the land of devs on RUclips very unique. Keep it up!
Great content as always, keep up the good work man!
ty
did some 5min optimizations by using numpy and got it to be 1400-1800x faster than the example he provided. Still, if i can continue to code in python and make it faster, and have strong types, then i see this as an absolute win lol
Anyone else notice that their Python performance benchmarks are for Python 3.10? Python 3.11 is supposed to have some major speed improvements.
For Python 2 you could have used psyco, there is numba, and julia and stuff
In Rust, you can pin a shared buffer, and dispatch slices from it to each core. That’s basically what I’m expecting that Mojo code to actually be doing.
interesting
ngl as someone learning python as part of my topology degree mojo looks really tempting
especially once they busted out the Mandelbrot
f32 is directly supported in almost all SIMD ISAs. f64 reduces the number of components (in 128 bits, you can have 4 32 bit float, but only 2 64 bit floats).
ah, very interesting
@@ThePrimeTimeagen a lot of ML programs use F16 as well but that might be more related to memory savings than speed
@@fakenameforgoogle9168 While it's obvious that it's smaller, the real savings is in terms of speed. GPUs commonly use F16 for both reasons.
@@fakenameforgoogle9168 Even f8 recently
Amazing stuff. Still I wonder...how is it fair to compare Mojo with plain python when numpy is basically a part of python itself at this point? Numpy often outperforms even Julia (for large arrays).
They also have comparisons to optimized numpy implementations and still achieve 2.5x over numpy. Also note that Mojo is being built as a hetegenerous language, meaning that it should be pretty straightforward to utilize GPU or other accelerators. Having all this in one single coherent package is a very big deal
Hey, on tilling: this is necessary to keep the processor cache hot. The classical example is inverting the index of the two loops in a matrix-vector multiplication. The parallel algorithms for the same operation can be tuned by sizing the chunk of the matrix your are operating on. This becomes even more critical when you add another level of locality by using an accelerator like a GPU or when working in a MPI cluster.
You gotta feed the beast, especially when you're extracting all the juice out of your CPU with SIMD
No, you are wrong about f32 and f64. On the CPU side of things (which the example is on), the float is always faster and all games, all cad, all highperf code always use float instead of double unless precision errors crop up. This is even true without simd, but with simd things are even worse because you can either do 2x as many operations on floats or 1x as many double, which is literally halving the speed!
Also in the past there was a case when doing 8 byte loads were slower than 64 byte ones because certain (in most cases risc) CPUs could not even address smaller memory. But even there this have stopped at 32 bit usually so 32 bits you can manipulate even as integers just like you would do the 64 bit ones even on arm usually. So even for integers it is worthwhile to use 32 bits instead of 64 bit - for example there are builds of 64 linux kernels that enable 32 bit pointers and those are much better if you are not having more than 4gb of memory but otherwise want to do 64 bit ops usually. Also in many highperf codes of mine I tend to store indices instead of pointers because indices can be stored on 32 bit and thus your "pointer-ish" part of the data eats literally half of data cache - and yes memory is plentiful, but caches are very limited.
What you say had some merit in the past, but in current state of the art highperf optimized codebases it is actually a bad advice to use double - unless of course float errors kill your alg.
The story also highly differ on GPUs, but traditionally GPUs also massively operate on float and not double so in most cases its faster there too. I do not follow all architectures on the gpu side of thing for cuda and all though, but there sometimes using the bigger type is better - like on GPUs that do not support float16, those will be emulated with float32 which is bad. If the GPU does support float16 and suchformats however those can be immensely more faster for machine learning if float errors let you do that so codes usually just ask the APIs if there is support and if they can use 16 bit floats they do.
Its good to have this language because python is extraordinarily slow.... Extremely... Only good for glue-code kind of fast hacking but sometimes even the glue part is slow so this is good development. I don't know what are the results though if he would compare to lets say cpython or something that is compiler ahead to time...
From what I know about GPUs FP32 is the main focus in gaming while PF64 is for compute applications. Gaming cards often lock down FP64 performance so you need to buy a workstation card to get the full performance. Sometimes different architectures are used like right now for AMD RDNA is optimized for gaming with slow FP64 while CDNA is optimized for compute with very fast FP64. I think some consumer GPUs like Intel's have no hardware FP64 so it isn't used much in client applications.
It is my understanding that lower precision is becoming more and more important thanks to its use in machine learning with architectures getting improved performance for FP16, INT8 and even FP8 on Nvidia Hopper.
25:00 "will work on exciting projects like Excel spreadsheets, data entry, and *building hyper-intelligent armed robots* "
Basically, if you exclude a chunk of what Python can express, what remains can be made very efficient. So add a little syntax to allow you to ring-fence stuff that you want optimised. Makes a lot of sense.
If it lowers the difficulty in being able to make code that can run on GPU's and mixed use cases, I'm all for it, still it being signup only feels very weird right now,
You sign up for the playground right now if I read the site correctly, not the language. It made more sense to me after trying to jump in myself
hopefully not gonna be proprietary
i think mojo is basically a proof of concept/best showcase for MLIR and what better accessible lang to superset to be honest. very exciting project and also very curious and excited about what mlir can accomplish for other languages.
Python is chosen because of its ease of use and libraries which take care of things for us. If we add all these specialist language constructs back into it, have we just undone that ease of use; is it still easily understandable; or does it provide a reasonable pathway from noob to expert?
Probably the idea is that it can be used for those creating the language and libraries. Currently many python libraries are implemented in C, C++ and Fortran. Therefore if it is possible to write fast Mojo code, the library could just be written in that reducing. The hurdle of linking different programming languages.
He's well known person in deep learning community but I would say in order to compare you could compare numpy vs mojo for matrix multiplications , dot products , etc .
And try comparing to a cython compile too.
@@EwanMarshall yea its always easier said than done, but lets hope that works
Also @numba.jit(nopython=True), @rpython, @julia (if it exists)...
Kind of reminds me of Common Lisp with one of the several approaches to integrate python. For integrating python in a faster environment ofc, not syntax. SBCL even has a lot of nice SIMD stuff, native threads and green threads, has nice interactive developement and debug tools. You can also optionally declare types, which does impact performance. It could use a better package manager and some better project tools.
This is awesome and I'm really looking forward to when it gets released. But it is a marketing stunt. You should compare it to something like numpy with multithreading. Probably still 10-50x faster, but no one who has the slightest idea about numerical calculations in python uses for loops.
I can't wait for the next stage of new programming languages, like Writescript, an superset of Typescript; C+++ an superset of C++ and lets not forget; GoGo an superset of Go.
@13:00
I felt like there was a code that just entered your being. And you had that revelation of macro expanding life altering script that subtly changes ones life. Not all in one call but gently nudgingly like a hash or crc or plane like CUDA where the code of the Jedi master is shared with everyone everywhere with everything .. EVERYTIME
at the beginning of this video i'm really hoping this makes me want to use my mojo playground access, but there is fear in my heart
yeah, it seems neet
i am pleasantly surprised
U friends with the president or smthing?
I don’t get the criticism that they picked a bad feature of python to compare against (i.e. for loop). In my mind, it’s fantastic that improves on what python does badly. I don’t see why anyone would use it if it just improved on libraries that already use C or C++ under the hood like bumpy or pandas. The whole idea for me is that, by using mojo, you get a better version of python in all cases and especially the most basic ones (e.g. just a simple for loop) without having to learn any new syntax. I think mojo will be great for already good python developers who already use type hints. Although I am a bit salty that mojo doesn’t use the same case for types as Python (e.g. Int vs int). I don’t think mojo is trying to replace Rust or C++. The jobs current python users most do simply isn’t the same as what Rust and C++ users do (unless for some reason you work at a company that uses python for backend or game engine development). Mojo is supposed to make data engineering, data analysis, data science and ML work better. No one was really using Rust for that.
I actually work my backend in python! Our web server uses flask and is pretty good! De dx of making backend with pythong is amazing. And IMO alot better than javascript. Also the production web servers if i am correct are written in c++. So you dont get as much of a performance penalty
@@fueledbycoffee583 I would prefer a backend in python than JS too, but I know that doesn’t scale well and the dynamic typing is problematic. Not saying your company is bad or anything. It’s just that massive companies aren’t likely to extensively use Python on the backend.
It's closed source
@@playea123 we do have a rule: All python must be written in a typed way. We extensivly use data classes, enums and validators so we do shoot ourself in the foot as much as posssible. Since our backend is a big thing we must doit that way because without it, it would be a tangled mess.
Ironically we dont use typescript because we arrived to the conclusion that typescript is not a good type system for js. Is hightly subjective but we dont enjoy the typesystem of TS. We go along pretty ok with vanilla js.
as a web developer, this video seemed like Egyptian hieroglyphs to me ngl
My team works on Tammy AI. Does Mojo has an API we can test?
"What kind of BS measurement are they doing?" best question ever.
It is only very fast if you have very fast hardware fore it.
Auto-tune may work in the bootstrap code that measures which settings gives you the fastest result. And maybe this could be changed dynamically during the running of your project.
I also wrote a ton of heuristic optimization algorithms like 8 years ago, but mine were in Matlab...
since mojo is just python.
Python devs can safely put 10+ years of Mojo experience on their resume.
7:20 I have used it a lot, through a custom Postgres extension written in Rust using the awesome PGRX framework.
When you've got a good fit for SIMD and the tools to easily apply it the performance improvement is like going from Python to C#
You know the saying, "if it's too good to be true..."
I'm a simple one I just wonder will we ever run out of names for new programming languages?
cat ~/Politics/new_genders.txt >> ~/Programming/language_names.txt
Just recycle them, as these junk languages will never catch on.
Can't wait for this sounds awesome, and preciate your videos dude always a fun watch.
If it has good type system, reasonably fast (compared to C) and doesn't have a bunch of features in it then it should be a fine language
I’m sure it’s already been pointed out, but SIMD instructions are sized specific to the registers they can handle, and some architectures aren’t actually flexible-if you don’t have data that fills the register when issuing on a GPU, then you pad with 0’s.
5:45 most operations you do with just floats (if you are actually writing low level fast code) is probably going to be memory bottle necked so even if in theory an operation would take a few extra machine cycles, from my understanding it could still be potentially faster to use f32's because they theoretically take up half the memory bandwidth of a f64.
This is so funny. There are languages like rust and Julia which had bindings too all sorts of neural network frameworks which are really fast. I don't know why these people think it's a great idea to reinvent the wheel without using these other languages first. Julia is almost as fast as c in some cases and it's got all sorts of really cool symbolic math features in it
I think for the 32/64 bit operations.... All depends what the compiler does.... Thought on a 64-bit machine the compiler may simply use 64 bit float under and be done with it. I remember and was surprised at work going from 16 to 32 bit was little or no problem. When moving to 64bit machines *EVERYTHING* had to be on a 64bit word boundary, when not... BAM! EXCEPTION!!!! So the easiest solution was to tell the compiler to automatically align everything internally on a 64bit alignments, most everything will work, while if i'm remembering, some places the auto align didn't work but more often when doing bit operations.... Was working in c/c++ so could be were some cases where it seemed remembering most of the problems if i'm remembering correctly struct/unions/classes are , given i was working in C++ at the time, would think MOJO or other things these days, might be a hidden detail under the cover they simply do and fix up for you automagically.
The Fireship video just came out of few days ago, and we're already reuploading it :(
That's a VERY sexy product they got there. I really need my manual memory management and the syntax is still stupid, but it's a step in the right direction I think.
We can use numoy and do that matrix multiplication without using any for loops. Also numpy lists is faster than normal lists.
True, but numpy is written actually in C, C++, Cython and Fortran, and this is the point. How can you author fast Python libraries/code without using these languages.
@@DataPastor yes
Hey Prime, serious question(forgive me for asking before watching the video): I've heard Mojo's aiming to replace Python as the AI "vehicle" language - but what's the point if the heavy lifting is done by the CUDA/GPU stuff? How much realistically(5, maybe 10%) can you speed up by replacing the non-GPU related things?
CUDA is going to be for the AI work in python.
@@TCH534 yes, the "heavy lifting" I referenced. All the big bulky matrix multiplication stuff is done on GPUs, and is the vast majority of any workload. Python is there just as a high-level script for ease of use.
it's not free to call other languages, esp from python, nor are the type conversions (which is why the 'serious' python libraries force you to commit to types). The beauty of mojo will be for researchers to set fire to fewer trees, with less effort.
@@jereziah that's not gonna make the whole thing thousands or even hundreds of times faster.
It's gonna make 5-10% tens, maybe hundreds of times faster.
@@markusmachel397 I feel like if anyone cared about that sort of margins(and I don't think it's 10% always... it's like... 1-5% if I think really hard about it, and it's skewed heavily towards... 1%, maybe 1.5% being at the mean), we'd see a lot more implementations being... C/C++/Rust with pure compute shaders. If we're talking squeezing every little drop of performance.
I don't have hard data for this, but looks like convenience and just "roll it and shove it at hardware" is the approach. ...like with so many things these days...
Seems to work not that all bad though. (though who can tell how it would work at the peak perf, def. not me)
The only thing I dislike about Mojo is the name.
But that being said, the fact that it is a superset of Python makes it so that even if I only use it 1% of the time, that it is still worth learning the extra bit of syntax.
It does what Julia promised, but it actually understood the assignment: you are competing to get the Python folks, who often are not programmers but scientists, so make the transition as easy as possible. In this case, the transition is instantaneous, as you can still use all your tools.
That actually makes me sad a bit, because I like the name Julia much more than I like Mojo, but oh well.
I'm not a CPU expert either. But with regards to float32, the FP is done in the SIMD registers. Most likely Mojo will convert things to use packed SIMD where possible, and you can fit twice as many FP32s in a SIMD register as FP64s. Loading a single FP32 or FP64 is likely memory bound, so using FP32s means you can have more in cache at the same time. I guess an expert (i.e. not me) doing proper benchmarks will give a better picture.
Did I miss something, where in the code examples did it imply a +35,000X speedup, I'm only seeing +4,000x at most, not a dig, but just where is it? Also, does the +14x speedup imply that the differences in hardware between Fireship and the other dev's computers at compilation affected the outcome of their code tests?
it's kinda fishy but the 35,000x speedup was for the mandelbrot set algorithm which they glossed over
15:33 "The servo mechanisms in my neck are designed to approximate Human movements. I did not realize the effect was so distracting." Data
The autotune feature was low hanging fruit. I'm actually stunned its a new concept to people because it's such an old one for me. I just assumed the compiler was already doing it.
Loved 'Mojo programmer, must have 10 years experience'.
This is the first I'm hearing about mojo. I wonder how it compares in performance to julia, which is really supposed to target the same audience (in most ways) and has been around for a little bit longer. I have used julia for a while and it is very easy to write incredibly performant code, often the compiler is good enough to do some basic simd on the arrays you pass in. I would love to see how they go toe to toe.
The techniques in Mojo will proliferate other more popular languages, but I don't buy this Apple hype train one bit.
If they changed how we think about how we can more easily reason about programs and prototype, I'd buy into it.
fifty years and no one can beat C still 😂
@@theinsane102 LOL so true!
You are fun. Thank you for bringing a smile to my face.
22:53 just to note, python have slots for static classes.
If Mojo can support these speedups even just with the base 8x gains. It could save lots of money when it comes to heavy computations when training big models.
The only semi-useful part of the demo is where they did the same exact code in Python vs. Mojo, and even that was biased against Python. Let's see the most performant Python vs. the most performant Mojo. After all, Mojo isn't competing with Python, it's competing with things like Cython. Still, I'm curious about this language and what it brings to the table in terms of drag-and-drop performance improvements to Python code and an excuse to write Rust code while tricking my employers into thinking I'm writing Python.
I have done this parallelized matmul in C and this looks nicer but also not as hardcore, and I'm all about that hardcore. I don't want to call 'simd', i want __m512d _mm512_mask_mul_round_pd(__m512d s, __mmask8 k, __m512d a, __m512d b, int r) and nothing less
This is really cool! I might finally have a reason to use "python" again
The other thing to consider is that by restricting this stuff to a subset of Python (in terms of what can be optimised), and not allowing precise low-level stuff like e.g. C, there are potentially more optimisations available since Mojo doesn't need to worry about pointers and other stuff. (Roughly: The more your language can do, the less the compiler can assume about a program's behaviour.) The average data scientist using this could quite likely end up with something faster than what your average C programmer could do in C (as said average C programmer likely knows less about optimising numerical code than the authors of Mojo). I look forward to seeing where this ends up.
"Why are he robots looking at the keyboard?"
yes austin power keep coming back to my head in the last week or so
float32 and float64 are computed in amd64 in the same FPU with 80 bits of precision, so time (latency) to compute is the same, but, if aligned properly, the cpu can parallelize internally, so you always can do twice as much float32s than float64, thus float32 gives a higher throughput generally, but both have the same latency.
plus the point of SIMD is to do identical operations on multiple values, and if you are using a f64 rather than f32 that takes up twice the amount of slots for the SIMD operation
It’s as if Ron Burgundy had a programming channel and I love it.
This got my mojo on. I've got 10 years experience with this language, and I'm looking for a worthy project.
O wow, I remember looking at neural-js 10 or 8 years ago. It was beyond me so I put it back.
hah!!!
A faster pyhon with structs, var and let -- I'm ready. This is killer.
its pretty cool tbh, both swift and llvm are huge & cool in their own right too.
float 32 and float 64 are handled by the FPU, no masking operations required
16:15 Yeah robots can just relay the text wirelessly from their "brain" to any other system.
A major part of making python 35,000x faster had to do with just how awfully slow it is to begin with.
It seems like not using the integrated multi threading/parallize stuff in mojo was still only getting Python up into the realms of JS speed.
Every time Prime streams I just imagine a ranch with a bench and Prime sitting on it, caught in a 2-hour heated debate with himself, trying to desperately convince a field of grazing donkeys what the best software engineering paradigm is. Only kidding, but the "unga bunga" chats made me think of that, love your streams my man!
Imagine if you did that level of optimization in assembly 😂 the processor will be chilling at cold temperatures in the corner and you brain and fingers will catch on fire 😂 , imagine doing it in native binary instead 💀 but it will be rewarding in terms of performance
Mojo is using MLIR and LLVM under the hood. The authors of Mojo are compiler experts, so it's not surprising MOJO is so much faster.
The real reason why MOJO has auto-tune is it makes it trivial to target different hardware. NVidia 4 series cards have thousands of cuda cores, so the speed up is going to be even higher than CPU cores. Then there's TPU and the other AI accelerators.
Maybe "MLIR" is something else, but I would guess that this is already happening with most languages, and ESPECIALLY it's already happening - and unavoidable - on GPUs?
Like Rust, and also clang and most other modern compiled languages I know of, has its own intermediate representation. It does some optimizations on that, and then that goes to LLVM, another intermediate representation, LLVM then compiles to machine code.
Thats 2 intermediate representations already.
Then, when you want to run stuff on a GPU, modern APIs like Vulcan, Metal or Dx12 are targets from some default shader language that has a first party compiler (often C dialects) but there's also other frontends, like eg Rust - experimental community project - , but that then gets compiled to an intermediate language that is part of the API, for Vulcan that is SPIR-V, and that then usually gets compiled again to a vendor specific representation that actually gets executed.
ThePrimeTime: "Classic.. EVERYONE KNOWS ABOUT THAT"
It's a good thing. Give better tools to the noobs who only understand Python. At least they can learn about types and see that you get vastly improved performance by actually understanding data types. It is all machine code in the end. This is a good stopgap or learning language before systems programming.
(I'm at 14:09), the one benefit I see is that it's Python but has all the quality of lifes of C. Even if it isn't for AI, it's basically Python that isn't as gimped if you've used lower level languages.
"...I always hated rust anyway..." I laughed so hard at that🤣
gotem
If it really does work with all of python, It makes Python capable of anything
Well according to the playground and documentation, it works with All python libraries and isn't only for Machine Learning so we will se what the future holds.
I've never felt so craving for a language. Python with struct, but also without ; and {.
Thanks @ThePrimeagen I just burned all my rust lang books and am eagerly awaiting your mojo merch and future everything becomes a mojo convo, also can we agree that we measure dicts with the same measurement we use for horses. Hands, how many hands is your dict?
Python is versatile, If you need speed you have Cython, if you need a full stack web-app platform then you have Anvil run Python in the browser and server side, you Numpy, TF, PyTorch, Pandas, Plotly. I hardly use C/C++/VB. SQL stood the test of time, still using that since the 90s...
4:11 I've heard that new fancy algorithms in spite of having better big O complexity, not really cache-friendly so in practice standard algorithm won't be slower at least.
But not 100% sure that it's true.
Key Question: Does this mean that we can train our neural nets much faster? Is GPT-5 training about to get faster?!!!
Mojo, someday but for now Cython 3.1 or C are very fast to write all the computational extension modules for Python.
6:22 machine learning works better at float32 instead of 16 or 64 because it balances precision and memory usage.
Mojo Python is compiled, at the moment WSL if you have Windows. Waiting for a Windoze version, but RW read rights twins... now have X as a buddy (what a Twatter MuskRat?)
(Scrolling) HelloWorld runs so much faster... lol...
Mojo is basically statically typed python
keep it softcore baby. showtime after dark over here
By the end of this year, there will be new language called bojo
Mojo has that big simd energy
ohhh babe, nothing like big simd energy
Autotune feels like its essentially JIT?
Also, once you understand the superset, is it really python? Type systems, structs, etc aren't just some syntactical sugar and take a bit of learning to truly understand.
If it provides 8x speed boost without modifying original code as they said, then there's not much barrier to transit into mojo imo
It seems like the main difference between JIT and autotune, as I understand them, is that JIT will do extra compiling work and cache the results at runtime, based on what parts of the code are being run in the interpreter most often and thereby using up redundant processing by being interpreted over and over, whereas autotune is actually compiling a given section of code a few different ways at compile time, measuring what the performance is like on that particular system, and including that in the rest of the compiled code.
I’m not an expert in either feature though, and the Just In Time compiler implementation probably varies across languages.
I think you're wrong on the float 32 thing.
At least generally. A compiler can actually recognize that the numbers are 32 bit and pack 2 into a single 64bit registry, which can improve performance.
In the intel world at least the FPU does 80bit internal operations and you read / write either 32bit or 64bit in a single instruction either way. SIMD is packed stuff.
Recruiters are hiring for mojo developer with 10 years of experience 😅
I always knew python had something in it
Yeah, it's full of it.