They are not the same, but having async tasks is a powerful functionality that isn't available in all languages. It is correct he wasn't comparing the same, but you could argue that he was comparing how you would achieve the same thing if you wrote it in each language
@@MikyLestat Depends, because with Python you will run on a single thread, but with go for example you will use multiple threads. If you are actually computing anything this will make a significant difference.
@@lozanov95 Exactly. I think that the reason for the comparison is to get an indication of how much memory (minimally) each programming language will use to achieve the same thing. Achieving the same thing in each language is translated to using the features and constructs of each language. Python is a great language, but it isn't the fastest. The global-interpreter lock (in addition to Python being interpreted in CPython) causes it to be slow. Just because Python doesn't really have multi-threading, it doesn't mean we shouldn't use multi-threading/tasks in other languages and then profile the memory footprint.
@@MikyLestat i think, this's wrong ways to compare language that only run in single thread vs multi-thread to get requirement memory to run that tasks. garbage collector have feature to queque overload thread. so fastest process means lower memory. and for tasks that have high range let say. first task 20KB, 70th task 1MB. Initial size heap higher give good response than set initial size to 50KB and re-allocate memory size. This all dependent user hardware to choose process ways or memory ways. if memory cheaper than cpu. than go memory, if cpu cheaper then choose like go or rush that re-allocator frequently
Using Python's asyncio for this test was the wrong thing to do. It's similar to what was done with NodeJS. Asyncio is an event loop, not a thread. Python has threading libs for threads.
If I'm not wrong, C# uses a theard pool behind the scenes when using async/await and what it does is it recycles theards. That's why in the first test it was way up than the others. I think that was the threads pool being initialized with a bunch of threads.
Yup. It always allocates a fixed size pool of managed threads depending on the system it's running on, unless you set the size yourself, which is possible and would be separately interesting for this benchmark.
@@3ventic The ThreadPool default is much smaller, it shouldn't take 120 MB at idle. I'm betting he wasn't distinguishing between allocated and committed memory.
as far as I know, C# also compiles the async methods to stateful classes, so it generates the states of each “step” of processing beforehand, when you create that amount of tasks you are basically creating a list of super small instances in a queue to the threadpool to consume until the next state (await) and throw again in the end of the queue
@@MikyLestat I was a bit mistaken, but there is a fixed minimum number of threads (ThreadPool.GetMinThreads). On my system it's 32 by default and the equivalent program on my system (1 task) takes up 195M RES 108M SHR while a million tasks is using 52 threads and 472M RES 23M SHR.
As a Java apologist, it first got virtual threads in 1997 with version 1.1 (edit: later removed and recently re-added in v 19). Also, Java (and presumably.NET) pre-allocates a bunch of memory by default. Hence how mem looks high for small numbers of threads and it doesn’t increase until you hit bigger numbers.
Creating a million concurrent "tasks" (or spawning processes as we call them in Erlang/Elixir) and allowing them to remain idle is one thing, while making those processes actually do something, such as each one of them having a persistent connection to a client and feeding it, is something entirely different. In practical terms, when it comes to real-time apps, the BEAM (Elixir/Erlang) outperforms all other languages by a significant margin. This is precisely why Brian Action and Jan Koum chose Erlang for WhatsApp after years of experience with Yahoo Messenger and Yahoo Chat Rooms. If someone hasn't had the opportunity to work with any BEAM language, the above statement may appear to them as an empty boast, and I can't blame them for that.
But then this example needs to be done and showed to the world as this primeagen is reacting. I'm surprised with Elixir performance here... in a bad way.
The BEAM is pretty quick, but it won't "outperform all other languages by a significant margin". Ran several huge elixir services in production with lots of traffic and our Go services were much more performant.
@@-rate6326Yes, goroutines aren't threads. But they do need to run at some point and the ones that aren't running are just waiting and we aren't talking about them
An information that has not been said in the video is that: async functions in C# are State Machines and Tasks (are part of the Task Parallel Library and) are automatically run in thread pools. So the only internal state these async functions have is the time they need to wake up, and all Tasks could theoretically have the same wakeup time. I would've loved to see a C# Thread implementation. I suspect the C# compiler is optimizing redundant Tasks away since they lack any side effects.
Thread pool has like 512 preallocated threads, hence high memory usage in idle. Tasks are actually running, but max degree of parallelism is 8 (8 threads CPU) so there is practically nothing allocate.
@@vitskr1 you can tune this, knowing your workload though. Some languages I feel didn’t he the best showing here as the author isn’t an expert in each one, which is understandable
@@vitskr1 Exactly what I suspected ruclips.net/video/WjKQQAFwrR4/видео.html . Its using the Server tuning, I think on Desktop the default is Number of Cores * 2 .
@@monad_tcp i agree. and irl if you plan to launch 1M concurrency your probably have the RAM to match. i still don't think many people do these in a single process anyway. probably better to distribute workload to multiple servers. i recommend orleans 7 for c# devs. 😅
There's also the memory vs. speed tradeoff. Sometimes keeping more things in memory can also make it faster. If the managed environments that have a higher starting point in memory usage already has a bunch of kernel threads lying dormant in a thread pool that's taking up memory but speeds up spawning of threads.
Yeah. Bun.js was priding itself on being faster than Rust in it's beta. Then when it came out and people started benchmarking it was slightly faster than rust by like a few percent, but used 40 times more memory on average.
The Rust forums are just clogged with unproductive / outdated discussions that lead nowhere and make it harder to get anywhere as a community. The mods should simply go through all the threads once in a while and nuke the ones that are no longer relevant or helpful so the good stuff can get more space and everything would run smoother. Maybe they could even automate this with an LLM agent? They could call it “RustScheduledGarbageRemover”
Each elixir process spawns with a 50k heap, garbage collection happens on a per process level (you dont stop the world, you stop a process). This is because the way processes are used in elixir is like how microservices are used. Each process does a small amount of stuff then sends a message on to another service. The erlang vm that elixir runs on will launch 1 scheduler per cpu and does pre-emptive multitasking. So if you had 1mn processes doing stuff you would get each process executing for a few ms then being switch out and added back into the queue that the schedulers pull from. So if you have more cores you get more parallelism, if you only have 1 core you still get concurrency. Whereas async runtimes tend to be cooperative require some form of explicit yielding from a running task, elixir will just swap stuff out. Makes it good for soft realtime stuff, if you want to do cpu intensive things you can delegat to NIFs (native implemented functions) written in C or Rust. The rust ones tend to be safer since panics are caught and raised as errors in elixir. Wheras a panic in C will crash the whole VM
You can also specify the memory usage of a process as well on the beam VM, this significantly reducing the amount of memory something will use whenever it's spawned and doesn't really allocate anything, like in this case
And to do a test closer to what some of the other runtimes are doing, just call :timer.send_after(10000, :done) a million times, and then do a loop to receive :done 1 million times. Takes about 200mb instead.
I wouldn't compare it to microservices. I would just say Elixir processes are independent and don't share memory. Which really makes it unique (I don't know of another runtime like this except Node.js webworkers).
C# has the lowest memory usage because it is using the threadpool, that recycles blocking threads, like when calling Task.Delay. So there aren’t actually a million threads created but rather they are queued into the threadpool. To avoid this create the threads explicitly
@@user-qu5cc5oe2h ROTFL. As a first time viewer I asked myself if ThePrimeTime is always on that level of cocaine? Well, its something different than other coding channels. A fresh breeze, so to say .... **g**
No shit, Sherlock, all of the languages were using threadpools except Java and Rust with real worker threads. So you've failed to uniquely qualify C# altogether.
@@ThisIsMaddockid argue that C is more influential but yeah, saying no one cares about the language most used in most performance critical applications, that also need low level access to memory, is a really big stretch.
This guy reminds me of yongyea. Parrots other's work and makes more than the authors combined. He has no insight or original opinions or educated insight (from experiences academic or otherwise). I hate how people raise this guy up. Agreed on c++. That's my personal preference as I like the syntax being I learned it the same term I took cobol, Java (when it was new), visual basic and oop was still being defined. I've never worked in industry as a programmer but keep up to a middling ability. One thing I do know is that bullshit always smells like bullshit and this dude is full of it. People that talk during react videos do so only to fall under fair use, I see the same here transposed to a topic he is novice. Want for choice as mediocrity's excuse is no less evident than an untrained hand on display for no person's betterment or an opiate of excuse to be subject for one not turning to their purpose. I'm as wrong as apt to be right so there's that as well.
The go results are not surprising. It's a well-documented feature that each goroutine starts with an initially pre-allocated stack size. Prior to go 1.2, it was 4kb, then it went to 8kb, and I believe it's now at 2kb for go 1.4+. So 2kb × 10k means an additional 20mb on start. At 100k, it means a minumum of 200mb on start. The math seems pretty consistent with the results we see for go, although they seem to suggest that initial stacksize may be closer to 2.7kb than 2kb. We also have to keep in mind that there is a garbage collector running in there, and we didn’t account for how much memory it requires to keep track of everything going on.
Also he wasn't using ValueTask, they reduce the memory consumption considerably. But I hate tests like this because a compiler could remove everything before the code isn't doing anything.
9:30 - In the 19th century the german mathematician Georg Cantor proved that there must be more than one kind of infinity, such a the infinity of the natural numbers, and the infinity of real numbers and so on, and that there are larger infinities than others. The smallest infinity is that of the natural numbers, and its called Aleph Zero. So yes, Buzz can indeed go to infinity and beyond, so long it is mathematical infinity.
pretty cool i remember studying this part of set theory and how Alef (first alphbet in Arabic) the idea is that the set of natural numbers (1, 2, 3, ...) has the smallest cardinality and is denoted as Aleph Zero (ℵ₀)
Nothing "and so on". That is not clear. In fact it can neither be proven not disproven with standard mathematics. It is called the continuum, hypothesis
@@drtfsghdfghdgfshdgfhdgfhdg The continuum hypothesis is that there are no intermediary infinities between "infinity of integers" and "infinity of reals". It is, indeed, but an axiom. However, the cartesian product of a set with itself ALWAYS yields a set with higher cardinality, so infinitely many distinct infinities can be constructed by the repeated usage of it.
usage of Task.async in elixir, it comes with lot of boiler plate that is wrapped on top of GenServer. if the test has to be performed for concurrent tasks, one could go with primitives like spawn, send and receive in order to know the true potential. Just my opinion on why elixir used a lot of memory.
It's not doing anything. The erlang process concept has nothing to do with threading. Sure it explains the memory usage, but there are ways to pool it so a maximum amount of processes could be spawned at any time.
@@Eirenarch No it should not. If you did it the way you describe, the work (in this case represented by Task.Delay) would not be scheduled on TaskScheduler and would instead be done on the thread that this code is running at thus blocking it and not using CPU cores to its fullest. If any, it should be Task task = Task.Run(Task.Delay(TimeSpan ...)); tasks.Add(task); This would save some memory while still scheduling the work on worker threads. I am not sure if there would be any benefits, if you used TaskFactory and Scheduler directly, whether it would be more performant, but I highly doubt so. Task itself is glorified coroutine and job child. Its just a premise of an action, that can wait for other actions to complete. Task.Delay does not do anything with scheduling, or threading. It just writes a timestamp, and deposits the Task to run later, when the proper time has come. But it would not start new thread/virtual thread/Task/Coroutine. Since they are trying to figure out, how costly scheduling a new thread/virtual thread/Task/Coroutine is, this would not do the work.
I was looking for this comment. Guy who created that blog clearly knows nothing since he is using chatGPT and chatGPT also knows nothing if it outputs that kind of code... But hey, even my 'senior' coworker used to write async code like that so who am I to judge.
As likely already pointed out, C# uses a thread pool, and will definitely not create a gazillion threads in this test, and the memory required to house all of these insignificant tasks will be very small, which is apparent in the test results. I tried it out in LinqPad, but with one additional task whose only purpose was to keep track of the number of simultaneous threads actually in use. For 1 million tasks, the actual active thread count peak never even exceeded 50 on my system (usually much lower). No wonder, when all that the tasks are "doing" is async-waiting on a delay. This benchmark is broken in the sense that it doesn't really do what the author thinks it does, i.e. it does NOT create a lot of threads (virtual or otherwise) in all languages/runtimes, and measuring the memory usage is thus close to pointless.
There is some important information not mentioned in the article. Goroutines are compared to threads, whether real or virtual, but they are not compared to an event loop. Go has event loop libraries, and since the author of the article has used the event loop in other languages, he should also use it in Go to ensure an unbiased comparison. Additionally, the advantage of goroutines over threads is their portability; they do not depend on the operating system. If your application requires low-level operation, such as with chips or microcontrollers that do not have an operating system, a goroutine can still be executed. This is not possible with threads, as the language does not perform the task-the operating system does. Where there is no operating system, there are no threads. One last thing: when an application uses system threads, the system reserves memory. The question is: Did the author of the article account for the memory reserved by the system?
I wonder why Kotlin wasn't included, I guess it does share similarities with Java and Go but it's implementation of Coroutines is supposed to be different from that in Go. I guess testing it would also have to include both JVM and Native compile targets because you never know.
@@avalagum7957 suspend keyword and channels are part of the standard kotlin library. Coroutines package includes coroutines' builders and stuff like flows. For some reason Prime just ingores Kotlin whatsoever :/ But i'd really like to watch some quality kotlin roast.
@@DeliOZzz cause its not a popular choice for backends, alot of people still thinks kotlin is only for android, im afraid this stigma will stick around for the time being
15:11 Python, by default, only uses one worker thread. When writing asyncio code you do need to be careful that you don't block. My understanding is that each event loop may have only one worker, but I'm not experienced enough to be confident in saying that.
Elixir reserves 4kiB of RAM for each of its processes. Each process in Elixir has its own separate heap to eliminate the possibility of stop-the-world-GC.
@@carlinhos10002 Now that I've re-read the definition of green threads, I'm not sure how they aren't. They are not OS managed. They are lightweight thread-like primitives managed by the runtime. What are they missing? Wikipedia also lists them as such on en.wikipedia.org/wiki/Green_thread Not sure if this is as important though, every language in the lists was using their concurrency primitive built on top of some managed pool anyway.
@@pavelyeremenko4640 he’s just making things up. Most implementations are using some abstraction over OS thread. Only one of Java and Rust versions dont do that.
C# tasks use a threadpool to execute. But one thread can have multiple tasks waiting simultaneously and the code this guy used had each thread sleeping for several seconds
You actually pointed this out early on. In the Java and C# version, he uses "ArrayList" without specifying the size. ArrayList in both these languages hold an actual Array object. It's why the lookup time for "get" is a memory address lookup time. When Java needs to expand the array size, it creates a larger array that is twice the size of the current array size. I believe the default is 10. Java also doesn't run the garbage collector unless it needs to be run or specifically invoked with System.gc. Because the JRE doesn't plan ahead for your bad code, it just looks for a new place to put the object in memory, leaving all the old references that need to be deleted alone - because the GC will deal with it as needed. Just to recap there are several arraylist objects each holding an array of size n (below) in memory - and if the JVM is given enough memory, all 11 of these will still be there. So that means there are 20510 threads in memory on the test. While his approach to joining all the threads was barbaric, it's also the accepted answer on StackOverflow, we are not measuring the speed of the execution, just the memory of it. If you were not trying to measure the memory performance of threading on difference languages, I would actually give java more threads to manage the threads (parallelize stream). Finally thoughts, We aren't concerned about thread space in production equipment, we are concerned about execution time and if my entire program hangs because one calculation couldn't be done, I'm missing out on something important - it could be a trade, moving servo for a robotic (self driving cars) or producing an input for a chess game. Collecting the information that I can allows me to implement an algorithm that is capable of making educated guesses based of what was calculated. If we do care about thread space, we would be better off doing single threaded applications since we don't have an overhead associated with the effing cost of the thread. TL;DR Something something short equal something something int because the JVM go fast blah blah addresses blah blah blah 4. (primitive array blah blah addresses, blah blah)
Man I am allergic to empty catch blocks in Java - always. After looking for exceptions that have never been rethrown or really handled, I am really on the fence. Empty catch blocks should not exist or even be allowed...
You are allergic to using your brain, yes we know. Maybe if you knew what checked and unchecked exceptions are and stopped making dumb comments. This is why you should stop the drugs and go back to school, fool
I'm allergic to exceptions. I will wrap all my code with empty catch blocks to further mayhem and until everyone else is conditioned to hate exceptions too. MWAHAHAHAHAHAHHAH
The Elixir solution has a LOT of room to squeeze out. I can get it running in about 990mb with some tweaks. Main thing is the default heap size. Passing `+hms 1` as part of `erl` options sets default size to 1 4-byte word. Also, using plain spawn calls instead of Task (which accumulates results, and adds extra memory and GC and processing overhead) reduces it further.
True, but as long as the "threads" don't actually do anything it is a useless comparison. The constructs on these platform all provide a different feature set, so comparing performance is bogus. I mean a C# Task is just one or a few objects waiting in several queues to be invoked by native threads in the thread pool with a job stealing algorithm. NodeJs and Python are single threaded with a single event loop. I don't know what the others do and give you for free, but this isn't apples to apples. (Edit: I automatically type thread with a capital T)
@@mennol3885 Yup. The comparison is pretty meaningless. The "cheap", non-idomatic Elixir way to do this, would be to start 1,000,000 timers, and wait for them to finish. Effectively doing the same thing as some other platforms. I just tried that - uses about 200mb in total of memory. If all it's doing is starting something that sits there idly for 10 seconds, there isn't much difference. No point carting round a whole isolated separate stack and heap for each process, and associated house keeping. Elixir processes are cheap, but they're not *that* cheap.
My personal hate for it came from the pain of trying to use it in my SW dev course on linux compared to those windoze fags who have first class support for everything, and from missing a bunch of the things I love about Rust when doing C# (e.g. immutable by default, f, u, i (though byte is fine and I guess using "long", "short", etc. isn't really bad. more just personal preference and more efficient), match, traits, enums, macros! True some of these stuff are to a decent extent available in C#, but the.. culture doesn't use them primarily like Rust does). But the language itself genuinely looks pretty nice, and has some nice features and shit even over Rust. I'm definitely comfortable calling the language "better Java", and would be okay programming in it professionally or even hobbyistically.
@@MH_VOID Yeah. Rust is very intriguing language (excluding the dramas and BS). Also things should be a lot better than before. Although there still is some windows/Microsoft bias in the language.
tbh the C# number kind of makes sense, it scales incredibly well, especially in later .NET versions. Some C#-based fancy Unity optimizations can beat out GCC in raw speed and memory.
Granted, there is probably some optimization going on in Release mode, since it's not doing anything. I'd expect the memory consumption to be higher, but not 4GB high.
@@marcossidoruk8033 yeah, the optimizations are made by the compiler. He meant the C language, but specifically with GCC. If you used the microsoft compiler or other options you would have different performances.
@@CorvinhoDoMal No way C# is going to beat carefully written C code in any imaginable benchmark ever, its just impossible. Plus what he said makes no sense, "unity optimizations" how do you compare C# unity performance with C unity performance if you can't do unity scripts in C? Am I going crazy or what. And if he means the engine that is written almost in its entirety in C++
Go is definitely not a memory hog; at least for IO-intensive tasks. The main thing is that the Go libraries are always very careful to stream large inputs; rather than buffer them in memory. Java itself doesn't really have major memory issues beyond spawning threads; but in any large Java project, the code will be full of things being buffered into arrays, rather than being streamed. I tried rewriting netty to make it stop doing dumb things; and just switched (permanently) to Go. Part of Java's program is also the legal issues of shipping a JVM; and the existence of Oracle thumb-breakers and lawyers; to come punish you for shipping.
C# code was not written correctly. Code snippet wraps one task into another `Task.Delay(...)` into `Task.Run(...)`, creating 2 million tasks and every 2nd task wrapped into another task. Correctly written code would have had consumption ~176MB on .NET 6. This was enough to create singular task: `tasks.Add(Task.Delay(TimeSpan.FromSeconds(10)));`
You're 100% right about the complexity of the task. But also, I would have stopped reading after they said they used ChatGPT to come up with the code. You need to have these contributed by people that actually write this language and that actually understand this language. The ambiguity between what the code was actually doing in all of these was horrible, as other commenters have also pointed out.
Erlang, a language used in telecommunications, still seems to be the concurrency champion (according to a book by Röhrl and Schmiedl called »Produktiver programmieren«, I've read it in German a while ago).
The C# implementation is completely bogus compared to the others. It's using a small thread pool (task.run) to set a bunch of timers (task.delay) that's why it shows low memory usage. This is not demonstrating concurrency. If the implementation did a thread. sleep or used real threads the results would be completely different and probably worse than Java since C# doesn't have virtual threads. In the real world Go runtimes will have considerably less memory overhead than C# or Java
@@_daniel.w Go has a delay() function that looks similar to what's used in the C# impl. Rework the Go implementation to use this and I suspect it will perform drastically better
@@cethien I've been developing c# on linux and macos for a couple of years now using Rider (I just like it more but the Visual Studio is also fully cross platform). I don't personally enjoy the language as much nowadays but the tooling is great whatever platform you pick.
He said he launched 1 Task, as soon as you start one async task C# (in .NET 6) already sets up all the thread pool stuff and Access control. For such simple instances you should use threads in C#. Afaik it greatly improved with .NET 7. But in exchange you are prepared to scale incredibly, also yeah the .NET runtime does some incredible smart magic in the background, e.g. have a looked at LINQ performance in .NET 7.
@@rroscop on my hardware no problemo, remember that they are way more like go routines than like hardware threads, so only a dozen is actually working in parallel, the rest is just queued.
@@boredstudent9468 nice. Are you talking about System.Threading.Thread's? Or tasks run via Task.Run()? my understanding was that Task.Run() used a thread pool under the hood, but real Threads were more heavyweight. I'm not a C# developer though, just dabbled
Go's minimum stack size is (I think) 4KB per Goroutine and it grows/shrinks as needed. Not sure whats the minimum stack size. Therefore the ~2GBs in Go is not surprising. So in 3GB of memory, you can put 1mil/10mil and probably even 20/30 million goroutines, they will just shrink in size. You can probably with the example from Piotr do even more, since it's a very simple non-memory consuming routines. But as I said, not sure whats the minimum stack size that will be consumed by a gorutine. But its less then 4KB for sure (in your example 2.8GB/1_000_000 = 2.8KB). My guess is that is not shrinking even less than this since there is enough memory available. Anyway you put it nicely, this is not a real world test, TCP/Websocket connection would be much better
yea node example is not spawning threads, it's just placing tasks on the timeout callback queue of the eventloop to be executed later using the main thread.
Infinity and beyond is mathematically sound because there are some infinities that are larger than others. The most trivial example would be the set of odd or even natural numbers, and the set of natural numbers. They're both countable infinite, but because the the odd or even natural number sets can be mapped one to one to their values in the natural numbers, there will always be double the numbers in the natural numbers, as in a larger infinity. There's likely more important infinities to consider, and I might have explained that wrong or poorly, but most definitely there is more than just a single, simple infinity.
The issue with the java threads i feel like is not preallocating the array list, every time an arraylist gets appended it checks for the size and generates a new array. Which in this case would be a whole lot of arrays in memory for the gc to collect.
Since this is a Linux system it’s using the completely fair scheduler (cfs) which means each thread runs at the same priority (as apposed to the mlfq (multilevel feedback queue) that windows uses). The issue then is that the OS is processing at the same priority as each of the threads created so the computer just freezes up. There’s also a minimum time spent in each thread so you rarely get to execute an action.
.Net pre-allocates a thread-pool at startup though the memory shouldn't be quite that high. Pretty sure it also utilizes a work stealing scheduler under the hood for continuations and its async/.await behavior. Also if you want to further optimize for memory the ValueTask struct will do some caching cleverness to dodge Task allocations if the work is either already done or can be done synchronously. Given how simple the test is, the GC probably won't kick in as it can recycle a lot of those Task objects.
If you want node to actiually use multiple threads, you need to tell libuv to use multiple threads. There is a env variable for this: UV_THREADPOOL_SIZE . Like you said, node has an eventloop. Thats not multi-threaded. It's single threaded with callbacks. Thats why setTimeout is more a 'minimum' guideline and not precise at all (under heavy loads). Just make a busy-wait program in node and you'll see it only filling up a single core on ur CPU
back in the JDK 1.3 days, the JVM would allocate 1MB per thread, but it was changed around 1.6/1.8, I forget exactly which release they fixed that. It's also important in Java to get the memory used, not memory allocated. The biggest issue with java for me is once the JVM allocates memory, it doesn't release it until you stop the JVM process.
No, C# Task implies no threads whatsoever. It uses the thread pool by default for CPU work, yes, but that can easily be just the part of the job that says "this task is finished" (e.g. handling the async I/O response). Creating an explicit thread (_not_ a hardware thread, _not_ an OS thread - you don't have control over those natively in .NET) is something completely different, and very rarely used in modern C#. It negates the whole point of using asynchronous I/O in the first place, which is avoiding the overhead of threads that do nothing but wait for something to complete (whether that's a timer or a HTTP request). Which, let's not forget, was part of the point of the original article - showing how expensive "real" threads are, and that different approaches to handling asynchronous code have vastly different results. But that article is very flawed anyway. It would make sense to compare multi-threaded code with other ways of doing asynchronous I/O... but instead, we get an arbitrary choice of one or the other for each platform. You can have promises in any language. Many have commonly used or outright built-in APIs for that. Seeing the difference between, say, Java threads and Java Futures would be a bit illuminating, at least... though it still needs to be noted that you have a lot of control over things that absolutely crush this comparison anyway. The default stack size of a new thread on modern .NET is usually 1 MiB. Windows doesn't really allow you to go very small with thread stack sizes (you're supposed to use a few threads, not thousands). Linux is designed around multiple processes/threads using the same memory for as long as possible, so a thousand threads each with 1 MiB memory can actually occupy just a few megabytes (until you actually start to modify the memory). Every performance benchmarks needs to have a goal. This one doesn't really seem to have one, apart from a simplistic "weird that memory usage in async stuff can vary wildly"... I mean, pretty much every platform out there allows you to pre-allocate as much unused memory as you want, but it'd be a weird way to compare different platforms, right?
In C# when you use Tasks with async/await, the default implementation creates a state machine that uses pre-existing thread pool to schedule execution of your tasks on the threads in the thread pool. Not only that, but it can even detect if the task in the thread is small enough to be executed synchronously - in that case it won't even end up in the thread pool - it will just execute and return as normal function call. To test how much memory threads consume in C#, you can't use Tasks with async/await - you have to use Thread class directly - that way you circumvent all of the optimalizations done in the runtime and in the Tasks scheduler.
🤔 I concur with you Big P...let's look at some more real use cases. Going outside of the process itself will complicate analysis with other elements (e.g. DB, ORM, etc.) that should be held constant; however, there are good use cases to eliminate as much of the 7 layer stack as we can: 1. Storage - with the good old random file manipulation, etc. 2. Network - doing something more like a UDP listener to eliminate possible contamination with socket handling 3. Memory - malloc, 😮multi-threaded data manipulation, release (to watch garbage collection) 4. Compute - not all compute operations are math-based, but do some string parsing, concatenation, etc. I'm thinking we want to eliminate math computations because most of those operations will come down to the underlying math implementation vs. actual performance (e.g. Fortran being fast, etc.), but network issues could have the same impact. Consider the history of Java IO vs. NIO.
Hypothesis: .NET is up-front creating a Heap which it looks like is ~128MB perhaps? And also a thread pool. And then everything up to 100K tasks fits within those limits so the memory consumption stays the same. Then going to 1M tasks is exhausting that Heap so it has to be expanded. Guessing it could probably manage 250K tasks within that initial allocation? Anyway, .NET and C# are better than you think they are these days.
12:30 Per default tokio creates worker threads equal to the amount of cpu cores. Though thinking about it, if you only use timers having a single threaded runtime would likely be just as fast and more efficient.
Not a good choice. You often have long running threads that also do block. In fact all the systems where the kernel is not controlling the worker threads sucks. This means: Linux,Android and the BSDs. The other systems have kernel driven thread pools for much better handling making sure that IO blocks don't prevent utilisation.
@llothar68 I explicitly meant that for the case of using only timers, which are neither cpu intensive nor use blocking APIs. When using a async runtime like tokio you shouldn't use blocking APIs anyway and if you have to there is tokio::spawn_blocking, which spawns a thread/uses a thread pool.
C# Uses a thread pool behind the scenes with a default config of #X amount of threads depending on the system it's running, it's usually 20 if I remember correctly from my .NET days. What's interesting to me is how it can spin up more if required and scales correctly.
Why were they using the newest rust from last month and nodejs from like 4 years ago? Like AWS doesn't support the version they used. Or 3 major verisons after it.
You can go "beyond infinity" in the paradigm of transfinite numbers. You manipulate an "infinite number" called omega (the greek letter) and then you have the number omega + 1, omega + omega, omega power omega and so on. This was primarily developped to compare the cardinal of infinite sets (ex: card(N) < card(R) even though they're both infinite)
Yeah, the C# example is not real threads. The code is just adding tasks to the scheduler, similar to "setTimeout" in JS. Which might be fine for most things, but each "Task" is taking up memory and then waiting to run. IMO, these tests are not good overall. I agree the Java one is probably not a good example wither with the synchronous join.
Not full threads but not just tasks either. Tasks use a threadpool to manage execution and the .net runtime will decide how many threads are in that threadpool.
just for fun, did creating threads in c++ in a similar fashion: static std::atomic toInc = 0; { std::vector threads; for (int i = 0; i < 1'000'000; ++i) { threads.emplace_back(std::jthread{ []() { toInc++; } }); } } running on a cpu providing 8 cores it took endless (we're talking bout 15minutes) to allocate thread-handles, resulting maxmemory consumed was 75MB. deallocating the thread-handles took the same amount of time creating them. so. this testcase highly depends on what kind of platform/OS is in use. Also it's not advised to use more threads than your hardware can handle on native cores, on my system the highest multithread-performance was on 32 threads (including an if < 1'000'000 inside each thread's lambda). and the peak-performance for the simple task was on singlethreaded (guess because no locking on atomic was necessary) --- everything just observations and measurements
If you are creating a new Elixir "process" per task it will scale up pretty linearly with the number of tasks, hence why it's high. High memory usage is not really a bad thing, perse. Likewise, the same with Go and goroutines, whereas other runtimes with a fixed threadpool or Node.js with it's single event loop won't keep climbing linearly. I would be more interested in CPU usage. You're welcome for this insight! 🤜🤛
This. The BEAM VM was designed to prioritise latency and predictable scalability. Copy-on-write and other memory consumption optimisations can produce latency spikes.
Comparing memory usage of VMs is tricky. They usually behave differently based on how much system memory is available/installed and configuration/mode. There's also the JIT compilation in most of these, which potentially adds a spike in memory usage, which might never be returned to the OS. It's just hard to say what's happening exactly, and the one number at the end is kinda pointless.
@@kippers12isOG You're missing the point. It may be doing that because it's configured to not return the memory to the OS. You can configure it differently. It behaves differently depending on the machine you're running it on, the memory configuration, the VM and GC configurations. For example in Java, you can just tell it to reserve a large amount of memory at the start, and also put a cap on it. It will generally not allocate any system memory beyond that, and it'll not run GC until you actually need more than you reserved. If you run it with defaults, on a system with tons of free memory (let's say >50iB free), do you want the default to be having a tiny footprint, but having to run GC more often? Or do you just say, use a few 100MiB, and almost never run the GC. If you ran this same program on a system that's starved for memory, the VM can decide to collect 20x as frequently, and keep its overall memory footprint 10x lower. A much better way to test would be to limit its container's RAM and see how low you can go until it starts to malfunction.
In Rust, the default stack size for an OS thread on all tier 1 platforms is 2MB. Not sure if it's allocated up front, but that's probably something to do with when all the memory went.
@@FinnBender Aww man! Yeah, that makes sense, Haskell is infamous for its high memory consumption because of thunks and stuff like that. I'm surprised it's that bad for 100k though, damnnn!
Maybe C# is doing something like Julia, that is, postponing execution until it actually needs to do something. Or maybe Roslyn has some under-the-covers optimizations. Any CLR experts care to comment?
`Task.Run` uses the ThreadPool by default, which is very conservative when spinning up new threads. The benchmark would pretty much make the ThreadPool never spin up new threads since each task completes immediately. It waits a good long while before deciding it actually should spin up a new one, which is why you see the memory increase at 1 million.
It creates n (depending on the CPU) managed threads for the default scheduler. If he wants to optimize for memory allocation, he should have used ValueTask and reduced the max managed threads of the default scheduler. But then again he should have measured threads instead of a higher level concept.
Task.Run(()=>{}); does not create a thread, but will instead schedule work on the Thread pool. Task.Delay() halts execution, and 'await' returns the thread to the threadpool. The benchmarks extremely useless for C#, since all you are doing is juggling the same handful of threads back and forth starting a task, and then doing no work until the delay is up and the Task is discarded. You don't need many Threads when your Task doesn't actually do any computation or IO work.
@@TheTim466 It's true for any true asynchronous I/O. You can do it in Windows with a C program, no need for fancy async languages. I/O doesn't need threads, and `Task.Delay` is just I/O - you get a notification from the system timer at a given time in the future, then a threadpool thread is used to handle the continuation (which in this case essentially just signals that the task is completed). That's also why the C# version doesn't need much space for the tasks at all - just a few pointers, a cancellation token and a tiny state machine. It fits in a few dozen bytes on x64 per task. You could trim it even lower if you wanted.
The Task.Delay() in C# does not actually occupy a real thread for the wait, it just subscribes to a kernel event and relies on the kernel to fire it back after the time is passed. It does however create a bunch of objects like Task and internally Timer and some more which all have to be GCed eventually. chhhhh tfu!
18:18 C# is doing what is expected. C# async await is pretty similar to go routine and virtual threads as it can run in parallel. I think the low memory usage of C# is due to tools you have in C# to write memory efficient code. Unlike most other managed languages(Java, Go) C# has structs which are not generally allocated on the heap. Those structs are not usually used that much in user code but in runtime code they are used to optimize performance and memory. Also C# pre allocate some memory at first so it doesn't allocate much after that. Also C# caches memory heavily. That's why in case of small program C# use more memory than most language but as the program gets bigger it catches up with other languages.
yes he could've use worker to create thread for concurrent task, by using settimeout you're still mono thread so all those setimeout will be queued inside the callback queue
There is a huge confusions between async tasks and threads in this whole article. green threads != hardware theads, and async tasks is a separate concept that doesn't necessarly imply any threading models; the tasks can just yield on the same thread, or be distributed on a thread pool... the JS version is not even thread at ALL, single-thread, and the C# version is... probably threaded but depends on which synchronization context; that code in a blank UI application will actually only join on the UI thread, see SynchronizationContext and ConfigureAwait.
AOT and tree shaking business has come a long way with c#. I would assume actual minimums an order of magnitude or less, but he did say default release configurations.
Elixir reserves a separate stack AND heap space for each process. This has some advantages, but you need to pay for it with memory. So you need to do realistic tasks for it to be able to compare. Elixir might run faster and create less fragmentation than go for example. It can remain more consistent in response times theoretically than go. But it depends on the scenario. If you are doing a benchmark, you can always be working around the problems that naturally arise in each.
Idk how he compiled the C# program, but I rewrote his program line by line in Net core 6 and running it gave these results (Checked with Process Hacker 2, as Task Manager doesn't report all memory): - 1 task = ~7.3 MB - 10000 tasks = ~13.5 MB - 1000000 tasks = ~430 MB - Compiled with Net Core SDK 6.0.408 - CPU: AMD Ryzen 9 7900 - OS: Windows 10 build 19045 I assume that either C# cheats on Windows by having Windows preload the runtime into memory and then reusing it for all C# programs and it simply doesn't report the memory consumed by the runtime, or some other shenanigans are going on in that article. I also tried building the app in the "self-contained" mode, where it includes the whole runtime in output, not requiring it to be installed and the footprint hasn't changed.
@@stefano_schmidt I tried allocating actual threads with new Thread(() => {Thread.Sleep(10000);}) and that started using *a lot* of memory. Million threads took ~4 GB of memory and shutting them down with Thread.Join took forever. But considering that Threads are really not recommended by anyone these days they might be lacking the optimizations over the years.
@@stefano_schmidt The whole point of asynchronous I/O is to avoid wasting threads for things that do not need threads. If you're working with well-written modern .NET code, you don't really need more threads than you have logical CPU cores, so why pay the cost? If anything, this shows exactly one of the reasons why spinning up new threads for every task you want to do is painfully wasteful. The article tries to compare different ways of handling asynchronous code. Threads are just one of those ways, and explicit threads should be really rare in any modern codebase. The article doesn't talk about creating a million threads - it talks about a million _asynchronous tasks_ . It's the YT video that claims this is about a million threads, which is silly - there's very few platforms where the overhead from the language/runtime will be remotely comparable to the overhead from having a thread in the first place. The default stack size of a Windows thread is usually 1 or 4 MiB. It will never take less than 64 kiB (or more exactly, the page size). Now compare that to the ~230 B a C# task takes, or the ~600B (in a pre-allocated structure of at least 2 kiB) of a goroutine. When you change the code to create threads... your memory usage comes entirely from thread stacks. Which means... what exactly? We know threads are expensive, that's why we want to avoid them! :D That's where async comes from (mostly). The real failure of the article is that it doesn't even attempt to find async tasks in each of those platforms - though that isn't all that surprising given the code was written by GPT :D
If C# has memory available it will swallow a lot for optimizations. Once i experimented with docker and performance tested my simple api endpoint with Bombardier (tool written in GO) - bombarding it with thousands of requests. My app used 1.5 gig of ram (!). But then I started limiting my container's available memory (-m parameter), and guess what, I went down to 15 MB and still worked. GO equivalent required at least 16 megs to work. The C# API with so little memory available performed almost the same as when using 1.5 GB anyway. (The GO was like 2% faster though, not gonna lie)
The nodejs example is off point. You need to choose worker threads for staying in line with all of the other examples. The same goes for the Python AsyncIO example.
Tbh the c# example doesn't create 1 mill thread, async/await in c# is implemented in a way that it doesn't just throw everything on threads, it has a bunch of internal logic that does the async state without actually spawning thread unless they are necessary and even then, it has a threshold on how many threads can be created, so if the thread limit is not increased, it won't ever spawn that many threads, the allocations on the other hand, 1mill tasks is a LOT of memory xD
At 9:33 it's said "I hate that phrase, to infinity and beyond" and it's said that it's nonsensical, but it's not: according to mathematics, there are many different kinds of infinities and they are not all of the same size, there are infinities bigger than the other actually: vide George Cantor work and particularly the diagonal argument, e.g. the size of the interval [0,1] alone is bigger than all natural numbers combined.
The big flaw in this test is that the main memory footprint of threads (no matter what kind) is the amount of thread-local data that has to be duplicated. And like most things in software, there is a trade-off between memory and speed (or latency). The sorts of things that thread-local data is used for are memory allocations, garbage collection, I/O, and maybe task scheduling. Usually more thread-local data means less contention between threads on those operations. So, a benchmark that does none of those things is only looking at the cost side without the benefit side. Long running tasks that allocate a lot of memory or do lots of operations like that will benefit greatly from the lower contention associated with running in a "real" thread. Short-lived tasks that have a small footprint or mostly statically allocated memory and sit idle for a significant portion of their run-time have far less potential to contend with each other in harmful ways, so the lower memory footprint and faster spawn-time of something like a plain event-loop or coroutines wins handily. And systems using thread-pools are basically trying to find a happy medium for tasks that are a bit of both. As always, the right tool for the right job. Obviously, if the "job" is to wait for 10 seconds 1M times concurrently, then a plain event-loop should win hands down because any decent event-loop implementation would boil that down to a single 10 second wait and then flushing an array of 1M small event objects.
compaaring actual threads with async tasks seems kinda weird
And workers and a plain event loop. Terrible all around.
They are not the same, but having async tasks is a powerful functionality that isn't available in all languages. It is correct he wasn't comparing the same, but you could argue that he was comparing how you would achieve the same thing if you wrote it in each language
@@MikyLestat Depends, because with Python you will run on a single thread, but with go for example you will use multiple threads. If you are actually computing anything this will make a significant difference.
@@lozanov95 Exactly. I think that the reason for the comparison is to get an indication of how much memory (minimally) each programming language will use to achieve the same thing. Achieving the same thing in each language is translated to using the features and constructs of each language. Python is a great language, but it isn't the fastest. The global-interpreter lock (in addition to Python being interpreted in CPython) causes it to be slow.
Just because Python doesn't really have multi-threading, it doesn't mean we shouldn't use multi-threading/tasks in other languages and then profile the memory footprint.
@@MikyLestat i think, this's wrong ways to compare language that only run in single thread vs multi-thread to get requirement memory to run that tasks. garbage collector have feature to queque overload thread. so fastest process means lower memory. and for tasks that have high range let say. first task 20KB, 70th task 1MB. Initial size heap higher give good response than set initial size to 50KB and re-allocate memory size. This all dependent user hardware to choose process ways or memory ways. if memory cheaper than cpu. than go memory, if cpu cheaper then choose like go or rush that re-allocator frequently
Using Python's asyncio for this test was the wrong thing to do. It's similar to what was done with NodeJS. Asyncio is an event loop, not a thread. Python has threading libs for threads.
multiprocessing xD no need for a benchmark, it would be just atrocious
@@Kobrar44 Yeah just run "multiprocessing.Pool(int(1e6))" and you are good to go :D Argh I hate python, but it is still my main language.
@@nikonyrhJust curious, why do you hate Python ?
But asyncio is faster because pythons multithreading is so bad, so it’s what people use. And it accomplishes the same things
this is an IO-Task so asyncio is the good solution!
If I'm not wrong, C# uses a theard pool behind the scenes when using async/await and what it does is it recycles theards. That's why in the first test it was way up than the others. I think that was the threads pool being initialized with a bunch of threads.
this.
Yup. It always allocates a fixed size pool of managed threads depending on the system it's running on, unless you set the size yourself, which is possible and would be separately interesting for this benchmark.
@@3ventic The ThreadPool default is much smaller, it shouldn't take 120 MB at idle. I'm betting he wasn't distinguishing between allocated and committed memory.
as far as I know, C# also compiles the async methods to stateful classes, so it generates the states of each “step” of processing beforehand, when you create that amount of tasks you are basically creating a list of super small instances in a queue to the threadpool to consume until the next state (await) and throw again in the end of the queue
@@MikyLestat I was a bit mistaken, but there is a fixed minimum number of threads (ThreadPool.GetMinThreads). On my system it's 32 by default and the equivalent program on my system (1 task) takes up 195M RES 108M SHR while a million tasks is using 52 threads and 472M RES 23M SHR.
As a Java apologist, it first got virtual threads in 1997 with version 1.1 (edit: later removed and recently re-added in v 19). Also, Java (and presumably.NET) pre-allocates a bunch of memory by default. Hence how mem looks high for small numbers of threads and it doesn’t increase until you hit bigger numbers.
Yep, rare prime L
Yes bu ran the same code aot comüiled for c# and its only 5mb baseline. The blog author misrepresented c# badly
@@elraito no doubt, but I don’t expect someone to be proficient at all those langs/runtimes.
That's why they should have used Kotlin coroutines
@@elraito That's the variation introduced by running it locally.
Creating a million concurrent "tasks" (or spawning processes as we call them in Erlang/Elixir) and allowing them to remain idle is one thing, while making those processes actually do something, such as each one of them having a persistent connection to a client and feeding it, is something entirely different. In practical terms, when it comes to real-time apps, the BEAM (Elixir/Erlang) outperforms all other languages by a significant margin.
This is precisely why Brian Action and Jan Koum chose Erlang for WhatsApp after years of experience with Yahoo Messenger and Yahoo Chat Rooms. If someone hasn't had the opportunity to work with any BEAM language, the above statement may appear to them as an empty boast, and I can't blame them for that.
But then this example needs to be done and showed to the world as this primeagen is reacting. I'm surprised with Elixir performance here... in a bad way.
@@ThugLifeModafocah I'm not. Erlang processes are completely isolated. COMPLETELY. Every "task" has a separate GC, memory space, everything.
@@xbmarxso if things crush only these things crush that's a feature itself
The BEAM is pretty quick, but it won't "outperform all other languages by a significant margin". Ran several huge elixir services in production with lots of traffic and our Go services were much more performant.
@@Aaku13I can agree for only CPU bound tasks. For IO bound tasks, Golang doesn't come close in performance to Elixir
Go reserves 4K of memory for each thread's stack so you could do quite a bit of work on each of those threads without incurring further costs.
Makes sense
goroutines aren't threads.
@@-rate6326Yeah, GO actually creates all threads at startup and just assign gorourines to them.
All of this to say: it's a thread pool lol
@@-rate6326Yes, goroutines aren't threads. But they do need to run at some point and the ones that aren't running are just waiting and we aren't talking about them
programming languages assuming that you would use the threads to do actual work
An information that has not been said in the video is that: async functions in C# are State Machines and Tasks (are part of the Task Parallel Library and) are automatically run in thread pools. So the only internal state these async functions have is the time they need to wake up, and all Tasks could theoretically have the same wakeup time.
I would've loved to see a C# Thread implementation. I suspect the C# compiler is optimizing redundant Tasks away since they lack any side effects.
Thread pool has like 512 preallocated threads, hence high memory usage in idle. Tasks are actually running, but max degree of parallelism is 8 (8 threads CPU) so there is practically nothing allocate.
@@vitskr1 you can tune this, knowing your workload though. Some languages I feel didn’t he the best showing here as the author isn’t an expert in each one, which is understandable
@@vitskr1 Exactly what I suspected ruclips.net/video/WjKQQAFwrR4/видео.html . Its using the Server tuning, I think on Desktop the default is Number of Cores * 2 .
@@vitskr1 512 threads * 512Kb = 256MB . Its not that big of a deal for servers with lots of cores.
@@monad_tcp i agree. and irl if you plan to launch 1M concurrency your probably have the RAM to match. i still don't think many people do these in a single process anyway. probably better to distribute workload to multiple servers. i recommend orleans 7 for c# devs. 😅
There's also the memory vs. speed tradeoff. Sometimes keeping more things in memory can also make it faster. If the managed environments that have a higher starting point in memory usage already has a bunch of kernel threads lying dormant in a thread pool that's taking up memory but speeds up spawning of threads.
if my hello world doesnt use 27 gigabytes of ram i wont write it
Yeah. Bun.js was priding itself on being faster than Rust in it's beta. Then when it came out and people started benchmarking it was slightly faster than rust by like a few percent, but used 40 times more memory on average.
You forgot the extra Rust thread it takes to track all the bullshit drama in the Rust community
oopsy! is it new feature of crablang in 2023?
dammit why did I laugh so hard on this
The Rust forums are just clogged with unproductive / outdated discussions that lead nowhere and make it harder to get anywhere as a community. The mods should simply go through all the threads once in a while and nuke the ones that are no longer relevant or helpful so the good stuff can get more space and everything would run smoother. Maybe they could even automate this with an LLM agent? They could call it “RustScheduledGarbageRemover”
@@JensRoland Garbage Collector? BAN
@@juniuwu banning people is just garbage collection for communities ;-)
Each elixir process spawns with a 50k heap, garbage collection happens on a per process level (you dont stop the world, you stop a process). This is because the way processes are used in elixir is like how microservices are used. Each process does a small amount of stuff then sends a message on to another service.
The erlang vm that elixir runs on will launch 1 scheduler per cpu and does pre-emptive multitasking. So if you had 1mn processes doing stuff you would get each process executing for a few ms then being switch out and added back into the queue that the schedulers pull from. So if you have more cores you get more parallelism, if you only have 1 core you still get concurrency.
Whereas async runtimes tend to be cooperative require some form of explicit yielding from a running task, elixir will just swap stuff out. Makes it good for soft realtime stuff, if you want to do cpu intensive things you can delegat to NIFs (native implemented functions) written in C or Rust. The rust ones tend to be safer since panics are caught and raised as errors in elixir. Wheras a panic in C will crash the whole VM
You can also specify the memory usage of a process as well on the beam VM, this significantly reducing the amount of memory something will use whenever it's spawned and doesn't really allocate anything, like in this case
And to do a test closer to what some of the other runtimes are doing, just call :timer.send_after(10000, :done) a million times, and then do a loop to receive :done 1 million times. Takes about 200mb instead.
Elixir / Erlang processes have far less memory by default. More like 256 bytes but depends on word size on your system iirc.
really smart GC model! Elixir was very well designed
I wouldn't compare it to microservices. I would just say Elixir processes are independent and don't share memory. Which really makes it unique (I don't know of another runtime like this except Node.js webworkers).
C# has the lowest memory usage because it is using the threadpool, that recycles blocking threads, like when calling Task.Delay. So there aren’t actually a million threads created but rather they are queued into the threadpool. To avoid this create the threads explicitly
pff... everyone knows that c# offloads 50% of tasks on Azure servers
@@user-qu5cc5oe2h ROTFL.
As a first time viewer I asked myself if ThePrimeTime is always on that level of cocaine?
Well, its something different than other coding channels. A fresh breeze, so to say .... **g**
@@user-qu5cc5oe2h free compute hack
@@user-qu5cc5oe2h😂
No shit, Sherlock, all of the languages were using threadpools except Java and Rust with real worker threads. So you've failed to uniquely qualify C# altogether.
Where is c++?
None cares😅
@@ErickBuildsStuffAh yes, no one cares about one of the most important and influential programming languages of all computing history
@@ThisIsMaddockid argue that C is more influential but yeah, saying no one cares about the language most used in most performance critical applications, that also need low level access to memory, is a really big stretch.
This guy reminds me of yongyea. Parrots other's work and makes more than the authors combined. He has no insight or original opinions or educated insight (from experiences academic or otherwise).
I hate how people raise this guy up.
Agreed on c++. That's my personal preference as I like the syntax being I learned it the same term I took cobol, Java (when it was new), visual basic and oop was still being defined.
I've never worked in industry as a programmer but keep up to a middling ability.
One thing I do know is that bullshit always smells like bullshit and this dude is full of it. People that talk during react videos do so only to fall under fair use, I see the same here transposed to a topic he is novice. Want for choice as mediocrity's excuse is no less evident than an untrained hand on display for no person's betterment or an opiate of excuse to be subject for one not turning to their purpose.
I'm as wrong as apt to be right so there's that as well.
@@jstro-hobbytech I personally use Rust as it keeps some of the cpp syntax and adds on top of it to prevent common mistakes.
The go results are not surprising. It's a well-documented feature that each goroutine starts with an initially pre-allocated stack size. Prior to go 1.2, it was 4kb, then it went to 8kb, and I believe it's now at 2kb for go 1.4+.
So 2kb × 10k means an additional 20mb on start. At 100k, it means a minumum of 200mb on start.
The math seems pretty consistent with the results we see for go, although they seem to suggest that initial stacksize may be closer to 2.7kb than 2kb.
We also have to keep in mind that there is a garbage collector running in there, and we didn’t account for how much memory it requires to keep track of everything going on.
C# was the winner, clearly everybody was expecting this
yes
Of course, kudos to .NET runtime team! 😎
Clearly they fucked their setup
[Insert cope here]
To be fair, they did fuck it but...
Running as AOT has even smaller footprint
Also he wasn't using ValueTask, they reduce the memory consumption considerably. But I hate tests like this because a compiler could remove everything before the code isn't doing anything.
Intro: let's not compare apples to potatoes
The rest of the video: compares making threads with maintaining an event queue
9:30 - In the 19th century the german mathematician Georg Cantor proved that there must be more than one kind of infinity, such a the infinity of the natural numbers, and the infinity of real numbers and so on, and that there are larger infinities than others. The smallest infinity is that of the natural numbers, and its called Aleph Zero.
So yes, Buzz can indeed go to infinity and beyond, so long it is mathematical infinity.
pretty cool i remember studying this part of set theory and how Alef (first alphbet in Arabic) the idea is that the set of natural numbers (1, 2, 3, ...) has the smallest cardinality and is denoted as Aleph Zero (ℵ₀)
Thanks. Was thinking the same.
Nothing "and so on". That is not clear. In fact it can neither be proven not disproven with standard mathematics. It is called the continuum, hypothesis
@@drtfsghdfghdgfshdgfhdgfhdg The continuum hypothesis is that there are no intermediary infinities between "infinity of integers" and "infinity of reals". It is, indeed, but an axiom. However, the cartesian product of a set with itself ALWAYS yields a set with higher cardinality, so infinitely many distinct infinities can be constructed by the repeated usage of it.
@@mykhailonikolaichuk6392 That is just wrong. Infinite cartesian products of natural numbers, for examples, are "just" rational numbers.
To be fair, Elixir is spawning new processes with their own memory and PID (inside the VM).
And also providing stuff for graceful restarts and an entire message queue
And preemptive scheduling, if any one of them fails or blocks indefinitely it cannot take the rest down with it.
usage of Task.async in elixir, it comes with lot of boiler plate that is wrapped on top of GenServer. if the test has to be performed for concurrent tasks, one could go with primitives like spawn, send and receive in order to know the true potential. Just my opinion on why elixir used a lot of memory.
It's not doing anything. The erlang process concept has nothing to do with threading. Sure it explains the memory usage, but there are ways to pool it so a maximum amount of processes could be spawned at any time.
That C# method has 2 extra layers, the code inside the for loop should just be tasks.Add(Task.Delay(TimeSpan.FromSeconds(10)));
This 👆
They created threads to run their threads inside
@@Eirenarch No it should not. If you did it the way you describe, the work (in this case represented by Task.Delay) would not be scheduled on TaskScheduler and would instead be done on the thread that this code is running at thus blocking it and not using CPU cores to its fullest.
If any, it should be Task task = Task.Run(Task.Delay(TimeSpan ...)); tasks.Add(task); This would save some memory while still scheduling the work on worker threads.
I am not sure if there would be any benefits, if you used TaskFactory and Scheduler directly, whether it would be more performant, but I highly doubt so.
Task itself is glorified coroutine and job child. Its just a premise of an action, that can wait for other actions to complete. Task.Delay does not do anything with scheduling, or threading. It just writes a timestamp, and deposits the Task to run later, when the proper time has come. But it would not start new thread/virtual thread/Task/Coroutine. Since they are trying to figure out, how costly scheduling a new thread/virtual thread/Task/Coroutine is, this would not do the work.
c# and you are the 2 most useless stuffs
Also I don't see value tasks and the list doesn't have a buffer set.
I was looking for this comment. Guy who created that blog clearly knows nothing since he is using chatGPT and chatGPT also knows nothing if it outputs that kind of code... But hey, even my 'senior' coworker used to write async code like that so who am I to judge.
As likely already pointed out, C# uses a thread pool, and will definitely not create a gazillion threads in this test, and the memory required to house all of these insignificant tasks will be very small, which is apparent in the test results.
I tried it out in LinqPad, but with one additional task whose only purpose was to keep track of the number of simultaneous threads actually in use. For 1 million tasks, the actual active thread count peak never even exceeded 50 on my system (usually much lower). No wonder, when all that the tasks are "doing" is async-waiting on a delay.
This benchmark is broken in the sense that it doesn't really do what the author thinks it does, i.e. it does NOT create a lot of threads (virtual or otherwise) in all languages/runtimes, and measuring the memory usage is thus close to pointless.
as
There is some important information not mentioned in the article. Goroutines are compared to threads, whether real or virtual, but they are not compared to an event loop. Go has event loop libraries, and since the author of the article has used the event loop in other languages, he should also use it in Go to ensure an unbiased comparison.
Additionally, the advantage of goroutines over threads is their portability; they do not depend on the operating system. If your application requires low-level operation, such as with chips or microcontrollers that do not have an operating system, a goroutine can still be executed. This is not possible with threads, as the language does not perform the task-the operating system does. Where there is no operating system, there are no threads.
One last thing: when an application uses system threads, the system reserves memory. The question is: Did the author of the article account for the memory reserved by the system?
I wonder why Kotlin wasn't included, I guess it does share similarities with Java and Go but it's implementation of Coroutines is supposed to be different from that in Go. I guess testing it would also have to include both JVM and Native compile targets because you never know.
If you include kotlinx library, you should add Scala Actor, ZIO ... too.
@@avalagum7957 suspend keyword and channels are part of the standard kotlin library. Coroutines package includes coroutines' builders and stuff like flows.
For some reason Prime just ingores Kotlin whatsoever :/ But i'd really like to watch some quality kotlin roast.
@@DeliOZzz cause its not a popular choice for backends, alot of people still thinks kotlin is only for android, im afraid this stigma will stick around for the time being
It runs in the same VM. At most it would be equal to a competent implementation in Java only.
It should've been "To infinity and NaN" as an homage to JavaScript.
15:11 Python, by default, only uses one worker thread. When writing asyncio code you do need to be careful that you don't block. My understanding is that each event loop may have only one worker, but I'm not experienced enough to be confident in saying that.
Elixir reserves 4kiB of RAM for each of its processes. Each process in Elixir has its own separate heap to eliminate the possibility of stop-the-world-GC.
Each Linux kernel thread needs 32kb (28kb of it are non swappable physical kernel stack space) + 1kb for kernel structures.
C# has threads. Benchmarking Tasks instead is just confusing, because those aren't theads.
As you may have noticed, he's benchmarking green threads(tasks in c#, goroutines in go, etc.) across the languages.
C# does not have green threads. Tasks are not green threads
@@carlinhos10002 Now that I've re-read the definition of green threads, I'm not sure how they aren't. They are not OS managed. They are lightweight thread-like primitives managed by the runtime. What are they missing?
Wikipedia also lists them as such on en.wikipedia.org/wiki/Green_thread
Not sure if this is as important though, every language in the lists was using their concurrency primitive built on top of some managed pool anyway.
@@pavelyeremenko4640 he’s just making things up. Most implementations are using some abstraction over OS thread. Only one of Java and Rust versions dont do that.
C# tasks use a threadpool to execute. But one thread can have multiple tasks waiting simultaneously and the code this guy used had each thread sleeping for several seconds
You actually pointed this out early on. In the Java and C# version, he uses "ArrayList" without specifying the size.
ArrayList in both these languages hold an actual Array object. It's why the lookup time for "get" is a memory address lookup time.
When Java needs to expand the array size, it creates a larger array that is twice the size of the current array size. I believe the default is 10.
Java also doesn't run the garbage collector unless it needs to be run or specifically invoked with System.gc.
Because the JRE doesn't plan ahead for your bad code, it just looks for a new place to put the object in memory, leaving all the old references that need to be deleted alone - because the GC will deal with it as needed.
Just to recap there are several arraylist objects each holding an array of size n (below) in memory - and if the JVM is given enough memory, all 11 of these will still be there.
So that means there are 20510 threads in memory on the test.
While his approach to joining all the threads was barbaric, it's also the accepted answer on StackOverflow, we are not measuring the speed of the execution, just the memory of it.
If you were not trying to measure the memory performance of threading on difference languages, I would actually give java more threads to manage the threads (parallelize stream).
Finally thoughts,
We aren't concerned about thread space in production equipment, we are concerned about execution time and if my entire program hangs because one calculation couldn't be done, I'm missing out on something important - it could be a trade, moving servo for a robotic (self driving cars) or producing an input for a chess game. Collecting the information that I can allows me to implement an algorithm that is capable of making educated guesses based of what was calculated.
If we do care about thread space, we would be better off doing single threaded applications since we don't have an overhead associated with the effing cost of the thread.
TL;DR
Something something short equal something something int because the JVM go fast blah blah addresses blah blah blah 4. (primitive array blah blah addresses, blah blah)
Man I am allergic to empty catch blocks in Java - always. After looking for exceptions that have never been rethrown or really handled, I am really on the fence. Empty catch blocks should not exist or even be allowed...
You are allergic to using your brain, yes we know. Maybe if you knew what checked and unchecked exceptions are and stopped making dumb comments. This is why you should stop the drugs and go back to school, fool
I have no problems with empty catch blocks, as long as my compiler is allowed to optimize them away.
I'm allergic to exceptions. I will wrap all my code with empty catch blocks to further mayhem and until everyone else is conditioned to hate exceptions too.
MWAHAHAHAHAHAHHAH
The Elixir solution has a LOT of room to squeeze out. I can get it running in about 990mb with some tweaks. Main thing is the default heap size. Passing `+hms 1` as part of `erl` options sets default size to 1 4-byte word. Also, using plain spawn calls instead of Task (which accumulates results, and adds extra memory and GC and processing overhead) reduces it further.
True, but as long as the "threads" don't actually do anything it is a useless comparison. The constructs on these platform all provide a different feature set, so comparing performance is bogus. I mean a C# Task is just one or a few objects waiting in several queues to be invoked by native threads in the thread pool with a job stealing algorithm. NodeJs and Python are single threaded with a single event loop. I don't know what the others do and give you for free, but this isn't apples to apples.
(Edit: I automatically type thread with a capital T)
@@mennol3885 Yup. The comparison is pretty meaningless. The "cheap", non-idomatic Elixir way to do this, would be to start 1,000,000 timers, and wait for them to finish. Effectively doing the same thing as some other platforms. I just tried that - uses about 200mb in total of memory.
If all it's doing is starting something that sits there idly for 10 seconds, there isn't much difference.
No point carting round a whole isolated separate stack and heap for each process, and associated house keeping. Elixir processes are cheap, but they're not *that* cheap.
i'm ready for the C# arc let's go, it has a really bad reputation that is totally undeserved these days
True.
My personal hate for it came from the pain of trying to use it in my SW dev course on linux compared to those windoze fags who have first class support for everything, and from missing a bunch of the things I love about Rust when doing C# (e.g. immutable by default, f, u, i (though byte is fine and I guess using "long", "short", etc. isn't really bad. more just personal preference and more efficient), match, traits, enums, macros! True some of these stuff are to a decent extent available in C#, but the.. culture doesn't use them primarily like Rust does). But the language itself genuinely looks pretty nice, and has some nice features and shit even over Rust. I'm definitely comfortable calling the language "better Java", and would be okay programming in it professionally or even hobbyistically.
@@MH_VOID Yeah. Rust is very intriguing language (excluding the dramas and BS). Also things should be a lot better than before. Although there still is some windows/Microsoft bias in the language.
I think C# is great honestly. Not the best in anything, but it’s good in many areas
@@sohn7767 Yeah agree. And I think that it is its main strength. That it can be used for everything.
tbh the C# number kind of makes sense, it scales incredibly well, especially in later .NET versions. Some C#-based fancy Unity optimizations can beat out GCC in raw speed and memory.
Granted, there is probably some optimization going on in Release mode, since it's not doing anything. I'd expect the memory consumption to be higher, but not 4GB high.
What do you mean by "beating GCC" last I checked GCC was a compiler.
@@marcossidoruk8033 yeah, the optimizations are made by the compiler. He meant the C language, but specifically with GCC. If you used the microsoft compiler or other options you would have different performances.
@@CorvinhoDoMal No way C# is going to beat carefully written C code in any imaginable benchmark ever, its just impossible.
Plus what he said makes no sense, "unity optimizations" how do you compare C# unity performance with C unity performance if you can't do unity scripts in C? Am I going crazy or what.
And if he means the engine that is written almost in its entirety in C++
@@marcossidoruk8033 Google the Unity Burst compiler. Faster than GCC in fibonacci and NBody simulation.
Go is definitely not a memory hog; at least for IO-intensive tasks. The main thing is that the Go libraries are always very careful to stream large inputs; rather than buffer them in memory. Java itself doesn't really have major memory issues beyond spawning threads; but in any large Java project, the code will be full of things being buffered into arrays, rather than being streamed. I tried rewriting netty to make it stop doing dumb things; and just switched (permanently) to Go. Part of Java's program is also the legal issues of shipping a JVM; and the existence of Oracle thumb-breakers and lawyers; to come punish you for shipping.
C# code was not written correctly. Code snippet wraps one task into another `Task.Delay(...)` into `Task.Run(...)`, creating 2 million tasks and every 2nd task wrapped into another task. Correctly written code would have had consumption ~176MB on .NET 6.
This was enough to create singular task: `tasks.Add(Task.Delay(TimeSpan.FromSeconds(10)));`
Apperantly alot of it was written using chat gpt, so it makes sense.
C# master race. Lets go.
.NET team is optimizing the fu*k out of the stack for a few years.
Hands down the best api backend language to work with. 🥰
I hope that it become so good that it could be perfectly used for full stack language.
@@reddragon2358 It does work fairly well together with HTMX
@@BosonCollider Oh, glad to hear, but for example with Java could be used for full stack development with the help of Java frameworks.
@@reddragon2358 That produces horrendous UI. Could be future using WASM.
@@mishikookropiridze5079 I heard that C# has UI frameworks. I hope that the get better with time.
You're 100% right about the complexity of the task.
But also, I would have stopped reading after they said they used ChatGPT to come up with the code.
You need to have these contributed by people that actually write this language and that actually understand this language.
The ambiguity between what the code was actually doing in all of these was horrible, as other commenters have also pointed out.
C# fan bois are eating good these days
Yup
Thank you for sharing and commenting on this one. I would love to see C# with AOT compile. I believe it would make a huge difference.
Erlang, a language used in telecommunications, still seems to be the concurrency champion (according to a book by Röhrl and Schmiedl called »Produktiver programmieren«, I've read it in German a while ago).
C# now has native aot and would have significantly improved the memory footprint of this
Yeah, 7,4mb for just a standalone release mode app.
Also trimming
@@sgbench ValueTasks and adding a buffer size to the list will help.
@@FilipCordas That's a good point
I'm curious about C, C++ & Zig.
Also, I love Go. What happened, why did it end up using so much memory? Kinda sucks
@nósferratu Oh, alright.
I was watching chat go by and someone mentioned Go is stackbased or something along those lines.
Thanks for the info 👍
@nósferratu right I was going to comment the same and found this
The C# implementation is completely bogus compared to the others. It's using a small thread pool (task.run) to set a bunch of timers (task.delay) that's why it shows low memory usage. This is not demonstrating concurrency.
If the implementation did a thread. sleep or used real threads the results would be completely different and probably worse than Java since C# doesn't have virtual threads.
In the real world Go runtimes will have considerably less memory overhead than C# or Java
@@_daniel.w Go has a delay() function that looks similar to what's used in the C# impl. Rework the Go implementation to use this and I suspect it will perform drastically better
My man hates C# so much, it's hilarious! To be fair though I agree with everything you said and would love to see your benchmarks about this topic.
@@cethienI love C# but hate MS. I use Rider and Linux to code in my personal time and I like it. I think it's very good for API development.
@@cethien VS sucks, Rider rules. I do also hate Microsoft but it’s a good language nonetheless
@@cethien I've been developing c# on linux and macos for a couple of years now using Rider (I just like it more but the Visual Studio is also fully cross platform).
I don't personally enjoy the language as much nowadays but the tooling is great whatever platform you pick.
@@pavelyeremenko4640 last time I used visual studio on mac it was only for Xamarin
@@cethien I loooove writing Razor components 🤓
// MyComponent.razor
@using Microsoft.AspNetCore.Components
@Title
@Message
@code {
[Parameter]
public string Title { get; set; }
[Parameter]
public string Message { get; set; }
}
the fuck is this shit
He said he launched 1 Task, as soon as you start one async task C# (in .NET 6) already sets up all the thread pool stuff and Access control. For such simple instances you should use threads in C#. Afaik it greatly improved with .NET 7. But in exchange you are prepared to scale incredibly, also yeah the .NET runtime does some incredible smart magic in the background, e.g. have a looked at LINQ performance in .NET 7.
CAS is not a thing anymore in dotnet core world.
@@metaltyphoon CAS?
Can you really run 1 million C# threads?
@@rroscop on my hardware no problemo, remember that they are way more like go routines than like hardware threads, so only a dozen is actually working in parallel, the rest is just queued.
@@boredstudent9468 nice. Are you talking about System.Threading.Thread's? Or tasks run via Task.Run()?
my understanding was that Task.Run() used a thread pool under the hood, but real Threads were more heavyweight. I'm not a C# developer though, just dabbled
Go's minimum stack size is (I think) 4KB per Goroutine and it grows/shrinks as needed. Not sure whats the minimum stack size. Therefore the ~2GBs in Go is not surprising. So in 3GB of memory, you can put 1mil/10mil and probably even 20/30 million goroutines, they will just shrink in size. You can probably with the example from Piotr do even more, since it's a very simple non-memory consuming routines. But as I said, not sure whats the minimum stack size that will be consumed by a gorutine. But its less then 4KB for sure (in your example 2.8GB/1_000_000 = 2.8KB). My guess is that is not shrinking even less than this since there is enough memory available.
Anyway you put it nicely, this is not a real world test, TCP/Websocket connection would be much better
Go test here was completely misrepresented by non optimized garbage collection settings and not profiling how much of that was colored for deletion.
yea node example is not spawning threads, it's just placing tasks on the timeout callback queue of the eventloop to be executed later using the main thread.
Infinity and beyond is mathematically sound because there are some infinities that are larger than others. The most trivial example would be the set of odd or even natural numbers, and the set of natural numbers. They're both countable infinite, but because the the odd or even natural number sets can be mapped one to one to their values in the natural numbers, there will always be double the numbers in the natural numbers, as in a larger infinity.
There's likely more important infinities to consider, and I might have explained that wrong or poorly, but most definitely there is more than just a single, simple infinity.
The issue with the java threads i feel like is not preallocating the array list, every time an arraylist gets appended it checks for the size and generates a new array. Which in this case would be a whole lot of arrays in memory for the gc to collect.
Since this is a Linux system it’s using the completely fair scheduler (cfs) which means each thread runs at the same priority (as apposed to the mlfq (multilevel feedback queue) that windows uses). The issue then is that the OS is processing at the same priority as each of the threads created so the computer just freezes up. There’s also a minimum time spent in each thread so you rarely get to execute an action.
.Net pre-allocates a thread-pool at startup though the memory shouldn't be quite that high. Pretty sure it also utilizes a work stealing scheduler under the hood for continuations and its async/.await behavior. Also if you want to further optimize for memory the ValueTask struct will do some caching cleverness to dodge Task allocations if the work is either already done or can be done synchronously. Given how simple the test is, the GC probably won't kick in as it can recycle a lot of those Task objects.
10:08 why that old .NET version? 7.0.6 was current bay in May 2023.
If you want node to actiually use multiple threads, you need to tell libuv to use multiple threads. There is a env variable for this: UV_THREADPOOL_SIZE . Like you said, node has an eventloop. Thats not multi-threaded. It's single threaded with callbacks. Thats why setTimeout is more a 'minimum' guideline and not precise at all (under heavy loads). Just make a busy-wait program in node and you'll see it only filling up a single core on ur CPU
back in the JDK 1.3 days, the JVM would allocate 1MB per thread, but it was changed around 1.6/1.8, I forget exactly which release they fixed that. It's also important in Java to get the memory used, not memory allocated. The biggest issue with java for me is once the JVM allocates memory, it doesn't release it until you stop the JVM process.
C# Task is an abstraction using the threadpool. He should use the Thread class which instantiates a real thread.
*managed thread
No, C# Task implies no threads whatsoever. It uses the thread pool by default for CPU work, yes, but that can easily be just the part of the job that says "this task is finished" (e.g. handling the async I/O response).
Creating an explicit thread (_not_ a hardware thread, _not_ an OS thread - you don't have control over those natively in .NET) is something completely different, and very rarely used in modern C#. It negates the whole point of using asynchronous I/O in the first place, which is avoiding the overhead of threads that do nothing but wait for something to complete (whether that's a timer or a HTTP request). Which, let's not forget, was part of the point of the original article - showing how expensive "real" threads are, and that different approaches to handling asynchronous code have vastly different results.
But that article is very flawed anyway. It would make sense to compare multi-threaded code with other ways of doing asynchronous I/O... but instead, we get an arbitrary choice of one or the other for each platform. You can have promises in any language. Many have commonly used or outright built-in APIs for that. Seeing the difference between, say, Java threads and Java Futures would be a bit illuminating, at least... though it still needs to be noted that you have a lot of control over things that absolutely crush this comparison anyway. The default stack size of a new thread on modern .NET is usually 1 MiB. Windows doesn't really allow you to go very small with thread stack sizes (you're supposed to use a few threads, not thousands). Linux is designed around multiple processes/threads using the same memory for as long as possible, so a thousand threads each with 1 MiB memory can actually occupy just a few megabytes (until you actually start to modify the memory).
Every performance benchmarks needs to have a goal. This one doesn't really seem to have one, apart from a simplistic "weird that memory usage in async stuff can vary wildly"... I mean, pretty much every platform out there allows you to pre-allocate as much unused memory as you want, but it'd be a weird way to compare different platforms, right?
C# uses loads of thread pools and I think the issue is they likely didnt trim the assemblies etc so it kept a bunch of unused crap
The name is the C-sharpagen.
It looks as though c sharp is creating a thread pool by default instead of actually launching threads.
crab is fast and fox is slow
do a barrel roll
In C# when you use Tasks with async/await, the default implementation creates a state machine that uses pre-existing thread pool to schedule execution of your tasks on the threads in the thread pool. Not only that, but it can even detect if the task in the thread is small enough to be executed synchronously - in that case it won't even end up in the thread pool - it will just execute and return as normal function call.
To test how much memory threads consume in C#, you can't use Tasks with async/await - you have to use Thread class directly - that way you circumvent all of the optimalizations done in the runtime and in the Tasks scheduler.
🤔 I concur with you Big P...let's look at some more real use cases. Going outside of the process itself will complicate analysis with other elements (e.g. DB, ORM, etc.) that should be held constant; however, there are good use cases to eliminate as much of the 7 layer stack as we can:
1. Storage - with the good old random file manipulation, etc.
2. Network - doing something more like a UDP listener to eliminate possible contamination with socket handling
3. Memory - malloc, 😮multi-threaded data manipulation, release (to watch garbage collection)
4. Compute - not all compute operations are math-based, but do some string parsing, concatenation, etc.
I'm thinking we want to eliminate math computations because most of those operations will come down to the underlying math implementation vs. actual performance (e.g. Fortran being fast, etc.), but network issues could have the same impact. Consider the history of Java IO vs. NIO.
Hypothesis: .NET is up-front creating a Heap which it looks like is ~128MB perhaps? And also a thread pool. And then everything up to 100K tasks fits within those limits so the memory consumption stays the same. Then going to 1M tasks is exhausting that Heap so it has to be expanded. Guessing it could probably manage 250K tasks within that initial allocation? Anyway, .NET and C# are better than you think they are these days.
12:30 Per default tokio creates worker threads equal to the amount of cpu cores.
Though thinking about it, if you only use timers having a single threaded runtime would likely be just as fast and more efficient.
Not a good choice. You often have long running threads that also do block. In fact all the systems where the kernel is not controlling the worker threads sucks. This means: Linux,Android and the BSDs. The other systems have kernel driven thread pools for much better handling making sure that IO blocks don't prevent utilisation.
@llothar68 I explicitly meant that for the case of using only timers, which are neither cpu intensive nor use blocking APIs.
When using a async runtime like tokio you shouldn't use blocking APIs anyway and if you have to there is tokio::spawn_blocking, which spawns a thread/uses a thread pool.
C# Uses a thread pool behind the scenes with a default config of #X amount of threads depending on the system it's running, it's usually 20 if I remember correctly from my .NET days. What's interesting to me is how it can spin up more if required and scales correctly.
Should be equal to number of cores you have available on the machine.
Why were they using the newest rust from last month and nodejs from like 4 years ago? Like AWS doesn't support the version they used. Or 3 major verisons after it.
This is very good question. Looks like manipulation
You can go "beyond infinity" in the paradigm of transfinite numbers. You manipulate an "infinite number" called omega (the greek letter) and then you have the number omega + 1, omega + omega, omega power omega and so on.
This was primarily developped to compare the cardinal of infinite sets (ex: card(N) < card(R) even though they're both infinite)
Yeah, the C# example is not real threads. The code is just adding tasks to the scheduler, similar to "setTimeout" in JS. Which might be fine for most things, but each "Task" is taking up memory and then waiting to run. IMO, these tests are not good overall. I agree the Java one is probably not a good example wither with the synchronous join.
Dude… only the one of the Java and Rust was real threads. All other tasks have use a pool abstraction. I think Elixer uses actual process.
Not full threads but not just tasks either. Tasks use a threadpool to manage execution and the .net runtime will decide how many threads are in that threadpool.
just for fun, did creating threads in c++ in a similar fashion:
static std::atomic toInc = 0;
{
std::vector threads;
for (int i = 0; i < 1'000'000; ++i)
{
threads.emplace_back(std::jthread{ []() {
toInc++;
} });
}
}
running on a cpu providing 8 cores it took endless (we're talking bout 15minutes) to allocate thread-handles,
resulting maxmemory consumed was 75MB.
deallocating the thread-handles took the same amount of time creating them.
so. this testcase highly depends on what kind of platform/OS is in use.
Also it's not advised to use more threads than your hardware can handle on native cores,
on my system the highest multithread-performance was
on 32 threads (including an if < 1'000'000 inside each thread's lambda).
and the peak-performance for the simple task was on singlethreaded (guess because no locking on atomic was necessary)
--- everything just observations and measurements
Prime will now worship at the altar of Anders (creator of C# and Typescript) /s
XD
If you are creating a new Elixir "process" per task it will scale up pretty linearly with the number of tasks, hence why it's high. High memory usage is not really a bad thing, perse. Likewise, the same with Go and goroutines, whereas other runtimes with a fixed threadpool or Node.js with it's single event loop won't keep climbing linearly. I would be more interested in CPU usage. You're welcome for this insight! 🤜🤛
This. The BEAM VM was designed to prioritise latency and predictable scalability. Copy-on-write and other memory consumption optimisations can produce latency spikes.
I wanna see Nick Chapsas's reaction on this 🤣
x2
Comparing memory usage of VMs is tricky. They usually behave differently based on how much system memory is available/installed and configuration/mode.
There's also the JIT compilation in most of these, which potentially adds a spike in memory usage, which might never be returned to the OS.
It's just hard to say what's happening exactly, and the one number at the end is kinda pointless.
If the jit takes memory doing this, it will probably take similar memory in real use cases no? That would mean that it's valid to include it
That's why some languages provide their own memory library/tool, as it's know exactly how much actual memory is being used.
@@kippers12isOG You're missing the point. It may be doing that because it's configured to not return the memory to the OS. You can configure it differently. It behaves differently depending on the machine you're running it on, the memory configuration, the VM and GC configurations.
For example in Java, you can just tell it to reserve a large amount of memory at the start, and also put a cap on it. It will generally not allocate any system memory beyond that, and it'll not run GC until you actually need more than you reserved.
If you run it with defaults, on a system with tons of free memory (let's say >50iB free), do you want the default to be having a tiny footprint, but having to run GC more often? Or do you just say, use a few 100MiB, and almost never run the GC.
If you ran this same program on a system that's starved for memory, the VM can decide to collect 20x as frequently, and keep its overall memory footprint 10x lower.
A much better way to test would be to limit its container's RAM and see how low you can go until it starts to malfunction.
Where is C++?
In Rust, the default stack size for an OS thread on all tier 1 platforms is 2MB. Not sure if it's allocated up front, but that's probably something to do with when all the memory went.
I would have loved to see Haskell tested like this, it'd be so good
It's surprisingly bad :(
1 thread: 5.0 MB
10 threads: 4.9 MB
100 threads: 4.9 MB
1k threads: 8.3 MB
10k threads: 63.1 MB
100k threads: 803.8 MB
@@FinnBender Aww man! Yeah, that makes sense, Haskell is infamous for its high memory consumption because of thunks and stuff like that. I'm surprised it's that bad for 100k though, damnnn!
I think the C# compiler just optimized the clearly only-idling tasks away.
Is this the first time in history he turned off the notifications before starting the video?
Don't tell anyone...
9:28 "would you not be at infinity, If you can go beyond?" Loved it
Maybe C# is doing something like Julia, that is, postponing execution until it actually needs to do something. Or maybe Roslyn has some under-the-covers optimizations. Any CLR experts care to comment?
`Task.Run` uses the ThreadPool by default, which is very conservative when spinning up new threads. The benchmark would pretty much make the ThreadPool never spin up new threads since each task completes immediately. It waits a good long while before deciding it actually should spin up a new one, which is why you see the memory increase at 1 million.
It creates n (depending on the CPU) managed threads for the default scheduler. If he wants to optimize for memory allocation, he should have used ValueTask and reduced the max managed threads of the default scheduler. But then again he should have measured threads instead of a higher level concept.
Task.Run(()=>{}); does not create a thread, but will instead schedule work on the Thread pool. Task.Delay() halts execution, and 'await' returns the thread to the threadpool.
The benchmarks extremely useless for C#, since all you are doing is juggling the same handful of threads back and forth starting a task, and then doing no work until the delay is up and the Task is discarded.
You don't need many Threads when your Task doesn't actually do any computation or IO work.
@@mikestiver9000 Would be the same for any language with async capabilities or not?
@@TheTim466 It's true for any true asynchronous I/O. You can do it in Windows with a C program, no need for fancy async languages. I/O doesn't need threads, and `Task.Delay` is just I/O - you get a notification from the system timer at a given time in the future, then a threadpool thread is used to handle the continuation (which in this case essentially just signals that the task is completed). That's also why the C# version doesn't need much space for the tasks at all - just a few pointers, a cancellation token and a tiny state machine. It fits in a few dozen bytes on x64 per task. You could trim it even lower if you wanted.
The Task.Delay() in C# does not actually occupy a real thread for the wait, it just subscribes to a kernel event and relies on the kernel to fire it back after the time is passed. It does however create a bunch of objects like Task and internally Timer and some more which all have to be GCed eventually. chhhhh tfu!
Now imagine giving Tom a C#
18:18 C# is doing what is expected. C# async await is pretty similar to go routine and virtual threads as it can run in parallel. I think the low memory usage of C# is due to tools you have in C# to write memory efficient code. Unlike most other managed languages(Java, Go) C# has structs which are not generally allocated on the heap. Those structs are not usually used that much in user code but in runtime code they are used to optimize performance and memory. Also C# pre allocate some memory at first so it doesn't allocate much after that. Also C# caches memory heavily. That's why in case of small program C# use more memory than most language but as the program gets bigger it catches up with other languages.
Let's rewrite Elasticsearch, Kafka, and Cassandra in C# and get free performance
Wohooo. Let's go
Look up ScyllaDB as a Cassandra replacement. It's written in C++
keep microsoft crap to its ecosystem.
I have a feeling they're testing apples to oranges to melons to avocados to onions
More interested in seeing Nodejs 20 with worker threads as they claim that there is a lot of perf improvements in Node 20
yes he could've use worker to create thread for concurrent task, by using settimeout you're still mono thread so all those setimeout will be queued inside the callback queue
The threads used in C# async await come from a pool and is different to system.threqding.thread
I wouldn't be surprised if some of this were optimised away in release mode by Roslyn. The test case is not valid.
It specifically said there were no significant difference between debug and release...
There is a huge confusions between async tasks and threads in this whole article. green threads != hardware theads, and async tasks is a separate concept that doesn't necessarly imply any threading models; the tasks can just yield on the same thread, or be distributed on a thread pool... the JS version is not even thread at ALL, single-thread, and the C# version is... probably threaded but depends on which synchronization context; that code in a blank UI application will actually only join on the UI thread, see SynchronizationContext and ConfigureAwait.
AOT and tree shaking business has come a long way with c#. I would assume actual minimums an order of magnitude or less, but he did say default release configurations.
Another win for the Rust Team...
you spit on C# or Python ? i don't get it...
He had a bad experience with C# in the past IIRC. I wouldn't take nothing from it tbh. Both are good technologies in their own way
And spider says: "Do I really need eight legs?", and Gods answers: "Nobody needs eight of anything".
C# to the moon! Havent finished yet. Drum roll.
Let's go.
Elixir reserves a separate stack AND heap space for each process. This has some advantages, but you need to pay for it with memory. So you need to do realistic tasks for it to be able to compare. Elixir might run faster and create less fragmentation than go for example. It can remain more consistent in response times theoretically than go. But it depends on the scenario. If you are doing a benchmark, you can always be working around the problems that naturally arise in each.
Idk how he compiled the C# program, but I rewrote his program line by line in Net core 6 and running it gave these results (Checked with Process Hacker 2, as Task Manager doesn't report all memory):
- 1 task = ~7.3 MB
- 10000 tasks = ~13.5 MB
- 1000000 tasks = ~430 MB
- Compiled with Net Core SDK 6.0.408
- CPU: AMD Ryzen 9 7900
- OS: Windows 10 build 19045
I assume that either C# cheats on Windows by having Windows preload the runtime into memory and then reusing it for all C# programs and it simply doesn't report the memory consumed by the runtime, or some other shenanigans are going on in that article.
I also tried building the app in the "self-contained" mode, where it includes the whole runtime in output, not requiring it to be installed and the footprint hasn't changed.
you should try creating the actual threads, instead of re-using ThreadPool (Tasks) as shown in the article
@@stefano_schmidt I tried allocating actual threads with new Thread(() => {Thread.Sleep(10000);}) and that started using *a lot* of memory. Million threads took ~4 GB of memory and shutting them down with Thread.Join took forever. But considering that Threads are really not recommended by anyone these days they might be lacking the optimizations over the years.
You can try to compile AOT
@@stefano_schmidt it’s useless test. The OS will just spend more time context switching than doing real work (thrashing)
@@stefano_schmidt The whole point of asynchronous I/O is to avoid wasting threads for things that do not need threads. If you're working with well-written modern .NET code, you don't really need more threads than you have logical CPU cores, so why pay the cost? If anything, this shows exactly one of the reasons why spinning up new threads for every task you want to do is painfully wasteful.
The article tries to compare different ways of handling asynchronous code. Threads are just one of those ways, and explicit threads should be really rare in any modern codebase. The article doesn't talk about creating a million threads - it talks about a million _asynchronous tasks_ . It's the YT video that claims this is about a million threads, which is silly - there's very few platforms where the overhead from the language/runtime will be remotely comparable to the overhead from having a thread in the first place. The default stack size of a Windows thread is usually 1 or 4 MiB. It will never take less than 64 kiB (or more exactly, the page size). Now compare that to the ~230 B a C# task takes, or the ~600B (in a pre-allocated structure of at least 2 kiB) of a goroutine.
When you change the code to create threads... your memory usage comes entirely from thread stacks. Which means... what exactly? We know threads are expensive, that's why we want to avoid them! :D That's where async comes from (mostly). The real failure of the article is that it doesn't even attempt to find async tasks in each of those platforms - though that isn't all that surprising given the code was written by GPT :D
If C# has memory available it will swallow a lot for optimizations. Once i experimented with docker and performance tested my simple api endpoint with Bombardier (tool written in GO) - bombarding it with thousands of requests. My app used 1.5 gig of ram (!). But then I started limiting my container's available memory (-m parameter), and guess what, I went down to 15 MB and still worked. GO equivalent required at least 16 megs to work. The C# API with so little memory available performed almost the same as when using 1.5 GB anyway. (The GO was like 2% faster though, not gonna lie)
I will most likely need to use C# as my primary language at my next job
Wish you all the best
c# is life, c# is love
@@dziarskihenk8798 XD
dont use maui
.NET 8 is Native AOT now which compiles like Rust or Go and the performance overall can bit all of the others no competitor left.
The nodejs example is off point. You need to choose worker threads for staying in line with all of the other examples.
The same goes for the Python AsyncIO example.
Agree
Tbh the c# example doesn't create 1 mill thread, async/await in c# is implemented in a way that it doesn't just throw everything on threads, it has a bunch of internal logic that does the async state without actually spawning thread unless they are necessary and even then, it has a threshold on how many threads can be created, so if the thread limit is not increased, it won't ever spawn that many threads, the allocations on the other hand, 1mill tasks is a LOT of memory xD
My wife says you yell too much. I tried to prove she is wrong.
My argument didnt last a second.
At 9:33 it's said "I hate that phrase, to infinity and beyond" and it's said that it's nonsensical, but it's not: according to mathematics, there are many different kinds of infinities and they are not all of the same size, there are infinities bigger than the other actually: vide George Cantor work and particularly the diagonal argument, e.g. the size of the interval [0,1] alone is bigger than all natural numbers combined.
Dotnet 6 vs dotnet 8 would result in better perf.
Agreed, especially since .NET 8 will also support native AOT compilation
The big flaw in this test is that the main memory footprint of threads (no matter what kind) is the amount of thread-local data that has to be duplicated. And like most things in software, there is a trade-off between memory and speed (or latency). The sorts of things that thread-local data is used for are memory allocations, garbage collection, I/O, and maybe task scheduling. Usually more thread-local data means less contention between threads on those operations. So, a benchmark that does none of those things is only looking at the cost side without the benefit side. Long running tasks that allocate a lot of memory or do lots of operations like that will benefit greatly from the lower contention associated with running in a "real" thread. Short-lived tasks that have a small footprint or mostly statically allocated memory and sit idle for a significant portion of their run-time have far less potential to contend with each other in harmful ways, so the lower memory footprint and faster spawn-time of something like a plain event-loop or coroutines wins handily. And systems using thread-pools are basically trying to find a happy medium for tasks that are a bit of both. As always, the right tool for the right job.
Obviously, if the "job" is to wait for 10 seconds 1M times concurrently, then a plain event-loop should win hands down because any decent event-loop implementation would boil that down to a single 10 second wait and then flushing an array of 1M small event objects.