Ah the standard pattern: - Write JS code with a memory leak - Complain that GC is bad - Switch to language without GC Unlike what the article asserts, the problem was never the GC, other than it let them get away with having a memory leak until they decided to do something about their performance issues. They could have fixed the leak fairly easily, but instead chose to write a completely different algorithm in Rust (which would have worked for JS too) instead of trying to actually write good JS code. Yes this was a memory leak, sure a decent GC usually means that you don't have "leaks" in the traditional sense, i.e. by losing track of a block of memory so you can't free it, a GC can still keep memory around you weren't expecting, because there was a reference to it you didn't realise was there. In this case, even though the files were being processed "line-by-line" the code was keeping every line in memory by virtue of the small portions being kept in the permanent list of records. As a result, the entire file was read into memory and kept there for the entire operation. The GC could do nothing to help because the code was keeping valid references to each line. Sure it would all be correctly disposed at the end, so not technically a "leak", but the whole point of processing line-by-line is to avoid having to keep the entire file in memory, yet the code ended up doing exactly that.
17:40 that's what I always say, the problem isn't using GC, its the fact that having it makes it so easy to create trash, if you use a manually manage memory and do a lot of malloc/free, you are also going to be slow. Using the heap that way is bad for performance.
Gotta keep in mind that Rust's lifetime notation being explicit enabled OP to find the right data structure whereas in JS it wasn't upfront how memory was being managed behind the scenes.
JS allows the developer to mostly ignore memory memory management, and when performance issues are spotted, they can fix it by improving the hot spots in the code. Overall it seems pretty simple if they just stuck with JS and were more careful in their hot path.
People like to bash stack overflow but these guys could had just went on stack overflow and asked "my js code runs slow, how to optimize it" paste their code and in a day they'd have a few tweaks to their js code that gets it to run 2000% faster - maybe not as good as 2500% rust but good enough - they would had many many hours saved from spinning 25 containers and rewriting code in rust.
They probably would have had it flagged as a duplicate and downvoted to oblivion, but then they could have followed the link to the duplicate question. It's quicker than searching ;)
As a Node developer, I can't agree more. It is so easy to create memory in node that it fills the memory with garbage. For any intensive task, other tools must be considered before starting any project. For these I mainly use Go (love it). I may try Rust again soon
yeah same. so many people brush off performance and consider "choosing a fast language" as a "early optimization". using more resources than needed for anything on a computer should be frowned upon even a naive implementation on a decently performant language is faster than these interpreted languages people shove everywhere nowadays. they seem to forget that no program runs in isolation, if everyone makes everything in these languages eventually we'll have thrown all the improvements in CPU speeds we've had throughout the years...
I'm weird junction in my career with 7 YOE experience with JS & TS. Can you please shed some light on whether I should seriously start learning GoLang, Zig or Rust? I mostly know working with threads and passing data through it and while learning GoLang it felt more naturally growing into me. But still would like to know what should I choose to make myself more impactful and payable!
@@pbdivyesh Zig has no industry yet, it's a niche and the compiler is not even 1.0 yet. There's still some rough edges like interfaces and its async runtime, there's also plan to refactor the standard library and change naming conventions Go has a lot of use in industry, a lot of server infrastructure tooling is in it now, the modern internet kinda "relies" on some tools made in it to run, it's a very traditional language tho, it's pretty minimal too, I like it but some people don't because it's not as strongly typed as rust. rust is very nice, but it has a bit of a steep learning curve, it's till finding it's place even tho it's very popular amongst open source projects rn and big companies have partially written their code in it. Go is still much more used than Rust I think. You're more likely to get a Go than a Rust job too imo. But Rust, Go and Zig are kinda slightly different in ways that matter to make all of them worth learning (tho learning zig right now is not for everyone as it's still not 1.0, maybe learn C instead for the time being if u want something in the low level minimalist language niche)
it's nice that in rust, if he wanted to do the same approach as the JS one, he would have had to explicitly clone the line to put it into a record to keep after reading the file, which should sound some alarm bells
It remembers me the best advice i got in CS: use the right data structure and the program will write itself. (PLUS: Use the right tool for the job, and Js is not in this case)
JS maybe isn't best choice but isn't terrible either it's fast garbage collected language and well optimized and as mentioned in article they already have node.js ecosytem (even their rust code ended up embedded in js) - then nodejs makes sense but what they did with Rust module was just completely unnecessary overkill. It was so unnecessary that I suspect they intentionally didn't show hashmap version of javascript because their managers would been asking some tough questions about why did they had to spend all this time working on Rust version that is just some marginal % better all while only introducing more complicated toolchain and build process, but the way they written it as is, they just gonna sell it as big win - shit like this happens all the time when managers in tech don't have strong grounds in computer science and can't read through bullshit.
@@karakaaa3371 I did the work. It's a bit painful to parse tsv streaming in Node, but with a bit of work I got it processing 2GB in about 20s, and with about 2 hours of work optimized that down to 7.7s using the least natural code possible to avoid any allocation: all indices into typed arrays. The trivial, a baby could write it, a third the size, version in Rust was 7.2s. So yeah, realistically shouldn't have been *so* much slower, but if you care about both performance at all and maintainable code, jumping to Rust is hardly dumb.
Yea this is already a solved problem and there are many tools that can do this job easily. One tool I've used before to do this is Azure data lake analytics. Ingesting some TSV using some U-SQL on a schedule and writing it to a database is easy. You should never try to reinvent the big data wheel.
17:00 That's exactly what I was wondering, when seeing line.split being used. Why are they not just using a simple buffer fread approach and reading it line by line instead. No promises, just a single hashmap and buffer and processed line, that's it.
Appreciate the article bcs it gave me some ideas regarding plugins, but I feel like this was a case of not being overly proficient in js+node+profiling and using a document database for structured data you want to aggregate, which just doesn't make sense. 😅 And on top of that, having a TSV/CSV etc is basically the best case, since you can go line by line. Imagine formatted XML or JSON.
The question becomes: What would the JS times be if the data structures were altered to conform to the Rust solution? Did they need to add an extension to do this - and if so, would finding a new solution become a critical "must do now" task? How much time would that have saved by fixing a few lines of JS? Granted, writing the Rust extension was a win, but it appears they problem was found but not recognized during reimplementation - which is the flip side to the "now I understand why they did it that way" realizations during reimplementations.
"not even the simplest projects are that simple" totally feel that! reworking a simple application, that uses a node package to "screenshot" what user data a website stores you can enter some custom url's, it saves those url's and the output paths to the browsers local storage now it should run on a server with node and storing it to a postgres database instead of storing all the url's and output paths to the local storage I estimated 4 hours for this I only worked with mysql and didn't even know, there are little but meaningful syntax differences (like wtf, both are using SQL??) had to fight permission issues on the linux system (it was designed to work on windows) now it's round about 6 hours in and it's like 80-90% done
He's reading 240gb, and splitting each line, and looking at the http response code. What he should do is read the line character by character, find the nth tab where the code is and decide if the code is >200, which presumably is a tiny fraction of the total. Now he he would eliminate 99.9% of the memory garbage for valid responses, and only process the error lines with split, and it might just work in node.js.
@@ThePrimeTimeagen that's stupid, even C# has a state machine object that's kept for the entire duration of the async and they just replace a field value with the object returned from async. Why people use NodeJS instead of C# or Java is beyond me....
@8:18 why did prime say "don't use a promise" to eliminate idle time? is it cause it halts the program to wait for the response and instead it should just start running the queries in parallel?
@@ThePrimeTimeagennot exactly sure what you mean by event loop iteration: promise callbacks go on the microtask queue, so technically they all get cleared before the event loop continues. It's slower than immediately running if only because of the allocation and VM call stack, sure, but it shouldn't be polling for IO slow.
When seeing this I’m a bit saddened that people would even think of all these crazy solutions like scaling it up to 25 servers using docker containers, while the solution is so simple: use the right tool for the job. I’ve the feeling that a lot of programmers out there are missing some core fundamentals and are just throwing more CPU and memory at a problem, when it doesn’t perform well. As opposed to truly understanding what’s happening.
Honestly this seems like something relatively easily solvable with node and some lambdas consuming a queue way better than what they had. Just use some rxjs...
It wasn't swapping a module from C++ to Rust, it was taking a nearly completely unoptimized Node-only loop and putting some functionality in a Rust-based module and rewriting the Node-based side a bit to compensate.
Rule of thumb, the inherent speed and efficiency of a programming language is rarely enough to overcome a poorly written algorithm/data structure. Porting your bad JS code to rust will only create bad rust code.
I have done this in JS and the memory issues are easy to solve: MySQL temp table. Our analytics are stored in MySQL instead of Mongo because I'm smart, so I can read through a CSV from S3 50MB at a time, parse the lines from it (yes I use .split(" "), sue me), and store the results in a temp table. This way, if something goes wrong, the main data isn't polluted with partial data we need to query for and clear; we also can have the SQL server copy the data to the main table and drop the temp once everything is kosher. I kinda want to see if i can get a 200GB CSV to try to see how long it would take to parse, but I also would probably migrate to Rust at that stage lol
It's okay to use JavaScript for its convenience and ability to be isomorphic for resumability, and it's also okay to do any expensive algorithm in another language.
AWS glue is a serverless solution that could’ve eliminated most of these infrastructure complexity and speed up the batch job. Given they already use s3 to store the data, they could’ve used an event driven pattern.
Seeing stuff like this really makes you wonder what else they don't understand... Imagine believing that the reason memory went down ~220x is due to switching to rust lol.
How I would love Prime, to explain why he does not like something, provide an alternative and explain why that is better, I feel I’m only getting half the story with his narrative style. It’s fun, but it is also patronizing and pretentious, he could change easily, but he seems happy with his current style, so I’ll keep searching for useful bits from his frat talks and research on my own, thanks for the pointers Agen.
I feel like, being a livestream his reactions are from at all, it'd just get tiring if he had to re-explain his stance over and over again everytime it came up, and that's the reasonn he doesn't do it.
Nice contributions. I love your style of rythm to information. However, my question did somebody investigated into using a Set instead of a map? At least a set forbids duplicates, which could even save you from problems. Thanks and best Regards, Tim Susa.
200GB / 25 machines / 3 hours = 740,74 KB of data per second, per machine... Holy smokes that's bad. simdjson claims to parse _gigabytes_ per second, as a frame of reference.
you got that wrong the idle time comes from the nodejs eventloop to be honest it is libuv which always waits for additional devices to respond. if you would run your code directly on v8 or a other engine then node eg es4x which can use the so called epoll epoll is the kernel wait cycle which is much faster. hope that gives you some insights.
This is probably a case to break out your c compiler, I'm sure you could blow away these rust numbers. C code could use basically no memory, and run even faster still, and the processing is simple.
So i just tried writing a basic version of this with pretty good, streaming node code that avoided any allocation, and the straightforward Rust code. I created a 2Gb tsv, 100M rows, with some garbage first two columns and a third column being the index % 200. After a bunch of careful iteration and lots of debugging and optimization, including manually parsing the number to avoid allocating a string, I got a 70 line accept that ran in about 20s. Pretty good, that's 100MB/s! The Rust code ran in 7s the first time, and was only 24 lines.
Hey, good news! After a bunch of profiling and tuning the buffer size, I got the node version down to only 7.7 seconds, to Rust's 7.2 seconds! It required replacing every natural API with direct indexes and typed arrays, but you *can*, with a lot of effort, get JavaScript close in performance to incredibly basic Rust code!
It looks like they found different tutorials for js and rust and followed them literally. Then they measured and wrote this blog post. By chance rust implementation won.
I don't know rust and I admit that I'm a caveman programmer, so let me know if I'm missing something. I feel that they are blaming the tools more than they are blaming themselves. The performance gain is still remarkable. I'm sure rust is great, but something smells. I have used GC languages to parse big files in the past, not 200GB, but big enough. Back in the day, when computers were way slower, one company used to give us a bunch of hotel info in one giant file. I have used java, php, node and go, all GC languages.
Lol. I had a data funneling job in Paramount. TBs of data was downloaded, then processed and stored in a new file and then upload to a DB 😂 we would get new data files once a week. It became problematic when the process started taking 8 days 😂 I started streaming it and suddenly it was taking hours, not days. Then I split it across multiple servers and suddenly I the 8 day job was taking 15 minutes 😂
At 21:55 he types "vmrss" - is that a custom script? I haven't found any such tool online and it's not available by default on my Ubuntu system either.
As someone who stopped learning Rust when getting to lifetimes, I think the statement "Rust makes it easy to write memory-safe applications" a bit misleading and one of the reasons people get turned off of the language, besides the crazy community. The truth is that it's harder to write memory-unsafe code, but the simple things are unnecessarily complicated in Rust and that's really annoying when you don't drink the Kool-aid.
Its JS, where S is "script" - as in "a movie script". Nobody puts every wrinkle in a movie script. Its JavaScript, not JavaCompute or JavaProcess... People just can't read. With JS one should only outline what the actors do in order, but not how exactly.
Maybe I am missing something, but doesn't JS have iterators? Why not have some iterator over the files and lines, extract the values, and reduce by counting?
It seems that our friend in the article could have done everything, without rust with 1% of the resources used if he knows devops and good computing practices LOL.
Considering the files were in fixed format, could he have not use precompiled regex to match atleast the status code part and not split until necessary???
could be, but the article never provided much info, like mentioned in the video, there was no actual profiling that could've give us some idea what exactly was causing the issue
We gonna do high memory / high CPU / high IO tasks in ... NodeJS. Hahahaha ... sorry but Node just has to GO .. Joke aside, I don't think there are even many OSS tools that do those kind of things in NodeJS. One example would be the good old Apache log parsing tools ... they are written in C/CPP usually, the newer one we tried was written in GO. One could assume there is a good reason for that. It shouldn't even be too hard to write something like that in Rust nowadays.
> "Gotta keep in mind that Rust's lifetime notation being explicit enabled OP to find the right data structure" - @FaraazAhmad, elsewhere itt Sometimes you WANT to think about lifetimes. Sometimes you don't, but sometimes you DO. Rust enables you to make that choice ahead of time and guides you as you discover how to deal with the consequences of that choice.
I doubt this I/O intensive workload is really impacted by node/js performance. If the guy merely understood how to optimize node for large file workloads and maybe use the event queue to efficiently do compute while concurrently doing i/o we would see totally different numbers.
no one mentioned this vmrss tool he is using. I can not verify that it exist. So is it his own cli tool or a script or does it actually exist on linux some where.? I wrote a bash script that looks to do the same thing here. #!/bin/bash while [ -f "/proc/$1/status" ]; do current_kb=$(grep VmRSS /proc/$1/status | grep -Eo "[0-9]+") current_mb=$(bc
Ah the standard pattern:
- Write JS code with a memory leak
- Complain that GC is bad
- Switch to language without GC
Unlike what the article asserts, the problem was never the GC, other than it let them get away with having a memory leak until they decided to do something about their performance issues. They could have fixed the leak fairly easily, but instead chose to write a completely different algorithm in Rust (which would have worked for JS too) instead of trying to actually write good JS code.
Yes this was a memory leak, sure a decent GC usually means that you don't have "leaks" in the traditional sense, i.e. by losing track of a block of memory so you can't free it, a GC can still keep memory around you weren't expecting, because there was a reference to it you didn't realise was there.
In this case, even though the files were being processed "line-by-line" the code was keeping every line in memory by virtue of the small portions being kept in the permanent list of records. As a result, the entire file was read into memory and kept there for the entire operation. The GC could do nothing to help because the code was keeping valid references to each line. Sure it would all be correctly disposed at the end, so not technically a "leak", but the whole point of processing line-by-line is to avoid having to keep the entire file in memory, yet the code ended up doing exactly that.
Amen 🤌
Could you fix this by deep copying the data instead of referencing it? And if so how?
17:40 that's what I always say, the problem isn't using GC, its the fact that having it makes it so easy to create trash, if you use a manually manage memory and do a lot of malloc/free, you are also going to be slow.
Using the heap that way is bad for performance.
great minds think alike, kudos.
Congratulations y'all finally discovered the Python equivalent of C extensions
Exactly.
Shitty languages mastery unlocked.
nodejs supported C modules from the begining
@@Disorrder And they've changed 20 times and suck compared to python. You forgot that part.
@@justdoityourself7134 that’s fair
21:00 rust intuitively pushed him to a better solution. that's pretty powerful
Gotta keep in mind that Rust's lifetime notation being explicit enabled OP to find the right data structure whereas in JS it wasn't upfront how memory was being managed behind the scenes.
My thought too. Rust solved the problem by helping them write it better rather than by running it better.
Sounds like a skill issue
Best take.
Love your stuff
JS allows the developer to mostly ignore memory memory management, and when performance issues are spotted, they can fix it by improving the hot spots in the code. Overall it seems pretty simple if they just stuck with JS and were more careful in their hot path.
Who would have thought that replacing JavaScript by literally anything else makes it faster
Not 90% of web developers, i'll tell you that much.
DHH on his way to rewrite the entire web in ruby
even Python?
enter Bun 1!!!
my former team lead thought of JS being the cream of the crop and its perfomance irrelevant.
(writes it in the slowest js possible)... "Boss, it looks like there's no other option but to write it in Rust 😇"
well... is he working at a day job? or does he do video streaming 24/7 at this point?
Both bro lol he is staff arch at Netflix
@@perc-aiwhats staff arch?
He's moonlighting Netflix 😂
Bro is so handsome he can do both
People like to bash stack overflow but these guys could had just went on stack overflow and asked "my js code runs slow, how to optimize it" paste their code and in a day they'd have a few tweaks to their js code that gets it to run 2000% faster - maybe not as good as 2500% rust but good enough - they would had many many hours saved from spinning 25 containers and rewriting code in rust.
They probably would have had it flagged as a duplicate and downvoted to oblivion, but then they could have followed the link to the duplicate question. It's quicker than searching ;)
Noob like to bash SO
Nah , the real problem was between the keyboard and the chair.
As a Node developer, I can't agree more. It is so easy to create memory in node that it fills the memory with garbage.
For any intensive task, other tools must be considered before starting any project. For these I mainly use Go (love it).
I may try Rust again soon
yeah same. so many people brush off performance and consider "choosing a fast language" as a "early optimization". using more resources than needed for anything on a computer should be frowned upon even a naive implementation on a decently performant language is faster than these interpreted languages people shove everywhere nowadays.
they seem to forget that no program runs in isolation, if everyone makes everything in these languages eventually we'll have thrown all the improvements in CPU speeds we've had throughout the years...
Rust is simply amazing when coupled with Node/Web using Napi/WASM.
Definitely worth learning
I'm weird junction in my career with 7 YOE experience with JS & TS. Can you please shed some light on whether I should seriously start learning GoLang, Zig or Rust?
I mostly know working with threads and passing data through it and while learning GoLang it felt more naturally growing into me.
But still would like to know what should I choose to make myself more impactful and payable!
@@pbdivyesh Zig has no industry yet, it's a niche and the compiler is not even 1.0 yet. There's still some rough edges like interfaces and its async runtime, there's also plan to refactor the standard library and change naming conventions
Go has a lot of use in industry, a lot of server infrastructure tooling is in it now, the modern internet kinda "relies" on some tools made in it to run, it's a very traditional language tho, it's pretty minimal too, I like it but some people don't because it's not as strongly typed as rust.
rust is very nice, but it has a bit of a steep learning curve, it's till finding it's place even tho it's very popular amongst open source projects rn and big companies have partially written their code in it.
Go is still much more used than Rust I think. You're more likely to get a Go than a Rust job too imo. But Rust, Go and Zig are kinda slightly different in ways that matter to make all of them worth learning (tho learning zig right now is not for everyone as it's still not 1.0, maybe learn C instead for the time being if u want something in the low level minimalist language niche)
@@pbdivyeshpayable? Out of the three you’ve posted? Go.
it's nice that in rust, if he wanted to do the same approach as the JS one, he would have had to explicitly clone the line to put it into a record to keep after reading the file, which should sound some alarm bells
It remembers me the best advice i got in CS:
use the right data structure and the program will write itself.
(PLUS: Use the right tool for the job, and Js is not in this case)
that is the case
JS maybe isn't best choice but isn't terrible either it's fast garbage collected language and well optimized and as mentioned in article they already have node.js ecosytem (even their rust code ended up embedded in js) - then nodejs makes sense but what they did with Rust module was just completely unnecessary overkill. It was so unnecessary that I suspect they intentionally didn't show hashmap version of javascript because their managers would been asking some tough questions about why did they had to spend all this time working on Rust version that is just some marginal % better all while only introducing more complicated toolchain and build process, but the way they written it as is, they just gonna sell it as big win - shit like this happens all the time when managers in tech don't have strong grounds in computer science and can't read through bullshit.
Even JS should have no problem parsing some TSV files? How is this not just disk/network limited?
This, so true
@@karakaaa3371 I did the work. It's a bit painful to parse tsv streaming in Node, but with a bit of work I got it processing 2GB in about 20s, and with about 2 hours of work optimized that down to 7.7s using the least natural code possible to avoid any allocation: all indices into typed arrays.
The trivial, a baby could write it, a third the size, version in Rust was 7.2s.
So yeah, realistically shouldn't have been *so* much slower, but if you care about both performance at all and maintainable code, jumping to Rust is hardly dumb.
And viola. Won't somebody think of the cellos?
This entire thing looks like it would be a very straightforward Spark job
Straightforward once you learn spark's ecosystem, which isn't simple.
Yea this is already a solved problem and there are many tools that can do this job easily.
One tool I've used before to do this is Azure data lake analytics. Ingesting some TSV using some U-SQL on a schedule and writing it to a database is easy.
You should never try to reinvent the big data wheel.
Guys it’s literally a for loop
Spark or hive Or Impala or any schema on Read distributed analytics tool. This was my thought too.
Lets be real, he could've just used chat gpt
17:00 That's exactly what I was wondering, when seeing line.split being used. Why are they not just using a simple buffer fread approach and reading it line by line instead. No promises, just a single hashmap and buffer and processed line, that's it.
Appreciate the article bcs it gave me some ideas regarding plugins, but I feel like this was a case of not being overly proficient in js+node+profiling and using a document database for structured data you want to aggregate, which just doesn't make sense. 😅
And on top of that, having a TSV/CSV etc is basically the best case, since you can go line by line. Imagine formatted XML or JSON.
The question becomes: What would the JS times be if the data structures were altered to conform to the Rust solution? Did they need to add an extension to do this - and if so, would finding a new solution become a critical "must do now" task? How much time would that have saved by fixing a few lines of JS? Granted, writing the Rust extension was a win, but it appears they problem was found but not recognized during reimplementation - which is the flip side to the "now I understand why they did it that way" realizations during reimplementations.
"not even the simplest projects are that simple"
totally feel that!
reworking a simple application, that uses a node package to "screenshot" what user data a website stores
you can enter some custom url's, it saves those url's and the output paths to the browsers local storage
now it should run on a server with node and storing it to a postgres database instead of storing all the url's and output paths to the local storage
I estimated 4 hours for this
I only worked with mysql and didn't even know, there are little but meaningful syntax differences (like wtf, both are using SQL??)
had to fight permission issues on the linux system (it was designed to work on windows)
now it's round about 6 hours in and it's like 80-90% done
> There might be certain issues that JavaSctipt simply can't solve efficenetly
Damn, I laughed so hard on this line 🤣
He's reading 240gb, and splitting each line, and looking at the http response code. What he should do is read the line character by character, find the nth tab where the code is and decide if the code is >200, which presumably is a tiny fraction of the total. Now he he would eliminate 99.9% of the memory garbage for valid responses, and only process the error lines with split, and it might just work in node.js.
This is a classic Map Reduce problem. What about Apache Spark or big machine and DuckDB?
I approve of this message. Data structures is ALWAYS where it's at.
09:15 doesn't async iterator create promise and {value:something,done:something} object on every iteration?
yes
@@ThePrimeTimeagen that's stupid, even C# has a state machine object that's kept for the entire duration of the async and they just replace a field value with the object returned from async.
Why people use NodeJS instead of C# or Java is beyond me....
@@monad_tcppromises are state machines
The Node vs Bun speed wars are ON !!!
you can also import rust into bun using some plugin AFAIK
@8:18 why did prime say "don't use a promise" to eliminate idle time? is it cause it halts the program to wait for the response and instead it should just start running the queries in parallel?
callback. there is no extra event loop iteration required
performance critical paths should avoid promises
@@ThePrimeTimeagennot exactly sure what you mean by event loop iteration: promise callbacks go on the microtask queue, so technically they all get cleared before the event loop continues.
It's slower than immediately running if only because of the allocation and VM call stack, sure, but it shouldn't be polling for IO slow.
When seeing this I’m a bit saddened that people would even think of all these crazy solutions like scaling it up to 25 servers using docker containers, while the solution is so simple: use the right tool for the job. I’ve the feeling that a lot of programmers out there are missing some core fundamentals and are just throwing more CPU and memory at a problem, when it doesn’t perform well. As opposed to truly understanding what’s happening.
0:17 9/11 wasn't 3 days away from sep 6th
Honestly this seems like something relatively easily solvable with node and some lambdas consuming a queue way better than what they had. Just use some rxjs...
Definitely feels like they could have spent 30 minutes profiling the JS version lol
The sun emoji had me smiling the entire time
23:43 - Celebrities Explain DevOps has Flavor Flav explaining Kubernetes at around the 25sec mark
"I Ditched javascript and typescript for backend and desktop apps and my code runs as fast as C++ and Rust"
No shit Sherlock
If swapping one Node module from C++ to Rust gave such a kick, where did the developers of Deno (which is written in Rust) go wrong?
It wasn't swapping a module from C++ to Rust, it was taking a nearly completely unoptimized Node-only loop and putting some functionality in a Rust-based module and rewriting the Node-based side a bit to compensate.
Rule of thumb, the inherent speed and efficiency of a programming language is rarely enough to overcome a poorly written algorithm/data structure.
Porting your bad JS code to rust will only create bad rust code.
Unfortunately it is enough. Your print inside a for loop will take you further in c than in python.
I have done this in JS and the memory issues are easy to solve: MySQL temp table. Our analytics are stored in MySQL instead of Mongo because I'm smart, so I can read through a CSV from S3 50MB at a time, parse the lines from it (yes I use .split("
"), sue me), and store the results in a temp table.
This way, if something goes wrong, the main data isn't polluted with partial data we need to query for and clear; we also can have the SQL server copy the data to the main table and drop the temp once everything is kosher.
I kinda want to see if i can get a 200GB CSV to try to see how long it would take to parse, but I also would probably migrate to Rust at that stage lol
It's okay to use JavaScript for its convenience and ability to be isomorphic for resumability, and it's also okay to do any expensive algorithm in another language.
AWS glue is a serverless solution that could’ve eliminated most of these infrastructure complexity and speed up the batch job. Given they already use s3 to store the data, they could’ve used an event driven pattern.
Seeing stuff like this really makes you wonder what else they don't understand... Imagine believing that the reason memory went down ~220x is due to switching to rust lol.
8:41 i come here to get a daily boost of laughter. moist balls, upper section.
Single node computer… This man is living in the year 2099. What’s next, statically linking your serverless functions into a single binary executable?
Polars + Python + VPS with enough RAM/CPU and done 😂 25 instances!!
26:40: esbuild does not replace TSC. If you want type checking not just compiling, then you need to run tsc explicitly.
How I would love Prime, to explain why he does not like something, provide an alternative and explain why that is better, I feel I’m only getting half the story with his narrative style.
It’s fun, but it is also patronizing and pretentious, he could change easily, but he seems happy with his current style, so I’ll keep searching for useful bits from his frat talks and research on my own, thanks for the pointers Agen.
I feel like, being a livestream his reactions are from at all, it'd just get tiring if he had to re-explain his stance over and over again everytime it came up, and that's the reasonn he doesn't do it.
Bruh that query 😂
I tried wix once. It was horrible. The platform had limited functionality. It was slow. The tools they provided were slow. Now I kinda get why.
Nice contributions. I love your style of rythm to information. However, my question did somebody investigated into using a Set instead of a map? At least a set forbids duplicates, which could even save you from problems. Thanks and best Regards, Tim Susa.
200GB / 25 machines / 3 hours = 740,74 KB of data per second, per machine...
Holy smokes that's bad. simdjson claims to parse _gigabytes_ per second, as a frame of reference.
I love your take here! Good job
you got that wrong the idle time comes from the nodejs eventloop to be honest it is libuv which always waits for additional devices to respond. if you would run your code directly on v8 or a other engine then node eg es4x which can use the so called epoll epoll is the kernel wait cycle which is much faster. hope that gives you some insights.
This is probably a case to break out your c compiler, I'm sure you could blow away these rust numbers. C code could use basically no memory, and run even faster still, and the processing is simple.
So i just tried writing a basic version of this with pretty good, streaming node code that avoided any allocation, and the straightforward Rust code.
I created a 2Gb tsv, 100M rows, with some garbage first two columns and a third column being the index % 200.
After a bunch of careful iteration and lots of debugging and optimization, including manually parsing the number to avoid allocating a string, I got a 70 line accept that ran in about 20s. Pretty good, that's 100MB/s!
The Rust code ran in 7s the first time, and was only 24 lines.
Hey, good news! After a bunch of profiling and tuning the buffer size, I got the node version down to only 7.7 seconds, to Rust's 7.2 seconds!
It required replacing every natural API with direct indexes and typed arrays, but you *can*, with a lot of effort, get JavaScript close in performance to incredibly basic Rust code!
5:20 got me 💀💀💀
Mongo aggregation is a gamechanger. It's stages, of course if you get the wrong order it screws you, it's linear. Derp.
Streaming data would have improved it significantly without the need of Rust. Unfortunately streaming in Node is kind of PIA
Tell me what should i learn for backend i know react js , rust do i need to learn node js? Or tell me any good way
It looks like they found different tutorials for js and rust and followed them literally. Then they measured and wrote this blog post. By chance rust implementation won.
Dart has Macros, Flutter rendering Engine renders 3D.
I don't know rust and I admit that I'm a caveman programmer, so let me know if I'm missing something. I feel that they are blaming the tools more than they are blaming themselves. The performance gain is still remarkable. I'm sure rust is great, but something smells. I have used GC languages to parse big files in the past, not 200GB, but big enough. Back in the day, when computers were way slower, one company used to give us a bunch of hotel info in one giant file. I have used java, php, node and go, all GC languages.
Birthday's are a great piece of information for those with ill intent. Happy birthday! I may be missing the point. I am a Sept baby too
I would love a video on performance related to data structures.
I'd still like to know if and how this would have been possible in JS. Just blaming JS before finding the exact cause seems a bit lazy to me.
Very possible. Maybe not 25x optimizable but perhaps 15-20x without spending a huge amount of time on a complete rewrite
Lol. I had a data funneling job in Paramount. TBs of data was downloaded, then processed and stored in a new file and then upload to a DB 😂 we would get new data files once a week. It became problematic when the process started taking 8 days 😂
I started streaming it and suddenly it was taking hours, not days. Then I split it across multiple servers and suddenly I the 8 day job was taking 15 minutes 😂
For processing this amount of data, Apache Spark would have been a good solution. PySpark with Python will be the easiest path.
Yeah, this was mainly a bad algorithm in JS, this sort of processing would be completely IO limited, even in JS.
At 21:55 he types "vmrss" - is that a custom script? I haven't found any such tool online and it's not available by default on my Ubuntu system either.
Surely this would be trivial to do in Spark / Databricks. Feels like a lot of time was wasted....
would be interested in a cost benchmark comparison between bun / fastify / go
8:41 is why I live here now.
I would like to understand how many sprints they spent on it and why simply not used emr or aws glue 😅
As someone who stopped learning Rust when getting to lifetimes, I think the statement "Rust makes it easy to write memory-safe applications" a bit misleading and one of the reasons people get turned off of the language, besides the crazy community. The truth is that it's harder to write memory-unsafe code, but the simple things are unnecessarily complicated in Rust and that's really annoying when you don't drink the Kool-aid.
Seems like this would be a perfect use case for awk maybe?
2:24 getting padded down by the csv... .
viola 😂 i can't believe you never heard of the word
To be fair it probably should be spelt: violà.
This whole thing could have been a one-liner in awk and it would have performed just as well.
Its JS, where S is "script" - as in "a movie script". Nobody puts every wrinkle in a movie script. Its JavaScript, not JavaCompute or JavaProcess... People just can't read. With JS one should only outline what the actors do in order, but not how exactly.
10:36 Couldn't he create a variable with the object
`let foo = { pathname: x, referrer: x }`
and then after the push do
`delete foo`
😂 I always have a good time watching this guy.
Damn I had no idea about vmrss, that's so useful!
I learn so much with these videos 😂 it's amazing
vooh - a - lah
Maybe I am missing something, but doesn't JS have iterators?
Why not have some iterator over the files and lines, extract the values, and reduce by counting?
what happened to the elixir video?
Is voila pronounced as "vyola" or "vwala"?
I would guess that most websites and SaaSes dont need to parse a 200GB file every day
what is that profiler tab he has in the browser. I don't see it on my browser.
what command did u ran? VmRSS share ;(
"2500% performance improvement on Node" without even knowing how to program using Node basically LOL
It seems that our friend in the article could have done everything, without rust with 1% of the resources used if he knows devops and good computing practices LOL.
x25 != +2500%
After many years I still don't understand why node.js devs don't use the thing that the runtime is good at: streams
The most preoccupying thing about all this is that the person writing the article probably has a CS degree.
Considering the files were in fixed format, could he have not use precompiled regex to match atleast the status code part and not split until necessary???
Probably the percentage of non-successful responses was too low to bother
could be, but the article never provided much info, like mentioned in the video, there was no actual profiling that could've give us some idea what exactly was causing the issue
We gonna do high memory / high CPU / high IO tasks in ... NodeJS. Hahahaha ... sorry but Node just has to GO ..
Joke aside, I don't think there are even many OSS tools that do those kind of things in NodeJS. One example would be the good old Apache log parsing tools ... they are written in C/CPP usually, the newer one we tried was written in GO. One could assume there is a good reason for that.
It shouldn't even be too hard to write something like that in Rust nowadays.
> "Gotta keep in mind that Rust's lifetime notation being explicit enabled OP to find the right data structure" - @FaraazAhmad, elsewhere itt
Sometimes you WANT to think about lifetimes. Sometimes you don't, but sometimes you DO. Rust enables you to make that choice ahead of time and guides you as you discover how to deal with the consequences of that choice.
I doubt this I/O intensive workload is really impacted by node/js performance. If the guy merely understood how to optimize node for large file workloads and maybe use the event queue to efficiently do compute while concurrently doing i/o we would see totally different numbers.
Array vs Map
Bro needs to go back to school
no one mentioned this vmrss tool he is using. I can not verify that it exist. So is it his own cli tool or a script or does it actually exist on linux some where.? I wrote a bash script that looks to do the same thing here.
#!/bin/bash
while [ -f "/proc/$1/status" ]; do
current_kb=$(grep VmRSS /proc/$1/status | grep -Eo "[0-9]+")
current_mb=$(bc
Most SQL databases allow for importing CSV/TSV FAST!
I wanna be like you when I grow up Mr Primeagen
This guy made an Apache Spark with nodejs???