Tip for remembering stalagmites and stalactites. "Stalagmites have a g for ground and stalactites have a c for ceiling", it's how I remember which is which. It was a tip in a Xanth novel by Piers Anthony. I think it was "Man from Mundania", but I'm not sure because I haven't read them in 20+ years. Gosh, that makes me feel old. :)
Ahaha, the true mnemonic is actually just the etymology of the word. I don't know if it's Latin or Greek, but for example, in french it's m for monte (raise) and t for tombe (fall). Simple.
True, but I don’t really know why everyone is doing buffered reads. The challenge says that the file is put in a RAM backed fs (some tmpfs) before running the program. Best way is to just mmap it which is zero-copy and zero-alloc.
It drives me bonkers how they used 10 instead of ' ', and even went so far as to describe the magic integers at 28:21 with comments like "if b == 45 { // 45 == '-' signal"
My guess on the read buffer and diminishing returns: I bet you get max performance when the buffer size aligns with the underlying hardware's size. Like it's best when you read a sector at a time (or however SSDs are addressed/broken down in firmware).
4:35 Assuming go's `map` is a self-growing (via reallocation) array (like C++ `vector` or C# `List`), as the `map` grows, you'd have to mem copy the whole underlying array, and a bunch of pointers would be way cheaper than a `struct`
mmap Split the memory space into the number of Cores Hand out pointers start/end to threads Walk all but the first pointer start forward until after the next new line or EOF. Start ripping from there. Profit.
On his hardware, he's I/O bound and any optimization is useless. Dude has 32 gigs of RAM. Meaning that, on an idle enough system, most of that memory will be used for file system cache, into which a file with the size of 13GB fits quite neatly. I will probably not be too exaggerating if I say that he only read the file from disk once - the first time he ran his program. If not once, then by the fifth run, the entire file would be up in RAM for sure. All the rest of the "I/O" tests were performed against the memory, which just checked how fast memory copy in chunks of different sizes and multiples of allocations can be performed. Had he been performing actual I/O, there's no way he'd be getting >13GB/s (which a time of ~0.98s suggests.) In fact, his drive is rated at 497MB/s (manufacturer spec), so on that hardware, it's useless to play with the buffer size, since you won't be reading the file faster than ~27 seconds, as the first file read test with the buffer size of 1024 would suggest. 13*1024/497=26.78, and i'm pretty sure that all the allocations were done during iowait, so it's safe to assume the file size is not exactly 13GB, but more around 13.3-13.5 :D This article is written by someone who probably doesn't understand storage or operating systems too well (using windows for development - first hint... jk,) but it's a nice experiment to see how well you can optimize such an algorithm if your disk bandwidth is infinite.
Also, back when I was doing that obscure shift organizer program for hospitals, I used my own fixed point package to optimize stuff - everything was one single digit of precision anyway, so I just worked with ints determining 10ths of hours (another "problem"). Worked well, fast, and didn't use as much space as those pesky floats. I did this before the FP coprocessor was included in intel processors (ie. before the 486. My actual development machine was an original IBM PC XT, running an 8088 at 4.77MHz! I needed all the speed I could get).
A Goroutine is Go's syntax for Tony Hoare's Concurrent Sequential Processes (CSP, not like the browser's CSP though). Fun fact: the creator of Go had made several previous languages, all with CSP baked in. In Clojure[script], the simple syntax for CSP was enabled via a library. CSP has been implemented in JS via generators, but there are implementations with more usage (eg. for Clojure).
Current top Java implementation reaches 300ms, but measurements are done on reference hardware (32 cores / 64 threads), and thus might be different to whereever the Go guy was running it at.
I did some basic aggregation with node on a 2 GB ini file: from memory with a bunch of work i got it down from 40s done somewhat naturally to about 7s done by a crazy person. The dumb 10 line Rust code took 3s or something.
I am sorry, but you are wrong. Boomer loops are GOTO and CONTINUE loops. The simulation code that we use at work was written in modern FORTRAN (FORTRAN 77, not 65) and is full of GOTO 1000 Do stuff 1000 CONTINUE
I must have missed something. The SSD (Kingston SSD SV300S37A/120G) has a maximum read rate of 450MB/s, so reading the 13GB should take 28.88 seconds minimum. wat. Can someone explain?
I just clicked on the first search engine result for the one billion rows challenge in c language and the result of the guy beats the "official" java winner. Not surprised.
Not that surprising, the first result is likely to be the best linked (highest ranked) when everyone is talking about fastest implementation in language X.
The word "buffer" CRIES for underoptimised implementation with data being copied between kernel memory and user space process memory. I think i'd start by doing an mmap of the whole data on disc, assuming it's already in the fs cache.
Can't wait for the "what is your 1 billion row challenge time" question in interviews. (Actually though, taking a legit stab at the challenge for myself sounds super fun and I really wanna see if work will greenlight letting me work on it as part of my training hours)
Devin's out there taking notes. This whole article is honestly like an AI overtraining on a specific dataset. Its language capabilities even degrades as it reaches the its max context window
Unfortunately produces corrupt data. If run it multiple times over the same 13Gb dataset, it'll produce a different result each time. Some temperature values end up in the 10s of thousands, and also new locations appear. Signs of race/memory corruption issues.
"mutex is a spin lock" technically mutex is just the semantics, not an implementation, and there's a few ways to do it, with different trade-offs. They generally *start* with a spin lock, but that's just an optimization assuming the lock time is short. They then need a way to put the aquiring thread to sleep, and there's a bunch of ways to implement that. You can do it in user space with just thread sleep and wake functions, which can be good for "fair" locks, but you can also use events or explicit kernel mutexes, which might be better for thread residency.
I remember having to parse 600TB databases in the gamedev industry, I ended up using python and the windows copy buffer to just snapshop the file into memory
Interesting , have a few questions. Obviously it can't loat 600TB into memory at once, did you chunk your reads or were the underlying DB files split up naturally? Were you using a network file system? Did you run multiple processes and map/reduce or just a single process? I'm curious how long it took in either case
@dv_xl the first layer was using perforce, so any previous work or code could compare against cached version of all unchanged files locally synced Next you need to break up Parallel loops based on file types, ascii files are super easy to write regex logic (think file mirroring) I would quickly build a list of all file dependencies (if I was parsing a game map, I listed all the models, if it was a 3d model, it connected what maps and textures used it etc etc... Now for the copy trick, depending on file size, when having to parse through larger 1gb+ files you can choose to either copy an entire folder or individual files, and binary format you need to do the painful thing of writing a custom binary parser for the now copied into memory data I remember back on wolfenstein a couple of times having to checkout the entire repo because German lawyers were like "nein! You cannot have any file names with verboten naming on disk" and when you need to edit file names across an entire project that is weeks away from gold master.. not a lot of wiggle room :P
@@dv_xl So the data was all stored in perforce, so I would store a snapshop with a perforce timestamp, so I could choose a chached or fresh mapping depending on folder/file size, sometimes you could copy entire folders to parse through larger files... it really depended on file types or single files at a time with custom binary interpreter. so you could skip entire sections of files and pull out relevant info (I was tracking all assets, where they showed up in engine or in a map and then all the related textures, models, audio etc..) It was a reflection system across data formats :P All done in parallel, and a weird reason to do batches of folders and not file by file is the limited number of threads python would spin up before hitting some per machine arbitrary number of threads windows can keep track of :P (also, early exist everywhere, I don't need to parse a 3D models vertices, or the animation sequence in a skeleton!) At some point I was checking out the entire project because german lawyers were like "Nein! Verboten! you cannot have nazi named file folders on the shipped disc" "but it's wolfenstein?" -> glad I added the "find and replace" option so I could do mass edits while it was parsing through :D timing I had it under around a few seconds, under 1s if the perforce cache existed (db was stored as sql file with no read/write locks in perforce)
These times are too good tobe true. Heavy caching through pagecache. He should flush pagecache before every try. 13GB in 1.96 =~ 6.5GB per second. No way in hell with the mentioned ssd. Flushing cache for honest numbers on the same system is benchmarking 101. Did he ever run the java implementation on his own system to set a baseline or did he just take the other benchmaker's results? Do people even know how to benchmark?
@@arden6725 why? For reproducibility. His results could now easily be skewed from run to run if for example chrome is having a bad day and is filling his memory and thereby flushing his oagecache during some runs but not others. If you are unaware of this you make wrong conclusions on what changes made your program faster or not. If you want to take ik out of the equation than the benchmark should've stated to use a ramdisk or generate the data in-process
@@ytdlgandalf this kind of code isn't meant to be run within a workstation but a server, meaning it'd be the able to take full advantage of the machine. When it comes to a workstation, all of these low-level impl will fall short behind the general impl because there's no way to predict the amount of resources the environment is willing to give this program in question in order to complete it at the fastest time possible.
i an wondering why you need mutex when reading from file. why not open file x times for reading ? and using seek start reading from right position ? right positions can be computed in main thread at the beginning. sort of index. did not test ot but suppose ut would remove a lot of merge logic from the end of article
Firstly this statement is inherently false, it can never be as fast as the fastest asm or c. But more importantly, where did you get that idea? I looked up the results for Java from the test and they were 6 seconds. It's not clear what the hardware used for the testing was, but it doesn't look to me like there's a good cross language comparison table anywhere
I actually prefer specific syntax for multiple return parameters 🤷♂️ The language is almost certainly creating an anonymous struct under the hood anyway, so I'd rather it be more obvious they're connected/contiguous. Plus you have the option of passing around the entire tuple or destructuring into the components depending on what's the most convenient which just seems objectively better to me. I love go but that's up there with lack of sum types on the list of things that bother me
Probably could do pretty fast with Bun. Bun.file has some good ability to read file partials, so you could see how big the file is, spawn a ton of threads and handle only the parts for each.... JavaScript does also have cool things like SharedArrayBuffers that could enable some more low level style memory control...
I still don't understand how these BILLION row challenges are not entirely IO limited ... I mean even in JS, how to you spend more CPU time than it takes to read that much data? :/
This is my question. How are they getting millisecond solutions? What are they running on? My NVMe drive tops out at around 1500MBps, so I couldn't even process the file in less than 10 seconds...
Graal doesnt differ that much. One of the main principles behind Graal is AOT which is able to practically knock the start-up period of java programs out of the water. However what graal gains in performance increases it sacrifices in things like meta-programming. Something like reflection is blocked out and becomes impossible at runtime. That said graal is an impressive piece of software written in java that only gets better with time.
Actually not true This test was done in 20.x, 18.x, and 16.x By the very definition they cannot be faster. They can be of equal speed if extremely clever compiler stuff happens. This would require jit to take place as well
@@ThePrimeTimeagen Mmmh i tested NodeJS 21 and actually found it was faster: const array = Array.from({ length: 1_000_000 }).fill(1); time = performance.now(); array.forEach((e) => e); console.log(performance.now() - time); // run was between 10 - 14ms compared with time = performance.now(); for (e of array) { e; }; console.log(performance.now() - time); // run was between 14 - 20ms Wonder why its faster
@@lazyh0rsethis wasn't the only reason, sure it reduced the time by removing the startup cost however there are many tricks that led to the 1.5 second (and even, 323ms when using all the 32 cores of the test machine instead of just 8). There is a great blog post by QuestDB that explains the tricks used in the top solutions in detail.
I can’t believe I have to point this out but his SSD can’t do 13GBps and so this is all coming from his page cache in RAM. Don’t expect anything close to these results if you flush the cache. In light of that, he should be seeing a much better score if implemented correctly since he has so many threads.
how can you read 13GB from disk in 1.5 seconds even :/ I need to watch the rest of the video lol, the timer must've been started while the 13GB was in mem
All Go code uses tabs. The reason it looked excessive was because the default browser styling for the tab-size property is 8 spaces, and apparently they didn't change it with css.
I just want to say his ram won't run at 6,000MHz. I found the hard way, getting 128GB, and it down-clocks the rate, because AMD chips can't handle faster memory like Intel chips. Overall I chose AMD, but clearly there's more nuance than they all advertise 🤯
"I have very little experience in these kinds of investigations" Me: Oh, word, he and I will be talking on the same level ... Me: Oh, shit, I understand none of this
There are two kinds of great professionals who show of their skills: 1) will make you inspired 2) will throw you into despair. For me Prime is the second kind. But he's funny. I give him that. And him boasting about how he ruined every ones day when he got that calc test way ahead of others back in his uni times is just a proof of this.
OMG ITS MY FAVOURITE PROFESSIONAL YAPPER!
5 DOLLARS A MONTH 🗣🗣🗣🗣🗣✋✋✋✋✋
"professional yapper" what a good job description for a streamer lmao
nl clears
nl my goat @@charlesyoung601
Why do I feel this is every so called tech RUclipsr right now
Asking Flip to take something out seems like the most reliable way to ensure that it absolutely does not get taken out.
is flip even real?
Yes @@istasi5201
Narrator:
Flip did, in fact, not take that out (16:00)
Flip. Take this anti-flip propaganda out.
I stand against the establishment
@@flipmediaprod truly and upstanding and forward thinking editor. You kept it in for the people, 👏🤯🤯🤯
He sounds like he is begging :D
Tip for remembering stalagmites and stalactites. "Stalagmites have a g for ground and stalactites have a c for ceiling", it's how I remember which is which. It was a tip in a Xanth novel by Piers Anthony. I think it was "Man from Mundania", but I'm not sure because I haven't read them in 20+ years. Gosh, that makes me feel old. :)
Ahaha, the true mnemonic is actually just the etymology of the word. I don't know if it's Latin or Greek, but for example, in french it's m for monte (raise) and t for tombe (fall). Simple.
Stalagmite sounds like dynamite and you don't wan to put that on ceiling is how Ive always remembered it
Stalactites stick tight to the ceiling. Stalagmites might grow upwards
the JDSL implementation would be 10x faster. Tom's a genius!
JDSL would have melted the CPU from how fast it would be parsing those rows.
FYI: Buffer size of 1024 is terrible, because most modern disks use 4kB sectors nowadays. So some multiple of 4kB is immediately better.
True, but I don’t really know why everyone is doing buffered reads. The challenge says that the file is put in a RAM backed fs (some tmpfs) before running the program. Best way is to just mmap it which is zero-copy and zero-alloc.
@@LtdJorgeyep. The file reading and parsing can be improved quite a lot. Even the original java guys did a lot more
Wow: a 4.7HGz with 6000mhz memory. Those millihertz come in handy with the HenryGigaz processor...
It drives me bonkers how they used 10 instead of '
', and even went so far as to describe the magic integers at 28:21 with comments like "if b == 45 { // 45 == '-' signal"
the author has a brazilian name. brazil mentioned
Dev do Gamers club do fallenzão (2:06)
let's go!!!
It is also a Portuguese name
Nobody lives in Portugal 😂@@user-zg2bx4oz2p
I love these posts, there's a lot of tidbits of information to learn.
"Managers be like push it to prod! We're done... Good enough!" @ 20:16. Lol, like every non-technical manager ever.
My guess on the read buffer and diminishing returns: I bet you get max performance when the buffer size aligns with the underlying hardware's size. Like it's best when you read a sector at a time (or however SSDs are addressed/broken down in firmware).
Or file system block size. Typically reading in multiples of the block size is most efficient.
4:35 Assuming go's `map` is a self-growing (via reallocation) array (like C++ `vector` or C# `List`), as the `map` grows, you'd have to mem copy the whole underlying array, and a bunch of pointers would be way cheaper than a `struct`
you can do vec.reserve(n) in c++. eliminates need for expensive reallocation
mmap
Split the memory space into the number of Cores
Hand out pointers start/end to threads
Walk all but the first pointer start forward until after the next new line or EOF.
Start ripping from there.
Profit.
Dammit, now I feel like I will have to do this in Zig or something... But great article. Really shows the experimentation and learning process.
French gang:
Stalag-mite (M like "monter" in french, to go UP)
Stalag-tite ( T like "tomber, to fall)
Always remembered it from the C in stalactite being “ceiling” lol.
'might go up, tights come down.'
@@OnStageLightinggiggity
Stalag (ground), Stalac (ceiling)
Primeagens reactions in this video "wow that's a lot slower than I would have thought... well I GUESS it is a BILLION items" x1 Billion
On his hardware, he's I/O bound and any optimization is useless.
Dude has 32 gigs of RAM. Meaning that, on an idle enough system, most of that memory will be used for file system cache, into which a file with the size of 13GB fits quite neatly.
I will probably not be too exaggerating if I say that he only read the file from disk once - the first time he ran his program. If not once, then by the fifth run, the entire file would be up in RAM for sure. All the rest of the "I/O" tests were performed against the memory, which just checked how fast memory copy in chunks of different sizes and multiples of allocations can be performed. Had he been performing actual I/O, there's no way he'd be getting >13GB/s (which a time of ~0.98s suggests.)
In fact, his drive is rated at 497MB/s (manufacturer spec), so on that hardware, it's useless to play with the buffer size, since you won't be reading the file faster than ~27 seconds, as the first file read test with the buffer size of 1024 would suggest. 13*1024/497=26.78, and i'm pretty sure that all the allocations were done during iowait, so it's safe to assume the file size is not exactly 13GB, but more around 13.3-13.5 :D
This article is written by someone who probably doesn't understand storage or operating systems too well (using windows for development - first hint... jk,) but it's a nice experiment to see how well you can optimize such an algorithm if your disk bandwidth is infinite.
Oh damn, for-loops are now considered boomer loops? What about while(true)/break loops? Are those dinosaur loops?
Also, back when I was doing that obscure shift organizer program for hospitals, I used my own fixed point package to optimize stuff - everything was one single digit of precision anyway, so I just worked with ints determining 10ths of hours (another "problem"). Worked well, fast, and didn't use as much space as those pesky floats. I did this before the FP coprocessor was included in intel processors (ie. before the 486. My actual development machine was an original IBM PC XT, running an 8088 at 4.77MHz! I needed all the speed I could get).
nah, the dinosaur loops are the asm branch loops 🦖
Well, go don't have while loops, so yes dinosaur loop for me
Those are biblical times loops
A Goroutine is Go's syntax for Tony Hoare's Concurrent Sequential Processes (CSP, not like the browser's CSP though). Fun fact: the creator of Go had made several previous languages, all with CSP baked in. In Clojure[script], the simple syntax for CSP was enabled via a library.
CSP has been implemented in JS via generators, but there are implementations with more usage (eg. for Clojure).
Current top Java implementation reaches 300ms, but measurements are done on reference hardware (32 cores / 64 threads), and thus might be different to whereever the Go guy was running it at.
I did some basic aggregation with node on a 2 GB ini file: from memory with a bunch of work i got it down from 40s done somewhat naturally to about 7s done by a crazy person. The dumb 10 line Rust code took 3s or something.
I like how its from 95s to 1.96s whilst inside the article a sub-second result is mentioned.
For one second I read that as milli hz of ram and was like why is you ram going only 6 hz, are you manually clocking that thing
I am sorry, but you are wrong.
Boomer loops are GOTO and CONTINUE loops.
The simulation code that we use at work was written in modern FORTRAN (FORTRAN 77, not 65) and is full of
GOTO 1000
Do stuff
1000 CONTINUE
I must have missed something. The SSD (Kingston SSD SV300S37A/120G) has a maximum read rate of 450MB/s, so reading the 13GB should take 28.88 seconds minimum. wat. Can someone explain?
Well, I guess the complete 13GB file is cached in RAM by Windows.
@@jhk940 Yup every os keeps hot files in ram; the java one actually had a final implementation with a ramdisk instead so the ssd overhead didnt matter
Stalactite - remembered by "it has to hold on tight" :)
In Java it is 1.3 sec
I just clicked on the first search engine result for the one billion rows challenge in c language and the result of the guy beats the "official" java winner.
Not surprised.
Not that surprising, the first result is likely to be the best linked (highest ranked) when everyone is talking about fastest implementation in language X.
The word "buffer" CRIES for underoptimised implementation with data being copied between kernel memory and user space process memory.
I think i'd start by doing an mmap of the whole data on disc, assuming it's already in the fs cache.
Can't wait for the "what is your 1 billion row challenge time" question in interviews. (Actually though, taking a legit stab at the challenge for myself sounds super fun and I really wanna see if work will greenlight letting me work on it as part of my training hours)
This video gives me huge flashbacks
Devin's out there taking notes. This whole article is honestly like an AI overtraining on a specific dataset. Its language capabilities even degrades as it reaches the its max context window
Unfortunately produces corrupt data. If run it multiple times over the same 13Gb dataset, it'll produce a different result each time. Some temperature values end up in the 10s of thousands, and also new locations appear. Signs of race/memory corruption issues.
What? Your program or the program in the video?
@@anon1963 The solution in the video.
@@RenThraysk ah ye, they probably ran finished program once and were like: "good enough!"
"mutex is a spin lock" technically mutex is just the semantics, not an implementation, and there's a few ways to do it, with different trade-offs.
They generally *start* with a spin lock, but that's just an optimization assuming the lock time is short. They then need a way to put the aquiring thread to sleep, and there's a bunch of ways to implement that. You can do it in user space with just thread sleep and wake functions, which can be good for "fair" locks, but you can also use events or explicit kernel mutexes, which might be better for thread residency.
I’m going to give you a like based purely on the amount of text. I’m happy for you though, or sorry that happened.
technically, anything is just the semantics
I remember having to parse 600TB databases in the gamedev industry, I ended up using python and the windows copy buffer to just snapshop the file into memory
Interesting , have a few questions.
Obviously it can't loat 600TB into memory at once, did you chunk your reads or were the underlying DB files split up naturally?
Were you using a network file system?
Did you run multiple processes and map/reduce or just a single process? I'm curious how long it took in either case
@dv_xl the first layer was using perforce, so any previous work or code could compare against cached version of all unchanged files locally synced
Next you need to break up Parallel loops based on file types, ascii files are super easy to write regex logic (think file mirroring) I would quickly build a list of all file dependencies (if I was parsing a game map, I listed all the models, if it was a 3d model, it connected what maps and textures used it etc etc...
Now for the copy trick, depending on file size, when having to parse through larger 1gb+ files you can choose to either copy an entire folder or individual files, and binary format you need to do the painful thing of writing a custom binary parser for the now copied into memory data
I remember back on wolfenstein a couple of times having to checkout the entire repo because German lawyers were like "nein! You cannot have any file names with verboten naming on disk" and when you need to edit file names across an entire project that is weeks away from gold master.. not a lot of wiggle room :P
@@dv_xl So the data was all stored in perforce, so I would store a snapshop with a perforce timestamp, so I could choose a chached or fresh mapping
depending on folder/file size, sometimes you could copy entire folders to parse through larger files... it really depended on file types or single files at a time with custom binary interpreter. so you could skip entire sections of files and pull out relevant info (I was tracking all assets, where they showed up in engine or in a map and then all the related textures, models, audio etc..) It was a reflection system across data formats :P
All done in parallel, and a weird reason to do batches of folders and not file by file is the limited number of threads python would spin up before hitting some per machine arbitrary number of threads windows can keep track of :P (also, early exist everywhere, I don't need to parse a 3D models vertices, or the animation sequence in a skeleton!)
At some point I was checking out the entire project because german lawyers were like "Nein! Verboten! you cannot have nazi named file folders on the shipped disc"
"but it's wolfenstein?" -> glad I added the "find and replace" option so I could do mass edits while it was parsing through :D
timing I had it under around a few seconds, under 1s if the perforce cache existed (db was stored as sql file with no read/write locks in perforce)
The mighty stalags rise, while the other stalags hold tight is my way of remembering which is which hahah
Etymology: m for monte, t for tombe
Prime, what about Redis changing licensing model and Garnet (by Microsoft) written in C# outperforming Redis in C++. Help us make sense of it.
These times are too good tobe true. Heavy caching through pagecache. He should flush pagecache before every try. 13GB in 1.96 =~ 6.5GB per second. No way in hell with the mentioned ssd. Flushing cache for honest numbers on the same system is benchmarking 101. Did he ever run the java implementation on his own system to set a baseline or did he just take the other benchmaker's results? Do people even know how to benchmark?
why would you want a software optimization benchmark to be limited by your disk speed, that’s literally pointless
@@arden6725 why? For reproducibility. His results could now easily be skewed from run to run if for example chrome is having a bad day and is filling his memory and thereby flushing his oagecache during some runs but not others. If you are unaware of this you make wrong conclusions on what changes made your program faster or not. If you want to take ik out of the equation than the benchmark should've stated to use a ramdisk or generate the data in-process
@@ytdlgandalf this kind of code isn't meant to be run within a workstation but a server, meaning it'd be the able to take full advantage of the machine. When it comes to a workstation, all of these low-level impl will fall short behind the general impl because there's no way to predict the amount of resources the environment is willing to give this program in question in order to complete it at the fastest time possible.
@javierflores09 this is about reproducibility. Doesn't matter if its your workstation or a "server".
i an wondering why you need mutex when reading from file. why not open file x times for reading ? and using seek start reading from right position ? right positions can be computed in main thread at the beginning. sort of index. did not test ot but suppose ut would remove a lot of merge logic from the end of article
I was also wondering this
Renato Pereira alone sounds like a cool secret agent driving a very fast classical car
Its a common brazilian name
Java has proven to be the fastest lang on earth with this challenge ! No other lang can compete
Firstly this statement is inherently false, it can never be as fast as the fastest asm or c. But more importantly, where did you get that idea? I looked up the results for Java from the test and they were 6 seconds. It's not clear what the hardware used for the testing was, but it doesn't look to me like there's a good cross language comparison table anywhere
@@dv_xl fastest java took 1.4 sec
Easy way to know of stalagmite or stalagtite, the M is pointing upwards
I don't get it, the File Read Buffer took only 0.98 s!!!! Why everyone is ignoring it!!!
MMAP?
They should have used their "4.7HGz" PC to run a spell checker
16:00 flip did NOT take that out
Flip, more like Slip, cuz he be slipppppin
I'm having some problems solving this in HTML
PORTO MENTIONED!
I guess best java solution used mmap.
someone should do the 1 billion row challenge using vim
I'm thinking one way to convert the temp (float) is have a hash map for all 100 possible different values i.e. map("99.9") simply return 99.9....
There are 2000 values, cos of the decimals.
The hash and lookup would be a lot slower than just parsing the numbers directly.
Don't hash it! Just make a 2000 element array, use the raw bits as an index, and it's gonna be fast.
I actually prefer specific syntax for multiple return parameters 🤷♂️ The language is almost certainly creating an anonymous struct under the hood anyway, so I'd rather it be more obvious they're connected/contiguous. Plus you have the option of passing around the entire tuple or destructuring into the components depending on what's the most convenient which just seems objectively better to me. I love go but that's up there with lack of sum types on the list of things that bother me
we do a little struct { int a, b, c; } fn(int in) { /* ... */ return (typeof(fn(0))){ a, b, c }; }
Probably could do pretty fast with Bun. Bun.file has some good ability to read file partials, so you could see how big the file is, spawn a ton of threads and handle only the parts for each....
JavaScript does also have cool things like SharedArrayBuffers that could enable some more low level style memory control...
I got the 1BRC down to 5.5 sec with nodejs. Bun has a bug with highwatermark option that make it less performant than node (at least in my test)
remember about Amdahl's law
One billion comments, lets go!
does flip even watch the videos or just use the markers seeing he misses every cut request ;p
I still don't understand how these BILLION row challenges are not entirely IO limited ... I mean even in JS, how to you spend more CPU time than it takes to read that much data? :/
This is my question. How are they getting millisecond solutions? What are they running on? My NVMe drive tops out at around 1500MBps, so I couldn't even process the file in less than 10 seconds...
@@Treslagets cached in ram
Java did it very fast with Graalvm native compilation. Not with JVM. Graalvm is very interesting.
Graal doesnt differ that much. One of the main principles behind Graal is AOT which is able to practically knock the start-up period of java programs out of the water.
However what graal gains in performance increases it sacrifices in things like meta-programming. Something like reflection is blocked out and becomes impossible at runtime.
That said graal is an impressive piece of software written in java that only gets better with time.
I’m waiting for a cloud vendor to suggest just running all billion in serverless - scale up to what you need to scale down when you’re done bro, e.z.
2 business days: from Friday to Monday.
forEach is faster than boomer loops in newer versions of node and in bun.
Pretty wacky, but true.
Actually not true
This test was done in 20.x, 18.x, and 16.x
By the very definition they cannot be faster. They can be of equal speed if extremely clever compiler stuff happens.
This would require jit to take place as well
@@ThePrimeTimeagen Mmmh i tested NodeJS 21 and actually found it was faster:
const array = Array.from({ length: 1_000_000 }).fill(1);
time = performance.now(); array.forEach((e) => e); console.log(performance.now() - time);
// run was between 10 - 14ms
compared with
time = performance.now(); for (e of array) { e; }; console.log(performance.now() - time);
// run was between 14 - 20ms
Wonder why its faster
Prime, do it. Just do it.
Boomer loops sounds like a great cereal, now with fiber.
Can anyone suggest a streamer that is as good with SWE but on the other side of the spectrum - TEMPERAMENTwise. I'm more of an Uncle Bob kind of guy.
Would be awesome to see you do it in ts/js
whoa! really enjoyed!
2 business days got me)))
forEach, map, etc. are the devil in JS
Before even watching the video I'll guess the biggest gains will come from reducing allocations
How it’s done in Java in 1.5 second? Now you have to read the java version)
they used native GraalVM, it compiles java to machine code
@@lazyh0rsethis wasn't the only reason, sure it reduced the time by removing the startup cost however there are many tricks that led to the 1.5 second (and even, 323ms when using all the 32 cores of the test machine instead of just 8). There is a great blog post by QuestDB that explains the tricks used in the top solutions in detail.
KOTLIN mentioned!!!
This cannot be true, the Kingston SSD SV300S37A is not capable of transferring 13Gb/sec
Windows caches file reads in RAM when it can, so it's plausible that not all of the reads are hitting the disk
5:50 Why didn't they start with profiling?
I can’t believe I have to point this out but his SSD can’t do 13GBps and so this is all coming from his page cache in RAM. Don’t expect anything close to these results if you flush the cache. In light of that, he should be seeing a much better score if implemented correctly since he has so many threads.
Flip ain't taking it out brother.
the one guy in your chat spamming "hardly know her" jokes
13GB in one second? I think the ssd couldn't even be that fast, right?
how can you read 13GB from disk in 1.5 seconds even :/ I need to watch the rest of the video lol, the timer must've been started while the 13GB was in mem
RAM disk possibly?
That Go person used tabs (8 spaces)?
All Go code uses tabs. The reason it looked excessive was because the default browser styling for the tab-size property is 8 spaces, and apparently they didn't change it with css.
@@YawhatneverOh, thank you. I didn't know that.
I just want to say his ram won't run at 6,000MHz. I found the hard way, getting 128GB, and it down-clocks the rate, because AMD chips can't handle faster memory like Intel chips.
Overall I chose AMD, but clearly there's more nuance than they all advertise 🤯
flip did not take that out
flip did not cut it out
i'm calling it, multithread/multiprocess overhead is going to show that his single process/thread solution is actually faster
not for i/o operation with 1billion rows
no lol
I just found your channel and your the dr disrespect of software, get some sunglasses
Could you please do a video on Pocketbase?
Renato Pereira is a Brazilian name soooooo...
BRAZIL MENTIONED LWSGOOOOOOOOOO BRAZIL!!11!1!1!1!!1!1!1!11!1!1!1!1!!!!1!!1!1!1!!1!
nobody is wondering how he can read 13GB in under a second? Really?
Flip didn't take it out
GOD DAMN IT FLIP
"I have very little experience in these kinds of investigations"
Me: Oh, word, he and I will be talking on the same level
...
Me: Oh, shit, I understand none of this
15:55 - Ignored
Stalagmite - *might* reach the ceiling one day
Stalagtite - holding on *tight* so it doesn't fall
I want someone to try this in javascript 😂😂😂
There are two kinds of great professionals who show of their skills: 1) will make you inspired 2) will throw you into despair. For me Prime is the second kind. But he's funny. I give him that. And him boasting about how he ruined every ones day when he got that calc test way ahead of others back in his uni times is just a proof of this.
how can i insert a yomoma joke here, or an insult involving your mom?
8:00 reason is the buffersize of your hdd/ssd.
Joelang
Gopoutine
Why windows, that's gotta count for half the slow down, you want to optimise, get rid of windows.