Article includes optimization steps that reference relevant assembly Theo: This is chaos, I hope no rust devs are actually ever doing this Me: This is sick, I wish my job would let me do this sort of shit
NO, I'M RANTING 12:30 - dude starts looking for where splitting is happening - has PARSELOOP VISIBLE ALL THE TIME 13:02 - dude gets to the function, finally 13:12 - AFTER 10 SECONDS DUDE DROPS "ESOTERIC JAVA BULLSHIT" AND REFUSES TO ELABORATE FURTHER IF THATS HOW JS/TS DEVS WORK THEN OMFG
This guy is why low-level developers mock and shit on Javascript developers. I'm sure there are way better ambassadors of the Js community, but this guy is not helping at all. Dunning-Kruger is strong on this one.
"not a CSV if it doesn't use comma separators" -- absolute nonsense, in practice csv files are so unstandardized you're lucky if they have separators and values.
In practice especially in Europe CSV most of the time means you'll find a semicolon as column separator. Even Excel does it when saving as CSV as localized number formatting means commas might be used as decimal separators
@@okmarshall the abbreviation is used for both Comma-Separated Values and Character-Separated Values. It would be comma if you go by RFC 4180, but since nobody bothered writing that down until 2005, in practice CSV can be pretty much anything. Or as the PostgreSQL documentation puts it: "Many programs produce strange and occasionally perverse CSV files, so the file format is more a convention than a standard."
As a Data Engineer TBH, unless you're planning to do this daily, you'd just throw it at whatever data ingest or database tools you have set up and eat the time it takes as a one off. More time would be spent checking that the data was all 'nice' and dealing with any weird errors that are usually inevitable in real world data. I'd just be thankful that the data in the test is all UTF-8 and not some weird set of codepages from 40 years ago, where you have to figure out if it's big or little endian.
Just be thankful the data is in a digital form at all. For a while in the 2010's I was working for BP on their pipeline corrosion data, and it was all stored as a combination of paper documents and pdfs. Honestly, finding efficient ways to convert hand-drawn reports in a variety of different formats into usable digital data was one of the more fun projects I ever did. But it was certainly a step up over any other data sanitation I've done in my career.
@@JeremyAndersonBoise as someone who's done this sort of thing in the real world (heavy streaming processing of >1TB of data in a shared-nothing processing situation), I wish people's takeaway was, "how can I apply this to the real world" and "why is this faster?" rather than, "this was fun, but useless."
@@thewiirocks Unfortunately, the takeaway will be people write this type of obscure codes everywhere to process 100 lines of data. Aka, the leetcode symptom, people will take things into extreme.
@@doc8527 if only they'd apply it to designing ETL systems. We could save YEARS of processing time and DECADES of man-hours by handling processing better. Per system.
I wrote myself a hex-viewer that uses mmap(). Can handle any size of files like nothing (on 64 bit architectures). But its of course displayed as hex, not as text. But I can search text and everything. Maybe I should also write a text-viewer that can do that. Needs a bit more logic for finding line start/end.
He kept trying to figure out how to count lines in a file. And I was like, "please use 'wc -l' already". But he stubbornly pretended to not know what less or wc commands do.
@@blenderpanzi > What makes you think he pretended? :) I was being charitable. If anyone ever needs proof that a little knowledge is a dangerous thing, they should subscribe to Theo's channel.
9:51 "I f-ing hate JavaScript." I like to hate on JavaScript like the next guy, but that is just IEEE floating point math! That's the same in *every* language that implements IEEE floating point! C, Java, C#, Rust, Go, ...
It's not the languages that implement IEEE754. It's in the hardware. If you try to compare two FP numbers in 8086 assembly and at least one of them is NaN, the FPU will always return false. Like, unless you purposefully move those values from FPU registers into regular CPU registers and then XOR them, NaN will never == NaN.
@@igorordecha Indeed, although there can be software implementations independent of the hardware. But I don't know of any language that does that. It is thinkable that a compiler might do that for constant folding when cross compiling or to be really sure that no floating point bug lands in the compiled binary. Not sure if any actually do that. I think usually they just don't constant fold floats (maybe trivial things like 1.0 * 0.0). The reason why I said languages that implement that is because those languages specify that their floats are IEEE floats (not sure about C). If the language is ported to an architecture that doesn't do IEEE floats they have to emulate them in software.
AFAIK, it almost always goes down to the CPU. Usually where you hit a wall is with small CPUs, embedded and such. Some compilers would compile FPU operations in software (with terrible performance), or some might not even support it at all. At least that's what I remember from my school days
I loved that he somehow thought a BILLION rows would fit in 800k, uncompressed. 1 billion carriage-returns alone would necessarily HAVE TO be 1 gb. Adding a name, colons and a number would be dozens of gb. Still laughing when he did that.
I write a bunch of rust and one of the reason i love it so much is actually because parallelism is so easy in it. arc mutexs can be a bit to wrap your head around but thats only if you need to share info between threads otherwise its a breeze
just another comment that immediately tells you that he has zero clue what he's talking about when it comes to rust. honestly super annoying at this point.
It's faster, safer, and slightly more accurate when calculating means to save any division until the end. Just calculate visited like you were, and also calculate sum. Have mean be a calculated value - not a stored one - whenever you want to print it out. The only catch is that you need store sum in a type that won't overflow. But the upsides are numerous. No expensive division and multiplication. (Only one division to display final result.) When multithreading with a single point of storage for results, you don't get incorrect answers when two threads both increment visited before either updates mean. All you need to guarantee is that all the increments and additions get processed exactly once, and order doesn't matter. Fewer operations to compound floating point errors. Average error magnitude is approximately sqrt(N)*P where N is the number of operations capable of introducing a rounding error, and P is the precision of the stored value (i.e. 2^-[mantissa length]). This one is admittedly a nit, as the primitive number types all have way more precision than is required for the task. But if you were creating a custom number class to optimize for these calculations, this would allow you to save half a bit of precision in the mantissa. Probably most importantly though, you can store total temperature as an integer value measured in decidegrees, avoiding the hassle of floating point arithmetic altogether.
What do you mean "They are not calculating average the traditional way, they are calculating it via the sum plus count." That's literally the definition of average. Also, your solution doesn't prevent overflow, since `mean * visited` is the sum. You are just repeatedly dividing and multiplying the sum for some reason.
@@twitchizle Do you watch actual programmers? There's plenty of people out there who do exclusively long form programming content. The problem is that hardly anyone actually watches those people. It's not a viable way to be a stramer or youtuber. It simply doesn't get enough views to be viable.
@@LightTheMars, no, it is now the go-to standard way to hack things, which Eva has done to hack S3 upload buckets to get session cookies from same domain and got user data.
6:15 😂 “bottom submission? Fuck, am I a bottom?” Came for 1 billion lines…stayed for this one. 10:40 well, it’s test data…you aren’t supposed to do your final code on test data, you test it with a subset of the data, maybe a few rows to get the parsing right, and then up to a million to stress test it, and then use the program on the full data when it’s working as expected.
BigInt cannot overflow. They are arbitrary-precision integers. The only "limit" is in Safari there is a theoretical limit set at 1 million bits, but there's nothing in the spec that specifies a maximum limit.
I wonder where it sits in terms of performance, tho. I mean, a big int would have to be checking for overflows and such. Isn't a 64 bit integer the better choice here?
@@hevad Yes I am not sure why bigint was chosen. I think the idea is that theoretically the accumulated number could overflow 64-bit given many records, if each could be a 64-bit integer as well. But I am not entirely sure if that was the case and it's very theoretical given the max value of a 64-bit signed, not to mention unsigned integer.
@@privateagent these are linux tools. What do u recommend for ur average windows developer who has like never seen a Linux cli? Ofc is most stuff easy when u know what to do
“If I know one thing about Rust then that parallelizing things is hard with it” One of the main advantages of Rust is fearless parallelization, it says so on the webpage and all (and if you check the code, it’s true: It’s very straightforward)
maybe this would not happen if one were to do the multiplication as log adds, e.g. exp(log(mean) + log(visited)) + temp i realize this is golfy but perhaps that is ok, since the whole challenge strikes me as a celebration of golf
@@flight_riskif mean is negative, Math.log(mean) will be NaN multiplication using logarithms is less accurate: ; Math.exp(Math.log(10)+Math.log(10)) == 100.00000000000004 ; 10*10 == 100
@@flight_riskI say that would take care of limiting the range (i.e.: Not having the values grow wildly out of control), but would actually introduce more precision errors.
Not only that, the multiplication `mean * count` (he does every fucking step) is LITERALLY pushing his integers to the same levels as just doing the sum. Still he says "his solution keeps the numbers lower" 🤣🤣🤣🤣🤣🤣 Is not even his solution, is just what copilot spat out!! Seriously, Dunning-Kruger is strong on this one.
the elixir solution at 24:42 is very clean, although not doing any manual parsing or structuring, but that seems like a nice and useful pattern to know when in need
@@Maxjoker98 any of them? I mean, Oracle's is akin to Chrome (expect paid) while all the OSS ones are builds of Chromium. Pick one. Any one. Or just run the package install and be happy.
Java honestly doesn’t seem that bad, but its development tools seem archaic or nonexistent. However, please note I really don’t know and I’ve only coded Java from Neovim and Zed for CS 1&2 so I haven’t looked too far into he ecosystem. Getting an LSP set up for Neovim is not easy, you have to link and use Eclipse development tools
I love these kinds of competitions I always learn so many tricks that I would never have found on my own it's crazy how much knowledge is out there, some of this stuff I still don't even fully understand what they did or why it helped but that's almost the best part because I can go find out.
16:56 what? There's literally no difference to the size of the number between a sum and (count * mean), which is what you're doing. Your approach isn't any less susceptible to overflowing.
Prides himself that "his solution keeps the numbers lower". It's not even HIS solution!! 🤣🤣🤣🤣 It's just what copilot spewed out! Seriously, Dunning-Kruger is strong on this one.
@@mikemhz That's one of the issues. There aren't any key names (so I don't have that semi-colon issue in the key names at least...) one has to count the column. And the columns have variable lengths (i.e. number of semi-colons) semi-colon seperated values... Software is in German as well.
saying that the elixir solution is cool is just saying that polars is cool. which it is, sure, but it's not elixir, it's a native rust library. here's the python solution using polars, takes around 10 seconds (i presume the elixir one takes around the same without having to compile it or whatever): import polars as pl df = pl.scan_csv("measurements.txt", separator=";", has_header=False, with_column_names=lambda cols: ["city", "value") grouped = df .group_by("city") .agg( pl.min("value").alias("min"), pl.mean("value").alias("mean"), pl.max("value").alias("max"), ) .sort('city') .collect(streaming=True) for data in grouped.iter_rows(): print(f"{data[0]}={data[1]:.1f}/{data[2]:.1f}/{data[3]:.1f},", end=" ")
12:18 If you specify the maximum possible value, you can immediately see if there is a problem (impossibility), if you use a small maximum value as initialization, we'll only know that the current maximum value is greater than the previous one, without knowing if it should be possible at all. *Edit: 13:43 You can even see this behavior in action: the output max with a value of 3415.9 is obviously wrong, since the maximum possible value is 999, but the way it was setup, this case slipped through. 17:00 Neither BigInt nor a normal int64 would overflow as 10^9 * 999 is only 9.99 * 10^11 and int64 can hold values ranging from plus to minus 9.2*10^18. The float32 is probably fine, as not all values are max values, but potentially not fine as a single precision float can only exactly represent integers lower than 1.6*10^7 (IEEE-754).
I always consider CSV to mean character separated values. CSVs become harder to parse quickly when the text inside contains commas. E.g. First entry,"second entry","third,entry"
I am really curious about the trillion row challenge. I am fascinated by how much a problem changes at different scales. How seemingly simple things become very complicated when there are *many* of them.
Is it good fun, should I watch? ) About five years ago I've shipped a streaming "xml-like-but-not-xml" parser for 5gb incoming files in nodejs typescript. Worked alright. What I've noticed - practically nothing in ecosystem was designed around streaming. Most projects just assume the file is small enough to fit in memory.
Puh, I'm a software engineer in the scientific community. I had to parse multiple 2tb text files (because scientists are dumb). Do have a super computer though, so it wasn't that bad
I had to do something like this for work. Get the directory contents of like some millions number of files in sometimes thousands of subfolders. Just the directory paths alone was 7gb in a txt file.
head -n 200 measurements.txt > sample.txt Always do this when writing code like what's happening here. It'll run in milliseconds and allow you to capture core behaviors in a viable functional test suite if that's more your style.
So cool. During this video implemented my own solution in c. It uses unbalanced binary search tree, because i don't know how balance tree or implement a hash map. Generating data took 11 minutes meanwhile my solution took 18 minutes and 30 seconds to run 😅. Definitely not trying one trillion row challenge though
"your mission, should you decide to accept it, is deceptively simple: write a Java program-..." this is where the issues started, using Java instead of using C, C++ or CPython
Python going all in on Trillion Row challenge= How fast can I access vectors = how many parameters can I query in a LLM for an input. More parameters=smarter results.
I am a bioinformatic scientist. Loading and parsing >20G files, and comparing them to other >20G files is a daily event. I just allocate a terabyte of RAM if I know I am doing some heavy lifting. I am very tempted to try this in R just to play with some easy data
As an old timer, it's hilarious watching Theo rail against Java in the same way Java engineers used to rail against C/C++. Of course, Java engineers sometimes had a point. Typescript, OTOH, isn't a real language. Change my mind.
"Real language" had no objective meaning in the first place. So it's a meaningless comment on both the subjects of: programming languages, and the English language too.
@@HappyCheeryChap a reasonably nice try. But "English" can be understood by pretty much anyone who speaks English. So can Ebonics. But many would have you believe that Ebonics is its own language rather than a mode of English speaking. (i.e. accent + slang + local culture) It's like saying that Australian is a different language. Not really. Overcoming some of the accent and local terms might take a bit of work, but it's perfectly possible to communicate. Typescript is just Javascript Ebonics.
@@thewiirocks The point was that saying something "isn't a real " doesn't actually communicate any objective point that people could "change your mind" on, or even really discuss at all. All it tells us it that you don't like it. And refusing to expand when asked, implies that maybe you can't actually articulate why that is. My only guess is that it might be called "not real" because it's a transpiler? But who is going to debate that it isn't a transpiler? Is there actually is some technical opinion you have on it, that somebody could attempt to "change your mind" on? Or even some minor clarification on what "real" is meant to mean?
@@HappyCheeryChap so I'm a bit late replying. Apologies for that. I didn't expect to catch anyone looking to have a serious conversation on the matter. I honestly appreciate that you're taking this seriously. Let me break this down from a technical perspective: The problem is not that Typescript is a transpiler. Transpilers are very common in this space with several far more effective examples on hand. (e.g. GWT, WASM, DART, etc.) The problem is that Typescript is *not* a transpiler. At best it's a pre-processor. And not a terribly good one at that. When I write code in GWT, C/C++ compiled to WASM, or DART, I am presented with semantics that are fundamentally different from Javascript. Accounting for these different semantics means that the Transpiler has to perform significant work to maintain those semantics. e.g. If I'm able to assign an incorrect type to a variable in Java and then make duck-type calls on those types, that's a serious problem for the GWT Transpiler. It's a bug that needs to be fixed. Typescript offers the illusion of transpiling. It provides a type system that does compile time checking, but none of the checking semantics are maintained at runtime. So it becomes far too easy to perform type violations and duck-type invocations on accident. And no one is going to fix this! *Because what you're really doing is writing Javascript with a type-checking pre-processor* Ok, but that's still better than writing pure Javascript, right? Clearly we're solving a common problem and making things better. *Wrong* I'll skip over the most abstract arguments about code behaving unexpectedly being a problem. That is an issue, but it's not the strongest argument against what Typescript provides. The strongest argument I have is that Typescript semantics strongly encourage OOP-style procedural code. I'm sure right now you have a very confused look on your face, and you're thinking, "what's wrong with that?" What's wrong is that we've already established that you're really writing Javascript. Just with a pre-processor. And while C++ started as a pre-processor, it relied on C semantics. It didn't try to fundamentally change them at the time. Typescript is the opposite. It pretends not to rely on Javascript semantics, but those semantics are still very much there. And the problem with writing OOP/procedural code in Javascript is that Javascript is a _terrible_ procedural language. Javascript was designed as a functional language with list-processing (LISP) style semantics. It is truly fantastic when you successfully hook into its functional nature. This is the reason for the dynamic typing of the language and many (though not all) of the design choices that your average programmer pooh-poohs. And since they're convinced that Javascript is a terrible language, they get frustrated that they're writing so much code to work around what they perceive to be Javascript's failings. And thus we come to the fundamental failure of Typescript as a language: It's an unenforced semantic layer on top of Javascript that encourages the developer to write objectively worse Javascript. Thus it is not a "real" language. QED.
@@thewiirocks The part where you were "sure" about me "having a very confused look" re "what's wrong with that" couldn't have been more wrong. Pretty random assumptions. Pretty much all the use of TS I see is far more FP than OOP. Pretty much any thread on OOP/procedural style code/libs in TS (becoming less and less common) will be full of comments suggesting it be written more FP/Haskell style. Discriminated unions / result-types etc are pretty much the norm over things like OOP inheritance, or even using classes at all, aside from those who haven't learnt FP. And for them, TS + its modern ecosystem is more likely to encourage them to learn it than plain JS anyway I think. Dunno what you mean re TS "pretends not to rely on Javascript semantics". Anyone who understands TS, knows what it "is" and "isn't". If some people are confused about it, that's just their own misguided interpretation. If you mean that Microsoft employees or some other specific person/people are making inaccurate claims or something, maybe say that. TS just "is" what it is. Whether it "pretends"... only comes from what people write about it, or how they misinterpret what it "is". As a general net rule/effect "encourages the developer to write objectively worse Javascript" seems pretty off base to me. Opposite to what I see every day. Most plain-JS written libs are full of stupid decisions that likely never would have been made if they were using TS. From everything I see... on average, TS written code is better than plain JS. Likewise I could try to claim that non-TS "encourages the developer to write objectively worse Javascript", but neither you or I are being objective on that, we have no metrics, and we're just giving our own anecdotal summations. But of course with the amount of libs in the NPM ecosystem, you can point to many exceptions to the rule. And we tend to notice the bad stuff more than the good. If you're seeing different, and you think that's the norm, then maybe depends on what you're focusing on. Maybe with like NestJS or something, I'd agree with you... on a case-by-case basis for specific libs/code bases. But that's not the norm these days, vast majority of us are abandoning OOP for FP, and TS has been a great catalyst for that. It was my "gateway drug" for getting into FP, and some other languages like Haskell, F# & Rust... and has been for many others too. I've never heard of TS being something that gets people more into OOP/procedural than they already were coming in. But does it help someone who only knows OOP from other languages sticking with OOP in JS/TS? I guess so. But it does that just as much for FP or any other style too, being pretty flexible in doing any of them. But I don't see encouraging OOP as the overall net effect. Appreciate you taking the time to explain your thoughts though. Much more useful than the initial vague/subjective "real language" thing. Does seem you're seeing quite a different net effect though, if these are your conclusions. Lacking runtime safety certainly sucks. And there's things from other languages that would be good to have. But seeing TS as a net-detriment -vs- plain JS seems very subjective to me, and the other 99% of informed TS users who experience all its benefits in our daily use.
They all use Map/Reduce on file chunks. That's fast enough for most needs. No need to optimize single thread performance beyond the most obvious. Unless you are in a comptetition they offer diminishing returns. Just doing that your trivial javascript solution will already be at sub 5 seconds speeds depending on your cores.
@@stefanalecu9532 What are you taking about? I want to see if the vector instructions have a big performance gain. ARM is the correct comparison for this class of operation, ofc large x86 server or desktop form factor chips will do better. But sure I guess I can under clock a PC and use a raspberry pi or something.
What would be really nice, especially on Linux, is if by default any text related file so not just .txt, .log and so on as well, would be to have it be immediately compressed at the RAM-CPU level, still readable to in real time compression and decompressing with supported cli command or application and of course when need be like shutdown or reboot compressed saved to Hard Drive/SSD always.. I also am sick of these real large log files especially. Then find its not correctly working in terms of deleting way older files.
Thought that too, but then thought some more. Say you have a (generous) -1000.00,1000.00C range. Pre-multiply by 100, it so it becomes -100000,100000. Offset by 100000 so you have 0,200000. That's "only" a 200k element array that stores a count of how many of those measurements you got. At the end, "iterate with counts" until you're at the middle point. Take index, unshift by -100000, divide by 100, done. Median would be bad if you naively stored 1b and then sorted, but not that bad if you don't. You can also make this "as many threadw as you want" with little performance penalty by doing a memory/speed tradeoff and having each thread have it's own 200k array, then the main thread sums up the counts at the end before determining the mid point.
It cringes me a lot that this guy gatekeeps SW engineering for people that don't know git, but blames Javascript for `NaN === NaN` being `false`. Schools don't have Computer Org classes anymore or what? Also, for not using 'head' or 'wc' or 'vi' to explore the file.
22:05 Huh? Parallelizing in Rust is much much easier than most other languages if, and only if, you are doing everything with safe code and using mutex of some kind.
Not watching this video yet, but just gonna leave this popup for Windows Notepad here... "The xxx.txt file is too large for Notepad. Use another editor to edit the file." (3 million rows) Edit: Watched it now
Article includes optimization steps that reference relevant assembly
Theo: This is chaos, I hope no rust devs are actually ever doing this
Me: This is sick, I wish my job would let me do this sort of shit
Same. I got to do some manual parsing of a huge XML file in Go many years back and it was so fun and so satisfying to get it down by like 95%.
Embrace chaos! Yes, PhDing is sick :)
For him, anything he doesn't understand: "this is chaos".
This is the only kind of thing I'd like to be doing 😅
"824 kb, that's not bad for a billion rows"
I was like, what the fuck? That's not even a billion bytes.
Yeah that was one hell of a weird sentence to say. :D
Exactly
@@Snollygoster- God tier compression algorithm obviously
That wasn't the real file, the real one is 12gb
@@Snollygoster- theo is overrated, he's more of a successful RUclipsr than a great programmer.
6:14 “Fuck, am I a bottom?” -Theo
what does the joke mean
@@gerkim62it means you are the int and someone else is the BigInt
👀👀👀
@@gerkim62 Google it while not at work
the quote of all time
NO, I'M RANTING
12:30 - dude starts looking for where splitting is happening - has PARSELOOP VISIBLE ALL THE TIME
13:02 - dude gets to the function, finally
13:12 - AFTER 10 SECONDS DUDE DROPS "ESOTERIC JAVA BULLSHIT" AND REFUSES TO ELABORATE FURTHER
IF THATS HOW JS/TS DEVS WORK THEN OMFG
This guy is why low-level developers mock and shit on Javascript developers. I'm sure there are way better ambassadors of the Js community, but this guy is not helping at all.
Dunning-Kruger is strong on this one.
"not a CSV if it doesn't use comma separators" -- absolute nonsense, in practice csv files are so unstandardized you're lucky if they have separators and values.
In practice especially in Europe CSV most of the time means you'll find a semicolon as column separator.
Even Excel does it when saving as CSV as localized number formatting means commas might be used as decimal separators
CSV literally means comma separated values?
@@okmarshall the abbreviation is used for both Comma-Separated Values and Character-Separated Values.
It would be comma if you go by RFC 4180, but since nobody bothered writing that down until 2005, in practice CSV can be pretty much anything.
Or as the PostgreSQL documentation puts it:
"Many programs produce strange and occasionally perverse CSV files, so the file format is more a convention than a standard."
As a Data Engineer TBH, unless you're planning to do this daily, you'd just throw it at whatever data ingest or database tools you have set up and eat the time it takes as a one off. More time would be spent checking that the data was all 'nice' and dealing with any weird errors that are usually inevitable in real world data.
I'd just be thankful that the data in the test is all UTF-8 and not some weird set of codepages from 40 years ago, where you have to figure out if it's big or little endian.
Just be thankful the data is in a digital form at all.
For a while in the 2010's I was working for BP on their pipeline corrosion data, and it was all stored as a combination of paper documents and pdfs.
Honestly, finding efficient ways to convert hand-drawn reports in a variety of different formats into usable digital data was one of the more fun projects I ever did. But it was certainly a step up over any other data sanitation I've done in my career.
@@JeremyAndersonBoise as someone who's done this sort of thing in the real world (heavy streaming processing of >1TB of data in a shared-nothing processing situation), I wish people's takeaway was, "how can I apply this to the real world" and "why is this faster?" rather than, "this was fun, but useless."
@@thewiirocks Unfortunately, the takeaway will be people write this type of obscure codes everywhere to process 100 lines of data. Aka, the leetcode symptom, people will take things into extreme.
@@doc8527 if only they'd apply it to designing ETL systems. We could save YEARS of processing time and DECADES of man-hours by handling processing better. Per system.
For my app I have to parse like 1 GB of JSON in about half a second without consuming more than 100 MB of memory and without using more than 1 thread.
Watching Theo not open the file with less was painful
Oooo, yeah, I less everything
I wrote myself a hex-viewer that uses mmap(). Can handle any size of files like nothing (on 64 bit architectures). But its of course displayed as hex, not as text. But I can search text and everything. Maybe I should also write a text-viewer that can do that. Needs a bit more logic for finding line start/end.
He kept trying to figure out how to count lines in a file. And I was like, "please use 'wc -l' already". But he stubbornly pretended to not know what less or wc commands do.
@@mubashir3 What makes you think he pretended?
@@blenderpanzi > What makes you think he pretended?
:)
I was being charitable.
If anyone ever needs proof that a little knowledge is a dangerous thing, they should subscribe to Theo's channel.
for those asking about copilot, this video was recorded multiple months ago
Enlighten me, is something wrong now with Copilot?
I think he does not uses copilot anymore, he uses another one @redyau_
@@redyau_I assume people asked because he has moved from copilot to supermaven
9:51 "I f-ing hate JavaScript." I like to hate on JavaScript like the next guy, but that is just IEEE floating point math! That's the same in *every* language that implements IEEE floating point! C, Java, C#, Rust, Go, ...
It's not the languages that implement IEEE754. It's in the hardware. If you try to compare two FP numbers in 8086 assembly and at least one of them is NaN, the FPU will always return false.
Like, unless you purposefully move those values from FPU registers into regular CPU registers and then XOR them, NaN will never == NaN.
@@igorordecha Indeed, although there can be software implementations independent of the hardware. But I don't know of any language that does that. It is thinkable that a compiler might do that for constant folding when cross compiling or to be really sure that no floating point bug lands in the compiled binary. Not sure if any actually do that. I think usually they just don't constant fold floats (maybe trivial things like 1.0 * 0.0).
The reason why I said languages that implement that is because those languages specify that their floats are IEEE floats (not sure about C). If the language is ported to an architecture that doesn't do IEEE floats they have to emulate them in software.
@@igorordechadepends some dont support fp natively
AFAIK, it almost always goes down to the CPU. Usually where you hit a wall is with small CPUs, embedded and such. Some compilers would compile FPU operations in software (with terrible performance), or some might not even support it at all. At least that's what I remember from my school days
I like how Theo sees a scanner and says it's obscure java lol
I loved that he somehow thought a BILLION rows would fit in 800k, uncompressed. 1 billion carriage-returns alone would necessarily HAVE TO be 1 gb. Adding a name, colons and a number would be dozens of gb. Still laughing when he did that.
I write a bunch of rust and one of the reason i love it so much is actually because parallelism is so easy in it. arc mutexs can be a bit to wrap your head around but thats only if you need to share info between threads otherwise its a breeze
Theo: "If I know anything about Rust, it's not easy to parallelize." Alright, so he knows nothing about Rust then...
yeah fr
just another comment that immediately tells you that he has zero clue what he's talking about when it comes to rust. honestly super annoying at this point.
@@remosenekowitsch2609 lol even if you didn't know about about Arc
spoken like someone who truly does not understand concurrency, in rust or elsewhere
It's faster, safer, and slightly more accurate when calculating means to save any division until the end. Just calculate visited like you were, and also calculate sum. Have mean be a calculated value - not a stored one - whenever you want to print it out.
The only catch is that you need store sum in a type that won't overflow. But the upsides are numerous.
No expensive division and multiplication. (Only one division to display final result.)
When multithreading with a single point of storage for results, you don't get incorrect answers when two threads both increment visited before either updates mean. All you need to guarantee is that all the increments and additions get processed exactly once, and order doesn't matter.
Fewer operations to compound floating point errors. Average error magnitude is approximately sqrt(N)*P where N is the number of operations capable of introducing a rounding error, and P is the precision of the stored value (i.e. 2^-[mantissa length]). This one is admittedly a nit, as the primitive number types all have way more precision than is required for the task. But if you were creating a custom number class to optimize for these calculations, this would allow you to save half a bit of precision in the mantissa.
Probably most importantly though, you can store total temperature as an integer value measured in decidegrees, avoiding the hassle of floating point arithmetic altogether.
What do you mean "They are not calculating average the traditional way, they are calculating it via the sum plus count." That's literally the definition of average. Also, your solution doesn't prevent overflow, since `mean * visited` is the sum. You are just repeatedly dividing and multiplying the sum for some reason.
Don't forget the merry hell of floating point accuracy with that many divides and multiplies.
17:51 > why is it stored as an SVG when it's not actually an SVG?
but it is SVG, what do you mean?
He doesnt know programming. Only influencing
@@twitchizle Do you watch actual programmers? There's plenty of people out there who do exclusively long form programming content. The problem is that hardly anyone actually watches those people. It's not a viable way to be a stramer or youtuber. It simply doesn't get enough views to be viable.
@@Bobbias yes, i didnt say otherwise
It's just really rare to encounter SVG files with embedded JavaScript.
@@LightTheMars, no, it is now the go-to standard way to hack things, which Eva has done to hack S3 upload buckets to get session cookies from same domain and got user data.
6:15 😂 “bottom submission? Fuck, am I a bottom?” Came for 1 billion lines…stayed for this one.
10:40 well, it’s test data…you aren’t supposed to do your final code on test data, you test it with a subset of the data, maybe a few rows to get the parsing right, and then up to a million to stress test it, and then use the program on the full data when it’s working as expected.
BigInt cannot overflow. They are arbitrary-precision integers. The only "limit" is in Safari there is a theoretical limit set at 1 million bits, but there's nothing in the spec that specifies a maximum limit.
1 million bits is pretty small for a bigint
@@incription Given that a bigint has 'infinite' size, yeah that is pretty small. But also who uses a million bits in a browser to store an integer.
@@dealloc its important for fractal simulaitons
I wonder where it sits in terms of performance, tho. I mean, a big int would have to be checking for overflows and such. Isn't a 64 bit integer the better choice here?
@@hevad Yes I am not sure why bigint was chosen. I think the idea is that theoretically the accumulated number could overflow 64-bit given many records, if each could be a 64-bit integer as well. But I am not entirely sure if that was the case and it's very theoretical given the max value of a 64-bit signed, not to mention unsigned integer.
A developer not being able to open or search a 14GB file is peak humor for me as SysOp.
I abandoned SysOp'ing to become a developer and never looked back
How would you do it?
opening or searching such large files is not really the day to day task of most developers..
Whatever you do @@codeChuck, it needs to be async unless you have infinite RAM.
@@privateagent these are linux tools. What do u recommend for ur average windows developer who has like never seen a Linux cli? Ofc is most stuff easy when u know what to do
“If I know one thing about Rust then that parallelizing things is hard with it”
One of the main advantages of Rust is fearless parallelization, it says so on the webpage and all (and if you check the code, it’s true: It’s very straightforward)
uhhh no lol. it does suck, and you lose a ton of flexibility for a few minor gains in data races (which arent.... really even a thing in Java)
@@chop098CVE-2013-5512, CVE-2015-1882, CVE-2020-17534, CVE-2022-33915, and others disagree.
Mean * visited + temp will 100% introduce rounding errors
maybe this would not happen if one were to do the multiplication as log adds, e.g. exp(log(mean) + log(visited)) + temp
i realize this is golfy but perhaps that is ok, since the whole challenge strikes me as a celebration of golf
@@flight_riskif mean is negative, Math.log(mean) will be NaN
multiplication using logarithms is less accurate:
; Math.exp(Math.log(10)+Math.log(10)) == 100.00000000000004
; 10*10 == 100
@@flight_riskI say that would take care of limiting the range (i.e.: Not having the values grow wildly out of control), but would actually introduce more precision errors.
16:52 yes, but your solution has floating point issues that will be much more prevalent since you're multiplying and dividing so much more often
Not only that, the multiplication `mean * count` (he does every fucking step) is LITERALLY pushing his integers to the same levels as just doing the sum. Still he says "his solution keeps the numbers lower" 🤣🤣🤣🤣🤣🤣
Is not even his solution, is just what copilot spat out!!
Seriously, Dunning-Kruger is strong on this one.
the elixir solution at 24:42 is very clean, although not doing any manual parsing or structuring, but that seems like a nice and useful pattern to know when in need
18:27 "just use the browser"
this is why we can't have nice things
Java is still kicking? Are you high bro - of course Java is still kicking, it's currently the best it has ever been.
"currently the best it has ever been"... Don't ask what JRE to install.
@@Maxjoker98 any of them? I mean, Oracle's is akin to Chrome (expect paid) while all the OSS ones are builds of Chromium. Pick one. Any one. Or just run the package install and be happy.
Java honestly doesn’t seem that bad, but its development tools seem archaic or nonexistent. However, please note I really don’t know and I’ve only coded Java from Neovim and Zed for CS 1&2 so I haven’t looked too far into he ecosystem. Getting an LSP set up for Neovim is not easy, you have to link and use Eclipse development tools
@@Maxjoker98jre doesnt exist anymore. Now its just the jdk.
@@Mig440so, you can install security updates for decades old java software anymore in the future?
Or is a self contained deployment optional?
Esoteric Java? How? I’m not a Java dev and that just looks like using the language to me.
Thankyou Theo's editor for reducing the number of "Fuck, what's this?!" to an acceptable level.
I love these kinds of competitions I always learn so many tricks that I would never have found on my own it's crazy how much knowledge is out there, some of this stuff I still don't even fully understand what they did or why it helped but that's almost the best part because I can go find out.
16:56 what? There's literally no difference to the size of the number between a sum and (count * mean), which is what you're doing. Your approach isn't any less susceptible to overflowing.
And in addition it introduces many rounding errors.
Prides himself that "his solution keeps the numbers lower". It's not even HIS solution!! 🤣🤣🤣🤣 It's just what copilot spewed out!
Seriously, Dunning-Kruger is strong on this one.
I'm over here wondering why you didn't just use wc and head to validate the number of rows (and make sure there wasn't some header text).
The flamegraph is literally an SVG lmfao
"It uses semi colon making it not a CSV" is so funny when living in a predominently french place cause ALL csv is semi colon delimited
Worked with a german plc that saved data in comma csv. Problem being that the data also had commas.
@@philipbotha6718 I just hope it was only the values that had commas, and not randomly sprinkled in the key names
@@mikemhz That's one of the issues. There aren't any key names (so I don't have that semi-colon issue in the key names at least...) one has to count the column. And the columns have variable lengths (i.e. number of semi-colons) semi-colon seperated values...
Software is in German as well.
saying that the elixir solution is cool is just saying that polars is cool. which it is, sure, but it's not elixir, it's a native rust library. here's the python solution using polars, takes around 10 seconds (i presume the elixir one takes around the same without having to compile it or whatever):
import polars as pl
df = pl.scan_csv("measurements.txt", separator=";", has_header=False, with_column_names=lambda cols: ["city", "value")
grouped = df
.group_by("city")
.agg(
pl.min("value").alias("min"),
pl.mean("value").alias("mean"),
pl.max("value").alias("max"),
)
.sort('city')
.collect(streaming=True)
for data in grouped.iter_rows():
print(f"{data[0]}={data[1]:.1f}/{data[2]:.1f}/{data[3]:.1f},", end=" ")
You can make it lazy
We do read the assembly if we're writing core libraries. But that's probably (my guess)
824k for 1 billion rows?
"Not too bad" seems a little... understated.
"All things considered" ? What was considered? 😂
🤣🤣🤣 The first brain fart that comes to mind
12:18 If you specify the maximum possible value, you can immediately see if there is a problem (impossibility), if you use a small maximum value as initialization, we'll only know that the current maximum value is greater than the previous one, without knowing if it should be possible at all.
*Edit: 13:43 You can even see this behavior in action: the output max with a value of 3415.9 is obviously wrong, since the maximum possible value is 999, but the way it was setup, this case slipped through. 17:00 Neither BigInt nor a normal int64 would overflow as 10^9 * 999 is only 9.99 * 10^11 and int64 can hold values ranging from plus to minus 9.2*10^18. The float32 is probably fine, as not all values are max values, but potentially not fine as a single precision float can only exactly represent integers lower than 1.6*10^7 (IEEE-754).
I always consider CSV to mean character separated values. CSVs become harder to parse quickly when the text inside contains commas. E.g.
First entry,"second entry","third,entry"
I laughed out loud more and harder than is reasonable for a tech video. This was hilarious all the way through, very well done 😂
19:40 You still have to look at the generated assembly if you want to get the runtime from 40 seconds to less than 1 second!
I am really curious about the trillion row challenge.
I am fascinated by how much a problem changes at different scales. How seemingly simple things become very complicated when there are *many* of them.
Unsafe calls make me feel like my machine is unsafe
"Java is still kicking"
Is this what an American programmer is? They are not aware of anything outside their border
Is it good fun, should I watch? ) About five years ago I've shipped a streaming "xml-like-but-not-xml" parser for 5gb incoming files in nodejs typescript. Worked alright.
What I've noticed - practically nothing in ecosystem was designed around streaming. Most projects just assume the file is small enough to fit in memory.
There's no SAX for Javascript? 🤔
Love videos like this one where you dig into code and eff with your own solution. Would love more features like this dude!
33:25 Always look at the total wall clock time. User time ignores all memory allocations and IO which are important for this kinds of tasks.
1 trillion next :)
Thanks for video, was great to see comparisons :)
Oh yea, JavaScript is plenty fast!
Puh, I'm a software engineer in the scientific community. I had to parse multiple 2tb text files (because scientists are dumb). Do have a super computer though, so it wasn't that bad
Try the line stream transformer from Deno's standard library (available for Node and Bun via JSR)!
The real challenge is doing it in powerpoint
CSV= colon separated values
at my first job i stored user avatars as base64 in a postgres column
Glad you left!
Thanks editors! Looks great 😊
I had to do something like this for work. Get the directory contents of like some millions number of files in sometimes thousands of subfolders. Just the directory paths alone was 7gb in a txt file.
Can you add link of rust post to description?
head -n 200 measurements.txt > sample.txt
Always do this when writing code like what's happening here. It'll run in milliseconds and allow you to capture core behaviors in a viable functional test suite if that's more your style.
so why copilot again?
another comment says this video was recorded months ago
You can see at 2:38 that the stream is from February 14th.
So cool. During this video implemented my own solution in c. It uses unbalanced binary search tree, because i don't know how balance tree or implement a hash map. Generating data took 11 minutes meanwhile my solution took 18 minutes and 30 seconds to run 😅. Definitely not trying one trillion row challenge though
"your mission, should you decide to accept it, is deceptively simple: write a Java program-..."
this is where the issues started, using Java instead of using C, C++ or CPython
Use “less” to open large files.
Also, this video was fun to watch.
Rust compile-time macro could be crazy
Exoteric java bullshit, just some token parsing..
Perl is the king of text processing.
I think mc (midnight commander) file viewer (F3, not F4) can handle files of any size since it does not read files into memory.
Python going all in on Trillion Row challenge= How fast can I access vectors = how many parameters can I query in a LLM for an input. More parameters=smarter results.
You were multiplying current average by current count in every loop step. It can overflow just as easily as the sum.
I am a bioinformatic scientist.
Loading and parsing >20G files, and comparing them to other >20G files is a daily event.
I just allocate a terabyte of RAM if I know I am doing some heavy lifting.
I am very tempted to try this in R just to play with some easy data
chuck norris can do it in assembly in negative time
What is negative time?
Is copilot back?
The author wrote some articles about Debezium
One of my professors did a kind of map reduce in terminal one time with just terminal commands. Not 1 billion rows but it was still cool
As an old timer, it's hilarious watching Theo rail against Java in the same way Java engineers used to rail against C/C++.
Of course, Java engineers sometimes had a point. Typescript, OTOH, isn't a real language. Change my mind.
"Real language" had no objective meaning in the first place.
So it's a meaningless comment on both the subjects of: programming languages, and the English language too.
@@HappyCheeryChap a reasonably nice try. But "English" can be understood by pretty much anyone who speaks English. So can Ebonics. But many would have you believe that Ebonics is its own language rather than a mode of English speaking. (i.e. accent + slang + local culture)
It's like saying that Australian is a different language. Not really. Overcoming some of the accent and local terms might take a bit of work, but it's perfectly possible to communicate.
Typescript is just Javascript Ebonics.
@@thewiirocks The point was that saying something "isn't a real " doesn't actually communicate any objective point that people could "change your mind" on, or even really discuss at all.
All it tells us it that you don't like it. And refusing to expand when asked, implies that maybe you can't actually articulate why that is.
My only guess is that it might be called "not real" because it's a transpiler? But who is going to debate that it isn't a transpiler?
Is there actually is some technical opinion you have on it, that somebody could attempt to "change your mind" on?
Or even some minor clarification on what "real" is meant to mean?
@@HappyCheeryChap so I'm a bit late replying. Apologies for that. I didn't expect to catch anyone looking to have a serious conversation on the matter. I honestly appreciate that you're taking this seriously.
Let me break this down from a technical perspective: The problem is not that Typescript is a transpiler. Transpilers are very common in this space with several far more effective examples on hand. (e.g. GWT, WASM, DART, etc.)
The problem is that Typescript is *not* a transpiler. At best it's a pre-processor. And not a terribly good one at that.
When I write code in GWT, C/C++ compiled to WASM, or DART, I am presented with semantics that are fundamentally different from Javascript. Accounting for these different semantics means that the Transpiler has to perform significant work to maintain those semantics. e.g. If I'm able to assign an incorrect type to a variable in Java and then make duck-type calls on those types, that's a serious problem for the GWT Transpiler. It's a bug that needs to be fixed.
Typescript offers the illusion of transpiling. It provides a type system that does compile time checking, but none of the checking semantics are maintained at runtime. So it becomes far too easy to perform type violations and duck-type invocations on accident. And no one is going to fix this!
*Because what you're really doing is writing Javascript with a type-checking pre-processor*
Ok, but that's still better than writing pure Javascript, right? Clearly we're solving a common problem and making things better.
*Wrong*
I'll skip over the most abstract arguments about code behaving unexpectedly being a problem. That is an issue, but it's not the strongest argument against what Typescript provides. The strongest argument I have is that Typescript semantics strongly encourage OOP-style procedural code.
I'm sure right now you have a very confused look on your face, and you're thinking, "what's wrong with that?"
What's wrong is that we've already established that you're really writing Javascript. Just with a pre-processor. And while C++ started as a pre-processor, it relied on C semantics. It didn't try to fundamentally change them at the time.
Typescript is the opposite. It pretends not to rely on Javascript semantics, but those semantics are still very much there. And the problem with writing OOP/procedural code in Javascript is that Javascript is a _terrible_ procedural language.
Javascript was designed as a functional language with list-processing (LISP) style semantics. It is truly fantastic when you successfully hook into its functional nature. This is the reason for the dynamic typing of the language and many (though not all) of the design choices that your average programmer pooh-poohs. And since they're convinced that Javascript is a terrible language, they get frustrated that they're writing so much code to work around what they perceive to be Javascript's failings.
And thus we come to the fundamental failure of Typescript as a language: It's an unenforced semantic layer on top of Javascript that encourages the developer to write objectively worse Javascript. Thus it is not a "real" language.
QED.
@@thewiirocks The part where you were "sure" about me "having a very confused look" re "what's wrong with that" couldn't have been more wrong. Pretty random assumptions.
Pretty much all the use of TS I see is far more FP than OOP. Pretty much any thread on OOP/procedural style code/libs in TS (becoming less and less common) will be full of comments suggesting it be written more FP/Haskell style. Discriminated unions / result-types etc are pretty much the norm over things like OOP inheritance, or even using classes at all, aside from those who haven't learnt FP. And for them, TS + its modern ecosystem is more likely to encourage them to learn it than plain JS anyway I think.
Dunno what you mean re TS "pretends not to rely on Javascript semantics". Anyone who understands TS, knows what it "is" and "isn't". If some people are confused about it, that's just their own misguided interpretation. If you mean that Microsoft employees or some other specific person/people are making inaccurate claims or something, maybe say that. TS just "is" what it is. Whether it "pretends"... only comes from what people write about it, or how they misinterpret what it "is".
As a general net rule/effect "encourages the developer to write objectively worse Javascript" seems pretty off base to me. Opposite to what I see every day. Most plain-JS written libs are full of stupid decisions that likely never would have been made if they were using TS. From everything I see... on average, TS written code is better than plain JS.
Likewise I could try to claim that non-TS "encourages the developer to write objectively worse Javascript", but neither you or I are being objective on that, we have no metrics, and we're just giving our own anecdotal summations.
But of course with the amount of libs in the NPM ecosystem, you can point to many exceptions to the rule. And we tend to notice the bad stuff more than the good.
If you're seeing different, and you think that's the norm, then maybe depends on what you're focusing on. Maybe with like NestJS or something, I'd agree with you... on a case-by-case basis for specific libs/code bases. But that's not the norm these days, vast majority of us are abandoning OOP for FP, and TS has been a great catalyst for that. It was my "gateway drug" for getting into FP, and some other languages like Haskell, F# & Rust... and has been for many others too.
I've never heard of TS being something that gets people more into OOP/procedural than they already were coming in. But does it help someone who only knows OOP from other languages sticking with OOP in JS/TS? I guess so. But it does that just as much for FP or any other style too, being pretty flexible in doing any of them. But I don't see encouraging OOP as the overall net effect.
Appreciate you taking the time to explain your thoughts though. Much more useful than the initial vague/subjective "real language" thing.
Does seem you're seeing quite a different net effect though, if these are your conclusions.
Lacking runtime safety certainly sucks. And there's things from other languages that would be good to have. But seeing TS as a net-detriment -vs- plain JS seems very subjective to me, and the other 99% of informed TS users who experience all its benefits in our daily use.
They all use Map/Reduce on file chunks. That's fast enough for most needs. No need to optimize single thread performance beyond the most obvious. Unless you are in a comptetition they offer diminishing returns.
Just doing that your trivial javascript solution will already be at sub 5 seconds speeds depending on your cores.
11:25 bookmark i am high
26:20 the compression 😂
I do really wanna see a comparison is the new RISC V v1 Vector Instructions vs a traditional ARM implementation.
At least be honest and compare it to what Intel and PowerPC have had for decades
@@stefanalecu9532 What are you taking about? I want to see if the vector instructions have a big performance gain. ARM is the correct comparison for this class of operation, ofc large x86 server or desktop form factor chips will do better. But sure I guess I can under clock a PC and use a raspberry pi or something.
33:50 CSV with semicolons is just Microsoft variant of CSV. The use C as in semiColon.
What would be really nice, especially on Linux, is if by default any text related file so not just .txt, .log and so on as well, would be to have it be immediately compressed at the RAM-CPU level, still readable to in real time compression and decompressing with supported cli command or application and of course when need be like shutdown or reboot compressed saved to Hard Drive/SSD always.. I also am sick of these real large log files especially. Then find its not correctly working in terms of deleting way older files.
"If there's one thing I know about rust, it's that making things parallel is really hard" the most wrong Theo has ever been
this is the most I've heard theo swear until now
WOOO LTT dropped a new video
thanks editor
Stochastic "algorithms" go brrr
I related more to Ragnar in this video
aw thanks :P
Now find the median ...
Thought that too, but then thought some more. Say you have a (generous) -1000.00,1000.00C range. Pre-multiply by 100, it so it becomes -100000,100000. Offset by 100000 so you have 0,200000. That's "only" a 200k element array that stores a count of how many of those measurements you got. At the end, "iterate with counts" until you're at the middle point. Take index, unshift by -100000, divide by 100, done. Median would be bad if you naively stored 1b and then sorted, but not that bad if you don't. You can also make this "as many threadw as you want" with little performance penalty by doing a memory/speed tradeoff and having each thread have it's own 200k array, then the main thread sums up the counts at the end before determining the mid point.
0:56 Ahh Bulawayo my mother's home town
McLaughlin Ferry
It cringes me a lot that this guy gatekeeps SW engineering for people that don't know git, but blames Javascript for `NaN === NaN` being `false`. Schools don't have Computer Org classes anymore or what? Also, for not using 'head' or 'wc' or 'vi' to explore the file.
Cry about it
@@ClowdyHowdy Ok, will do 😂
"master class on debugging" bro never used the debugger
26:17 rip bitrate
CSVs despite the name don't always come with commas, actually I've seen semicolons more often
why bigint when you expect it to go to a billion?
bro, sometimes you just need the custom number parser
Theo what typeface/font are you using here? It looks very familiar.
22:05 Huh? Parallelizing in Rust is much much easier than most other languages if, and only if, you are doing everything with safe code and using mutex of some kind.
Would it still be a CSV if it used colons(:) instead of semicolons(;)? (Colon Separated Values -> CSV)
Fun challenge!
My best Javascript solution does it in 6s. Bun is bugged (does not allow to customise read highwatermark making it very slow)
I want a text editor that allows to change this type of file
Not watching this video yet, but just gonna leave this popup for Windows Notepad here...
"The xxx.txt file is too large for Notepad. Use another editor to edit the file." (3 million rows)
Edit: Watched it now
Theo offhandedly questioning if he's a bottom is one of the most chaotic things I've seen all day.