15 years ago I made a "Windows XP Easter Eggs" video featuring this bug, and now I feel this strange sense of satisfaction finally knowing why it happens. Thanks FlyTech!
Note: What Windows Write calls "unicode" is UCS-2, a now-obsolete 16-bit encoding. What today we call unicode is usually UTF-8, a variable-width encoding that conveniently matches US-ASCII, though there's also UTF-16 and UTF-32.
There's not just those variants either. There's the obsolete UTF-7 too (for some reason), and the multi-byte encodings come in both Little Endian and Big Endian flavours.
@@weakspirit_ internally, yes. but when saving text it does default to utf8, since a text in 7-bit ascii and in utf8 are the same, and since in english you rarely need characters beyond that, using utf8 is more space efficient
Technically ANSI Windows-1252 (what is functionally always used when Windows refers to "ANSI") is incompatible with UTF-8 as there are numerous UTF-8 bytes which are "reserved" (usually for multi-byte characters) which Windows-1252 uses as printable (single-byte) characters. If any of those bytes exist in the string when read as UTF-8 the encoding will break, which in a well developed system will merely produce a few erroneous characters. Now since Windows-1252 extends standard ASCII, most of the bytes in Windows-1252 will be read well in UTF-8, specifically all the common American used characters. The problem with the encoding only occurs when you have these non-ASCII characters either in Windows-1252 or in UTF-8, in which case if you try to read the either UTF-8 from Windows-1252 or Windows-1252 from UTF-8, you will get problems. Experienced this exact issue when dealing with a game that reads and writes Windows-1252 but which has resources written in UTF-8, causing all sorts of weird problems.
@@laurentverweijen9195 UCS-2 can only represent ~6% of the reserved Unicode codepoints and ~69% of the ones already assigned , whereas UTF-16 can represent them all through surrogates. Don't get me wrong, UTF-16 is the worst Unicode encoding, but at least it *is* a Unicode encoding, unlike UCS-2.
@@npoaccount9154 as soon as i read that, i tried breaking it. i ended up creating a document filled with a creepy smiley face image with the size of 2,306,727,936 bytes (that's 2.3 gb). i cant open it because it's too big, and i think i corrupted the file by closing wordpad while it was saving the file. i still dont think this counts as breaking it, i think its just the file thats broken.
@@chri-k To keep It simple, C64 BASIC has a random number generator that, every time you turn on the machine, it always produces the same "random" sequence of floating-point numbers, all between 0 and 1. Using this "feature" someone wrote a small program (4 lines of code) that print the sentence "Bill Gates Sucks" on screen
Aside from the statistical check, another heuristic uses the newline as a way to rule out Unicode since a word-aligned newline has the bytes 0D 0A, but U+0A0D is not assigned in Unicode. Also it apparently only detects based on the first 256 bytes or so, which might make the longest string challenge futile beyond that point.
4:19 for anyone wondering why fly wrote it as 0x75 0x42 when the hex editor shows 0x42 0x75 its because the file is encoded in "little endian", which means the last byte in the hex editor goes first when the computer is reading unicode. i dont know why the computer does this, nor am i an expert in these kinds of things but i just wanted to share in case someone wants to know
Write a simple program converting from integer to string and you will find out why little endian is a thing. Btw, the numbers we are writing everyday is big endian.
@@RedstoneNguyen Both storage directions make perfect sense . Little endian for decimal digits is how Arabs write Arab numerals, big endian is how westerners write the same numbers with the same digits . Because computers gulp up entire binary numbers in one memory clock cycle it's entirely cultural for them too . The x86 and x80 CPU families belong to little endian design cultures, the 68000 and SPARC families belong to big endian design cultures . ARM and PowerPC hardware is bilingual in this matter .
The "oracle" not guessing correctly on newlines might be due to differences in newline coding; given it's a Python script it may be using only LF line endings, but famously Windows always uses CRLF as its line endings.
"This comment has the challenge shown for the longest strings that triggers the Windows glitch from the video you recorded. The video's bug shows that it's difficult for doing accidentally. Specially for the challenge proposed. I try using workarounds and the hardest one probably is the odd character words I have to usually put for those requirements. But sometimes I don't use odd length, which usually happens because the symbols are joined with the word. I created a small story for the text: The bus was not there for a bad reason, probably. Maybe someone found the reasons but I don't know? Which increases the words variety I can use for this! The current character counter is at 690 and I think I could add non-sense but that wouldn't get interesting enough. I'm a bit close here, 200 character distance. So, I could get another story using odd character count words only. There was a man named "e_g.". The challenge was waiting for him!! And so he broke the world record! FlyTech himself saw the text, and got amazed! The comment had too many characters and broke the record! I could not believe it! The story ended and thank you for reading this." this whole text has 1156 characters, which beats the previous record of 1016. the remaining balance was -399, which means you could add 2 more characters (following the rules from the challenge) without failing. a proof was made in a windows xp vm, and you can see the video proof in my channel or test it by yourself.
Have you heard about the Russian city: Seversk? It has a humid continental climate maintaining a low temperature and receiving 530mm precipitation every year. Through its presence, nuclear weapons have been assembled there and stored. One serious nuclear catastrophe would occur in 1993 because a container holding a dangerous and radioactive substance exploded. Character count: 362 Edit: I didn't actually test this on Windows XP notepad, but I used a script, and the script gave 7640 and 2542, and 7640 is just barely greater than 3*2542
It uses the 2 bytes FF and FE shown at 5:24. To make the glitch happen on modern Windows you put ÿþ into the beginning of a text file, then save it, and voila! You don't even need to input any special text after it.
@@lunafoxfire They did that back then, too, but that's not nearly enough. The _presence_ of the BOM _confirms_ a file is unicode. The _absence_ of the BOM _does not_ mean a file is _not_ Unicode. That is to say, if there is no BOM, you still have to check if it's Unicode.
The 00 padding on non-unicode characters explains the fact that in a lot of files that are not text, some strings have spaces between each character... I've had this question since 2017 XD
but most modern applications ( one exception being Windows itself ) use UTF-8 and not UTF-16. UTF-8 is fully backwards-compatible with ASCII, so this may not be the only reason.
It's not padding, it's the page number . Page 00 is mostly the same as Western ANSI code, there are about 7000 other pages to keep track of, including the ones with smiley faces .
I was never able to reproduce this bug when I was a little kid.. But well, my boxed copy of XP was Service Pack 2 so I guess that's a given :P But I also don't remember what OS I was using at the time either.
for my whole life i thought bush hid the facts was an intentional easter egg so this was a very interesting video to me personally just because of that
2:05 you don't have to disappoint them. there also was another Bush who was president between '89 and '93. In fact, he's the father of the Bush we all know and hate. I have no explanations for "~Flytech" though. 😅
When 7/11 (I say 7/11 because if I wrote the right date my comment won't post) happened, the very next day in school someone came in to school and showed a group of us in the computer lab if you typed 9 and then 11 into Word using wingdings, it was a plane flying into two buildings. Wingdings has been changed since then.
@@FlyTechVideos a different commenter said that "11:44 I can't do that as any link (sometimes even links to RUclips) cause the immediate deletion of the comment, meaning it's visible to its author until page reload, edits fail ("Unknown error") and reloading the page makes it disappear. Only the creator of a video can post links in their own comments section without needing to be worried" youtube likes censoring the comments
the correct way to do this is, of course: 1. is there a BOM? If so, respect it. 2. try to decode it as UTF-8. If it worked, you're done, its UTF-8 (or us-ascii but thats a proper subset so whatever) 3. If you get here, complain to the user and make them figure it out. (if you're loading a document with more metadata you may of course use that too, I'm assuming plain text)
UTF-8 didn’t exist when IsTextUnicode() was written. The Unicode encoding in use was UTF-16 (based on the earlier UCS-2). So this is legacy code from the early days of Unicode that was never updated.
This also explains what happens when you load Unicode-encoded files such as .lnk and .url in Notepad or Vim etc, and it displays the weird spaced out lettering as the program assumes ANSI plaintext. Cool!
NT4 Notepad runs on NT 3.51, though it will abruptly close at times. On these older versions, you can change Notepad's global font, and in some cases you may even be able to read the erroneous Unicode characters!
I have moved to Linux completely a long time ago, and every time I stumble upon Windows, I seriously don't understand how can the most popular desktop operating system STILL have significant issues with encoding. Today I was on a video call with my coworker and had the pleasure of witnessing a modern (2021 version) app with cyrillic text display umlauts and diacritics instead of actual text on English-configured Windows., Look, mum, we've had Unicode for 32 years now! P.S. yes, I know Windows uses UTF-16, I refer to UTF-8, which is used practically everywhere on the web.
Kinda expected that it's just another encoding issue. Even nowadays Notepad still has similar encoding issues like when you write a script and it contains some special characters in it... and then you realize your script is broken. Aside that you should never use Notepad for scripting anyway...
UTF-16. Then they wanted some nice 🥵emojis and even ♔chess. Then UTF-32 appeared. yet UTF-8 is variable length which saves the space but not the nervecells of c++ devs
In the old Windows World's favorite unicode encoding. Which they got stuck with, even though it was a bad idea, because they were too eager to use unicode and more sensible unicode encodings hadn't caught on yet.
It looks like they encoded the code points directly. They did not use UTF-8 encoding or else ASCII characters would be only one byte, and they did not use UTF-16 encoding either because UTF-16 is not padded with NULL bytes. In other words, they succeed to mess up their Unicode implementation and invent a new encoding while Unicode was supposed to unify everything. Oh the irony...
@@cl00e9ment All ASCII characters do have null byte in high byte when represented in 16-bit integer though. 0x20 in 8-bit becomes 0x0020 in 16-bit, which becomes 0x20 0x00 in little endian, which is the correct little endian representation of space in UTF-16.
4:16 Saying that everything in "Unicode encoding" is 2 bytes is a bit misleading. This applies only to the implementation of Unicode used in very old versions of Windows (UCS-2), and does not apply to any modern, variable-width Unicode encoding. Notepad received support for UTF-8 and UTF-16 with Windows 7.
I can see where this comes from: if someone types in Chinese (or other Unicode-only) language, the characters will be very close to each other in the encoding space. Therefore, the most significant byte would be very close to each other, while the least significant byte would essentially be random.
While if someone types with just a few Unicode characters thrown in, the ASCII values will produce the null bytes seen at 5:11, which also won't add to the most-significant-byte-difference counter.
How about this one? He erected later a great monastery in which he lived forty years and had eight hundred and eight followers--they bound him tightly and carried him between them on their shoulders (-9)
Learnt something new in your video: Mojibake! I never knew this phenomenon had a specific name. I recall seeing it a lot in 1998 to 2001 on WebTV here in the US when viewing Japanese, Chinese, & Korean web pages. (Really, any pages that didn't use the Latin alphabet.)
They did - 0xFF 0xFE was consistently recognized as Unicode. The problem is that they assumed that text without this prefix can be Unicode as well, and they used the presented heuristic to guess
(4:15) Incorrect, Unicode itself doesn't require 2 bytes per character. Unicode is just a list of character. It depends on what encoding you use. UTF-16 is what Windows uses, which requires 2 or 4 bytes per character, and there's also UTF-32 which requires 4 bytes per character, as well as UTF-8 which is a variable number of bytes per character. (5:25) FF FE doesn't mark it as Unicode exactly, it does, but it also marks it as UTF-16, which is why it's 2 bytes per character. B is 42 00 as you show, but FE FF is also UTF-16 but reversed, where B is 00 42 instead. EF BB BF is UTF-8, for example. (6:40) Since UTF-8 (Unicode) is variable length, why can't this be an odd length? Characters can be variable length from 1 to 6.
So I ran a few tests and found the exact letters and words don't influence notepad's unicode detection algorithm too significantly. By far the biggest factor is the location of the space and punctuation symbols. If the space and punctuation symbols occur primarily at an odd index, then the unicode detection algorithm can get a large bonus towards the odd/lower bytes since the ascii distances between the punctuation symbols and lowercase letters greatly exceeds the ascii distances between any two lowercase letters. This means that if you use words with an odd number of letters, most space symbols end up on the odd indices. However, writing sentences using exclusively odd number of letters for all the words isn't easy. You can therefore sometimes use pairs of even words for a nicer structure. By using these tips, you can write quite lengthy sentences which sound almost completely natural without having to recalculate a new score every other character. PS. This whole comment would get censored by notepad
Character count: 1016 3 * Higher Diff - Lower: -85 Tested successfully on Windows 2000 pro With the heuristic outlined, it's fairly easy to make arbitrarily long strings, especially if you aren't overly concerned about clarity, word flow, or long term sentence structures. I wrote the comment without significant scripting/code assistance (only checking the isUnicode value every sentence or so).
Did you know conspiracy and conspiracy theory were words in 16th and 17th century, but conspiracist and conspiracy theorist weren't words until the 60s and 70s.
The fact I just remembered that Microsoft dropped support for WordPad, this video gives me yet another reason for me to be infuriated for that. WordPad has been faithful to me for other reasons, but this is a reminder of something I could have still benefitted when migrating to different OSes. Plus, seeing that even W11 crashed while using Notepad just tells me that it was a bad decision in the first place as it furthers the incompetence. Plus, I'm certain I remember this old bug from the days of the drink holder prank (which they've since patched over). It's nice this still can be done. Nostalgic, at least.
generally self-published content is not supposed to be used as a reliable source for a citation, but in this case, i guess it could be used as a showcase of a behavior mentioned in the article
@@FlyTechVideos I've been working on that exact same question for a different video lately (a Karl Jobst documentary). The short version is that the video can not be used unless it is published by a reliable source. Since RUclips videos are self-published, they don't count. An exception can be made if the person who posts the video is considered a subject matter expert. We've discussed that for Karl Jobst, but determined he doesn't qualify. For it to work, Jobst would have to have published articles about his work in trusted sources, outside of RUclips, and he hasn't done so. What that means for your video: Have you published journal articles about PC bugs, under your name? (Just being cited by them is not enough.) If the answer is "yes" then that's great! Please give us a link to that. With some luck, that will make you pass as a subject matter expert, and THEN we can start thinking about citing your RUclips video. P.S. For Jobst's video, we could avoid citing the video in the end. Jobst presented all his sources in the video. References to reliable sources that demonstrate that his conclusions were correct. Can you share such a source for your topic? In that case, we can start working on the Wikipedia article as well.
FlyTech, after he held a Microsoft employee in his basement for the past year: _"I am legally not allowed to tell you how I figured out. Let's say, I consulted some trustworthy sources for this."_
I live in ukraine and i've seen many university pages full of them (idiot devs, no ) After making essays in vscode i can see some mysterious п»ї too. its shit from BOM as i know
In unicode, a character isn't 2 bytes. Unicode itself is not an encoding, just a standard. UTF-8 is unicode where each character uses 1 byte or more, UTF-16 uses 2 bytes or more and UTF-32 always uses 4 bytes for each character.
Try "联通", iirc this single word also caused issue therefore similar "rumors", or rather, memes, arose amongst Chinese communities about how China Unicom had vendetta with Microsoft whatsoever.
So, fun fact: As the video says, UTF-16 ("Unicode" encoding according to Notepad) text files always start with either 0xFFFE or 0xFEFF (to indicate endianness). 0xFFFE and 0xFEFF don't make *any* sense as ANSI-encoded text (they display as ÿþ and þÿ respectively), meaning it's far safer to just look for those patterns to detect whether or not a file is encoded in UTF-16.
What you are referring to is calldd "UTF-16 with BOM". The BOM (Byte Order Mark), however, is not mandatory (in fact it is discouraged). Read more here en.m.wikipedia.org/wiki/Byte_order_mark
A unicode string with "simple" characters usually has a lot of null bytes, e.g. 42 00 43 00 44 00 ... and the heuristic is engineered to detect exactly this. As we can see, this leads to false positives
The windows world started using unicode before utf8 was invented. The Java world too. Sometimes it pays to be slow (although I remember switching my old redhat/mandrake systems over to default to utf8 was not fun either).
Unicode was created in the late 1980s. Microsoft and Java chose the early 16 bit Unicode and then had to use UTF-16 to encode the next few thousand pages . Then someone decided that any characters not handled by UTF-16 should be banned from all other encodings .
I posted an analysis of the algorithm as an answer to the post "how did Windows developers come up with...". TLDR; it can distinguish the patterns found in multi-byte ANSI vs 16-bit Unicode for Asian encoded text, and other languages fall into subsets of those distributions.
@@maximiliano_sv Basically Microsoft does not allow everybody to see the Windows source code or disassemble Windows to get the closest thing to the original source code, ChatGPT explains those things better than me: The source code of Windows is a set of instructions and programs written by Microsoft developers, forming the foundation of the operating system. While there are several reasons why an ordinary person cannot legally read the source code of Windows, here are some of the main ones: - Intellectual property rights: The source code of Windows is owned by Microsoft. The company invests significant resources, time, and effort in developing and improving their operating system. As a result, they have intellectual property rights over the code and can control who has access to it. This means they cannot allow just anyone to access and read their code without authorization. - Protection of trade secrets: The source code of Windows contains valuable and strategic information for Microsoft. Disclosing that code to unauthorized individuals could compromise the security, stability, and competitiveness of their operating system. Protecting their trade secrets is crucial for maintaining their market position and competitive advantage. - Risk of vulnerabilities and hacking: If the source code of Windows were available to the general public, it could be scrutinized by malicious individuals, including hackers and malware developers. This would increase the risk of discovering vulnerabilities and creating targeted attacks on the operating system, jeopardizing the security of millions of users. - License and end-user agreements: By using Windows, users agree to the terms and conditions set by Microsoft in their license and end-user agreement. These documents clearly state that users do not have the right to access, modify, or distribute the source code of the operating system. - In summary, the source code of Windows is owned by Microsoft and is protected by intellectual property rights and trade secrets. Its widespread access is not permitted to safeguard the security, competitiveness, and legal rights of Microsoft, as well as to prevent potential risks to users of the operating system. Disassembling parts of Windows to figure out the source code presents legal and practical challenges. Here are the reasons why an individual cannot easily disassemble Windows to obtain the source code: - Legal restrictions: Reverse engineering, which involves disassembling software to understand its underlying code, is subject to legal restrictions in many jurisdictions. Companies like Microsoft have legal protections in place to prevent unauthorized access, modification, and distribution of their software. Engaging in reverse engineering without proper authorization or a specific legal exception can infringe upon intellectual property laws. - Technical complexity: Disassembling software like Windows is a complex process that requires expertise in low-level programming languages, assembly code, and system architecture. It involves converting machine code (binary instructions) back into human-readable form. Understanding the disassembled code and piecing it together to reveal the original source code is a challenging and time-consuming task that requires significant skill and knowledge. - Incomplete information: Disassembling Windows and examining the disassembled code may provide insights into specific functions or algorithms, but it does not yield the complete source code. The source code of a complex software system like Windows consists of millions of lines of code, libraries, and dependencies, which cannot be fully reconstructed through disassembly alone. - Trade secrets and obfuscation: Software developers often use techniques such as code obfuscation to make disassembly and reverse engineering more difficult. Obfuscation intentionally complicates the disassembled code by adding irrelevant instructions, removing meaningful variable names, or applying other transformations. These measures aim to protect trade secrets and intellectual property by making it harder for unauthorized individuals to understand and reproduce the original source code. - In summary, legal restrictions, technical complexity, incomplete information, and code obfuscation make it challenging and legally risky for an individual to disassemble parts of Windows and obtain the complete source code. Reverse engineering is a highly specialized field that requires expertise, and engaging in it without proper authorization can have legal consequences.
I’m guessing as for why it detected Unicode like this, the idea is that legitimate Unicode text will mostly contain symbols from a single alphabet, which are close together and thus will have high bytes that are close, so the sum of differences in high bytes will be low. If it isn’t, the high bytes (made of every second letter) will be all over the place, making the sum of differences higher. Would be good to have an explanation like that in the video.
I grew up thinking this was an intentional easter egg! I guess people just thought it was intentional because Bush has had conspiracy theories about him and XP came out around back then. Thanks for showing us how this really works, was very cool
I think i know why line termination stop the bug from occuring. In Windows, lines are terminated using both the carriage return ' ' character and the newline ' ' character. If the characters ever appear together without a separating byte, then notepad can instantly deduce that the file is ansi. If they were separated by a byte, then it was unicode. Of course, this is an unconfirmed theory of mine. However, it makes a lot of sense.
@@Alexander-oh8ry Petition to you to 1. do not send me messages if you're a alexander (yes, it's on purpose) and 2. not to make me do things that I shouldn't do today or after 2 weeks
@@CBFNetworksArchive dam somebody sounds insulted because they cant force a youtuber to do more videos and is called out for it (and because they cant do math)
you can do this with almost every CLSID, as long it has a "ShellFolder" tag on it. Try "{ED834ED6-4B5A-4bfe-8F11-A626DCB6A921}" and you want to punch MS.
11:44 I can't do that as any link (sometimes even links to RUclips) cause the immediate deletion of the comment, meaning it's visible to its author until page reload, edits fail ("Unknown error") and reloading the page makes it disappear. Only the creator of a video can post links in their own comments section without needing to be worried.
I can only point to this 2 decades old article: "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)": *The Single Most Important Fact About Encodings* If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that “plain” text is ASCII. *_There Ain’t No Such Thing As Plain Text._* If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.
It would be interesting to find out how they fixed it in newer versions of Windows (if that encoding is still relevant). What is a quick fix or did they choose a totally different method ?
I am pretty sure Windows XP notepad messed up this half the time if you are using CJK languages. There's a reason we wait a minute for word to load since it is slightly less prone to this but still not immune. The widespread adoption of emoji really made western developers aware of unicode and text encoding problem is rarely a problem anymore
I love computers.
Same
same
Same
Same
same
The question we should be asking is "how did Windows developers come up with the worst way to detect Unicode?"
- So... how we are going to detect unicode?
- -Meth- Math
@@ОннокорОктябрь Meth?
@@RDCSTi was gonna _crack_ a joke, but that wouldn't be funny
I mean it was good enough because it "rarely" occured, still the amount of false positives is indeed too much and yeah it was pretty bad
It acts like a detector for windows's "Unicode" files that only have English characters, but with a lot of false positives.
I can't believe it took me this long to realise: this channel is a fly explaining bugs.
wait what
@@Axcyantol A fly (the type of insect) explaining bugs (computer errors, but the word also means insect)
He also... Well creates bugs too
(By that I mean destroying windows)
@@amberwingthefairycat no i understood it, i was just surprised about it being a fly explaining bugs
@@Axcyantol Oops, sorry, haha.
15 years ago I made a "Windows XP Easter Eggs" video featuring this bug, and now I feel this strange sense of satisfaction finally knowing why it happens. Thanks FlyTech!
That video is best video.
i think you made the video a little too popular, still, good one 👍
Note: What Windows Write calls "unicode" is UCS-2, a now-obsolete 16-bit encoding.
What today we call unicode is usually UTF-8, a variable-width encoding that conveniently matches US-ASCII, though there's also UTF-16 and UTF-32.
There's not just those variants either.
There's the obsolete UTF-7 too (for some reason), and the multi-byte encodings come in both Little Endian and Big Endian flavours.
@@weakspirit_ internally, yes. but when saving text it does default to utf8, since a text in 7-bit ascii and in utf8 are the same, and since in english you rarely need characters beyond that, using utf8 is more space efficient
UCS-2 and UTF-16 is more or less the same (and what windows / C-sharp incorrectly call "unicode encoding".)
Technically ANSI Windows-1252 (what is functionally always used when Windows refers to "ANSI") is incompatible with UTF-8 as there are numerous UTF-8 bytes which are "reserved" (usually for multi-byte characters) which Windows-1252 uses as printable (single-byte) characters. If any of those bytes exist in the string when read as UTF-8 the encoding will break, which in a well developed system will merely produce a few erroneous characters. Now since Windows-1252 extends standard ASCII, most of the bytes in Windows-1252 will be read well in UTF-8, specifically all the common American used characters. The problem with the encoding only occurs when you have these non-ASCII characters either in Windows-1252 or in UTF-8, in which case if you try to read the either UTF-8 from Windows-1252 or Windows-1252 from UTF-8, you will get problems. Experienced this exact issue when dealing with a game that reads and writes Windows-1252 but which has resources written in UTF-8, causing all sorts of weird problems.
@@laurentverweijen9195 UCS-2 can only represent ~6% of the reserved Unicode codepoints and ~69% of the ones already assigned , whereas UTF-16 can represent them all through surrogates. Don't get me wrong, UTF-16 is the worst Unicode encoding, but at least it *is* a Unicode encoding, unlike UCS-2.
I can't believe George W. Bush would do this.
"W." stands for WordPad
Amazing how he started hiding the facts at least 7 years before it happened
Now its George L. Bush, 'cause he hid the facts
George Walker Bush Sr. ruled 1089 to 1992, and did secret government stuff with Nixon and Reagan .
do you know what i can believe bush did?
I almost completely forgot WordPad exists, despite it being more powerful than Notepad.
WordPad is too powerful for us mortals
@@FlyTechVideosyea, nothing can break it
Notepad is a decedent of wordpad
@@npoaccount9154 wasn't there a video about corrupting the heck out of windows 10 and everything was broken except for wordpad
@@npoaccount9154 as soon as i read that, i tried breaking it. i ended up creating a document filled with a creepy smiley face image with the size of 2,306,727,936 bytes (that's 2.3 gb). i cant open it because it's too big, and i think i corrupted the file by closing wordpad while it was saving the file. i still dont think this counts as breaking it, i think its just the file thats broken.
Dude really digged into windows api's assembly to uncover a strange bug from 1994
Good job!
I think he not dive, he just wrote his own that works similar...
Maybe, idk
@@thevorkman_6551 He went into the code to figure out how it worked, and then wrote his own that worked the same
Don't forget the special... _sources_ ;)
@@mstech-gamingandmore1827 Yes)))
XP source leak probably
The amount of times i've seen this spewed about as an "easter egg" is nuts.
Just like the "Bill Gates Sucks" "easter egg" in C64 BASIC.
what’s that?
yeah what’s that?
@@chri-k To keep It simple, C64 BASIC has a random number generator that, every time you turn on the machine, it always produces the same "random" sequence of floating-point numbers, all between 0 and 1. Using this "feature" someone wrote a small program (4 lines of code) that print the sentence "Bill Gates Sucks" on screen
Aside from the statistical check, another heuristic uses the newline as a way to rule out Unicode since a word-aligned newline has the bytes 0D 0A, but U+0A0D is not assigned in Unicode. Also it apparently only detects based on the first 256 bytes or so, which might make the longest string challenge futile beyond that point.
oh wait, but that does mean that a non-word aligned newline could technically trigger it, right? should have spent more time researching nooooo
just make this comment your string so you can win the meta award
Ah yes. As a developer, I love to see people finding bugs in our lazy work
I'm an programmer in my free time, and this is actually facts. 😂
@@AlphaFruit-hx4cw which Bush hid
you're spitting FACTS
@@nt-authority-system666bush wasn't
God, when I tell you I thought Bush was a literal bush from nature and I couldn't figure it out until this day...
Same
Sounds like it'd be related to homer simpson creeping out of that bush gif.
Same
My wife's bush hid some facts. I'm catholic, so I discovered the truth only after the marriage
Yeah, for people not from USA, or not familiar with USA's presidents, the name "Bush" is likely going to refer to an actual bush instead.
4:19 for anyone wondering why fly wrote it as 0x75 0x42 when the hex editor shows 0x42 0x75 its because the file is encoded in "little endian", which means the last byte in the hex editor goes first when the computer is reading unicode.
i dont know why the computer does this, nor am i an expert in these kinds of things but i just wanted to share in case someone wants to know
Windows uses little endian because the x86 CPUs do so .
because x86 (which windows and every single os that supports x86 runs on) use little endian, so fuck intel i guess
Write a simple program converting from integer to string and you will find out why little endian is a thing. Btw, the numbers we are writing everyday is big endian.
@@RedstoneNguyen Both storage directions make perfect sense . Little endian for decimal digits is how Arabs write Arab numerals, big endian is how westerners write the same numbers with the same digits . Because computers gulp up entire binary numbers in one memory clock cycle it's entirely cultural for them too . The x86 and x80 CPU families belong to little endian design cultures, the 68000 and SPARC families belong to big endian design cultures . ARM and PowerPC hardware is bilingual in this matter .
@@johndododoe1411 i didnt say anything about culture. My idea is, little endian is mathematically simpler to implement than big endian.
The "oracle" not guessing correctly on newlines might be due to differences in newline coding; given it's a Python script it may be using only LF line endings, but famously Windows always uses CRLF as its line endings.
"This comment has the challenge shown for the longest strings that triggers the Windows glitch from the video you recorded. The video's bug shows that it's difficult for doing accidentally. Specially for the challenge proposed. I try using workarounds and the hardest one probably is the odd character words I have to usually put for those requirements. But sometimes I don't use odd length, which usually happens because the symbols are joined with the word. I created a small story for the text: The bus was not there for a bad reason, probably. Maybe someone found the reasons but I don't know? Which increases the words variety I can use for this! The current character counter is at 690 and I think I could add non-sense but that wouldn't get interesting enough. I'm a bit close here, 200 character distance. So, I could get another story using odd character count words only. There was a man named "e_g.". The challenge was waiting for him!! And so he broke the world record! FlyTech himself saw the text, and got amazed! The comment had too many characters and broke the record! I could not believe it! The story ended and thank you for reading this."
this whole text has 1156 characters, which beats the previous record of 1016.
the remaining balance was -399, which means you could add 2 more characters (following the rules from the challenge) without failing.
a proof was made in a windows xp vm, and you can see the video proof in my channel or test it by yourself.
Have you heard about the Russian city: Seversk? It has a humid continental climate maintaining a low temperature and receiving 530mm precipitation every year. Through its presence, nuclear weapons have been assembled there and stored. One serious nuclear catastrophe would occur in 1993 because a container holding a dangerous and radioactive substance exploded.
Character count: 362
Edit: I didn't actually test this on Windows XP notepad, but I used a script, and the script gave 7640 and 2542, and 7640 is just barely greater than 3*2542
Saves And Does Not Corrupt On Windows XP Media Center Edition 2005.
@@AdachiVlogsFIN Confirmed on mine
whats M69 doing here
NO WAY LMAO.
tried it on my windows XP vm it worked
now i kind of want to know how the current unicode detection works
I tried disassembling it, but it seems to have been moved to the kernel (RtlIsTextUnicode)
It uses the 2 bytes FF and FE shown at 5:24. To make the glitch happen on modern Windows you put ÿþ into the beginning of a text file, then save it, and voila! You don't even need to input any special text after it.
they probably do the sane standard thing and look for 0xFFFE at the start
@@Milennium1902**Icelandic obtains ÿ**
@@lunafoxfire They did that back then, too, but that's not nearly enough.
The _presence_ of the BOM _confirms_ a file is unicode.
The _absence_ of the BOM _does not_ mean a file is _not_ Unicode.
That is to say, if there is no BOM, you still have to check if it's Unicode.
The 00 padding on non-unicode characters explains the fact that in a lot of files that are not text, some strings have spaces between each character... I've had this question since 2017 XD
By not text I mean archives, executables, etc.
but most modern applications ( one exception being Windows itself ) use UTF-8 and not UTF-16. UTF-8 is fully backwards-compatible with ASCII, so this may not be the only reason.
It's not padding, it's the page number . Page 00 is mostly the same as Western ANSI code, there are about 7000 other pages to keep track of, including the ones with smiley faces .
All characters are Unicode characters. Some characters are also in ANSI, but they are also in Unicode.
I was never able to reproduce this bug when I was a little kid..
But well, my boxed copy of XP was Service Pack 2 so I guess that's a given :P
But I also don't remember what OS I was using at the time either.
does that string really worl?
Your videos are always entertaining and informative. Keep it up!
for my whole life i thought bush hid the facts was an intentional easter egg so this was a very interesting video to me personally just because of that
When I was a kid I used to open up exe files in a text editor. I thought programmers had to remember a lot of characters and had special keyboards.
mans casually drops "my windows crashes when opening notepad" like HUH
ruclips.net/user/shortsAtu7atNw-kw
what more can you expect from windows 11
I mean, it's an insider build. You would expect bugs
"we are warned not to change it" "So let's change it" I love this channel.
2:05 you don't have to disappoint them. there also was another Bush who was president between '89 and '93. In fact, he's the father of the Bush we all know and hate. I have no explanations for "~Flytech" though. 😅
He hid the fact that he personally assassinated JFK
When 7/11 (I say 7/11 because if I wrote the right date my comment won't post) happened, the very next day in school someone came in to school and showed a group of us in the computer lab if you typed 9 and then 11 into Word using wingdings, it was a plane flying into two buildings. Wingdings has been changed since then.
i'm pretty sure nothing stops you from posting 9/11 in this comment section
@@FlyTechVideos a different commenter said that "11:44 I can't do that as any link (sometimes even links to RUclips) cause the immediate deletion of the comment, meaning it's visible to its author until page reload, edits fail ("Unknown error") and reloading the page makes it disappear. Only the creator of a video can post links in their own comments section without needing to be worried"
youtube likes censoring the comments
Congratulations on being an official Wikipedia source! :D
Very interesting bug, Fly! Thanks for the video
69th like
flies.sh/discord
No
No
Fly!
@@r0fael_programmer and the no chain is already broken
@@busymaan cause its kind of cringe
the correct way to do this is, of course:
1. is there a BOM? If so, respect it.
2. try to decode it as UTF-8. If it worked, you're done, its UTF-8 (or us-ascii but thats a proper subset so whatever)
3. If you get here, complain to the user and make them figure it out.
(if you're loading a document with more metadata you may of course use that too, I'm assuming plain text)
UTF-8 didn’t exist when IsTextUnicode() was written. The Unicode encoding in use was UTF-16 (based on the earlier UCS-2). So this is legacy code from the early days of Unicode that was never updated.
This also explains what happens when you load Unicode-encoded files such as .lnk and .url in Notepad or Vim etc, and it displays the weird spaced out lettering as the program assumes ANSI plaintext. Cool!
Great Video! Very detailed explanation of the bug! :D
Thanks! 😀
The first time I found out about the "Bush hid the facts" bug, I thought the text was referring to a literal bush.
You can hide facts in a bush ... ... ..
@@OrbitalCookie Ah yes, my favourite hobby.
NT4 Notepad runs on NT 3.51, though it will abruptly close at times. On these older versions, you can change Notepad's global font, and in some cases you may even be able to read the erroneous Unicode characters!
All these years later, this explains why XP notepad was such a pee-pee about opening random text docs.
Pairing a text-generating transformer with a minimizing function for the unicode check could be funny to see
Exactly what I thought!
Even though Windows called it "Unicode", the less confusable and more accurate name is UTF-16.
But what if...in Windows 3.5, Bush was referring to George *H.W.* Bush?
You didn't dismiss the conspiracy. Bush, the old one, was US president from 1989 to 1993
Nice video keep it up! Will you upload more creepypasta videos like you did before?
Thank you! I only ever uploaded 2 of them, and no, I am not planning to continue them as I came to dislike even the 2 videos that I already made.
I have moved to Linux completely a long time ago, and every time I stumble upon Windows, I seriously don't understand how can the most popular desktop operating system STILL have significant issues with encoding. Today I was on a video call with my coworker and had the pleasure of witnessing a modern (2021 version) app with cyrillic text display umlauts and diacritics instead of actual text on English-configured Windows., Look, mum, we've had Unicode for 32 years now!
P.S. yes, I know Windows uses UTF-16, I refer to UTF-8, which is used practically everywhere on the web.
Microsoft is well known to preserve traditional bugs. Even the Win10 installer still could not select a partition to install.
please explain in detail.
Kinda expected that it's just another encoding issue. Even nowadays Notepad still has similar encoding issues like when you write a script and it contains some special characters in it... and then you realize your script is broken. Aside that you should never use Notepad for scripting anyway...
Could be wrong on this but I think windows uses CRLF encoding so you would need to put /r/n into the oracle to replicate the notepad newline
I tried it with
after the video, and the oracle says that it's censored while Notepad still doesn't break
encoder? I hardly know 'er!
"in unicode encoding, each character is 2 bytes"
Not exactly,but close enough explanation...
UTF-16. Then they wanted some nice 🥵emojis and even ♔chess. Then UTF-32 appeared. yet UTF-8 is variable length which saves the space but not the nervecells of c++ devs
In the old Windows World's favorite unicode encoding. Which they got stuck with, even though it was a bad idea, because they were too eager to use unicode and more sensible unicode encodings hadn't caught on yet.
@@Gameplayer55055 ♔ fits in 16-bit character though
It looks like they encoded the code points directly. They did not use UTF-8 encoding or else ASCII characters would be only one byte, and they did not use UTF-16 encoding either because UTF-16 is not padded with NULL bytes. In other words, they succeed to mess up their Unicode implementation and invent a new encoding while Unicode was supposed to unify everything. Oh the irony...
@@cl00e9ment All ASCII characters do have null byte in high byte when represented in 16-bit integer though. 0x20 in 8-bit becomes 0x0020 in 16-bit, which becomes 0x20 0x00 in little endian, which is the correct little endian representation of space in UTF-16.
2:15 Bush Sr predates Clinton though as the president.
The Bush that "hid the facts" refers to the junior one, doesn't it? (Iraq war?)
@@FlyTechVideos Bush Sr. hid two facts though: That his son is a liar, and that he was a madman that almost started the WW3.
Plot twist: Turns out, the entire video was just a secret message from aliens, and they were trying to communicate in their own funky beatbox language
4:16 Saying that everything in "Unicode encoding" is 2 bytes is a bit misleading. This applies only to the implementation of Unicode used in very old versions of Windows (UCS-2), and does not apply to any modern, variable-width Unicode encoding. Notepad received support for UTF-8 and UTF-16 with Windows 7.
en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows
en.wikipedia.org/wiki/Windows_Notepad
I can see where this comes from: if someone types in Chinese (or other Unicode-only) language, the characters will be very close to each other in the encoding space. Therefore, the most significant byte would be very close to each other, while the least significant byte would essentially be random.
While if someone types with just a few Unicode characters thrown in, the ASCII values will produce the null bytes seen at 5:11, which also won't add to the most-significant-byte-difference counter.
How about this one?
He erected later a great monastery in which he lived forty years and had eight hundred and eight followers--they bound him tightly and carried him between them on their shoulders
(-9)
2:58 This variant of BSOD is caused by using VMware SVGA 3D
I'd imagine Raymond Chen would write about this embarrassing algorithm implementation in his blog soon.
He already did: devblogs.microsoft.com/oldnewthing/20070417-00/?p=27223
Learnt something new in your video: Mojibake! I never knew this phenomenon had a specific name. I recall seeing it a lot in 1998 to 2001 on WebTV here in the US when viewing Japanese, Chinese, & Korean web pages. (Really, any pages that didn't use the Latin alphabet.)
If Windows is smart enough to remove the first two unicode bytes (0xFF, 0xFE), why the hell didn't they use it to detect unicode aswell?
They did - 0xFF 0xFE was consistently recognized as Unicode. The problem is that they assumed that text without this prefix can be Unicode as well, and they used the presented heuristic to guess
@@FlyTechVideos sounds like somebody in Microsoft didn't read the documentation all the way 🤣
Microsoft detects Unicode even without the byte order mark by design since other programs and/or platforms may save text files like that.
(4:15) Incorrect, Unicode itself doesn't require 2 bytes per character. Unicode is just a list of character. It depends on what encoding you use. UTF-16 is what Windows uses, which requires 2 or 4 bytes per character, and there's also UTF-32 which requires 4 bytes per character, as well as UTF-8 which is a variable number of bytes per character.
(5:25) FF FE doesn't mark it as Unicode exactly, it does, but it also marks it as UTF-16, which is why it's 2 bytes per character. B is 42 00 as you show, but FE FF is also UTF-16 but reversed, where B is 00 42 instead. EF BB BF is UTF-8, for example.
(6:40) Since UTF-8 (Unicode) is variable length, why can't this be an odd length? Characters can be variable length from 1 to 6.
I was hoping you'd then follow with how they fixed the bug. How does Notepad in Windows 7 detect Unicode?
by checking if the file starts with 0xFFFE
Dang, I thought the "Bush hid the facts" thing was an easter egg until I saw this
So I ran a few tests and found the exact letters and words don't influence notepad's unicode detection algorithm too significantly. By far the biggest factor is the location of the space and punctuation symbols. If the space and punctuation symbols occur primarily at an odd index, then the unicode detection algorithm can get a large bonus towards the odd/lower bytes since the ascii distances between the punctuation symbols and lowercase letters greatly exceeds the ascii distances between any two lowercase letters. This means that if you use words with an odd number of letters, most space symbols end up on the odd indices. However, writing sentences using exclusively odd number of letters for all the words isn't easy. You can therefore sometimes use pairs of even words for a nicer structure. By using these tips, you can write quite lengthy sentences which sound almost completely natural without having to recalculate a new score every other character. PS. This whole comment would get censored by notepad
Wow
Character count: 1016
3 * Higher Diff - Lower: -85
Tested successfully on Windows 2000 pro
With the heuristic outlined, it's fairly easy to make arbitrarily long strings, especially if you aren't overly concerned about clarity, word flow, or long term sentence structures. I wrote the comment without significant scripting/code assistance (only checking the isUnicode value every sentence or so).
@@liniarc I’ll try making one I think
Did you know conspiracy and conspiracy theory were words in 16th and 17th century, but conspiracist and conspiracy theorist weren't words until the 60s and 70s.
Bush didn't hide the facts, but Windows certainly did.
(not part of the challenge, just a joke)
Visual C++ 2008 Express Edition? I see a man of culture!
Least broken microsoft product:
💀
The fact I just remembered that Microsoft dropped support for WordPad, this video gives me yet another reason for me to be infuriated for that. WordPad has been faithful to me for other reasons, but this is a reminder of something I could have still benefitted when migrating to different OSes. Plus, seeing that even W11 crashed while using Notepad just tells me that it was a bad decision in the first place as it furthers the incompetence.
Plus, I'm certain I remember this old bug from the days of the drink holder prank (which they've since patched over). It's nice this still can be done. Nostalgic, at least.
Can videos be used as citation? I thought it had to be academic articles or something like that
Not sure if they _can_ , but I think I've seen some. Don't take my word for it though
even blog posts can.
I think so
generally self-published content is not supposed to be used as a reliable source for a citation, but in this case, i guess it could be used as a showcase of a behavior mentioned in the article
@@FlyTechVideos I've been working on that exact same question for a different video lately (a Karl Jobst documentary).
The short version is that the video can not be used unless it is published by a reliable source. Since RUclips videos are self-published, they don't count. An exception can be made if the person who posts the video is considered a subject matter expert. We've discussed that for Karl Jobst, but determined he doesn't qualify. For it to work, Jobst would have to have published articles about his work in trusted sources, outside of RUclips, and he hasn't done so.
What that means for your video:
Have you published journal articles about PC bugs, under your name? (Just being cited by them is not enough.) If the answer is "yes" then that's great! Please give us a link to that. With some luck, that will make you pass as a subject matter expert, and THEN we can start thinking about citing your RUclips video.
P.S. For Jobst's video, we could avoid citing the video in the end. Jobst presented all his sources in the video. References to reliable sources that demonstrate that his conclusions were correct. Can you share such a source for your topic? In that case, we can start working on the Wikipedia article as well.
FlyTech, after he held a Microsoft employee in his basement for the past year:
_"I am legally not allowed to tell you how I figured out. Let's say, I consulted some trustworthy sources for this."_
Yes.
More likely he used he leaked XP kernel source
Wait, is that my uncle?
9:45 if you were to spam a lot of these blocks, could you write secret messages?
If you consider mojibake secret, then yes
@@FlyTechVideos Well no because it's fixed in newer versions of windows.
I live in ukraine and i've seen many university pages full of them (idiot devs, no )
After making essays in vscode i can see some mysterious п»ї too. its shit from BOM as i know
@@nothing-lo8lh You can cause it by adding ÿþ to the start of a .txt file
In unicode, a character isn't 2 bytes. Unicode itself is not an encoding, just a standard. UTF-8 is unicode where each character uses 1 byte or more, UTF-16 uses 2 bytes or more and UTF-32 always uses 4 bytes for each character.
Try "联通", iirc this single word also caused issue therefore similar "rumors", or rather, memes, arose amongst Chinese communities about how China Unicom had vendetta with Microsoft whatsoever.
"Feel free to use this video as a citation! :)" bruh i audibly CACKLED when i saw that 💀
Before watching , let me guess is it some kind of Unicode auto detect mode bug
correct
So, fun fact: As the video says, UTF-16 ("Unicode" encoding according to Notepad) text files always start with either 0xFFFE or 0xFEFF (to indicate endianness). 0xFFFE and 0xFEFF don't make *any* sense as ANSI-encoded text (they display as ÿþ and þÿ respectively), meaning it's far safer to just look for those patterns to detect whether or not a file is encoded in UTF-16.
What you are referring to is calldd "UTF-16 with BOM". The BOM (Byte Order Mark), however, is not mandatory (in fact it is discouraged). Read more here en.m.wikipedia.org/wiki/Byte_order_mark
But why was that “low > 3*high” heuristic chosen?
A unicode string with "simple" characters usually has a lot of null bytes, e.g. 42 00 43 00 44 00 ... and the heuristic is engineered to detect exactly this. As we can see, this leads to false positives
@@FlyTechVideos Is that logic true across most languages or is this an English bias? Are there any countries where it's not true?
Real unicode detection is pretty expensive, so this NT bug got to live for quite a long time...
i completelly forgot some insane systems dont use utf8
The windows world started using unicode before utf8 was invented. The Java world too. Sometimes it pays to be slow (although I remember switching my old redhat/mandrake systems over to default to utf8 was not fun either).
Unicode was created in the late 1980s. Microsoft and Java chose the early 16 bit Unicode and then had to use UTF-16 to encode the next few thousand pages . Then someone decided that any characters not handled by UTF-16 should be banned from all other encodings .
I posted an analysis of the algorithm as an answer to the post "how did Windows developers come up with...".
TLDR; it can distinguish the patterns found in multi-byte ANSI vs 16-bit Unicode for Asian encoded text, and other languages fall into subsets of those distributions.
"Legally not allowed to tell"
Let me guess, reading leaked source code?
Idk. You can't probably reply to this anyway.
Or maybe it was a joke.
the word "sources" is italic
@@Milenakos Ahhhhhhh ok I'm dumb
@@the-Gammaron I still don't understand that about the sources, can someone explain it to me? is it some joke or something?
@@maximiliano_sv Basically Microsoft does not allow everybody to see the Windows source code or disassemble Windows to get the closest thing to the original source code, ChatGPT explains those things better than me:
The source code of Windows is a set of instructions and programs written by Microsoft developers, forming the foundation of the operating system. While there are several reasons why an ordinary person cannot legally read the source code of Windows, here are some of the main ones:
- Intellectual property rights: The source code of Windows is owned by Microsoft. The company invests significant resources, time, and effort in developing and improving their operating system. As a result, they have intellectual property rights over the code and can control who has access to it. This means they cannot allow just anyone to access and read their code without authorization.
- Protection of trade secrets: The source code of Windows contains valuable and strategic information for Microsoft. Disclosing that code to unauthorized individuals could compromise the security, stability, and competitiveness of their operating system. Protecting their trade secrets is crucial for maintaining their market position and competitive advantage.
- Risk of vulnerabilities and hacking: If the source code of Windows were available to the general public, it could be scrutinized by malicious individuals, including hackers and malware developers. This would increase the risk of discovering vulnerabilities and creating targeted attacks on the operating system, jeopardizing the security of millions of users.
- License and end-user agreements: By using Windows, users agree to the terms and conditions set by Microsoft in their license and end-user agreement. These documents clearly state that users do not have the right to access, modify, or distribute the source code of the operating system.
- In summary, the source code of Windows is owned by Microsoft and is protected by intellectual property rights and trade secrets. Its widespread access is not permitted to safeguard the security, competitiveness, and legal rights of Microsoft, as well as to prevent potential risks to users of the operating system.
Disassembling parts of Windows to figure out the source code presents legal and practical challenges. Here are the reasons why an individual cannot easily disassemble Windows to obtain the source code:
- Legal restrictions: Reverse engineering, which involves disassembling software to understand its underlying code, is subject to legal restrictions in many jurisdictions. Companies like Microsoft have legal protections in place to prevent unauthorized access, modification, and distribution of their software. Engaging in reverse engineering without proper authorization or a specific legal exception can infringe upon intellectual property laws.
- Technical complexity: Disassembling software like Windows is a complex process that requires expertise in low-level programming languages, assembly code, and system architecture. It involves converting machine code (binary instructions) back into human-readable form. Understanding the disassembled code and piecing it together to reveal the original source code is a challenging and time-consuming task that requires significant skill and knowledge.
- Incomplete information: Disassembling Windows and examining the disassembled code may provide insights into specific functions or algorithms, but it does not yield the complete source code. The source code of a complex software system like Windows consists of millions of lines of code, libraries, and dependencies, which cannot be fully reconstructed through disassembly alone.
- Trade secrets and obfuscation: Software developers often use techniques such as code obfuscation to make disassembly and reverse engineering more difficult. Obfuscation intentionally complicates the disassembled code by adding irrelevant instructions, removing meaningful variable names, or applying other transformations. These measures aim to protect trade secrets and intellectual property by making it harder for unauthorized individuals to understand and reproduce the original source code.
- In summary, legal restrictions, technical complexity, incomplete information, and code obfuscation make it challenging and legally risky for an individual to disassemble parts of Windows and obtain the complete source code. Reverse engineering is a highly specialized field that requires expertise, and engaging in it without proper authorization can have legal consequences.
Thank you lol
21x johnsparks still does the trick.
Source: "Windows Notepad - an Old Problem Surfaces Again"
great video. your explanation was perfect!
This would be an interesting game mechanic for secret codes
Bush hid the facts
畋样凭摩琠映捡獴
Picking up mongoose
I’m guessing as for why it detected Unicode like this, the idea is that legitimate Unicode text will mostly contain symbols from a single alphabet, which are close together and thus will have high bytes that are close, so the sum of differences in high bytes will be low. If it isn’t, the high bytes (made of every second letter) will be all over the place, making the sum of differences higher. Would be good to have an explanation like that in the video.
You're right... thanks for the suggestion
I'm just wondering, how can the windows notepad crash windows?
buggy gpu driver?
When I used RTM version of Win 11 on VMware after release, while using paint crashed into BSOD in few seconds so could be a VM problem
10gb+ files
Win11 notepad is very buggy, it can't even display actual text files .
@@johndododoe1411yes, microsoft made notepad slower than notepad++
same with cmd.exe (old one still exists as conhost.exe)
I grew up thinking this was an intentional easter egg! I guess people just thought it was intentional because Bush has had conspiracy theories about him and XP came out around back then. Thanks for showing us how this really works, was very cool
you can't make a video showing something in windows xp without dreamscape and bandicam
dreamscape is content-id protected and because i am not desiring to donate all my income to 007 sound system i would rather not
but yes
lol
I think i know why line termination stop the bug from occuring. In Windows, lines are terminated using both the carriage return '
' character and the newline '
' character. If the characters ever appear together without a separating byte, then notepad can instantly deduce that the file is ansi. If they were separated by a byte, then it was unicode.
Of course, this is an unconfirmed theory of mine. However, it makes a lot of sense.
Petition to make his videos way frequent:
👇
That will overwork him
@@thepikachugamer Why? He is making videos every 4 months! Is it too rare?
@@CBFNetworksArchive Petition for you to 1. not decide that for another person and 2. to do your math
@@Alexander-oh8ry Petition to you to 1. do not send me messages if you're a alexander (yes, it's on purpose) and 2. not to make me do things that I shouldn't do today or after 2 weeks
@@CBFNetworksArchive dam somebody sounds insulted because they cant force a youtuber to do more videos and is called out for it
(and because they cant do math)
Fun fact: the "godmode" shortcut "easter egg" does work, but it doesn't require the words "godmode." It is just a control panel with more stuff
you can do this with almost every CLSID, as long it has a "ShellFolder" tag on it.
Try "{ED834ED6-4B5A-4bfe-8F11-A626DCB6A921}" and you want to punch MS.
Great info on how Unicode encoding is detected. I've always wondered about this.
The bug is also included in Windows Longhorn pre-reset.
11:44 I can't do that as any link (sometimes even links to RUclips) cause the immediate deletion of the comment, meaning it's visible to its author until page reload, edits fail ("Unknown error") and reloading the page makes it disappear. Only the creator of a video can post links in their own comments section without needing to be worried.
I can only point to this 2 decades old article:
"The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)":
*The Single Most Important Fact About Encodings*
If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that “plain” text is ASCII.
*_There Ain’t No Such Thing As Plain Text._*
If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.
I love how this was made almost exactly like a classic video from 2008. With the exception of the RUclips Anthem and a Fraps watermark...
It would be interesting to find out how they fixed it in newer versions of Windows (if that encoding is still relevant).
What is a quick fix or did they choose a totally different method ?
doesnt work on Windows 10... but on notepad2.
And that's why everybody just uses UTF-8 nowadays
Which is easily mixed up with all the other 8-bit encodings leading to more glitches!
@@jhgvvetyjj6589 for text, really only plain ASCII or UTF-8 is used, at least in sensible systems
Finally, a video that explains everything! tysm, FlyTech
Lol it this video is now a citation in the "Bush hid the facts" wiki page
I am pretty sure Windows XP notepad messed up this half the time if you are using CJK languages. There's a reason we wait a minute for word to load since it is slightly less prone to this but still not immune.
The widespread adoption of emoji really made western developers aware of unicode and text encoding problem is rarely a problem anymore
1:50
This is truth that WordPad is un-breakable.
I remember when this bug was new when I was young, cool to see why it happened years later when I'd forgotten about it.