Code Pages and Kohuepts: The Chaos of 8 Bit Extended ASCII

Поделиться
HTML-код
  • Опубликовано: 10 сен 2024
  • "But it's plain text! What do you mean it looks weird?"
    Friends, let's take a walk around the wonderful world of code pages, the cause of more encoding headaches, bizarre punctuation and inventive workarounds than just about anything else in IT history - and along the way, we'll meet some other things which aren't really encoding standards, but which have cast their long shadow on the way folks interact with digital data in the age of the World Wide Web.

Комментарии • 156

  • @mrmimeisfunny
    @mrmimeisfunny 29 дней назад +45

    Neat fact about the keyboard layout thing. In the 2002 movie "The Bourne Identity" the protagonist assumes the fake identity of a Russian citizen named "Foma Kiniaev". He gets a fake Russian passport but his Russian passport in Cyrillic says "Ащьф Лштшфум". The prop department just set their keyboard to Russian and wrote "Foma Kiniaev" as if it was a qwerty keyboard.
    Turns out it was actually quite realistic. A few years back a guy tried to present a fake Israeli passport in Barbados under the name "Assulin Hormoz", But instead of "Hormoz" his surname in the passport in Hebrew was also typed as if it was Latin so it became "יםרצםז", which was further mangled by being rendered backwards as "זםצרםי" (bidirectional text is something you haven't covered and it's a whole other can of worms). There were also several other Hebrew mistakes in the passport such as text rendered upside down or similar looking letters being mixed up.

  • @vektracaslermd743
    @vektracaslermd743 28 дней назад +19

    Dylan is easily one of the best presenters I've ever seen. Fantastic work.

  • @Cmanorange
    @Cmanorange 29 дней назад +40

    that google tidbit is hilarious 😂

    • @realGBx64
      @realGBx64 28 дней назад +3

      Same thing works in Korean, too.
      The funny thing was when they used this strategy in the first Bourne movie to write the main character’s name in Cyrillic lol

  • @GeraldWaters
    @GeraldWaters 25 дней назад +3

    In the early 1980s, during one of my unemployment phases, I read various books in the nearby university science library - thus I happened to read about the committee processes for establishing ASCII. No idea now what the exact book was. Some of the contentions were whether and what to have for things like logical NOT and OR (as AND was obviously covered by &). I have a more vague memory that collation order was also much debated. Like some other commenters here, I've also written about character encoding history and issues, see my bio for links.

  • @MusicEngineeer
    @MusicEngineeer 29 дней назад +22

    These little anecdotes from the world of computer history are super entertaining!

  • @clasqm
    @clasqm 29 дней назад +16

    This brought back memories of typing Romanized Sanskrit letters into my dissertation. I had Wordperfect macros typing the letter, then backspacing one position and typing the diacritical mark.

  • @TeVolt805
    @TeVolt805 29 дней назад +36

    Excellent. Can't wait to see what you say about UTF-8.

    • @pleappleappleap
      @pleappleappleap 29 дней назад +6

      Or UTF-7 even.

    • @thetj8243
      @thetj8243 29 дней назад +3

      There is an excellent talk from Dylan about "plain text" that is as he told in this video the basis for this video ... And you can find a recording of the talk on RUclips

    • @mrmimeisfunny
      @mrmimeisfunny 29 дней назад +4

      Probably something about having Chinese in the event logs.

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 28 дней назад +1

      @@pleappleappleap What about UTF-9 and UTF-18? :) :)

    • @enterrr
      @enterrr 25 дней назад +2

      UTF-7 is a crime against humanity!
      And so are UTF-16 and UTF-32, while we are at it
      UTF-8 FTW!

  • @DragoniteSpam
    @DragoniteSpam 26 дней назад +4

    Growing up in Zimbabwe was not a piece of Dylan Beattie Lore I expected to learn today

  • @WilliamHostman
    @WilliamHostman 27 дней назад +6

    The octothorpe (#) was used in the late 19th and early 20th C for the pound avoirdupois (weight) in the US, especially for pounds of goods sold by the pound... So while it may not have been a Pound sign in the UK, in the US, as a postfix, it indicated weight pounds (not to be confused with pounts force, pounds thrust aka poundals, pounds mass, .nor pounds sterling aka £), and when prefixed, it starts a numeric sequence..
    I encountered this use a lot in late 19th and early 20th C US federal records from the then Territory of Alaska (now a US State) and Hawaii (also now a US state). Especially for the pounds of supplies ordered and delivered. It is still used in the US to indicate either numericity of the following characters, or to indicate a weight in pounds of the preceeding digits.
    As for Cyrillic, it is used in Alaska for Russian, and some dialects of Yupic and Inungan... (most Alaska Natives have now switched to using accented Latin...).

    • @billwall267
      @billwall267 21 день назад +1

      You'll still see it occasionally to this day at smaller retailers like farmers markets etc.

    • @Roxor128
      @Roxor128 21 день назад

      That whole "not to be confused with" section had me wincing and oh, so very glad Australia finished switching to metric before I started school!

    • @billwall267
      @billwall267 21 день назад

      @@Roxor128 so very glad the US never switched to that backward system

  • @eliavrad2845
    @eliavrad2845 29 дней назад +7

    Its not just forgetting to change keyboards: Sometimes it doesn't switch, sometimes you try to switch but it was already on the right language, sometimes the operating system gives you an extra keyboard or two for fun...

  • @mrJety89
    @mrJety89 29 дней назад +12

    Well, you've been ASCIIing for it

    • @edgeeffect
      @edgeeffect 29 дней назад

      Uuuugh! Dad Joke! :)

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 28 дней назад

      ASCII kind of sounds like someone sneezing :D

    • @TheUtuber999
      @TheUtuber999 26 дней назад +2

      That joke is as bad ASCIIt gets.

  • @bauckrob
    @bauckrob 29 дней назад +7

    There were also a ISO 646, which to us Norwegians meant that we could find words like bl}b{rsyltet|y.

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 28 дней назад

      ISO 646 was super common in all "West bloc" European countries that had additional characters.
      R{ksm}rg\s!

  • @edgeeffect
    @edgeeffect 29 дней назад +4

    That WordStar screenshot is such a goldmine of nostalgia, I used a lot of different CP/M and DOS machines back in the olden days and they all had their differences and "killer apps"... but WordStar was the ONE constant. At college we had realised that the CP/M text editor was the ninth circle of hell and some bright spark realised you could use WordStar in "non document mode" as quite a decent text editor and so, until Microsoft put a cut down version of QBASIC in DOS-5 and called it `EDIT`, WordStar followed me around for many years.

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 28 дней назад +1

      I never used Wordstar but Turbo Pascal and the other Turbo products used the same commands. In particular the old Turbo Pascal 3 really required that you knew the Wordstar commands. The end result is that they stuck in my brain and for casual text mode editing in *ix systems I use JOE, which uses Wordstar editing commands.

    • @edgeeffect
      @edgeeffect 26 дней назад

      @@Thesecret101-te1lm yeah... Turbo C++ and Delphi had lovely editors.

    • @nickwallette6201
      @nickwallette6201 25 дней назад +2

      IIRC, Edit wasn't a cut-down QBASIC, because QBASIC didn't exist until MS-DOS 5 either. The Edit executable required QBASIC because the latter actually contained the code for the text editing functionality, and EDIT COM was just a stub that launched it in text-editing mode. This changed with the release of Win9x, where, I guess, they decided the extra few dozen KB didn't matter anymore, and having a BASIC interpreter wasn't high on the priority list either.

  • @MeriaDuck
    @MeriaDuck 29 дней назад +11

    Of all your stories, that Harry Potter one is one of the very best 😂

    • @edgeeffect
      @edgeeffect 29 дней назад +2

      I used to have a whole collection of pictures of "mojibake" and that one was always my favourite.

  • @Wishbone1977
    @Wishbone1977 26 дней назад +6

    Ah, encodings... I work in integration, and let me tell you, the chaos is still with us to this day. I have written lengthy articles about various encoding problems, but here I will just touch on a single issue, the "MS Office character replacement problem".
    First, a bit of history. ISO-8859-1 is a single-byte text encoding scheme which extends the 7-bit ASCII character set with most of the characters used in western European languages. When Microsoft made Windows 1.0, they decided to copy this encoding, but rename it Windows-1252. Ever since then, this has been the default encoding on most Windows machines in the western world. Since Windows was for many years only a workstation OS (there was no server version of Windows initially), this led to a situation where a lot of text data was being produced on Windows machines but would eventually have to be processed by other operating systems. Since Windows-1252 was not an official international standard encoding, other operating systems did not have support for it. However, since Windows-1252 was initially identical to ISO-8859-1 which other operating systems _did_ support, it became common for data written in Windows-1252 to be marked as ISO-8859-1. This allowed other OS's to read Windows-1252 data with no problems, and it seemed like a good idea at the time...
    Now, ISO-8859-1 has a gap in its printable character definitions. The byte values 7F-9F (33 characters in all) are undefined. When Microsoft developed Windows 2.0, someone had the thought that it would be great to have a few more characters available, and wouldn't you know it, here were a bunch of character codes that weren't used for anything. So they added a few more character definitions in the space unused by ISO-8859-1. Then they did it again for Windows 3.1 and a final time for Windows 98, so that today all but 5 of the original 33 unused byte values in ISO-8859-1 have character definitions in Windows-1252. As a result, text data written in Windows-1252 can now potentially contain quite a lot of byte values which are undefined in ISO-8859-1. So what are these extra characters? Well, they are mostly typographical characters, by which I mean characters meant to make text "prettier" than the standard characters in ISO-8859-1 allows for. These are things like left and right hand versions of both single and double quotes, two dash characters of different lengths, a bullet point and an ellipsis (three dots). Recall that Windows-1252 encoded data has historically often been intentionally mislabeled as being ISO-8859-1 data, and we begin to see how this could potentially lead to problems.
    Then one glorious day, someone at Microsoft had the brilliant idea of helping end users write prettier text. How, you ask? By having all the MS Office programs (Word, Excel, Outlook, etc.) automatically replace some of the characters the users were typing _as they typed them_ with the "prettier" versions added to Windows-1252. And not as a function people had to switch on if they decided they _wanted_ this to happen to their text, no they did it as a function which was switched on by default when Office was installed and you had to manually find the setting and switch it off if you _didn't_ want it. Unsurprisingly, this aggravated the problem enormously, since so much text data was produced using MS Office. Instead of there being a mere _possibility_ that Windows-1252 encoded data might be decoded as ISO-8859-1 _and_ might contain characters not present in that encoding, it now became _highly probable_ that this would happen. And it did. A lot. And still does. All the time. And then I'm the one who has to fix it 😞
    There is _a lot_ more I could say on this subject, but I think this is probably enough for a RUclips comment 😀

    • @nickwallette6201
      @nickwallette6201 25 дней назад +1

      You touched on this, but it bears mentioning explicitly: Outlook, at least for a long time, uses/used the Word engine for the email editor. (Maybe it still does? I dunno, I use a Mac for my day-to-day business needs. While Office technically exists on Mac, it's basically "the version of Office that our intern wrote as a summer project" with enormous chunks missing. Same price though. So that's fun. Anyway...)
      Working in a technical field, I cannot begin to count the number of times someone would send code or configuration snippets that had been through the "pretty text" filter. Naturally, C compilers, bash script interpreters, and network appliances configuration parsers have absolutely no idea what to do with complementary opening and closing quote marks, or command line switches with em-dashes, or passwords with at symbols converted to email addresses. On that topic, why is it every Office application copies email address with a mailto prefix, despite no Office application being smart enough to remove the mailto prefix when you paste it somewhere an email address is expected?

    • @pjl22222
      @pjl22222 25 дней назад

      Another thing you could say about it is why those characters were unassigned. They were unassigned because they were the same as the control codes but with the high bit set. So if your text goes through a seven bit only system that strips the high bit you now might have a bunch of random control codes in your text instead of just getting the wrong characters like what would happen with the assigned code points.

    • @Wishbone1977
      @Wishbone1977 25 дней назад

      @@nickwallette6201 Yes, the automatic character replacement functionality of Office has wreaked havoc in many different contexts over the years.
      Interestingly, in order to combat this specific issue, the official HTML5 specification explicitly calls for all pages that state they use ISO-8859-1 to be decoded using Windows-1252. As such, for internet browsers the problem has now been permanently "fixed".

  • @euromicelli5970
    @euromicelli5970 29 дней назад +12

    I had never encountered “Kohuept” until I heard of it in Tom Scott’s “Lateral”, Now I can’t _unsee it_ and it seems to pop up somewhere at least once a month

  • @kupferdrachevideosfurdich8733
    @kupferdrachevideosfurdich8733 29 дней назад +6

    It is nearly as hilarious as working with dates and timestamps.

    • @realGBx64
      @realGBx64 28 дней назад +3

      And mixing dot and comma as the decimal separator in the two languages you usually use.

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 28 дней назад +1

      Time zones.
      Every time some software don't handle time zones correctly, I as an European feel some schadenfreude as Americans feel some pain that Europeans don't, as a contrast to every American software that has trouble with international characters :)

  • @pleappleappleap
    @pleappleappleap 29 дней назад +30

    "Kohuept" reminds me of people calling "Moscow" as "Mockba".

    • @mrJety89
      @mrJety89 29 дней назад +3

      Moszkva /hungarian/

    • @musiqtee
      @musiqtee 29 дней назад +3

      Moskva /Norwegian/

    • @pleappleappleap
      @pleappleappleap 29 дней назад +3

      @@mrJety89 Yes. The point being that the Cyrillic letter that looks like "C" sounds like "S", and the letter that looks like "B" sounds like "V".

    • @sponge1234ify
      @sponge1234ify 28 дней назад +7

      The joke I've heard is American/British wondering why are there so many PECTOPAHs around.

    • @musiqtee
      @musiqtee 28 дней назад +1

      @@sponge1234ify Suddenly hungry, wonder why…🤓

  • @cdreesbach
    @cdreesbach 29 дней назад +9

    Man, I do _NOT_ miss the old codepage mess AT ALL! Thx for a great trip down memory lane. ;]
    Also, did not know about kohuept - neat! 😂

  • @ThomasKnott
    @ThomasKnott 29 дней назад +9

    And still today in Germany many systems tell you to not use Umlauts (ä, ö, ü) in your username or even your password. Even more weird when they also require the password to contain special characters

    • @martinba9629
      @martinba9629 29 дней назад +2

      Und recht hamse. Auf Windows kämpft man ja immer noch öfter mit Win1252 - utf8 mismatches.

    • @Pystro
      @Pystro 29 дней назад +1

      Special characters in passwords is just so people don't use "password1" or "pa$$word1" other similar things that are essentially a single word.
      The more annoying part about that is not that it forces me to put a symbol into passphrases that are already secure enough without symbols, but that it still doesn't force people to use secure passwords.
      And forcing actually secure passwords wouldn't really be that difficult. Your browser would need access to a dictionary sorted by word "frequency" (or really probability of being in a password (in every language you might type a password), plus a globally valid character replacement dictionary. And then any password prompt in the browser would just display a prompt "if you are SETTING a password click here" that then sums up how many bits of entropy are in the password.
      And finally, avoiding codepage-dependent or keyboard-layout-dependent symbols is also useful for when you have to log into things from an internet cafe in a foreign country, or from the computer of a host company on a business trip.

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 28 дней назад +2

      In Sweden it's super common to encounter character encoding problems with everything that has anything remotely to do with embedded systems, label printers and whatnot.
      Receipts even at large chains like hamburger chains and whatnot commonly have incorrect encoding...

    • @johnrehwinkel7241
      @johnrehwinkel7241 24 дня назад +2

      I've run in to that "special characters, but not TOO special" more than once. Often it isn't even documented. What's wrong with “⁋assword”?

  • @Posiman
    @Posiman 14 часов назад

    When my father studied chemistry in 1970s Czechoslovakia he had an assignment he did not know how to do. The teacher told him to look up a book by American physicist Walter D. Knight. My father returned desperate that no library in the whole city has that book.
    "Oh, you were looking under K?"
    The book was not translated to Czech, only to Russian (which every student spoke) and by Russian standards they transliterated the name phonetically as "Найт," therefore every library sorted it according to Czech transliteration under N as in "Najt"

  • @ArduinoRR
    @ArduinoRR 29 дней назад +3

    Lovely historical info trove. In the 1960's I grew up on Dartmouth Timesharing Basic on a TeleType ASR 33, so I got to know 6-bit ASCII pretty well. Fast forward to 2004 trying to maintain a Spanish website on JDK1.4, which didn't support UTF-8 in property files. Had to copy and paste UTF-8 from Word documents into an app that converted UTF-8 to Unicode backslash escape characters. You've nicely covered quite an historical odyssey from Baudot to ASCII to EBCIDIC to Code Pages and finally Unicode. Thank you, sir!

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 28 дней назад

      6-bit???

    • @ArduinoRR
      @ArduinoRR 27 дней назад

      @@Thesecret101-te1lm Actually, yes. The TeleType ASR-33 didn't print the lowercase letters. I also programmed the 12-bit PDP-8, which packed two 6-bit characters in a word . Seems strange now that you mention it.

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 27 дней назад

      @@ArduinoRR Interesting that it counts as 6-bit, as the teletype needed 7 bits to handle control characters.

  • @Merrinen
    @Merrinen 29 дней назад +3

    I'm just happy I never have to do a latin-1 to utf-8 database conversion again.
    I'm also happy I never have to fix utf-8 stored with latin-1 connection to utf-8 stored with utf-8 connection again.

    • @edwardallenthree
      @edwardallenthree 16 дней назад

      I still find problems in an old database. Every time it gets backed up and restored, I'll find a new place where the Unicode has finally devolved into an error. I think my record was 12 characters between the "n" and the "t" in a "don't."

  • @Kobold666
    @Kobold666 29 дней назад +6

    It still is a problem with compilers and editors if you use non-ASCII characters (like ä, ö, ü, ß, copyright, trademark etc.) in string constants (or comments even). The editor might automatically switch to UTF-8 (Notepad++ does that), which the compiler takes for standard ASCII and chokes. Usually you get garbage at some point. I got used to embed such characters as hexadecimal escape codes to avoid that pitfall.

    • @rogo7330
      @rogo7330 29 дней назад

      I believe most parsers today are capable of reading UTF-8 automatically since UTF-8 just looks like ascii + bytes with value 128 to 255. You just searching for your special ascii characters and tread all other characters as "alphabet" (even if they are invalid UTF-8 since it does not matter and it is your fault).

  • @edwardallenthree
    @edwardallenthree 16 дней назад +1

    The Russian postal worker who translated that code page mistake was doing the Lord's work.

  • @JanMichalSzulew
    @JanMichalSzulew 29 дней назад +10

    9:07 you threw in an extra "T" between "N" (Н) and "Ts" (Ц) that isn't there

    • @DylanBeattie
      @DylanBeattie  27 дней назад +4

      D'oh. This is what happens when you're concentrating so hard on pronunciation your brain throws in extra letters which aren't there. My bad.

  • @pebbleschan6085
    @pebbleschan6085 29 дней назад +3

    Wordstar also worked with non-document text files without affecting the MSBit. It was used often for source code.

    • @edgeeffect
      @edgeeffect 29 дней назад +1

      Yay... a fellow "non doc mode" aficionado!

  • @YoutubeBorkedMyOldHandle_why
    @YoutubeBorkedMyOldHandle_why 23 дня назад

    This is great. I've been programming computers since the 1970's. Looks like after a few more of these videos, I might finally start to understand some of this stuff.

  • @sponge1234ify
    @sponge1234ify 28 дней назад +3

    On the keyboard trickery, the video game _Library of Ruina_ have a mid-lategame boss that, in the original Korean, is a jumbled mess of Latin characters. And in English, their name is a jumbled mess of Korean letters. As you can guess, their name is concealed using the "type in the wrong keyboard" method, with the other translations, Chinese and Japanese, _also_ using the same method in their own keybpards, so that their name is gibberish-but-"typable" in _all_ languages.

    • @realGBx64
      @realGBx64 28 дней назад +1

      This is the coolest thing I heard in the last 20 minutes!

  • @DrCoomerHvH
    @DrCoomerHvH 29 дней назад +3

    I love these miniature bites of your talks

  • @ralfbaechle
    @ralfbaechle 29 дней назад +8

    Well done, Dylan!
    ASCII was a good solution considering the technical constraints of the time. It just shouldn't have lived that long. We went through Commodore ASCII (aka not really ASCII at all) and a few other proprietary variants, more official extension such as the three dozen variants of ISO-8859-random_number plus a bunch of national standards and of course the code pages, Amiga ASCII, ATASCII aka Atari ASCII, EBCDIC (punch card compatible but not ASCII-like) and more. After having survived baudot code and what not in the teleprinter age. ASCII was always a standard that was typographically impoverished, just barely good enough - it doesn't even fully cover the character set used in an average newspaper such as proper “quotes”.cent symb ¢ and more.

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 28 дней назад +1

      Commodore/"PETSCII" IS actually ASCII, but it's 1963 ASCII rather than the way more common 1967 ASCII that almost everything else is based on. This is the reason for having an up arrow and a left arrow symbol.

    • @greggoog7559
      @greggoog7559 27 дней назад

      The best AND worst thing about Baudot code was the integrated code page switching (can't remember what it's called). If you missed such a character, the rest of the transmission was garbage 🥴

    • @ralfbaechle
      @ralfbaechle 27 дней назад

      @@greggoog7559 The Baudot issue was (is!) very annoying with radio teletypes. Somebody of the UTF-8 designers may have been aware of this issue . UTF-8 is designed to quickly resync after a byte has been lost or missed. .

  • @AxlefublrMain
    @AxlefublrMain 4 дня назад

    I'm russian, and am really happy at you actually pronouncing russian correctly! incredibly rare :D

  • @lforlight
    @lforlight 21 день назад

    That last example with Taylor Swift is very relatable. I had a friend with whom I chatted a lot via text. He noticed that writing the Hebrew word for "correct" - נכון - in the English layout by mistake makes "bfui". One day I asked him a question and he answered "bfui"... but transliterated to Hebrew - בפוי. After a bit I figured out that he meant "correct", transliterated from the wrong keyboard layout.
    Nowadays, knowing Google does accept these layout mix-ups, when it happens to me and I notice it halfway through, I need to stop myself from deleting the query and writing it all over again, and continue writing it wrong. It's not perfect, and many times it'll either not suggest the Hebrew layout equivalent, or it'll swap the layouts despite searching an obscure English string such as an error code or a mysterious executable's name. It may also provide gibberish results of pages where Hebrew is written backwards because it goes right to left and that's a whole can of worms...

  • @imarioable
    @imarioable 29 дней назад +2

    Looking forward for your Unicode series now! 😅

  • @Dominik-K
    @Dominik-K 29 дней назад +1

    That Google and Taylor Swift tidbit had me 😂 so much, it just makes so much sense haha

  • @DanielHauser
    @DanielHauser 26 дней назад

    At work we're fighting with an old system that has its own bespoke character encoding. It encodes various other charsets, such as one of the ISO-8859 subsets or corresponding Windows codepages. It even has specific codes for various text formatting operations, like bold, italic, underline and even blinking. But that's all single byte. The encoding also supports asian languages - Japanese, Chinese, Korean - half and full width. Those are encoded in multiple bytes, but sadly the charsets those are based on are not documented. Since this encoding is proprietary and there are no libraries to tame it, good luck converting it to UTF-8 and back.

  • @FredrikHistherRasch
    @FredrikHistherRasch 22 дня назад

    Just FYI, for Norwegian and Danish IBM 437 was actually sufficient. æåÆÅ are present in IBM 437, and the Greek phi was printed as a circle with a stroke through it, making it very similar to øØ

  • @TakeTheRedPill_Now
    @TakeTheRedPill_Now 19 дней назад

    Superb! Thank you.

  • @louisreinitz5642
    @louisreinitz5642 27 дней назад +2

    PECTOPAH == RESTORAN (Restaurant)

    • @edwardallenthree
      @edwardallenthree 16 дней назад

      Your comment has a translate to English button (for me, US English user), which ironically only normalizes the space around the double equals.

  • @mynameisben123
    @mynameisben123 20 дней назад

    I’m sure when making ASCII they didn’t envision such a grand scope, but rather they probably just focused on their application at the then present time.

  • @dj196301
    @dj196301 29 дней назад +1

    Riveting!
    2024 and I'm mired in mojibake (文字化け).

    • @MaddTheSane
      @MaddTheSane 29 дней назад

      Blame that on the three different encoding standards used by the Japanese computer industry. Where it's easier to fax a document than have to worry about the code page used by the other computer.
      You'd think Japan would move to Unicode…

  • @pquirk99
    @pquirk99 26 дней назад +1

    A couple of gaps that are worth covering in another video. 1. You didn't cover how applications switched code pages. 2. You made a brief reference to ISO 8859 but didn't discuss the locales that were added to this family of single-byte encodings, and the machinations to standardize collating sequences. In Spanish, the ll and ch digraphs had to be treated as single characters for collating purposes. This is worth discussing as you prepare the audience for Unicode.

  • @TheUtuber999
    @TheUtuber999 26 дней назад

    Your Microline 320 printer from 2001 should have been perfectly capable of printing the UK Pound symbol (£). All you needed to do is hold the Alt key on your keyboard and then type 156 on the numeric keypad in your word processor. Then it should just print normally and if not, you could have sent the control codes "ESC ! 0" to your printer to select the standard character set.

  • @setlonnert
    @setlonnert 29 дней назад +1

    Yes, before codepages we had some "adjustments" to e.g. Swedish. One such was ISO ESC 2/8 4/7 which actually included the Swedish variants "inside" of ASCII, in the Swedish computer ABC80. And if I remember correctly the same code was used in what was equivalent to the English Prestel, videotext or whatever they were called. Yes, that also had consequences. Early in the internet when we still had that awful ugly quoted printable hack, the mess got so bad that we gave up and spelled our texts with a and o instead of å, ä and ö. Mostly text were still readable, due to context ...

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 28 дней назад

      Also known as ISO 646
      In addition to being used in all sorts of computers predating the IBM PC, it was also used by teletext (text-tv in Swedish). ABC 80 used a character/font ROM intended for teletext.

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 28 дней назад

      Btw a pet peeve re ISO 646 is that it put the Swedish alphabet in the wrong order. (I assume this was true for Norwegian and Danish too?)
      I assume that it was in order to have Ä and Ö on the same codes as for German. TBH it would had been better if the germans had had to suck it up and have ß between Z and Ä, and have it share code with Å in the Scandinavian languages.
      Perhaps not the biggest problem in the world, but still annoying to have to have a special case for alphabetical sorting.

    • @pjl22222
      @pjl22222 25 дней назад

      But then whose alphabetical sorting are you referring to? Some languages sort accented letters after their non-accented versions, some mix them in with their non-accented versions as if they didn't have accents, others put them at the end of the alphabet.

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 25 дней назад

      @@pjl22222 Well, the variant of ISO 646 that has åäöÅÄÖ was only used in Finland and Sweden and those countries sort åäöÅÄÖ after the regular letters. (Also they aren't accented but rather umlaut and ring).

  • @SojournerDidimus
    @SojournerDidimus 20 дней назад

    Yay for utf-8!

  • @TheJamesM
    @TheJamesM 26 дней назад

    I’m guessing the Swedish passport/US plane ticket fact will be that Swedish passports have a kind of canonical spelling for names using only the modern Swedish alphabet, whereas in everyday life Swedes will often use the traditional spelling of a name. For example, a person with the surname Wallberg would have it appear as Vallberg on their passport. They might also use the fallback spellings for the additional vowels: å = aa, ä = ae, ö = oe (as those were the letter combinations those characters originally represented).
    In countries unfamiliar with these conventions, it appears that the name the ticket is registered under doesn’t match the passport, which obviously can cause issues.

  • @maxmuster7003
    @maxmuster7003 2 дня назад

    I like to use the extended ASCII character to display 1 bit Pixel Art animation on vga text screen.

  • @m4rt_
    @m4rt_ 29 дней назад +2

    3:45, Actually Swedish doesn't have æ Æ, they use ä Ä
    but ø Ø is missing, but you could use the Swedish version, ö Ö

  • @I.____.....__...__
    @I.____.....__...__ 29 дней назад +1

    11:11 On a related note, some systems (eg search-engines, auto-correct, etc.) can detect other errors like when you type something with your fingers shifted a key to the left or right. It's a lot of work to encode all of the possible mistakes that could be made to accommodate errors seamlessly, but perhaps it's a job well-suited for machine-learning. 🤔 (On the other hand, look at the mess that Microsoft made by making IE correct for sloppy web-developers. 😒)

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 28 дней назад

      My impression is that auto correct on iOS sucks at handling some non-US ASCII characters as compared to US ASCII characters. At least it's clueless if you make a typo re the Swedish åäöÅÄÖ, even though it handles typos good for other characters in Swedish words.

  • @FlameRat_YehLon
    @FlameRat_YehLon 27 дней назад

    So you are the plain text guy that shows up multiple times in my feed😂. Anyway, I've recently heard a new story about this. For some context, there used to be a popular game genre within the Chinese (the language) internet called 魔塔 (Magic Tower), and one of the famous game among mainland China was made in Hong Kong, and because back then mainland mostly used GBK coding while HK used Big5, the text is turned gibberish, and somehow someone on the internet was dedicated enough to just memorize and read such gibberish in order to progress the game.

  • @joe_z
    @joe_z 29 дней назад

    I learned about Kohuept from Tom Scott's *Lateral* podcast!

  • @McDuffington
    @McDuffington 29 дней назад

    Great stuff! Hope more is to come.

  • @probablypablito
    @probablypablito 29 дней назад

    Amazing! How do you find all of this? This is an incredible amount of knowledge being presented (and very well I might add!)

    • @DylanBeattie
      @DylanBeattie  27 дней назад

      Travel the world, go to a lot of tech conferences, ask people how computers specifically don't work in their part of the world. 30 years of staring at computer screens going "...but why doesn't it work?!?" also helped.

  • @diakritika
    @diakritika 24 дня назад

    Thank god for Unicode…

  • @greasedweasel8087
    @greasedweasel8087 28 дней назад +1

    4:38 I was excavating old* servers and ran into one with a BIOS from 2001 running contemporary BSD. Tried to ctrl-C a command and got a smiley face instead

  • @orbik_fin
    @orbik_fin 25 дней назад

    4:47 More likely garbage would be written directly to video memory (A0000h-AFFFFh) than standard output which is part of DOS API (Int 21h).

  • @WooShell
    @WooShell 29 дней назад

    holy cow.. I knew that non-US ASCII was a mess, but I had no idea it went to such extents

    • @MaddTheSane
      @MaddTheSane 29 дней назад +1

      Unicode was a gift. UTF-8 even more so.

  • @pnadk
    @pnadk 22 дня назад

    The Japanese have a word for what happened to the Russian address typed in France, they call it Mojibake

  • @andreydeev4342
    @andreydeev4342 29 дней назад

    Как всегда жжёшь! You rock it, as usual =)

  • @TheRowi62
    @TheRowi62 26 дней назад

    There is even a 7Bit ASCII Table for German, where []{}\| and ~ have been used for the umlauts and ß

  • @thejonte
    @thejonte 27 дней назад

    Love your work! You're a very good speaker.
    I've however got one question: why do you spend your time recording your NDC talks in a studio? Can't you get permission to repost the talks on here as well?

  • @pjl22222
    @pjl22222 25 дней назад

    When the USSR fell apart and Russia decided to start using the standard European license plates they had a problem: you have to use Latin, not Cyrillic, letters but most of their computer systems were only set to handle Cyrillic. Their solution: realizing that a lot of Cyrillic letters look just like Latin letters, they only assigned license plates with those letters. Still Cyrillic for the Russian computers but looks like Latin for all the other European countries they might drive to.

  • @soumen_pradhan
    @soumen_pradhan 27 дней назад

    Wait a min, Dylan is a Rhodesian! New lore drop.

    • @DylanBeattie
      @DylanBeattie  27 дней назад +1

      Nah, Dylan's British. Born in Kenya, lived in Zimbabwe 1981-1988, never set foot in any of the various bits of the world that were known as Rhodesia while they were still called that. (But Mum was born in Zambia while it was still called Northern Rhodesia, so maybe that counts.)
      You ever notice that if they'd named the country after Rhodes' first name instead of his last name, it would have been called Cecilia?

    • @soumen_pradhan
      @soumen_pradhan 27 дней назад

      @@DylanBeattie Well, I stand corrected.

    • @DylanBeattie
      @DylanBeattie  27 дней назад +1

      @@soumen_pradhan There is a whole gnarly tangled mess of what databases should do with people whose place of birth is a country that no longer exists. That'll probably make a fun topic for another video. :)

  • @kevinmcnamee6006
    @kevinmcnamee6006 28 дней назад

    Great video. These problems are still with us. I recently bought a new laptop and during the installation process it seemed to think I had a UK keyboard and rendered the @ as a ", which made it very difficult for me to enter my email address. I figured it out.

  • @cmyk8964
    @cmyk8964 27 дней назад

    Wow, I wonder how much Google will convert between JCUKEN and QWERTY. I’m guessing it’s just a few common words and names, and I’d be surprised if it were a thing for every word.

  • @Huntracony
    @Huntracony 29 дней назад +2

    I must admit, I had no idea Cyrillic had capital letters, I thought that was unique to the Latin alphabet. I should learn about the history of capital letters sometime, it's kinda weird that (at least?) two alphabets have two versions of every letter.

    • @0LoneTech
      @0LoneTech 29 дней назад +3

      It didn't surprise me since Greek also does.

    • @fullfungo
      @fullfungo 29 дней назад +5

      Actually, lowercase letters are a very recent invention. They did not exist for centuries until the printing process became wide-spread.
      So you got it the wrong way round (twice)

    • @sponge1234ify
      @sponge1234ify 28 дней назад +1

      @@Huntracony a lowercase-uppercase system is actually pretty rare, numerically speaking. It's mostly Latin, Greek, and alphabets descended from them, and the rest of them having a high level of correlation that suggest influence (Old Hungarian Runes, Zaghawa, Adlam, Warang Citi, etc.)

  • @redoktopus3047
    @redoktopus3047 28 дней назад

    Had no idea Dylan Beattie's parents were Whenwes

    • @DylanBeattie
      @DylanBeattie  27 дней назад

      Dad was. Mum was born and grew up in Zambia.

  • @taskfailedsuccesfully738
    @taskfailedsuccesfully738 29 дней назад

    I just got done watching the previous video lol

  • @danielrhouck
    @danielrhouck 28 дней назад

    Is this series going to build up to Pike Matchbox?

  • @ALeXKazik
    @ALeXKazik 25 дней назад

    Luckily I had an Amiga (with ISO-8859-1) and later Mac OS X (with Unicode) and never that PC codepages.

  • @butwhytho6522
    @butwhytho6522 28 дней назад +3

    Windows and the Byte Order Mark. Open the text file - looks normal. Open the text file on Linux - hey there's extra bytes at the start. Thanks again Microsoft!

    • @kalleguld
      @kalleguld 26 дней назад +1

      What do you mean "On Windows"? Which editor? It's the editor that inserts a BOM, not the OS

  • @Colaholiker
    @Colaholiker 24 дня назад

    And you'd think that with UTF-8 these days all of this was just a funny note in history, right?
    Nope. At least not at my workplace.
    Being someone who prefers efficiency over fancy presentation, I have set my email client to send plain-text mails by default. Mostly because Outlook totally messes up when I paste any code snippet from VS Code into an email and forget to paste it as text only. Of course it is set to UTF-8 encoding, as anything should be today. Being a German person working in people with Germans who all speak German, I use all letters that the German alphabet has. With UTF-8, this shouldn't be a problem, right?
    Enter our IT department (some Star Wars Imperial March would work here)
    They put some tool on our mail server to add our signature - for internal purposes just name and contact information, for external purposes also the required legal boilerplate. Apparently, this widget must be from the code page days, because once my mail passes through this filter, it totally messes up all characters that you wouldn't find in 7-bit ASCII. And of course this is what any recipient of the email, both internal and external would get to see...🤣

  • @david.mcmahan
    @david.mcmahan 29 дней назад

    Totally understand your view on "pound sign". But as an American of Gen X age and having had a grandmother who worked for a "baby" Bell telephone company, # is the "pound key" to me. AT&T and the Bell system instructed us to use the "pound key" for specific dialing situations with touch tone phones.

    • @caerphoto
      @caerphoto 29 дней назад +1

      A lot of UK companies use American phone menu systems that sometimes tell us to press the Pound key, which obviously doesn't make much sense here. I think most people are aware enough of American culture to figure out what it means, though.

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 28 дней назад

      TBH there is a certain imperial aura of having a special symbol for money. Afaik it's only dollar, pound, rubles and yen that has special symbols. And then there is the international symbol ¤
      This also leads to some ridiculous things like the screen keyboard on an iPhone when set to Swedish having a "kr" key that just prints the letter k and r, which is the standard abbreviation for "kronor". Zero need for that button, but someone at Apple decided that there should be a button for money at that place I guess?
      P.S. in Sweden # is "square" or uncommonly "lumber yard". :D

    • @pjl22222
      @pjl22222 25 дней назад

      Back in the day people writing, for instance, a list of goods bought or sold by weight would use # to mean pounds. Like if you bought two pounds of apples then maybe one line of your handwritten receipt would say "apples 3#" and if they were 50¢ per pound it might even say "apples 3# @ 50¢ $1.50"

  • @edgeeffect
    @edgeeffect 28 дней назад

    Hi @Dylan... I've heard you talk on this subject a couple'a times now... but you never use the term "Mojibake" (文字化) and I've always thought you should... 'cus it's a cool word. ;)

  • @PixelOutlaw
    @PixelOutlaw 29 дней назад +12

    They took away my beloved box drawing characters on Linux because some European needed 16 versions of the letter 'e'.

  • @jtsiomb
    @jtsiomb 29 дней назад +1

    For greek we had codepage 737, or iso8859-7, but since the whole thing was a mess, you had to use a greek font, run a VGA glyph-replacement program, and earlier computers where ascii-only anyway, my generation of computer geeks used to type greek with latin characters. We called it "greeklish" and that's what we defacto used online to communicate. In later years subsequent generations started taking offence at people using greeklish on forums since they grew up with unicode and never had to get used to reading greeklish, but for some of us, having to switch keyboard layouts mid-sentence to type an english term is just unbearable, so we keep using greeklish :) Also I type at half the rate in greek, since I never got used to it.... unbearable.

  • @marloelefant7500
    @marloelefant7500 29 дней назад

    Actually, that last part I'm using for some of my passwords. I'm typing something in with the English keyboard layout, but assuming another layout in my mind. The result is a more complicated, but easy to remember password (btw, my passwords also comprise multiple words).

    • @pihungliu35
      @pihungliu35 29 дней назад +3

      But be careful, do not use any common underlying phrase for your password as many other people have the same idea as you and knows that other layout will use that. An (in)famous example is that a common password-related phrase in Chinese, when assuming Taiwan IME layout, produces a randomly-looking letter and digit combinations, but because it is used so much by my fellow Taiwanese people that it appears amongst the top of commonly used password lists like HIBP.
      (An addendum since other comment mentioned Tom Scott's Lateral: this thing also showed up as a question once!)

    • @peterwmdavis
      @peterwmdavis 29 дней назад +1

      Right, this is just a basic (and predictable) substitution cypher. A password manager and truly random, long passwords would be much better. Or even a long but memorable multi-word sentence.

    • @marloelefant7500
      @marloelefant7500 28 дней назад +1

      @@pihungliu35 My passwords usually comprise 2-3 words, together more than 30 characters. I'm pretty sure, not many people have passwords of similar length.

  • @Chris-op7yt
    @Chris-op7yt 29 дней назад

    v. good

  • @Thesecret101-te1lm
    @Thesecret101-te1lm 28 дней назад

    Two things re code pages:
    You need at least EGA graphics on a PC to use code pages.
    Microsoft/IBM forced users in Sweden to do all the "load code page" crap to be able to select the correct keyboard layout, even though the default "CP437" already had all characters that we really needed. Everything about code pages in DOS has an aura of "pointy haired boss"...

  • @vanhetgoor
    @vanhetgoor 23 дня назад

    Klumsy amateurs at IBM, they better skip the I from their name and go further as Local American Business Machines. This cheap skate solution has traumatised the complete computer industry for many years. Luckily the Apple Mac had from the beginning on an international set up that worked. If this would not have been done, international publishing with the help of DTP would have been seriously delayed for five to ten years. This stupid mistake of IBM was the glorious moment of triumph for the Mac.

  • @people9178
    @people9178 29 дней назад

    Ш щаеут ащкпуе ещ срфтпу дфтпгфпу уізусшфддн црут Ш іуфкср штащкьфешщт щт Пщщпду. Щр тщ тще фпфшт!

  • @marloelefant7500
    @marloelefant7500 29 дней назад

    This is the fourth comment.

    • @DylanBeattie
      @DylanBeattie  29 дней назад

      Sixth. But who's counting, eh? 😉

    • @marloelefant7500
      @marloelefant7500 28 дней назад

      @@DylanBeattie RUclips only showed me 3 other comments at that time, but I guess that's eventual consistency. Or I'm an LLM, who knows.