There's a saying that UTF-8 was successful because USA did not need to understand it. (Explanation: they could just keep using ASCII and magically they are compatible with UTF-8).
While designing ASCII they also chose "00110000" (48) for character zero. This is even more impressive than "a is 1" since you can then XOR any character with the value of character zero to find out if it's a decimal number (0 through 9)! :) In code example: char x = random(0, 128); if (x ^ '0' < 10) { // variable x is a decimal number character } else { // variable x is NOT a decimal number character }
Minor goof by Tom at 6:25 he writes 0110 0001 and writes 'A' when it should be 'a'. But a great video, and perhaps this is a deliberate mistake to see who was awake in class. I remember when I first read how unicode works I was blown away, but Tom's explanation is so much better than how I learnt it.
Just want to mention, not that people probably care, that Korean actually has a phonetic alphabet, unlike Chinese and Japanese. The letters do arrange into syllable blocks (e.g. ㅎ[h]+ㅏ[a]+ㄴ[n]+ㄱ[g]+ㅜ[u]+ㄱ[k]=한국[Hanguk, meaning Korea]), so I'm not sure if individual letters are encoded or if entire syllable blocks are encoded, but it is an alphabet nonetheless.
I didn't know that. I remember studying Korea in world history and how it was very different from Japan and China. I guess I never thought about the language being that different. That's cool, and I'm sure it makes keyboards easy for you guys :)
Yeah, it's pretty cool. I'm a Korean-language learner, and I mastered Korean touch-typing (on an American keyboard, no less) in about a month. :) The Korean alphabet, called Hangeul, was invented by a team of scholars led by King Sejong the Great in 1443 so Koreans wouldn't have to use Chinese characters to write anymore. Whenever I talk to a Korean and the topic of Chinese characters comes up, I always tell them, "I'm very grateful for King Sejong!"
amykathleen2 You might not care, but Japanese texts have a large number of phonetic "letters" as well, unlike Chinese. Although it's technically not an alphabet but a syllabary. (Each "letter" signifying a syllable, rather than a "sound") Japanese uses a mix of phonetic and non-phonetic characters, and for a significant number of words both phonetic and non-phonetic spellings are common. It's also entirely possible to write any Japanese sentence fully in phonetic characters, but it's practically impossible to make a proper sentence without them. (Although it should be noted most sentences, especially more complex ones, would be significantly harder to read were they written fully phonetically.) In a modern Japanese sentence such as this: これは日本語での例文である。 all the curly characters (これは での である) are phonetic, and the more rigid/angled characters (日本語 例文) are usually non-phonetic characters, often identical to characters used in Chinese (汉语 / 漢語). Although there's also a type of angled phonetic characters (カタカナ), which is usually reserved for loan words and foreign names and such. It's likely you already knew this, but I felt the need to clarify for interested uninformed passersby.
Raizin Yes, I did know the basics. But I didn't know that the two syllabaries had different uses and different "kinds" of shapes, that's really interesting! Some of those angled phonetic characters really look a lot like Chinese characters - like 力 and 夕. I think if that syllabary was the more common one, I wouldn't be able to tell Chinese and Japanese writing apart, as my personal rule is "Japanese is the one with the squiggly characters," haha. Thank you for sharing that information! :D
***** The point I was trying to make is that, since not long after the Korean war, Korean has been written almost *exclusively* using a phonetic *alphabet*. Japanese usually uses a mix of Chinese characters and syllabic characters, while Chinese usually uses Chinese characters exclusively. In modern Korean, Chinese characters are only used in high-level texts, such as medical or legal journals. Everything else is written using the Korean alphabet (which, again, is *not* a syllabary, unlike bopomofo and kana, and is *not* based on borrowed letters, unlike pinyin). Many Koreans can't even write their own names using Chinese characters. So I made my comment to correct the fact that, in the video, he listed several alphabets (English, Cyrillic, Arabic), and then said, "Japanese, Chinese, and Korean characters." This is wrong; Korean uses an alphabet and should have been listed with the alphabets if it was to be listed at all.
Another nice feature: Sorting UTF-8 strings under the assumption they are ASCII strings will sort them correctly in ascending codepoint order. For proper sorting in the context of a language you need of course much more complicated methods, but having some kind of sort that somehow makes sense for some technical applications that can be performed by something that was written for ASCII is already very nice.
Hey, this video actually helped me fix a bug! I was trying to pass an ANSI filename to a function, and it would always fail. When I looked at the variable watch, the string showed up as a bunch of Chinese characters, so I was immediately able to recognize it was being reinterpret-casted to Unicode, rather than the proper typecast I assumed would happen!
I remember when I watched this video for the first time back in 2018, didn't make any sense to me. Now I can understand how beautifully he explains the complete journey started from Ascii to UTF 8.
The original version of UTF-8 was invented by Thompson and Pike for use in Plan 9 from Bell Labs. There were already ISO standards for character encoding; ISO 10646 is the master character compendium and assigns codes throughout a 31-bit range. I was impressed enough with the Plan 9 scheme that I promoted it in my C Standards column in the Journal of C Language Translation. The advantages of UTF-8 covered in this video helped its adoption by many applications needing to support an international character set. By the way, Plan 9 only implemented the 16-bit range, although the full scheme can encode any 31-bit pattern. The current IETF RFC3629 unnecessarily constrains UTF-8 to 16 bits. I'm at the beginning of the process of trying to undo those restrictions.
Love this guy's enthusiasm and this type of video converting the odd bits of computing like how number phill covers the odd bits of maths rather than teach a full course in those subjects
This was an excellent presentation. Thank you for making it so understandable! I do have a very minor quibble. At 7:18, there's an error; in a 2 byte Unicode character, having 11 bits available (5 from the header, and 6 from the continuation) will only allow you to get values up to 2048, not 4096.
It's always interesting to listen to someone who's that passionate, or at least sounds passionate. Even if you don't care about the subject at hand, it somehow becomes interesting when person speaking is passionate about it!
Holy shit, this guy is freaking enthusiastic about it. But he has a point.... I only recently learned the way UTF-8 works and I gotta say, this is some freaking genius hack.
You talk with so much passion about the subject. I think that's really beautifull. I bet Even someone who doesn't understand a sh** about computers will know how important it was.
historical note: Before ASCII there was 5 bit teletype code (upper case only), binary coded decimal (BCD), which was a 6 bit code, and extended BCD interchange code (EBCDIC), an 8 bit code. BCD and EBCDIC were IBM standards adopted by the industry. All used the "trick" of having the letters in collating order; it was the basis for punch card computing.
I am certain that 99.99999 percent of all humans don't care about this at all. Nice video. Good camera work, a competent presenter, no unnecessary music. Good job.
This is one of the best computerphile videos. This is the sort of topic explained at the right level to be interesting to most people who (I suspect) subscribe here. Good work!
Another tiny correction: at 1:53, he says a space is all zeros; actually, a space is 0100000 = 32 = 0x20. As he mentions later, all zeros is "end of string".
I like UTF-8 too. It's very useful. I quite like UTF-16 for encoding foreign words in RAM. I wrote a special text editor for writing in different languages and I found UTF-8 to be perfect for saving the text files.
In practice, you’ll never really encounter UTF-8 byte sequences with 4 or 5 continuation bytes. In November 2003, UTF-8 was restricted by RFC 3629 to end at U+10FFFF, in order to match the constraints of the UTF-16 character encoding. This removed all 5- and 6-byte sequences, and about half of the 4-byte sequences, but it’s still enough to represent every possible Unicode symbol ever.
The escape arrow (←) isn't from ASCII, it's in Code Page 437 (the MS DOS or OEM font) and shares the same position as ASCII's ESC code. In Unicode that arrow symbol is mapped to U+2190, which is 3 characters in binary: 11100010:10000110:10010000, and kept separate from the ESC control character (27 or 0x1B)
Thank you so much for sharing so much detailed information! I always thought of bits and bytes to be something i'll never be interester in, but frankly, this stuff is getting really interesting the more you into it. Greetings and all the best!
Hacks can also be seen as a trick to circumvent limitations. The most important part is the fact that it was not intended to happen as you said. In code it usually refers to code that fixes an issue in a weird way. This can be good or bad just like any other piece of code. In this case it circumvents the limitations of backwards compatibility with ASCII in a very elegant way.
Another advantage of UTF-8 that wasn't mentioned is that if you want to sort strings by Unicode value, you can just treat it as though each byte were a separate character, and it'll just work. The only real downside to UTF-8 is that you can't seek out a character at a specific index without walking the entire string character by character.
Would like to see more videos with Tom! BTW, I love these videos, but one thing I don't like so much is the camera movement at times, so if you don't use one, could you put the camera on a tripod for some parts of an interview? I know it wouldn't work when you have to look at something someone is writing or holding or doing, but for when the camera is focused on someone, it to me would be better to have the camera more stable.
For the people wanting to know where this vid was taken it in a cafe called the booking office in St Pancras station I know because I have been there once it's pretty popular
Hey Tom! Fancy seeing you here :) Great to see you on ye olde Computerphile. Maybe catch you at the next TDC! Loving the video empire Brady, thanks for bringing us a slice of Tom - Jim.
errm, C99 and C11 support unicode (more or less). C99 added wide characters and variable length arrays, and C11 was the one that added generics and similar. granted, when a lot of people think C, they think C89/C90. likewise, a certain major compiler still lacks support for many C99 and pretty much any C11 features.
From Wikipedia: In modern computing terminology, a kludge (or often a "hack") is a solution to a problem, doing a task, or fixing a system that is inefficient, inelegant, or even unfathomable, but which nevertheless (more or less) works.
It's a hack specifically because it makes 7-bit ASCII into just another valid UTF-8 encoding. You can run a 7-bit plaintext file through a UTF-8 parser and it won't complain, because it's not a special case. It's a hack because it assigns a meaning to the high (eighth) bit in 8-bit ASCII encoding, which had just been an extraneous zero. Since that leading '0' in old ASCII files is a valid UTF-8 header, it makes all 7-bit ASCII files ever written instantly into valid UTF-8 files as well.
Unicode is a miracle in this sense: a lot of the time, standards don't get created, or a de facto standard with lots of ugly warts arises, or a bunch of companies try to create their own standards and everything becomes a huge mess. But in another sense, Unicode's success seems obvious in hindsight. If there's one thing that needs a clean, universal standard that everyone uses, it's text-data interchange. And the Web has become so important to modern civilization that a standard had to arise.
Another quibble: he accidentally damns this with faint praise when he says "you have something at nearly works" right at the very end. He meant it nearly *works perfectly* but it does work excellently.
Which side of a napkin is the back?
This was shot at the Marriott St Pancras Renaissance in London - kind thanks to them for allowing us to film there! >Sean
This guy just radiates enthusiasm
It's rare to see a person who is knowledgeable, passionate and able to explain in a linear and easy to understand manner.
You forgot to mention that the great hacker behind the great hack is Ken Thompson, the genius behind unix
There's a saying that UTF-8 was successful because USA did not need to understand it. (Explanation: they could just keep using ASCII and magically they are compatible with UTF-8).
3 years later, still quality.
-Well, give-or-take a few leap seconds-
Tom Scott explaining UTF-8 in some hotel lobby 9 years ago. Very nice!
While designing ASCII they also chose "00110000" (48) for character zero. This is even more impressive than "a is 1" since you can then XOR any character with the value of character zero to find out if it's a decimal number (0 through 9)! :)
In code example:
char x = random(0, 128);
if (x ^ '0' < 10) {
// variable x is a decimal number character
} else {
// variable x is NOT a decimal number character
}
Minor goof by Tom at 6:25 he writes 0110 0001 and writes 'A' when it should be 'a'. But a great video, and perhaps this is a deliberate mistake to see who was awake in class. I remember when I first read how unicode works I was blown away, but Tom's explanation is so much better than how I learnt it.
Where is he presenting all this? That place looks rather pleasant.
Just want to mention, not that people probably care, that Korean actually has a phonetic alphabet, unlike Chinese and Japanese. The letters do arrange into syllable blocks (e.g. ㅎ[h]+ㅏ[a]+ㄴ[n]+ㄱ[g]+ㅜ[u]+ㄱ[k]=한국[Hanguk, meaning Korea]), so I'm not sure if individual letters are encoded or if entire syllable blocks are encoded, but it is an alphabet nonetheless.
I didn't know that. I remember studying Korea in world history and how it was very different from Japan and China. I guess I never thought about the language being that different. That's cool, and I'm sure it makes keyboards easy for you guys :)
Yeah, it's pretty cool. I'm a Korean-language learner, and I mastered Korean touch-typing (on an American keyboard, no less) in about a month. :)
The Korean alphabet, called Hangeul, was invented by a team of scholars led by King Sejong the Great in 1443 so Koreans wouldn't have to use Chinese characters to write anymore. Whenever I talk to a Korean and the topic of Chinese characters comes up, I always tell them, "I'm very grateful for King Sejong!"
amykathleen2 You might not care, but Japanese texts have a large number of phonetic "letters" as well, unlike Chinese. Although it's technically not an alphabet but a syllabary. (Each "letter" signifying a syllable, rather than a "sound")
Japanese uses a mix of phonetic and non-phonetic characters, and for a significant number of words both phonetic and non-phonetic spellings are common. It's also entirely possible to write any Japanese sentence fully in phonetic characters, but it's practically impossible to make a proper sentence without them. (Although it should be noted most sentences, especially more complex ones, would be significantly harder to read were they written fully phonetically.)
In a modern Japanese sentence such as this:
これは日本語での例文である。
all the curly characters (これは での である) are phonetic, and the more rigid/angled characters (日本語 例文) are usually non-phonetic characters, often identical to characters used in Chinese (汉语 / 漢語). Although there's also a type of angled phonetic characters (カタカナ), which is usually reserved for loan words and foreign names and such.
It's likely you already knew this, but I felt the need to clarify for interested uninformed passersby.
Raizin Yes, I did know the basics. But I didn't know that the two syllabaries had different uses and different "kinds" of shapes, that's really interesting! Some of those angled phonetic characters really look a lot like Chinese characters - like 力 and 夕. I think if that syllabary was the more common one, I wouldn't be able to tell Chinese and Japanese writing apart, as my personal rule is "Japanese is the one with the squiggly characters," haha. Thank you for sharing that information! :D
***** The point I was trying to make is that, since not long after the Korean war, Korean has been written almost *exclusively* using a phonetic *alphabet*. Japanese usually uses a mix of Chinese characters and syllabic characters, while Chinese usually uses Chinese characters exclusively. In modern Korean, Chinese characters are only used in high-level texts, such as medical or legal journals. Everything else is written using the Korean alphabet (which, again, is *not* a syllabary, unlike bopomofo and kana, and is *not* based on borrowed letters, unlike pinyin). Many Koreans can't even write their own names using Chinese characters. So I made my comment to correct the fact that, in the video, he listed several alphabets (English, Cyrillic, Arabic), and then said, "Japanese, Chinese, and Korean characters." This is wrong; Korean uses an alphabet and should have been listed with the alphabets if it was to be listed at all.
Another nice feature: Sorting UTF-8 strings under the assumption they are ASCII strings will sort them correctly in ascending codepoint order. For proper sorting in the context of a language you need of course much more complicated methods, but having some kind of sort that somehow makes sense for some technical applications that can be performed by something that was written for ASCII is already very nice.
6:30 -- 01100001 is not 'A' and its not 65, its 97 / 'a' . Or am I wrong?
Are you listening to me Neo, or are you distracted by "Woman in the red jeans" 5:40
Great explanation!
I've never seen a guy explaining utf8 so well and so excited like this fellow here - really great job
Hey, this video actually helped me fix a bug! I was trying to pass an ANSI filename to a function, and it would always fail. When I looked at the variable watch, the string showed up as a bunch of Chinese characters, so I was immediately able to recognize it was being reinterpret-casted to Unicode, rather than the proper typecast I assumed would happen!
This video showed up to me in Dec 2023, 10 years later from when this video launched. And I'm still amazed on how this guy explained it 👍👍
I remember when I watched this video for the first time back in 2018, didn't make any sense to me. Now I can understand how beautifully he explains the complete journey started from Ascii to UTF 8.
The original version of UTF-8 was invented by Thompson and Pike for use in Plan 9 from Bell Labs. There were already ISO standards for character encoding; ISO 10646 is the master character compendium and assigns codes throughout a 31-bit range. I was impressed enough with the Plan 9 scheme that I promoted it in my C Standards column in the Journal of C Language Translation. The advantages of UTF-8 covered in this video helped its adoption by many applications needing to support an international character set. By the way, Plan 9 only implemented the 16-bit range, although the full scheme can encode any 31-bit pattern. The current IETF RFC3629 unnecessarily constrains UTF-8 to 16 bits. I'm at the beginning of the process of trying to undo those restrictions.
This is interesting Doug. I plan to watch this later. Happy 4th to you too.
There will be more with Tom :) >Sean
Tom Scott is the James Grime of Computerphile!
Thank you for providing Korean subtitles. You explained it so well that I could understand it well. Thank you.
Love this guy's enthusiasm and this type of video converting the odd bits of computing like how number phill covers the odd bits of maths rather than teach a full course in those subjects
See the "extra bits" film for a further explanation! (link in the description) >Sean
This was an excellent presentation. Thank you for making it so understandable!
I do have a very minor quibble. At 7:18, there's an error; in a 2 byte Unicode character, having 11 bits available (5 from the header, and 6 from the continuation) will only allow you to get values up to 2048, not 4096.
I watched this video like 5 times over a long period now. Keep coming back to it, I so love the explanation and the storytelling!
This guy is brilliant at explaining things, please feature him more often!
For a restaurant setup, this is BIZARRELY informational and useful. So strange!!!
It's always interesting to listen to someone who's that passionate, or at least sounds passionate. Even if you don't care about the subject at hand, it somehow becomes interesting when person speaking is passionate about it!
Holy shit, this guy is freaking enthusiastic about it. But he has a point.... I only recently learned the way UTF-8 works and I gotta say, this is some freaking genius hack.
It's always nice when you're watching one of Brady's channels and someone from a completely unrelated channel you subscribe to turns up.
Finally, someone who loves UTF-8 and Unicode as much as me!
cameraman, please take a seat
UTF-8 is love, UTF-8 is life.
Such an incredible enthusiasm just for UTF-8! I’d like to hear you speaking about quantum entanglement 🥴
You talk with so much passion about the subject. I think that's really beautifull. I bet Even someone who doesn't understand a sh** about computers will know how important it was.
Definitely one of my favorite Tom Scott videos!
I've never seen someone that passionate about encoding characters.
historical note: Before ASCII there was 5 bit teletype code (upper case only), binary coded decimal (BCD), which was a 6 bit code, and extended BCD interchange code (EBCDIC), an 8 bit code. BCD and EBCDIC were IBM standards adopted by the industry. All used the "trick" of having the letters in collating order; it was the basis for punch card computing.
I am certain that 99.99999 percent of all humans don't care about this at all. Nice video. Good camera work, a competent presenter, no unnecessary music. Good job.
This was unironically riveting for me. I'm amazed at the incredibly clever solutions that make up the foundations of mundane computer operation.
"[...] we don't have mojibake, [...] we have something that nearly works" - Tom Scott, 2013.
I absolutely adore this "nearly" thing.
If you asked me to watch a UTF-8, I would have given it a pass... but with this guy, I could not stop watching.
This guy is a LOT of fun. He's so enthusiastic! Please have him on again!
This is one of the best computerphile videos. This is the sort of topic explained at the right level to be interesting to most people who (I suspect) subscribe here. Good work!
Another tiny correction: at 1:53, he says a space is all zeros; actually, a space is 0100000 = 32 = 0x20. As he mentions later, all zeros is "end of string".
Very interesting & informative video.
Explained in detail and still very easy to understand.
Thanks for uploading.
Keep up the good work guys...
bingeing computerphile on halloween is a mood
Was this filmed in the St Pancras Hotel?
Yep
***** actually this is just their public bar, our filming location fell through and they were kind enough to let us film there. Anyone can go in >Sean
5:38 i see what you did there xD
This guy's personal channel is in the description. I just checked it out and it's really amazing. You should too.
If you could do more with Tom Scott that would be amazing. I love his videos and these videos, so combining them is just awesome!
This is the first computerphile video I didn't hate. Well done.
Very interesting video explainer. I learnt something new today! (Particularly like the crash zoom at 5:42 to see girl in red pants!)
7:17 Why does he say 4096? You can use 5+6 = 11 bits so wouldn't that be 2^11 = 2048?
These videos are great contributions to human knowledge
This is the simplest explanation of UTF-8 I've heard. Thank you!
I really like Tom Scott's way of explaining.
I like UTF-8 too. It's very useful. I quite like UTF-16 for encoding foreign words in RAM. I wrote a special text editor for writing in different languages and I found UTF-8 to be perfect for saving the text files.
This makes me think of the error-checking header used in PNG files, really a quite clever piece of work that I'd love to see a video on. =)
UTF has to be one o the most beautiful solutions I´ve ever seen. Loved it since I translated the unicode page.
In practice, you’ll never really encounter UTF-8 byte sequences with 4 or 5 continuation bytes. In November 2003, UTF-8 was restricted by RFC 3629 to end at U+10FFFF, in order to match the constraints of the UTF-16 character encoding. This removed all 5- and 6-byte sequences, and about half of the 4-byte sequences, but it’s still enough to represent every possible Unicode symbol ever.
The escape arrow (←) isn't from ASCII, it's in Code Page 437 (the MS DOS or OEM font) and shares the same position as ASCII's ESC code. In Unicode that arrow symbol is mapped to U+2190, which is 3 characters in binary: 11100010:10000110:10010000, and kept separate from the ESC control character (27 or 0x1B)
this was one of the best videos on this channel, i loved it
One of the best parts of the video is that guy at the very end pondering whether that dude actually would or wouldn't be the last man on the moon.
Thank you so much for sharing so much detailed information!
I always thought of bits and bytes to be something i'll never be interester in, but frankly, this stuff is getting really interesting the more you into it.
Greetings and all the best!
Hacks can also be seen as a trick to circumvent limitations. The most important part is the fact that it was not intended to happen as you said. In code it usually refers to code that fixes an issue in a weird way. This can be good or bad just like any other piece of code. In this case it circumvents the limitations of backwards compatibility with ASCII in a very elegant way.
Another advantage of UTF-8 that wasn't mentioned is that if you want to sort strings by Unicode value, you can just treat it as though each byte were a separate character, and it'll just work.
The only real downside to UTF-8 is that you can't seek out a character at a specific index without walking the entire string character by character.
I don't even want to think about the stress of allocating every symbol with a number, but great video!
Nothing short of amazing.
Would like to see more videos with Tom! BTW, I love these videos, but one thing I don't like so much is the camera movement at times, so if you don't use one, could you put the camera on a tripod for some parts of an interview? I know it wouldn't work when you have to look at something someone is writing or holding or doing, but for when the camera is focused on someone, it to me would be better to have the camera more stable.
I have been waiting for something on Unicode/UTF-8. THANK YOU, COMPUTERPHILE!
For the people wanting to know where this vid was taken it in a cafe called the booking office in St Pancras station I know because I have been there once it's pretty popular
Thanks for the history lesson. It is always interesting to remember how we got to where we are today.
I learn more here than my software lessons
In depth explanation. He also shares a cool way to remember what A's and a's codepoints are.
thanx Computerphile for explaining utf8 , user tried to understand from wiki but could not do it, u make everything simple
Man... I love these videos, I love all the videos by you.
homework is to watch this
Love the enthusiasm! :)
By the way, the character he encodes into UTF-8 at 7:12 (Unicode character #434) is: Ʋ (U+01B2: LATIN CAPITAL LETTER V WITH HOOK)
I prefer the way he does it now because it seems more realistic and more like a conversation. Zooming in on his face though is funny :)
Hey Tom! Fancy seeing you here :) Great to see you on ye olde Computerphile.
Maybe catch you at the next TDC! Loving the video empire Brady, thanks for bringing us a slice of Tom - Jim.
Patiently watched twice and understood it very well, thanks!
Unicode is a beautiful beacon of hope for a united world and a celebration of the variety of writing systems our species developed. Go Unicode! ツ
errm, C99 and C11 support unicode (more or less). C99 added wide characters and variable length arrays, and C11 was the one that added generics and similar.
granted, when a lot of people think C, they think C89/C90.
likewise, a certain major compiler still lacks support for many C99 and pretty much any C11 features.
Simply Elegant … clear explanation
From Wikipedia: In modern computing terminology, a kludge (or often a "hack") is a solution to a problem, doing a task, or fixing a system that is inefficient, inelegant, or even unfathomable, but which nevertheless (more or less) works.
That is actually a very good explanation of UTF-8! I had wondered how the continuation bytes worked for a long time.
It's a hack specifically because it makes 7-bit ASCII into just another valid UTF-8 encoding. You can run a 7-bit plaintext file through a UTF-8 parser and it won't complain, because it's not a special case. It's a hack because it assigns a meaning to the high (eighth) bit in 8-bit ASCII encoding, which had just been an extraneous zero. Since that leading '0' in old ASCII files is a valid UTF-8 header, it makes all 7-bit ASCII files ever written instantly into valid UTF-8 files as well.
You pronounced mojibake pretty well! For anyone who wants google translate or a Japanese English dictionary: もじばけ
Unicode is a miracle in this sense: a lot of the time, standards don't get created, or a de facto standard with lots of ugly warts arises, or a bunch of companies try to create their own standards and everything becomes a huge mess.
But in another sense, Unicode's success seems obvious in hindsight. If there's one thing that needs a clean, universal standard that everyone uses, it's text-data interchange. And the Web has become so important to modern civilization that a standard had to arise.
Simply put, UTF-8 uses an indexing system, which is an old concept familiar
to myself and any other programmer :)
All 0s in ASCII is Nul. 32 (01 00000) is Space.
Brady, please bring this guy on computerphile many more times! and nice restaurant btw :)
Another quibble: he accidentally damns this with faint praise when he says "you have something at nearly works" right at the very end. He meant it nearly *works perfectly* but it does work excellently.
Great video, I really like the close-discussion format !
Thanks for this! Character encoding always confused me; this video explained UTF pretty well to me.
Great explanation, love his enthusiasm