This is the BEST Unicode tutorial. This explain well how does Unicode and UFT-8 worked. I was wondering how UFT-8 knows where to start and when to end and you explained very well. Thank you so much.
This is absolutely one of the best Unicode tutorials I have seen anywhere. I was thinking that it was an error to say "before 2000, Unicode was 16 bits" as I memorized July 1996 as the point that changed, but the RFC only came out in 2000. The UTF-16 encoding scheme was developed as a compromise and introduced with version 2.0 of the Unicode standard in July 1996.[8] It is fully specified in RFC 2781, published in 2000 by the IETF.[9][10]
I spent quite some time learning about Unicode, character sets and such. One thing I couldn't find was a comprehensive description on how surrogate pairs work. This video just made the perfect, easy to understand example, then went on and explained UTF-8 perfectly. I wish I started this whole topic with this video.
Thank You So Much! I was trying to figure out the same thing... How do we fit so many languages and emojis in UTF-16 (which is 16 bit)... I searched a lot and finally came to this video. This is absolute gold!
For anyone struggling with the concept of UTF-8, it might help to encode some characters manually the following way: 1. Take a character and find its binary representation (the index in binary). 2. See how many bits it's taking Time to play with utf-8. Let's start with the '110' header. 2 ones means two bytes total. In the first byte, you have 5 positions left for your character. In the second, you have the '10' at the start meaning it's a continuation byte that continues some previously started character. There's 8 - 2 = 6 bits left in the second byte, which amounts to 11 available bits in total. If you find yourself having leftover bits, just pad the character with some zeroes at the start till it's 11 bits. same goes for higher numbers. Too little space? Let's go "1110", three bytes. The available positions are now (8 - 4) + (8 - 2) + (8 - 2) = 4 + 6 + 6 = 16 bits. And so on. You might even write a simple computer program that does this for you with some loops. Whatever feels intuitive.
Thanks for a very good video. If the current Unicode system allows a maximum of 21 bits for 4-byte UTF-8, it means 16 bits (65,536 code points) for every plane, and the remaining 5 bits to identify which plane. So a maximum of 32 planes are possible. Now planes 0 to 16 are defined. Planes 0,1,2,3,14 are assigned, and planes 4 to 13 are not assigned, and 15 and 16 are private use. So does it mean that another 15 planes (planes 17 to 31) are not even defined? We may not need them even for the next 10,000 years (or may be never). But what do others think?
This is such an excellent video. The one question that I had was would it make sense to focus even more on hexadecimal rather than decimal (called denary by some pedants because there's no decimal points, we just have integers-but I saw the term used in one great video)...when working with Unicode shouldn't we be mostly just thinking about hexadecimal representations, the actual character names (and block names and script names and whatever) and the glyphs when they get rendered? I realize that not everyone grew up converting binary images drawn on graph paper to hex codes to make video games, but -- isn't it easier to just forget about decimal/denary entirely when your mind is in Unicode-space? It is way easier to see binary-hex relationships than binary-decimal ones. Just a question, the video is awesome either way. Even in ASCII, I normally prefer to think of 'A' as 0x41 rather than 64, and 'a' as 0x61 rather than 97.
Thanks for the kind words! Yes, that's a very good point. It's definitely easier to see the relationship between binary and hexadecimal than binary and decimal. When dealing with binary data, we also almost always view it in hex format. I decided to focus on decimal/binary here primarily to get people outside of the "think in memory" space. My teaching philosophy is based on the idea that there is often a lot of assumed information in tutorials/instructions and it would be best to pretend someone has very little background in the topic when explaining it. Hexidecimal is definitely easier to translate between the 2, but I wanted to get the point across about them being real numbers corresponding to this massive Unicode lookup table, rather than just some memory concept. To elaborate, you mentioned that you prefer to think of 'A' as 0x41 rather than 64, but it is in fact 65, not 64. I wanted to strengthen the relationship between the memory representation and the human readable one. I don't personally count in hexadecimal, so it's easier for me to see that 0xA3 is larger than 0x56 when we translate them into decimal (163 vs 84). Of course at the end of the day, I don't think I would have lost an enormous amount of pedagogical power either way. I could have used hexadecimal and not too much would have changed. Thanks for the comment!
@@EmNudge I totally get it now. I don't think of hex numbers as only about addresses and encodings, but also as numbers like code points and other real-world values, so I didn't see it that way. I am about equally good in addition in either hex or decimal. When subtracting I have to convert to binary (or even for complementing) and can only do small ones in my head without scratch space...while going forward I think I will continue to always think of the hex values of code points, rather than the decimal ones (most unicode.org materials seem rather hex based) but I learned some more from your answer.
In theory you would only need one rule to make something similar to UTF-8, that is a characters last byte has a 0 in the most significant bit and a 1 in the most significant bit of all the previous bytes so an 4 byte character would look something like 1XXXXXXX1XXXXXXX1XXXXXXX0XXXXXXX and a 2 byte character would look like 1XXXXXXX0XXXXXXX and a 1 byte character would look like 0XXXXXXX where the Xs represent any value of 0 or 1. This is interesting because you can extend this to any bit width optimization for instance if most of the numbers you store will be of 64 bit width you just make a variation of this that has a 0 in the most significant bit for a length of 64 bits after the zero and a 1 means that there is at least another block of 64 bits after + the bit that tells you if there is another block after.
Amazing video. Thanks so much However, I still have a doubt. What about UTF-8 encoding vs UTF-BOM? How does the BOM exactly works when encoding a file and why some systems or platforms need it to display everything correctly while others are messed up when the BOM is there? This is something I deal with with frequency and never got to understand the way it works. Thanks again!
From what I can tell, Windows is one of the few, is not the only, systems that doesn't default to UTF-8. Unicode themselves don't recommend using the Byte Order Mark (BOM) to indicated that it's UTF-8 and instead just say to assume until proven otherwise. The BOM is more useful with UTF-16 and UTF-32 since it can signal to the OS whether the text was encoded in big endian or little endian, which refers to the order that the bytes within a 16-bit or 32-bit block are stored. UTF-8 only has 1-byte blocks, so there's no need for determining the ordering. (There's only 1 way to order 1 byte.)
Good question! Instead of thinking about it in terms of numbers, try thinking about it as "bits". We're using 2 ranges of 1000 numbers, but let's look at what the starting numbers of these ranges are. 55296 -> d800 (in hex) -> 1101100000000000 (in binary). 56320 -> dc00 (in hex) -> 1101110000000000 (in binary). These each have about 10 zeros at the end. If we store one big number in these 10 zeros, then we can use 20 bits in total across 2 surrogate pairs! We just need a way of adding a number into these surrogates and extracting it out later. 10 bits only gets us 1023 numbers, but by combining these 2 pairs of 10 bits together, we can express numbers as large as 20 bits, giving us 1,048,575 possible numbers!
We do the same thing in reverse. * We read 11110000 which tells us this is part of a 4 byte chunk (since there are 4 1s) * We read the next 3 bytes (10011111 - 10010010 - 10101001) and remove the first 2 bits from each (10) * We stitch them together to get "011111010010101001" which is our code point - 128169
We might apply this sort of compression after already encoding characters, but we'll eventually have to decompress and read the characters into memory. We can't index into a huffman encoded binary, unfortunately, and that's one of the requirements for these encoding formats. Huffman encoding also prevents us from appending characters to a file or list without needed to re-encode the whole blob each time. It's great as a compression algorithm, but not as an encoding format.
In a word: context If the next byte is in a context where we are expected to read text, we read text. If not, we read numbers. Maybe the numbers even refer to an instruction. At the end of the day, all we have are bits. 1s and 0s. We can only ever know the meaning of some stream based on the context we see it in. We need to know beforehand to expect text and also what encoding format we're using.
@@EmNudge Thanks for a great video. Our existence and experience of this world is also based on context. Now we are thinking we are our bodies, and the objects in this world belong to or are related to us. When we see in greater context, we shall realise that we have nothing to do with this gross physical (and one more subtle physical) body, or the objects of this material world; and we are living in a sort of virtual reality. It does not mean that our existence is false; it is just that our current experience of existence in this world is temporary. Our existence in terms of our real identity is eternal (we are eternal non-material spiritual entities or 'sparks', eternally a part and parcel of the Supreme Personality of Godhead (in short, God)). Thanks if you read all the way till here.
wow, I am having nightmares getting the encoding wrong with my server. 24:10 Yes, my server works with files in different languages. Let me tell you utf-8 is a NO NO for working with languages other than English. Yesterday I upgraded my file processing logic to work with UTF-16 encoding, my nightmares seemed to go away.
Great video! One note on UTF-16 taking up less space with Asian letters: utf8everywhere.org/#asian I haven't tested this myself, but it would seem reasonable that if Latin and Asian letters are used together (like in websites with HTML tags), UTF-16 doesn't take up less space. And it's when the text only contains Asian letters that UTF-16 does take up less space. Again, haven't tested and confirmed for myself. But this leaves UTF-16 still useful in some cases :) UTF-8 really does seem like the best encoding method tho. Especially since when it was originally developed by Ken Thompson, it allowed for 1-6 byte long encodings (Which would allow it to encode all 31 bits). It wasn't until 2003 I believe, when the UTF-8 standard changed so it would be at most 4 bytes. I think this was done to match the limitations of UTF-16, which could at most encode 0x10FFFF codepoints, and that would be 4 bytes in UTF-8. Correct me if I am wrong. Edit: Took me AGES to notice I made a typo. Originally, I wrote "Which would allow it to encode all 21 bits". When I meant: "Which would allow it to encode all 31 bits". Now that the standard is 4 bytes max, it can encode 21 bits.
Agreed with the first parts, if you had pure blocks of text without a lot of markup, that is a use case, and of course, some people really want to be able to index quickly, there were reasons that a fixed-width representation was preferred even when it would take more space...I know for sure that they state that UTF-16 will never be expanded, no matter what, I don't remember if they explicitly say that about UTF-8. They go on to explain that they are enumerating characters, not glyphs, so that the current 17 * 64k code points they have should be more than enough "forever", we have already seen some people joking about 640k should be more than enough for anybody paraphrases, but it seems reasonable.
@@jvsnyc I don't understand why some people want a fixed width encoding so they can "index faster", because... Well... Unicode has combining characters and all sorts of magic in it, so that even if you used UTF-32; you wouldn't be guaranteed to able to index to specific "character"... Unicode is a mess because human languages (and symbols) are a mess :') Yeah the limit we have now is massive, as of version 13.0, there are 142,859 "characters" defined (stole that from a google search)... That's about 12% of all the possible encodings... And this includes symbols, all languages used today (almost), emojis and even stuff that isn't used like Nordic runes... We're not even half way. Plus, when you look at how Unicode handles country "flag" emojis (among other stuff), it's clear they are not that wasteful either; they're fairly clever to avoid too much waste. If the day ever comes and we NEEED MOAR SPACC!!! By that time, I imagine UTF-16 having been abandoned completely, and the artificial limitation of UTF-8 could be lifted... Tho I doubt that will ever happen... I will eat my socks if I ever get to see that day come true :'D
@@J_Tanzanite I had a super-long reply that covered all kinds of stuff, and accidentally lost it before I posted. I will just say that I should have said "code unit" rather than "character" and that people were worried because pre-Unicode DBCS/MBCS solutions were nightmares in that it was very easy to "get lost" and to not know what you were looking at, if you were in the middle of a character or at the start of one, etc. etc. etc. -- Unicode learned from all that and many of its features represent making it impossible to encounter problems that were encountered all the time with non-fixed-width systems of the past. In particular, both UTF-8 and UTF-16 have ingenious features that solved all of those. It left remaining due to its passion for respecting pre-existing tradition the issues that there are many different code points for some things that to every rational (non-programmer) seem to be the same character, and numerous different ways to represent the same thing either as combinations of code points or single all-in-one code points. They weren't starting from scratch, and I think they did a great job of solving problems that used to make programmers envy clowns at children's birthday parties, those that worked in bowling allies and pretty much any other job. It was SO much harder writing code that worked in both Japanese and English and... before Unicode, even if it remains non-trivial now.
@@jvsnyc Sorry for late reply... I just have to say, what a great comment you made. I am spoiled being born when I was and learning programming when I did. I do not envy the people of the past having to deal with worse solutions than what we have now. I'm still fairly new to learning about Unicode and encodings, but yeah, while having multiple ways to represent a "character" is kinda dumb (and wasteful) - it also has one interesting effect... Latin-1 can more or less be interpreted as "single byte codepoints" for the Latin-1 Supplement in Unicode... Almost... It's weird. It still bothers me how I sometimes see websites or programs that - for some reason - still think Latin-1 is in use... Like, trying to read UTF-8 as Latin-1... Or worse... I'm Norwegian, and we have three extra letters: ÆØÅ (and ofc æøå). And for some reason, there's is this Norwegian government website that doesn't handle those correctly in ALL situations, it's inconsistent and will (somehow) convert them to Latin-1, and then re-encode them to UTF-8, but horribly botched (skipping every second character... I'm thinking it has to do with those letters being two bytes in UTF-8... But I can't make sense of it)... Starting to wonder if there is a bug there, that can be exploited :P
@@J_Tanzanite Possibly. If something works for all the characters that are one byte in UTF-8 and not for the others, that seems a good candidate that some stage in their processing makes a lame-brained assumption that text is one byte...if they handle other UTF characters that are more than one byte correctly it would be something more specific to those letters possibly. There was a skit on Saturday Night Live called "Toonces, the Cat Who Could Drive a Car". Every skit showed badly faked video of a cat driving a car, with obvious fake paws, etc. -- and every one ended with him driving the car off a cliff and the car exploding in a fireball. All that varied was the guests in the car and the specific comments they made, usually including a recurring character saying "See, he CAN drive a car - just not very well!!" The whole world has gone Unicode now, but it is a little like Toonces - there are many places where pre-Unicode mindset causes problems on edges and in corners. In about 2010 thru 2015, I was absolutely blown away by how many bug fixes there were in text-processing software that had been available since the 90's or even the 80's. It seemed literally impossible, millions of people had been using these libraries for decades, no way, right? When I looked at them in detail, almost all involved peculiarities of proper Unicode support. It turns out it took years to find and fix all the little places people made 8-bit or 7-bit assumptions somewhere, as these were marked nowhere, it was just how we did things. When Unicode came along, people corrected these as they were noticed, and it is taking a long, long time to get to 100%. So in the last 15 years the world went from almost no Unicode support to "almost everywhere" but we still see a lot of cats driving cars off cliffs and exploding into fireballs, as in the case you mentioned and so very many more. I consider it part of the responsibility of the "Unicode Aware" to help find and fix these. Unicode is somewhat complex and involved, but nothing compared to the nightmares that came before it.
Yep, minor mistake there, it should be binary grows slower than decimal since exponentiation by 2 is smaller than exponentiation by 10, otherwise incredible tutorial!
the first bits in the first byte for utf-8 represent the number of bytes you are using. for 1 byte you start with 0 and the next 7 bits are basically ascii. for 2 bytes you start with 110 and you need the 0 after. and the next byte and any more after for other sizes always start with 10. so 2 bytes would be 110? ???? 10?? ???? and 3 bytes would be 1110 ???? 10?? ???? 10?? ????. and 4 would be starting with 4 1s and a 0 then the same format for the other bytes.
We could still pack things oddly if we'd like. If we're trying to represent the number 1224841 in 2 bits, we get 100101011000010001001. Split into 8-bit chunks we get '10010101', '10000100', '01001XXX' where X is nothing. Of course if we're using 8 bit chunks then we need to fill this with SOMEthing, so we can just put the next number there. As long as we know to account for this, we can do some bit shifting and retrieve the true number. Systems often don't work like this, however. There have existed computers with architectures that support weird numbered bits, like IBM's 31-bit computer in 1983, but these days we often don't do this. In the video I believe I did mention this briefly and how instead of going to 21, the next step was UTF-32.
HOLY SHT!!! This tutorial is so clear and the visual makes it even better to understand. Amazing voice as well :)
We Have The Same Color!
This is such a great and thorough explanation of Unicode! I really started to understand. Clear voice - good at explaining - nice visuals. Great job 👍
This is the BEST Unicode tutorial. This explain well how does Unicode and UFT-8 worked. I was wondering how UFT-8 knows where to start and when to end and you explained very well. Thank you so much.
Wow!!! I fully agree that this is the best tutorial on unicode utf-x.
Thanks!!!
This was the most concise video on Unicode I have seen. Thank you sooo much.
This is absolutely one of the best Unicode tutorials I have seen anywhere. I was thinking that it was an error to say "before 2000, Unicode was 16 bits" as I memorized July 1996 as the point that changed, but the RFC only came out in 2000.
The UTF-16 encoding scheme was developed as a compromise and introduced with version 2.0 of the Unicode standard in July 1996.[8] It is fully specified in RFC 2781, published in 2000 by the IETF.[9][10]
It’s worth rewatching. Good stuff.
Amazing explanation of concepts. Subscribed, liked, and Favorited for the algorithm!
Unfortunately, videos like that are not in high demand, that really sucks. Your exploration was flawless. I’m at a loss for words.
Explanation
very good and clear explanation. perfect for refreshing my knowledge of utf encodings
hands down best tutorial I watched in a while
u explained me what my teacher can't in hours ?! you're fire bro. Keep going! you deserve a lot more attention!!!
Awesome explanation man, very clear and visual. I love it. Greetings from Argentina!
This video confirmed to me that Unicode is just as complex as people make it out to be. 🙂 However, it does make logical sense.
Amazingly done. Definitely added clarity for me.
This a fantastically accessible explanation -- thank you!
that has to be the best hidden yet amazing tutorial of youtube. I'm shocked this doesn't have more views/subscribers
Great Contribution about Unicode tutorial sector. Thanks a lot brother.
Just gem of this topic, definitely!
Thanks for the amazing explanation
man I'm so thankful ... this helped me a lot
Thank you for this! Really helped me understand :)
thank you. its very clear and easy to undersrtand.
I spent quite some time learning about Unicode, character sets and such. One thing I couldn't find was a comprehensive description on how surrogate pairs work. This video just made the perfect, easy to understand example, then went on and explained UTF-8 perfectly. I wish I started this whole topic with this video.
in three month it's even a better refresher.
Thank You So Much! I was trying to figure out the same thing... How do we fit so many languages and emojis in UTF-16 (which is 16 bit)... I searched a lot and finally came to this video. This is absolute gold!
So well put together.
what an amazing explanation
Thank you very much sir
Thank you so much for this. Really clarified everything for me.
That is all i need, it clear and super easy understand. Thank you lot
So well explained! Subbed!
awesome explanation!
Amazing.
I learn a lot from this video,thank you
This is so good, thanks!
Thanks so much for this tutorial much appreciated
great video, thank you so much!
really good explanation thank you very much
For anyone struggling with the concept of UTF-8, it might help to encode some characters manually the following way:
1. Take a character and find its binary representation (the index in binary).
2. See how many bits it's taking
Time to play with utf-8. Let's start with the '110' header. 2 ones means two bytes total. In the first byte, you have 5 positions left for your character.
In the second, you have the '10' at the start meaning it's a continuation byte that continues some previously started character. There's 8 - 2 = 6 bits left in the second byte, which amounts to 11 available bits in total. If you find yourself having leftover bits, just pad the character with some zeroes at the start till it's 11 bits. same goes for higher numbers.
Too little space? Let's go "1110", three bytes. The available positions are now (8 - 4) + (8 - 2) + (8 - 2) = 4 + 6 + 6 = 16 bits. And so on. You might even write a simple computer program that does this for you with some loops. Whatever feels intuitive.
Awesome video man, thanks for the help!
Best content 💯
You have made me feel like a natural 🦄 happyguy
great video
thank you SO SO much!!!!! i'm russian and couldn't find any good video about UTF-8 on my language but you saved my student's ass 😆😆
Thank you! 🎉
Can you please send me the link for the article shown in the end of the video
thank you so much it's amazing tutorial 😍
Why is John Patrick Lowrie teaching me about text characters?
This is such a great video. Thank you for the time and effort you put into it.
Why subtract ox10000 for the poop emoji?
Thank you so much :) :)
Thanks for a very good video. If the current Unicode system allows a maximum of 21 bits for 4-byte UTF-8, it means 16 bits (65,536 code points) for every plane, and the remaining 5 bits to identify which plane. So a maximum of 32 planes are possible. Now planes 0 to 16 are defined. Planes 0,1,2,3,14 are assigned, and planes 4 to 13 are not assigned, and 15 and 16 are private use. So does it mean that another 15 planes (planes 17 to 31) are not even defined? We may not need them even for the next 10,000 years (or may be never). But what do others think?
Amazing!
What did you use to create that document?
This is such an excellent video. The one question that I had was would it make sense to focus even more on hexadecimal rather than decimal (called denary by some pedants because there's no decimal points, we just have integers-but I saw the term used in one great video)...when working with Unicode shouldn't we be mostly just thinking about hexadecimal representations, the actual character names (and block names and script names and whatever) and the glyphs when they get rendered? I realize that not everyone grew up converting binary images drawn on graph paper to hex codes to make video games, but -- isn't it easier to just forget about decimal/denary entirely when your mind is in Unicode-space? It is way easier to see binary-hex relationships than binary-decimal ones. Just a question, the video is awesome either way. Even in ASCII, I normally prefer to think of 'A' as 0x41 rather than 64, and 'a' as 0x61 rather than 97.
Thanks for the kind words!
Yes, that's a very good point. It's definitely easier to see the relationship between binary and hexadecimal than binary and decimal. When dealing with binary data, we also almost always view it in hex format.
I decided to focus on decimal/binary here primarily to get people outside of the "think in memory" space. My teaching philosophy is based on the idea that there is often a lot of assumed information in tutorials/instructions and it would be best to pretend someone has very little background in the topic when explaining it. Hexidecimal is definitely easier to translate between the 2, but I wanted to get the point across about them being real numbers corresponding to this massive Unicode lookup table, rather than just some memory concept.
To elaborate, you mentioned that you prefer to think of 'A' as 0x41 rather than 64, but it is in fact 65, not 64. I wanted to strengthen the relationship between the memory representation and the human readable one. I don't personally count in hexadecimal, so it's easier for me to see that 0xA3 is larger than 0x56 when we translate them into decimal (163 vs 84).
Of course at the end of the day, I don't think I would have lost an enormous amount of pedagogical power either way. I could have used hexadecimal and not too much would have changed.
Thanks for the comment!
@@EmNudge I totally get it now. I don't think of hex numbers as only about addresses and encodings, but also as numbers like code points and other real-world values, so I didn't see it that way. I am about equally good in addition in either hex or decimal. When subtracting I have to convert to binary (or even for complementing) and can only do small ones in my head without scratch space...while going forward I think I will continue to always think of the hex values of code points, rather than the decimal ones (most unicode.org materials seem rather hex based) but I learned some more from your answer.
are u the dude from HL2 Best Impressions ??? Your Voice Is SO DOPE
In theory you would only need one rule to make something similar to UTF-8, that is a characters last byte has a 0 in the most significant bit and a 1 in the most significant bit of all the previous bytes so an 4 byte character would look something like 1XXXXXXX1XXXXXXX1XXXXXXX0XXXXXXX and a 2 byte character would look like 1XXXXXXX0XXXXXXX and a 1 byte character would look like 0XXXXXXX where the Xs represent any value of 0 or 1. This is interesting because you can extend this to any bit width optimization for instance if most of the numbers you store will be of 64 bit width you just make a variation of this that has a 0 in the most significant bit for a length of 64 bits after the zero and a 1 means that there is at least another block of 64 bits after + the bit that tells you if there is another block after.
Awesome explanation🔥. Where can I find the diagram?
Good question! Decided to upload it since you asked. I should have done it a long time ago.
Here you go:
resources.emnudge.dev/files/Unicode.pdf
Thank you
Amazing video. Thanks so much However, I still have a doubt. What about UTF-8 encoding vs UTF-BOM? How does the BOM exactly works when encoding a file and why some systems or platforms need it to display everything correctly while others are messed up when the BOM is there? This is something I deal with with frequency and never got to understand the way it works.
Thanks again!
From what I can tell, Windows is one of the few, is not the only, systems that doesn't default to UTF-8. Unicode themselves don't recommend using the Byte Order Mark (BOM) to indicated that it's UTF-8 and instead just say to assume until proven otherwise.
The BOM is more useful with UTF-16 and UTF-32 since it can signal to the OS whether the text was encoded in big endian or little endian, which refers to the order that the bytes within a 16-bit or 32-bit block are stored. UTF-8 only has 1-byte blocks, so there's no need for determining the ordering. (There's only 1 way to order 1 byte.)
Hey awesome presentation! How did you made that PDF and the drawings? What tools did you use?
Thanks! I used Notability on the iPad for this and just exported to a PDF.
excellent
7:40
It only holds 65,000 code points, what the point of a Surrogqte point only in the 50,000’s?? Don’t you need 128,000
Good question!
Instead of thinking about it in terms of numbers, try thinking about it as "bits".
We're using 2 ranges of 1000 numbers, but let's look at what the starting numbers of these ranges are.
55296 -> d800 (in hex) -> 1101100000000000 (in binary).
56320 -> dc00 (in hex) -> 1101110000000000 (in binary).
These each have about 10 zeros at the end. If we store one big number in these 10 zeros, then we can use 20 bits in total across 2 surrogate pairs! We just need a way of adding a number into these surrogates and extracting it out later.
10 bits only gets us 1023 numbers, but by combining these 2 pairs of 10 bits together, we can express numbers as large as 20 bits, giving us 1,048,575 possible numbers!
you should make a video on "yield"
But how utf-8 translate back to poop emoji? There is no code point to check it?
We do the same thing in reverse.
* We read 11110000 which tells us this is part of a 4 byte chunk (since there are 4 1s)
* We read the next 3 bytes (10011111 - 10010010 - 10101001) and remove the first 2 bits from each (10)
* We stitch them together to get "011111010010101001" which is our code point - 128169
We can't apply huffman here right ?
We might apply this sort of compression after already encoding characters, but we'll eventually have to decompress and read the characters into memory. We can't index into a huffman encoded binary, unfortunately, and that's one of the requirements for these encoding formats.
Huffman encoding also prevents us from appending characters to a file or list without needed to re-encode the whole blob each time. It's great as a compression algorithm, but not as an encoding format.
Lovely
thanks!
buen video
hey how does the computer render 65 as A to the screen and not the number 65, awesome video by the way.
In a word: context
If the next byte is in a context where we are expected to read text, we read text. If not, we read numbers. Maybe the numbers even refer to an instruction.
At the end of the day, all we have are bits. 1s and 0s. We can only ever know the meaning of some stream based on the context we see it in. We need to know beforehand to expect text and also what encoding format we're using.
@@EmNudge Thanks for a great video. Our existence and experience of this world is also based on context. Now we are thinking we are our bodies, and the objects in this world belong to or are related to us. When we see in greater context, we shall realise that we have nothing to do with this gross physical (and one more subtle physical) body, or the objects of this material world; and we are living in a sort of virtual reality. It does not mean that our existence is false; it is just that our current experience of existence in this world is temporary. Our existence in terms of our real identity is eternal (we are eternal non-material spiritual entities or 'sparks', eternally a part and parcel of the Supreme Personality of Godhead (in short, God)). Thanks if you read all the way till here.
wow, I am having nightmares getting the encoding wrong with my server. 24:10 Yes, my server works with files in different languages. Let me tell you utf-8 is a NO NO for working with languages other than English. Yesterday I upgraded my file processing logic to work with UTF-16 encoding, my nightmares seemed to go away.
Great video!
One note on UTF-16 taking up less space with Asian letters: utf8everywhere.org/#asian
I haven't tested this myself, but it would seem reasonable that if Latin and Asian letters are used together (like in websites with HTML tags), UTF-16 doesn't take up less space.
And it's when the text only contains Asian letters that UTF-16 does take up less space.
Again, haven't tested and confirmed for myself. But this leaves UTF-16 still useful in some cases :)
UTF-8 really does seem like the best encoding method tho.
Especially since when it was originally developed by Ken Thompson, it allowed for 1-6 byte long encodings (Which would allow it to encode all 31 bits).
It wasn't until 2003 I believe, when the UTF-8 standard changed so it would be at most 4 bytes.
I think this was done to match the limitations of UTF-16, which could at most encode 0x10FFFF codepoints, and that would be 4 bytes in UTF-8.
Correct me if I am wrong.
Edit: Took me AGES to notice I made a typo.
Originally, I wrote "Which would allow it to encode all 21 bits".
When I meant: "Which would allow it to encode all 31 bits".
Now that the standard is 4 bytes max, it can encode 21 bits.
Agreed with the first parts, if you had pure blocks of text without a lot of markup, that is a use case, and of course, some people really want to be able to index quickly, there were reasons that a fixed-width representation was preferred even when it would take more space...I know for sure that they state that UTF-16 will never be expanded, no matter what, I don't remember if they explicitly say that about UTF-8. They go on to explain that they are enumerating characters, not glyphs, so that the current 17 * 64k code points they have should be more than enough "forever", we have already seen some people joking about 640k should be more than enough for anybody paraphrases, but it seems reasonable.
@@jvsnyc I don't understand why some people want a fixed width encoding so they can "index faster", because... Well... Unicode has combining characters and all sorts of magic in it, so that even if you used UTF-32; you wouldn't be guaranteed to able to index to specific "character"... Unicode is a mess because human languages (and symbols) are a mess :')
Yeah the limit we have now is massive, as of version 13.0, there are 142,859 "characters" defined (stole that from a google search)...
That's about 12% of all the possible encodings... And this includes symbols, all languages used today (almost), emojis and even stuff that isn't used like Nordic runes...
We're not even half way. Plus, when you look at how Unicode handles country "flag" emojis (among other stuff), it's clear they are not that wasteful either; they're fairly clever to avoid too much waste.
If the day ever comes and we NEEED MOAR SPACC!!!
By that time, I imagine UTF-16 having been abandoned completely, and the artificial limitation of UTF-8 could be lifted... Tho I doubt that will ever happen... I will eat my socks if I ever get to see that day come true :'D
@@J_Tanzanite I had a super-long reply that covered all kinds of stuff, and accidentally lost it before I posted. I will just say that I should have said "code unit" rather than "character" and that people were worried because pre-Unicode DBCS/MBCS solutions were nightmares in that it was very easy to "get lost" and to not know what you were looking at, if you were in the middle of a character or at the start of one, etc. etc. etc. -- Unicode learned from all that and many of its features represent making it impossible to encounter problems that were encountered all the time with non-fixed-width systems of the past. In particular, both UTF-8 and UTF-16 have ingenious features that solved all of those.
It left remaining due to its passion for respecting pre-existing tradition the issues that there are many different code points for some things that to every rational (non-programmer) seem to be the same character, and numerous different ways to represent the same thing either as combinations of code points or single all-in-one code points. They weren't starting from scratch, and I think they did a great job of solving problems that used to make programmers envy clowns at children's birthday parties, those that worked in bowling allies and pretty much any other job. It was SO much harder writing code that worked in both Japanese and English and... before Unicode, even if it remains non-trivial now.
@@jvsnyc Sorry for late reply... I just have to say, what a great comment you made.
I am spoiled being born when I was and learning programming when I did. I do not envy the people of the past having to deal with worse solutions than what we have now.
I'm still fairly new to learning about Unicode and encodings, but yeah, while having multiple ways to represent a "character" is kinda dumb (and wasteful) - it also has one interesting effect...
Latin-1 can more or less be interpreted as "single byte codepoints" for the Latin-1 Supplement in Unicode... Almost... It's weird.
It still bothers me how I sometimes see websites or programs that - for some reason - still think Latin-1 is in use... Like, trying to read UTF-8 as Latin-1... Or worse...
I'm Norwegian, and we have three extra letters: ÆØÅ (and ofc æøå).
And for some reason, there's is this Norwegian government website that doesn't handle those correctly in ALL situations, it's inconsistent and will (somehow) convert them to Latin-1, and then re-encode them to UTF-8, but horribly botched (skipping every second character... I'm thinking it has to do with those letters being two bytes in UTF-8... But I can't make sense of it)...
Starting to wonder if there is a bug there, that can be exploited :P
@@J_Tanzanite Possibly. If something works for all the characters that are one byte in UTF-8 and not for the others, that seems a good candidate that some stage in their processing makes a lame-brained assumption that text is one byte...if they handle other UTF characters that are more than one byte correctly it would be something more specific to those letters possibly.
There was a skit on Saturday Night Live called "Toonces, the Cat Who Could Drive a Car". Every skit showed badly faked video of a cat driving a car, with obvious fake paws, etc. -- and every one ended with him driving the car off a cliff and the car exploding in a fireball. All that varied was the guests in the car and the specific comments they made, usually including a recurring character saying "See, he CAN drive a car - just not very well!!"
The whole world has gone Unicode now, but it is a little like Toonces - there are many places where pre-Unicode mindset causes problems on edges and in corners. In about 2010 thru 2015, I was absolutely blown away by how many bug fixes there were in text-processing software that had been available since the 90's or even the 80's. It seemed literally impossible, millions of people had been using these libraries for decades, no way, right? When I looked at them in detail, almost all involved peculiarities of proper Unicode support. It turns out it took years to find and fix all the little places people made 8-bit or 7-bit assumptions somewhere, as these were marked nowhere, it was just how we did things. When Unicode came along, people corrected these as they were noticed, and it is taking a long, long time to get to 100%. So in the last 15 years the world went from almost no Unicode support to "almost everywhere" but we still see a lot of cats driving cars off cliffs and exploding into fireballs, as in the case you mentioned and so very many more.
I consider it part of the responsibility of the "Unicode Aware" to help find and fix these. Unicode is somewhat complex and involved, but nothing compared to the nightmares that came before it.
Why does Javascript use UTF-16.??? pelase answer
Very cool tutorial except for 1 part.
17:21 is incorrect. It takes 2 seconds to realize why honestly.
Yep, minor mistake there, it should be binary grows slower than decimal since exponentiation by 2 is smaller than exponentiation by 10, otherwise incredible tutorial!
thx
AmAzInG
the first bits in the first byte for utf-8 represent the number of bytes you are using. for 1 byte you start with 0 and the next 7 bits are basically ascii. for 2 bytes you start with 110 and you need the 0 after. and the next byte and any more after for other sizes always start with 10. so 2 bytes would be 110? ???? 10?? ???? and 3 bytes would be 1110 ???? 10?? ???? 10?? ????. and 4 would be starting with 4 1s and a 0 then the same format for the other bytes.
How is 21bits possible when smallest memory size is 8bits?
I mean 2*8 is 16
3*8 is 24
We could still pack things oddly if we'd like.
If we're trying to represent the number 1224841 in 2 bits, we get 100101011000010001001. Split into 8-bit chunks we get '10010101', '10000100', '01001XXX' where X is nothing. Of course if we're using 8 bit chunks then we need to fill this with SOMEthing, so we can just put the next number there. As long as we know to account for this, we can do some bit shifting and retrieve the true number.
Systems often don't work like this, however. There have existed computers with architectures that support weird numbered bits, like IBM's 31-bit computer in 1983, but these days we often don't do this.
In the video I believe I did mention this briefly and how instead of going to 21, the next step was UTF-32.
Benstokes!! Kuch samjh hi nhi aaya
Lost me at surrogate pairs 😢
bad...
Great video
Great video