@@energy-tunes Computerphile presentation appears cluttered to me. While Alex seems to be tidy & to the point with flow reminiscent to that of a river.
@@ankitchabarwal6814 nothing about it is "cluttered". the difference is that computerphile covers the most important aspect of utf 8: how it knows how many bytes to read for multibyte codepoints, while this one leaves it out. i am not blaming this video, as it is a little confusing to explain, but this video is by no means the best.
@@rz2374 It is best in one aspect: Keep it to the general encoding itself. Only one enhancement in aspect to UTF8: Pointing out to look for the UTF8-encoding strategy elsewhere…
Another fun fact about the way letters are laid out in ASCII: A capital letter's corresponding lowercase counterpart is exactly 32 values ahead. This is because 32 is a power of 2, and makes it so that you only need to flip the 6th bit (from the right) of a byte to change the case of a letter.
NOT ONLY ASCII I was going to say that it lost it's use in Unicode because it doesn't work with other alphabets, but went to check and… It works for Cyrillic and Greek as well! 1101000010010000 is А (cyr) 1101000010110000 is а 1100111010010001 is Α (greek) 1100111010110001 is α 1101000010010110 is Ж 1101000010110110 is ж Edit: Although it doesn't work for every character, 1101000010101111 is Я and 1101000110001111 is я
@@bororobo3805That's not true, most programmers if they're good, will bitwise AND or OR rather than doing an arithmetic operation. With a barrel shifter, which most CPUs have, it will always be more efficient than using an arithmetic operation.
Great explanation, thank you! For anyone who may still not understand, UTF-8 CAN get up to 32 bits and bit as big as UTF-32, but only if it has to. Otherwise, it just uses the minimum amount of space (8 bits) and expands as required, depending on the grapheme.
@@friedrichmyers You mean 32 bits. And it depends on your use case. What do you mean by typed systems language? If you use 32 bits at all times you will be using a lot of memory. If the programming language you are using is modern, it should be able to handle UTF-8 just fine and expand variably, as required.
7:50 Worth pointing out that Python 2 is way past its end of life now and in Python 3, all strings are Unicode aware and the u modifier does nothing (it's only there for backwards compatibility). Using Python 2 is a good way to showcase the difference between Unicode aware and unaware functions but seems like this could confuse some beginners trying to replicate what you're doing who will likely be using Python 3 and might not be aware of the difference.
Only officially and even then was another point release considered. For some platforms there is no py3 port and plenty of py2 code is still around being used.
@@mokovec It's definitely still out there but by now the official end of life has already been almost 3 years ago and it's finally dying for real. The consideration of "another point release" has long passed. On my distro, python 2 isn't installed by default anymore and on Windows, you need to really go out of your way to install it. Platforms where 2 is still the only option are far and few in between and certainly not the ones beginners are using. And there's definitely no reason to encourage starting to use it now.
@@1vader this video is not a programming course, it's an explainer on utf-8. i assume he chose python2 on purpose specifically to demonstrate the differences in a simple language (so that a beginner wouldn't get too distracted by the syntax) that has an explicit delineation between unicode and non-unicode strings. of course we shouldn't be writing production ready code based off this video lol, and the video isn't encouraging it (he literally says "check out the applicable string behaviour and libs in your own language"); it's just a teaching tool
I took a long searching to understand the Unicode (also UTF-8 and the others), I did, but some things were still ambiguous to me, this guy literally taught me the whole topic with only 10 minutes
One small correction: grapheme is a part of a particular writing system. Writing systems are always language-related. Unicode does not reflect any particular writing system. Unicode, and this is probably the smartest choice that could ever be made, maps numbers (code points) to character descriptions or names. This way Unicode is detached from any font and therefore from any particular shapes. This results in a code point relating to not what we want to see, but what we want, making it more abstract. A good examples are U+0067 and U+0261. In most writing systems based on Latin script, they are allographs, variants of grapheme, so if Unicode were to contain graphemes, there should be only one character. But this is not the case. A writing system may prefer particular glyph (a shape of a letter) and that glyph could be the main variant (allograph) of the grapheme in that system. In most Latin-based writing systems is the main variant, but in the International Phonetic Alphabet it is , because of it correlation with other similar characters like .
You earned a subscriber today. I've worked in the past with non-english characters scratching my head. This alone sums up all the concepts in details. Very good use of examples and video production 💌
Traditionally major Chinese coding methods would just consider a Chinese grapheme as 2 characters though, because they would use 2 byte coding while the the 128 ASCII code points only use 1 byte coding. And also, traditionally a Chinese grapheme would always take up double the width of an ASCII grapheme in fixed width console font. This kinda makes everything neatly aligned (the amount of storage bytes needed is the same as the amount of character printing space needed), but basically falls apart when Internet and UTF-8 become more popular. And that's also basically the very reason there are double width Latin letters in Unicode. Traditionally it's used to improve readability of English words in vertical text arrangement for Chinese and Japanese, and is called full-width letters full width as in it's the full width of a Chinese character.
Thanks, a ton!! Here is what I learnt from the video (I have added a few things that I knew earlier): - UTF-8 (Unicode Transformation Format, 8-bit) is an encoding scheme for representing Unicode characters. - In UTF-8, ASCII characters are represented using a single byte, which means that any valid ASCII text is also valid UTF-8 text. - Therefore, UTF8 is backward compatible with ASCII. - In UTF-8, characters that can be represented using a single byte (i.e., ASCII characters) are represented as themselves. - Characters that require more than one byte are encoded using a combination of multiple bytes. - A code point refers to a numerical value assigned to each character or symbol in the Unicode standard. - Code points are represented using hexadecimal notation and are typically prefixed with "U+" to distinguish them from other numerical values. - For example, character "é" (Latin Small Letter E with Acute) consists of two Unicode code points: the base character "e" (U+0065) and the combining acute accent (U+0301). When encoded in UTF-8, "é" is represented by the bytes 0xC3 0xA9. - A grapheme refers to a visual unit of a written language. It represents a single user-perceived character or a combination of characters that are displayed together. - len() function returns the number of bytes, not the number of characters in a Unicode-unaware string. - len() function returns the number of characters in case of a Unicode-aware string.
You have an amazing talent for teaching m8! Keep it up, this video like your others is so helpful. I love that you cover basic concepts of coding, not "how do I implement a server in node.js" but what really is the essence of becoming a good coder.
Definitely the best explanation of Unicode I could ask for. Checked with Elixir and it appears that Elixir stringս are Unicode aware by default! Amazing!
Trying to be better in IT and dreaming to be an awesome programmer! I have always just skipped on learning Unicode and never cared due to laziness. But now realize the very importance! thank you so much for this video.
This is so awesome yet somehow has so few views. These kind of videos are amazing for people self-studying CS. I hope you have lots more videos like this to make it easier. Keep up the great work 🤘
8:15 In python3, at least now it can do that with python without any problem. >>> a="你好" >>> a[0] '你' In python3, every string is in utf8. Using a u"" is a thing in py2
Technically, the internal representation of strings in Python is actually UTF-32 or UCS-4 (they're basically the same), because strings need to be efficiently indexable - as these encoding are uniform-length, that's easy to do. Using utf-8, the lookup of a given character in a string would take longer, because the whole string up to that point would need to be traversed as bytes are not evenly allocated to each character. To top things off, strings that only contain Latin characters just use ascii anyway internally to reduce waste. That's why when a string is appended to, it creates an entire new string object, as the characters the new string has to encode may be outside the original encoding's range.
This is such a great explainer video. As somebody who's run into the codepoint slicing problem in Python a lot. This explains so much! Thank You again!
I've got a paper that tells exactly how you can convert code point to UTF-8 bytes and vice versa. It's actually pretty good piece of knowledge and it tells exactly how UTF-8 encodes code points. Had I known that before, I could simply convert the Unicode symbols by hand myself.
Such best videos better than 100X bad books, I can't stress too much how good this video is. There are always treasure videos like this on TH-cam, hope OP keeps working, keep letting your knowledge and legacy to influence more people.
WHY? WHY? WHY HAVENT I SEEN YOUR CHANNEL, YOU ARE THE BEST HANDS DOWN, definitely sub + bell (and like) just the way you explain.. its so good and clear
The reason I'm going to watch this one is that it explains new stuff like Emojis without expecting already having knowledge about Unicode. Bad informative material usually either lacks the new stuff or expected to much knowledge about the old.
Fun fact: most writing systems can express their characters in 2 o 3 UTF-8 bytes (only 2 for the most used ones: extended latin, arabic, cyrilic, greek…). The only outlier is CJK Ideographs (Chinese, Japanese and Korean). There are an insane amount of them so each CJK grapheme carries more meaning (like a word) so it makes sense that they take more bytes in UTF-8 (usually 4 bytes).
So here's my question: when deciding a UTF-8 string, how do we know whether the next byte is a one-byte character or the first byte in a multi-byte code point?
Here's how you can determine if a byte in UTF-8 is a single codepoint or part of a multi-byte codepoint: 1. Examining the Leading Bits: Single-byte codepoints: Start with 0x00 to 0x7F (binary 0xxxxxxx) Represent ASCII characters directly. Multi-byte codepoints: Begin with 0xC0 to 0xFF (binary 11xxxxxx) Require 2-4 bytes to represent characters beyond the ASCII range. 2. Byte Sequence Patterns: 2-byte codepoints: Start with 0xC0 to 0xDF Followed by a byte in the range 0x80 to 0xBF. 3-byte codepoints: Start with 0xE0 to 0xEF Followed by two bytes in the range 0x80 to 0xBF. 4-byte codepoints: Start with 0xF0 to 0xF7 Followed by three bytes in the range 0x80 to 0xBF.
I kinda hoped this video would teach me how to use all the unicode characters in 11min, but it turns out it's just a general overview of how UTF-8 works. This is like going to a math class and learning that addition exists, but not learning how to use the + man.
no videos were explaining clearly the step of converting the "grapheme" to a "code point" and then to binary they were just assuming that middle step was understood. that's what I was missing apparently thanks great video
Such clean video format for explanning clearly a knowledge ! Thank you^^ (by a newcomer in the field of informaticL May you have a nice day with cookies and tea/milk 🥛🍪
thx for the incredible explanation, it's all clearer now. There are just 2 points that stay in the blur: what is UTF-16 and how do you decrypt UTF-8 (how do you know if the code point you decrypt takes 1,2,3 or 4 bytes?)
UTF-16 is an encoding that uses 16 bits (2 bytes) for most characters. Codes from U+0000 to U+FFFF are literal. Codes above U+10000 are reprsented by Surrogate Pairs, which use two code points from an intentionally unused portion from U+D800 to U+DFFF. UTF-8 decoding is easy. As Tom Scott’s video on Computerphile explains, the first few bits tell you what role each byte has: 0−−−−−−− = U+00 to U+7F (ASCII) 10−−−−−− = Carries data. Can’t be the first byte of a letter. 110−−−−− = Followed by 1 data byte. 1110−−−− = Followed by 2 data bytes. 11110−−− = Followed by 3 data bytes.
Not sure you got a reply on this. Basically, UTF-16 encodes each unicode code point in 16 bits (2 bytes). This has the same advantages of UTF-32 (consistent byte length per character and no favoring of languages). However, unlike UTF-8 and UTF-32, it can't support all defined code points, but it can support the first two major groupings of them (called planes) which covers the writing characters for pretty much every language on Earth. Most Unicode-ready programming languages actually use UTF-16. UTF-16 also has the interesting characteristic that the two bytes per character might be reversed (optimized for certain computer systems), which necessitates a 2-byte leading character, known as the Byte Order Mark, to tell the text reader which order the bytes appear in.
1:49 Arabic is spelled backwards; it should be written (العربية). I think it's because of endianness (I see this very often in Western / Latin resources). Basically, it's like if you write hsilgnE for English Thank you for the videos though^^
Hi, can someone please tell me where I can find info on 8:39, i.e. why the terminal can't render it correctly. Where can I learn how terminals render the strings. More generally, how software 'render' strings and why some can render combined graphemes correctly while others do not. I desperately need to learn about it. Thanks in advance.
2:08 Is it just me, or is "عربي" (Arabic) written backwards there (Left to right instead of right to left)? Not that I know how to read or write Arabic, but did work with it a bit. I can kind of recognize when the ligatures go missing (all glyphs ending up separated) due to the characters being written in the wrong direction.
Recently I was rewriting my python program which prints all the Unicode Characters (without the combining ones) within a range of Hexadecimal characters or chosen unicode blocks. You can see it on my channel but know that i am a beginner and i am aware of how i can improve it because I created it a long time ago.
this alone is the best unicode video explanation in the entire youtube, 100x better than the, maybe, second place from Computerphile.
Computerphile's explanation was very concise as well. I don't understand why you are blatantly throwing shade on them like that lol
@@energy-tunes Computerphile presentation appears cluttered to me. While Alex seems to be tidy & to the point with flow reminiscent to that of a river.
@@ankitchabarwal6814 nothing about it is "cluttered". the difference is that computerphile covers the most important aspect of utf 8: how it knows how many bytes to read for multibyte codepoints, while this one leaves it out. i am not blaming this video, as it is a little confusing to explain, but this video is by no means the best.
@@rz2374 It is best in one aspect: Keep it to the general encoding itself. Only one enhancement in aspect to UTF8: Pointing out to look for the UTF8-encoding strategy elsewhere…
i didnt understand clearly ... Can you explain me?
Another fun fact about the way letters are laid out in ASCII: A capital letter's corresponding lowercase counterpart is exactly 32 values ahead. This is because 32 is a power of 2, and makes it so that you only need to flip the 6th bit (from the right) of a byte to change the case of a letter.
ASCII digits are also nicely laid out; the low nibble of characters '0' to '9' are 0 to 9.
Yup. You only need to XOR to flip back and forth
@@Michael75579 yup. You can subtract from '0' to get decimal equivalents.
This is how most assembly programers do it
NOT ONLY ASCII
I was going to say that it lost it's use in Unicode because it doesn't work with other alphabets, but went to check and… It works for Cyrillic and Greek as well!
1101000010010000 is А (cyr)
1101000010110000 is а
1100111010010001 is Α (greek)
1100111010110001 is α
1101000010010110 is Ж
1101000010110110 is ж
Edit: Although it doesn't work for every character,
1101000010101111 is Я and
1101000110001111 is я
@@bororobo3805That's not true, most programmers if they're good, will bitwise AND or OR rather than doing an arithmetic operation. With a barrel shifter, which most CPUs have, it will always be more efficient than using an arithmetic operation.
This is the type of content that should be suggested by TH-cam to everyone. Great explanation, thankyou
Great explanation, thank you! For anyone who may still not understand, UTF-8 CAN get up to 32 bits and bit as big as UTF-32, but only if it has to. Otherwise, it just uses the minimum amount of space (8 bits) and expands as required, depending on the grapheme.
So I would still allocate 32 bytes, right?
I use a typed systems language.
@@friedrichmyers You mean 32 bits. And it depends on your use case. What do you mean by typed systems language? If you use 32 bits at all times you will be using a lot of memory. If the programming language you are using is modern, it should be able to handle UTF-8 just fine and expand variably, as required.
7:50 Worth pointing out that Python 2 is way past its end of life now and in Python 3, all strings are Unicode aware and the u modifier does nothing (it's only there for backwards compatibility). Using Python 2 is a good way to showcase the difference between Unicode aware and unaware functions but seems like this could confuse some beginners trying to replicate what you're doing who will likely be using Python 3 and might not be aware of the difference.
This also caught me off guard. Python 2 has been deprecated for such a long time by now.
Only officially and even then was another point release considered. For some platforms there is no py3 port and plenty of py2 code is still around being used.
@@mokovec It's definitely still out there but by now the official end of life has already been almost 3 years ago and it's finally dying for real. The consideration of "another point release" has long passed. On my distro, python 2 isn't installed by default anymore and on Windows, you need to really go out of your way to install it. Platforms where 2 is still the only option are far and few in between and certainly not the ones beginners are using. And there's definitely no reason to encourage starting to use it now.
@@1vader I don't see anyone encouraging it.
@@1vader this video is not a programming course, it's an explainer on utf-8.
i assume he chose python2 on purpose specifically to demonstrate the differences in a simple language (so that a beginner wouldn't get too distracted by the syntax) that has an explicit delineation between unicode and non-unicode strings. of course we shouldn't be writing production ready code based off this video lol, and the video isn't encouraging it (he literally says "check out the applicable string behaviour and libs in your own language"); it's just a teaching tool
I've listened to quite a few Unicode tutorials. This one blows the others out of the water. Clear. Concise. Good tempo. Thx
FINALLY someone is able to explain this clearly. Most other videos complicate this so much. Thank you!
I took a long searching to understand the Unicode (also UTF-8 and the others), I did, but some things were still ambiguous to me, this guy literally taught me the whole topic with only 10 minutes
One small correction: grapheme is a part of a particular writing system. Writing systems are always language-related. Unicode does not reflect any particular writing system. Unicode, and this is probably the smartest choice that could ever be made, maps numbers (code points) to character descriptions or names. This way Unicode is detached from any font and therefore from any particular shapes. This results in a code point relating to not what we want to see, but what we want, making it more abstract.
A good examples are U+0067 and U+0261. In most writing systems based on Latin script, they are allographs, variants of grapheme, so if Unicode were to contain graphemes, there should be only one character. But this is not the case. A writing system may prefer particular glyph (a shape of a letter) and that glyph could be the main variant (allograph) of the grapheme in that system. In most Latin-based writing systems is the main variant, but in the International Phonetic Alphabet it is , because of it correlation with other similar characters like .
Man this channel is golden. The most difficult topics here are explained so easily and in such a unique way
This video is like the sum of the most important things about unicode and ascii, very well done.
👍👍🏾
You earned a subscriber today.
I've worked in the past with non-english characters scratching my head.
This alone sums up all the concepts in details. Very good use of examples and video production 💌
at 1:49 the arabic is rendered incorrectly. it shows each letter in the isolated form instead of the connected forms.
and it's written from left to right instead of right to left
happens every time lol, i remember some big project did the same thing recently
I keep coming back to this video to refresh the concept, thank you :)
Traditionally major Chinese coding methods would just consider a Chinese grapheme as 2 characters though, because they would use 2 byte coding while the the 128 ASCII code points only use 1 byte coding. And also, traditionally a Chinese grapheme would always take up double the width of an ASCII grapheme in fixed width console font. This kinda makes everything neatly aligned (the amount of storage bytes needed is the same as the amount of character printing space needed), but basically falls apart when Internet and UTF-8 become more popular.
And that's also basically the very reason there are double width Latin letters in Unicode. Traditionally it's used to improve readability of English words in vertical text arrangement for Chinese and Japanese, and is called full-width letters full width as in it's the full width of a Chinese character.
this is such an appealing video due to the detail, music, subtle animations, even the colour theme. thanks for the video!
Thanks, a ton!!
Here is what I learnt from the video (I have added a few things that I knew earlier):
- UTF-8 (Unicode Transformation Format, 8-bit) is an encoding scheme for representing Unicode characters.
- In UTF-8, ASCII characters are represented using a single byte, which means that any valid ASCII text is also valid UTF-8 text.
- Therefore, UTF8 is backward compatible with ASCII.
- In UTF-8, characters that can be represented using a single byte (i.e., ASCII characters) are represented as themselves.
- Characters that require more than one byte are encoded using a combination of multiple bytes.
- A code point refers to a numerical value assigned to each character or symbol in the Unicode standard.
- Code points are represented using hexadecimal notation and are typically prefixed with "U+" to distinguish them from other numerical values.
- For example, character "é" (Latin Small Letter E with Acute) consists of two Unicode code points: the base character "e" (U+0065) and the combining acute accent (U+0301). When encoded in UTF-8, "é" is represented by the bytes 0xC3 0xA9.
- A grapheme refers to a visual unit of a written language. It represents a single user-perceived character or a combination of characters that are displayed together.
- len() function returns the number of bytes, not the number of characters in a Unicode-unaware string.
- len() function returns the number of characters in case of a Unicode-aware string.
thanks for posting the summary here
The notes on pythons functions are wrong as they are for a much earlier version that has not been in use for half a decade
You have an amazing talent for teaching m8! Keep it up, this video like your others is so helpful. I love that you cover basic concepts of coding, not "how do I implement a server in node.js" but what really is the essence of becoming a good coder.
The clearest Unicode has ever been in my career!
The atmosphere in the video is so chill, and the video is informative, TH-cam recommend this to more people.
Wow being the messy programmer that I am I always got encoded and decoded mixed up... Much clearer now, thanks
Definitely the best explanation of Unicode I could ask for. Checked with Elixir and it appears that Elixir stringս are Unicode aware by default! Amazing!
Terrific video dude, definitely deserves far more views
Trying to be better in IT and dreaming to be an awesome programmer! I have always just skipped on learning Unicode and never cared due to laziness. But now realize the very importance! thank you so much for this video.
honestly thought you had a million subs. quality is top notch.
This is so awesome yet somehow has so few views.
These kind of videos are amazing for people self-studying CS. I hope you have lots more videos like this to make it easier.
Keep up the great work 🤘
Glad you liked it! Any topic suggestions for future videos?
Where have you been all my life?! 🤩 What a great explanation! Thank you so very much! 🥰
been using Unicode for so long but never really understood what it means. Thanks for explaination.
8:15 In python3, at least now it can do that with python without any problem.
>>> a="你好"
>>> a[0]
'你'
In python3, every string is in utf8. Using a u"" is a thing in py2
Technically, the internal representation of strings in Python is actually UTF-32 or UCS-4 (they're basically the same), because strings need to be efficiently indexable - as these encoding are uniform-length, that's easy to do. Using utf-8, the lookup of a given character in a string would take longer, because the whole string up to that point would need to be traversed as bytes are not evenly allocated to each character.
To top things off, strings that only contain Latin characters just use ascii anyway internally to reduce waste. That's why when a string is appended to, it creates an entire new string object, as the characters the new string has to encode may be outside the original encoding's range.
Your videos have helped reach over $200,000 in stocks by age 23! Thanks In The setup. Keep the videos coming.
Absolutely love this explanation. As someone else mentioned, this is even better than Tom Scott's explanation on the numberphille channel.
I have waited for this type of video for years, thank goodness you made one!
I hit the (U+1F44D) button because this video was so easy to understand!
I've watched a few videos today trying to understand these concepts and this is by far the best. Good job
Very useful and applicable explanation of ASCII, UTF and UNICODE
This is the best explanation of unicode and UTF-8. Super comprehensive & helpful! Thanks!
Hey, your Sales Incentive payment is all sorted out and good to go!
Best explanation with a fantastic presentation. Thank you so much
This video packed so much information in such a short amount of time, amazing. Thank you for this content and please keep uploading,
this is the best video I've seen for how unicode works! Thanks!
This is such a great explainer video. As somebody who's run into the codepoint slicing problem in Python a lot. This explains so much! Thank You again!
wow your Chinese pronuciation is pretty good there @2:24
I've got a paper that tells exactly how you can convert code point to UTF-8 bytes and vice versa. It's actually pretty good piece of knowledge and it tells exactly how UTF-8 encodes code points. Had I known that before, I could simply convert the Unicode symbols by hand myself.
Beautifully Explained
Such best videos better than 100X bad books, I can't stress too much how good this video is. There are always treasure videos like this on TH-cam, hope OP keeps working, keep letting your knowledge and legacy to influence more people.
wow, best explanation i've seen... short and accurate. Bravo.
you deserve a fucking award man, I swear, this is the top 1 explanation on TH-cam. NONE of the others are even close to yours.
Thanks, this video really helped me understand text encoding better.
I'm a programmer so this knowledge is really useful :)
Your tutorial videos are amazing. I decided to go back to creating soft after 16 years. soft soft is so easy to get into, but also offers
WHY? WHY? WHY HAVENT I SEEN YOUR CHANNEL, YOU ARE THE BEST HANDS DOWN, definitely sub + bell (and like)
just the way you explain.. its so good and clear
Yeah, I don`t know what is going on with my channel. Is TH-cam ripping me off? I know the Freedom Fighters [world-wide] are.
Justbone thing, arabic in 2:04 writed backwards, and the correction is the following
اَلْعَرَبِيَّةُ
The reason I'm going to watch this one is that it explains new stuff like Emojis without expecting already having knowledge about Unicode. Bad informative material usually either lacks the new stuff or expected to much knowledge about the old.
Fun fact: most writing systems can express their characters in 2 o 3 UTF-8 bytes (only 2 for the most used ones: extended latin, arabic, cyrilic, greek…). The only outlier is CJK Ideographs (Chinese, Japanese and Korean). There are an insane amount of them so each CJK grapheme carries more meaning (like a word) so it makes sense that they take more bytes in UTF-8 (usually 4 bytes).
Hey thanks for the explanation; I spent some time understanding the Unicode with the tables ASCII table but this explanation is really good. Thanks.
Best tutorial on Unicode
Thanks for showing actual code.. it made it very easy to understand
So here's my question: when deciding a UTF-8 string, how do we know whether the next byte is a one-byte character or the first byte in a multi-byte code point?
Here's how you can determine if a byte in UTF-8 is a single codepoint or part of a multi-byte codepoint:
1. Examining the Leading Bits:
Single-byte codepoints:
Start with 0x00 to 0x7F (binary 0xxxxxxx)
Represent ASCII characters directly.
Multi-byte codepoints:
Begin with 0xC0 to 0xFF (binary 11xxxxxx)
Require 2-4 bytes to represent characters beyond the ASCII range.
2. Byte Sequence Patterns:
2-byte codepoints:
Start with 0xC0 to 0xDF
Followed by a byte in the range 0x80 to 0xBF.
3-byte codepoints:
Start with 0xE0 to 0xEF
Followed by two bytes in the range 0x80 to 0xBF.
4-byte codepoints:
Start with 0xF0 to 0xF7
Followed by three bytes in the range 0x80 to 0xBF.
TNice tutorials one is really good, among all other basics videos
Beautiful man, you're the only one who helped me
ok i finally understand why i need to specify utf 8 when writing to files
The greatest unicode video ever
cleared all my doubts. best explanation on the entire universe :)
Amazing explanation about Unicode
i really apreciate your help with dowloanding this software
thank you so much alex , great and lucid explanation .
I kinda hoped this video would teach me how to use all the unicode characters in 11min, but it turns out it's just a general overview of how UTF-8 works. This is like going to a math class and learning that addition exists, but not learning how to use the + man.
loved this video! So so clear ✨
This is the best explaination of Unicode.👏🏼👏🏼👏🏼
no videos were explaining clearly the step of converting the "grapheme" to a "code point" and then to binary they were just assuming that middle step was understood. that's what I was missing apparently thanks great video
Very good explanation. Clearly deserves more views.
Studying with Alex, the PRO!
Hi, I have some awesome news that will bring a smile to your face!
Excellent Explanation Man... Understood every bit of it... Many Many Thanks😊❤️
Hi! This is Gold! Please do more videos like this. especially the terminal colors
Such clean video format for explanning clearly a knowledge ! Thank you^^ (by a newcomer in the field of informaticL
May you have a nice day with cookies and tea/milk 🥛🍪
Good Good i have really suffered in this TH-cam before i met you
This is so nicely explained
Excellent video! Explained everything in a way that I could understand very well, thank you :)
thx for the incredible explanation, it's all clearer now. There are just 2 points that stay in the blur: what is UTF-16 and how do you decrypt UTF-8 (how do you know if the code point you decrypt takes 1,2,3 or 4 bytes?)
UTF-16 is an encoding that uses 16 bits (2 bytes) for most characters. Codes from U+0000 to U+FFFF are literal. Codes above U+10000 are reprsented by Surrogate Pairs, which use two code points from an intentionally unused portion from U+D800 to U+DFFF.
UTF-8 decoding is easy. As Tom Scott’s video on Computerphile explains, the first few bits tell you what role each byte has:
0−−−−−−− = U+00 to U+7F (ASCII)
10−−−−−− = Carries data. Can’t be the first byte of a letter.
110−−−−− = Followed by 1 data byte.
1110−−−− = Followed by 2 data bytes.
11110−−− = Followed by 3 data bytes.
Not sure you got a reply on this. Basically, UTF-16 encodes each unicode code point in 16 bits (2 bytes). This has the same advantages of UTF-32 (consistent byte length per character and no favoring of languages). However, unlike UTF-8 and UTF-32, it can't support all defined code points, but it can support the first two major groupings of them (called planes) which covers the writing characters for pretty much every language on Earth. Most Unicode-ready programming languages actually use UTF-16. UTF-16 also has the interesting characteristic that the two bytes per character might be reversed (optimized for certain computer systems), which necessitates a 2-byte leading character, known as the Byte Order Mark, to tell the text reader which order the bytes appear in.
@@epsi Thx for the answer it is very useful!
Best video about unicode
Brilliant explanation Alex, thank you
1:49 Arabic is spelled backwards; it should be written (العربية). I think it's because of endianness (I see this very often in Western / Latin resources). Basically, it's like if you write hsilgnE for English
Thank you for the videos though^^
Great video!
Note:
In new Python versions, *len("👍")* actually returns 1
Hi, can someone please tell me where I can find info on 8:39, i.e. why the terminal can't render it correctly. Where can I learn how terminals render the strings. More generally, how software 'render' strings and why some can render combined graphemes correctly while others do not. I desperately need to learn about it. Thanks in advance.
So much informative video, thank you for creating such a video, and also salute to your effort for making this video. 👍 ❤
Great video , made me understand finally.
It would have been nice if you explained how each collection of graphe,es are organized and how the UTF-8 parser traverses the tree of each set.
Excellent, thank you - helpful and interesting!
2:08 Is it just me, or is "عربي" (Arabic) written backwards there (Left to right instead of right to left)?
Not that I know how to read or write Arabic, but did work with it a bit. I can kind of recognize when the ligatures go missing (all glyphs ending up separated) due to the characters being written in the wrong direction.
Easy to digest explanation! Good job, thanks 👍
This was very informative and clear. Thank you.♥
Thanks man!! You've earned my respect
Absolutely amazing video. Good Job 👍.
Recently I was rewriting my python program which prints all the Unicode Characters (without the combining ones) within a range of Hexadecimal characters or chosen unicode blocks. You can see it on my channel but know that i am a beginner and i am aware of how i can improve it because I created it a long time ago.
Quality content, awesome. Post more these kind of great explainable video
great content, it's much more clear now. Thxxxxxxx
Sitting here to understand the equivalent of 3 hours cours at univ.
Very good explanation , thank you for your work!
1:56 For an Arabic Script, the Characters are messy and disorganized.
Classic video editing software not supporting Arabic text
They are not connected because the video editor doesnt support that
Very good video! Thank you a lot!