As a programmer I never had an issue it’s quite straight forward I found, UTF-8/16/32/64 is a format that Unicode is stored as, UTF-8 is a popular format used because it varies in size depending on the Unicode allowing backward compatibility with 7-bit ASCII it’s all pretty much self explanatory and easy to understand
Super well made video, not gonna lie I wasn't expecting to watch through a video about Unicode and UTF-8, but it actually ended up clearing some stuff up for me. Great work!
i was ashamed that i couldnt grasp thos concept despite being a DB admin for 12 years until someone asked about creating a db as a unicode db or non unicode.I took the redpill and went deep into the rabbit hole.
Where were you man all this time 😭 I was doing a data migration which had characters from all languages and I was working on windows, programs run on linux and was facing many issues with encoding and decoding To add to that we were trying to zip and store some of the data, so bytes being compressed, decompressed and being converted back to different languages
3:23 - That's ASCII along with an 8-bit extension. Basic ASCII by itself is only 7-bit, ranging from 0-127. 3:54 - That is not the GB 18030 binary representation of the number 六; 0xC1 0xF9 is, which is only two bytes. You're also overcomplicating this. Just say that Unicode defines a set of characters and assigns each one a unique number. UTF-8 and UTF-16 (and others) are simply ways of representing those number in memory. It is furthermore possible to convert between UTF-16 and UTF-8, as both can represent all possible Unicode values. Also mention that UTF-8 was developed to be compatible with basic ASCII, and that UTF-16 was designed to be compatible with UCS-2.
In my Windows 10 20H2 version, I see the encoding choices listed as: ANSI UTF-16 LE UTF-16 BE UTF-8 UTF-8 with BOM Which would no longer lead to so much confusion as what you had showed. Microsoft arguably made things much worse by pre-pending UTF-8 encoded files with a BOM, which for those who know UTF-8 makes no sense, as there is no such thing as BE or LE in UTF-8, it is unambiguously just UTF-8 and BE or LE doesn't apply to it. So it is good to see UTF-8 by default without the meaningless BOM show up as the simpler looking of the two here nowadays, and not to see one named Unicode. They were trying to disambiguate between ANSI and UTF-8 up-front with the BOM, but because nobody else did that, a whole lot of people got very confused by it. I have seen that happen this year, people couldn't understand what the bad data was at the top of the UTF-8 file they had which they were not expecting, because they knew that "UTF-8 doesn't have a BOM", and it doesn't, unless it came from some Microsoft programs. It is definitely true that a full understanding of all character sets and encodings fills not just a book, but an encyclopedia. The basics were confused because early adopters of Unicode "knew" that every character in Unicode fit in 16-bits, because for many years it did. Famous examples were Microsoft Windows and Java. So calling UCS-2 Unicode thru 1995 was just fine. That people would continue to use the term "Unicode" to mean UCS-2 after Unicode 2.0 came out, when UCS-2 which was stuck on the Basic Multilingual Plane (BMP) was superseded by UTF-16 and UTF-8 as two of the encodings that could represent the full Unicode 2.0 and later character sets, is where I think the most confusion came about. I am not sure how THAT happened, and it confused me as well. It might be reluctance to change anything that might break backwards compatibility... So I too watched this thinking "the whole thing is silly, UTF-8 is just by far the most popular encoding for the Unicode character set!" but realized that the twisted history of Unicode support and naming conventions in various languages has left a lot of people confused.
@@godnyx117 Simple in technology, but there are millions of good developers who are developing totally or partially on and for Windows, if that's all you meant. Some by choice, many not.
Great video! Maybe next time, record another track for the voiceover once the script is complete :D That way the voice doesn't feel like it's hopping all over the place with edits.
It's not nearly as complicated as all that. Unicode started coming together in ~1988, the first published standard was 1991. It wasn't until July 1996 that it stopped being the case that every single character fit into 16-bits. Everyone was using Unicode as a synonym for UCS-2 up until then. Even today, there are a lot of systems (fortunately fewer) that are still stuck on UCS-2 rather than UTF-8 or UTF-16, both of which can represent all the characters, not just the ones on the Basic Multilingual Plane (BMP) that UCS-2 can. So calling UCS-2 "Unicode" was historically correct in 1988, 1989, ... 1995, but since Unicode 2.0 in 1996 is now wrong and an anachronism. That is a subset of what we have today both in terms of which characters you can express and in features...
ok what is the difference between utf8 and utf16le, i know what le is but what is utf16 EDIT: just looked it up, number at the end represents minimum spacce each character would take
My brain fucked up today for this stuff lookin' utf-8 encoding hexdecimal to decimal i didn't even try binary i was just tryin' todo simple XOR encryption in c++ and have no clue how i convert the xor grbage buffer output to BYTE integer array.. the Asci letters match but not the latin letters and some operators& emojis.
So I have a strange question for you 🤔 Would you call artwork made in unicode or utf32 like people old school text art call it ascii art and ansi art for bbs boards. But what would you call it if you decided make art in unicode or utf8 ?
It would be character art. The term "Text Art" is questionable, because it would include thousands and thousands and thousands of 😂 🤡 👾 🧝 🍉 🥑 🍩 🐘 clearly these are neither ASCII nor ANSI text. Everything in Unicode is a character (or a combination of characters)...so it is character art. Of course, I don't think saying "texting" for SMS is strictly true anymore if you are using emoji, but most people use that term in that way. I would say "messaging" is still strictly true, but are all these emojis text? I call all the things characters, whether they are letters/text-for-languages-that-don't-use-letters or emojis. Strings are now strings of characters, some of which are letters, some are whole words (for example Chinese), some are pictures/emojis.
So I am taking content from an old pdf and creating a manual using a commercial "SD1000" compliant, framemaker ripoff software made by some Russian douchbag. The source files are some unknown pdf. In acrobat pro I go into text edit mode and copy the the info I need and past it into the publishing software. Once I finish, I publish my new document into pdf, but there is a major issue... wherever there is a word containg the letter combinations of 'fi' or 'fl', a # symbol appears. I contact Drago and tell him his software is failing. He goes on to explain to me about the unicode and utf-8 blah blah blah. IDGAF, but he says says that he would have to charge us a consulting fee to "fix" this issue. It's like I'm on his toll road and in the middle of the the road there is a cow and he wants me to pay to have him move his own cow off the road I'm paying him to be on. WTF!!! I can take the text to NotePad and then move it to the publishing software, but that sounds daunting. Is there a software that can "clean" the source pdf so this won't happen?
Am I missing something or are you trying to be funny? Ten million UTF-8 characters take 0.0000000000000000000000000000000000000% more space than ten million ASCII characters, unless they are representing characters that aren't in ASCII. Are you saying that everyone in Japan, China, Korea, India, Russia, etc. etc. etc. etc. etc. etc. etc. should all just grow up and learn English? Every character that can be represented in ASCII (0 thru 0x7F) takes the same space in UTF-8, so what else could the point be?
@@jvsnyc i'm not serious, but i do think a lot of languages are not computationally friendly, my first language is Chinese and it's hard to use in digital products, e.g. have to support the encoding, almost impossible to make custom fonts, etc (having existing encoding lib / fonts doesn't make them inherently less complicated) We only really use 0.1% of the hundreds and thousands characters in unicode library. Not saying these languages are bad in any way, just saying they happen to not be computation friendly, plain ascii is just easy
@@slmjkdbtl lol, no arguments there. Plain ASCII was way, way, way easier for sure. But monochrome graphics were easier than 24-bit color too...yeah, 21 bits is a bit much, and there was much to love about all chars being the same size (in bits)...life got harder.
@@jvsnyc yeah exactly, personally i'm practicing all things easy myself including graphics stuff, including software rendering, small canvas, by having no "hi-res" need for personal projects makes me a happy programmer
I’m a bit confused with two questions: At 4:10, you call UTF-8 an ‘encoding’ (aka function) that maps bytes to unicode. Then at 7:10, you say to map unicode code points to bytes is to ‘encode’ while the reverse is to ‘decode’. Shouldn’t UTF-8 be considered a… ‘decoding’ instead of an ‘encoding’? Or maybe using the word ‘encoding’ as a synonym for ‘function’ intrinsically leads to confusion. My second question is… is UTF-8 a *well-defined* function?? Like, a sequence of bytes maps to exactly one unicode value via UTF-8? I think not because the link below says a unicode character can consist of MULTIPLE code points: riptutorial.com/unicode/topic/6485/characters-can-consist-of-multiple-code-points
A single UTF-8 sequence maps to exactly one code point. Multiple code points are sometimes needed to represent what we consider a single character. Therefore, such characters will consist of more than one UTF-8 sequence. Note that UTF-8 can theoretically represent up to 2.2 billion code points, which is far more than Unicode currently has. As such, not all possible UTF-8 sequences map to valid Unicode code points.
As a programmer I never had an issue it’s quite straight forward I found, UTF-8/16/32/64 is a format that Unicode is stored as, UTF-8 is a popular format used because it varies in size depending on the Unicode allowing backward compatibility with 7-bit ASCII it’s all pretty much self explanatory and easy to understand
This video actually really helped me understand unicode and UTF-8! You deserve way more subscribers!
i have watched too many videos to understand this and finally nod and understood from your video, thank you so much!
2:05 tbh, as a noob, this was a MAJOR eye-opener for me. Thanks!
Super well made video, not gonna lie I wasn't expecting to watch through a video about Unicode and UTF-8, but it actually ended up clearing some stuff up for me. Great work!
Yo bro, this was amazing. You have a talent for explaining concepts in ways that make sense. Bravo!!!
Nicely explained in simple terms
i was ashamed that i couldnt grasp thos concept despite being a DB admin for 12 years until someone asked about creating a db as a unicode db or non unicode.I took the redpill and went deep into the rabbit hole.
Where were you man all this time 😭 I was doing a data migration which had characters from all languages and I was working on windows, programs run on linux and was facing many issues with encoding and decoding
To add to that we were trying to zip and store some of the data, so bytes being compressed, decompressed and being converted back to different languages
Nine minutes and ten seconds packed with knowledge 😊👍
Fantastic explanation and teaching style. TY.
Love this video! This really makes me feel I am not alone ❤️
God (just youtube) gave me subtitles, I'm grateful
And *Thank You*
3:23 - That's ASCII along with an 8-bit extension. Basic ASCII by itself is only 7-bit, ranging from 0-127.
3:54 - That is not the GB 18030 binary representation of the number 六; 0xC1 0xF9 is, which is only two bytes.
You're also overcomplicating this. Just say that Unicode defines a set of characters and assigns each one a unique number. UTF-8 and UTF-16 (and others) are simply ways of representing those number in memory. It is furthermore possible to convert between UTF-16 and UTF-8, as both can represent all possible Unicode values. Also mention that UTF-8 was developed to be compatible with basic ASCII, and that UTF-16 was designed to be compatible with UCS-2.
In my Windows 10 20H2 version, I see the encoding choices listed as:
ANSI
UTF-16 LE
UTF-16 BE
UTF-8
UTF-8 with BOM
Which would no longer lead to so much confusion as what you had showed. Microsoft arguably made things much worse by pre-pending UTF-8 encoded files with a BOM, which for those who know UTF-8 makes no sense, as there is no such thing as BE or LE in UTF-8, it is unambiguously just UTF-8 and BE or LE doesn't apply to it. So it is good to see UTF-8 by default without the meaningless BOM show up as the simpler looking of the two here nowadays, and not to see one named Unicode. They were trying to disambiguate between ANSI and UTF-8 up-front with the BOM, but because nobody else did that, a whole lot of people got very confused by it. I have seen that happen this year, people couldn't understand what the bad data was at the top of the UTF-8 file they had which they were not expecting, because they knew that "UTF-8 doesn't have a BOM", and it doesn't, unless it came from some Microsoft programs.
It is definitely true that a full understanding of all character sets and encodings fills not just a book, but an encyclopedia.
The basics were confused because early adopters of Unicode "knew" that every character in Unicode fit in 16-bits, because for many years it did. Famous examples were Microsoft Windows and Java. So calling UCS-2 Unicode thru 1995 was just fine. That people would continue to use the term "Unicode" to mean UCS-2 after Unicode 2.0 came out, when UCS-2 which was stuck on the Basic Multilingual Plane (BMP) was superseded by UTF-16 and UTF-8 as two of the encodings that could represent the full Unicode 2.0 and later character sets, is where I think the most confusion came about. I am not sure how THAT happened, and it confused me as well. It might be reluctance to change anything that might break backwards compatibility...
So I too watched this thinking "the whole thing is silly, UTF-8 is just by far the most popular encoding for the Unicode character set!" but realized that the twisted history of Unicode support and naming conventions in various languages has left a lot of people confused.
There is a simple solution for this. Don't use bad operating systems
@@godnyx117 Simple in technology, but there are millions of good developers who are developing totally or partially on and for Windows, if that's all you meant. Some by choice, many not.
@@jvsnyc I know. It kinda said as a joke tbh. But I'm surely getting pissed of by how stupid this OS is but of course just my choice like you said
Great video! Maybe next time, record another track for the voiceover once the script is complete :D That way the voice doesn't feel like it's hopping all over the place with edits.
It's not nearly as complicated as all that. Unicode started coming together in ~1988, the first published standard was 1991. It wasn't until July 1996 that it stopped being the case that every single character fit into 16-bits. Everyone was using Unicode as a synonym for UCS-2 up until then. Even today, there are a lot of systems (fortunately fewer) that are still stuck on UCS-2 rather than UTF-8 or UTF-16, both of which can represent all the characters, not just the ones on the Basic Multilingual Plane (BMP) that UCS-2 can.
So calling UCS-2 "Unicode" was historically correct in 1988, 1989, ... 1995, but since Unicode 2.0 in 1996 is now wrong and an anachronism. That is a subset of what we have today both in terms of which characters you can express and in features...
AMAZING explanation! Thanks a lot!
ok what is the difference between utf8 and utf16le, i know what le is but what is utf16
EDIT: just looked it up, number at the end represents minimum spacce each character would take
this actually did help a lot thanks
Is it just me, or the video of Balmer, Gates and others cheering, dancing was funny enough to distract you?
Thanks pal, useful material. Go ahead.
what does UTF-8 stand for
8-Bit Universal character set Transformation Format
Yeah I thought he was talking about encoding for real
My brain fucked up today for this stuff lookin' utf-8 encoding hexdecimal to decimal i didn't even try binary i was just tryin' todo simple XOR encryption in c++ and have no clue how i convert the xor grbage buffer output to BYTE integer array.. the Asci letters match but not the latin letters and some operators& emojis.
To much info please anyone tell me (( why some pepole useing the utf-8 in the web scraping ??? ))
Thanks man! Really helped
So I have a strange question for you 🤔
Would you call artwork made in unicode or utf32 like people old school text art call it ascii art and ansi art for bbs boards. But what would you call it if you decided make art in unicode or utf8 ?
It would be character art. The term "Text Art" is questionable, because it would include thousands and thousands and thousands of 😂 🤡 👾 🧝 🍉 🥑 🍩 🐘 clearly these are neither ASCII nor ANSI text. Everything in Unicode is a character (or a combination of characters)...so it is character art. Of course, I don't think saying "texting" for SMS is strictly true anymore if you are using emoji, but most people use that term in that way. I would say "messaging" is still strictly true, but are all these emojis text? I call all the things characters, whether they are letters/text-for-languages-that-don't-use-letters or emojis. Strings are now strings of characters, some of which are letters, some are whole words (for example Chinese), some are pictures/emojis.
interesting question
Ma man just made life easier JAJAJA
看懂了,感谢!
Thanks!
Yeah, sometimes it works, sometimes it doesn't.
This is still so confusing :P
I don't understand. Why do you initially keep pointing to "your favorite" as if it changes anything?
This video was beautiful, thank you!!
So I am taking content from an old pdf and creating a manual using a commercial "SD1000" compliant, framemaker ripoff software made by some Russian douchbag. The source files are some unknown pdf. In acrobat pro I go into text edit mode and copy the the info I need and past it into the publishing software. Once I finish, I publish my new document into pdf, but there is a major issue... wherever there is a word containg the letter combinations of 'fi' or 'fl', a # symbol appears. I contact Drago and tell him his software is failing. He goes on to explain to me about the unicode and utf-8 blah blah blah. IDGAF, but he says says that he would have to charge us a consulting fee to "fix" this issue. It's like I'm on his toll road and in the middle of the the road there is a cow and he wants me to pay to have him move his own cow off the road I'm paying him to be on. WTF!!! I can take the text to NotePad and then move it to the publishing software, but that sounds daunting. Is there a software that can "clean" the source pdf so this won't happen?
odm unite
Hangul
characters are bloated, ascii ftw
Am I missing something or are you trying to be funny? Ten million UTF-8 characters take 0.0000000000000000000000000000000000000% more space than ten million ASCII characters, unless they are representing characters that aren't in ASCII. Are you saying that everyone in Japan, China, Korea, India, Russia, etc. etc. etc. etc. etc. etc. etc. should all just grow up and learn English? Every character that can be represented in ASCII (0 thru 0x7F) takes the same space in UTF-8, so what else could the point be?
@@jvsnyc i'm not serious, but i do think a lot of languages are not computationally friendly, my first language is Chinese and it's hard to use in digital products, e.g. have to support the encoding, almost impossible to make custom fonts, etc (having existing encoding lib / fonts doesn't make them inherently less complicated) We only really use 0.1% of the hundreds and thousands characters in unicode library. Not saying these languages are bad in any way, just saying they happen to not be computation friendly, plain ascii is just easy
@@slmjkdbtl lol, no arguments there. Plain ASCII was way, way, way easier for sure. But monochrome graphics were easier than 24-bit color too...yeah, 21 bits is a bit much, and there was much to love about all chars being the same size (in bits)...life got harder.
@@jvsnyc yeah exactly, personally i'm practicing all things easy myself including graphics stuff, including software rendering, small canvas, by having no "hi-res" need for personal projects makes me a happy programmer
I’m a bit confused with two questions:
At 4:10, you call UTF-8 an ‘encoding’ (aka function) that maps bytes to unicode.
Then at 7:10, you say to map unicode code points to bytes is to ‘encode’ while the reverse is to ‘decode’. Shouldn’t UTF-8 be considered a… ‘decoding’ instead of an ‘encoding’? Or maybe using the word ‘encoding’ as a synonym for ‘function’ intrinsically leads to confusion.
My second question is… is UTF-8 a *well-defined* function?? Like, a sequence of bytes maps to exactly one unicode value via UTF-8? I think not because the link below says a unicode character can consist of MULTIPLE code points:
riptutorial.com/unicode/topic/6485/characters-can-consist-of-multiple-code-points
A single UTF-8 sequence maps to exactly one code point. Multiple code points are sometimes needed to represent what we consider a single character. Therefore, such characters will consist of more than one UTF-8 sequence. Note that UTF-8 can theoretically represent up to 2.2 billion code points, which is far more than Unicode currently has. As such, not all possible UTF-8 sequences map to valid Unicode code points.