Unicode and Byte Order

แชร์
ฝัง
  • เผยแพร่เมื่อ 26 พ.ย. 2024

ความคิดเห็น • 68

  • @RepertoireSix
    @RepertoireSix 3 ปีที่แล้ว +12

    I must have watched half a dozen videos about encoding text over the years, but this one is easily the best one! Very easy to understand with all the examples. Now to hope I don't ever have to deal with anything else than utf-8...

    • @ComputerScienceLessons
      @ComputerScienceLessons  3 ปีที่แล้ว +3

      Thank you. I think UTF-8 will be around until we start communicating with extra-terrestrials. :)KD

  • @mhn147
    @mhn147 ปีที่แล้ว +1

    I'm a CS graduate and I needed a refresher about this subject and this explanation had everything that I needed. Thanks.

  • @lostcarpark
    @lostcarpark 2 ปีที่แล้ว +1

    Probably the clearest explanation I've seen on the topic. There's lots that cover the basics pretty well, but completely skip over the byte order aspect.

  • @paulfontaine7819
    @paulfontaine7819 3 ปีที่แล้ว +4

    I have been waiting for this clear and precise explanation for 20 years I believe. Thanks so much.

  • @paulg687
    @paulg687 ปีที่แล้ว +2

    Seriously underrated video. Should have a LOT more view than this.

  • @Moe5Tavern
    @Moe5Tavern 3 หลายเดือนก่อน +1

    This was super interesting and well done, thank you so much! I had an issue with BOM yesterday that broke my script because the shebang wasn't being recognized. I love discovering Hex Editor Neo, gave me a lot of peace of mind, actually seeing the "ef bb ef" in front of my code haha.

  • @booboo-oh1vd
    @booboo-oh1vd 3 ปีที่แล้ว +2

    Watched all the videos in this playlist. Can't thank you enough.

  • @parthodasporag9885
    @parthodasporag9885 3 ปีที่แล้ว +1

    Best video on youtube about Unicode.

  • @smallerz6486
    @smallerz6486 3 ปีที่แล้ว +4

    As a Japanese speaker, the fact that the first four glyphs in your thumbnail are "I❤日本" made me wonder if you're a fan of Japan! (日本 means Japan in Japanese)

    • @ComputerScienceLessons
      @ComputerScienceLessons  3 ปีที่แล้ว +2

      Konnichiwa - Japan is at the top of my places to visit, one day. :)KD

  • @jvsnyc
    @jvsnyc 3 ปีที่แล้ว +1

    Quite excellent. So many little places it is possible, even easy to trip up and give sub-par or confusing information on this topic, and you just tap dance right across that minefield. Beautiful.

  • @surman1816
    @surman1816 3 ปีที่แล้ว +4

    another good one keep doing the great work man

  • @TheKurama9
    @TheKurama9 3 ปีที่แล้ว +11

    23:15 that's the kind of nonchalant joke I like lmao

  • @Mro0Ali0o
    @Mro0Ali0o 3 หลายเดือนก่อน

    Thanks for the great video.
    What is the default Endianness for most Windows file types? Is it always Big-Endian unless otherwise specified?

  • @jvsnyc
    @jvsnyc 3 ปีที่แล้ว +2

    One question at the end. There are 17 * 256 * 256 possible characters across the 17 planes, not including the various blackout or restricted sections like those used for surrogates...isn't that more like: 1,114,111 than well over 2 million? They do require 21 bits, rather than 20, because they left all but 2048 of the original BMP to lie where they were, and added another 1024 * 1024 possible new ones on top of that, so it is (256 * 256) * 16 + (256 * 256) - 2048...either way, way more than we should ever need if we don't go nuts...as they point out, Unicode is for characters, not glyphs or fonts, so if there are 500 ways you want to write A, that is still just one code point...

    • @ComputerScienceLessons
      @ComputerScienceLessons  3 ปีที่แล้ว

      To be honest, I think Unicode is one of those topics which, if you keep digging, you may well go nuts. I made a conscious decision not to overthink it (i.e. "tap dance"). Perhaps Unicode will prove inadequate if SETI ever come up with the goods. :)KD

    • @jvsnyc
      @jvsnyc 3 ปีที่แล้ว

      @@ComputerScienceLessons Yeah, one thing that is unfortunate is that we went from a situation where almost everything "normally" used was in BMP, to one where lots of 🤦‍♂️🤷‍♂️🤞super-popular stuff ain't. Too bad!

  • @kaos092
    @kaos092 ปีที่แล้ว +1

    So why is UTF16 the only standard not compatible with ASCII?

  • @sadBytes
    @sadBytes 3 ปีที่แล้ว +1

    thanks a lot for the awesome explanation!

  • @EMEKC
    @EMEKC 3 ปีที่แล้ว +2

    To be or not to be... good one! ;)

  • @alexthegreek1085
    @alexthegreek1085 3 ปีที่แล้ว +1

    Amazing video

  • @obeid_s
    @obeid_s 2 ปีที่แล้ว +1

    oh i thought UTF-16LE Order bytes will be for all 4 bytes .. but its only for 2 bytes-2bytes
    Thank you man ..

  • @NaifAlqahtani
    @NaifAlqahtani 3 ปีที่แล้ว +2

    1:19 nice

  • @jefersonwillian5579
    @jefersonwillian5579 3 ปีที่แล้ว

    Great lecture!
    I had a question: what is that weird characters we see on the Hex editor? Because Notepad recognizes the encoding used, while Hex editor doesn't.

    • @iparadoxg
      @iparadoxg 3 ปีที่แล้ว

      Hex editor displays the raw data present on the file. It could have shown that to us in binary format too, but it displays that raw binary data to us in a simpler Hexadecimal format

  • @charlesklein7232
    @charlesklein7232 2 ปีที่แล้ว

    what web site do i get these things from? you talk about it 16:00 but were are they? how do i find them? what do i search for? this would make a good video.

  • @kqvanity
    @kqvanity 3 ปีที่แล้ว +1

    may i know why this video got re-uploaded ; as if any you've any changes it would be useful for to spot

    • @ComputerScienceLessons
      @ComputerScienceLessons  3 ปีที่แล้ว +2

      It was brought to my attention by one of my viewers that the bit sequence for the Greek letter Phi, in the table of UTF-8 values, did not match the bit sequence that was (correctly) derived below the table (thank you!). All of the other information was correct. I fixed the problem and took the opportunity to de-ess the sound at the same time :)KD

  • @olivercordingley776
    @olivercordingley776 3 ปีที่แล้ว +2

    I am convinced that you are actually javidx9 in disguise

  • @ankitchabarwal6814
    @ankitchabarwal6814 2 ปีที่แล้ว

    @23:18 "… the heart is broken" lol

  • @NoName-tj8dm
    @NoName-tj8dm 2 ปีที่แล้ว

    Dear Sir,
    I have tried to convert '€' , in pure binary It is 10000000 and in denary 128, to UTF-8 format. As It is 1 byte character but also starts with 1 . Hence I am confused how to convert it to UTF-8 format.
    Please help.
    Thanks in Advance.

    • @angeldude101
      @angeldude101 2 ปีที่แล้ว

      It doesn't fit in 7 bits, so it's not 1 byte in UTF-8. It does however fit in 11 bytes, so it's a 2 byte codepoint. Pad to 11 bits and shove it in the two bytes with the proper headers
      00010_000000 => 110_00010 10_000000

  • @justcurious1940
    @justcurious1940 2 ปีที่แล้ว

    4:56 in UTF-8 we have control bits which are necessary to identify the number of bytes used to represent a code point while in UTF-16 we don't have such thing because low and high surrogates are known by the the range of values so we can identify if a code point is encoded using 1 or 2 UTF-16 code units by just the range of values can some one correct me please ?

    • @justcurious1940
      @justcurious1940 2 ปีที่แล้ว

      2🐝⊕¬2🐝?

    • @angeldude101
      @angeldude101 2 ปีที่แล้ว

      @@justcurious1940 That doesn't look quite right... Did you mean to say 2🐝∨¬2🐝?
      Edit: oh, I guess that was the binary string showed in the video? Then it's possible the video did a typo or something.

    • @justcurious1940
      @justcurious1940 2 ปีที่แล้ว

      @@angeldude101 did u convert it to binary unicode then to characters ?

    • @angeldude101
      @angeldude101 2 ปีที่แล้ว

      @@justcurious1940 No, just noticed that ⊕ looked more like a direct/tensor sum than an or, so I just replaced it with the actual or symbol.

    • @justcurious1940
      @justcurious1940 2 ปีที่แล้ว

      @@angeldude101 yea i see i think u have to it yourself because i couldn't find a website that does it directly

  • @eenteghadi
    @eenteghadi 2 ปีที่แล้ว

    @24:05 UTF-8 encoding for Greek character Phi is 11001110 10100110 and not 11000011 10100110

    • @ComputerScienceLessons
      @ComputerScienceLessons  2 ปีที่แล้ว

      That's a capital Phi in my video :)KD

    • @eenteghadi
      @eenteghadi ปีที่แล้ว

      @@ComputerScienceLessons I meant also capital letter. Or maybe I am missing sth. Capital Phi: U+03A6 and small letter Phi: U+03C6

  • @foo0815
    @foo0815 2 ปีที่แล้ว +1

    And there are even more stupid encodings like Oracle's CESU-8, which mostly is UTF-8, but encodes chars outside the BMP as UTF-16 and then those surrogates *again* as UTF-8, so it becomes 6 bytes... WTF?

  • @justcurious1940
    @justcurious1940 2 ปีที่แล้ว +2

    2🐝⊕¬2🐝?

    • @ComputerScienceLessons
      @ComputerScienceLessons  2 ปีที่แล้ว +1

      THAT is the question! :)KD

    • @justcurious1940
      @justcurious1940 2 ปีที่แล้ว

      @@ComputerScienceLessons i could not find an online website that convert UTF- 8 data to txt directly i had to convert them to unicode code points first then convert each code point to the corresponding character
      😅

  • @rl_gamer15
    @rl_gamer15 ปีที่แล้ว

    U+0045

  • @kavinkumar1384
    @kavinkumar1384 21 วันที่ผ่านมา

    Not simple but simply waste