The Mötley Crüe Guide to Unicode Normalization

แชร์
ฝัง
  • เผยแพร่เมื่อ 2 พ.ย. 2024

ความคิดเห็น • 126

  • @Dorgrin
    @Dorgrin 2 หลายเดือนก่อน +60

    This is an excellent series on how the world is held together by chewing gum and paddle pop sticks 😂

    • @nurmr
      @nurmr 2 หลายเดือนก่อน +1

      and "A project some random person in Nebraska has been thanklessly maintaining since 2003"? (ref: xkcd #2347)

    • @arisweedler4703
      @arisweedler4703 2 หลายเดือนก่อน +2

      This dude doesn’t miss. It’s so fun to learn this stuff. And I 100% agree w/ u 😂

  • @gustavohman5811
    @gustavohman5811 2 หลายเดือนก่อน +57

    Dylan: Normalization.
    TH-cam CC: Normalisation.
    Case proven...

    • @Raiment57
      @Raiment57 2 หลายเดือนก่อน +2

      So did TH-cam CC use the British spelling because of Dylan's accent?!

    • @mtarek2005
      @mtarek2005 2 หลายเดือนก่อน +2

      ​@@Raiment57he's the one that added the CC

    • @Raiment57
      @Raiment57 2 หลายเดือนก่อน +2

      @@mtarek2005 Oh, inconsistent spelling. He's such an anarchist! 😊

    • @TheJamesM
      @TheJamesM 2 หลายเดือนก่อน +1

      @@Raiment57 Presumably Dylan set the language of the video as "English (United Kingdom)" when uploading (or in his channel defaults), rather than any kind of accent detection shenanigans. As for hiss written use of "normalize", that could be down to one of two things:
      a) Exposure to the US spelling in programming languages, technical documentation, etc. (e.g. his use of the JavaScript `.normalize()` method).
      b) Oxford spelling. The Oxford University Press style guide calls for use of "-ize" rather than "-ise" spellings (but "-yse" rather than "-yze"), alongside what would be thought of as traditional British spelling for other words. This convention is also followed by a number of international academic journals, so it's not too unusual to see. It's also a pretty good example of how a single monolithic standard for a language is a bit of a myth: dig deep enough, and you'll find all sorts of variance between publications, such as the New Yorker's idiosyncratic use of diereses in words like "coöperation". I've seen it claimed that Microsoft Word is to blame for a lot of contemporary linguistic bugbears: when you choose your dictionary language, you plant your flag in the ground and trust that everything it tells you is the definitive truth of your version of the language, when in actuality there's a lot more variance than that.
      I'd guess that it's the latter.

    • @DylanBeattie
      @DylanBeattie  2 หลายเดือนก่อน +30

      @@gustavohman5811 ah, welcome to the neurone-pickling linguistic nightmare of being a British developer in a world where most programming languages are based on American English.
      Dylan writing video scripts, which then get uploaded to TH-cam’s subtitle engine starting either this video: Normalisation. British English.
      Dylan writing JavaScript: Normalization, otherwise it won’t work.
      Dylan writing title slides… is apparently still in JavaScript mode.
      And don’t even get me started on color vs colour… 🤣

  • @R.B.
    @R.B. 2 หลายเดือนก่อน +23

    At Swype, I got to learn a lot about composition and decomposition. Korean has normalized and denormalized forms. Something like "한국" is two syllable glyphs, '한' and '국'. This decomposes into the jamos 'ᄒ', 'ᅡ', 'ᆫ', 'ᄀ', 'ᅳ', and 'ᆨ'. Since there wasn't a way to put every composed syllable on the keyboard, the keyboard showed the consonant and vowel jamos. Swyping from 'ᄒ' to 'ᅡ' and then to 'ᆫ' would output '한' and repeating the same Swyping for the next triad of jamos would output '국.' Together you've written the composed word for Korea, occupying two Unicode glyphs from six decomposed jamos. I was responsible for testing that Swype would do the right thing.
    To help with comprehension, the jamos are like like letters on a conventional 104 key keyboard. The jamos can be combined as consonant-vowel or consonant-vowel-consonant syllables (there may be other forms as well, but as someone who doesn't speak Korean, those are the ones with which I'm familiar), and one or more of those syllables create a word, such as "한국," meaning Korea.
    To test Swype, I had the list of words which we supported in our dictionary. I would decompose the syllables into a string of jamos and generate Swype traces events which would mimic a finger trace, and then I'd compare the result Swype generated with the expected result to see how well we did.
    Long story short, Korea built their keyboard so that the most common consonants and vowels are on the home row. Like "put," "pot," "out," and "par," there wasn't a lot of distinction between traces, with Koreans rapidly sweeping their finger back and forth along the home row. While this was an advantage for typists on a full keyboard, it complicated Swyping unless the Swypist would curve their trace to clearly indicate each intended jamo. In that role I did learn a lot about Unicode, Korean, composition/decomposition, and gained a lot of respect for the considerations made by the Unicode Consortium.

    • @WilliamHostman
      @WilliamHostman 2 หลายเดือนก่อน +1

      Hangul is interesting, as it's a phonetic alphabet giving the appearance of an alpha-syllabary due to symbol stacking rules.

  • @timseguine2
    @timseguine2 2 หลายเดือนก่อน +37

    For mainframes there is UTF-EBCDIC, which is basically a complicated UTF-8 that is EBCDIC compatible instead of ASCII compatible. And as far as I know every major mainframe database system supports it. As well as the fact that I think UTF-8 support is available on mainframes these days also (although that doesn't help legacy customers who have old ebcdic code).

    • @dennydravis8758
      @dennydravis8758 2 หลายเดือนก่อน +9

      Things I am never telling my clients that are thinking about transitioning off of mainframes for 500 Alex.

    • @timseguine2
      @timseguine2 2 หลายเดือนก่อน +2

      @@dennydravis8758 It definitely falls in the category of cursed. And I tend to prefer PC technology across the board. But I think most people's aversion to mainframe technology is mostly just unfamiliarity.

    • @AllanSavolainen
      @AllanSavolainen 2 หลายเดือนก่อน

      Probably similar to UTF-5

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 2 หลายเดือนก่อน +1

      @@timseguine2 Isn't the aversion partially due to mainframes = IBM rather than mainframes being mainframes?
      Like compare with DEC stuff, which technically were mini computers rather than mainframes, but the largest DEC systems were comparable to the IBM systems, and people loved the DEC stuff.

    • @timseguine2
      @timseguine2 2 หลายเดือนก่อน

      ​@@Thesecret101-te1lm Maybe. It is hard to say, since I don't have that aversion.
      My general feeling is that it is a lot to do with the parallel and isolated culture surrounding mainframes. The number of mainframe installations is several orders of magnitude smaller than other platforms (despite being much more comparable in terms of total invested assets in dollars).
      So there is less opportunity for an "average" computer user to be involved with mainframe technology and it has always been that way. Add to that the fact that they have their own terminology and jargon that is alien and incompatible to standard hacker jargon. All that means that from the very beginning there was a (deserved or undeserved) reputation of mainframe technology as being elitist.
      And that minicomputers and microcomputers were computing for the masses.
      From my perspective that perspective never softened over the years. And since the mainframe enclave has more or less existed happily in parallel for decades now, many of the old stereotypes have become fact for most people considering they still have no reason or opportunity to have exposure to it.
      Just my two cents. Full disclosure, I used to work at IBM as firmware engineer for their mainframe products. Take that as you will. But I have been all over the place in terms of work outside of IBM, so I don't really see myself as biased.

  • @edwardallenthree
    @edwardallenthree 2 หลายเดือนก่อน +2

    Putting things in alphabetical order is a spectacular challenge. Think about what happens when you add requirements for alphabetizing books. We don't alphabetize by first name, but by last name. What name is the last name is not readily determined by a regular expression, because it's not always the last word after a space. We also organize titles alphabetically, but we ignore certain words in certain languages (in English "a," "an," and "the."). Excited for your next video.

    • @stephenspackman5573
      @stephenspackman5573 2 หลายเดือนก่อน

      And the diacritics in different languages have different effects on the collation sequence. And what do the Japanese do? And (IIRC from my linguistics classes) there are languages in which the first letters of a word carry agreement information, not meaning inherent to the word, and where the first letter that is part of the root doesn't stay the same throughout the paradigm.
      Not to mention that your index may want to include references to backspaces and linefeeds (ok, it was due to a bug, but I did once work on a programming language that allowed significant backspaces in identifiers).
      Civilians, they have no idea.

  • @indyztech
    @indyztech 2 หลายเดือนก่อน +3

    Gee. I thought I had my head around Unicode, but here I am, learning useful stuff. Thank you for this excellent series.

  • @natalie5947
    @natalie5947 2 หลายเดือนก่อน +1

    I've been really enjoying this series.

  • @TheGeoffable
    @TheGeoffable 2 หลายเดือนก่อน +8

    I've previously seen the last few text encoding vids as parts of your longer live talks on it, but there's always something new here that is either new or, quite possibly, that I missed the first time because I was giggling too much ;) 👍🤘

    • @DylanBeattie
      @DylanBeattie  2 หลายเดือนก่อน +5

      I think the 1-hour talk is going to end up as 2, maybe 2.5 hours of video. There's a lot of stuff that ends up getting cut when you're editing to fit a conference schedule.

  • @MeriaDuck
    @MeriaDuck 2 หลายเดือนก่อน +3

    I played snippets of your talks in my classes, these are a great length to include and respond to in class! And as I'm giving lessons about databases, I love to see a video on collation and sorting 😂

  • @CaioAguida
    @CaioAguida 2 หลายเดือนก่อน

    This is an excellent video. An anecdote on how it is important: Recently in my PhD research (linguistics of ancient greek), I discovered that every NLP model was underperforming and messing up my work because the datasets were not normalized, so that each word could have +5 different unicode representations. This forced me to switch from a NLP assisted data collection to a manual one.

  • @Jakub1989YTb
    @Jakub1989YTb 2 หลายเดือนก่อน +3

    You're one of, if not only the one (now to be) TH-camr, whom I believe the sincere opening and closing "jingles".
    We indeed need to look after each other more. And great week to you too.

  • @andrerenault
    @andrerenault 2 หลายเดือนก่อน +1

    I’ve seen virtually all of this in your talks before, but I’m still rewatching it because it’s fascinating stuff

  • @VRchitecture
    @VRchitecture 2 หลายเดือนก่อน +16

    You know it’s an old school dev when one uses *var* in JavaScript☝🏻

    • @DylanBeattie
      @DylanBeattie  2 หลายเดือนก่อน +23

      @@VRchitecture one does not worry about scope when one’s development environment is the Chrome DevTools console…

    • @VRchitecture
      @VRchitecture 2 หลายเดือนก่อน +5

      @@DylanBeattie True, though I wanted to emphasize here the fact of usage per se, not its implications (whose you’re obviously aware of as an experienced web developer) :)

    • @nickwallette6201
      @nickwallette6201 2 หลายเดือนก่อน +2

      var things_to_look_up_later = [“JS variable scoping and conventions”];

    • @cod3r1337
      @cod3r1337 2 หลายเดือนก่อน

      Maybe not, but every time you use "var" in JS, God kills a little kitten. You don't want the kittens to die, do you? ;)

    • @stephenspackman5573
      @stephenspackman5573 2 หลายเดือนก่อน

      @@cod3r1337 The developers of interpreted languages often seem to hate kittens.

  • @AartBluestoke
    @AartBluestoke 2 หลายเดือนก่อน +2

    my most recent run-in with code pages was a finance export - been running fine for ages, then someone decided to add a pound symbol to one of the account names, and one system sent £ (char 163) and another threw a wobbly over incorrectly encoded UTF8 characters ...
    and to make matters worse, opening and saving in excel "handles" it by silently promoting it to the unicode encoded version which (in ascii) £ - which breaks other things , which assumes ascii ...

  • @darrennew8211
    @darrennew8211 2 หลายเดือนก่อน +1

    I'm rocking this series. This is such fun info.

  • @JamieBainbridge
    @JamieBainbridge 2 หลายเดือนก่อน

    You are great at explaining difficult things. I try follow a similar structure with my work stuff - history (why we are uere), and then build knowledge a step at a time to combine steps, bring the audience along with you.

  • @lostcarpark
    @lostcarpark 2 หลายเดือนก่อน

    I thought I knew Unicode pretty well, but I learned a lot from your video.

  • @jonduke4472
    @jonduke4472 2 หลายเดือนก่อน

    Love the coherent timeline for all of this.
    Work in technical manual generation and kinda just absorbed most of what this series covers over the years. Thanks

  • @DCaseyTucker
    @DCaseyTucker 2 หลายเดือนก่อน +1

    I love how just after 8:37 when he says "combining" it sounds like he's speaking in Zalgo

  • @viccie211
    @viccie211 2 หลายเดือนก่อน

    Thanks foir these amazing insightful videos Dylan. I found you through recordings of your NDC talks and am enjoying this series very much :)

  • @leyasep5919
    @leyasep5919 2 หลายเดือนก่อน

    I was waiting for this video and it's awesome! I hope there will be more on this very subject, as I would like to dive a bit in custom Unicode implementations.

  • @Wolfeur
    @Wolfeur 2 หลายเดือนก่อน +5

    Tony the Pony is not a reference I was expecting to see pop up here

  • @Scott-i9v2s
    @Scott-i9v2s 2 หลายเดือนก่อน

    I remember a Dutch colleague having a not-Dutch family-name that at 1st sight seemed to consist of 2 words. But it was actually a single word where the space was what now would be represented with a non-breaking space ( ). This broke quite a few Dutch databases back in the 1970s. I wish that I remembered which language used such things...

  • @conodigrom
    @conodigrom 2 หลายเดือนก่อน +6

    Yeah, confront their statement to the UTF versions...and most annoingly of all, variable length encodings. Thank you Unicode.

  • @brynyard
    @brynyard 2 หลายเดือนก่อน

    Booking hotels and checking in in the US is quite "interesting" with a surname like Øvergård...
    Some booking systems simply won't let you book, some use "internationalized" variants (æ => ae, ø => oe, å => aa, and capitalization becomes a mystery), some do a simple replace (a,o,a), some simply removes them, and some "eats" their neighboring characters (ie: double utf8 decode), with various results and often straight out crashes.

  • @WilliamHostman
    @WilliamHostman 2 หลายเดือนก่อน +1

    diaeresis were used in American English for indicating a separate pronunciation of the second of two adjacent vowels. It's mostly used by a couple snooty magazines these days; I've seen it in handwritten texts from the late 19th C. (Government records, to be specific.) And before the magazines in question first entered circulation; in a few typed docs, it was an overstrike double-quote, usually over o or e, for separation. The use of an acute accent in handwriting, and the single quote as an over-strike in typed documents from the same era. (again, government records.) I've seen coordinate, co-ordinate, coördinate, and coórdinate. Gotta love the lack of standardization... but it's wrong to claim English doesn't use the diaeresis; it's more correct to say it doesn't use it anymore, same for the ash (Ææ), oek (Œœ), and thorn (Þþ). Eth\Edh ( Ðð) was also used in a few dialects. In the times there was a distinction thorn (þ) was used for the soft th as in with, thick, or thin, and eth (ð) for the voiced th, as in the, thee, this, or that.

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 2 หลายเดือนก่อน

      Afaik it's also used in some other languages for writing certain names, like Cloë

    • @mandolinic
      @mandolinic 2 หลายเดือนก่อน

      I was just saying the same thing to my friend Noël yesterday.

  • @DragoniteSpam
    @DragoniteSpam 2 หลายเดือนก่อน +1

    There's a blog post somewhere on the Internet called "falsehoods programmers believe about names," which would probably be a fun (for some definition of "fun") place to go after this. (◡‿◡✿)

    • @tokeivo
      @tokeivo 2 หลายเดือนก่อน +1

      There's also a great talk by some guy - I'm terrible at names, and saw him live at a conference, sorry - talking about right-to-left writing.
      It has much of the same energy as the old "names" blog post. (Spoilers: It boils down to "People have at least 1 name, and they include at least 1 character. You can't make more assumptions than that if you want to include everyone")

  • @LyleAshbaugh
    @LyleAshbaugh 2 หลายเดือนก่อน

    Looking forward to a video about UTF-7, UTF-8, UTF-16 (LE & BE)

  • @garethlagerwall
    @garethlagerwall 9 วันที่ผ่านมา

    Oh no! Its Tony the Pony! I feel like there is more of a story to this...

  • @Scott-i9v2s
    @Scott-i9v2s 2 หลายเดือนก่อน

    @DylanBeattie Serious question about the digits 0-9 (as we know them in the languages that use THAT version of digits):
    Are they (in Unicode) defined for lowercase or defined for uppercase or defined for neither or defined for both or defined as something else? (Where 'lowercase' & 'uppercase' are meant as they are used in said languages.)
    (As I write this, I am definitely aware of the question NOT being properly stated...)

  • @RoamingAdhocrat
    @RoamingAdhocrat 2 หลายเดือนก่อน

    ah I wish I was going to NDC. I just wanted to tell you both, good luck. We're all counting on you.

  • @dj196301
    @dj196301 2 หลายเดือนก่อน +1

    Enlightening and entertaining!
    There seems to be some amateur thinking going on with some (okay, many) code points. I want to write in 101 in circled numbers. No can do. 0x2776 is circled 1 and 0x277F is circled... 10? It only specifies 9 digits!
    Thank you for this great content!

    • @cmyk8964
      @cmyk8964 2 หลายเดือนก่อน

      You can blame some very old Japanese standards for that. Shift-JIS, an old Japanese character encoding, had circled numbers for list items from 1 to 10, as well as pre-composed Roman numerals from 1 to 12 and other weirdness.

    • @brighthades5968
      @brighthades5968 2 หลายเดือนก่อน

      0 exists at 0x24ff but it's still weird that it isn't found with all the other numbers 1-10

    • @dj196301
      @dj196301 2 หลายเดือนก่อน +1

      @@brighthades5968 So it is! Thank you--and I agree it's oddly placed. I'll have to hunt for the other circled zeros for the others (0x2780, 0x278a) but, truth be told, I'm just splittin' hairs over here!

    • @brighthades5968
      @brighthades5968 2 หลายเดือนก่อน

      @@dj196301 hmm i checked in "unicode character inspector" and it turns out there are only two circled zeros! (normal and negative.) kinda makes sense because a sans-serif and a serif zero look the same (almost) so they didn't bother with making a sans-serif zero (probably in some old jp codepage)

  • @petergerdes1094
    @petergerdes1094 2 หลายเดือนก่อน +5

    Has anyone ever published math using those Unicode conventions?
    Mathematicians tend to just use Tex -- both for publishing or just write the Tex commands in text to be clear.

    • @timseguine2
      @timseguine2 2 หลายเดือนก่อน +6

      math usually requires typesetting to be legible, not just text (regardless of available symbols). So the Unicode characters can be useful for typesetting math, but are not a replacement for the typesetting.

    • @chaosflaws
      @chaosflaws 2 หลายเดือนก่อน +4

      The unicode-math package uses Unicode symbols if a font containing glyphs for the relevant code points is available, and I've seen it recommended on the internet. So Unicode has an impact even if you use (certain derivatives of) TeX.

    • @petergerdes1094
      @petergerdes1094 2 หลายเดือนก่อน

      @@chaosflaws It does? For things like mathbb? Are you telling me that the STIX fonts actually use a standard encoding instead of random tex fuckery for things like mathbb? Finally? Thank God if so!
      I'm sure it does for standard Greek characters but mathbb and other weird characters have always had weird tex representations and I assumed they still did because I wasn't seeing that when I tried to copy mathbb symbols from a pdf with under lualatex and unicode-math with STIX but maybe it's a package I was loading or something.
      But unless Tex actually bothers to use the unicode to represent mathbb etc symbols it seems like a waste (but I really hope they do...now or in the future).

    • @petergerdes1094
      @petergerdes1094 2 หลายเดือนก่อน

      @@timseguine2 Certainly, but unless someone actually uses those characters it's kinda pointless. I assumed Tex still wasn't doing that but see discussion below. If that's true it's great.

    • @chaosflaws
      @chaosflaws 2 หลายเดือนก่อน

      @@petergerdes1094 You got me curious. To be clear, I am talking about Lua/XeLaTeX's unicode-math package, which does *not* use Stix fonts. To my understanding, the file "unicode-math-table.tex" in that package provides a mapping from LaTeX math commands to unicode code points.

  • @AzureFlash
    @AzureFlash 2 หลายเดือนก่อน

    Bröther I need some lööps

  • @niclash
    @niclash หลายเดือนก่อน

    I am sure it was a small effort for Magnus to travel to USA. Try China, where the border people are uninterested, unknowledgeable and unreasonable... Yeah, almost missed the flight, but an hour later, several supervisor's input, eventually they let me through.
    Speaking of China; Don't have too many characters in your full name. Many banks takes 10-20 characters maximum in their systems, and within the same bank, different lengths. Then combine that with different handling of Swedish characters (a, ae and ä existed within the bank in my case), I think I had about 10 different variants of my "full name" within that bank. And just about every interaction with the bank ended up being a crime novel, either treating me like I tried to defraud the account, or that time when it stopped accepting salary pay out when they upgraded the system, since the new TT system accepted a different number of characters in names than previous system and took awhile to work that out and get my employer to change that.

  • @bart1439
    @bart1439 2 หลายเดือนก่อน +1

    Umlauds do look cül

  • @HenryLoenwind
    @HenryLoenwind 2 หลายเดือนก่อน

    Looking forward to all the examples of programs breaking because upper(i) != I...

  • @vibujicilemi
    @vibujicilemi 2 หลายเดือนก่อน

    Mötley Crüe, Mötley Crüe!

    • @AubreyBarnard
      @AubreyBarnard 2 หลายเดือนก่อน +1

      I hope that you made the bits of those be the two different encodings in the video (even if you are just chanting)!

    • @vibujicilemi
      @vibujicilemi 2 หลายเดือนก่อน

      @@AubreyBarnard I didn't but I'll do it now, let's see if TH-cam normalizes them.
      Mötley Crüe, Mötley Crüe!

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 2 หลายเดือนก่อน

      Side track: While Mötley Cruë afaik doesn't really mean anything in other languages, the band called Tröja is really for anyone speaking Swedish as it just means sweatshirt :D :D

  • @JW-uC
    @JW-uC หลายเดือนก่อน

    There is still one outstanding issue with unicode... while all the letters of every language [will eventually] be within it, the vast majority of fonts fail to include every single letter and this is then made even worse by the fact that some fonts will contain only partial supersets of common other language letters/glyphs. @R.B.'s comment only shows some of the Korean characters on my setup (debian/firefox) with others replaced by the square blocks with 4 hex numbers (such as D5 5C, AD 6D, etc.). So while the encoding might be there, its not just EBCDIC that has issues... the font(1) used to communicate might also not be compatible with GDPR even if the backend database is. Also, as a brief note, the AS/400 can happily store UTF8/16/32 and had been able to store variable character length code pages (Katakana characters for example) long before the idea was even a twinkle in a PC OS writers eye; that was much less an hardware issue and much more a "we didn't think of that" application software issue.
    (1) I have no idea if font handling has "fall back" fonts for extended character sets (like it does if a font is completely missing) or if its a requirement that the individual font must include every single additional character to allow correct rendering because I can't see that happening for the vast majority of fonts.

    • @JW-uC
      @JW-uC หลายเดือนก่อน

      OK, let me just take back the part about fonts/unicode... it seems that there is some kind of fallback process going on in linux. I added *every* font available from the repositories for debian and now the characters that were showing as hex boxes are now showing correctly (but are obviously from a different font than the one selected for rendering webpages). Somehow the system (it seems to also work on the console) knows there is a font that will show the characters and its using them instead of the selected font which didn't have those characters. Ok, that is seriously cool.

  • @James2210
    @James2210 2 หลายเดือนก่อน

    I'm going to call it mutley cruh from now on, lol

  • @xugro
    @xugro 2 หลายเดือนก่อน

    the letters üe next to each other look soo cursed

  • @LeeSmith-cf1vo
    @LeeSmith-cf1vo 2 หลายเดือนก่อน

    Huh. I didn't realise únicode was quite that old.
    Make me wonder why the early web wasn't in unicode!

    • @R.B.
      @R.B. 2 หลายเดือนก่อน

      Well the Internet existed long before the Web, but the services back then were mostly dealing with 7-bit text or in the case of FTP, transferring a binary file which already contained the relevant code page text as though it were written locally. Of course that's expecting that the receiving machine was set up with the right code page as well, but it isn't all that different from the sneaker net at that point. The Consortium recognized the impending problem and sought to fix it before the Web was invented, but that took time and no computers would even know what to do with Unicode until a few years after the Web was commercialized. The fact that there are some rough edges doesn't convey how amazing it is that we've been making the transition with no significant problems.

    • @nickwallette6201
      @nickwallette6201 2 หลายเดือนก่อน

      UTF-8 is a godsend.

  • @stephenspackman5573
    @stephenspackman5573 2 หลายเดือนก่อน

    This _is_ Spın̈al Tap.

  • @RandomGeometryDashStuff
    @RandomGeometryDashStuff หลายเดือนก่อน

    11:15 0x10FFFF is excluded

  • @ABaumstumpf
    @ABaumstumpf 2 หลายเดือนก่อน

    For airtraffic - that is also a result of the FAI and other rather ancient organisations, and the aviation-industry being too hung up on their custom software so they can claim certainly things like launches are only available if you book a trip directly with their portal. Flights still use 3 letter abbreviations cause the ancient API for airtraffic is based on "human readable" text queries where each airport is only 3 letters and only ASCII is supported. (Of course there has been an open Standards for many years already but shortly after airlines started adopting it they started to abandon it again).
    And Unicode TRIED to create a universal encoding - and they failed spectacularly several times as can be seen for example with Windows: The only OS at the time that was built around using unicode.... and then unicode completely abandoned that encoding (UCS2) and then UTF8/16/32 came along and they still have their fair share of problems.
    Normalisation would be good - if there was ONE way to normalise text - but there are multiple which means that any and all optimisations for text-conversions or comparison? Yeah they need to convert the text - potentially multiple times.
    But unicode is the best we have and very likely so close to being "perfect" for what we need that there will be no replacement in the foreseeable future: It works and is well known. That is all it takes to stick around.

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 2 หลายเดือนก่อน

      On the other hand, it seems like air traffic is the only thing that has an almost worldwide compatible booking system.
      A classic problem and a classic complaint is that railways and other public transit lacks this. There are various work arounds, like different seats reserved for different booking systems for international trains, but it's not great.
      Three letters of ASCII isn't that bad for airports though.
      Btw at least in some places the railways use similar abbreviations. In Sweden there is at least one case where the abbreviation reflects an older spelling, Ck is the abbrevation/signature for the station in the city called Karlskrona, where a spelling reform changed it from Carl to Karl.

    • @ABaumstumpf
      @ABaumstumpf 2 หลายเดือนก่อน

      @@Thesecret101-te1lm "On the other hand, it seems like air traffic is the only thing that has an almost worldwide compatible booking system."
      It just never works correctly. It worked "ok" some 40 years ago but it is one of the main sources for all the large delays flights have - cause the system is fundamentally flawed.
      There are entire section in the industry tasked with creating workarounds for those problems.

  • @phyphor
    @phyphor 2 หลายเดือนก่อน +1

    One thing I think Unicode did wrong was letting things that look the same be created with different code points. I understand why they did it, I just think it was the wrong decision. The fact that we fact to be wary of links to sites that "clearly" have a safe name is just one reason, let alone the whole kludgy comparison mess.

    • @Tsudico
      @Tsudico 2 หลายเดือนก่อน +1

      If my understanding is correct, then I agree with why they did it. It is a way to allow languages to change without requiring Unicode to change at the same time. I think it is far better to future-proof your design choices, especially if those choices might have wide or far reaching impacts.

    • @lubricustheslippery5028
      @lubricustheslippery5028 2 หลายเดือนก่อน

      It's the URL standard that is in the wrong. They should never allowed full unicode in the URL/e-mail. The same for usernames everywhere. It's just to much safety issues and how different the characters need to be, and the Unicode standard don't specify the font.

  • @JBC518
    @JBC518 2 หลายเดือนก่อน

    The closed caption for "Ukrainian" is translated to "Russian" for CC English (Great Britain)!?

    • @TheGeoffable
      @TheGeoffable 2 หลายเดือนก่อน

      Huh, yeah, and "Polish" becomes "Swedish", and "Japanese" is dropped entirely. Odd.

    • @DylanBeattie
      @DylanBeattie  2 หลายเดือนก่อน +7

      No, that’s me using the wrong draft of the script when I uploaded the subtitle file. Oops. Thanks for the heads-up.

  • @allenng2348
    @allenng2348 2 หลายเดือนก่อน +1

    The important question: where did you get that shirt? I must have it.

    • @DylanBeattie
      @DylanBeattie  2 หลายเดือนก่อน

      Custom print, my own design. You think there's a market for them?

    • @allenng2348
      @allenng2348 2 หลายเดือนก่อน

      @@DylanBeattie At least a market of one!

  • @hlynurjensson
    @hlynurjensson 2 หลายเดือนก่อน +2

    The Icelandic letter ó is correctly stored as a single character in Windows (u00F3) while in MacOS it's incorrectly stored as an o (u006F) with an accent (u0301). Same with the other vowels.
    I've had problems because of this with files created on MacOS with Icelandic characters in the filename and then transferred to a Windows system and my software can't find the files because ó != ó.

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 2 หลายเดือนก่อน

      Oh, I feel your pain.
      Here in Sweden we also suffer from a difference in how Apple and how everyone else (Windows, Linux, Android) does things.
      It shows up now and then in surprising places, like for example someone posts on Instagram using an iPhone, and has that account linked so the posts shows up on Facebook, and then someone else views that post using a web browser on a computer running Windows, and the åäö becomes aao. (This was eventually fixed, but the fact that it at all was a problem is sad).
      If there is an official standard for things like this in Icelandic I congratulate you. Afaik no official standardization organization has ever defined which way is the correct way for åäöÅÄÖ in Swedish. I wish that there at least were legal requirements to do this a specific way for products somehow financed by the public sector (direct buy but also grants). Of course exceptions would be ok for systems not primarily intended for data processing, like for example a computer controlled CNC machine, and also for legacy system not intended for processing new data.
      Speaking of wanting stronger government regulations: I really wish that a similar regulation were in place that forced OS/software vendors to treat week numbers equally important as they treat dates and month numbers. In particular I wish it would be possible to display time, day of week and week number instead of time, date and month in the systray in Windows.

  • @user-tc2ky6fg2o
    @user-tc2ky6fg2o 2 หลายเดือนก่อน

    I think the ASCII was chaotic even without code pages. More specifically, the use of ASCII in computers as a representation of texts was a non-perfect, but probably a necessary choice.
    The chaos comes from the fact, that ASCII made not just a representation of printable characters, as Unicode aimed better, but made as a transmission protocol. There are transmission-control codes and some terminal-control codes as well in ASCII.
    Choosing #0 as a representation of the '0', #1 for the '1', and so on values would be obvious for a completely new code system, if compatibility with the legacy (virtually everything) systems isn't necessary. But we know, the #0 has a special meaning in C, and probably sending 7-8 low-bit over a serial line, and trusting if it means '0' is not a good idea.
    But, if we do not take into consideration the characteristics of the data transmission, just purely the representation, computers trustworthy handles zero (all bit low) as #0.
    Imagine, how much energy went to conversion of the ASCII numeric values to actual binary values and back in the history of computing because of the design decision of ASCII. Billions and billions of computers handle numerical texts millions of times, and all have to add or subtract an offset every time.
    Not to mention the code page chaos we had, unsolved problems, programming efforts, and user efforts. (I just saw now, you have a video in the description with ASCII chaos).

    • @ABaumstumpf
      @ABaumstumpf 2 หลายเดือนก่อน +1

      "Choosing #0 as a representation of the '0', #1 for the '1', and so on values would be obvious for a completely new code system"
      And it would fail in so many ways cause while that might seem logical to average humans - as an encoding that would be terrible.

    • @lubricustheslippery5028
      @lubricustheslippery5028 2 หลายเดือนก่อน

      Null ≠ 0 is true in so many places not just C. And how would you represent stuff like 10. Strings and numbers can't be represented in the same way in binary so I don't see why '0' should be 0. For floating point values it gets complicated, especially as you can't represent the same floating point numbers in binary and decimal . You definitely have other non standard problems as line breaks, Windows "
      ", Unix "
      " old Max "
      ". ASCII are rarely a problem and that it's 7 bit is a blessing so it's possible to extend.

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 2 หลายเดือนก่อน

      Most legacy encodings at least have the four low bits set to 0-9 for the characters 0-9. But there are exceptions, like the Sinclair ZX81 which on the other hand has the alphabet directly next to the digits, resulting in it not needing any extra processing for converting between binary and hex as compared to between binary and decimal.

  • @redoktopus3047
    @redoktopus3047 2 หลายเดือนก่อน +7

    my dream job would be to work at unicode.
    i think the only culturally non-neutral thing they've done reguarding the encoding is that they have not added a slot for all human body parts.
    it's not inhernetly sexual to have a number associated for "penis emoji" or "breast emoji", but you can tell it's mostly americans doing it.
    maybe one day we'll get it. but unicode is such a success that many people don't even know it exists.

    • @tiagorodrigues3730
      @tiagorodrigues3730 2 หลายเดือนก่อน

      Well, I don`t know about emoji, but we do have “breast hieroglyph” (U+13091) and “penis hieroglyph” (U+130B8) available for use if someone cares for it (of course, font availability is a bitch).

    • @Kenionatus
      @Kenionatus 2 หลายเดือนก่อน

      There are a bunch of Japanese people upset about Han unification (the combining of characters from different languages that use Han origin characters (like between English and German). It just wasn't possible to be neutral there. Either you upset the nationalists and the people with font chaos or you upset the... internationalists(?) and people worrying about file size.

    • @maybenat
      @maybenat 2 หลายเดือนก่อน

      Han unification does annoy me a bit whenever I input Japanese in HTML, because my browser renders the Simplified Chinese character variants until I add in a lang="ja" somewhere lol. It is kind of funny how you get multiple font-like variants of Latin characters in Unicode, but for Chinese characters it's apparently not doable. I get it some of the reasoning, but it's not exactly ideal

    • @redoktopus3047
      @redoktopus3047 2 หลายเดือนก่อน

      @@maybenat i didn't know about this! thanks for the information :D

    • @nickwallette6201
      @nickwallette6201 2 หลายเดือนก่อน

      But we do have at least one of those. Have you not seen the eggplant emoji? ;-)

  • @mattiviljanen8109
    @mattiviljanen8109 2 หลายเดือนก่อน +1

    Kompatibility Composition? Morͯ͠tal Ko̻m⃘̝b᷿͇⃜a̦ͥ̆͋t͍ͮ̂ͪͅ

    • @Rand0081
      @Rand0081 2 หลายเดือนก่อน

      Read more Show less

  • @mrmimeisfunny
    @mrmimeisfunny 2 หลายเดือนก่อน +1

    10:54 𝓣𝓱𝓪𝓽 𝔀𝓪𝓼 𝓽𝓱𝓮 𝓲𝓷𝓽𝓮𝓷𝓽𝓲𝓸𝓷 𝓪𝓽 𝓵𝓮𝓪𝓼𝓽. 𝓘𝓽'𝓼 𝓶𝓸𝓼𝓽𝓵𝔂 𝓾𝓼𝓮𝓭 𝓽𝓸 𝓪𝓭𝓭 𝓯𝓵𝓪𝓲𝓻 𝓽𝓸 𝓹𝓮𝓸𝓹𝓵𝓮'𝓼 𝓼𝓸𝓬𝓲𝓪𝓵 𝓶𝓮𝓭𝓲𝓪 𝓫𝓲𝓸𝓼.
    ᏗᏖ ᏝᏋᏗᏕᏖ ᎥᏖ'Ꮥ ᏰᏋᏖᏖᏋᏒ ᏖᏂᏗᏁ ᏖᏂᏋ ᎮᏋᎧᎮᏝᏋ ᏇᏂᎧ ᏬᏕᏋ Ꮧ ᏁᎧᏁ ᏝᏗᏖᎥᏁ ᏕፈᏒᎥᎮᏖ ᏕᏬፈᏂ ᏗᏕ ፈᏂᏋᏒᎧᏦᏋᏋ. (Please stop doing it, it's ugly, breaks screen readers and makes it impossible for read for people who can read the script you used.)

    • @brighthades5968
      @brighthades5968 2 หลายเดือนก่อน

      i don't read cherokee and i struggled hard with reading that so…