Plain Text - Dylan Beattie - NDC Oslo 2021

แชร์
ฝัง
  • เผยแพร่เมื่อ 29 ส.ค. 2024
  • Software is complicated. Machine learning, microservice architectures, message queues... every few months there's another revolutionary idea to consider, another framework to learn. And underneath so many of these amazing ideas and abstractions is text. When you work in software, you spend your life working with text. Some of those text files are source code, some are configuration files, some of them are documentation. Editors, revision control systems, programming languages - everything from C# and HTML to Git and VS Code is based on the idea of "plain text files". But... what if I told you there's no such thing?
    When we say something is a "plain text file", we're relying on a huge number of assumptions - about operating systems, editors, file formats, language, culture, history... and, most of the time, that's OK. But when it goes wrong, "plain text" can lead to some of the weirdest bugs you've ever seen... why is there Chinese in the event logs? Why is the city of Aarhus in the wrong place? And why does Magnus Mårtensson always have trouble getting into the USA? Join Dylan Beattie for a fascinating look into the hidden world of text files - from the history of mechanical teletypes to encodings, collations and code pages. We'll look at some memorable bugs, some golden rules for working with plain text - and we'll even find out the story behind the mysterious phrase "pike matchbox" and what it has do with driving in Belarus.
    Check out more of our featured speakers and talks at
    ndcconferences...
    ndcoslo.com/

ความคิดเห็น • 553

  • @ElfinaAshfield
    @ElfinaAshfield 2 ปีที่แล้ว +1028

    There's a text encoding "joke" here in China. The MSVC debugger will initialize any unallocated stack memory to "0xcccccccc", which will translate into GBK codec as "烫烫烫烫" (the GBK codec is a 2-byte encoding system). "烫" means hot. So when the program is hot, it means you're encountering a wildcard pointer.

    • @jordanmcconnon6214
      @jordanmcconnon6214 2 ปีที่แล้ว +54

      That's funny as fuck

    • @yuriythebest
      @yuriythebest 2 ปีที่แล้ว +39

      that's actually a very lucky and convenient meaning, good thing it's not something like "gardening"

    • @pihungliu35
      @pihungliu35 2 ปีที่แล้ว +59

      Coincidentally, 0xCCCC in Big5 encoding is an archaic character 昍, which is composed of two suns (日 is "sun"), and roughly means bright. Which means this "uninitialized" memory reads out bright in Big5 environment.

    • @Lodinn
      @Lodinn 2 ปีที่แล้ว +18

      @@yuriythebest And "gardening" repeated two times would also mean "river bank", and three times "past-noon murder", as it does.

    • @CottidaeSEA
      @CottidaeSEA 2 ปีที่แล้ว +1

      @@Lodinn Well... there's also three women. I'll let you figure this one out yourself; 姦
      Although the radical for woman is kind of wild in general.

  • @NicolasChanCSY
    @NicolasChanCSY 2 ปีที่แล้ว +267

    In case anyone interested, the Chinese characters in 38:50 means the following:
    䐀 (U+4400): to dismember body of the livestock
    攀 (U+6500): to climb
    氀 (U+6C00): a category of textile and fabrics
    琀 (U+7400): gems/pearls/jade formerly put into the mouth of a corpse (note: this seems to be related to an ancient Chinese tradition/belief, that putting something, likely valuable, into the mouth of the corpse will lead the deceased to a better after life.)
    Only 攀 is still frequently used in modern Chinese.

    • @Kenionatus
      @Kenionatus 2 ปีที่แล้ว +10

      The first one sounds a lot like "butcher".

    • @NicolasChanCSY
      @NicolasChanCSY 2 ปีที่แล้ว +6

      @@Kenionatus Yes because they are referring to similar actions, if you are using it and "butcher" as verbs.

    • @KaneYork
      @KaneYork 2 ปีที่แล้ว

      Is that supposed to be "formally" like, as part of a formal procedure, or formerly, as in the gems were put in in the past?

    • @NicolasChanCSY
      @NicolasChanCSY 2 ปีที่แล้ว +2

      @@KaneYork I am not an expert so I am not sure about the answer to your question.
      Formerly or not, only god knows what happened to the billions of people passed away. Some articles in the web claims that a queen in the Qing dynasty had some gems or pearls placed in her mouth after her death. But that was near the end of the 19th century. I personally don't think this tradition is still a thing in this modern world, at least to my best knowledge.
      Formal or not, I think it is logical to deduce that funeral can be in many forms and largely affected by the social class and wealth of the deceased. Only wealthy families can afford gems and pearls in the past, right? Like in many culture, this thing origins from the superstition or belief that putting valuable items next to the deceased allows the deceased to have a better afterlife or second life. It is probably not part of any formal procedures, though I would like to reiterate that I am not 100% sure, if you are looking for an academically accurate answer.

    • @Spookspek
      @Spookspek 2 ปีที่แล้ว +5

      Butcher climb textile climb amulet climb!

  • @_fudgepop01
    @_fudgepop01 2 ปีที่แล้ว +341

    Omg it’s the “The Art of Code” guy!! I can tell I’m going to enjoy this one too~

    • @KarolHaltenberger
      @KarolHaltenberger 2 ปีที่แล้ว +4

      @@WhiteStripesStripiestFan I'm starting to love clay tablets :)

    • @MarcCastellsBallesta
      @MarcCastellsBallesta 2 ปีที่แล้ว +1

      Every single of his talks is awesome.

    • @kayakMike1000
      @kayakMike1000 2 ปีที่แล้ว +1

      Yeah, that talk was FANTASTIC

    • @RupertReynolds1962
      @RupertReynolds1962 2 ปีที่แล้ว +5

      He made me a Rockstar developer.
      What more can I ask? :-)

    • @RupertReynolds1962
      @RupertReynolds1962 2 ปีที่แล้ว +2

      And don't forget The Web That Never Was, another Dylan Beattie classic :-)

  • @FGB64
    @FGB64 2 ปีที่แล้ว +437

    I started laughing when I heard "Chinese in the event logs". A few years back the company I worked for at the time received a whole bunch of data in XML files. No problem, we had an XML import tool for that. Except that the importer choked on every single file we tried it on, complaining about invalid characters. When I opened a file in a text editor, it was a mish-mash of valid XML with strings of Chinese text thrown in. So I opened the file in a hex editor. It had the characteristic every other byte is 00 that you see with UCS-2 or UTF-16. It had a valid BOM which matched the endian-ness of the characters, so the XML parser we were using should have no problems reading it. The files looked perfect... until I finally noticed that not all characters where 2-bytes wide. CRLF sequences where encoded as 0d 0a instead of 00 0d 00 0a. Once I fixed that, both the text editor and XML parser were happy again.

    • @Gersberms
      @Gersberms 2 ปีที่แล้ว +9

      What a mess... I still run into text that has been converted from UTF-8 to ASCII a lot, and you can't easily translate it back. You just get 3 ascii characters for every UTF character they intended to use and you just have to hunt them all down before it works. At least I think that's what it is...

    • @MyBinaryLife
      @MyBinaryLife 2 ปีที่แล้ว +2

      how did you fix it? hope my asking so doesnt cause any trauma inducted flashbacks

    • @FGB64
      @FGB64 2 ปีที่แล้ว +17

      @@MyBinaryLife That was the easy part. I wrote a small program to replace 0d 0a byte sequences with 00 0d 00 0a in the affected files.

    • @kiseitai2
      @kiseitai2 2 ปีที่แล้ว +12

      To me, the Chinese characters in the logs hit home. I spotted a ransomware attack at one of our hospitals and all of the files had Chinese writing. Now, I knew a bunch of characters from studying Japanese in college. However, in retrospect and thanks to this presentation, I think I was looking at the same utf issue, which I mistook for actual Chinese which led me to believe we were under attack which ironically was a real attack…

    • @tekvax01
      @tekvax01 2 ปีที่แล้ว

      i had a similar thing happen in the logs at my shop too.

  • @michaelklaus_de
    @michaelklaus_de 2 ปีที่แล้ว +300

    This talk should be part of the curriculum in any IT related field! I can't count the times this information would have saved me from sleepless nights - either by knowing quicker how to deal with a problem, or by just giving peace of mind because you know that the inventors of this stuff were actually doing a great job.

    • @MarkUKInsects
      @MarkUKInsects 2 ปีที่แล้ว +1

      Every tester certainly should - great way to find bugs

    • @richardokeefe7410
      @richardokeefe7410 2 ปีที่แล้ว

      This talk should not be part of ANY curriculum because it has too many easily avoided factual errors.

    • @YT-Observer
      @YT-Observer 2 ปีที่แล้ว

      @@richardokeefe7410 I knoticed some of them

  • @akirachisaka9997
    @akirachisaka9997 2 ปีที่แล้ว +76

    As an Asian person, yeah, "Plain Text" was never easy growing up.
    I remember as a kid, I kept needing to struggle with CJK stuff on computers.
    Like, first you need to get the encoding right. And of course there are multiple ways to encode characters in all three languages...
    Then, you got the encoding right, but it's still gibberish because you don't have the correct font.
    Then, you installed CJK fonts, but the fonts treats Japanese Kanji and Chinese Characters the same way, yet use different font for each character. So some of your characters are in Japanese fonts and some are in Chinese fonts...
    Yeah, it's pretty fun.

    • @theodiscusgaming3909
      @theodiscusgaming3909 2 ปีที่แล้ว +3

      Unicode *still* treats kanji and hanzi as the same thing

    • @FlameRat_YehLon
      @FlameRat_YehLon 2 ปีที่แล้ว +2

      Then you installed the font specifically for the correct region and it's in the writing convension of another region since it's not locally produced, and every character would look just a bit off?

    • @billyswong
      @billyswong 2 ปีที่แล้ว +19

      @@FlameRat_YehLon Yeah, Unicode unifies the CJK text in a messy way. For example 吳 is "simplified" into 吴 in China and 呉 in Japan. They got their own code point because of historical reason. Now look at the same character with additional radical: 誤. This one don't have separate code point for the right portion (there is 误 for China though, but that doesn't kept the left portion in traditional form) also for historical reason. Now how the glyph is shown is totally up to what the font like to do.
      When I see the demonstration of difficulty in SQL text search for Latin character text in this video, my thinking is, "Oh wait till you see the horrible situation of CJK text." For people that live outside the CJK culture: the same word in CJK can be presented by multiple glyph variants. Think of it like color vs colour, program vs programme, check vs cheque. One can see that sometimes "check" means "cheque" but sometimes "check" just means "check" and shouldn't be replaced by "cheque". Imagine this kind of scenario multiplied by hundreds or even thousands of times. 🤮

    • @deelkar
      @deelkar 2 ปีที่แล้ว +5

      @@billyswong I have used CJK encoding for almost 25 Years now, and just now realised it stood for Chinese Japanese Korean

    • @RoastLambShanks
      @RoastLambShanks 2 ปีที่แล้ว

      @@billyswong "Check" it not the same as "cheque", ever. Thats just people who cant spell.

  • @treyquattro
    @treyquattro 2 ปีที่แล้ว +129

    who knew a talk about text could be so entertaining? I didn't really learn anything new about character encoding, having gone through the whole development from ASCII to Unicode via code pages and various levels of UTF, but the little anecdotes about the Harry Potter letter and Billy Joel album etc. were fascinating (and new to me)

    • @keithrobertson7579
      @keithrobertson7579 2 ปีที่แล้ว +1

      I'm just wondering why the topic of Billy Joel in St. Petersburg (Leningrad at the time) is accompanied by an image of him in Moscow.

  • @NestorCustodio
    @NestorCustodio 2 ปีที่แล้ว +13

    I love this kind of low-level "this is how we got here" type stuff. Fantastic talk!

  • @barspinoza
    @barspinoza 2 ปีที่แล้ว +98

    It's 2 in the morning as I finished watching this terrific talk, and I feel like giving a standing ovation.

    • @skilz8098
      @skilz8098 2 ปีที่แล้ว +1

      I just got done watching it and it's 3:20 am where I'm at.

    • @JACKHARRINGTON
      @JACKHARRINGTON ปีที่แล้ว

      Did you do it?

    • @rumbust7793
      @rumbust7793 ปีที่แล้ว

      1:08 am here.

  • @DmytroKo
    @DmytroKo 2 ปีที่แล้ว +46

    hi from Ukraine! brilliant talk!

  • @brickviking667
    @brickviking667 2 ปีที่แล้ว +53

    Loved the talk. Even stranger to have heard the terms "shibboleth" and "pike matchbox" in the same talk. Dylan seems to have a way to make text encoding entertaining instead of dry, maybe the architects of the UTF-8 and Unicode specs could learn from this.

    • @Gersberms
      @Gersberms 2 ปีที่แล้ว +11

      He's extremely well educated on cultures, for some reason it surprised me. Hating on Motley Crue was totally deserved too, I can't stand the abuse of umlauts.

    • @AffidavidDonda
      @AffidavidDonda 2 ปีที่แล้ว +7

      @@Gersberms unless it is Spın̈al Tap of course

    • @JivanPal
      @JivanPal 2 ปีที่แล้ว

      @@AffidavidDonda Hey, where did that "n" steal the second tittle from?!

  • @MateHegyhati
    @MateHegyhati 2 ปีที่แล้ว +38

    I can't recall the last time I watched such a long video on yt without any breaks. Thank you.

    • @metaltyphoon
      @metaltyphoon 2 ปีที่แล้ว +2

      What another one by the same guy called the Art of Code.

  • @KurtisRader
    @KurtisRader 2 ปีที่แล้ว +18

    I love this video. I was a UNIX support engineer (for Sequent Computer Systems) during the period when ISO 8859 (i.e., Windows "code pages" for UNIX) was introduced. It was a nightmare that resulted in a huge increase in my team's workload. ISO 8859 is easily my choice for the worst ISO standard to ever be introduced.

    • @lhpl
      @lhpl 2 ปีที่แล้ว

      You haven't had enough opportunity to use ISO 646-xx then :-)

    • @RyJones
      @RyJones 2 ปีที่แล้ว +1

      I got started on Sequent S/220. Good machines

  • @gogobram
    @gogobram 2 ปีที่แล้ว +22

    Great talk. If I had to add just one thing, it's the BOM-prefix :) The Byte-Order-Marker which tells if a file is Little-Endian or Big-Endian.
    And you could go from this topic and expand towards mime types.
    And statistical analysis which webbrowsers use to "guess" encodings.
    Which would lead us to discussing why HTML isn't actually XML.

    • @highcollector
      @highcollector 2 ปีที่แล้ว

      Browsers use statistical analysis to guess encodings? My mind is blown!

  • @bzdirt
    @bzdirt 2 ปีที่แล้ว +10

    Excellent talk! Just 2 things:
    - æ isn't a ligature (you can write both ae and æ, it's not the font who decides)
    - and it could/should have talked about BOM markers. It's basically a file header for plain text!! (specifies the type of UTF, can make 2 different files look identical)

    • @peterhindes56
      @peterhindes56 2 ปีที่แล้ว

      Yeah bom markers are fun. Had to remove them for gcode files on a cnc machine once

    • @calmeilles
      @calmeilles 2 ปีที่แล้ว

      In Old English Æ | æ is a distinct letter, æsc or ash, at code points 00C6 and 00D6. Æthelstan is not Aethelstan although mediæval mediaeval and medieval are matters of style (and country).
      Also Ossetian Cyrillic Æ | æ is similarly a single letter, but a different glyph being at 04D4 and 04D5.

  • @blueskyredkite
    @blueskyredkite 2 ปีที่แล้ว +13

    I'm amazed Dylan managed to make the (what I thought) incredibly dull subject of Plain Text interesting enough for me to spend almost an hour watching. Great video, thanks.

    • @alexeynezhdanov2362
      @alexeynezhdanov2362 2 ปีที่แล้ว

      Well, I'd not consider UTF-8 (or whatever else encoding except for 7-bit ascii) to be a plain text. So he kinda drove off-topic pretty quick :)

    • @nsyne
      @nsyne 2 ปีที่แล้ว +2

      @@alexeynezhdanov2362 that's interesting... Why not?

  • @christianemden7637
    @christianemden7637 2 ปีที่แล้ว +31

    An absolute fantastic presentation, about something everybody, including me thought to be trivial.

  • @anhadjha2362
    @anhadjha2362 2 ปีที่แล้ว +32

    This guy is just amazing at explaining real and applicable topics with such entertaining use cases🤩

    • @MarcCastellsBallesta
      @MarcCastellsBallesta 2 ปีที่แล้ว +2

      Hey! My youtube for android is not rendering your last symbol!
      Really appropriate for this talk.

    • @DaVinc-hi7hd
      @DaVinc-hi7hd 7 หลายเดือนก่อน +1

      @@MarcCastellsBallesta that's the smiling emoji face with red stars as eyes emoji !!

    • @MarcCastellsBallesta
      @MarcCastellsBallesta 7 หลายเดือนก่อน

      @@DaVinc-hi7hd Now I see it. Some update may have patched this.

    • @DaVinc-hi7hd
      @DaVinc-hi7hd 7 หลายเดือนก่อน +1

      @@MarcCastellsBallesta great !! now, that you have no trouble seeing these -->> 🥰😅☺

  • @MeriaDuck
    @MeriaDuck 2 ปีที่แล้ว +32

    14:30 I loved that the m68k processor had an instruction called 'abcd' which was add binary coded decimal (with extend). Using the dbra (decrement and branch) instruction one could add strings of ascii decimals in a two instruction loop.

    • @heartache5742
      @heartache5742 2 ปีที่แล้ว +15

      >dbra
      take me to dinner first

    • @VAXHeadroom
      @VAXHeadroom 2 ปีที่แล้ว +2

      PowerPC instruction set has 'eieio' which tells the process to flush the output cache...

    • @sadasulna6056
      @sadasulna6056 2 ปีที่แล้ว +1

      I had a friend at college who would snigger every time anyone mention dbra. He was a virgin.

  • @SymTrkl
    @SymTrkl 2 ปีที่แล้ว +15

    I did not expect a one hour talk on plain text, of all things, to hold my undivided attention. I paused to tell my wife something cool I learned three separate times. I call hax.

  • @BradenBest
    @BradenBest 2 ปีที่แล้ว +8

    Of course there is such a thing as plain text. Plain text is text in the text/plain MIME type, or more generally, text without special formatting, using only printable 7-bit ASCII characters with few exceptions (so UTF-8 also counts as plain text as long as only U+0020 .. U+007E, newlines muddy the waters a bit since the web uses CRLF while unix-likes use LF). Base64 is a fundamentally plain text encoding scheme, which is exactly the reason why it's desirable. Using base64, you can transmit non-plaintext over media that is hostile to it, like an HTTP response body with the Content-Type field set to text/plain (if you want to transmit a raw binary, you have to use an appropriate MIME type, such as image/png or application/force-download), or a clipboard, or, and this is where they appear most often, an address bar in the form of a querystring/hash or data url. You cannot transmit a raw PNG over a TH-cam comment, for example (hell, you can't even reliably copy a source code file into a rich text document without it doing _something_ to mangle it--rich text _hates_ duplicate spaces, for example). The contents will be mangled beyond all recognition. The MD5 digest will not match. Random unicode characters will be inserted, from weird little non-breaking spaces to the U+FEFF Byte Order Mark reserved by UTF-16/32, random linefeed and/or carriage returns, other control characters that may or may not make it into the clipboard on copy/paste. However, if you run it through a b64 encoder and _then_ transmit it, so long as the text itself isn't corrupted, it can be decoded back into the original binary with no corruption. Base64 uses a carefully-chosen subset of the 94 (whitespace doesn't count) printable ASCII-7 characters, A-Za-z0-9+/=. Only three of the characters in the charset are non-alphanumeric, they aren't anything that can be construed as a bracket or quotation delimiter, and one of them is used for padding and technically isn't even required, as the length mod 4 of a base64 string can only be 2, 3, or 0, mapping to a source size mod 3 of 1, 2, and 0 bytes, respectively.
    There is a clear distinction between plain text and binary. And that distinction is that plain text is designed for humans and thus can be heavily tampered with, while binary is for computers, the format is precisely defined down to the byte, is fragile and must be handled with care. Attempting to transmit a binary through a plain text channel will almost always result in a bad time. This is even true with text editors, as plain text files tend to be formatted in UTF-8 and have a newline appended at the end of the file, something that the text editor just arbitrarily enforces and doesn't give you a choice in. Go ahead, in Vim, try opening and saving a PNG without doing anything else and watch it get corrupted to the point of being rejected by image viewers. And then try telling me that plain text isn't a thing.
    I know this might be confusing, as Vim proponents will often tell you that Vim has a "built in" hex editor (they're referring to the xxd hexdump program that comes with Vim), but what people neglect to mention is that `:set binary` by itself will not handle binaries correctly. I know, right? It's called binary mode but it doesn't handle binaries correctly 🥴 You can `:%!xxd` and `:%!xxd -r` whenever you want, but before you write to the file, make sure to `:set binary noeol fenc=latin-1 enc=latin-1`. Binary sets vim in "binary mode", which is ill-defined. Docs say it mostly just turns off some text editing stuff like expandtab and ignores fenc, but I have enough experience with vim mangling binaries to know better. noeol prevents vim from inserting that stupid newline at the end of the file (which it will still do if you have binary mode on), and then fenc and enc are both set to latin-1 to ensure that both the file and buffer are encoded correctly before writing back to the file and ruining it. latin-1 is the official name of "8-bit ascii", or basically just using single bytes for characters, 1:1, without doing any extra encoding, which is vital for not mangling binaries--you don't want FF (1111-1111) to be interpreted as U+00FF and get encoded in UTF-8 as C3BF (1100-0011-1011-1111), for example. If your binary is a ROM for a virtual machine where FF is the instruction for pushing the A register to the stack and C3 is the instruction for making the programmer a beverage, then by transforming it into C3BF, you've just corrupted it, as the A register no longer gets pushed to the stack. Instead the VM makes you a vanilla latte, which is nice and all, aside from the fact that the bug it introduces combined with a network vulnerability in the VM that won't be known about for another 12 years results in a long Final Destination-esque chain of events that ultimately results in Arnold Schwarzenegger going back in time to kill your parents, which somehow causes WWIII (I guess people _really_ liked that meme you posted on twitter a few years ago). All because you chose to put on a foil hat and die on the "plain text isn't real, plain text can't hurt you" hill.

    • @iXPilot
      @iXPilot 2 ปีที่แล้ว +1

      Congratz on getting into next revision of this talk th-cam.com/video/gd5uJ7Nlvvo/w-d-xo.html ;)

    • @BerniePunished
      @BerniePunished ปีที่แล้ว

      It's honestly incredible that this comment got into NDC 2022 plain text talk.

    • @BradenBest
      @BradenBest ปีที่แล้ว

      @@BerniePunished It's probably because of the last part about corrupting a ROM causing a VM to initiate WWIII, which I wrote because I thought it was funny.
      But yes, I'm as surprised as you are.

    • @ac4740
      @ac4740 ปีที่แล้ว

      you dropped this 👑

  • @MeriaDuck
    @MeriaDuck 2 ปีที่แล้ว +8

    22:00 that's a wholesome dedication of a postal worker!

  • @Shogoeu
    @Shogoeu 2 ปีที่แล้ว +6

    Didn't expect to hear "Bulgarian" here. This video is gold - explains so many things. There was recently a bug in VS Code with an empty symbol (not visually shown, zero-length) causing issues.

    • @meghanto
      @meghanto 2 ปีที่แล้ว +3

      Same, I am trying to build a compiler and then I almost went insane because of a zero width space

  • @user-mw5gw3us5e
    @user-mw5gw3us5e 2 ปีที่แล้ว +5

    Back in 1980's, I was a salesperson of dotmatrix printers manufactured in Japan. I still remember the first Chinese "word processor" on IMB-AT was actually a generator of a single Chinese character picked from Guojia Biaozhun (National Standard) Level 1 and 2. In that era, surely Word Star was de facto. Taiwan software makers were more advanced and a few number of PC manufacturers provided Big 5 code and practically used wordprocessor softwares were on the market, but they were stored in a set of several "floppy disks".
    My company released a 15-inch printer of (24 dot-matrix with Chinese-character generating IC tip) in 1989 but the turmoil in Beijing caused all the shipment from Japan to China mainland, while I was in Hong Kong.
    Now in 21st century, I am so glad to type in Japanese, Chinese, Greek, Hebrew, English and French in a SINGLE FILE, all thanks to Unicord Consortium.

  • @lhpl
    @lhpl 2 ปีที่แล้ว +55

    There are some (I initially thought and wrote "many", but really most are merely natural and fair simplifications or omissions) details that are wrong, but when I got to around 00:28:48 or so, and saw the names 'Aachen' and 'Aarhus', I guessed what was coming up, and you were immediately forgiven.
    I live in Århus (and I insist on spelling it like that!) and character sets have been a pet peeve of mine - in Danish we would say one of my kæpheste ("stick horses" or hobby horses) - since I learned programming in 1982. I used to put "Min k{phest har f}et et f|l" in my Usenet signature. ("My hobbyhorse had a foal", "having a foal" in Danish is a idiom for getting upset or shocked.)
    You mentioned CP437 - it actually failed to include _all_ Nordic letters, so we would often see ¢ in place of ø back in the 80es. Then the 90es came, and gradually ISO 8859 began replacing ISO 646. Of course with new problems. I had to hack my favorite Mac newsreader, John Norstad's NewsWatcher, which originally cleared the high bit of all characters. At about the same time, the MIME standards, Internet RFCs 1341 and 1342 came out, having their own problems. For example having some mail servers inexplicably cutting off e-mail messages. (Because sendmail would use /bin/mail to deliver messages, but some Unix implementations of it would treat "." alone on a line as End of message. Something that wouldn't occur normally, but suddenly became frequent, when quoted-printable encoding expanded a line beyond the max, resulting in a period becoming wrapped onto a line by itself.) Ah, those were fun times.
    The Danish spelling reform introduced the Å (not Æ and Ø, we had them before), but made it optional in names, so people could choose whether to write their name as Aagaard or Ågård for example. Some towns also mostly insisted on keeping the Aa (yes, that is how the _letter_ looks in uppercase), such as Aalborg and Aabenraa. The change for Århus in 2011 was a really stupid idea of the mayor at the time.
    The official collation rules are indeed such that it depends on whether the Aa is used in a Danish name and equivalent to Å, meaning it is the _last_ letter in the alphabetical ordering, or not (putting 'Aachen' at the beginning.) Yes, it is outright perverse!
    But having had my middle name mangled by computers and people in other countries so many times (ACM replaced the ø in it with a '?' on my membership card, long after Unicode had become common!), I consider it fair revenge.

    • @djchrisi
      @djchrisi 2 ปีที่แล้ว +1

      Thanks for the explanation. Would you mind to briefly point out what other details you found to be incorrect?

    • @lhpl
      @lhpl 2 ปีที่แล้ว +7

      @@djchrisi It's really just minor and insignificant details, such as the first telegraph being "trinary" (which doesn't quite make sense, as the explanation says only two lines of five are active per letter.) Not mentioning the Mac using CR for newline, calling CP/M computers "minicomputers" (that was the term for the previous generation of PDPs, System38...) etc.

    • @orbital1337
      @orbital1337 2 ปีที่แล้ว +10

      @@lhpl The telegraph encoding is trinary because each line can be in one of three states (off, pointing left, pointing right).

    • @lhpl
      @lhpl 2 ปีที่แล้ว +3

      @@orbital1337 yes, but a 5-"trigit" encoding would be able to encode 3⁵ = 3×3×3×3×3 = 243 different characters (more than ASCII even), yet the telegraph didn't even support the full alphabet, so clearly the trinary potential wasn't used.

    • @DylanBeattie
      @DylanBeattie 2 ปีที่แล้ว +15

      @@lhpl If we're going to get down to *that* level of detail, I'd argue that the encoding was absolutely trinary - five symbols, each of which could have one of three possible values - and the fact that only a limited subset of these was used was a restriction of the available hardware, not of the encoding itself...
      I like the word "trigit", though. I might borrow that next time I give this talk. 😉

  • @RaymondHinton
    @RaymondHinton 2 ปีที่แล้ว +12

    If the DuckDuckGo translation is to be believed, then that first Chinese character (4400) actually means "Yikes", which is somehow even better than the translation given in the talk! :)

  • @allanrichardson9081
    @allanrichardson9081 2 ปีที่แล้ว +8

    Excellent video, except for one minor point. The ASCII code was never intended to be used on punched cards (except in the 360ff mainframe equivalent of “binary,” which is described below). It was an expanded replacement for the 5-bit Baudot code, which used a case latching printer. All but 6 of the 32 possible combinations of 5 bits were used to encode the 26 letters of the Latin alphabet when in LTRS case, or the 10 Arabic digits and the most used punctuation marks (also the BEL, which rang a bell). The other 6 were case independent: no holes (00000) was used for paper tape leaders; all five holes (11111) to correct typos on the fly by backspacing and overpunching the bad character; two other codes, FIGS and LTRS, to flip the case shift (one code to toggle would be too dangerous, since a misread case toggle would leave every character in the rest of the message printed with the wrong case; with this system, such a single error would only mess up the text to the next FIGS or LTRS); and two other codes for CR and LF.
    ASCII made the printer much faster, even though it expanded the characters from 5 to 7 or 8, got rid of the case shift, and allowed lower case letters and lots more punctuation marks.
    IBM had the Hollerith code for punched cards, in which (originally) zero holes (space), 1 hole (& or +, -, and 0 through 9), two holes (but no more than one of the top (12 and 11) and no more than one of 0 through 9) to give 12+0 (flagged digit 0), 11+0 (flagged digit 0), 12+1 through 9 (A through I, upper case only), 11+1 through 9 (J through R), 0+1 (“/“), 0+2 through 9 (S through Z); and a few combinations of 8+2, 8+3, 8+4, 8+5, 8+6, and and 8+7, alone or with 12, 11, or 0, for punctuation marks.
    Then IBM developed the 8-bit EBCDIC (Extended Binary Coded Decimal Interchange Code), allowing many “unprintable” card codes, more punctuation, and lower case letters, to be mapped into a subset of the 4096 possible combinations of 12 punches: any combination of 12, 11, 0, 8, or 9 (5 bits) and no more than one out of 1 through 7 (3 effective bits), with some ranges of exceptions and a number of individual exceptions. Thus, the “binary” deck of a program, if printed, would show embedded text directly.
    IBM also specified that ASCII-8 could be used in computation, but as it happened, it was only used as input or output on non-IBM devices, and was translated to EBCDIC on input and from EBCDIC ON OUTPUT, for those devices. A single instruction (TR) with a 256-byte read-only translation table, could translate up to 256 bytes.

  • @user-tx6or7kh4w
    @user-tx6or7kh4w 2 ปีที่แล้ว +15

    On PIKE MATCHBOX thing: that one doesn't work for Russia as we don't have 'i'. Instead, we use lowercase 'У' (ie 'у', reads us 'u' in 'urban', I think) which looks pretty much like lowercase 'Y'. So it's PyKE MATCHBO >_

  • @softy8088
    @softy8088 2 ปีที่แล้ว +32

    35:04 I think that's wrong. The K stands for "compatibility" and is for "are there vaguely the same?", while the version without the K is the canonical one, which means, "are these 100% the same for all meaningful purposes?"
    Edit: Yeah, watching further along, the speaker has confused "canonical" with "compatibility". Ⓟ is not canonically equivalent to P, but it is compatibility-equivalent.

    • @DylanBeattie
      @DylanBeattie 2 ปีที่แล้ว +22

      Yep, you're right - my mistake entirely. K = compatible, not canonical.

    • @softy8088
      @softy8088 2 ปีที่แล้ว +3

      @@DylanBeattie It was pretty good talk overall, but having done a dive into this topic myself, my inner pedant won't shut up. :)

    • @NicholasShanks
      @NicholasShanks 2 ปีที่แล้ว +1

      @@DylanBeattie but at least you present the lie with convincing conviction.

  • @ittixen
    @ittixen 2 ปีที่แล้ว +5

    Came in expecting a talk about infosec and cryptography, but this completely different subject was just as fascinating.

  • @TheMultiminded
    @TheMultiminded 2 ปีที่แล้ว +3

    Ah yes. The old UTF-16 BE/LE. Reminds me of a system I was working on way back were we were receiving both UTF-8 and UTF-16 XML files and for some reason the BOM was sometimes (but not always) missing. Our XML parser had a few things to say about that. Customers were screaming so our just-get-it-done "solution" was to take the first four bytes and try every possible encoding of a "

  • @Atabascael
    @Atabascael 2 ปีที่แล้ว +6

    28:49 - the native spelling of "Zurich" is actually "Zürich" :)

    • @Yora21
      @Yora21 2 ปีที่แล้ว

      Normally I wouldn't care. But when talking about mistransscribing language specific character, I was waiting the whole time for him to get to the point where he explains why he wrote it as Zurich in this example.

    • @astupidnickname
      @astupidnickname 2 ปีที่แล้ว +1

      @@Yora21 So happy I am not the only one being triggered by this 🙈😂

  • @larryd9577
    @larryd9577 2 ปีที่แล้ว +12

    If you use `Achen` as an overwrite for `Aachen` in sorting you get it sorted after `Abchen` (albeit it doesn't, it suffices for proving a point), which should actually follow `Aachen`. `A_chen` would be a better choice.

  • @robbertwethmar5612
    @robbertwethmar5612 2 ปีที่แล้ว +9

    nice talk, ranging from (ancient) history to modern encoding problems, thanks!

  • @HarmonicaMustang
    @HarmonicaMustang 2 ปีที่แล้ว +1

    As a side note, International Morse Code is still used today by ham/amateur radio operators in a mode called CW (continuous wave). Although no longer a requirement for a license, it is by far the most efficient method of communication for language barriers, power and distance.
    My first radio station was a little radio I could run off of a power bank, it was capable of producing 5 Watts of power, and that was hooked up to about 9 meters of regular speaker wire I'd string up in a tree. I was able to communicate with someone in Italy from the UK, that's about 2000 km/1200 miles away on less power than it takes to charge a smartphone.
    Though it has the steepest learning curve for an operator, the radio itself can be very simple (to the point some amateurs have built one into an Altoids tin) and although we now take global communications for granted, it is still very satisfying to speak to someone so far away unaided by modern technology.

  • @sbrazenor2
    @sbrazenor2 2 ปีที่แล้ว +144

    He's like, "You should go to Ukraine if you have the chance." A week later, Russia invades. I don't think that was what he meant. Apparently the Russians got a different version of the message. 🤣

    • @dvircohen2465
      @dvircohen2465 2 ปีที่แล้ว +24

      It really aged awkwardly lol

    • @leap123_
      @leap123_ 2 ปีที่แล้ว

      The message that Russians get is "You should *invade* Ukraine if you have the chance"

    • @carina_akaia
      @carina_akaia ปีที่แล้ว +3

      Nothing funny though

  • @christophe3d
    @christophe3d 2 ปีที่แล้ว +9

    I loved the reference to François Bordes (aka Francis Carsac, one of my favorite French Sci-Fi authors, and an inspiration for me for parts of my own "Informagie" book).
    In terms of anecdotes, he mentioned the £ sign vs. $, but he could also have spoken about the Apple II Europlus (I believe) which had a physical switch under the keyboard to switch between the US and "European" character set. IIRC, switching like this would change $ to £. While I'm not entirely sure about which character converted to £ (it might have been @ or something else), what I also recall is that, at the time, Apple insisted on spelling Apple II as Apple ][, which was cute, except that in the European version, ] and [ characters were remapped, so it shifted to something nonsensical like Apple éè. I believe that this was contry-dependent, i.e. German and French character ROMs were different.

  • @stan.rarick8556
    @stan.rarick8556 2 ปีที่แล้ว

    I coded in and used plain text files for 46 years and I prefer it. For multiple reasons.

  • @super9mega
    @super9mega 2 ปีที่แล้ว +8

    The only reason I recognize the European one, is because it's the same tile set that dwarf fortress uses. I just happen to notice all of the symbols in use because I've played the game for so long

  • @StreuPfeffer
    @StreuPfeffer 2 ปีที่แล้ว +2

    As Tom Scott said in his emoji talks[ "Nobody" cared about Unicode until: "Ha! I can send my friend piles of poo" and suddenly the world cared about unicode.] or something along those lines :D

  • @joaomelo6642
    @joaomelo6642 2 ปีที่แล้ว +10

    Amazing talk. Learning and Entertainment at the same time!

  • @VAXHeadroom
    @VAXHeadroom 2 ปีที่แล้ว +2

    One of my jobs early in my career (1988ish) was to write a routine for creating a mag tape to be read by an IBM mainframe...from a DEC VAX8340. Kind of mind-bending to write EBCDIC out from a VAX (using ASCII).

  • @_nikeee
    @_nikeee 2 ปีที่แล้ว +5

    39:00
    They use 16 bit encodings for historical reasons. Early on, they were using UCS-2 in the 90s, which was a block-code like ascii, only with 16 bits. Turns out that this wasted much space, so UTF-8, UTF-16 and UTF-32 were invented. UCS-2 is forward-compatible with UTF-16 and those systems switched to UTF-16 instead.
    New programming languages and systems (Rust, Go, etc) use UTF-8 internally, because its much more efficient for mostly english text (also, utf-8 is the most used encoding on the internet).
    Today, the fact that C# uses UTF-16 instead of UTF-8 is actually causing performance issues because when writing web servers, they have to re-encode all UTF-8 code to UTF-16, so the program can work with that text.

  • @superhawk6105
    @superhawk6105 2 ปีที่แล้ว +1

    I love that shibboleth is, itself, a shibboleth (I've only ever heard it mentioned in dev circles, at least).

  • @jeanotzubler2477
    @jeanotzubler2477 2 ปีที่แล้ว +2

    I love / hate the fact, that when talking about umlauts Zürich is written as Zurich the whole time.

    • @DylanBeattie
      @DylanBeattie 2 ปีที่แล้ว +3

      Yeah. My bad... I've fixed that in the slide deck. Sorry!

  • @wobuntu
    @wobuntu 2 ปีที่แล้ว +5

    It has been a long time since I saw auch an entertaining yet informative talk, well done

  • @samwilson2926
    @samwilson2926 2 ปีที่แล้ว +2

    As a network guy I’m wondering what sort of glitch in a network switch can lose a single byte from a packet and not hit a checksum problem of some kind. I’m also slightly disappointed you left out the TELEX and Baudot parts of the history, and the various computer architectures that used non-8 numbers of bits for characters, e.g. DEC’s 36-bit systems. Apart from those niggles this is just wonderful!

  • @CyReVolt
    @CyReVolt 2 ปีที่แล้ว +8

    Just like time zones, text is a simple non-issue in tech... LOL :D

  • @thewhitefalcon8539
    @thewhitefalcon8539 ปีที่แล้ว +1

    The K in NFKD and NFKC actually stands for kompatibility (because C already stands for composed). It's a stronger normalization which replaces some characters that are different but have similar semantic meanings. It's good for searching.

  • @LarsOestreicher
    @LarsOestreicher 2 ปีที่แล้ว +1

    I will definitely give this TH-cam video as required viewing in my UI Programme course!
    Thanks for a great lecture!

  • @gedece
    @gedece 2 ปีที่แล้ว +1

    Nowadays, when interfacing via webservices you are asked to do plain text XML, and the question that follows from the developer is: "which encoding?" and if the other person doesn't know we define for them and tell them to set it to be UTF-8 which is native in our server machine. We can accommodate other encodings, but nothing beats native.

  • @boblangill6209
    @boblangill6209 ปีที่แล้ว

    "Same shape but actually different characters". That bit me the first time I tried to enter a simple program. Fortunately I had the listing from the teletype style terminal I was using. The first two people who looked at it didn't see anything wrong. The third guy asked me "Are you using lower case L for one?" Of course I was, I knew how to touch type. Standard typewriters didn't have a key for the number one so that's how you did it.

  • @clayz1
    @clayz1 ปีที่แล้ว

    I love text, and ASCII. Having grown up in a print shop I love fonts and type faces. Turns out I know a thing or two about 90’s Windows and DOS character sets. But this documentary just blew my mind and explained, at last, what code pages are (just stick to ASCII and you won’t get hurt!). I am from the past. Great talk. Thank you.

  • @90hijacked
    @90hijacked 2 ปีที่แล้ว

    Probably the most competent windows user i've ever seen

  • @alexanderilin8720
    @alexanderilin8720 2 ปีที่แล้ว +11

    This is an amazing talk, I love it!

  • @tslivede
    @tslivede 2 ปีที่แล้ว +17

    Really nice and interesting talk!
    Sorry, but my inner pedant requires me to mention, that at 38:25 you say "They use 16 bit for everything", which is not true because utf16 is a variable width encoding (see utf16 surrogate pairs).
    Also AFAIK utf16 is not used because of efficiency, but because of backwards compatibility, as unicode was still only 16bit wide, when these apis were designed.

    • @tslivede
      @tslivede 2 ปีที่แล้ว +7

      Minor detail at 42:17
      UTF-8 stops at 4 bytes, not because utf32 restricts it to 32 bytes (even 6 "UTF-8" bytes could only encode 31 bytes, so it would be compatible to utf32)
      The reason utf8 stops there, is because that's the maximum that utf16 surrogate pairs can represent

    • @eekee6034
      @eekee6034 2 ปีที่แล้ว +1

      @@tslivede Does the 4-byte limit restrict UTF-8 to encoding only 21 bits? I recall some 21-bit limit on Unicode being lifted years ago.

    • @tslivede
      @tslivede 2 ปีที่แล้ว +2

      @@eekee6034 In section "2.4 Code Points and Character" of
      "The Unicode® Standard
      Version 14.0 - Core Specification"
      it says:
      "In the Unicode Standard, the codespace consists of the integers from 0 to 10FFFF"

    • @tslivede
      @tslivede 2 ปีที่แล้ว +2

      @@eekee6034 0x10ffff is the maximum value representable as a utf16 surrogate pair and has a width of 21 bits.

    • @tslivede
      @tslivede 2 ปีที่แล้ว +2

      @@eekee6034 If you want to find the standard, just google for "unicode latest standard" which leads to the page "Unicode 14.0.0" on the unicode org website. There you can find the PDF on the left side with the link "Full Text pdf for Viewing".
      Alternatively: English Unicode Wikipedia article → External links (at bottom of article) → Official technical site → Latest Version → Full Text pdf for Viewing

  • @matthewcane0
    @matthewcane0 2 ปีที่แล้ว +6

    What a brilliant talk, super interesting and engaging!

  • @HrHaakon
    @HrHaakon 2 ปีที่แล้ว +1

    It is actually true that Ø is a separate letter in Norwegian, Danish and Swedish, but the Swedes draw that character as Ö. But it's the SAME LETTER.
    I am so happy that I never had to be the people in the Unicode consortium having to untangle that horrible mess.

  • @dexterBlanket
    @dexterBlanket 2 ปีที่แล้ว +4

    Thank you for this "introduction to text encoding history" - good, funny, and very educational.
    YUSCII encoding, "compatible" to a DOS code page for East European languages with Latin alphabet, was used in former Yugoslavia. It was "one of those" DOS text encodings that was mashed together and lived (or should I say limped) long after its expiration date when Windows 95 introduced "correct" support.
    I still remember the DOS console prompt "C:ĐWindowsĐsystem>" 😂
    But YUSCII fonts for Windows were the worst, the letters "ŠĐČĆŽ" were scattered across the keyboard (they are "now" grouped on the right hand side), and they managed to mix up the case - Shift+š got you "š" instead of "Š" and vice versa. You could still find Word documents that are using YUSCII fonts.

    • @JivanPal
      @JivanPal 2 ปีที่แล้ว

      "YUSCII" just made me think that Russia should've called their encoding "RUSCII", would've been an amazing pun! Wikipedia suggests that it's a nickname for codepage 1125 (RST 2018-91).

  • @Olfan
    @Olfan 2 ปีที่แล้ว

    @32:10 - hate to nitpick, but this way "Aachen" would be sorted after "Abenberg" (ab < ac), so that's better but still not ideal. For this to actually work, a double a would have to be sorted as an actual double a, and the aa in Aarhus would have to become something else entirely, maybe the Å it was before.

  • @TheFinagle
    @TheFinagle 2 ปีที่แล้ว +29

    This talk is like programing in a nut shell. "We have a problem." "OK, heres a solution." "Sure that solution is great, but heres a new problem it causes." "Sure we solved it like this." "Thanks, heres a special case that doesnt quite cover." "Opps, we'll get that with this." And it never ends

  • @pihungliu35
    @pihungliu35 2 ปีที่แล้ว +5

    There is a quirk in Windows notepad that (sort-of) converts "Bush hid the facts" into Chinese characters, when it try to interpret ascii text as UTF16-LE. (There's an article in Wikipedia describing this under this string.) I was thinking the Chinese character log is this kind of misinterpretation, but it's weirder than this.

  • @SamTheEnglishTeacher
    @SamTheEnglishTeacher 2 ปีที่แล้ว +11

    Reminds me of doing data migration taking unstructured text in SQL Server over to a hadoop cluster... spent a few days trying to figure out why there were these question marks inside of squares popping up all over the place. Smart quotes. Why is everything especially dumb so often called "smart"?

    • @evannibbe9375
      @evannibbe9375 2 ปีที่แล้ว +1

      Yeah, the publisher that came up with the idea of using smart quotes should have been prohibited from touching a computer

    • @rosiefay7283
      @rosiefay7283 2 ปีที่แล้ว +1

      @@evannibbe9375 You have it the wrong way around. Opening and closing quotation marks have always been different. They're different in handwriting, they're different in print. The problem was caused by ASCII using one code point for both.

    • @SamTheEnglishTeacher
      @SamTheEnglishTeacher 2 ปีที่แล้ว

      @@rosiefay7283 what use is there for italicised quotation marks as distinct from non-italicised quotation marks? If you only want them to appear a certain way, that's what fonts are for

  • @TheBuilder
    @TheBuilder 2 ปีที่แล้ว +4

    You really did your research with this one

  • @Mskvaer
    @Mskvaer 2 ปีที่แล้ว +1

    A correction for his next presentation: æ is a letter, at least in Denmark. If Dylan wants to present ligatures, the common one is ff which also is representable as a character ff (U+FB00) - thus keeping us all confused!
    The Aarhus and Aachen is a brilliant example - the fact you cant have it both ways.
    ö is a letter in Sweden, and just an umlaut (an o with a ¨ ontop) in German - which affects sorting order.
    The sharp-s character ß (U+00DF) is not the greek beta β (U+3B2) - they usually look distinct. Furthermore in German it stands for SS, but in Austrian it stands for SZ. Sorting is not "plain text", but is just as fascinating: In spanish C is before D, but the letter pair CH comes after C but before D (ie CA, CM, CX, CH, DA)

    • @YT-Observer
      @YT-Observer 2 ปีที่แล้ว

      sort order in Castillian Spanish YMMV c,ch,d sometimes sorts differently depending on whether it starts the word or is embeded

  • @dennisk.6988
    @dennisk.6988 2 ปีที่แล้ว +26

    The Ukrain is an awsome country you should definetly go there… Didn't know Putin watched your talks

    • @negirno
      @negirno 2 ปีที่แล้ว +4

      I assume TH-cam's algorithm boosted all videos mentioning Ukraine because the current war events. I even got Langfocus's video about the difference of Russian and Ukrainian in my recommendation feed.

    • @mattsadventureswithart5764
      @mattsadventureswithart5764 2 ปีที่แล้ว +3

      Ukrainians prefer people calling their country "Ukraine", not "the Ukraine".
      Just an FYI.

  • @AndreasHontzia
    @AndreasHontzia 2 ปีที่แล้ว +3

    I have rooted machines with Unicode. The RTL features and canonicalization are really handy to bypass filters.

  • @glum_hippo
    @glum_hippo 2 ปีที่แล้ว +1

    Amazing presentation. A bit of history is always a welcome thing!

  • @Graham_Rule
    @Graham_Rule 2 ปีที่แล้ว +1

    I'm old enough to know why the DEL is all bits set although I'd have said it was from paper tape rather than punched cards (I thought they used a different encoding), and I remember the problems getting email though ASCII / EBCDIC translators (eg the IBM based BITNET network). But I loved this presentation. The speaker's style and ability to raise so many of the problems that come up when everyone assumes that the rest of the world (and history) do just the same as they do was great. I can't help wondering if there was a Q&A after this - or even better a discussion in the bar.

  • @Aaron628318
    @Aaron628318 2 ปีที่แล้ว +1

    A lot of this I knew already, but somehow this presentation makes me feel more comfortable about what is - partly due to history but also partly of necessity - something of a mess.

  • @AdrX003
    @AdrX003 2 ปีที่แล้ว

    Its been a wile since i dont find a good talk like this, Thanks!
    edit: the first line was before the thing start, but now i just ended it like this
    "HOLY F... THIS S.. WAS AWESOME!"

  • @Maldito011316
    @Maldito011316 2 ปีที่แล้ว

    Props for using the Zalgo meme (Zalgo he comes) and not just calling it Zalgo like I'd probably do

  • @SIRBOB102
    @SIRBOB102 2 ปีที่แล้ว +5

    thank you you have given the devs of the world an excuse to be on twitter

  • @JonGretarB
    @JonGretarB 2 ปีที่แล้ว +2

    Ææ is not a ligature in Icelandic. It’s it’s own letter. Used to be in english as well before the printing press which was german. Þ disappeared because of the same reason and now is “Th” but first usually written with “y” which they thought looked similar. So when you see “Ye olde” it’s supposed to be “Þe olde” and pronounced “The old”

    • @YT-Observer
      @YT-Observer 2 ปีที่แล้ว

      it's not just Icelandic, several languages had such characters

  • @Yora21
    @Yora21 2 ปีที่แล้ว +4

    During the whole segment about city names, it really bugged me that it's "Zürich", not "Zurich".
    Normally I'd let that slide in an English language presentation, but that's exactly the kind of thing that segment was all about.

  • @billyleask
    @billyleask 2 ปีที่แล้ว

    Thank you Mr. Beattie for such an interesting talk, and for all your other great talks!

  • @franciscovarela7127
    @franciscovarela7127 2 ปีที่แล้ว +2

    Yeah, was waiting for the collating sequence aspect of text handling. By the way Zurich is correctly spelled Zürich.

  • @jurjenbos228
    @jurjenbos228 2 ปีที่แล้ว +1

    Actually, the first version of ASCII was 6 bits, what is now characters 0x20 - 0x5F (upper case, digits, and most symbols). Then an extra bit was added to be the reverse of the initial bit, so the halves where swapped and the lower case letter could follow the upper case (60-7f). This code was used a lot on 7 bit tape (not cards). The characters 00-1f where then filled with the control characters. They tried to make them as useful as possible, but it never caught on.
    The IBM people, who were using 6-bit EBCDIC derived from cards, extended it to 8 bits and of course did not include the same set of characters.

  • @oskrm
    @oskrm 2 ปีที่แล้ว +11

    Anyone knows Pike Matchbox?

  • @artsmith1347
    @artsmith1347 2 ปีที่แล้ว +2

    24:50 We have been unhappy about the Tower of Babel (Genesis 11:7) since languages were divided. So our partial workaround is to make a unique representation for each of thousands of glyphs. It treats the symptom without trying to recognize the cause.

  • @JulianSloman
    @JulianSloman 2 ปีที่แล้ว +4

    PIKE MATCHBOX! Nice - the visiting Ukraine part... bad timing

  • @FreyrDev
    @FreyrDev 2 ปีที่แล้ว +2

    49:27 This is wrong though, æ is a separate character not a ligature, U+00E6 in unicode, its a separate letter to ae in most Scandinavian languages and the IPA

    • @YT-Observer
      @YT-Observer 2 ปีที่แล้ว

      @freyr it's another example where ligature and separate symbol overlap - "English" actually has both in historical spellings.

  • @scifino1
    @scifino1 2 ปีที่แล้ว +1

    28:47 Nice demonstration of an encoding problem in "Zürich" there.

  • @economicist2011
    @economicist2011 2 ปีที่แล้ว

    23:30 Came for "plain text" for some reason. Got an interesting piece of modern historical linguistics trivia and more.

  • @andytroo
    @andytroo 2 ปีที่แล้ว +9

    I currently have an appreciation for the difference between a code point and a char :).
    yesterday i was fighting c#, wanting my xml to be output in utf8, and declared as such (not utf-16, with predominantly 1 byte chars, which is what you get by default)

    • @edwardfanboy
      @edwardfanboy 2 ปีที่แล้ว +2

      Did you know that if you use System.Text.Encoding.UTF8 when writing a text file, it will output a byte order mark? That gave me a headache.

    • @AyCe
      @AyCe ปีที่แล้ว

      @@edwardfanboy new UTF8Encoding(false), I defined it as a constant and use it everywhere

  • @johnterpack3940
    @johnterpack3940 ปีที่แล้ว

    I did not expect to be fascinated by a talk on text.

  • @leonardchan2638
    @leonardchan2638 2 ปีที่แล้ว

    Very interesting talk, never though I'll be entertained this much by the "plain text" I've been using all this time. Changed the way I look at them next time

  • @IllidanS4
    @IllidanS4 2 ปีที่แล้ว

    Raymond Chen recently had an article about a similar error in text conversion, actually related to not one but two of the "features" here: line endings and UTF-16 endianness. A misconfigured server was auto-converting between
    and
    but didn't accommodate for UTF-16, and so every other line was in Chinese since the bytes were shifted, and shifted back into the correct alignment the next line.

  • @AmbiqMercury
    @AmbiqMercury 5 หลายเดือนก่อน

    Adding to the historical part of text encoing: In 1792, France began operating an optical telegraph which used moving indicator arms to encode the messsages. Line transmission speed depended on the weather.

  • @stan.rarick8556
    @stan.rarick8556 2 ปีที่แล้ว +1

    @9:21 you are incorrect. The punched cards you show are IBM cards and are not punched in ASCII. The are a modified form of BCD (Binary Coded Decimal) and the follow on for IBM mainframes was EBCDIC which is still in use today.

  • @richardokeefe7410
    @richardokeefe7410 2 ปีที่แล้ว

    In the early drafts of ASCII, the character that is now LineFeed was *intended* to be NextLine (= CR+LF, = C1 control NEL 0x85).
    The simple explanation of ASCII = network controls + format controls + plain text designed for overstriking = remotely operated Teletype.

  • @filker0
    @filker0 2 ปีที่แล้ว

    There are additional complications from the 8-bit encoding days that this talk did not cover. Graphic sets, coding regions, code extensions, single shifts, locking shifts, NRCS (National Replacement Character Sets), and so forth. Every time you are using a terminal emulator (including xterm), and you accidentally cause a character with the hex byte value of 0x0E to be received by the terminal, all lower case characters and most punctuation (and in some cases, all graphic characters) to be displayed out of a different character set until a 0x0F byte is received by the terminal program or the terminal is reset. 0x0E is "SO" (Shift Out) aka "LS1" (Locking Shift 1), 0x0F is "SI" (Shift In) aka "LS0" (Locking Shift 0). This is the old ANSI/ISO code extension model that was used by "ANSI" terminals starting with the DEC VT100. Combined with an additional set of code extension techniques, it was possible to display glyphs for up to 4 different character sets, each with 94 or 95 distinct printable characters, on a screen or printed (via a conforming terminal) on a page. The 4 graphic sets (G0 through G3) could each be assigned to hold one of a number (greater than 4) of different character sets, and not all terminals or printers supported the same collection of graphic sets. Unless you knew what character sets were assigned to each of the graphic sets, and tracked LS1 and LS0 (and LS2 and LS3, which were multi-byte sequences), you could often not work out what the text in the file was.
    Unicode has its own code extension techniques though text encoded in it is far less ambiguous than ANSI 3.4 and ISO/IEC 2022 conventions. On the other hand, UTF8, UTF16, and UTF32 require 8-bit safe transmission and storage. Both are actually elegant solutions to similar (if not the same) problem.

  • @lawrencelee3624
    @lawrencelee3624 2 ปีที่แล้ว

    That was brilliant! I'm going to save the URL, and I'm going to download and save the entire video!

  • @davidkantor7978
    @davidkantor7978 2 ปีที่แล้ว +1

    DEC had a better idea about the end of a line: there’s no end-of-line character. Instead, each line begins with 2 bytes that encode the length. Those 2 bytes are hidden from the user’s view; the O/S knows how to deliver a line from a text file.

    • @kc9scott
      @kc9scott 2 ปีที่แล้ว +1

      That’s a worse idea… what if you want more than 64K chars in a line?

    • @davidkantor7978
      @davidkantor7978 2 ปีที่แล้ว

      @@kc9scott well okay. It could be expanded to 4 bytes. The point is that the concept of a line is abstracted, and separate from its content. You can have any characters inside a line.

    • @calmeilles
      @calmeilles 2 ปีที่แล้ว

      @@kc9scott 64K ought to be enough for anybody… 😀

  • @happyemi23
    @happyemi23 2 ปีที่แล้ว

    Impressive talk on a (usually) boring topic. I was half expecting a nod to PETSCII at the beginning of the talk, but it wouldn't have brought anything new to the table. Great Job!

  • @Kwolf448
    @Kwolf448 2 ปีที่แล้ว +1

    Thank you for teaching me where Zalgo comes from!

  • @maxheadrom3088
    @maxheadrom3088 2 ปีที่แล้ว +2

    Carriage return was also used to see the text since there was no display and eve daisy wheels or those IBM balls will cover the text. I believe that's really why CR exists.

  • @jeremysaklad6703
    @jeremysaklad6703 2 ปีที่แล้ว +3

    In my mind, *real* plain text is UTF-8 with LF line endings, and everything else is obsolete.

  • @Bolpat
    @Bolpat 2 ปีที่แล้ว +2

    Unicode decided against having single-line and double-line separately. In some South American countries, single-line means US-Dollar and double line means Peso.

    • @MichaelPohoreski
      @MichaelPohoreski 2 ปีที่แล้ว

      That’s an Unicode blunder.
      The US _used_ to have the dollar sign with two lines as early as 1790. In 1869 an U and S were used. (Some theories say the two lines represent gold and silver but I believe that is conjecture.)
      I don’t know when it got switched to having only one line in the 1960s.

    • @Bolpat
      @Bolpat 2 ปีที่แล้ว

      @@MichaelPohoreski From what I've heard, the dollar sign is derived from an abbreviation for pesos: P and S in one place.

    • @MichaelPohoreski
      @MichaelPohoreski 2 ปีที่แล้ว

      @@Bolpat If you check Wikipedia’s _Dollar sign_ entry that indeed is one theory.
      > It is still uncertain, however, how the dollar sign came to represent the Spanish American peso.

  • @samiraperi467
    @samiraperi467 2 ปีที่แล้ว +2

    Heh, being a Finn I was aware of encoding issues by the time I got online at the latest. ISO 8859-1 vs SF7/PC8/whatever the PC world used. Learned to read different charsets quite fast. :D
    5:16 That's 20 letters, not 24. Missing are C, J, Q, V, X, Z. 7:05 Shirley you mean 1865?