My browser decided to buffer at the perfect moment, in the Morse Code section. "The code for A is , but if you leave a gap, ..." and then it just started spinning. It was a VERY long gap 😂
Once upon a time, I worked in the office next door to Bob Bemer, the editor of the first ASCII standard. Which, by the way, also specified EBCDIC. IBM was the only manufacturer that embraced EBCDIC rather than ASCII because EBCDIC was more punched-card friendly, and IBM virtually owned the marked on 80-column card equipment. Single newline came from the B programming language. Multics used . X-ON and X-OFF are misidentified in your table. They're DC1 and DC3 respectively. ETB was the standard 'file mark' that separated multiple files on a magnetic tape. EM 'end medium' was a mark that meant, 'this file spans multiple reels, time to switch to the next reel.' NAK - negative acknowledgment - is the ^U that you use to cancel the stuff you're typing at the command line.
@@thezipcreator Shells terminate with ^D, not ^C. ^D is End of Transmission (i.e. connection). ^C is passed on from the shell to the program that is running at the time.
I always wondered how terminal progress bars and such worked! This also explains why these often kinda break when there's an error or warning during the progress bar. Thanks, this was entertaining _and_ useful.
On modern systems, since roughly the '80s (support added ub Windows 11, which previously used a *very* different method), it's all done using VT100 and its successors. Here you'll find ways to encode things like "move to row X column Y", and "set color to red" (there are hundreds of commands). The trade-off is that commands are no longer single bytes. This was used for the first digital displays, especially for dumb terminals.
Windows has even added support for these newer control codes to their console host. Before that, and what still works, is to send commands to the console host driver (condrv.sys) via device IO controls.
Great video. Though I think what deserves to be mentioned is backspace. On paper terminals, you can't delete a character you've already written, so all that backspace did was going back one space, allowing you to print over the previous character. This was useful for making text bold - as you mentioned when discussing carriage return - but also for creating combined characters. Want to type "café"? Just type "cafe", hit backspace, and type an apostrophe. The fonts used for paper terminals were carefully designed to make this look good. Likewise, o with " on top was a "good enough" approximation of ö. Some ASCII characters were included specifically for this reason: the tilde, the acute/backtick, the caret. But most importantly, the underscore. The only reason why it was included was to underline words to highlight them.
There was a phase of my life when I was using a 5-bit teleprinter as an I/O device for my homebrew 8-bit system. It unfortunately didn't have any backspace ability, which was very annoying when I wanted to print zeroes with slashes through them. I ended up doing a CR and then going over the whole line again to fill in the slashes.
It's kind of funny, but that decision became a self-fulfilled prophecy. Because it wasn't a given that you would have consistently-mapped upper ASCII characters to represent even the most common international letters, it got to be fairly commonplace to see letters with accent marks dropped back to their un-accented variants. Granted, I'm a native English speaker, and so 26 letters ought to be enough for anybody. ;-) But, it didn't seem to have much of an effect on the intelligibility of words that used those letters. I recall seeing discussions on this where, specifically, Spanish-language and German speakers shrugged it off as, "eh... we knew what it meant." And, again, as a native English speaker, I have rarely considered the word "jalapeno" spelled with anything other than a plain 'n', and yet I recognize it easily enough in either form. On a related note, I got a crash-course in the peculiarities of various languages when I started writing a driver for FAT filesystems. Plain FAT (as in, pre-LFN) is case-insensitive, and meant to only consider letters in the low-ASCII range A to Z. All lowercase alpha chars are converted to uppercase with that toggling of bit 5. But, when LFN support was added, well... now we're dealing with Unicode characters in UTF-16 form, and ... _technically_ ... we should be case-folding everything (I think?) to uppercase to store the 8-dot-3 compatibility entry, and when searching for or comparing filenames in either 8.3 or LFN form. I say "I think" because the official FAT LFN spec is a bit quiet on what to do about chars with ordinals above 127, probably for the most obvious reason: It's kind of a pain to handle those. You have languages that case-fold differently depending on context, and when you're converting from Unicode to local code-pages, that character might not even exist. While (IIRC) the common US English DOS code page has upper- and lower-case variants of all the accented characters >127, not all of the code pages do, making it impossible to represent properly uppercased versions of any given filename entered with lowercase chars. And, of course, if you change code pages (by either changing the local code page, or moving a file to a system with a different code page), the filename might "change" completely, resulting in lowercase chars in the filename, potentially making it inaccessible by normal means, or causing match collisions with files that get created with uppercasing applied. I think some (or maybe most?) implementations just continue cas-folding lower-ASCII chars, and letting everything else slide. All of this because the original implementations were designed in a relatively simple (cultural) language with straightforward rules, and when -- or if -- the thought occurred to anyone about how to handle other languages, they just shrugged and thought, "eh... that's a problem for future developers."
It took me longer than I care to admit to figure out that ASCII is also BCD with extra "tag" nibbles in between. You can read numbers off easily in a hex dump just by ignoring the extra 3s everywhere. Well and if you get used to it, you can also read letters pretty easily from the hex dump but that feels more like using one of those old cereal box decoder rings.
The thing with the numbers is even more pretty if you look closely on the bits and have your hexadecimal in mind. 0x30 till 0x39 are d'0' to d'9'. So if you are in embedded Programming, and plan your decimals according, you can look what each are, without have to try and disect the bits.
This Is cool, although an inevitable side effect of the "& 0b1111" thing. In order to get a string to int using only an and, they would have to be LSB aligned, and because there are 10 digits they need 4 bits.
This is a side effect of it being backwards compatible with BCD. If you wanted to you could actually do arithmetic directly in string form because of that.
And I still have burned into my brain that there are 7 character codes between '9' and 'A', from all the hexadecimal to binary conversion routines I wrote in machine language. I've sometimes fancied that if I were to design an improved character encoding I would make the first 36 codes be all the digits followed by all the uppercase letters. It just makes sense. To a programmer, at least.
A lot of the "skipped" codes, like ACK, NAK, and SYN was used in a lot of early communication-protocols, like XMODEM and the likes. And for some reason I don't understand, DC1 and DC3 was used for XON and XOFF that I think we all remember from the old modem-days. I don't know why SO and SI are called X-On and X-Off in some ASCII-tables ... maybe some other protocols used those? Ah, the days of RS-232 ASCII-based protocols!
DC1 and DC3 turned on and off the paper tape reader on Teletypes. DC2 and DC4 turned on and off the paper tape punch. When you were sending a paper tape down the line, if you were threatening to overrun a buffer, the other end would send DC3 to say, 'hold on there, tiger', and DC1 again when it was ready to slurp down more. ^S and^Q still work that way on most Unix terminal emulators.
In your chart, you've got x-On and X-off as 14,15 SO,SI "control-N" / "control-O" but, in any system I remember, XOFF and XON are DC1/DC3 17,19 "Control-S" / "control-Q" ... and that takes me back to writing printer handshaking diagnostics for the repair centre at work and saying, "Oh that's why some of the old 8-bit machines had control-S to pause scrolling". My old manual typewriter didn't even have an exclamation mark... because you could make one out of single-quote, backspace, full stop.
First time viewing you channel - this was excellent. Before HTML was a thing, I worked for an organisation selling structured news (sports results &c) We used record separators (RS, ascii 30) and file separators (FS ascii 28) to split up our rows and fields. It took me a long time to realise we were redefining the acronyms.
ASCII 27 (ESC, generally written in source code as \x1b, \033 or \e) is still used a lot for terminal applications for more complex than can do, including changing the colour of the text or background!
There was a whole ANSI standard that came later for what the various escape codes were supposed to do. (Nobody implemented the whole thing, and no two vendors implemented the same parts.)
VT100 and later ANSI Escape sequence made BBS pages colorful and graphical (boxes, symbols, etc.). DEC added REGIS graphics to the Escape sequences, and graphic primitives could be drawn on the screen enabling interactive graphis terminals, all using 7 bit ASCII.
If I remember correctly, EBCDIC was designed to be backward compatible with IBM's punchcard systems, which were still relevant at the time. I think there were considerations for efficient electromechanical sorting and also for not producing too many consecutive holes in the card which could clog the reader or the hole punch machine. Back when it was invented, IBM was almost superstitious about punchcards because they were a huge reason for their financial success, and continuing financial success. In hindsight they don't seem so important of course.
@@timseguine2 Right. The punchcards didn't use a binary encoding of the digits 0-9. Instead they had 1 row for each digit. So it made sense to use only the digits 0-9 in the lower nibble for the letters, too. There is a picture of a punchcard in the Wikipedia article about EBCDIC. It looks quite neat and not random at all.
EBCDIC was just a newer, fancier version of BCDIC, binary coded decimal interchange code, which itself was more a group of similar but different encodings. BCDIC was a 6 bit encoding where the numbers 0-9 were encoded as the values 0-9 and everything else was distributed basically randomly. The letters (uppercase only) were divided into three groups which were backwards, S-Z was encoded with smaller numbers than J-R which were smaller than A-I. EBCDIC is an 8-bit encoding (although many code points were left undefined) which didn't fix the noncontiguous problem but it did fix the order of the letter groups.
^D isn't dead and gone. in most CLI/TUI contexts it's a semi-standard way to close out into the parent shell, and still works well in cases where ^C is taken (e.g. in the python shell, where it will raise KeyboardInterrupt).
What ^D does is it sends any buffered data, including whatever you've already typed, but not the ^D. If you type it with nothing buffered, then it sends zero bytes. And Unix treats a read of zero length as an end-of-file. ^Z is the character that CP/M decided to put inline to mark the end of a text file, because all files in CP/M were a multiple of 128 characters long. You never saw a file that was like 74 bytes long, so if you had a 74-byte text string in a file, you tacked ^Z on as byte 75.
Excellent video. It certainly brought back memories. My first job as a programmer (1975) was working on code that allowed IBM mainframes to communicate with ASCII terminals. This involved translating ASCII to EBCDIC and of course worrying about how all the control characters worked, like CR, LF, TAB, NULL, etc. On the old Teletype 33 terminals you even had to worry about how long it would take for carriage to return to the left margin after printing a long line, and insert enough NULLs to allow it time to happen before the next printable character arrived. We referred to them as dumb-ASCII terminals. One thing that made things more tricky was that the guy who wrote the specs for the communications controller on the IBM mainframe got the bit order reversed, so the low order bit from the IBM system was sent as the high order bit on the wire. Another difference was sort order. In ASCII, digits sort first, followed by lower case letters, and then upper case. In EBCDIC, upper case letters sort first, followed by lower case, and then numbers.
One of the frustrations that I remember from the early 1980s was the occasional mangling of data when going between EBCDIC and ASCII world. Alphabetic and digits were OK, as most of the punctuation. Manged were things such as horizontal tabs, circumflex, backslash, curly braced and square brackets (apparently some versions of EBCDIC had these, and some did not, and those that did sometimes they appeared in different locations). E-mail and general text would generally pass through OK (or if was "mangled" in translation, it was still understandable). What was not so great was when you tried to transfer some source code in languages like C or Pascal. Learned quickly to NOT use TAB characters for indentation (due to the inconsistent translation -- sometimes it translated directly into a single spaces, other times it would get "expanded" to a sequence of spaces but inconsistently -- if you're lucky it expanded to the right number of spaces to preserve the indentation, but more often than not, it didn't). This helped to preserve the indentation of code -- allowing for easier recovery when the curly braces would get lost (and you had a better chance to guess correctly the location of those missing curly-braces). The loss of curly braces, square brackets and backslashes would render C source code unusable -- but a "somewhat obscure" feature of trigraphs became quite useful in this case. Downside is they make your code *really ugly*. For Pascal code, found the some the alternates used in Pascal/VS on the IBM useful -- such as the "(." and ".)" aliases for the square brackets, and "->"" alias for the caret.
My first encounter with double-byte character set was on the Control Data mainframe -- where a double-byte system was used to get beyond the limitation of 6-bit bytes. It was also on the Control Data systems that I'd finally understood why Pascal had used eoln() function (rather then looking at the character value and check for carriage return or linefeed) -- end-of-line was a very specific pattern (iirc it was something like a word-aligned sequence of contiguous zero-bytes -- where there were 10 6-bit bytes in a 10-byte word).
Looking forward to next week's video! Reading ascii codes in decimal hurts my poor lil brain though, I was taught early on in hexadecimal, and it always made more sense to me that way :)
I learned BASIC at school using an ASR-33 TeleType dialling in to an HP 2000F, saving my programs to paper tape. Sometimes, classmates would want to know which program was on their paper tape that they forgot to write the name on. This was easy enough if the terminal wasn't being used, but I could read the holes and tell them :)
I've always liked how caret notation makes clever use of the ascii scheme. If you ever hit backspace in a terminal and see ^H^H^H or cat -A a text file written in windows notepad and see a bunch of ^Ms (or see the programmers use them in comments here), it's because the display has taken the non-printing character, flipped one bit, and is presenting it as its corresponding alphabetic block character. So NUL (00000000) becomes ^@ (01000000), TAB (00001001) becomes ^I (01001001), etc. It also works in reverse to enter these characters, as the Control-C bit in the video explained. Very clever.
You skipped over 16-31 very fast. I think the Escape character at least deserves a mention! You mention Morse code, but there were several other digital codes that predate even computers. Baudot was developed in France in the 1870s for telegraph machines as a 5-bit digital code. The early consoles used a piano-like keyboard, and required operators to press keys together to make chords, so the code was designed to be easier for operators, with more common letters in single bit positions, and even the numbers weren't continuous. This was later adapted into Murray code, in the early 20th with the development of teletype terminals and teleprinters that let operators use a QWERTY style keyboard. As they were mechanical, the code was designed to minimise wear on the machinary. Finally, fully electronic machines started appearing in the 1930s, leading to the development of ITA2 (which at least put the numbers back in a contiguous block). Having been developed for one purpose and evolved and tweaked for others, the code was quite messy, so we can probably be grateful that the designers of ASCII decided to go with a clean sheet design. There probably is a universe in which they decided to take Baudot/ITA2 and extend into a 7 bit code. ASCII effectively has four 5-bit "pages". I could imagine taking the "letter" and "figure" modes of ITA2 as two of those pages, than adding lower-case and control codes as the other two. Then, your video would be explaining why the ASCII code letters weren't in alphabetical order.
That was awesome ! Quite an excellent wrap-up of lots of things I had been learning in the past 50 years or so. Thanks a lot !!! When I did porting a software to an Amdahl machine back in 1993 I had been driven crazy when trying to test the s/w (BTW: compile of pure C code when thru without a clitch). I had lots of attempts with entering the license key. After launching the debugger it turned out that a character was missing. Finally the system admin did ask which characters are among the license key. It turned out that that was the right one; The '#' (a.k.a. hash or pound) was used to was used as a "DEL'/delete character to 'X' out unwanted input. Typewriter style software at its best ...
Subscribed! No dumb-ass stock footage, no tangent shots, just an entertaining and informative chap talking about cool stuff. Looking forward to "Why UTF8 is Actually Very Clever"--unless you've done and ii just haven't seen it. Thank you.
Very nice video. I work with computers since the 80s, and never though about ASCII. Now I know how python progress bar is built and other clever ideas. Well done Dylan!
They even did take care to support foreign western languages to some degree. ASCII includes the grave accent `, circumflex accent ^ and tilde ~ and you could backspace and print it over a letter (on a real teletype, not on a video screen). The single-quote/apostrophe character 0x27 ' did triple duty as an acute accent and in some old fonts it looks like a mirror image of the grave accent. The double quote character " could be used as umlaut/diaeresis in a pinch. The double-quote and single-quote characters were also common on typewriters and these did not have separate opening an closing quotes. The underscore character was meant to be overprinted on other text as well, just doing a CR without LF.
You can do it on a video screen too. It's called "Compose" and you just press the Compose key (whichever key you've assigned for that purpose) and then for example 'a' and '^'.
@@greggoog7559 That has nothing to do with ASCII as such. Compose combinations are substituted with codepoints for accented letters (formerly in your favorite 8-bit code page, today in Unicode). I was talking about old printers that only had 7-bit ASCII and could print a letter, then backspace then the accent.
EOT (Ctrl+D) is still used in Unix/Linux to end a terminal session. I also find it odd that 28-31 aren't used more, they are perfect for use in CSV(like) files to avoid needing to do escaping etc.
The utility of CSV is that you can edit it in pretty much any text editor in a pinch and it still remains (fairly) human readable. Once you introduce control codes that won't be visible at all in some editors and require special settings in others, you might as well develop a binary format that is more efficient. That said, if you can't influence the design of a data format and need an extra set of delimiters they are useful, but probably not best practice.
Control D doesn't end a terminal session. It flushes the keyboard buffer without adding anything to it. If you're at the start of a line, then you flush zero bytes. A read from a file of zero bytes indicates an end of file in Unix. So the terminal reads zero bytes, thinks its input is closed, and exits. Write a program that sits in a loop reading the stdin and writing what it gets without any buffering. Then type "ABC" and hit ^D, and you'll see instead of exiting it just prints ABC.
11:50 The Rest contains one really important character: The ESC, or Escape-Character. It is used with ANSI Escape Codes to generate all the wonderful color and other formatting in terminals even to this day. Maybe that is worth a video.
13:14 I’m pretty sure not the creators of ASCII threw all the hyphens and quotes on a couple of piles - it was the teletype-makers that around 1900 to 1960’s had no 1, only an i without a dot, a separate dot that doubled as a single quote, and no separate characters for o and 0. That meant that ASCII adding back these additional characters would force mechanical changes to the devices that were supposed to use the new standard. Since computers need a distinction between a letter and a number the 1/i and 0/O issue was required to be solved, but the start and end quotes have no functional meaning in a computer.
Great video and very well explained! The point about why certain commands are still in use today and their origins was very interesting. I learned something new-thanks for sharing
Fun Fact: Some of us remember the key-strokes Ctrl-S and Ctrl-Q. They are the ASCII codes to stop and resume display output. They use the codes for Device Control 1 (ASCII 11 Hex) and Device Control 3 (13 Hex) to tell the sending device to stop sending data.
Thanks a bunch for this video. I've known most of these things already, but in my programming career knowing those fundamental bit layouts and tricks had been so valuable to writing efficient and understandable code
When I was first introduced to computers in 1977, I used an ASR-33 Teletype complete with paper tape punch/reader. The ASR-33 only had uppercase letters, so it was with a sense of wonder I discovered that some more advanced terminals could also do lowercase! And everyone wrote the obligatory program that scanned through codes 0 to 127 and printed them out to see what they would do. Sending a string of ^G characters to an ASR-33 produced a sound never equaled by later devices, especially since they never seemed to insert a gap between the beeps.
Between Morse code and ASCII there was also ITA2 (sometimes incorrectly called Baudot code), a five-bit code for mechanical teletypes. It used control codes (letters and figures shifts) to switch between letters and digits/punctuations. ASCII still has SO/SI control codes to make it possible to temporarily switch to a different character set. ITA2 has a Null character, CR and LF and even Bell and "Who are you" (similar to the ENQ control code in ASCII).
I've actually used 0x1F instead of commas when I needed to save something with the sheer simplicity of a CSV file while not having to figure out the logic of how to handle data with commas or quotes in them. Works great. You know, since that's what it's for, haha
It is mostly forgotten that we have SOH, STX, ETX, EOT, ENQ, ACK, SYM, ETB, FS, GS, RS, US, and particularly EM: end of medium. This was primarily designed for data transmission like Baudot and not for use for use on the computers themselves (like memory and files) as the very name states: "for Information Interchange". It is interesting to analyze those systems by their purpose (a teleology if you want): Morse made the most used characters shorter (he went to a printing press and looked at the size for the type cases, the most common were bigger, yes this is why we call uppercase and lowercase); Baudot was firstly designed to minimize the wear in the mechanical parts of the telegraph (not the modern Baudot), and ASCII, well we see hew hints of a protocol attached to a machine as those mentioned and DC1, DC2, DC3 and DC4. I always wonder if it was used this way or that part of the standard was simply ignored. Yeah, a teleprinter used many of them, but certainly not FS, GS, RS and US they are used for sending files not only inside of files, you do not need FS inside a file (except maybe a file like a TAR) but need it on a data stream that has several files, like a paper tape a magnetic tape or something like that.
I got distracted at 3:50 and reimplemented morse code as a canonical Huffman code. By hand, in Excel, for fun. 😅 Each character is 3-9 bits long but it's a binary prefix code so no need for gaps in transmission.
One more thing, following on from how upper and lower case are separated by a single bit: look at the number keys on a keyboard and the symbols on them. Starting from 1 you’ll notice the codes for the numbers and symbols are also separated by a single bit. It goes a bit wrong about half way along, but on old keyboards (pre IBM PC) this usually works for the whole set. Now look at the keys for the non alphabetic symbols in those two alphabetic ‘blocks’. You’ll find the symbol in the low case block is on the same key as the equivalent symbol in the upper case block. Thus, the symbols and numbers on most keys differ only by a single bit. Why? Because taking a keyboard scan code and converting it to ASCII requires a bunch of code and a look up table. Old computers were very slow and had very little memory. So old keyboards generated ASCII codes in hardware, to be returned to the processor. Arranging the keys so the symbols on them were one bit apart made the hardware much simpler. To be fair, it’s probably fair to say that the ASCII codes were derived from existing typewriter layouts. So it’s actually the ASCII code ordering being chosen to match the keyboard layout rather than the layout being designed to match the ASCII. But that just makes the ASCII design even smarter. (And I suspect the same is true for teletypes and the symbol pairings on the hammers - which were probably inherited from typewriters anyway).
There was also a design for conversion between EBCDIC and ASCII that required only a handful of transistors. The two standards were developed together. (IBM 026 and 029 card code preceded EBCDIC.)
The ESCAPE ascii character is often used in various APIs like old BIOS interrupts for reading the keyboard you can grab "scan" codes or ascii codes. Most people who wrote games go for scan codes and many other software too, but even though there are no ascii returned properly for arrow keys for example, the escape key generates the ESC character properly in the bios - just example.
The Device Control characters are still very important for configuring barcode scanners. How do you change the settings of a barcode scanner about e.g. whether or not to insert a or or nothing after scanning a barcode, you send combinations of device control characters followed by alphanumerics. Exact combinations are device specific. Also, we just last year migrated away from a 1980s unix program (still a very popular program) that uses a database of literal ascii strings, each field separated by the Record Separator character.
@@darrennew8211 more likely he does not "feel" NUL (\0) is a character in the earnest. But gut feeling or C ASCIIZ hangups) are irrelevant - the ASCII is defined as 128 7-bit characters
Reminds me of the MCP in Tron with his "End of Line". Ctrl-d can be used anywhere you want to end a file like `cat - >my_file.txt` - type a line, type another line, ctrl-d
Ctrl D is used a lot on Linux in general. Anytime you use a pipe it takes one processes stdout and connects it to another's stdin and the convention to say that the stdout is empty is to send Ctrl D
@@pidgeonpidgeon No, ctrl-d for end of transmission is in the terminal (tty) layer. Between processes end of file is indicated by closing the connection, see shutdown or close system calls. The terminal in cooked mode also permits using ctrl-d to input an unterminated line without ending the file, similar to fflush, or actually transmitting EOT with ctrl-v ctrl-d. More details in e.g. stty(1); try "stty -a".
It's more than bash. *nix uses ^D to mean EOF. Any program reading from STDIN getting an EOF would exit as it can no longer read any input; eg: $ cat > hello World ^D $ cat hello World $ Thus, when you put an EOF (as the first character) to bash, it gets an EOF and exits, as do sh, csh, tsh, etc.
You missed one very important use of characters in the control block: character 27 (ESC) is used by terminal emulators as part of the "control sequence introducer" ("CSI") to do things such as changing foreground/background color, setting bold/italics/underline, etc. Although this is more rpevalent in UNIX world, even DOS (and the Windows command prompt) had a device driver (ANSI.SYS) supporting these ANSI escape codes.
Another neat thing about the way the digits are organized in ASCII is if you convert it to hex, you just look at the lower half and you'll get the number. Also I like how the alphabet characters start with bit 0 as 1, because it makes more sense to use that A = 1 rather than A = 0.
Great vid, thanks. A follow up vid could be a similar explainer about how utf-8 uses multiple bytes and what happens when that is read using a single byte encoding.
Good video, nice refresher of a topic I haven't really thought about directly since university - except for bloody Windows crlf when working with cross platform code
I like this, but what about the earlier threads like Jacquard Looms? There's some fascinating stuff in the first APL books (I forget if it's in A Programming Language or Automatic Data Processing) about how to design encodings for punch cards with various numbers of holes.
My first programming was over a dialup teletype at 110 Baud or 10 characters per second. I was in high school in the '70s and dial up time share systems running BASIC cost $6.00 per hour, so connect time was precious. You wrote your program offline on paper, then entered it on the teletype, punching it on tape as you typed, and if you made a mistake, the DEL key was like digital White out. Of course, it did not speed up data transmission. Once you had it all typed onto paper tape, you dialed the number with a Touch-Tone keypad, logged in and then played the paper tape back to upload your program. Then you ran it, you could also renumber, and list it back and re-punch it for later. When I told my mom I needed money to learn BASIC programming, she asked what I did on the computer. I told her games. I love her: she didn't complain. I became an Electrical Engineer/Computer Science guy.
EOT (End of Transmission) is Ctrl+D and can be used today still. Ctrl+D in Linux (and other similar systems) will flush the current buffer. If this buffer is empty, it will result in a zero-byte read. A zero-byte read mean end of file/end of input in most contexts. For example using it in at a shell prompt will cause the shell exit with exit code zero. If that was a login shell, it causes a logout. I use it every day. Also ESC is widely used to decorate Linux console output (colors etc).
Control character 4 (EOT), that is, Ctrl-D, still lives on in terminal emulators of Unix-derived system like Linux as the end-of-file character (although technically it's the flush-input-buffer character, but returning an empty input is interpreted as end of file on Unix-derived systems, therefore it effectively acts as end-of-file for terminal input and also is commonly referred to as such; the difference can be seen if you try to use it on a non-empty line).
Great talk thanks. I always suspected DEL because of how it sat in the ASCII table didn't look right. A control character all by itself as if it was an after thought. So a program reading from a stream would just ignore DEL characters.
One great thing about these blocks described is that one can see that like using Ctrl-C for ASCII 3 (ETX), one can also use Ctrl-[ (ESC) instead of lifting hands off the home row for Escape. Great for increasing TUI speed and efficiency.
More about the ASCII graveyard, please! For instance, RS Record Seperator, now used in application/json-seq format to separate JSON objects, e.g. in a streaming event log that will never finish. Lots of goodies in the graveyard...
ASCII being 7 bit covered most of the generic characters including accented characters via overprinting. If the inventors had wanted to include all possible characters across the world, they would have needed at least 2 bytes per character to be able to handle Chinese and Japanese ideographs. Leaving the remaining 128 values of a byte unspecified allowed different countries to add country specific characters. In the IBM PC world these were implemented as “code pages”, and were a bit of a problem when talking between countries. Unicode eventually resolved this communication problem, but it requires 32 bits or 4 bytes to encode the over 140,000 characters, and there are visually identical Unicode characters that are logically different, which makes it easier for scammers to fake internet addresses. And something as large as Unicode wasn’t practical in the early days of computing, when every single byte saved was significant. EBCDIC had the advantage that numbers were readable on hex crash dump printouts, but numbers and letters shared the same character codes (C1 represented either A or positive signed 1, depending on what the data type was.)
7:12 Speak for yourself! I still use Control-D, to close terminals and exit SSH sessions, quit python or node.js and the like. Edit: Love the video btw, can't wait for the next one :)
4:51 While eight-bit bytes were already common when work on ASCII began in the early 1960s, they did not become ubiquitous until the mid-to-late 1970s.
Praised be the Algorithm ... it happens VERY rarely that I want to upvote a video and notice that I have already done so. Guess I'll have to subscribe ...
The extra bit also provided parity.. CR and LF were separate because going to the next line on a teletype took two character times. Multics chose LF as the NL because CR could be considered as not doing anything. _ was originally a left arrow.
I like how you've recycled some of the points from your talks into their own little videos, especially when the video topics are directly interactive with the community or fans.
You missed the cleverness behind code 33-41. These punctuation marks come in the same order as they do on the number keys on an (American) keyboard; this means that similar to how lower case letters were converted to upper case by resetting a single bit (toggled by the shift key), the same were actually true about pressing shift + a number key.
In old Apple ][ word processors you'd enter control characters to teach the word processor how to work with your new printer (instead of drivers). Also they were used for modems. We had to type in weird characters to get the modem transmitting.
Apple ][ basic also used the VT52 arrow key ESC sequences to move the cursor around the screen ready for copying - you listed a line and then had to ESC-D-ESC-D etc to get to the start of what you wanted to copy, use the -> key to copy, and use lots of ESCs to skip over blank characters - the Apple ][ was aggressive in printing out blank spaces and word wrapping. To fix the excessive spaces we set the right hand edge of the window to be one less character than it needed before it word wrapped (indenting from line number) and so it would not put in the excessive end and start of line spaces. The next thing I did was to write a character output trap which ignored spaces outside quotes and showed control characters in reverse (particularly for the DOS ^D prefix character, but it also showed up any other CTRL codes) to make it easier when editing such lines.
It's worth mentioning that is also still in use when it comes to HTTP. HTTP is a text / line based protocol and each line is separated by a . So even in the Unix / linux world you have to deal with that line ending. There's a similar issue when it comes to little vs big endian. While little endian kind of dominates the PC world nowadays, when it comes to network protocols, most of them use big endian. That's why it's often called "network order". Big endian makes it easier to create hardware decoders and since a huge chunk of the network world is hardware this actually makes sense. From the programming point of view I generally prefer little endian.
Fun facts: The ASCII underscore character was originally a left-pointing arrow, which is why Smalltalk (from around 1976) uses "_" as the assignment operator, and why Pascal (designed to work with EBCDIC also) uses ":=" instead, to look like an arrow as close as you can get on punched cards. EBCDIC has the same sort of bitwise feature for letters that the upper/lower trick in ASCII uses, except it's designed for punched cards. So with a card 11 columns high, the letters are in "contiguous" numbers if you ignore the proper holes on the card rather than ignoring the proper bits in the byte.
Actually, Pascal had several digraphs to be used when certain characters were not available. For example, Pascal comments were written in curly braces {like this}, but in case curly braces were not available, you could also use parentheses with asterisks (*like this*). Now the only character used by Pascal that was not available in ASCII was the left arrow, whose digraph replacement was :=, which is why that one became commonly known as the Pascal assignment operator.
@@__christopher__ Were '" not available? "" for assignment, eg: RA -> VARLOC means the contents of the A register are stored in the location pointed to by VARLOC (effectively a variable).
@@__christopher__ Interesting how all the BASICs I've used over load the '=' operator to mean both "assign" and "compare equal" - the meaning based on context. How about ""? (That looks more like an arrow than ":=".)
@@cigmorfil4101 that is already a less-than followed by a unary minus operator. Also, := was already in use in mathematics for definitions, so it fits quite well. Note also that a proper assignment statement in BASIC was LET var = value A lot of BASIC interpreters (in particular Microsoft's) allowed omitting the LET though.
Sadly lost and not mentioned here, the FS, GS, RS, and US characters (28-31). Meant to serve as distinct bytes that wouldn't be part of text data, and therefore could easily be used to delineate it. But alas instead we just totally forgot they existed and therefore ended up with formats like CSV, which gave double meaning to commas, newlines, quotes, etc. With special escaping rules and incompatibilities between systems. And we've spent generations figuring out how to handle that properly and handle all the edge cases. Just because we didn't have and didn't bother to come up with a few symbols to represent those 4 characters. Some of those other low code points were perfect for networking, sending a single byte to communicate something that now we need an entire packet to communicate the same thing.
Interestingly Pick uses characters 252-254 as markers in dynamic arrays (and filed items) between the "elements": FE - 254 - Attribute mark FD - 253 - Value mark FC - 252 - Sub Value mark The whole dynamic array is a string with the elements separated by the marks. If an element is required that doesn't exist, Pick adds enough of the relevant marks to create it when setting the value of the "element" or returns a null. This means you get to access things like: Data = '' Data = 'attr 1' Data = 'at 2, v1, sv 3' Data = 'at 2, v 3' Data = 'at 4, v2' CopyData = Data Element2 = Data The strings Data and CopyData contain: attr 1[am][sm][sm]at 2, v 1, sv 3[vm][vm]at 2, v 3[am][am][vm]at 4, v 2 And Element2 contains [sm][sm]at 2, v 1, sv 3[vm][vm]at 2, v 3 Where [am] is char(254), [vm] is char(253) and [sm] is char(252) Pick is a multi-value DBMS OS with all fields of variable length and type (though as the whole is stored as a string they're effectively all strings which are converted to the relevant type at time of use).
The use of CSV is to _avoid_ non-printing control characters (other than a line break) so that the data is easily edited as plain text by a plain text editor. A plain text editor generally only understands line breaks; how control characters are displayed depends upon their programming: some may display as ^c, some may display a '?' regardless of the chatacter, some may let the display driver decide what to do (hence the smiley face, musical notes, etc, that the original IBM PCs displayed for control characters) As there was no consensus how to handle control codes, CSVs avoided them and stuck to plain text, using commas (hence the name: _Comma_ Separated Values) requiring some sort of escape for commas - enclose a field with commas within quotes - and a mechanism to handle the quoting characher within fields.
@@cigmorfil4101 I always found this argument bizarre. ASCII was invented well before any "plain text editor" was, so saying "we changed this because plain text editors couldn't handle ASCII" sounds like working around the problems in tools rather than just fixing the tools. There was also an image format called NetPBM which was great, and one of the options was to represent all the bytes with decimal digits. Like, you could read it with BASIC even. Red would literally be "255 0 0" with nothing other than ASCII digits and spaces.
I despise the EU political project, but I learned a LOT about computing from the Euro symbol, and having to work on its introduction. It must have been weird to show up as the (eg Saudi) delegate to one of the conferences in the 90s and have to talk to nerds about how other writing goes right to left, and how the character set is continuous
Technically not. It's "send the buffered output without sending the ctrl-D". If there's no buffered output, the program gets a read length of zero, which is EOT. But if you type something first and hit control D, it just sends what you typed.
As someone who learned to type on a Smith Corona, the later used a Model 33 teletype, I did not know there where different characters for left and right quote marks. There is single and double quote.
As someone who was taught to write by hand at school we we taught to use 66 quotes at the start and 99 quotes at the end. Having only " on a computer was a bit of a shock. These days LibreOffice writer and Word automagically "correct" with smart quote to 66 or 99 quotes (depending upon surrounding spaces).
My browser decided to buffer at the perfect moment, in the Morse Code section. "The code for A is , but if you leave a gap, ..." and then it just started spinning. It was a VERY long gap 😂
Once upon a time, I worked in the office next door to Bob Bemer, the editor of the first ASCII standard. Which, by the way, also specified EBCDIC. IBM was the only manufacturer that embraced EBCDIC rather than ASCII because EBCDIC was more punched-card friendly, and IBM virtually owned the marked on 80-column card equipment.
Single newline came from the B programming language. Multics used
.
X-ON and X-OFF are misidentified in your table. They're DC1 and DC3 respectively.
ETB was the standard 'file mark' that separated multiple files on a magnetic tape. EM 'end medium' was a mark that meant, 'this file spans multiple reels, time to switch to the next reel.'
NAK - negative acknowledgment - is the ^U that you use to cancel the stuff you're typing at the command line.
IBM literally invented the punched data card via the Hollerith Company in 1889…
Punched cards controlling machines however dates from 1798.
@@allangibson8494 It's crazy that modern IDEs still have markings at 80 characters. That technology was so far ahead of its time.
@@lbgstzockt8493 80 Characters was what was determined by the U.S. census as being adequate to store the population information as a line item…
I didn't even know ^U existed, I've always just done ^C (and shells are smart so they know to catch that and not just terminate)
@@thezipcreator Shells terminate with ^D, not ^C. ^D is End of Transmission (i.e. connection). ^C is passed on from the shell to the program that is running at the time.
I always wondered how terminal progress bars and such worked! This also explains why these often kinda break when there's an error or warning during the progress bar. Thanks, this was entertaining _and_ useful.
On modern systems, since roughly the '80s (support added ub Windows 11, which previously used a *very* different method), it's all done using VT100 and its successors. Here you'll find ways to encode things like "move to row X column Y", and "set color to red" (there are hundreds of commands). The trade-off is that commands are no longer single bytes. This was used for the first digital displays, especially for dumb terminals.
Windows has even added support for these newer control codes to their console host. Before that, and what still works, is to send commands to the console host driver (condrv.sys) via device IO controls.
Great video. Though I think what deserves to be mentioned is backspace. On paper terminals, you can't delete a character you've already written, so all that backspace did was going back one space, allowing you to print over the previous character. This was useful for making text bold - as you mentioned when discussing carriage return - but also for creating combined characters. Want to type "café"? Just type "cafe", hit backspace, and type an apostrophe. The fonts used for paper terminals were carefully designed to make this look good. Likewise, o with " on top was a "good enough" approximation of ö. Some ASCII characters were included specifically for this reason: the tilde, the acute/backtick, the caret. But most importantly, the underscore. The only reason why it was included was to underline words to highlight them.
Important context!
There was a phase of my life when I was using a 5-bit teleprinter as an I/O device for my homebrew 8-bit system. It unfortunately didn't have any backspace ability, which was very annoying when I wanted to print zeroes with slashes through them. I ended up doing a CR and then going over the whole line again to fill in the slashes.
As an American I will proudly ignore all further episodes as I now have everything I need. /s 😂
so you don't need that emoji over there, right? :)
It's kind of funny, but that decision became a self-fulfilled prophecy. Because it wasn't a given that you would have consistently-mapped upper ASCII characters to represent even the most common international letters, it got to be fairly commonplace to see letters with accent marks dropped back to their un-accented variants.
Granted, I'm a native English speaker, and so 26 letters ought to be enough for anybody. ;-) But, it didn't seem to have much of an effect on the intelligibility of words that used those letters. I recall seeing discussions on this where, specifically, Spanish-language and German speakers shrugged it off as, "eh... we knew what it meant." And, again, as a native English speaker, I have rarely considered the word "jalapeno" spelled with anything other than a plain 'n', and yet I recognize it easily enough in either form.
On a related note, I got a crash-course in the peculiarities of various languages when I started writing a driver for FAT filesystems. Plain FAT (as in, pre-LFN) is case-insensitive, and meant to only consider letters in the low-ASCII range A to Z. All lowercase alpha chars are converted to uppercase with that toggling of bit 5. But, when LFN support was added, well... now we're dealing with Unicode characters in UTF-16 form, and ... _technically_ ... we should be case-folding everything (I think?) to uppercase to store the 8-dot-3 compatibility entry, and when searching for or comparing filenames in either 8.3 or LFN form.
I say "I think" because the official FAT LFN spec is a bit quiet on what to do about chars with ordinals above 127, probably for the most obvious reason: It's kind of a pain to handle those. You have languages that case-fold differently depending on context, and when you're converting from Unicode to local code-pages, that character might not even exist. While (IIRC) the common US English DOS code page has upper- and lower-case variants of all the accented characters >127, not all of the code pages do, making it impossible to represent properly uppercased versions of any given filename entered with lowercase chars. And, of course, if you change code pages (by either changing the local code page, or moving a file to a system with a different code page), the filename might "change" completely, resulting in lowercase chars in the filename, potentially making it inaccessible by normal means, or causing match collisions with files that get created with uppercasing applied. I think some (or maybe most?) implementations just continue cas-folding lower-ASCII chars, and letting everything else slide.
All of this because the original implementations were designed in a relatively simple (cultural) language with straightforward rules, and when -- or if -- the thought occurred to anyone about how to handle other languages, they just shrugged and thought, "eh... that's a problem for future developers."
It took me longer than I care to admit to figure out that ASCII is also BCD with extra "tag" nibbles in between. You can read numbers off easily in a hex dump just by ignoring the extra 3s everywhere. Well and if you get used to it, you can also read letters pretty easily from the hex dump but that feels more like using one of those old cereal box decoder rings.
The thing with the numbers is even more pretty if you look closely on the bits and have your hexadecimal in mind.
0x30 till 0x39 are d'0' to d'9'.
So if you are in embedded Programming, and plan your decimals according, you can look what each are, without have to try and disect the bits.
This Is cool, although an inevitable side effect of the "& 0b1111" thing. In order to get a string to int using only an and, they would have to be LSB aligned, and because there are 10 digits they need 4 bits.
This is a side effect of it being backwards compatible with BCD. If you wanted to you could actually do arithmetic directly in string form because of that.
And I still have burned into my brain that there are 7 character codes between '9' and 'A', from all the hexadecimal to binary conversion routines I wrote in machine language.
I've sometimes fancied that if I were to design an improved character encoding I would make the first 36 codes be all the digits followed by all the uppercase letters. It just makes sense. To a programmer, at least.
A lot of the "skipped" codes, like ACK, NAK, and SYN was used in a lot of early communication-protocols, like XMODEM and the likes. And for some reason I don't understand, DC1 and DC3 was used for XON and XOFF that I think we all remember from the old modem-days. I don't know why SO and SI are called X-On and X-Off in some ASCII-tables ... maybe some other protocols used those?
Ah, the days of RS-232 ASCII-based protocols!
STX (start text) and ETX (end text) are used sometimes for framing purposes.
DC1 and DC3 turned on and off the paper tape reader on Teletypes. DC2 and DC4 turned on and off the paper tape punch. When you were sending a paper tape down the line, if you were threatening to overrun a buffer, the other end would send DC3 to say, 'hold on there, tiger', and DC1 again when it was ready to slurp down more. ^S and^Q still work that way on most Unix terminal emulators.
Shift in and shift out controlled what you might think of as the typeface.
In your chart, you've got x-On and X-off as 14,15 SO,SI "control-N" / "control-O" but, in any system I remember, XOFF and XON are DC1/DC3 17,19 "Control-S" / "control-Q" ... and that takes me back to writing printer handshaking diagnostics for the repair centre at work and saying, "Oh that's why some of the old 8-bit machines had control-S to pause scrolling".
My old manual typewriter didn't even have an exclamation mark... because you could make one out of single-quote, backspace, full stop.
Control-O was commonly used to throw away the rest of the current terminal output.
Clicked on this video only to discover it's by the guy who made the Rockstar programming language, lmao wasn't expecting that. Loved the video btw!
I enjoyed this one very much. I still remember discovering an ASCII table in one of my father's handbooks when I was a kid. This video took me back.
First time viewing you channel - this was excellent.
Before HTML was a thing, I worked for an organisation selling structured news (sports results &c)
We used record separators (RS, ascii 30) and file separators (FS ascii 28) to split up our rows and fields.
It took me a long time to realise we were redefining the acronyms.
RS was right to separate records. The fields should have been separated with US, unit separator. GS and FS were higher level.
ASCII 27 (ESC, generally written in source code as \x1b, \033 or \e) is still used a lot for terminal applications for more complex than
can do, including changing the colour of the text or background!
There was a whole ANSI standard that came later for what the various escape codes were supposed to do. (Nobody implemented the whole thing, and no two vendors implemented the same parts.)
Don't forget *^[*
😉
ESC is a pretty important character. Not as important as 0x0A or 0x0D, though.
VT100 and later ANSI Escape sequence made BBS pages colorful and graphical (boxes, symbols, etc.). DEC added REGIS graphics to the Escape sequences, and graphic primitives could be drawn on the screen enabling interactive graphis terminals, all using 7 bit ASCII.
I still remember hitting my first EBCDIC files (about 1985) and being amazed that the A-Z characters were scattered around at what looks like random.
If I remember correctly, EBCDIC was designed to be backward compatible with IBM's punchcard systems, which were still relevant at the time. I think there were considerations for efficient electromechanical sorting and also for not producing too many consecutive holes in the card which could clog the reader or the hole punch machine.
Back when it was invented, IBM was almost superstitious about punchcards because they were a huge reason for their financial success, and continuing financial success.
In hindsight they don't seem so important of course.
@@timseguine2 Right. The punchcards didn't use a binary encoding of the digits 0-9. Instead they had 1 row for each digit. So it made sense to use only the digits 0-9 in the lower nibble for the letters, too. There is a picture of a punchcard in the Wikipedia article about EBCDIC. It looks quite neat and not random at all.
They're lined up properly if you ignore the right holes on the punch card, rather than ignoring the right bits in a byte.
EBCDIC was just a newer, fancier version of BCDIC, binary coded decimal interchange code, which itself was more a group of similar but different encodings. BCDIC was a 6 bit encoding where the numbers 0-9 were encoded as the values 0-9 and everything else was distributed basically randomly. The letters (uppercase only) were divided into three groups which were backwards, S-Z was encoded with smaller numbers than J-R which were smaller than A-I. EBCDIC is an 8-bit encoding (although many code points were left undefined) which didn't fix the noncontiguous problem but it did fix the order of the letter groups.
^D isn't dead and gone. in most CLI/TUI contexts it's a semi-standard way to close out into the parent shell, and still works well in cases where ^C is taken (e.g. in the python shell, where it will raise KeyboardInterrupt).
incidentally if you're using the python shell more than very very occasionally... install ptpython
infinitely nicer python shell
Except on windows where its usually Ctrl Z
What ^D does is it sends any buffered data, including whatever you've already typed, but not the ^D. If you type it with nothing buffered, then it sends zero bytes. And Unix treats a read of zero length as an end-of-file.
^Z is the character that CP/M decided to put inline to mark the end of a text file, because all files in CP/M were a multiple of 128 characters long. You never saw a file that was like 74 bytes long, so if you had a 74-byte text string in a file, you tacked ^Z on as byte 75.
@@darrennew8211 I don't know why TH-cam sent me a notification about your comment but I'm glad it did.
Huh. While I knew the conventions of the above, I did not know the reasons why. This has been educational.
Excellent video. It certainly brought back memories. My first job as a programmer (1975) was working on code that allowed IBM mainframes to communicate with ASCII terminals. This involved translating ASCII to EBCDIC and of course worrying about how all the control characters worked, like CR, LF, TAB, NULL, etc. On the old Teletype 33 terminals you even had to worry about how long it would take for carriage to return to the left margin after printing a long line, and insert enough NULLs to allow it time to happen before the next printable character arrived. We referred to them as dumb-ASCII terminals. One thing that made things more tricky was that the guy who wrote the specs for the communications controller on the IBM mainframe got the bit order reversed, so the low order bit from the IBM system was sent as the high order bit on the wire. Another difference was sort order. In ASCII, digits sort first, followed by lower case letters, and then upper case. In EBCDIC, upper case letters sort first, followed by lower case, and then numbers.
One of the frustrations that I remember from the early 1980s was the occasional mangling of data when going between EBCDIC and ASCII world. Alphabetic and digits were OK, as most of the punctuation. Manged were things such as horizontal tabs, circumflex, backslash, curly braced and square brackets (apparently some versions of EBCDIC had these, and some did not, and those that did sometimes they appeared in different locations). E-mail and general text would generally pass through OK (or if was "mangled" in translation, it was still understandable). What was not so great was when you tried to transfer some source code in languages like C or Pascal.
Learned quickly to NOT use TAB characters for indentation (due to the inconsistent translation -- sometimes it translated directly into a single spaces, other times it would get "expanded" to a sequence of spaces but inconsistently -- if you're lucky it expanded to the right number of spaces to preserve the indentation, but more often than not, it didn't). This helped to preserve the indentation of code -- allowing for easier recovery when the curly braces would get lost (and you had a better chance to guess correctly the location of those missing curly-braces).
The loss of curly braces, square brackets and backslashes would render C source code unusable -- but a "somewhat obscure" feature of trigraphs became quite useful in this case. Downside is they make your code *really ugly*.
For Pascal code, found the some the alternates used in Pascal/VS on the IBM useful -- such as the "(." and ".)" aliases for the square brackets, and "->"" alias for the caret.
My first encounter with double-byte character set was on the Control Data mainframe -- where a double-byte system was used to get beyond the limitation of 6-bit bytes.
It was also on the Control Data systems that I'd finally understood why Pascal had used eoln() function (rather then looking at the character value and check for carriage return or linefeed) -- end-of-line was a very specific pattern (iirc it was something like a word-aligned sequence of contiguous zero-bytes -- where there were 10 6-bit bytes in a 10-byte word).
Looking forward to next week's video!
Reading ascii codes in decimal hurts my poor lil brain though, I was taught early on in hexadecimal, and it always made more sense to me that way :)
I learned BASIC at school using an ASR-33 TeleType dialling in to an HP 2000F, saving my programs to paper tape.
Sometimes, classmates would want to know which program was on their paper tape that they forgot to write the name on.
This was easy enough if the terminal wasn't being used, but I could read the holes and tell them :)
If you have ever punched a Hollerith card, EBCIDIC makes a certain amount of sense.
I've always liked how caret notation makes clever use of the ascii scheme. If you ever hit backspace in a terminal and see ^H^H^H or cat -A a text file written in windows notepad and see a bunch of ^Ms (or see the programmers use them in comments here), it's because the display has taken the non-printing character, flipped one bit, and is presenting it as its corresponding alphabetic block character. So NUL (00000000) becomes ^@ (01000000), TAB (00001001) becomes ^I (01001001), etc.
It also works in reverse to enter these characters, as the Control-C bit in the video explained. Very clever.
Excellent. Thanks.
One of my favorite subjects! Looking forward to the follow up parts!
You skipped over 16-31 very fast. I think the Escape character at least deserves a mention!
You mention Morse code, but there were several other digital codes that predate even computers. Baudot was developed in France in the 1870s for telegraph machines as a 5-bit digital code. The early consoles used a piano-like keyboard, and required operators to press keys together to make chords, so the code was designed to be easier for operators, with more common letters in single bit positions, and even the numbers weren't continuous. This was later adapted into Murray code, in the early 20th with the development of teletype terminals and teleprinters that let operators use a QWERTY style keyboard. As they were mechanical, the code was designed to minimise wear on the machinary. Finally, fully electronic machines started appearing in the 1930s, leading to the development of ITA2 (which at least put the numbers back in a contiguous block).
Having been developed for one purpose and evolved and tweaked for others, the code was quite messy, so we can probably be grateful that the designers of ASCII decided to go with a clean sheet design. There probably is a universe in which they decided to take Baudot/ITA2 and extend into a 7 bit code. ASCII effectively has four 5-bit "pages". I could imagine taking the "letter" and "figure" modes of ITA2 as two of those pages, than adding lower-case and control codes as the other two. Then, your video would be explaining why the ASCII code letters weren't in alphabetical order.
That was awesome ! Quite an excellent wrap-up of lots of things I had been learning in the past 50 years or so. Thanks a lot !!!
When I did porting a software to an Amdahl machine back in 1993 I had been driven crazy when trying to test the s/w (BTW: compile of pure C code when thru without a clitch). I had lots of attempts with entering the license key. After launching the debugger it turned out that a character was missing. Finally the system admin did ask which characters are among the license key. It turned out that that was the right one; The '#' (a.k.a. hash or pound) was used to was used as a "DEL'/delete character to 'X' out unwanted input. Typewriter style software at its best ...
Subscribed! No dumb-ass stock footage, no tangent shots, just an entertaining and informative chap talking about cool stuff. Looking forward to "Why UTF8 is Actually Very Clever"--unless you've done and ii just haven't seen it.
Thank you.
@@dj196301 thank you! UTF-8 is coming in a few weeks. Got some other stuff to talk about first :)
Very nice video. I work with computers since the 80s, and never though about ASCII. Now I know how python progress bar is built and other clever ideas. Well done Dylan!
They even did take care to support foreign western languages to some degree. ASCII includes the grave accent `, circumflex accent ^ and tilde ~ and you could backspace and print it over a letter (on a real teletype, not on a video screen). The single-quote/apostrophe character 0x27 ' did triple duty as an acute accent and in some old fonts it looks like a mirror image of the grave accent. The double quote character " could be used as umlaut/diaeresis in a pinch. The double-quote and single-quote characters were also common on typewriters and these did not have separate opening an closing quotes. The underscore character was meant to be overprinted on other text as well, just doing a CR without LF.
You can do it on a video screen too. It's called "Compose" and you just press the Compose key (whichever key you've assigned for that purpose) and then for example 'a' and '^'.
@@greggoog7559 That has nothing to do with ASCII as such. Compose combinations are substituted with codepoints for accented letters (formerly in your favorite 8-bit code page, today in Unicode). I was talking about old printers that only had 7-bit ASCII and could print a letter, then backspace then the accent.
EOT (Ctrl+D) is still used in Unix/Linux to end a terminal session. I also find it odd that 28-31 aren't used more, they are perfect for use in CSV(like) files to avoid needing to do escaping etc.
The utility of CSV is that you can edit it in pretty much any text editor in a pinch and it still remains (fairly) human readable. Once you introduce control codes that won't be visible at all in some editors and require special settings in others, you might as well develop a binary format that is more efficient. That said, if you can't influence the design of a data format and need an extra set of delimiters they are useful, but probably not best practice.
Control D doesn't end a terminal session. It flushes the keyboard buffer without adding anything to it. If you're at the start of a line, then you flush zero bytes. A read from a file of zero bytes indicates an end of file in Unix. So the terminal reads zero bytes, thinks its input is closed, and exits.
Write a program that sits in a loop reading the stdin and writing what it gets without any buffering. Then type "ABC" and hit ^D, and you'll see instead of exiting it just prints ABC.
11:50 The Rest contains one really important character: The ESC, or Escape-Character. It is used with ANSI Escape Codes to generate all the wonderful color and other formatting in terminals even to this day. Maybe that is worth a video.
13:14 I’m pretty sure not the creators of ASCII threw all the hyphens and quotes on a couple of piles - it was the teletype-makers that around 1900 to 1960’s had no 1, only an i without a dot, a separate dot that doubled as a single quote, and no separate characters for o and 0. That meant that ASCII adding back these additional characters would force mechanical changes to the devices that were supposed to use the new standard. Since computers need a distinction between a letter and a number the 1/i and 0/O issue was required to be solved, but the start and end quotes have no functional meaning in a computer.
Not just teletype. That was pretty common on typewriters too.
Loved every second of it
Some of this I knew, but I didn't realise the deliberate design elements. Good job.
This was a really captivating video, well presented, interesting stuff.
Lol I didn't expect that little shower thought to turn into a whole video, good fun!
Great video and very well explained! The point about why certain commands are still in use today and their origins was very interesting. I learned something new-thanks for sharing
Fun Fact: Some of us remember the key-strokes Ctrl-S and Ctrl-Q. They are the ASCII codes to stop and resume display output. They use the codes for Device Control 1 (ASCII 11 Hex) and Device Control 3 (13 Hex) to tell the sending device to stop sending data.
This seems to be an abridged version (or possibly the first episode) of Dylan's 'No such thing as plain text' talk, which is well worth a watch.
Thanks for the DEL story!
Thanks a bunch for this video. I've known most of these things already, but in my programming career knowing those fundamental bit layouts and tricks had been so valuable to writing efficient and understandable code
Thanks. That was very informative and insightful. I like your delivery and the small jabs/joked you put in. I am looking forward to your next video!
When I was first introduced to computers in 1977, I used an ASR-33 Teletype complete with paper tape punch/reader. The ASR-33 only had uppercase letters, so it was with a sense of wonder I discovered that some more advanced terminals could also do lowercase! And everyone wrote the obligatory program that scanned through codes 0 to 127 and printed them out to see what they would do. Sending a string of ^G characters to an ASR-33 produced a sound never equaled by later devices, especially since they never seemed to insert a gap between the beeps.
Between Morse code and ASCII there was also ITA2 (sometimes incorrectly called Baudot code), a five-bit code for mechanical teletypes. It used control codes (letters and figures shifts) to switch between letters and digits/punctuations. ASCII still has SO/SI control codes to make it possible to temporarily switch to a different character set. ITA2 has a Null character, CR and LF and even Bell and "Who are you" (similar to the ENQ control code in ASCII).
I've actually used 0x1F instead of commas when I needed to save something with the sheer simplicity of a CSV file while not having to figure out the logic of how to handle data with commas or quotes in them.
Works great. You know, since that's what it's for, haha
It is mostly forgotten that we have SOH, STX, ETX, EOT, ENQ, ACK, SYM, ETB, FS, GS, RS, US, and particularly EM: end of medium. This was primarily designed for data transmission like Baudot and not for use for use on the computers themselves (like memory and files) as the very name states: "for Information Interchange".
It is interesting to analyze those systems by their purpose (a teleology if you want): Morse made the most used characters shorter (he went to a printing press and looked at the size for the type cases, the most common were bigger, yes this is why we call uppercase and lowercase); Baudot was firstly designed to minimize the wear in the mechanical parts of the telegraph (not the modern Baudot), and ASCII, well we see hew hints of a protocol attached to a machine as those mentioned and DC1, DC2, DC3 and DC4.
I always wonder if it was used this way or that part of the standard was simply ignored. Yeah, a teleprinter used many of them, but certainly not FS, GS, RS and US they are used for sending files not only inside of files, you do not need FS inside a file (except maybe a file like a TAR) but need it on a data stream that has several files, like a paper tape a magnetic tape or something like that.
I've always loved your talk on ascii.
Love seeing more stuff from your brain!
keep it commin!
That was great. Excellent tour through the origins. Just incredible.
I got distracted at 3:50 and reimplemented morse code as a canonical Huffman code. By hand, in Excel, for fun. 😅
Each character is 3-9 bits long but it's a binary prefix code so no need for gaps in transmission.
Good lawd, this was awesome. Kinda hilarious how everyone else tried to ensure IBM was out there in the wind.
One more thing, following on from how upper and lower case are separated by a single bit: look at the number keys on a keyboard and the symbols on them. Starting from 1 you’ll notice the codes for the numbers and symbols are also separated by a single bit. It goes a bit wrong about half way along, but on old keyboards (pre IBM PC) this usually works for the whole set. Now look at the keys for the non alphabetic symbols in those two alphabetic ‘blocks’. You’ll find the symbol in the low case block is on the same key as the equivalent symbol in the upper case block.
Thus, the symbols and numbers on most keys differ only by a single bit. Why? Because taking a keyboard scan code and converting it to ASCII requires a bunch of code and a look up table. Old computers were very slow and had very little memory. So old keyboards generated ASCII codes in hardware, to be returned to the processor. Arranging the keys so the symbols on them were one bit apart made the hardware much simpler.
To be fair, it’s probably fair to say that the ASCII codes were derived from existing typewriter layouts. So it’s actually the ASCII code ordering being chosen to match the keyboard layout rather than the layout being designed to match the ASCII. But that just makes the ASCII design even smarter.
(And I suspect the same is true for teletypes and the symbol pairings on the hammers - which were probably inherited from typewriters anyway).
There was also a design for conversion between EBCDIC and ASCII that required only a handful of transistors. The two standards were developed together. (IBM 026 and 029 card code preceded EBCDIC.)
The ESCAPE ascii character is often used in various APIs like old BIOS interrupts for reading the keyboard you can grab "scan" codes or ascii codes. Most people who wrote games go for scan codes and many other software too, but even though there are no ascii returned properly for arrow keys for example, the escape key generates the ESC character properly in the bios - just example.
The Device Control characters are still very important for configuring barcode scanners. How do you change the settings of a barcode scanner about e.g. whether or not to insert a
or
or nothing after scanning a barcode, you send combinations of device control characters followed by alphanumerics. Exact combinations are device specific.
Also, we just last year migrated away from a 1980s unix program (still a very popular program) that uses a database of literal ascii strings, each field separated by the Record Separator character.
Correction to the last frame: ASCII has 128 characters, not 127 😏
I bet you could argue that DEL is not a character. :-) I saw that too, and then thought about it.
@@darrennew8211 more likely he does not "feel" NUL (\0) is a character in the earnest. But gut feeling or C ASCIIZ hangups) are irrelevant - the ASCII is defined as 128 7-bit characters
@@enterrr Granted that NUL on paper tape is arguably less of a character than DEL is. :-)
@@darrennew8211 that's like calling 0 less of a number than 1, hehe
@@enterrr Not really. I mean, unless you want to say the tape comes pre-filled with NUL characters, right? :-)
Ctrl-d is still used a little with Bash. If you want to quit a user session fast (and can't be bothered with "exit"), ctrl-d will end it.
Reminds me of the MCP in Tron with his "End of Line".
Ctrl-d can be used anywhere you want to end a file like `cat - >my_file.txt` - type a line, type another line, ctrl-d
And CTRL-L to "clear" the screen. (Maps to "Form Feed" … which shifted the paper to the start of the next - blank - page).
Ctrl D is used a lot on Linux in general. Anytime you use a pipe it takes one processes stdout and connects it to another's stdin and the convention to say that the stdout is empty is to send Ctrl D
@@pidgeonpidgeon No, ctrl-d for end of transmission is in the terminal (tty) layer. Between processes end of file is indicated by closing the connection, see shutdown or close system calls. The terminal in cooked mode also permits using ctrl-d to input an unterminated line without ending the file, similar to fflush, or actually transmitting EOT with ctrl-v ctrl-d. More details in e.g. stty(1); try "stty -a".
It's more than bash.
*nix uses ^D to mean EOF. Any program reading from STDIN getting an EOF would exit as it can no longer read any input; eg:
$ cat > hello
World
^D
$ cat hello
World
$
Thus, when you put an EOF (as the first character) to bash, it gets an EOF and exits, as do sh, csh, tsh, etc.
You missed one very important use of characters in the control block: character 27 (ESC) is used by terminal emulators as part of the "control sequence introducer" ("CSI") to do things such as changing foreground/background color, setting bold/italics/underline, etc. Although this is more rpevalent in UNIX world, even DOS (and the Windows command prompt) had a device driver (ANSI.SYS) supporting these ANSI escape codes.
Another neat thing about the way the digits are organized in ASCII is if you convert it to hex, you just look at the lower half and you'll get the number.
Also I like how the alphabet characters start with bit 0 as 1, because it makes more sense to use that A = 1 rather than A = 0.
Ctrl+d is still commonly used on Linux. It's the way to logout of a shell, and also the way to get out of a REPL like Python.
Great vid, thanks. A follow up vid could be a similar explainer about how utf-8 uses multiple bytes and what happens when that is read using a single byte encoding.
Good video, nice refresher of a topic I haven't really thought about directly since university - except for bloody Windows crlf when working with cross platform code
I like this, but what about the earlier threads like Jacquard Looms? There's some fascinating stuff in the first APL books (I forget if it's in A Programming Language or Automatic Data Processing) about how to design encodings for punch cards with various numbers of holes.
My first programming was over a dialup teletype at 110 Baud or 10 characters per second. I was in high school in the '70s and dial up time share systems running BASIC cost $6.00 per hour, so connect time was precious. You wrote your program offline on paper, then entered it on the teletype, punching it on tape as you typed, and if you made a mistake, the DEL key was like digital White out. Of course, it did not speed up data transmission. Once you had it all typed onto paper tape, you dialed the number with a Touch-Tone keypad, logged in and then played the paper tape back to upload your program. Then you ran it, you could also renumber, and list it back and re-punch it for later. When I told my mom I needed money to learn BASIC programming, she asked what I did on the computer. I told her games. I love her: she didn't complain. I became an Electrical Engineer/Computer Science guy.
One of my friend's dad had a 300 baud terminal/printer, and we used to dialup GE's free modem line and just print out stuff in order to watch it work.
I remember doing port-a-punch cards in EBCDIC for my first computer programmes at grammar school! 10-6-8 everyone (or was it 11-6-8;-).
EOT (End of Transmission) is Ctrl+D and can be used today still. Ctrl+D in Linux (and other similar systems) will flush the current buffer. If this buffer is empty, it will result in a zero-byte read. A zero-byte read mean end of file/end of input in most contexts. For example using it in at a shell prompt will cause the shell exit with exit code zero. If that was a login shell, it causes a logout. I use it every day.
Also ESC is widely used to decorate Linux console output (colors etc).
Control character 4 (EOT), that is, Ctrl-D, still lives on in terminal emulators of Unix-derived system like Linux as the end-of-file character (although technically it's the flush-input-buffer character, but returning an empty input is interpreted as end of file on Unix-derived systems, therefore it effectively acts as end-of-file for terminal input and also is commonly referred to as such; the difference can be seen if you try to use it on a non-empty line).
Great talk thanks. I always suspected DEL because of how it sat in the ASCII table didn't look right. A control character all by itself as if it was an after thought. So a program reading from a stream would just ignore DEL characters.
ASCII 27 still maps to the Escape key.
One great thing about these blocks described is that one can see that like using Ctrl-C for ASCII 3 (ETX), one can also use Ctrl-[ (ESC) instead of lifting hands off the home row for Escape. Great for increasing TUI speed and efficiency.
Yep, ESC is essential for CSI (and SGR in particular), so without it there would be no ANSI terminal colors!
More about the ASCII graveyard, please! For instance, RS Record Seperator, now used in application/json-seq format to separate JSON objects, e.g. in a streaming event log that will never finish. Lots of goodies in the graveyard...
As one who lived and died on TTY33/35 devices this was very interesting. Programmed in SLEUTH (Univac assembler) later BAL. Lived in ASCII land.
ASCII being 7 bit covered most of the generic characters including accented characters via overprinting.
If the inventors had wanted to include all possible characters across the world, they would have needed at least 2 bytes per character to be able to handle Chinese and Japanese ideographs.
Leaving the remaining 128 values of a byte unspecified allowed different countries to add country specific characters. In the IBM PC world these were implemented as “code pages”, and were a bit of a problem when talking between countries.
Unicode eventually resolved this communication problem, but it requires 32 bits or 4 bytes to encode the over 140,000 characters, and there are visually identical Unicode characters that are logically different, which makes it easier for scammers to fake internet addresses.
And something as large as Unicode wasn’t practical in the early days of computing, when every single byte saved was significant.
EBCDIC had the advantage that numbers were readable on hex crash dump printouts, but numbers and letters shared the same character codes (C1 represented either A or positive signed 1, depending on what the data type was.)
so happy i got this as a suggested vid
7:12 Speak for yourself! I still use Control-D, to close terminals and exit SSH sessions, quit python or node.js and the like.
Edit: Love the video btw, can't wait for the next one :)
On Windows python repl only accepts ^Z and enter key needs to pressed for it to work
Dammit Windows, this is why we can't have nice things!
@@Tweekism86 In this case you can blame CP/M, in particular where file length in bytes was not recorded.
useful and well presented.
New to the channel, this is wonderful
4:51 While eight-bit bytes were already common when work on ASCII began in the early 1960s, they did not become ubiquitous until the mid-to-late 1970s.
Thanks for bringing the GIF/JIF debate to EBCDIC ;D
The first c in EBCDIC is pronounced like the c in "Pacific Ocean" - what's the problem? 🤣
@@DylanBeattie ;p luckily I have yet to see someone argue that those 256-color images are pronounced "SHIF" :D
I don't normally comment on the clothing style of TH-cam creators - but that t-shirt rocks. 🤣
Thanks; now I know how that blasted vertical tab got into that text field that then didn't serialize to XML CDATA, but in fact errored out completely
Ctrl-D in the terminal is great, it will exit most REPLs or shells
Amazing video, loved it.
Praised be the Algorithm ... it happens VERY rarely that I want to upvote a video and notice that I have already done so. Guess I'll have to subscribe ...
The extra bit also provided parity.. CR and LF were separate because going to the next line on a teletype took two character times. Multics chose LF as the NL because CR could be considered as not doing anything. _ was originally a left arrow.
I'm very much looking forward to the next episode (kohuept and èÁÒÉ ðÏÔÅÒ).
I'd love to here more about the characters in line 16 through 31 and their uses and if there are still any uses for them today.
I like how you've recycled some of the points from your talks into their own little videos, especially when the video topics are directly interactive with the community or fans.
Incredible video!
Excellent video!
You missed the cleverness behind code 33-41. These punctuation marks come in the same order as they do on the number keys on an (American) keyboard; this means that similar to how lower case letters were converted to upper case by resetting a single bit (toggled by the shift key), the same were actually true about pressing shift + a number key.
In old Apple ][ word processors you'd enter control characters to teach the word processor how to work with your new printer (instead of drivers).
Also they were used for modems. We had to type in weird characters to get the modem transmitting.
Apple ][ basic also used the VT52 arrow key ESC sequences to move the cursor around the screen ready for copying - you listed a line and then had to ESC-D-ESC-D etc to get to the start of what you wanted to copy, use the -> key to copy, and use lots of ESCs to skip over blank characters - the Apple ][ was aggressive in printing out blank spaces and word wrapping.
To fix the excessive spaces we set the right hand edge of the window to be one less character than it needed before it word wrapped (indenting from line number) and so it would not put in the excessive end and start of line spaces. The next thing I did was to write a character output trap which ignored spaces outside quotes and showed control characters in reverse (particularly for the DOS ^D prefix character, but it also showed up any other CTRL codes) to make it easier when editing such lines.
awesome content. subscribbbed. also props for the def leppard shirt
Very interesting! Thank you
Love the DL inspired shirt.
Also, ctrl-d is still used to mark end of streams on unix. So it is not just ctrl-d and ctrl-g that has survived until this day.
Great video!
I can't believe you just called ^[ and ^D unimportant
It's worth mentioning that
is also still in use when it comes to HTTP. HTTP is a text / line based protocol and each line is separated by a
. So even in the Unix / linux world you have to deal with that line ending. There's a similar issue when it comes to little vs big endian. While little endian kind of dominates the PC world nowadays, when it comes to network protocols, most of them use big endian. That's why it's often called "network order". Big endian makes it easier to create hardware decoders and since a huge chunk of the network world is hardware this actually makes sense. From the programming point of view I generally prefer little endian.
Fun facts: The ASCII underscore character was originally a left-pointing arrow, which is why Smalltalk (from around 1976) uses "_" as the assignment operator, and why Pascal (designed to work with EBCDIC also) uses ":=" instead, to look like an arrow as close as you can get on punched cards.
EBCDIC has the same sort of bitwise feature for letters that the upper/lower trick in ASCII uses, except it's designed for punched cards. So with a card 11 columns high, the letters are in "contiguous" numbers if you ignore the proper holes on the card rather than ignoring the proper bits in the byte.
Actually, Pascal had several digraphs to be used when certain characters were not available. For example, Pascal comments were written in curly braces {like this}, but in case curly braces were not available, you could also use parentheses with asterisks (*like this*). Now the only character used by Pascal that was not available in ASCII was the left arrow, whose digraph replacement was :=, which is why that one became commonly known as the Pascal assignment operator.
@@__christopher__
Were '" not available?
"" for assignment, eg:
RA -> VARLOC
means the contents of the A register are stored in the location pointed to by VARLOC (effectively a variable).
@@cigmorfil4101 that's already the less-equal operator.
@@__christopher__
Interesting how all the BASICs I've used over load the '=' operator to mean both "assign" and "compare equal" - the meaning based on context.
How about ""? (That looks more like an arrow than ":=".)
@@cigmorfil4101 that is already a less-than followed by a unary minus operator.
Also, := was already in use in mathematics for definitions, so it fits quite well.
Note also that a proper assignment statement in BASIC was
LET var = value
A lot of BASIC interpreters (in particular Microsoft's) allowed omitting the LET though.
8:18: Dylan taking a stance on tabs vs. spaces 😂
Thank you for this! I've been working with a lot of regex, and been wondering a lot about this. Would you have any resources for further reading?
Sadly lost and not mentioned here, the FS, GS, RS, and US characters (28-31). Meant to serve as distinct bytes that wouldn't be part of text data, and therefore could easily be used to delineate it.
But alas instead we just totally forgot they existed and therefore ended up with formats like CSV, which gave double meaning to commas, newlines, quotes, etc. With special escaping rules and incompatibilities between systems. And we've spent generations figuring out how to handle that properly and handle all the edge cases. Just because we didn't have and didn't bother to come up with a few symbols to represent those 4 characters.
Some of those other low code points were perfect for networking, sending a single byte to communicate something that now we need an entire packet to communicate the same thing.
The number of self-taught computer programmers who reinvent the wheel because they were never taught what already works always astounds me.
Interestingly Pick uses characters 252-254 as markers in dynamic arrays (and filed items) between the "elements":
FE - 254 - Attribute mark
FD - 253 - Value mark
FC - 252 - Sub Value mark
The whole dynamic array is a string with the elements separated by the marks. If an element is required that doesn't exist, Pick adds enough of the relevant marks to create it when setting the value of the "element" or returns a null.
This means you get to access things like:
Data = ''
Data = 'attr 1'
Data = 'at 2, v1, sv 3'
Data = 'at 2, v 3'
Data = 'at 4, v2'
CopyData = Data
Element2 = Data
The strings Data and CopyData contain:
attr 1[am][sm][sm]at 2, v 1, sv 3[vm][vm]at 2, v 3[am][am][vm]at 4, v 2
And Element2 contains
[sm][sm]at 2, v 1, sv 3[vm][vm]at 2, v 3
Where [am] is char(254), [vm] is char(253) and [sm] is char(252)
Pick is a multi-value DBMS OS with all fields of variable length and type (though as the whole is stored as a string they're effectively all strings which are converted to the relevant type at time of use).
The use of CSV is to _avoid_ non-printing control characters (other than a line break) so that the data is easily edited as plain text by a plain text editor.
A plain text editor generally only understands line breaks; how control characters are displayed depends upon their programming: some may display as ^c, some may display a '?' regardless of the chatacter, some may let the display driver decide what to do (hence the smiley face, musical notes, etc, that the original IBM PCs displayed for control characters)
As there was no consensus how to handle control codes, CSVs avoided them and stuck to plain text, using commas (hence the name: _Comma_ Separated Values) requiring some sort of escape for commas - enclose a field with commas within quotes - and a mechanism to handle the quoting characher within fields.
@@cigmorfil4101 I always found this argument bizarre. ASCII was invented well before any "plain text editor" was, so saying "we changed this because plain text editors couldn't handle ASCII" sounds like working around the problems in tools rather than just fixing the tools.
There was also an image format called NetPBM which was great, and one of the options was to represent all the bytes with decimal digits. Like, you could read it with BASIC even. Red would literally be "255 0 0" with nothing other than ASCII digits and spaces.
@@cigmorfil4101 Wow. It has been *ages* since I heard anyone else who ever used Pick. :-) Blast from the past there.
I despise the EU political project, but I learned a LOT about computing from the Euro symbol, and having to work on its introduction. It must have been weird to show up as the (eg Saudi) delegate to one of the conferences in the 90s and have to talk to nerds about how other writing goes right to left, and how the character set is continuous
Ctrl-D is still "End-of-File" in UNIX tty land.
Technically not. It's "send the buffered output without sending the ctrl-D". If there's no buffered output, the program gets a read length of zero, which is EOT. But if you type something first and hit control D, it just sends what you typed.
The eighth bit was often used for parity checking.
As someone who learned to type on a Smith Corona, the later used a Model 33 teletype, I did not know there where different characters for left and right quote marks. There is single and double quote.
As someone who was taught to write by hand at school we we taught to use 66 quotes at the start and 99 quotes at the end.
Having only " on a computer was a bit of a shock. These days LibreOffice writer and Word automagically "correct" with smart quote to 66 or 99 quotes (depending upon surrounding spaces).