str vs bytes in Python

แชร์
ฝัง
  • เผยแพร่เมื่อ 1 ต.ค. 2024

ความคิดเห็น • 196

  • @lawrencedoliveiro9104
    @lawrencedoliveiro9104 ปีที่แล้ว +191

    1:34 Fun fact: separating bytes from strings was the most important major breaking change between Python 2 and Python 3. Trying to keep strings as byte-encoded led to all kinds of unfortunate trouble in Python 2, which could not be fixed without sacrificing backward compatibility.
    And they thought, while they were breaking things anyway, they might as well fix a few other things in a cleaner, non-backward-compatible way while they were at it.

    • @kadeemaustin1259
      @kadeemaustin1259 ปีที่แล้ว +2

      That’s actually super interesting 😮 thanks for the info

  • @BenjaminWheeler0510
    @BenjaminWheeler0510 ปีที่แล้ว +145

    When I started learning Rust, this was something that actually comes up quite a bit, since you can't iterate over a string object (you don't necessarily know its encoding at compile time). It was the first time I realized that the difference between ascii, utf, and others is actually really important!

    • @0LoneTech
      @0LoneTech ปีที่แล้ว +18

      This seems mistaken. Any decodable encoding is iterable, and Rust proponents keep bragging about compile time checks and zero overhead abstractions. A quick look up of Rust's std::string::String does define it as encoded using UTF-8, and it inherits the iteration method chars() from str. So it appears you're talking about something else, perhaps an equivalent to Python's bytes type.

    • @mr.bulldops7692
      @mr.bulldops7692 ปีที่แล้ว +11

      ​@@0LoneTech Remember not everything is an object in Rust. Rust has two "strings" to be aware of. The first is a stack "str" which is a primitive type stored on (you guessed it) the stack as bytes. This is the default behavior when declaring a string in Rust. It might be "iterable" at this point, but I don't think the memory safety checks can hold true if you start manipulating values on the stack. Rather, Rust makes you create a slice of "str" on the heap as a "String" struct before doing manipulation.

    • @tomiesz
      @tomiesz ปีที่แล้ว +16

      I think the comment is just weirdly stated. String itself does not implement the Iterator trait (aka there is no "default" way to iterate a String), instead it makes you choose between characters or bytes by using the chars() or bytes() method that return the appropriate iterator.

    • @0LoneTech
      @0LoneTech ปีที่แล้ว

      @@mr.bulldops7692 You're talking about mutability, a separate subject in both Rust and Python. This is distinct in turn from owned or borrowed, the prime difference between Rust's String and str (with lifetimes guarding stack frame ownership, iirc). &mut str does let you modify the data, irrespective of where it is stored, but one should take care not to violate UTF-8 encoding when doing so. The borrow checker should prevent breaking character boundaries while slices depend on them.
      I expect string literals to be in rodata, as 'static str, neither mutable nor in stack or heap.

    • @VivekYadav-ds8oz
      @VivekYadav-ds8oz ปีที่แล้ว +2

      The way you phrased it makes it seem like Rust doesn't know the encoding of a String/str at compile-time, which is bonkers. The type itself enforces the variant that it holds UTF-8 data. It's not encoding that's unknown at runtime, it's the grapheme clusters. Indexing is supposed to be O(1) for example, but you don't know what str[5] character will be in O(1), since maybe an emoji is in b/w which will take more than one byte, and so there's no direct mapping b/w index and character position. If you just want to assume byte array, just call .to_bytes() on it.

  • @japedr
    @japedr ปีที่แล้ว +9

    Windows encodings are a real nightmare.
    There are the OEM/MS-DOS codepages used by the console which make almost impossible to consistently write non-English characters from a .bat script.
    Then there are the "ANSI" codepages which are used by the Win32 functions accepting strings as char pointers (e.g. MessageBoxA). It is usually Windows-1252 in western countries which is a slightly incompatible variant of ISO 8859-1 (also known as "Latin1").
    Then there are the "Unicode" strings/MBCS/wchar_t pointers which are actually UTF-16 (even MS documentation states wrongly that "Unicode is a 16-bit character encoding"), meaning that Emojis will probably work in some places and not in others (try calling MessageBoxW with an emoji...). Except not really because in some cases it is UCS-2 instead of UTF-16 (another slightly incompatible variant). BTW, at least until recently you needed to add the BOM character to make stuff like notepad to recognize a UTF-16 file.
    Note that NONE of those encodings are UTF-8.

    • @lawrencedoliveiro9104
      @lawrencedoliveiro9104 ปีที่แล้ว +1

      Remember that MS-DOS was originally created for the IBM PC, so it had to incorporate the whole IBM “code page” concept.

  • @nollix
    @nollix ปีที่แล้ว +44

    Perhaps some of the confusion comes from this: Bytes are interpreted as bytes, but you type them as if they were string literals. So then, how does the string get transformed into bytes? Doesn't it have some sort of implicit encoding when you type it in your IDE?

    • @jacobgoldsmith7651
      @jacobgoldsmith7651 ปีที่แล้ว +1

      yes, ascii. a=1, b=2, etc

    • @cyrilsli
      @cyrilsli ปีที่แล้ว +43

      @@jacobgoldsmith7651 that’s… not the ascii table?

    • @lawrencedoliveiro9104
      @lawrencedoliveiro9104 ปีที่แล้ว +1

      Everything is Unicode these days. When you convert between Unicode strings and bytes, the default decoding/encoding is “utf-8”.

    • @Howtheheckarehandleswit
      @Howtheheckarehandleswit ปีที่แล้ว +14

      @@jacobgoldsmith7651 In ASCII, a = 97, b = 98, etc, and A=65, B=66, etc.

    • @Howtheheckarehandleswit
      @Howtheheckarehandleswit ปีที่แล้ว

      I don't know for sure, but I'd imagine that it either uses the literal bytes that whatever text editor you used decided to encode the bytes literal as

  • @mrtnsnp
    @mrtnsnp ปีที่แล้ว +11

    Looking forward to "from __future__ import default_encoding".

    • @mCoding
      @mCoding  ปีที่แล้ว +6

      A very likely possibility. Maarten called it first! My guess is `from __future__ import utf8`

  • @cleverclover7
    @cleverclover7 ปีที่แล้ว +23

    It's crazy how much you come across decoding/encoding issues in the wild. I sometimes work with large text datasets with mixed encodings, sometimes even in the same line! The worst is that if you try and decode with the wrong encoding it can raise a runtime error, so I ended up writing a short program with a bunch of try/excepts for the different possibilities (utf-8 first of course). I did the same thing when I worked in C and Tcl. Gotta be a better way...

    • @mCoding
      @mCoding  ปีที่แล้ว +9

      *Mixed* encodings! That's nightmare fuel for developers if I ever heard it!

    • @Bobbias
      @Bobbias ปีที่แล้ว +3

      Chardet is a helpful library for trying to guess the most likely encoding. However, if you've got a single string with mixed encodings, then that might not be helpful.

    • @0LoneTech
      @0LoneTech ปีที่แล้ว +2

      Nightmare indeed. One rare program I've seen handling it decently is mlterm with mixes like ISO 8859-1 and EUC-JP.

    • @AntonioZL
      @AntonioZL ปีที่แล้ว

      I have dealt with that just recently. Absolutely terrible.

    • @cleverclover7
      @cleverclover7 ปีที่แล้ว

      @@Bobbias thank you I'll check it out!

  • @sleeper789
    @sleeper789 ปีที่แล้ว +13

    5:24 "UTF-8 is by far the most common encoding across all programming languages." I don't think this is actually true. UTF-8 is the most common encoding for on disk and on the web text, but programming language implementations will often internally work in a different encoding than the wire/disk encoding. Both in Java and DotNet the String type is internally implemented using UTF-16, not UTF-8.

    • @lawrencedoliveiro9104
      @lawrencedoliveiro9104 ปีที่แล้ว +3

      UTF-16 is an unfortunate hangover from the early days of Unicode. Nobody uses it voluntarily any more.

    • @benhetland576
      @benhetland576 ปีที่แล้ว

      @@lawrencedoliveiro9104 Voluntary or not, utf-16 is now deeply imbued into every NT-derived Windows computer out there. The ubiquitous windows-1252 et al is only what we see on "the surface" within some GUI apps and the command window. NT used to have "16-bit Unicode", but after Unicode expanded past the BMP they redefined it to be utf-16 instead. I wonder how many bugs are still hiding in there that don't actually handle the utf-16 "escapes" correctly and just assume every character is 16 bits...

    • @lawrencedoliveiro9104
      @lawrencedoliveiro9104 ปีที่แล้ว +2

      @@benhetland576 And that’s why you wouldn’t choose to use it.

  • @playerguy2
    @playerguy2 ปีที่แล้ว +3

    Instructions unclear: Slapped the like button an even number of times.

  • @timogden9681
    @timogden9681 ปีที่แล้ว +19

    Wow really informative. I wrote most of a project in windows, started using it in a Linux Google cloud VM, but I realized some of my data in a csv file was invalid.
    In the interest of getting a proof of concept out quick, I just quickly wrote a script in the VM that opens the file as a pandas dataframe, removes the invalid rows, and stores it as a csv file again. Except when I went to open this new file before giving it to my ML algorithm, it kept telling me the file was corrupted. I couldn't understand it, I was at a total loss, and I ended up just writing another hacky solution in which if I encountered an error loading one of the rows during the training process, I would just default to loading the first row instead.
    This makes total sense that this could have been the problem. Thanks James!

    • @mCoding
      @mCoding  ปีที่แล้ว +11

      A true war story from the field! This is exactly what can happen with mixed encodings and I'm glad this helped figure out the problem!

    • @lawrencedoliveiro9104
      @lawrencedoliveiro9104 ปีที่แล้ว

      I’m sure Pandas has ways to hook into the loading/saving process. Python has options in decoding as to how to treat invalid byte sequences: for example, you could ignore them, or replace them with some marker character.

    • @0LoneTech
      @0LoneTech ปีที่แล้ว

      ​​​​@@lawrencedoliveiro9104 Yep, in this case pandas.read_csv has arguments encoding, encoding_errors and on_bad_lines.
      One guess at a cause of the corruption might be how Windows NT Notepad silently injects an invalid BOM into UTF-8 files.

  • @lawrencedoliveiro9104
    @lawrencedoliveiro9104 ปีที่แล้ว +14

    5:25 Not just the most popular, but some languages, including Python, have embraced Unicode to the point that identifiers can contain any Unicode characters that are classed as “letters”. So for example while “in” is a reserved word, “ın” is not, and can be used as an identifier.

    • @BenjaminWheeler0510
      @BenjaminWheeler0510 ปีที่แล้ว

      Does it warn you about doing this? This rings a bell... I think some language out there actually does warn you if you do silly stuff like this. Maybe it was Rust or c++? Not sure.

    • @lawrencedoliveiro9104
      @lawrencedoliveiro9104 ปีที่แล้ว +1

      @@BenjaminWheeler0510 It’s a feature, not a bug.

    • @fltfathin
      @fltfathin ปีที่แล้ว

      @@BenjaminWheeler0510 it only warns you if it's not a "letter" character (emoji,etc), katakana/hiragana/etc works as variable name without warning

    • @benhetland576
      @benhetland576 ปีที่แล้ว

      It can be fun, then, to mix otherwise identical characters from the Latin, Greek and Cyrillic alphabets. You can write dеf instead of def, for example.

    • @TheAnonymmynona
      @TheAnonymmynona ปีที่แล้ว +2

      @@BenjaminWheeler0510 Some IDEs warn you about it, for example vscode has a warning about non standart characters

  • @anon-fz2bo
    @anon-fz2bo ปีที่แล้ว +2

    yea newer languages such as GO ([]byte) and ZIG ([]const u8) use a slice of bytes to interchangeably represent strings
    which makes sense from a C/C++ perspective considering that strings in C/C++ are essentially just an array of characters
    and characters are essentially uint8_t (bytes)

    • @mCoding
      @mCoding  ปีที่แล้ว +1

      Yeah this is an important choice that lower level languages make. "Strings" in those languages are more like the bytes object in Python, a contiguous container of bytes with stringlike functions. If you want true unicode support you have to use some external lib, which makes sense in performance driven languages because parsing utf8 at runtime is a huge performance penalty.

  • @aceae4210
    @aceae4210 ปีที่แล้ว +4

    so this solved a thing that I didn't think about
    so you know the *base64.b64decode* function (import base64)
    so when you decode a base 64 string the output is b'(decoded content)'
    which is as I just found out a byte formatted string
    before what I was doing was this (mind the naming schema)
    base64_decoded= b'some|text|here'
    str_base64_decoded= str(base64_decoded)
    and then str_base64_decoded[2:-1] (which is the same as slice() which is formatted as (start, stop, step)
    so what that did was remove the *b'* and then also removes the ending *'*
    to give *some|text|here*
    so yeah knowing byte formatted strings exist helps as instead I can just do this
    base64_decoded.decode()
    which will get me the same output
    *some|text|here*
    thanks for reading my weird experience, have a good day

    • @AssemblyWizard
      @AssemblyWizard ปีที่แล้ว +1

      Won't always give the same output, try decoding the base64 "4oKs" (that's a lowercase O not a zero), and then compare str with slicing vs decode

    • @aceae4210
      @aceae4210 ปีที่แล้ว

      @@AssemblyWizard so doing a test I see what you mean
      the first row is with the byte to string (with str()) and the bottom one is .decode()
      "\xe2\x82\xac" byte to string then cut
      "€" using built in func
      which the main difference is .decode() being able to properly represent characters that byte strings can't
      thanks for letting me know
      the code I used is down bellow
      import base64
      base_decode = base64.b64decode("4oKs")
      str_decoded_cut = str(base_decode)[2:-1]
      base_decode_builtin_func = base_decode.decode()
      print(f'"{str_decoded_cut}" byte to string then cut

      "{base_decode_builtin_func}" using built in func')

  • @gloweye
    @gloweye ปีที่แล้ว +1

    Huh, didn't know about that system encoding. Very much agree with PEP 686.

  • @finnthirud
    @finnthirud ปีที่แล้ว +6

    Decoded the mystery in a few minutes, thank you! ☺

  • @WalterVos
    @WalterVos ปีที่แล้ว +4

    When you're completely unsure what the encoding of any file that you're processing is, the chardet package is really helpful.

    • @mCoding
      @mCoding  ปีที่แล้ว +1

      Great tip!

  • @MithicSpirit
    @MithicSpirit ปีที่แล้ว +17

    Discord gang

    • @mCoding
      @mCoding  ปีที่แล้ว +4

      Best gang!

    • @swizice
      @swizice ปีที่แล้ว

      > Stack Overflow?

    • @w花b
      @w花b ปีที่แล้ว +1

      ​@@swizice yuuuup'

  • @jedpittman6739
    @jedpittman6739 ปีที่แล้ว +2

    mcoding != encoding. Amazing. 😂

  • @silverKirilljedi
    @silverKirilljedi ปีที่แล้ว +16

    Great video! But is not it weird that python's string encode() and open() use different default encodings? I've asked chat GPT and it says Python 3.10 has 17 built in functions and class methods that use encoding parameter and there's 5 default values utf-8, None (system's default), latin-1, ascii, utf-16. This is bad, right?

    • @NathanHedglin
      @NathanHedglin ปีที่แล้ว +2

      Sounds like an absolute mess

    • @hemerythrin
      @hemerythrin ปีที่แล้ว +13

      Why ask ChatGPT instead of reading the documentation?

    • @Plajerity
      @Plajerity ปีที่แล้ว +8

      ChatGPT is the best storyteller humankind has ever seen. To distinguish the fake from the truth, possible it's not. Do not put your faith on it if your question might be less popular, it generalizes everything.

    • @mCoding
      @mCoding  ปีที่แล้ว +12

      Yes it is very weird, and there were historical reasons for doing it that are somewhere between no longer very relevant and a mistake. That's why PEP 686 is finally switching the default to utf8, but since this is big change they have to wait until 3.15!

    • @lawrencedoliveiro9104
      @lawrencedoliveiro9104 ปีที่แล้ว

      Checking help(open), it says the default encoding is taken from your locale. But I always set my locale to something UTF-8-based anyway, so no biggie.

  • @bartlomiejodachowski
    @bartlomiejodachowski ปีที่แล้ว +2

    5:50 if utf-8 character has 4 bytes shouldnt have there been padding bytes after/before every byte from 65 to 68 ? 0,0,0,65, 0,0,0,66, 0,0,0,67, 0,0,0,68, 201,184,240,159 ...

    • @lawrencedoliveiro9104
      @lawrencedoliveiro9104 ปีที่แล้ว +3

      No. UTF-8 is variable-length. In particular, all the values in the range 0 .. 127 fit in a single byte.

    • @bartlomiejodachowski
      @bartlomiejodachowski ปีที่แล้ว +1

      @@lawrencedoliveiro9104 variable length explains. i dont get how it can be variable length and work but i will google it. thx

    • @mCoding
      @mCoding  ปีที่แล้ว +2

      Great question and indeed the solution is that utf-8 is a variable-length encoding. The way this works is by encoding the number of total bytes in the character within the first byte. If the first byte starts like:
      0xxxxxxx -> 1 total byte
      110xxxxx -> 2 total bytes
      1110xxxx -> 3 total bytes
      11110xxx -> 4 total bytes.
      In particular, since ascii values are 0-127, they all start out 0xxxxxxx and hence all ascii values are encoded in a single byte in uft-8. Clever! Read more here: en.wikipedia.org/wiki/UTF-8

    • @quillaja
      @quillaja 5 หลายเดือนก่อน +1

      @@bartlomiejodachowski Hopefully you found your answer, but if not, there was a very good Computerphile video featuring Tom Scott about UTF8

    • @bartlomiejodachowski
      @bartlomiejodachowski 5 หลายเดือนก่อน +1

      ​ @quillaja i didint delete my comment in case someone had simmilar question. i have alredy studied encodings, but thx for your response.

  • @lethalantidote
    @lethalantidote ปีที่แล้ว +4

    I absolutely love your videos. Regardless of my familiarity with a topic, every video seems to have some piece of information that I would not have discovered on my own. I never knew that files were encoded with the system encoding unless specified. It has never been an issue, but I know that one day it will be and without this knowledge, I would have really struggled to identify the issue. Future me really appreciates your hard work.

    • @mCoding
      @mCoding  ปีที่แล้ว +3

      Great to hear! And I was also surprised when I ran the example and found out that my default encoding was not utf-8. Then I remembered I record my videos on Windows!

    • @Bobbias
      @Bobbias ปีที่แล้ว +1

      ​@@mCoding yep, the "we do things different because.... (Usually) bad reasons." OS. As a longtime windows user, it's so damn frustrating sometimes.

  • @Mr.Beauregarde
    @Mr.Beauregarde ปีที่แล้ว +2

    Thank God for UTF-8

  • @lawrencedoliveiro9104
    @lawrencedoliveiro9104 ปีที่แล้ว +5

    4:19 Just a note that, in Python, the len() function is counting “code points”, not characters. So strings are really being interpreted as sequences of code points, not of actual Unicode characters (which can be encoded in multiple code points).

    • @mCoding
      @mCoding  ปีที่แล้ว +3

      This isn't particularly a video about the specifics of Unicode, but since you mention it, a "character" as defined by the Unicode standard is the same as a code point, and this is the same way that I use the term in this video. You may be confusing the term "character" with a "glyph", which is a shape that is rendered as a representation of one or more characters. Various relationships may exist between character and glyph: a single glyph may correspond to a single character or to a number of characters, or multiple glyphs may result from a single character. Python's len function counts both characters and code points because these are the same, but it does not count glyphs. I refer you to Section 2.2 "Characters, Not Glyphs" in the Unicode standard for further explanation.

    • @lawrencedoliveiro9104
      @lawrencedoliveiro9104 ปีที่แล้ว +2

      @@mCoding No. Consider a character with diacritic marks, like for example “ä”. This has its own code point U+00E4, but it can also be represented as U+0061 (“a”) followed by the combining diacritic mark U+0308.
      The most common character-plus-diacritic combinations have their own assigned code points, but not every combination can be represented this way. Hence the need for multiple code points to represent a character.

  • @SeanCrites
    @SeanCrites ปีที่แล้ว +1

    As I was searching the interwebs as to what a type 'byte' was and how to convert it to a string, my YT refreshed and there was this video at the top of my subscription, 4 minutes old. This timing was apropos.

  • @nitishvirtual4745
    @nitishvirtual4745 ปีที่แล้ว +1

    Yet another informative and well put video. Thanks!

  • @shukterhousejive
    @shukterhousejive ปีที่แล้ว

    With all the stuff Python gladly broke in the 2to3 switch I'm shocked bytestrings stayed around, all they do is confuse people for minimal convenience. Shoulda swapped it out with a fixed-length bytearray implementation, that way nobody gets confused about the intended purpose.

  • @AngryArmadillo
    @AngryArmadillo ปีที่แล้ว +1

    Hey James, I’d love to see a video showcasing how to use the Textual package. It’s really neat, and fits your style.

  • @stiliyangoranov5518
    @stiliyangoranov5518 3 หลายเดือนก่อน

    How does Python store some string, e.g. test: str = "some-test-string" in memory? I mean, if it is of "bytes" type and encoded in UTF-8 it is clear what bytes will be written in memory, but I simply cannot logically explain to me how does Python handle non-encoded strings, they should live somewhere in memory, while the programming is executing, but how will they be stored? For instance the example that you store emoji in a variable, how is it stored in memory, before you encode it and print it?

  • @quintencabo
    @quintencabo ปีที่แล้ว +1

    One thing that's missing I feel is that a bytes is actually a list of ints between 0 and 256 nothing more. It makes sense but it was like an ah moment for me

  • @ConstantlyDamaged
    @ConstantlyDamaged ปีที่แล้ว

    Instructions not specific, slapped like button 256 times.

  • @kyleaustin7768
    @kyleaustin7768 ปีที่แล้ว +2

    Its crazy how one day I am wondering about something and a week later you have a great video on it. Thanks for another great one!

    • @mCoding
      @mCoding  ปีที่แล้ว

      Great to have you watching!

  • @creed404
    @creed404 ปีที่แล้ว

    What i know is that utf-8 is also 8-bit length so how he knows that he should interpret the 4 bytes as a emoji instead of some other 4 8-bit characters? Shouldn’t we use utf-32?

  • @Veptis
    @Veptis 9 หลายเดือนก่อน

    I have been using tree sitter for a language model dataset. I use the start_byte and end_byte to cut out a function and replce it with generation for the benchmark.
    I spend a few hours hunting down some offset issues... And it was due to difference in len for str vs bytes, also indexing. So i do a lot of encode, slice, decode. and its awful. I woild love to simply use the byte index to slice a str.

  • @zachb1706
    @zachb1706 ปีที่แล้ว

    The encoding type changing depending on the system's configuration is nightmare fuel.

  • @yash1152
    @yash1152 ปีที่แล้ว

    6:12 ohw, that was the thing the pylance/mypy was shouting at me "encoding not specified"

  • @GameSmilexD
    @GameSmilexD ปีที่แล้ว

    but how to convert for binary shellcode in python3? i have a chroot python2 version for that

  • @MattSpaul
    @MattSpaul ปีที่แล้ว +2

    Am I right in assuming the storage size the same between string and byte?

    • @mCoding
      @mCoding  ปีที่แล้ว +3

      For the actual size in memory it's actually up to the implementation, they can differ due to things like small string optimization, ascii-only optimization, cached properties, and a few other things. When you write a string to disk, it is always converted to bytes first (it is done automatically in text mode) so in that sense the storage size is the same. However, the "length" of a string is the number of characters, which can differ from the number of bytes because some characters can take multiple bytes (like the smiley).

    • @lawrencedoliveiro9104
      @lawrencedoliveiro9104 ปีที่แล้ว

      Unicode “code points” can officially have any value in [0 .. 0x10FFFF]. That means a single code point could fit in 21 bits.

    • @lawrencedoliveiro9104
      @lawrencedoliveiro9104 ปีที่แล้ว

      Let me amend that. The valid ranges for Unicode code points are [0 .. 0xD7FF] and [0xE000 .. 0x10FFFF]. The values in the gap are called the “High Surrogates” and “Low Surrogates”, and are reserved for representing UTF-16 encodings. Which nobody should be using any more.

  • @tbonethechamp
    @tbonethechamp ปีที่แล้ว

    wow you learn something new everydaay! had no idea that python has a built in way to do intersections 1:00

  • @4647540
    @4647540 ปีที่แล้ว

    very good explanation, Cleared my head :)

  • @TanUv90
    @TanUv90 ปีที่แล้ว

    2:17 Most humble ad read ever lol

  • @bp56789
    @bp56789 ปีที่แล้ว

    A message from our sponsor: (quiet voice) me.

  • @denyspisotskiy75
    @denyspisotskiy75 ปีที่แล้ว +1

    interesting theme. waiting for your next video :)

  • @Kingofgnome
    @Kingofgnome ปีที่แล้ว

    One question, i still have: x = b"Hello World 😉" will then automaticly convert my string into bytes using the system encoding as default?

    • @mCoding
      @mCoding  ปีที่แล้ว

      The bytes literal syntax b"...." always uses ascii encoding and does not allow non-ascii characters like "😉" in the literal. If you want to include bytes (0-255 allowed) outside the ascii range (0-127 allowed), then you have to feed it an iterable of integers like bytes([255, 255, 255]) instead of using the literal syntax.

  • @Yotanido
    @Yotanido ปีที่แล้ว +3

    This makes me glad I only ever work on Linux systems. Utf-8 everywhere.
    I would have never even considered python using anything other than utf-8 when opening a file in text mode.
    Although, I also didn't know encode and decode could be used without an argument. I always specified utf-8 and will continue to do so.

  • @Unpug
    @Unpug ปีที่แล้ว

    Incredible explanation

  • @pdmkdz
    @pdmkdz ปีที่แล้ว

    I needed this explanation 3y ago :/

  • @francescoferazza9341
    @francescoferazza9341 ปีที่แล้ว +1

    One of the best explanations ever.

  • @volbla
    @volbla ปีที่แล้ว +4

    Be warned! The standard library json.write() function has an "ensure_ascii" variable that for some reason defaults to True. If you want to save data that's not just standard latin text you have to set it to False.
    I guess we have to wait for 3.15 for that to change...

    • @0LoneTech
      @0LoneTech ปีที่แล้ว +2

      ensure_ascii generates \u escape sequences. It makes the JSON ASCII compatible, it does not alter the data contained within strings.

    • @volbla
      @volbla ปีที่แล้ว

      @@0LoneTech And that makes it unreadable in a text editor. I guess it's cool that all the data is still there, but is there a reason to not keep the data _and_ have it readable? What doesn't support UTF-8 these days?

    • @0LoneTech
      @0LoneTech ปีที่แล้ว

      @@volbla 6:08 is one example, likely Windows-1252. The default ensure_ascii format will function through that, utf-8 won't.
      JSON does not support indicating encoding like e.g. HTML or HTTP. So if it was stored in a file, the default system encoding is the only suggestion.

  • @therabidpancake1
    @therabidpancake1 หลายเดือนก่อน

    Are those bytes like like megabytes ?

  • @tusharsnn
    @tusharsnn ปีที่แล้ว +1

    Heads up:
    Unicode: It's like a dictionary of characters. Each character has a unique entry and a value which identifies it, aka code-point. A Code-point is a 4 bytes value.
    An encoding (there are several), encodes this unique code-point(4 bytes)to a sequence of bytes(variable sized), so as to save space.
    Eg. A utf8 encoded character can use 1/2/3/4 bytes depending on its code-point. Similarly, a utf-16 encoded character can use 2/4 bytes.
    Why utf8 is so popular you might ask? Reason is backward compatiblity to ASCII, all the ascii characters encoded to utf8 shares the same "value" when encoded to ASCII, E.g. 'A' is 65 in both encodings. All ascii chars uses only 1 byte in utf8.
    Why utf16, well utf8 cannot represent all the unicode chars, there are some chars that have code-points outside the range of utf8.

    • @benhetland576
      @benhetland576 ปีที่แล้ว +1

      News to me. Which Unicode code points do you claim cannot be encoded in utf8, then?

    • @tusharsnn
      @tusharsnn ปีที่แล้ว

      @@benhetland576
      Just checked and it's looks like utf8 does support all code points according to wikipedia, but I'm not sure if it's correct. I saw this warning when I was working with powershell script, it needed input encoded specifically in utf-16LE since it mentioned that this supports 'all' code points. Again, not sure why it might say that.

    • @mCoding
      @mCoding  ปีที่แล้ว +8

      Maybe I should make a video not just on str vs bytes, but Unicode specifically. There are lots of interesting (and dark) corners in there!

    • @benhetland576
      @benhetland576 ปีที่แล้ว +2

      @@tusharsnn The encoding that utf-8 uses theoretically allows a max of 7 octets (or bytes if you like), in which case the first octet would start with 7 ones followed by a 0, i.e, 0xFE. The next 6 octets each encode 6 bits for a total of 36 bits, and only 21 bits are needed to cover all possible Unicode codepoints (17 "planes" of 65536 codepoints each). 4 octets encode (8 - 5) + 3 × 6 = 21 bits, so that is the longest octet sequence ever needed to encode a single Unicode codepoint. There are byte sequences that are not valid utf-8 for several reasons (even single octets like a 0xFF), but not vice versa.

    • @0LoneTech
      @0LoneTech ปีที่แล้ว +4

      There are also characters with multiple code points, such as latin V and roman numeral V, and characters with distinct glyphs such as Han unified ones, and combined characters that may be separable like ä, and the classification and ordering of characters is language dependent. Text is messy. Unicode does not provide the tools needed to compare a Chinese and Japanese phrase in one text, by design. History and politics (not necessarily national) are involved.

  • @Dmittry
    @Dmittry ปีที่แล้ว

    The best integration I've ever seen.

  • @mattholden5
    @mattholden5 ปีที่แล้ว

    Thanks, James.Very concise, well-informed and well-executed . I especially like the grounded references to Python 3.1x.x My inbox needs a vaccine for "Python 4 .*" titles. I might take a look at the meta on this vid to see if I can spot such effusion for my personal ytube feed.

  • @b4ttlemast0r
    @b4ttlemast0r ปีที่แล้ว

    This seems to be a pretty good way to handle it. Meanwhile in C++ I'm still trying to figure out how to work with unicode characters at all..

  • @che_kavo
    @che_kavo ปีที่แล้ว

    Thank you! I' ve always struggled to understand the diff between str, bytes and what is encoding. And now I finally understand! Thank you 😊

  • @leesweets4110
    @leesweets4110 ปีที่แล้ว

    So how does a sequence of four bytes get interpreted as a smiley face?

    • @mCoding
      @mCoding  ปีที่แล้ว

      When you ask python to decode using utf8 (whether you specify the encoding specifically or whether python just does that by default), you are asking python to interpret the bytes according to the unicode standard, and the unicode standard specifies that that exact sequence of bytes means "smiling face" or some other similar description. It is then up to the author of the font you are using to create a glyph (the picture of an actual smiley) to draw whenever you try to display the smiley as a character. This allows for different sets of smileys, like the Apple smileys vs the Microsoft smileys, choosing between them is simar to choosing a different font.

  • @Lestibournes
    @Lestibournes ปีที่แล้ว

    Yesterday I wrote a self-extracting installer script. Today I see this.
    I found it easiest to write the installer file as a string and then write the files it contains as bytes that are encoded as utf-8, especially if they are binary files. Writing the whole installer to file as bytes caused me trouble with the python interpreter.

  • @PetrSzturc
    @PetrSzturc ปีที่แล้ว

    Thanks for this.

  • @jasonhenson7948
    @jasonhenson7948 ปีที่แล้ว

    Excellent video, thank you.
    I've had a couple of issues where I've had to use the IO and locale libraries to "fix" encoding shenanigans, but I think if I revisited those lines I'd now have an actual understanding of what was happening, how the changes worked and, most importantly, how /to do it better/.

    • @mCoding
      @mCoding  ปีที่แล้ว +1

      Thank you for your kind words! I'm glad this was able to help you understand and you will know how to fix it when it crops up again.

  • @GabrielEdu
    @GabrielEdu ปีที่แล้ว

    Muchas gracias, me ayudaste un montón!!!

    • @mCoding
      @mCoding  ปีที่แล้ว

      De nada!

  • @jullien191
    @jullien191 ปีที่แล้ว

    와 고마워요. 최고 ㅋㅋㅋ

  • @yash1152
    @yash1152 ปีที่แล้ว

    1:21 which IDE?

  • @DK-eo9vj
    @DK-eo9vj ปีที่แล้ว +1

    always great vids. thanks a lot!

    • @mCoding
      @mCoding  ปีที่แล้ว +1

      Glad you enjoyed!

  • @broccoloodle
    @broccoloodle ปีที่แล้ว

    Really comprehensive explanation

  • @petrskupa6292
    @petrskupa6292 ปีที่แล้ว

    Great! So thankful, it cleared my confusion (I still didn’t go to Stack overflow for it 😅😂)
    ... May I just have curious question for an end? What might be the reason for anyone to have system not having UTF-8 as a default? (why not?)

    • @lawrencedoliveiro9104
      @lawrencedoliveiro9104 ปีที่แล้ว +1

      Legacy reasons.
      Before Unicode--indeed, for a long while after--there were these things called “national character sets”. In fact, there is likely still a large collection of text stored in these legacy encodings.

  • @pkgo1122
    @pkgo1122 หลายเดือนก่อน

    Nice!

  • @Zifox20
    @Zifox20 ปีที่แล้ว

    Always there to teach me life ahah, thanks!

    • @mCoding
      @mCoding  ปีที่แล้ว +1

      Any time!

  • @jeffkevin3
    @jeffkevin3 ปีที่แล้ว

    What a coincidence! I just tried to survey the difference between them and found this video that just came out! 😀
    So... why isn't your computer in UTF-8? 🤣

    • @yomajo
      @yomajo ปีที่แล้ว

      Ask Billy Jeans

  • @tiagomacedo7068
    @tiagomacedo7068 ปีที่แล้ว

    That was the best message from a sponsor I've ever seen.

  • @AssemblyWizard
    @AssemblyWizard ปีที่แล้ว

    UTF8 ≠ Unicode
    I was expecting you to explain the difference, since it's super related to bytes vs. str, but instead you said these terms commonly refer to the same thing 😓

    • @AssemblyWizard
      @AssemblyWizard ปีที่แล้ว +1

      FWIW here's the difference:
      Unicode usually refers to the conversion between characters and numbers (ord/chr in python), and UTF8 is the conversion between these numbers and bytes.
      (although Unicode is technically the name for the entire standard, including both UTF8 and the mapping between characters and numbers)

    • @mCoding
      @mCoding  ปีที่แล้ว +3

      I completely agree that UTF-8 != Unicode. However, it is widespread and commonplace for developers (and others) to colloquially use the word "Unicode" interchangeably with UTF-8 as most people don't bother with technical distinctions and it is usually clear from context which conversion is meant.

  • @albogdano
    @albogdano ปีที่แล้ว

    Very interesting, thank you

  • @eternlyytc7300
    @eternlyytc7300 ปีที่แล้ว

    New subscriber here. Just wanted to say that I love your videos. Very informative and fun to watch! Keep up the good work

    • @mCoding
      @mCoding  ปีที่แล้ว +1

      Thanks for your kind words! Welcome to the channel and I hope you learn a lot!

    • @eternlyytc7300
      @eternlyytc7300 ปีที่แล้ว

      @@mCoding thanks man! See you around

  • @qexat
    @qexat ปีที่แล้ว +1

    strife horde 🤙🤙🤙

  • @mjdevlog
    @mjdevlog ปีที่แล้ว

    so insightful for me as a beginner!

  • @lionkg81
    @lionkg81 ปีที่แล้ว

    Great video as always, thanks! But still not really clear when to use each of these types.

    • @volbla
      @volbla ปีที่แล้ว +1

      Bytes are mostly just useful if you mean to interpret them as something very specific that's not text. For example encryption keys or raw image data. If you're writing or handling text you should go with strings.

  • @PouriyaJamshidi
    @PouriyaJamshidi ปีที่แล้ว

    Fantastic explanation!

  • @trag1czny
    @trag1czny ปีที่แล้ว +4

    discord gang 🤙🤙🤙🤙

    • @mCoding
      @mCoding  ปีที่แล้ว +1

      Always appreciated!

  • @zd2600
    @zd2600 ปีที่แล้ว

    By default, we should be good to know why we use string in Python. But, is there a practical use case for us to use bytes ? That may helps us to differentiate the uses between string and bytes here.

    • @denisfrunza1040
      @denisfrunza1040 ปีที่แล้ว +1

      many low level libraries will make you to use bytes a good example: try to write a web server from scratch

    • @0LoneTech
      @0LoneTech ปีที่แล้ว +2

      It's simply one level less of abstraction. Bytes hold arbitrary data and can be used in I/O, like storing or transmitting. String is for when data is text, while other formats could be handled with e.g. struct or ctypes. This video had a couple of examples decoding some bytes as little or big endian integers. It wouldn't make sense to pass a string to zlib.decompress() for instance.

    • @lawrencedoliveiro9104
      @lawrencedoliveiro9104 ปีที่แล้ว +1

      For example, the struct module lets you convert between various Python numeric/string types and strings of bytes.

  • @eliseuantonio6652
    @eliseuantonio6652 ปีที่แล้ว

    Why isn't your machine's default utf-8 if it's so popular?

    • @NostraDavid2
      @NostraDavid2 ปีที่แล้ว +1

      Because UTF-16 is older than UTF-8, and Windows decided on the UTF-16 standard, back in the day.

    • @NostraDavid2
      @NostraDavid2 ปีที่แล้ว

      And Microsoft is VERY big on backwards compatibility, which means they won't replace UTF-16 with UTF-8, unless they find a way to stay compatible.

    • @lawrencedoliveiro9104
      @lawrencedoliveiro9104 ปีที่แล้ว +2

      UTF-16 is an unfortunate hangover from the early days of Unicode. Nobody uses it voluntarily any more.

  • @avinoamkugler2720
    @avinoamkugler2720 ปีที่แล้ว

    Great video😊

  • @maxwellsmart3156
    @maxwellsmart3156 ปีที่แล้ว

    Is there a command to tell you the system default encoding?

    • @briannormant3622
      @briannormant3622 ปีที่แล้ว

      On Linux you would set the LOCALE to language.utf-8 but no clue if you can do that on windows

    • @nirvana8145
      @nirvana8145 ปีที่แล้ว

      python3 -c "import sys; print(sys.getdefaultencoding())"

  • @nathanoy_
    @nathanoy_ ปีที่แล้ว

    YAY new video!

  • @norude
    @norude ปีที่แล้ว

    Can you make a video on sets. How they are implemented? What is the time/space complexity of adding elements, creation from list, removing elements and others? Why it is at all in the language? Is using lists faster in specific situatutions?

  • @phantomzkarma7633
    @phantomzkarma7633 ปีที่แล้ว

    Looks interesting

  • @jerrylu532
    @jerrylu532 ปีที่แล้ว

    Small hint: Instead of writing `encoding="utf-8"`, you can just write `encoding="u8"`, which saves you up to 3 keystrokes! Check the Python doc and you can see that `u8` is just another name for `utf-8`.

    • @mCoding
      @mCoding  ปีที่แล้ว +1

      Nice tip! I think i still prefer writing out the long form just for readability, but I didn't know about this shortcut before!

    • @AntonioZL
      @AntonioZL ปีที่แล้ว

      Just don't use u8 instead of utf-8 and then go on about your day writting for i in range(len(x)) 😁

  • @RazeVX
    @RazeVX ปีที่แล้ว

    even i knew it already since i literaly had to google it a hundret times because i found it funny to even ask the question ^^ like bites are just a string of 1 and 0 just that every 1 and 0 of your string takes at least 8 1´s and 0´s and thus is at least 8 times your memory used utf16 utf32. Let me tell you if came to the idea just useing str ('10010011' ) work with actual biits since you more comfortable with them then bytearrays dont its so painfuly slow i know cause i tryed it -.-

  • @angryman9333
    @angryman9333 ปีที่แล้ว

    Make JS videos now

  • @angryman9333
    @angryman9333 ปีที่แล้ว

    JS better

    • @NathanHedglin
      @NathanHedglin ปีที่แล้ว

      JS is trash 🤮

    • @briannormant3622
      @briannormant3622 ปีที่แล้ว

      ​@@anon8510c better

    • @lawrencedoliveiro9104
      @lawrencedoliveiro9104 ปีที่แล้ว

      Consider why JavaScript (and PHP) need a “===” operator, and Python doesn’t.
      (Sorry, C++, this isn’t your league.)

    • @eeriemyxi
      @eeriemyxi ปีที่แล้ว +1

      JS is an unfortunate existence that I wish to alter.