Elegant Compression in Text (The LZ 77 Method) - Computerphile

แชร์
ฝัง
  • เผยแพร่เมื่อ 21 ธ.ค. 2024

ความคิดเห็น • 311

  • @jakovmatosic4890
    @jakovmatosic4890 ปีที่แล้ว +6

    I love it so much when there's a computerphile episode for something I need to learn, makes it so much better than a random ppt tutorial.

  • @AlexanderKrivacsSchrder
    @AlexanderKrivacsSchrder 11 ปีที่แล้ว +41

    I remember in my computer science classes on compression, we were taught Lempel-Ziv-Welch (LZW). I thought that one was pretty cool, because instead of looking for specific things it has seen, it just constantly makes a dictionary of combinations of characters it has already seen and can refer back to those as it goes along. The best part is that it doesn't have to store the dictionary anywhere, because the decompressor can rebuild it while it's decompressing, based on the compressed data.

  • @TheDave000
    @TheDave000 9 ปีที่แล้ว +91

    Can I please have an evening in a pub with Professor Brailsford and just have him explain things to me? I wish he were my grandfather, or at the very least mentor of some sort. He's just such and all round excellent bloke. I think his computer science pupils are very lucky to have such a clear speaking and warm hearted teacher.
    Professor Brailsford is the best!

  • @grayswandir47
    @grayswandir47 9 ปีที่แล้ว +262

    I did some experimentation. If you open a zip or gz file with a hex editor there isn't anything human readable. You have to compress using an older format such as .lzo (Not sure anything in Windows will do this but it's a compress option in Fedora Linux.) Opening a .lzo file with a hex editor (I used ghex) shows much of the original text and it's easy to see where text is replaced by pointers as discussed in the video. You can see more frequent pointers as you move farther along in the document. The pointers in the .lzo file were two, three and four bytes long. I could not see any telltale that identifies a pointer versus text. There has to be some method for the decompression algorithm to distinguish between original text and a pointer. Thanks for posting the video. It stirred my curiousity.

  • @zandaya
    @zandaya 11 ปีที่แล้ว +361

    What's with all the hate here?? Can't you just appreciate his work? He's better than my university lecturers and I thank him for that.

  • @seanld444
    @seanld444 9 ปีที่แล้ว +5

    It's hilarious that people at school always say: "you shouldn't have to use your brain during summer!", when I am watching Computerphile, and enjoying learning about compression and stuff. I love this!

  • @Computerphile
    @Computerphile  11 ปีที่แล้ว +3

    Hopefully fixed now, thanks for spotting! >Sean

  • @ffggddss
    @ffggddss 8 ปีที่แล้ว +160

    Actually, wouldn't that stored string - "computer" - include the leading space?
    (" computer")
    Because that, too, occurs in both instances.
    Also, it occurs to me that you'll never want to tag & remember a repeated string of 0, 1, or even 2 characters, because the "compressed" version of any string in this scheme, is 2 characters (bytes) long, so you're not actually saving anything until you have a string of 3 or more characters.
    This allows "encrypting" the 4 bits that encode length, to mean the numbers from 3 - 18, rather than the usual 0 - 15.
    These, if they are implemented in L-Z, are refinements that would properly be omitted from this overview, but I thought they might be worth bringing up here in the comments.
    And BTW, thank you for a very clear and concise explanation of how this scheme works!

  • @fredhair
    @fredhair 7 ปีที่แล้ว +1

    I think it speaks volumes of prof Brailsford that he can simplify some of the most complex topics and explain them in a way that is a joy to listen to. However when pressed for more details and information he will gladly oblige you and probably tell you all there is to know on the subject! Brilliant man, could listen to him talking on computers all day.

  • @ruddyrdx
    @ruddyrdx 9 ปีที่แล้ว +20

    Oh, how beautifully he explained the concept! Totally impressed! Now I wish only if our professor in college had explained LZW this elaborately, but he did not.

  • @Computerphile
    @Computerphile  11 ปีที่แล้ว +6

    You can thank Sean for that!
    >Brady

  • @Doniazade
    @Doniazade 9 ปีที่แล้ว +88

    Wouldn't it be more efficient to have j be the distance to the END of the string you are referencing? That way you can reference strings (l-1) characters farther back with the same number of bits available for j.
    Basically, go j characters back and then take that character and the preceding (l-1) ones.

  • @isaac10231
    @isaac10231 11 ปีที่แล้ว +2

    I am absolutely loving this channel. Anywhere I look online for an explination for something computer or network related, like TCP, I can't understand any of it. But this channel gives me a foundation to work from, to understand stuff like this.

  • @grandadie
    @grandadie 9 ปีที่แล้ว +200

    Could a pointer point to a pointer instead of having to remember another instance of the word?

  • @etmax1
    @etmax1 9 ปีที่แล้ว +47

    People often say that learning times tables etc in this day and age is a waste of time, but I disagree. If you don't need to use a calculator as but it means that you both intrinsically understand the numbers and you can be much faster. I new an accountant that could add up 3 digit columns as fast as I can mentally do single digits. In his job that would have been worth gold. Knowing powers of 2 in computer science is similar

  • @MrCOPYPASTE
    @MrCOPYPASTE 11 ปีที่แล้ว

    First of all I would like to thank you and your peers for the time you invest in this videos.
    One thing I didn't grasp was if the implementation of the search of a given match is biased(it's prone to a given design) or it's a "closed analytic solution" and what was the creators approach, thank you for your time.

  • @neey3832
    @neey3832 หลายเดือนก่อน

    i love how visual and easy to understand this video is, even 10 years later!

  • @felineboy
    @felineboy 11 ปีที่แล้ว +4

    Actually, along with the jump/length tuple, the LZ77 algorithm outputs an optional literal character. So, for example, the very first output of the compressor would be a jump/length tuple with a length of 0, followed by the first byte of the decompressed data.
    Implementations differ on how they specify when there's a literal character or how the length/pair tuple is encoded.

  • @Bovineprogrammer
    @Bovineprogrammer 11 ปีที่แล้ว

    I absolutely love LZ compression algorithms. Beautiful to code, beautiful to see in motion, and pretty darn effective too. Not only that, but it's easily tweaked to work with whatever data you're using just by changing a few constants.
    Fantastic.

  • @bobbyaldol
    @bobbyaldol 9 ปีที่แล้ว +130

    Feels like I am listening to David Attenborough teach Computer Science.

  • @fredericweisbecker845
    @fredericweisbecker845 9 ปีที่แล้ว +24

    I want to go back to university with this man as my professor!

  • @BGBTech
    @BGBTech 11 ปีที่แล้ว +2

    typically by flagging them somehow.
    in many variants, this is done by symbol range, like a certain range of symbols will be treated as special;
    some others stuff in extra bits to distinguish between raw characters and runs;
    a few also make raw characters the special case, where you either have a dictionary reference, or 1 to N raw characters.
    often, LZ77 is combined with Huffman, as in Deflate (example: ZIP and GZIP), or arithmetic coding (7zip / LZMA, at least sort of...).

  • @phookadude
    @phookadude 3 ปีที่แล้ว +6

    One important thing missing there, you need some sort of identification that the 2 bytes are a reference and not just more characters. You could set the highest bit of the 2 character string to on but that kind of shows another problem: most text only uses the lower 7 bits of an 8 bit byte. So if you want to compress things it seems odd to leave that because every character is 12% bigger than it needs to be.

  • @Humineral
    @Humineral 11 ปีที่แล้ว

    This is such a quality channel.

  • @styleisaweapon
    @styleisaweapon 11 ปีที่แล้ว

    More compression either requires more time or more auxiliary memory. The most common compression method in use today is called DEFLATE (a combination of LZ77 and Huffman) popularized by the ZIP archive format. In the case of DEFLATE the "extra compression" flag is dealing with the huffman part of the algorithm and doesnt have a significant time cost but does have a significant memory cost because it uses a forest of huffman binary trees rather than a single huffman binary tree.

  • @ChrisWalshZX
    @ChrisWalshZX 11 ปีที่แล้ว

    I've not looked into LZ but I am familiar with a lot of compression techniques.
    You are of course correct that the has to be some way of differentiation. This can be designed in two different ways.
    METHOD 1: Define a header exactly as you say which marks the following (say 16 bits) as a pointer. You will however need to ''escape' any chance of the same header from appearing in a normal run of the text string. Methods of escape can be done with a different header.

  • @luketimothy
    @luketimothy 11 ปีที่แล้ว +2

    I hope you have more of these Information Theory episodes coming! This was my favourite area of study at Uni. (I actually do something related to it for a Job now, but I still find these fun to watch) :)

  • @MECKENICALROBOT
    @MECKENICALROBOT 11 ปีที่แล้ว +3

    This sounds literally like how we use our language and even our creative process. Constantly referencing previous instances of a given word, idea, or concept... Im really enjoying this channel

  • @iabervon
    @iabervon 11 ปีที่แล้ว +1

    There's actually a whole family of Lempel-Ziv-based techniques. The basic technique is to have a table of character sequences from the document that you can reuse when they come up again. LZ77 is a particular way of constructing the table, but there are others. Some use relative positioning, some absolute (with a "start anew from here" code, in case the file switches languages in the middle). There are also clever tricks with how you write lengths, and how you store the codes.

  • @fllthdcrb
    @fllthdcrb 11 ปีที่แล้ว

    Good point. It illustrates a difference between the way we humans perceive the text and how the computer perceives it. We tend to see whole words and things starting on word boundaries, but a compressor usually matches things a naive person doesn't expect.

  • @lucdestrube
    @lucdestrube 11 ปีที่แล้ว +2

    Love the addition of the forward slash at the end :)

  • @stephenhorton
    @stephenhorton 11 ปีที่แล้ว +1

    Thanks for the video! Very interesting! To make the pointer even better: You'd never point to something less than 3 bytes, so the length part of the pointer could be seen as the length of the string minus 3, so you can represent up to 10 in 3 bits. Also why point to the start of the string? You're always going backwards, so point to the end, the rest can be worked out, meaning you can stretch 8202 characters back (13 bits + max length).

  • @iabervon
    @iabervon 11 ปีที่แล้ว

    In general, LZ produces a bit stream with longer-than-8-bit codes, of which some are 9-bit codes for 8-bit source bytes and some are 17-bit codes for "this is a pointer" plus 16 bits of pointer value. Then it takes the bit stream and chunks it into bytes for putting in compressed files. This is why LZ-compressed files look like random noise if you try looking at them as characters, while UTF-8 files look only slightly wrong if you don't know the encoding.

  • @DigGil3
    @DigGil3 11 ปีที่แล้ว

    Pay attention to the text at 8:23 ! In the HTML coding, tags end with «

  • @MathAndComputers
    @MathAndComputers 11 ปีที่แล้ว +2

    You can take a bitmap image file and put it in a ZIP file to try it out, yourself. :) Some image files, like logos or diagrams, tend to compress quite well with this method, which is why PNG files use a similar compression method, but some image files, like photographs, tend to compress quite poorly with this method, which is why JPG files use a very different (and "lossy") compression method.

  • @Ferrus91
    @Ferrus91 11 ปีที่แล้ว

    I was taught LZW at uni, it was kind of interesting to see its predecessor explained

  • @steveohbandito
    @steveohbandito 11 ปีที่แล้ว +9

    Learning your powers of 2 and 16 times table will help immensely. Also, to all the highschool students going to university/college for computer science, PLEASE, if you do anything, start programming assignments ASAP. CS assignments are usually never something that can be done the night before and quite often take many hours to complete. I'm only a 2nd year CS student but it's something that I've learned is very important to do.

  • @Celrador
    @Celrador 11 ปีที่แล้ว +3

    At a certain length of the document, the amount of bytes needed to address the jump, would be too big.
    You want to compress afterall. So with relative positioning you can still have infinite long documents.
    If you take absolute positioning you will at some point have to start using jump-addresses that get bigger than the words you want to compress with it.

  • @ItsThe1300
    @ItsThe1300 11 ปีที่แล้ว

    I love the way Professor Brailsford says "compression"!

  • @Vazzible_gaming
    @Vazzible_gaming 4 หลายเดือนก่อน

    guys who are in highschool preparing for this kind of stuff are so lucky, I didn't know I wanted to get into tech. I was out of highschool after I had taken wood shop every year, (biggest mistake ever) now I want to write operating systems programs in C. I literally screamed when I realized I didn't get shit out of going to highschool, I learned how to glue wood togeather when I should've been learning pointers and java, I still feel ignorant and stupid for not taking any tech classes or preparing for college in any way. Last 2 years of college now, it feels like im climbing up a cliff thats sets at a negative slope, no one ever pushed me into learning technology buy myself, I basically have no support structure.

  • @Arkalius80
    @Arkalius80 10 ปีที่แล้ว +6

    The common Deflate algorithm used in the popular zip and gzip compression formats uses LZ77, followed by Huffman coding, which essentially encodes more common byte values as shorter bit strings, with a trade-off of uncommon byte values having longer than 8-bit bit strings, but ultimately leading to a net savings. This is why a zip file doesn't have any readable content in it. All of the data has been re-coded in a way that isn't readable.

  • @Salabar_
    @Salabar_ 11 ปีที่แล้ว

    We can add single indicator bit before each symbol. So use "0%UNICODE" to represent characters (they will actually 17 bit characters) or "1%POINTER". It will add insignificant overhead, but for the most of texts, compression will go pretty well.

  • @powerhouseofthecell9758
    @powerhouseofthecell9758 10 ปีที่แล้ว +16

    What Edawan said, could you make an episode about distinguishing pointers from the rest of the text?

  • @BGBTech
    @BGBTech 11 ปีที่แล้ว

    short answer: better? it depends.
    also short answer: LZMA essentially is LZ77, just with a fairly large window and arithmetic/range coding.
    basically, many variants of LZ77 exist, differing in specifics like encoding, maximum length and pointer distance, and whether or not entropy coding is used. these can be used to balance speed, memory use, compression ratio, and other properties.
    usually, one tries to strike a good balance for the specific use case.

  • @fllthdcrb
    @fllthdcrb 11 ปีที่แล้ว

    Incidentally, a little off-topic, but there's another trick here. If we allow the copied string to overlap the current position, we can compress short repeated patterns fairly well. That is to say, the decompressor, when finding a backreference, can simply look back in its output buffer, start reading bytes, and copy them onto the end. If it runs over what was the current position when it started, it will pick up the start of the string and copy it again, thus repeating it for a while. (cont...)

  • @Chris_Dare
    @Chris_Dare 9 ปีที่แล้ว +27

    Question: what's stopping you from referencing the previous pointer if the 'jump' gets too long? Couldn't you point to the previous pointer and grab whatever that is pointing to? Assuming you want maximum compression and don't care about how long it takes to uncompress.

  • @fllthdcrb
    @fllthdcrb 11 ปีที่แล้ว

    In the end, your needs make a difference. If you're desperate to squeeze as much information into a given space as possible and don't care if it takes a while/a lot of computing power, you should try all sorts of tricks. Otherwise, you should trade off compression for speed.
    LZMA (see 7-Zip) and its derivatives are an example of really complex techniques to gain compression, although it's also more normally usable. RAR too, I suppose, but that's proprietary, so few know just what it does.

  • @TimVerweij
    @TimVerweij 11 ปีที่แล้ว

    If each composite pointer is 2 bytes, you can assume that the word you are pointing to has a minimum length of 3 bytes, otherwise there's point in creating a pointer there. So in the 4 bits you're using to specify the length, it could mean a length of 3-18 bytes. (instead of 0-15 or 1-16)

  • @JohnSmith-sw1ng
    @JohnSmith-sw1ng 11 ปีที่แล้ว +11

    I like it how he doesn't want to waste the paper with the holes on the side which hasn't been used since the 80's....

  • @AlexSimpsonsPage
    @AlexSimpsonsPage 11 ปีที่แล้ว

    Not always the case, e.g. zip, gzip, etc. is lossless.
    There is usually a memory and computational overhead to the higher compressions, so the higher compressions for a particular algorithm will take longer to run.
    For the example in this video, the dictionary might be bigger and cover a longer range of characters and combinations. Looking everything up to see if it matches in a very very large dictionary could take longer.

  • @Morturious
    @Morturious 11 ปีที่แล้ว

    Good question. Next step -- google the difference between UTF-8 encoding and LZ compression. Generally speaking, this information is (as far as I am aware, anyway) in some way stated in the header of the file.

  • @styleisaweapon
    @styleisaweapon 11 ปีที่แล้ว

    Relative is better because there are strong correlations in locality. Consider an encyclopedia: very few articles will use the term "Space Shuttle" but the ones that do will often use it frequently. As the encoder/decoder progresses, Space Shuttle is more likely to be seen if Space Shuttle was recently seen but very unlikely to be seen if its been a long time since it was last seen. Exploiting correlations in locality is part of the broader topic of Adaptive Compression which is often superior.

  • @fllthdcrb
    @fllthdcrb 11 ปีที่แล้ว +1

    Well, one must make a distinction between the compressed stream and the uncompressed stream. In the scheme presented, everything is in terms of the uncompressed stream, where there are no pointers (I'll call them backreferences). It's possible, at least theoretically, to reference things in terms of the compressed stream, or at least some form of it. But this would require the compressor to recursively resolve who knows how many backreferences. It might be fairly slow. (cont...)

  • @ThePC007
    @ThePC007 2 ปีที่แล้ว

    @Vix
    Not necessarily. By referencing the left end of the string you're referring to, you basically get run-length encoding for free.
    If I were to encode the string “AAAAA”, I'd then be able to do it as follows: “A”
    The algorithm would then copy the newly placed characters and decode the string properly.

  • @VamPDX
    @VamPDX 11 ปีที่แล้ว

    it's also possible to include the blanks in the compression so short words like "a" with a blank in the front and in the back could be used to save space

  • @styleisaweapon
    @styleisaweapon 11 ปีที่แล้ว

    In classic LZ77 a literal always follows a [j,l] pair. Multiple literals in a row require an [j, l] pair with a length of 0. LZ78 solves the problem by pre-initializing the dictionary with every symbol in the alphabet and specifically protecting this dictionary block (so no literals needed, only [j, l] pairs.) The LZH variant is based off of LZ77 rather than LZ78, and instead uses standard huffman codes for each literal plus a unique code to indicate that an [j, l] pair is to follow.

  • @Tasarran
    @Tasarran 11 ปีที่แล้ว

    With lossy compression, the computer isn't trying to match the match string exactly, it just tries to get close.
    That percentage is usually the amount of similarity allowed, the looser a 'match' is allowed to be, the more compression you will have, and the more loss.
    Its usually acceptable in things like images, where the human eye can't really tell the difference between, for example, 255,255,255 (white) and 253,254,255 (nearly white)...

  • @BGBTech
    @BGBTech 11 ปีที่แล้ว

    Deflate: also used in GZIP, PNG, various network protocols, several video codecs, Minecraft region files, ... so, yeah, pretty common.
    internally, it uses 0-255 for literal bytes, and (IIRC) 256-285 for special things (EOB, also LZ77 runs). for runs, the Huffman-coded symbol (essentially) works as a prefix indicating how many bits follow, with the prefix+bits encoding a value (run length and distance).
    and, also nifty: it Huffman encodes the Huffman table.

  • @nex
    @nex 11 ปีที่แล้ว

    Pretty much, except the flag bit goes either before a character or a pointer+length, the latter of which are usually longer than a byte, so it won't end up being one bit per byte.
    In the meantime I got around to looking it up: LZSS does what I just described. The original LZ77 _always_ outputs the next literal character after a pointer+length (which is a null pointer in the case of no match found), so it's always clear what's what and no flags are needed.

  • @JerehmiaBoaz
    @JerehmiaBoaz 11 ปีที่แล้ว

    There are different ways of distinguishing a literal from a pointer during compression. The easiest way is to always put out a pointer even if it points to a zero length sequence, followed by the first byte that wasn't part of the match. While decompressing you read the pointer, extract the sequence if its length is greater than zero, then you read the literal byte after the pointer, copy to its destination after the extracted sequence, and you're good to read the next 3-byte pointer + literal.

  • @ivansanchez1988
    @ivansanchez1988 10 ปีที่แล้ว +1

    Thank your professor, great explanation... Good Advice, learn your powers of 2 and 16...

  • @eideticex
    @eideticex 11 ปีที่แล้ว

    Well out of the 256 possible characters in a 1 byte character, there are at most 127 that will actually appear in the document. For example one would likely not use the bell character in a text file, so that's 1 value we have that can serve as the header for a pointer.

  • @vkillion
    @vkillion 7 ปีที่แล้ว +1

    I wonder if you could use one of these pointers to refer to a previous pointer, in the case where the referred to entry is farther than the maximum jump possible (4096 in his example), but another reference using a pointer was within range. This would in essence be recursive pointers.

  • @deividux12
    @deividux12 11 ปีที่แล้ว

    because when you "write down" the start and the length you can even reference to the middle of the word for example the word press could be referenced in the word compression (comp press ion)

  • @taldavid906
    @taldavid906 7 ปีที่แล้ว

    Great video to explain the LZ77 method! Thanks!

  • @LeethLee1
    @LeethLee1 6 ปีที่แล้ว

    This was amazing!! Thank you again for sharing all this wonderful knowledge. This is just a good thing for the world, and individuals, specially as a launching off point to learn more about certain areas in detail.

  • @KelvinShadewing
    @KelvinShadewing 8 ปีที่แล้ว +8

    So, what if you were to use a sort of nested pointer system where a pointer refers to a previous one and so on until the first word that got compressed? Is that what would be done if the original word is more than 4096 spaces away? Like, say that first pointer to "computer" is decoded, then the next one is out of reach of the first instance but can reach the second one. Would it just take the word from the second one? Or read its location to jump further back to the first?
    Because I'm wondering, if you were to do a nested pointer like this, would the length of the pointer being shorter than the length of the original word affect the data of previous pointers? Or maybe it goes through, counts the words, and then replaces them with pointers later, so that when they're decompressed in order, the string lengths match up again?

  • @KubGov
    @KubGov 8 ปีที่แล้ว +6

    What is the meaning of the 77 in LZ77?

  • @IceMetalPunk
    @IceMetalPunk 11 ปีที่แล้ว +10

    This is interesting :) The only thing I wondered throughout this video is, when reading the compressed file, how do you know whether the byte you've read is the start of a 2-byte pointer or just a literal 1-byte character?

  • @Grarrgle
    @Grarrgle 11 ปีที่แล้ว +1

    This professor is so lovely. What a nice fella.

  • @typograf62
    @typograf62 9 ปีที่แล้ว +19

    The IBM System/370 Reference Summary from 1984 has an error in the Hex conversion table. It says 4069 rather than 4096.
    If one notices that then it is probably time for a vacation.

  • @johanhendriks
    @johanhendriks 9 ปีที่แล้ว

    5:50 ~ Dutch will create massively long words, because when describing one thing (a noun) it doesn't use spaces. For example "Waste water purification plant" would be "waterzuiveringsinstallatie". Imagine how a document on purification of water might repeat this word many times.

    • @allanfloyd8103
      @allanfloyd8103 9 ปีที่แล้ว

      johanhendriks And LZ would do that.

  • @Kyuubi840
    @Kyuubi840 11 ปีที่แล้ว

    HTML tags start with their name inside < » and end with the same name, but with an added slash:

  • @stupossibleify
    @stupossibleify 9 ปีที่แล้ว +18

    I'm not a computer scientist, but I still see a problem here which isn't dealt with in the video: you would surely have to discriminate the pointer against the other information in the file. In this case, is the binary encoding ASCII or a LZ jump vector, for example? It's an obvious question which needs answering. Great video nonetheless.

  • @griv2000
    @griv2000 11 ปีที่แล้ว

    The implementation is left as an exercise to the reader, but most implementations use dictonaries and hashs to find the matches, it does not always give the best compression (because of hash collisions), but it is a lot fasyer than brute-force

  • @SnowRaptor
    @SnowRaptor 11 ปีที่แล้ว

    Essentially compressing and decompressing speed. Lower compression deompresses faster. Depending on the application, you prefer a file that is 20% larger but that can be decoded 20% quicker

  • @mandolinic
    @mandolinic 9 ปีที่แล้ว +19

    Replacing some characters by a two byte combined pointer is only half of the design challenge - the decoder still needs to know whether it's processing text or processing a pointer. This means that you've got to put a special byte marker before pointer, effectively changing a two byte pointer to three bytes and reducing the compression ratio, or you've got to sacrifice some of the bits in the pointer to make a pattern that isn't going to occur normally in the text which has further implications on the compression ratio.

  • @sayur54321
    @sayur54321 11 ปีที่แล้ว

    The closing tag made me smile :)

  • @Mackinstyle
    @Mackinstyle 11 ปีที่แล้ว

    I love your passion for mathematics.

  • @jonoandrew6821
    @jonoandrew6821 11 ปีที่แล้ว

    exactly what i was wondering, the only 2 solutions i can see is having a unique byte/s to indicate that it is a pointer, which is fine for a text file where you know that byte wont occur in text, but not so for pretty much any other file, so i would assume they store a list of the location of all pointers somewhere in the file (the start probably, otherwise we have the same problem of knowing when the list starts). I would google it, but im too tired right now, but that seems pretty logical

  • @gigantomaquia
    @gigantomaquia 9 ปีที่แล้ว +1

    im glad that my supositions about compression was right! :D thanks for your great videos computerphile! :D

  • @AsbjornGrandt
    @AsbjornGrandt 10 ปีที่แล้ว +111

    He forgot the space character preceding "computer"
    But this is a brief, and simplified explanation on how compression can work.
    The devil is in the details of course.

  • @666Tomato666
    @666Tomato666 11 ปีที่แล้ว

    a proper ending xml tag! Props Brady!

  • @luketimothy
    @luketimothy 11 ปีที่แล้ว

    Well, when you decompress the data, you will read it as bytes... so it's not UTF-8 at all.
    Also, I believe this assumes the text is ASCII encoding... in which case you have an extra bit, as ASCII only makes use of 7 bits. This extra bit is usually a parity bit, but in this case, you can say that it denotes a character or pointer data.
    Set this extra bit to a 0 for a character, and 1 for pointer data. Of course, this means you would lose possible dictionary/string size.

  • @OisínMcColgan
    @OisínMcColgan 11 ปีที่แล้ว

    Very good video. Any chance we might see one of Huffman coding?

  • @henke37
    @henke37 10 ปีที่แล้ว +1

    I am disappointed that the trick to do RLE wasn't brought up, it is a quite important trick in LZ77.

  • @JanCRefsgaard
    @JanCRefsgaard 11 ปีที่แล้ว

    I think it depends on the type of data, if your data is written by a human, words would probably cluster into sections such that relative pointers could capture most words with few bits
    if your data is the human genome, then absolute pointers might be best because the patterns will be scatted all over your document

  • @AbdulKhaleed
    @AbdulKhaleed 11 ปีที่แล้ว

    More of Tom Scott please!

  • @reristavi
    @reristavi 6 ปีที่แล้ว

    Very nice one. You really need a perfect mind to think about compression like this.

  • @BlueRavenGT
    @BlueRavenGT 11 ปีที่แล้ว

    Even then you can't tell just by looking that two bytes is a pointer instead of two characters. The simplest solution is to just place a character or sequence of characters before the pointer that doesn't otherwise occur in the resulting bitstream. Look up "escape character" to learn more.

  • @maartendj2724
    @maartendj2724 8 ปีที่แล้ว +1

    Is it fair to say, that for every combination of at least 3 characters that appears in the text for at least 2 times, a pointer should be created? Because the pointer 'costs' 2 (uneven) bytes of storage, and obviously a 3 character combination costs 3 bytes, so this should be the minimal length for which creating a pointer translates to less memory in use. So every 3 letter word like 'the' or 'and' should be pointed to as soon as it reappears, but in that case, even ' or' (note I put a space in front) should be, because a space is a character with a byte code as well.

  • @xZise
    @xZise 11 ปีที่แล้ว

    Although this will reduce the width of the pointer to 14 bits (with the ASCII method) or 12 bits (with the UTF-8 method). But you can tweak this, as you could allow all possibilites for the second byte. So for ASCII it would be 1xxxxxxx xxxxxxxx and for UTF8 10yyyyyy yyyyyyyy. This gives you 15 bits for ASCII and 14 bit for UTF-8.

  • @JanCRefsgaard
    @JanCRefsgaard 11 ปีที่แล้ว

    yep, you are the second to point that flaw out :), but it's nice that the internet is full of smart people :)

  • @dailydols
    @dailydols 10 ปีที่แล้ว +1

    Can you make a bit about BZIP? Or the background of Burrows-Wheeler-Transformation?

  • @arayapokey
    @arayapokey 11 ปีที่แล้ว

    Love the lecture, also this camera man kicks ass!

  • @nex
    @nex 11 ปีที่แล้ว

    Yeah I kept waiting for the professor to address this issue ...
    You don't need a whole byte, however; a single bit is enough. Another variation is to use a jump distance of 0 combined with a literal character in place of the string length. In that case, 0x00 does indeed signify the start of a literal string, but it doesn't take up an extra byte.
    Over the years quite a few variations have accumulated; unfortunately I haven't got the time right now to look up how the original LZ77 does it ;\

  • @ChrisWalshZX
    @ChrisWalshZX 11 ปีที่แล้ว

    METHOD 2: Normal runs of text could have their own header that says how long the text run is. That way, only two headers are required and no escaping.
    NOTE that a header may only need to be a few bits (even as little as 1 bit in the case of method 2) and not a whole byte.

  • @Tasarran
    @Tasarran 11 ปีที่แล้ว

    He specifically says in the video that there would not really be enclosing brackets, that it would be a binary string.

  • @aleksandarrusinov9974
    @aleksandarrusinov9974 9 ปีที่แล้ว +2

    Can this be cascading i.e pointing to a pointer, that is ho one may achieve transition in a longer text?

  • @jamesgrimwood1285
    @jamesgrimwood1285 11 ปีที่แล้ว

    The Unicode website has a section on this. You want to search for "Unicode Technical Note #31"