Elegant Compression in Text (The LZ 77 Method) - Computerphile

Поделиться
HTML-код
  • Опубликовано: 12 дек 2024

Комментарии • 311

  • @AlexanderKrivacsSchrder
    @AlexanderKrivacsSchrder 11 лет назад +40

    I remember in my computer science classes on compression, we were taught Lempel-Ziv-Welch (LZW). I thought that one was pretty cool, because instead of looking for specific things it has seen, it just constantly makes a dictionary of combinations of characters it has already seen and can refer back to those as it goes along. The best part is that it doesn't have to store the dictionary anywhere, because the decompressor can rebuild it while it's decompressing, based on the compressed data.

  • @jakovmatosic4890
    @jakovmatosic4890 Год назад +6

    I love it so much when there's a computerphile episode for something I need to learn, makes it so much better than a random ppt tutorial.

  • @grayswandir47
    @grayswandir47 9 лет назад +261

    I did some experimentation. If you open a zip or gz file with a hex editor there isn't anything human readable. You have to compress using an older format such as .lzo (Not sure anything in Windows will do this but it's a compress option in Fedora Linux.) Opening a .lzo file with a hex editor (I used ghex) shows much of the original text and it's easy to see where text is replaced by pointers as discussed in the video. You can see more frequent pointers as you move farther along in the document. The pointers in the .lzo file were two, three and four bytes long. I could not see any telltale that identifies a pointer versus text. There has to be some method for the decompression algorithm to distinguish between original text and a pointer. Thanks for posting the video. It stirred my curiousity.

  • @TheDave000
    @TheDave000 9 лет назад +91

    Can I please have an evening in a pub with Professor Brailsford and just have him explain things to me? I wish he were my grandfather, or at the very least mentor of some sort. He's just such and all round excellent bloke. I think his computer science pupils are very lucky to have such a clear speaking and warm hearted teacher.
    Professor Brailsford is the best!

  • @fredhair
    @fredhair 7 лет назад +1

    I think it speaks volumes of prof Brailsford that he can simplify some of the most complex topics and explain them in a way that is a joy to listen to. However when pressed for more details and information he will gladly oblige you and probably tell you all there is to know on the subject! Brilliant man, could listen to him talking on computers all day.

  • @seanld444
    @seanld444 9 лет назад +5

    It's hilarious that people at school always say: "you shouldn't have to use your brain during summer!", when I am watching Computerphile, and enjoying learning about compression and stuff. I love this!

  • @ffggddss
    @ffggddss 8 лет назад +160

    Actually, wouldn't that stored string - "computer" - include the leading space?
    (" computer")
    Because that, too, occurs in both instances.
    Also, it occurs to me that you'll never want to tag & remember a repeated string of 0, 1, or even 2 characters, because the "compressed" version of any string in this scheme, is 2 characters (bytes) long, so you're not actually saving anything until you have a string of 3 or more characters.
    This allows "encrypting" the 4 bits that encode length, to mean the numbers from 3 - 18, rather than the usual 0 - 15.
    These, if they are implemented in L-Z, are refinements that would properly be omitted from this overview, but I thought they might be worth bringing up here in the comments.
    And BTW, thank you for a very clear and concise explanation of how this scheme works!

  • @zandaya
    @zandaya 11 лет назад +361

    What's with all the hate here?? Can't you just appreciate his work? He's better than my university lecturers and I thank him for that.

  • @isaac10231
    @isaac10231 11 лет назад +2

    I am absolutely loving this channel. Anywhere I look online for an explination for something computer or network related, like TCP, I can't understand any of it. But this channel gives me a foundation to work from, to understand stuff like this.

  • @ruddyrdx
    @ruddyrdx 9 лет назад +19

    Oh, how beautifully he explained the concept! Totally impressed! Now I wish only if our professor in college had explained LZW this elaborately, but he did not.

  • @Doniazade
    @Doniazade 9 лет назад +88

    Wouldn't it be more efficient to have j be the distance to the END of the string you are referencing? That way you can reference strings (l-1) characters farther back with the same number of bits available for j.
    Basically, go j characters back and then take that character and the preceding (l-1) ones.

  • @Computerphile
    @Computerphile  11 лет назад +3

    Hopefully fixed now, thanks for spotting! >Sean

  • @MrCOPYPASTE
    @MrCOPYPASTE 11 лет назад

    First of all I would like to thank you and your peers for the time you invest in this videos.
    One thing I didn't grasp was if the implementation of the search of a given match is biased(it's prone to a given design) or it's a "closed analytic solution" and what was the creators approach, thank you for your time.

  • @etmax1
    @etmax1 9 лет назад +47

    People often say that learning times tables etc in this day and age is a waste of time, but I disagree. If you don't need to use a calculator as but it means that you both intrinsically understand the numbers and you can be much faster. I new an accountant that could add up 3 digit columns as fast as I can mentally do single digits. In his job that would have been worth gold. Knowing powers of 2 in computer science is similar

  • @neey3832
    @neey3832 Месяц назад

    i love how visual and easy to understand this video is, even 10 years later!

  • @grandadie
    @grandadie 9 лет назад +200

    Could a pointer point to a pointer instead of having to remember another instance of the word?

  • @MECKENICALROBOT
    @MECKENICALROBOT 11 лет назад +3

    This sounds literally like how we use our language and even our creative process. Constantly referencing previous instances of a given word, idea, or concept... Im really enjoying this channel

  • @Bovineprogrammer
    @Bovineprogrammer 11 лет назад

    I absolutely love LZ compression algorithms. Beautiful to code, beautiful to see in motion, and pretty darn effective too. Not only that, but it's easily tweaked to work with whatever data you're using just by changing a few constants.
    Fantastic.

  • @Computerphile
    @Computerphile  11 лет назад +6

    You can thank Sean for that!
    >Brady

  • @felineboy
    @felineboy 11 лет назад +4

    Actually, along with the jump/length tuple, the LZ77 algorithm outputs an optional literal character. So, for example, the very first output of the compressor would be a jump/length tuple with a length of 0, followed by the first byte of the decompressed data.
    Implementations differ on how they specify when there's a literal character or how the length/pair tuple is encoded.

  • @fredericweisbecker845
    @fredericweisbecker845 9 лет назад +24

    I want to go back to university with this man as my professor!

  • @ChrisWalshZX
    @ChrisWalshZX 11 лет назад

    I've not looked into LZ but I am familiar with a lot of compression techniques.
    You are of course correct that the has to be some way of differentiation. This can be designed in two different ways.
    METHOD 1: Define a header exactly as you say which marks the following (say 16 bits) as a pointer. You will however need to ''escape' any chance of the same header from appearing in a normal run of the text string. Methods of escape can be done with a different header.

  • @styleisaweapon
    @styleisaweapon 11 лет назад

    More compression either requires more time or more auxiliary memory. The most common compression method in use today is called DEFLATE (a combination of LZ77 and Huffman) popularized by the ZIP archive format. In the case of DEFLATE the "extra compression" flag is dealing with the huffman part of the algorithm and doesnt have a significant time cost but does have a significant memory cost because it uses a forest of huffman binary trees rather than a single huffman binary tree.

  • @BGBTech
    @BGBTech 11 лет назад +2

    typically by flagging them somehow.
    in many variants, this is done by symbol range, like a certain range of symbols will be treated as special;
    some others stuff in extra bits to distinguish between raw characters and runs;
    a few also make raw characters the special case, where you either have a dictionary reference, or 1 to N raw characters.
    often, LZ77 is combined with Huffman, as in Deflate (example: ZIP and GZIP), or arithmetic coding (7zip / LZMA, at least sort of...).

  • @luketimothy
    @luketimothy 11 лет назад +2

    I hope you have more of these Information Theory episodes coming! This was my favourite area of study at Uni. (I actually do something related to it for a Job now, but I still find these fun to watch) :)

  • @phookadude
    @phookadude 3 года назад +6

    One important thing missing there, you need some sort of identification that the 2 bytes are a reference and not just more characters. You could set the highest bit of the 2 character string to on but that kind of shows another problem: most text only uses the lower 7 bits of an 8 bit byte. So if you want to compress things it seems odd to leave that because every character is 12% bigger than it needs to be.

  • @stephenhorton
    @stephenhorton 11 лет назад +1

    Thanks for the video! Very interesting! To make the pointer even better: You'd never point to something less than 3 bytes, so the length part of the pointer could be seen as the length of the string minus 3, so you can represent up to 10 in 3 bits. Also why point to the start of the string? You're always going backwards, so point to the end, the rest can be worked out, meaning you can stretch 8202 characters back (13 bits + max length).

  • @Humineral
    @Humineral 11 лет назад

    This is such a quality channel.

  • @iabervon
    @iabervon 11 лет назад +1

    There's actually a whole family of Lempel-Ziv-based techniques. The basic technique is to have a table of character sequences from the document that you can reuse when they come up again. LZ77 is a particular way of constructing the table, but there are others. Some use relative positioning, some absolute (with a "start anew from here" code, in case the file switches languages in the middle). There are also clever tricks with how you write lengths, and how you store the codes.

  • @fllthdcrb
    @fllthdcrb 11 лет назад

    Good point. It illustrates a difference between the way we humans perceive the text and how the computer perceives it. We tend to see whole words and things starting on word boundaries, but a compressor usually matches things a naive person doesn't expect.

  • @lucdestrube
    @lucdestrube 11 лет назад +2

    Love the addition of the forward slash at the end :)

  • @bobbyaldol
    @bobbyaldol 9 лет назад +130

    Feels like I am listening to David Attenborough teach Computer Science.

  • @iabervon
    @iabervon 11 лет назад

    In general, LZ produces a bit stream with longer-than-8-bit codes, of which some are 9-bit codes for 8-bit source bytes and some are 17-bit codes for "this is a pointer" plus 16 bits of pointer value. Then it takes the bit stream and chunks it into bytes for putting in compressed files. This is why LZ-compressed files look like random noise if you try looking at them as characters, while UTF-8 files look only slightly wrong if you don't know the encoding.

  • @steveohbandito
    @steveohbandito 11 лет назад +9

    Learning your powers of 2 and 16 times table will help immensely. Also, to all the highschool students going to university/college for computer science, PLEASE, if you do anything, start programming assignments ASAP. CS assignments are usually never something that can be done the night before and quite often take many hours to complete. I'm only a 2nd year CS student but it's something that I've learned is very important to do.

  • @MathAndComputers
    @MathAndComputers 11 лет назад +2

    You can take a bitmap image file and put it in a ZIP file to try it out, yourself. :) Some image files, like logos or diagrams, tend to compress quite well with this method, which is why PNG files use a similar compression method, but some image files, like photographs, tend to compress quite poorly with this method, which is why JPG files use a very different (and "lossy") compression method.

  • @Ferrus91
    @Ferrus91 11 лет назад

    I was taught LZW at uni, it was kind of interesting to see its predecessor explained

  • @Salabar_
    @Salabar_ 11 лет назад

    We can add single indicator bit before each symbol. So use "0%UNICODE" to represent characters (they will actually 17 bit characters) or "1%POINTER". It will add insignificant overhead, but for the most of texts, compression will go pretty well.

  • @Celrador
    @Celrador 11 лет назад +3

    At a certain length of the document, the amount of bytes needed to address the jump, would be too big.
    You want to compress afterall. So with relative positioning you can still have infinite long documents.
    If you take absolute positioning you will at some point have to start using jump-addresses that get bigger than the words you want to compress with it.

  • @Vazzible_gaming
    @Vazzible_gaming 3 месяца назад

    guys who are in highschool preparing for this kind of stuff are so lucky, I didn't know I wanted to get into tech. I was out of highschool after I had taken wood shop every year, (biggest mistake ever) now I want to write operating systems programs in C. I literally screamed when I realized I didn't get shit out of going to highschool, I learned how to glue wood togeather when I should've been learning pointers and java, I still feel ignorant and stupid for not taking any tech classes or preparing for college in any way. Last 2 years of college now, it feels like im climbing up a cliff thats sets at a negative slope, no one ever pushed me into learning technology buy myself, I basically have no support structure.

  • @Arkalius80
    @Arkalius80 10 лет назад +6

    The common Deflate algorithm used in the popular zip and gzip compression formats uses LZ77, followed by Huffman coding, which essentially encodes more common byte values as shorter bit strings, with a trade-off of uncommon byte values having longer than 8-bit bit strings, but ultimately leading to a net savings. This is why a zip file doesn't have any readable content in it. All of the data has been re-coded in a way that isn't readable.

  • @fllthdcrb
    @fllthdcrb 11 лет назад

    In the end, your needs make a difference. If you're desperate to squeeze as much information into a given space as possible and don't care if it takes a while/a lot of computing power, you should try all sorts of tricks. Otherwise, you should trade off compression for speed.
    LZMA (see 7-Zip) and its derivatives are an example of really complex techniques to gain compression, although it's also more normally usable. RAR too, I suppose, but that's proprietary, so few know just what it does.

  • @styleisaweapon
    @styleisaweapon 11 лет назад

    Relative is better because there are strong correlations in locality. Consider an encyclopedia: very few articles will use the term "Space Shuttle" but the ones that do will often use it frequently. As the encoder/decoder progresses, Space Shuttle is more likely to be seen if Space Shuttle was recently seen but very unlikely to be seen if its been a long time since it was last seen. Exploiting correlations in locality is part of the broader topic of Adaptive Compression which is often superior.

  • @BGBTech
    @BGBTech 11 лет назад

    short answer: better? it depends.
    also short answer: LZMA essentially is LZ77, just with a fairly large window and arithmetic/range coding.
    basically, many variants of LZ77 exist, differing in specifics like encoding, maximum length and pointer distance, and whether or not entropy coding is used. these can be used to balance speed, memory use, compression ratio, and other properties.
    usually, one tries to strike a good balance for the specific use case.

  • @ThePC007
    @ThePC007 2 года назад

    @Vix
    Not necessarily. By referencing the left end of the string you're referring to, you basically get run-length encoding for free.
    If I were to encode the string “AAAAA”, I'd then be able to do it as follows: “A”
    The algorithm would then copy the newly placed characters and decode the string properly.

  • @DigGil3
    @DigGil3 11 лет назад

    Pay attention to the text at 8:23 ! In the HTML coding, tags end with «

  • @ItsThe1300
    @ItsThe1300 11 лет назад

    I love the way Professor Brailsford says "compression"!

  • @fllthdcrb
    @fllthdcrb 11 лет назад

    Incidentally, a little off-topic, but there's another trick here. If we allow the copied string to overlap the current position, we can compress short repeated patterns fairly well. That is to say, the decompressor, when finding a backreference, can simply look back in its output buffer, start reading bytes, and copy them onto the end. If it runs over what was the current position when it started, it will pick up the start of the string and copy it again, thus repeating it for a while. (cont...)

  • @mandolinic
    @mandolinic 9 лет назад +19

    Replacing some characters by a two byte combined pointer is only half of the design challenge - the decoder still needs to know whether it's processing text or processing a pointer. This means that you've got to put a special byte marker before pointer, effectively changing a two byte pointer to three bytes and reducing the compression ratio, or you've got to sacrifice some of the bits in the pointer to make a pattern that isn't going to occur normally in the text which has further implications on the compression ratio.

  • @Morturious
    @Morturious 11 лет назад

    Good question. Next step -- google the difference between UTF-8 encoding and LZ compression. Generally speaking, this information is (as far as I am aware, anyway) in some way stated in the header of the file.

  • @powerhouseofthecell9758
    @powerhouseofthecell9758 10 лет назад +16

    What Edawan said, could you make an episode about distinguishing pointers from the rest of the text?

  • @Chris_Dare
    @Chris_Dare 9 лет назад +27

    Question: what's stopping you from referencing the previous pointer if the 'jump' gets too long? Couldn't you point to the previous pointer and grab whatever that is pointing to? Assuming you want maximum compression and don't care about how long it takes to uncompress.

  • @AlexSimpsonsPage
    @AlexSimpsonsPage 11 лет назад

    Not always the case, e.g. zip, gzip, etc. is lossless.
    There is usually a memory and computational overhead to the higher compressions, so the higher compressions for a particular algorithm will take longer to run.
    For the example in this video, the dictionary might be bigger and cover a longer range of characters and combinations. Looking everything up to see if it matches in a very very large dictionary could take longer.

  • @TimVerweij
    @TimVerweij 11 лет назад

    If each composite pointer is 2 bytes, you can assume that the word you are pointing to has a minimum length of 3 bytes, otherwise there's point in creating a pointer there. So in the 4 bits you're using to specify the length, it could mean a length of 3-18 bytes. (instead of 0-15 or 1-16)

  • @fllthdcrb
    @fllthdcrb 11 лет назад +1

    Well, one must make a distinction between the compressed stream and the uncompressed stream. In the scheme presented, everything is in terms of the uncompressed stream, where there are no pointers (I'll call them backreferences). It's possible, at least theoretically, to reference things in terms of the compressed stream, or at least some form of it. But this would require the compressor to recursively resolve who knows how many backreferences. It might be fairly slow. (cont...)

  • @JohnSmith-sw1ng
    @JohnSmith-sw1ng 11 лет назад +11

    I like it how he doesn't want to waste the paper with the holes on the side which hasn't been used since the 80's....

  • @nex
    @nex 11 лет назад

    Pretty much, except the flag bit goes either before a character or a pointer+length, the latter of which are usually longer than a byte, so it won't end up being one bit per byte.
    In the meantime I got around to looking it up: LZSS does what I just described. The original LZ77 _always_ outputs the next literal character after a pointer+length (which is a null pointer in the case of no match found), so it's always clear what's what and no flags are needed.

  • @JerehmiaBoaz
    @JerehmiaBoaz 11 лет назад

    There are different ways of distinguishing a literal from a pointer during compression. The easiest way is to always put out a pointer even if it points to a zero length sequence, followed by the first byte that wasn't part of the match. While decompressing you read the pointer, extract the sequence if its length is greater than zero, then you read the literal byte after the pointer, copy to its destination after the extracted sequence, and you're good to read the next 3-byte pointer + literal.

  • @BGBTech
    @BGBTech 11 лет назад

    Deflate: also used in GZIP, PNG, various network protocols, several video codecs, Minecraft region files, ... so, yeah, pretty common.
    internally, it uses 0-255 for literal bytes, and (IIRC) 256-285 for special things (EOB, also LZ77 runs). for runs, the Huffman-coded symbol (essentially) works as a prefix indicating how many bits follow, with the prefix+bits encoding a value (run length and distance).
    and, also nifty: it Huffman encodes the Huffman table.

  • @typograf62
    @typograf62 9 лет назад +19

    The IBM System/370 Reference Summary from 1984 has an error in the Hex conversion table. It says 4069 rather than 4096.
    If one notices that then it is probably time for a vacation.

  • @deividux12
    @deividux12 11 лет назад

    because when you "write down" the start and the length you can even reference to the middle of the word for example the word press could be referenced in the word compression (comp press ion)

  • @eideticex
    @eideticex 11 лет назад

    Well out of the 256 possible characters in a 1 byte character, there are at most 127 that will actually appear in the document. For example one would likely not use the bell character in a text file, so that's 1 value we have that can serve as the header for a pointer.

  • @styleisaweapon
    @styleisaweapon 11 лет назад

    In classic LZ77 a literal always follows a [j,l] pair. Multiple literals in a row require an [j, l] pair with a length of 0. LZ78 solves the problem by pre-initializing the dictionary with every symbol in the alphabet and specifically protecting this dictionary block (so no literals needed, only [j, l] pairs.) The LZH variant is based off of LZ77 rather than LZ78, and instead uses standard huffman codes for each literal plus a unique code to indicate that an [j, l] pair is to follow.

  • @VamPDX
    @VamPDX 11 лет назад

    it's also possible to include the blanks in the compression so short words like "a" with a blank in the front and in the back could be used to save space

  • @Tasarran
    @Tasarran 11 лет назад

    With lossy compression, the computer isn't trying to match the match string exactly, it just tries to get close.
    That percentage is usually the amount of similarity allowed, the looser a 'match' is allowed to be, the more compression you will have, and the more loss.
    Its usually acceptable in things like images, where the human eye can't really tell the difference between, for example, 255,255,255 (white) and 253,254,255 (nearly white)...

  • @LeethLee1
    @LeethLee1 6 лет назад

    This was amazing!! Thank you again for sharing all this wonderful knowledge. This is just a good thing for the world, and individuals, specially as a launching off point to learn more about certain areas in detail.

  • @Kyuubi840
    @Kyuubi840 11 лет назад

    HTML tags start with their name inside < » and end with the same name, but with an added slash:

  • @IceMetalPunk
    @IceMetalPunk 11 лет назад +10

    This is interesting :) The only thing I wondered throughout this video is, when reading the compressed file, how do you know whether the byte you've read is the start of a 2-byte pointer or just a literal 1-byte character?

  • @vkillion
    @vkillion 7 лет назад +1

    I wonder if you could use one of these pointers to refer to a previous pointer, in the case where the referred to entry is farther than the maximum jump possible (4096 in his example), but another reference using a pointer was within range. This would in essence be recursive pointers.

  • @griv2000
    @griv2000 11 лет назад

    The implementation is left as an exercise to the reader, but most implementations use dictonaries and hashs to find the matches, it does not always give the best compression (because of hash collisions), but it is a lot fasyer than brute-force

  • @johanhendriks
    @johanhendriks 9 лет назад

    5:50 ~ Dutch will create massively long words, because when describing one thing (a noun) it doesn't use spaces. For example "Waste water purification plant" would be "waterzuiveringsinstallatie". Imagine how a document on purification of water might repeat this word many times.

  • @Mackinstyle
    @Mackinstyle 11 лет назад

    I love your passion for mathematics.

  • @JanCRefsgaard
    @JanCRefsgaard 11 лет назад

    I think it depends on the type of data, if your data is written by a human, words would probably cluster into sections such that relative pointers could capture most words with few bits
    if your data is the human genome, then absolute pointers might be best because the patterns will be scatted all over your document

  • @SnowRaptor
    @SnowRaptor 11 лет назад

    Essentially compressing and decompressing speed. Lower compression deompresses faster. Depending on the application, you prefer a file that is 20% larger but that can be decoded 20% quicker

  • @KelvinShadewing
    @KelvinShadewing 8 лет назад +8

    So, what if you were to use a sort of nested pointer system where a pointer refers to a previous one and so on until the first word that got compressed? Is that what would be done if the original word is more than 4096 spaces away? Like, say that first pointer to "computer" is decoded, then the next one is out of reach of the first instance but can reach the second one. Would it just take the word from the second one? Or read its location to jump further back to the first?
    Because I'm wondering, if you were to do a nested pointer like this, would the length of the pointer being shorter than the length of the original word affect the data of previous pointers? Or maybe it goes through, counts the words, and then replaces them with pointers later, so that when they're decompressed in order, the string lengths match up again?

  • @ivansanchez1988
    @ivansanchez1988 10 лет назад +1

    Thank your professor, great explanation... Good Advice, learn your powers of 2 and 16...

  • @stupossibleify
    @stupossibleify 9 лет назад +18

    I'm not a computer scientist, but I still see a problem here which isn't dealt with in the video: you would surely have to discriminate the pointer against the other information in the file. In this case, is the binary encoding ASCII or a LZ jump vector, for example? It's an obvious question which needs answering. Great video nonetheless.

  • @taldavid906
    @taldavid906 6 лет назад

    Great video to explain the LZ77 method! Thanks!

  • @xZise
    @xZise 11 лет назад

    Although this will reduce the width of the pointer to 14 bits (with the ASCII method) or 12 bits (with the UTF-8 method). But you can tweak this, as you could allow all possibilites for the second byte. So for ASCII it would be 1xxxxxxx xxxxxxxx and for UTF8 10yyyyyy yyyyyyyy. This gives you 15 bits for ASCII and 14 bit for UTF-8.

  • @luketimothy
    @luketimothy 11 лет назад

    Well, when you decompress the data, you will read it as bytes... so it's not UTF-8 at all.
    Also, I believe this assumes the text is ASCII encoding... in which case you have an extra bit, as ASCII only makes use of 7 bits. This extra bit is usually a parity bit, but in this case, you can say that it denotes a character or pointer data.
    Set this extra bit to a 0 for a character, and 1 for pointer data. Of course, this means you would lose possible dictionary/string size.

  • @jonoandrew6821
    @jonoandrew6821 11 лет назад

    exactly what i was wondering, the only 2 solutions i can see is having a unique byte/s to indicate that it is a pointer, which is fine for a text file where you know that byte wont occur in text, but not so for pretty much any other file, so i would assume they store a list of the location of all pointers somewhere in the file (the start probably, otherwise we have the same problem of knowing when the list starts). I would google it, but im too tired right now, but that seems pretty logical

  • @JanCRefsgaard
    @JanCRefsgaard 11 лет назад

    1), probably uses 1xxxxxxx for ASCII and 10xxxxxx for UTF-8 (see my other comment)
    2) yes, it sees the documents as ones and zeros, it doesn't go word by word
    3) very good idear! I don't know if the original implementation can, but it could easily be implemented into an algorithm, your pointer could look like this:
    UTF-8: 10xxxxxx(xxxxxxxx)10xxxxxx
    ASCII: 1xxxxxxx(xxxxxxxx)1xxxxxxx

  • @BlueRavenGT
    @BlueRavenGT 11 лет назад

    Even then you can't tell just by looking that two bytes is a pointer instead of two characters. The simplest solution is to just place a character or sequence of characters before the pointer that doesn't otherwise occur in the resulting bitstream. Look up "escape character" to learn more.

  • @maartendj2724
    @maartendj2724 8 лет назад +1

    Is it fair to say, that for every combination of at least 3 characters that appears in the text for at least 2 times, a pointer should be created? Because the pointer 'costs' 2 (uneven) bytes of storage, and obviously a 3 character combination costs 3 bytes, so this should be the minimal length for which creating a pointer translates to less memory in use. So every 3 letter word like 'the' or 'and' should be pointed to as soon as it reappears, but in that case, even ' or' (note I put a space in front) should be, because a space is a character with a byte code as well.

  • @Grarrgle
    @Grarrgle 11 лет назад +1

    This professor is so lovely. What a nice fella.

  • @nex
    @nex 11 лет назад

    Yeah I kept waiting for the professor to address this issue ...
    You don't need a whole byte, however; a single bit is enough. Another variation is to use a jump distance of 0 combined with a literal character in place of the string length. In that case, 0x00 does indeed signify the start of a literal string, but it doesn't take up an extra byte.
    Over the years quite a few variations have accumulated; unfortunately I haven't got the time right now to look up how the original LZ77 does it ;\

  • @JuddMan03
    @JuddMan03 11 лет назад

    because maximum can mean the difference between 30 minutes and 2 hours when compressing enormous files, and the output size between normal and maximum may be really small, such as 5%.

  • @henke37
    @henke37 10 лет назад +1

    I am disappointed that the trick to do RLE wasn't brought up, it is a quite important trick in LZ77.

  • @OisínMcColgan
    @OisínMcColgan 11 лет назад

    Very good video. Any chance we might see one of Huffman coding?

  • @fllthdcrb
    @fllthdcrb 11 лет назад

    That's a sensible possibility. In reality, you'd want to add/subtract offsets on the bitfields. Consider: is it useful to copy a 1-character string? No, because it costs 2 bytes to encode it; that's no longer compression. 2 characters is also pointless. So, why not copy strings of minimum 3 characters at a minimum offset of 3. 0 for the offset field is still reserved to signal literal 0, so subtract 2 from the offset and 3 from the length, and you gain a little range.

  • @reristavi
    @reristavi 6 лет назад

    Very nice one. You really need a perfect mind to think about compression like this.

  • @ChrisWalshZX
    @ChrisWalshZX 11 лет назад

    METHOD 2: Normal runs of text could have their own header that says how long the text run is. That way, only two headers are required and no escaping.
    NOTE that a header may only need to be a few bits (even as little as 1 bit in the case of method 2) and not a whole byte.

  • @KubGov
    @KubGov 8 лет назад +6

    What is the meaning of the 77 in LZ77?

  • @sayur54321
    @sayur54321 11 лет назад

    The closing tag made me smile :)

  • @RepublikSivizien
    @RepublikSivizien 11 лет назад

    for 2byte (j,l) you need 3Byte, because your unpacker had to know
    whether it be a word or a jump and in strings, its the 0x00

  • @666Tomato666
    @666Tomato666 11 лет назад

    a proper ending xml tag! Props Brady!

  • @AsbjornGrandt
    @AsbjornGrandt 10 лет назад +111

    He forgot the space character preceding "computer"
    But this is a brief, and simplified explanation on how compression can work.
    The devil is in the details of course.

  • @gigantomaquia
    @gigantomaquia 9 лет назад +1

    im glad that my supositions about compression was right! :D thanks for your great videos computerphile! :D

  • @Tupster
    @Tupster 11 лет назад

    Pretty sure this code was invented to compress 7-bit ASCII, to extend it to other encodings you would have to invent some way to escape that encoding similarly to how you can escape ASCII by setting the 8th bit.

  • @JanCRefsgaard
    @JanCRefsgaard 11 лет назад

    the c++, I know basic c++ and program python as my day job, but I am pretty sure than 4 videos in I will start learning new/forgotten things :)

  • @Tasarran
    @Tasarran 11 лет назад

    He specifically says in the video that there would not really be enclosing brackets, that it would be a binary string.