Understanding Deflate

ingve | 53 points

but I think the bits must be the opposite "endian" here, ie 2 instead of one

I remember asking the zlib developers many years ago why DEFLATE's bit order was unusual in that it made table-based decoding slightly more complex than necessary, and was told that only Phil Katz knew.

Also of note is that LHA/LZH uses an extremely similar data format where Huffman is also used to encode both literals and lengths in one "alphabet": http://justsolve.archiveteam.org/wiki/LHA Given that LHA very slightly predated ZIP, I suspect the latter was developed based on the former's algorithms.

userbinator | 30 minutes ago

> That said, even in this simple case, decoding by hand was a pain.

Well doing it by hand at this level is mostly decoding quasi-ASCII by hand, and the compression isn't making it significantly more painful.

If you reorder the static huffman table and byte align things, the encoded data could look more like:

  T O B E O R N O T T 0x85 0x09 T 0x88 0x0F 0xFF
So short lengths become 0x80 + length and short distances encode as themselves.

This version is pretty easy to decode by hand, and still 16 bytes. I'm arguably cheating by removing the 3 bit header, but even 17 bytes would be good.

Though the encoder deciding to repeat those Ts doesn't make much sense, shouldn't this be compressed to literal("TOBEORNOT") copy(6,9) copy(9,15)?

Dylan16807 | 5 hours ago

Not illustrated here, but the vast majority of deflated data uses a dynamic Huffman code which compresses the Huffman tree itself with another (fixed) Huffman tree. This sort of nested compression is in fact moderately popular; JPEG XL for example uses the same technique.

lifthrasiir | 6 hours ago