Understanding Deflate
> That said, even in this simple case, decoding by hand was a pain.
Well doing it by hand at this level is mostly decoding quasi-ASCII by hand, and the compression isn't making it significantly more painful.
If you reorder the static huffman table and byte align things, the encoded data could look more like:
T O B E O R N O T T 0x85 0x09 T 0x88 0x0F 0xFF
So short lengths become 0x80 + length and short distances encode as themselves.This version is pretty easy to decode by hand, and still 16 bytes. I'm arguably cheating by removing the 3 bit header, but even 17 bytes would be good.
Though the encoder deciding to repeat those Ts doesn't make much sense, shouldn't this be compressed to literal("TOBEORNOT") copy(6,9) copy(9,15)?
Not illustrated here, but the vast majority of deflated data uses a dynamic Huffman code which compresses the Huffman tree itself with another (fixed) Huffman tree. This sort of nested compression is in fact moderately popular; JPEG XL for example uses the same technique.
but I think the bits must be the opposite "endian" here, ie 2 instead of one
I remember asking the zlib developers many years ago why DEFLATE's bit order was unusual in that it made table-based decoding slightly more complex than necessary, and was told that only Phil Katz knew.
Also of note is that LHA/LZH uses an extremely similar data format where Huffman is also used to encode both literals and lengths in one "alphabet": http://justsolve.archiveteam.org/wiki/LHA Given that LHA very slightly predated ZIP, I suspect the latter was developed based on the former's algorithms.