A student’s desire to get out of a exam led to a compression algorithm

the-mitr | 358 points

There was a series of articles on lossless and lossy and compression techniques in, I think, PC Magazine that I read as a kid and made a big impression on me. I didn't, like, end up ready to write code to build Huffman trees or anything, but it did change it from a total black box into a bunch of smaller pieces each of which a mere mortal can understand.

The compression rabbit hole can be a rewarding one to go down if you haven't. Besides the tech itself making a lot of things work better (like a Web built on human-readable formats, or more specialized stuff like Parquet or other columnar formats, or video on the lossy side), it can give you some tools or perspective that apply elsewhere, e.g. to other probablistic stuff like caching and predictions for lossless compression, or other signal-processing stuff for lossy compression.

twotwotwo | a year ago

The only CS class I ever took was Huffman's "Cybernetics" (undergrad UCSC CS class) and it was a real mind-blower. The very first day he dove right into sphere packing (https://en.wikipedia.org/wiki/Sphere_packing) and the whole course was a great introduction to information theory.

I remember him describing how he wrote the original solution for huffman compression, crumpled it up, threw it away, and then retrieved it from the trash.

I failed the class which led to a huge imposter syndrome in me that pushed me to learn far more CS than I ever needed to know. Huffman was definitely an arrogant bastard, but he certainly taught me a lot of interesting math.

dekhn | a year ago

Interesting refresher. I do remember in college we had to build the lookup tree for some word as an exercise. Obviously the name Huffman stuck in my brain. But for the love of god, I can't even remember if the lecture mentioned Fano. Seems he was just as important in the design process of what we only refer to as Huffman encoding today.

iforgotpassword | a year ago

Caches drive the internet.

They are orders of magnitude more important than compression. 99.99% of requests hit a local cache. (1)

Compression is important too.

(1) I worked in a telco. Check it out for yourself!

sgt101 | a year ago

> The first step involves yet another clever strategy for identifying repetition and thereby compressing message size, but the second step is to take the resulting compressed message and run it through the Huffman process.

I wonder if this "first step" is Burrows-Wheeler Transform?

Side note: In Silicon Valley (the show), I'm pretty sure that Richard has a picture of David Huffman by his bedside.

denvaar | a year ago

Please pardon my ignorance...

My understanding is that there are 1000s of different compression algorithms, each with their own pros/cons dependent on the type and characteristics of the file. And yet we still try to pick the "generically best" codec for a given file (ex. PNG) and then use that everywhere.

Why don't we have context-dependent compression instead?

I'm imagining a system that scans objects before compression, selects the optimal algorithm, and then encodes the file. The selected compression algorithm could be prefixed for easy decompression.

Compare a single black image that's 1x1 to one that's 1000x1000. PNGs are 128bytes and 6KB, respectively. However, Run Length Encoding would compress the latter to a comparable size as the former.

joshsabol46 | a year ago

Great read for beginners. I've read about Robert Fano being the creator of Project MAC that turned into MIT's AI lab (CSAIL).

I'm sure AI will end up the winner in compression in the end.

nextmove | a year ago

missed opportunity to explain how compression and prediction are related, and that the better you can predict the next token the better your compression gets, then your article gets to mention GPT hey

gfody | a year ago

Huffman trees are really cool and a neat little example program to put together.

hackcasual | a year ago

> to the ubiquitous software utility PKZip

Huh? Ubiquitous?

verall | a year ago

In the article the compression doesn’t make sense

If you are only sending one word, and the recipient already needs to know the word, then you only need 1 bit, essentially just signaling that you are saying that specific word

If you want a richer vocabulary, you could create an index of about 300k words (from the English dictionary), shared between the parties

Then to send any word you only need to send one number, and in binary it would have between 1 and at most 19 bits, for any word in the index (2^19 is around 500k)

That’s without even sorting the index by frequency of appearance/usage

27 bits for just one word seems wasteful

nico | a year ago

I thought Huffman coding was obsolete, arithmetic coding replaced it. It allows for fractional number of bits per symbol, so is more efficient.

jhallenworld | a year ago
| a year ago

On a related note, I got an email today saying that Cloudflare will automatically turn on Brotli compression later this month. Does anyone know if Brotli really makes a difference beyond gzip?

yawaramin | a year ago

What would be interesting is something like an LLM attempting to compress some piece of data, and see how it gets better at it over time.

amelius | a year ago

gpt uses huffman coding for tokenization

ftxbro | a year ago

Quanta magazine and all these pop science websites need to be stopped.

(Not because popular science is bad, but because they do it badly, and the clickbait is insufferable)

fguerraz | a year ago