Tell HN: Deduplicating a 10.4 TiB game preservation archive (WIP)

DrFrugal | 30 points

It looks like your solutions so far expose the result as a plain, accessible filesystem. If you can relax this constraint you might have an easier time -

"Precomp" is a program that losslessly expands a zip/gzip archive in a way that can be 1:1 restored, but in its expanded format, is more amenable to solid compression.

Could you write a custom archiver for the MPQ files that does something similar - expands them out, in a reversible way, to make the files appear more similar? Then long-range solid compression (e.g. zpaq) can take care of the rest,

Or, your custom archiver could extract the blobs to individual files, and then you'd only need to store each inner stream once, while still being able to reconstruct the original MPQ on demand.

mappu | 7 months ago

Options:

- `zstd --train` a dictionary over all the files, then compress it all using the dictionary. Decompress on demand. Make sure you don't lose the dictionary, or you'll lose all of it.

- OR stick it into a Restic/Borg/Kopia/Deduplicacy repository. These use a rolling hash to deduplicate any subsections of files that are the same, before applying compression. You can try it with `--dry-run` first to see how much space you would save. Use their `mount` command to access the files as a directory.

I would not be surprised if it gets the total size down close-ish to the size of only a single copy of the game.

See the other comment about `precomp` if compression gives you issues.

Personally I would just go with the `restic` route, as that's what it's meant for.

Intralexical | 7 months ago

Instead of dupremove, bees may be more appropriate for your goals.

https://github.com/Zygo/bees

Bees does things at a lower level than just reflinking full files like dupremove (it dedups at the level of extents). Be sure to read the links in the "Recommended Reading" section, of the above link, to hopefully avoid issues.

sillystuff | 7 months ago

If the MPQ files are not too big and they match across languages, architectures and stages of development (e.g. same file names), the simplest solution is just to compress related files together. Let the compression algorithms figure out the symbols of your alphabet (eventually they will match your compressed files within MPQ archives, or their components). Restoring a specific version of the game then becomes "extract this list of files from archives and place them at these locations."

Random idea (I haven't tested it): If you need to figure out how to cluster together related MPQ files, you could compute merkle trees (or multiple merkle trees shifted by prime number offsets to identify common subsequences at arbitrary offsets), and use selected hashes for similarity search.

rlupi | 7 months ago

I think you should tighten up your goals - state specific benchmarks as goals:

- bring the size down - you're more than 90% smaller than original, where do you stop? the low-hanging branches have been pruned via jdupes, looks like trying to slim down MPQ redundancy will require more compute/runtime. How much are you willing to allow to get an extra few percent?

- easy sharable format - BTRFS - doesn't sound like an easy sharable format, especially the hardlinks. How will you share it? Even several tens\hundreds of GBs can be expensive if people are downloading it often.

- what's your definition of a lower end machine? SSD? or spinning rust hard disk allowed? How much RAM? What minimum CPU? Linux\Unix only?

canucker2016 | 7 months ago

This seems like the perfect use-case for DwarFS : "DwarFS is a read-only file system with a focus on achieving very high compression ratios in particular for very redundant data."

https://github.com/mhx/dwarfs

tricked | 7 months ago

It looks like you are basically outlining the solution with the description of the file format.

If it is { header, files[], hash_table[], block_table[] } then it would mean that you likely can de-duplicate based on the block_table or the hash_table (not familiar with the format itself). It may be prudent to check how many MPQ files have the same blocks inside them (checksumming or hashing, comparing on multiple checksums). If there is some overlap and some blocks are present in multiple files, you can extract those blocks and allow any version of the file be reconstituted back into an MPQ file - as long as you know which blocks to combine and in which order.

julik | 7 months ago

> jdupes

This might solve a problem I have! Whenever I use Google Takeout, it generates a new copy of a photo in Google Photos for every album that it's in, which means it takes up significantly more space than it should!

I've used fdupes to calculate just how much duplication is happening, and apparently out of 650+ gigabytes of photos, fully 250 gigabytes are duplicates.

pavel_lishin | 7 months ago

It sounds like you're trying to make a more efficient back end for something like eXoDos[0]

You could try reaching out to the developers there and see if they have any solutions.

[0] https://www.retro-exo.com/exodos.html

Teever | 7 months ago

There are open source libraries for interacting with MPQ, e.g., https://github.com/mbroemme/mpq-tools, if you’re not already using one.

rgovostes | 7 months ago

Interesting, though a bit outside what you want (not random access): https://gwern.net/sort

btdmaster | 7 months ago

It seems that the filesystem-level deduplication approach with BTRFS conflicts with your goal of having an easily-shareable format. A filesystem is not an ideal archive format, especially one that is (~)only compatible with Linux systems. Also, over a long enough time horizon, filesystems become obsolete and difficult to mount.

Secondly, it seems like it is going to take contortions to utilize BTRFS deduplication with MPQ archives, hacking archives apart into separate extents, fighting with wasted space, etc. You're talking about low level filesystem manipulations to optimize it for this peculiar use case. And then you would need custom tooling anyway to be able to re-constitute the MPQ files from this filesystem of exploded fragments.

As a simpler, filesystem-agnostic approach, I would move all of the file data into a large content-addressable storage blob, which is by construction deduplicated. Then, strip the file data out of the MPQ archives and leave pointers into this blob. The process to reconstitute an archive is straightforward.

This approach does not rely on filesystem-specific deduplication, is portable to other platforms, and largely avoids the problem of wasted space due to filesystem structures. However, it would be a non-standard archive format, and you would have to make sure your scripts are well-documented enough that they will be usable in the future.

In more detail, to prepare your collection, you would write a script to do these steps:

1. Enumerate each file in each MPQ archive and compute the SHA-256 hash of its data.

2. If this hash has never been seen before, concatenate the file to your large storage blob. Record the hash => (offset, length) mapping to an index file.

3. Strip the file data out of the MPQ archive, leaving behind the SHA-256 hash as pointer you will use later to look up the (offset, length) in the index and recover the file.

4. Record the SHA-256 hash of the entire original MPQ archive, so you can later verify you've recreated it without loss of data integrity.

To reverse the process, you just need to pull pieces out of the storage blob and reconstitute the original MPQ archive. Check the hash of the recreated file against the original.

The stripped archives should be of negligible size and no further attempt to compress or deduplicate them is necessary. That said, you should try to compress your storage blob with something like the Zstandard Seekable Format. This will work best if you decompress the file data before hashing/storing it, and then re-compress it later when reconstituting the MPQ archive.

rgovostes | 7 months ago

I'm surprised to see MPQ come up. What game is it?

zzlk | 7 months ago

Lookup FICLONERANGE ioctl

moonshadow565 | 7 months ago

Deduplication is a poor archival strategy. Storage is cheap. 10TiB is a couple of thousand dollars with reasonable redundancy.

Write Once is how to manage an archive.

Displaying a curated subset is a good interface. Good luck.

brudgers | 7 months ago