This seems like the perfect use-case for DwarFS : "DwarFS is a read-only file system with a focus on achieving very high compression ratios in particular for very redundant data."
Instead of dupremove, bees may be more appropriate for your goals.
Bees does things at a lower level than just reflinking full files like dupremove (it dedups at the level of extents). Be sure to read the links in the "Recommended Reading" section, of the above link, to hopefully avoid issues.
If the MPQ files are not too big and they match across languages, architectures and stages of development (e.g. same file names), the simplest solution is just to compress related files together. Let the compression algorithms figure out the symbols of your alphabet (eventually they will match your compressed files within MPQ archives, or their components). Restoring a specific version of the game then becomes "extract this list of files from archives and place them at these locations."
Random idea (I haven't tested it): If you need to figure out how to cluster together related MPQ files, you could compute merkle trees (or multiple merkle trees shifted by prime number offsets to identify common subsequences at arbitrary offsets), and use selected hashes for similarity search.
Options:
- `zstd --train` a dictionary over all the files, then compress it all using the dictionary. Decompress on demand. Make sure you don't lose the dictionary, or you'll lose all of it.
- OR stick it into a Restic/Borg/Kopia/Deduplicacy repository. These use a rolling hash to deduplicate any subsections of files that are the same, before applying compression. You can try it with `--dry-run` first to see how much space you would save. Use their `mount` command to access the files as a directory.
I would not be surprised if it gets the total size down close-ish to the size of only a single copy of the game.
See the other comment about `precomp` if compression gives you issues.
Personally I would just go with the `restic` route, as that's what it's meant for.
There are open source libraries for interacting with MPQ, e.g., https://github.com/mbroemme/mpq-tools, if you’re not already using one.
It seems that the filesystem-level deduplication approach with BTRFS conflicts with your goal of having an easily-shareable format. A filesystem is not an ideal archive format, especially one that is (~)only compatible with Linux systems. Also, over a long enough time horizon, filesystems become obsolete and difficult to mount.
Secondly, it seems like it is going to take contortions to utilize BTRFS deduplication with MPQ archives, hacking archives apart into separate extents, fighting with wasted space, etc. You're talking about low level filesystem manipulations to optimize it for this peculiar use case. And then you would need custom tooling anyway to be able to re-constitute the MPQ files from this filesystem of exploded fragments.
As a simpler, filesystem-agnostic approach, I would move all of the file data into a large content-addressable storage blob, which is by construction deduplicated. Then, strip the file data out of the MPQ archives and leave pointers into this blob. The process to reconstitute an archive is straightforward.
This approach does not rely on filesystem-specific deduplication, is portable to other platforms, and largely avoids the problem of wasted space due to filesystem structures. However, it would be a non-standard archive format, and you would have to make sure your scripts are well-documented enough that they will be usable in the future.
In more detail, to prepare your collection, you would write a script to do these steps:
1. Enumerate each file in each MPQ archive and compute the SHA-256 hash of its data.
2. If this hash has never been seen before, concatenate the file to your large storage blob. Record the hash => (offset, length) mapping to an index file.
3. Strip the file data out of the MPQ archive, leaving behind the SHA-256 hash as pointer you will use later to look up the (offset, length) in the index and recover the file.
4. Record the SHA-256 hash of the entire original MPQ archive, so you can later verify you've recreated it without loss of data integrity.
To reverse the process, you just need to pull pieces out of the storage blob and reconstitute the original MPQ archive. Check the hash of the recreated file against the original.
The stripped archives should be of negligible size and no further attempt to compress or deduplicate them is necessary. That said, you should try to compress your storage blob with something like the Zstandard Seekable Format. This will work best if you decompress the file data before hashing/storing it, and then re-compress it later when reconstituting the MPQ archive.
I'm surprised to see MPQ come up. What game is it?
Lookup FICLONERANGE ioctl
Deduplication is a poor archival strategy. Storage is cheap. 10TiB is a couple of thousand dollars with reasonable redundancy.
Write Once is how to manage an archive.
Displaying a curated subset is a good interface. Good luck.
It looks like your solutions so far expose the result as a plain, accessible filesystem. If you can relax this constraint you might have an easier time -
"Precomp" is a program that losslessly expands a zip/gzip archive in a way that can be 1:1 restored, but in its expanded format, is more amenable to solid compression.
Could you write a custom archiver for the MPQ files that does something similar - expands them out, in a reversible way, to make the files appear more similar? Then long-range solid compression (e.g. zpaq) can take care of the rest,
Or, your custom archiver could extract the blobs to individual files, and then you'd only need to store each inner stream once, while still being able to reconstruct the original MPQ on demand.