Send As SMS

Monday, December 05, 2005

Format tradeoffs for backups on DVD

I have been working lately on improving the backups that I maintain of my own data. In order to be able to make offline backups (beyond reach of malware and most kinds of human error), I burn DVDs. My home directory has recently started to approach the limits of what will fit uncompressed onto a single DVD, so over the weekend I did a little experimenting with compression options.

Ideally, I want to use a format for which the data is:
  • suitable for random access, which eliminates b/gzip'd tar/cpio. I suspect that pkzip would still be in the running (each contained file is compressed indepenently and much of the metadata is at the end of the archive), but there are no well supported kernel or userspace filesystem implementations, and I can't put a >4GB pkzip onto an ISO9660.
  • compressed. Even though more than half of the data being backed up consists of my (already compressed) photos, the rest is largely compressible. Standard ISO9660 doesn't do this, although there is an extension for doing so on Linux, which is good enough for me.
  • capable of supporting arbitarily large files (within the capacity of the media)
  • optimally packed; none of this fixed-minimum-size allocations (into the tens of KB for DVD-sized ISO9660 images), but rather a tar-, pkzip- or cpio- like sequential layout. This would be very bad for read-write media, but for write-once copies, it seems desirable
  • complete with all of its ext3fs metadata (user, group, permission bits, ACLs where appropriate) and special files (block/character specials, named pipes, symlinks, correct handling of hardlinks, ...)
Some (but not all) of my objectives could be reached by not placing an ISO9660 image onto the disk in the first place. Doing so irks me though, I'd need a really good reason to do it.

ISO9660's biggest problem for me is that few implementations will cope with >2GB files (so I can't even stick a single .tar.bz2 onto a DVD) and none will cope with >4GB files (there goes ~15% of single layer DVDs and ~53% of dual-layer ones). That said, absent a compelling alternative, it's an interesting option.

That pkzip appears to be capable of fast random access intrigues me. In principle it can support 64-bit offsets and therefore large files (i.e. much larger than 4GB), however the absence of Linux fs driver for it is an issue. Sadly, most pkzip tools that I've seen are built to handle pkzip-and-cpio-and-tar-and-... and therefore are built with an extract-to-/tmp approach (because most other archives in wiespread use do not support random access). A pkzip extractor need not work this way, but all that come to hand do (notwithstanding the text-mode unzip tool, but this doesn't allow me to browse, which is rather the point). It turns out that there is another limitation in implementation in the Debian/Sarge "zip" package; it can't generate archives which are more than 2GB in size. Some grepping through the source suggests that a macro can be turned on which would allow archives to be up to 4GB in size, but (a) this is not enough and (b) the config/build system is labyrinthine ("do not run ./configure, just type make, oh, and stay away from Gnu make, use my own custom make utility, ..."), it's not clear how to get this option turned on.

So, I decided to explore the zisofs option but, mindful of the wastage in per-file compression and fixed block allocations, decided to compare the results with other approaches. Here are the results (all sizes are as reported by du and use metric (not binary) prefixes, as per the "-h" option):

  • 4.5GB of source material in a full tree
  • 4.5GB as a standard ISO9660
  • 3.4GB as a .tar.bz2
  • 3.7GB of mkzftree-compressed material in a full tree
  • 3.6GB as a "compressed" ISO9660 (or zisofs i.e. via "mkzftree" pre-processing; note that the space consumed dropped compared to the full tree)
  • ?.?GB as a pkzip (this was how I discovered the 2GB limit; it aborted after writing 2GB)
So, as predicted, the block-allocation and per-file compression in the zisofs case does perform worse than the .tar.bz case, more than 8% worse in fact. However, until someone writes (or even I write) a performant pkzipfs driver for Linux, I'll live with this.