Armadillo Reticence: December 2005

Tuesday, December 27, 2005

Java, XML and pretty-printing

I have a need to pretty print some XML fragments. I figured that, since j2se5 already has a fairly thorough XML API implementation, this would be straightforward. Needless to say I was mistaken.

Document doc = ... // the document containing the fragment
Element element = ... // the particular element that is to be printed

DOMImplementationLS dils = (DOMImplementationLS) doc.getImplementation();

LSSerializer ser = dils.createLSSerializer();
ser.getDomConfig().setParameter("format-pretty-print", "true");

LSOutput lso = dils.createLSOutput();
lso.setByteStream(System.out);

ser.write(element, lso);

Setting aside the awful design of the API, this appears to be how this is supposed to be done. Inconveniently, the XML implementation in Sun's j2se5 does not support the format-pretty-print configuration parameter. As pretty printing is the entire point of this exercise. I needed to find another way.

There is another DOM API (there appear to be several) implemented in some versions of Xerces which revolves around a much cleaner Serializer interface (not to be confused with an LSSerializer, of course).

Document doc = ... // the document containing the fragment
Element element = ... // the particular element that is to be printed

OutputFormat of = new OutputFormat();
of.setIndenting(true);
of.setOmitXMLDeclaration(true);

DOMSerializer ser = SerializerFactory.getSerializerFactory(Method.XML).
       makeSerializer(System.out, of).asDOMSerializer();

ser.serialize(element);

This actually achieves the desired result. Of course, I'd rather have a solution which didn't require stepping outside of the APIs included in j2se5 and, it turns out that I don't have to. The above code is written against Xerces, but not by adding a xerces.jar to my classpath. Rather, I merely require the following import:

import com.sun.org.apache.xml.internal.serialize.*;

That's right; Sun has slipped a copy of Xerces into j2se5. This is fine, but why oh why does the standard API not expose the pretty-printing functionality that's already present in the implementation that they're using?

(Obviously there's a portability problem with this approach. A portable implementation can of course be realised by embedding a version of Xerces in the application jar and, fortunately, Sun's renaming of the packages means that this can be done without getting into namespace clashes. However, for situations where portability to other JVM+library implementations is not a concern, this is a handy shortcut.)

Tuesday, December 13, 2005

Alexa hosting and deja vu

Alexa is now offering to host search applications which, if I understand correctly, are able to access multiple crawls, not just a "most recent" crawl.

This was a definite headspin because (a) just 10 days ago I had proposed on a closed mailing list that Google do something analagous this (in response to a Google insider's request for suggestions) and (b) because when I punched my camera model (Minolta Dimage EX) into the sample app, the first 10 photos that appeared (out of ~300) were mine!

(That they aren't published on my site added to my astonishment. Charlie published them with my consent, as they were taken when a group stayed at his Dad's cottage, but I've not seen them for a while. When the results page first appeared I did a double take: "those photos look quite similar to some of mine...")

Thursday, December 08, 2005

The Banana Protocol

When I worked in Berlin, I made my first ever use of CVS in a moderately sized team (twelve people) of fulltime employees. We were using it to develop a wholesale wine marketplace on an awful, and rather fragile, e-Commerce platform; such things were popular at the time. One consequence of this was a need to perform integration work and testing at each commit. Rather than implement branching and a gatekeeper, I allowed each developer to commit directly to the trunk. CVS's lack of atomic commits combined with the hour or so that it took to do a successful integration (longer for an unsuccessful one) meant that the potential for conflicting commits, and the corresponding nightmarish debugging sessions, was high.

One lunchtime I discussed this with Ryan Shelswell (friend and, at the time, colleague) and reflected on the need for some sort of physical token, possession of which should be a precondition for starting a commit. No software needed to be integrated, this was purely a human protocol, about allowing developers to avoid treading on each other's toes. I was thinking about a debugging mallet (a "hammer" made entirely of foam, about a metre long, used primary to harmlessly thump an uncooperative computer during a debugging session in order to vent frustration) because much of a commit session was spent on debugging, but Ryan told me that he had a better idea and that, indeed, he had just the thing. The next day, Ryan appeared with an inflatable B1 about 35cm tall which the team readily accepted for the purpose.

Roll forward a couple of years to working at CounterSnipe, using CVS and encountering a similar problem. So, I described the use of B1 to my colleagues and set about finding a worthy successor. Sadly I failed (we ended up using a rubber ball), but it did solve the problem that we were having with trampling on each other's updates. (We've subsequently switched to SVN which has its own set of problems but, notably, has atomic commits.) One colleague, Jon Mann, was so impressed by the approach, and the obvious absurdity of using an inflatable banana to implement it, that he wrote it up as an RFC which, with his kind permission, I now present for your enjoyment.

Banana Working Group J. Mann
Request for Comments: 9999 Countersnipe
Category: Informational 1st April 2004

The Banana Protocol

Status of this Memo

This memo provides information for the Internet community.
This memo does not specify an Internet standard of any kind.
Distribution of this memo is unlimited.

Abstract

CVS fails to provide atomic transaction guarantees. This can lead
to an unexpected state within the repository. We define a
protocol to ensure that any one commit is fully atomic.

Terminology

The keywords MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD,
SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL, when they appear in
this document, are to be interpreted as described in RFC 2119.

Introduction

The Concurrent Versions System (CVS) is a popular multi user
network transparent version control system.

A "commit" is a CVS transaction that updates the repositories
codebase.

Motivation

"CVS failure to provide atomic transaction guarantees is widely
considered a bug." [1]

The Banana

A single Banana exists for each repository implementing the protocol.

The Banana is a physical entity choosen by the administrator of the
CVS repository.

The Banana represents a token that authorises the holder to action a
commit on the associated repository.

A holding location is choosen to locate the Banana when not held by
a CVS user. This location is choosen by the CVS administrator.

A user of the repository SHALL only commit if they have possession of
the Banana.

Exchange

A CVS user may obtain the Banana via any method they deem necessary.

Quake duels, deception or mindless violence are all acceptable methods.

New User

When a user is added to a repository they are instructed to adhere to
the Banana protocol.

This is the responsibility of the CVS administrator and occurs when
the login for the new user is created.

Ignorance is no excuse.

Lost Banana

This situation occurs when no user claims to hold the Banana and the
Banana is not at its holding location.

The Banana is considered "Missing In Action".

The administrator has the responsibility of replacing the Banana.

Virtual Banana

A situation may occur when a remote user needs to make a commit, but
cannot obtain physical access to the Banana. For example they maybe
working remotely.

In these cases the "Virtual Banana Clause" is invoked.

The user arranges to place the physical Banana in escrow via
another CVS user. They MAY then commit until they authorise the
Bananas release.

Credits

http://www.cvshome.org

The folks who developed CVS for an otherwise excellent product.

References

[1] http://cvsbook.red-bean.com/cvsbook.html#My_commits_seem_to_happen_in_pieces_instead_of_atomically
[2] http://www.cvshome.org/docs/manual/cvs-1.11.6/cvs_10.html#SEC88

Security Considerations

Security issues are not discussed in this memo.

Author's Address

J. Mann

Full Copyright Statement

Copyright Countersnipe (2002). All Rights Reserved.

This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain
it or assist in its implementation may be prepared, copied,
published and distributed, in whole or in part, without restriction
of any kind, provided that the above copyright notice and this
paragraph are included on all such copies and derivative works.

However, this document itself may not be modified in any way, such
as by removing the copyright notice or references to Countersnipe
or other Internet organizations, except as needed for the purpose
of developing Internet standards in which case the procedures for
copyrights defined in the Internet Standards process must be followed,
or as required to translate it into languages other than English.

The limited permissions granted above are perpetual and will not be
revoked by Countersnipe or its successors or assigns.

This document and the information contained herein is provided on
an "AS IS" basis and Countersnipe DISCLAIMS ALL WARRANTIES, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE
OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Monday, December 05, 2005

Format tradeoffs for backups on DVD

I have been working lately on improving the backups that I maintain of my own data. In order to be able to make offline backups (beyond reach of malware and most kinds of human error), I burn DVDs. My home directory has recently started to approach the limits of what will fit uncompressed onto a single DVD, so over the weekend I did a little experimenting with compression options.

Ideally, I want to use a format for which the data is:

suitable for random access, which eliminates b/gzip'd tar/cpio. I suspect that pkzip would still be in the running (each contained file is compressed indepenently and much of the metadata is at the end of the archive), but there are no well supported kernel or userspace filesystem implementations, and I can't put a >4GB pkzip onto an ISO9660.
compressed. Even though more than half of the data being backed up consists of my (already compressed) photos, the rest is largely compressible. Standard ISO9660 doesn't do this, although there is an extension for doing so on Linux, which is good enough for me.
capable of supporting arbitarily large files (within the capacity of the media)
optimally packed; none of this fixed-minimum-size allocations (into the tens of KB for DVD-sized ISO9660 images), but rather a tar-, pkzip- or cpio- like sequential layout. This would be very bad for read-write media, but for write-once copies, it seems desirable
complete with all of its ext3fs metadata (user, group, permission bits, ACLs where appropriate) and special files (block/character specials, named pipes, symlinks, correct handling of hardlinks, ...)

Some (but not all) of my objectives could be reached by not placing an ISO9660 image onto the disk in the first place. Doing so irks me though, I'd need a really good reason to do it.

ISO9660's biggest problem for me is that few implementations will cope with >2GB files (so I can't even stick a single .tar.bz2 onto a DVD) and none will cope with >4GB files (there goes ~15% of single layer DVDs and ~53% of dual-layer ones). That said, absent a compelling alternative, it's an interesting option.

That pkzip appears to be capable of fast random access intrigues me. In principle it can support 64-bit offsets and therefore large files (i.e. much larger than 4GB), however the absence of Linux fs driver for it is an issue. Sadly, most pkzip tools that I've seen are built to handle pkzip-and-cpio-and-tar-and-... and therefore are built with an extract-to-/tmp approach (because most other archives in wiespread use do not support random access). A pkzip extractor need not work this way, but all that come to hand do (notwithstanding the text-mode unzip tool, but this doesn't allow me to browse, which is rather the point). It turns out that there is another limitation in implementation in the Debian/Sarge "zip" package; it can't generate archives which are more than 2GB in size. Some grepping through the source suggests that a macro can be turned on which would allow archives to be up to 4GB in size, but (a) this is not enough and (b) the config/build system is labyrinthine ("do not run ./configure, just type make, oh, and stay away from Gnu make, use my own custom make utility, ..."), it's not clear how to get this option turned on.

So, I decided to explore the zisofs option but, mindful of the wastage in per-file compression and fixed block allocations, decided to compare the results with other approaches. Here are the results (all sizes are as reported by du and use metric (not binary) prefixes, as per the "-h" option):

4.5GB of source material in a full tree
4.5GB as a standard ISO9660
3.4GB as a .tar.bz2
3.7GB of mkzftree-compressed material in a full tree
3.6GB as a "compressed" ISO9660 (or zisofs i.e. via "mkzftree" pre-processing; note that the space consumed dropped compared to the full tree)
?.?GB as a pkzip (this was how I discovered the 2GB limit; it aborted after writing 2GB)

So, as predicted, the block-allocation and per-file compression in the zisofs case does perform worse than the .tar.bz case, more than 8% worse in fact. However, until someone writes (or even I write) a performant pkzipfs driver for Linux, I'll live with this.

Saturday, December 03, 2005

A History of Tom Baker's Scarves

...and here was me thinking that it was the same scarf all the time! Not so.

(via Ego Food)

Slowing of the Gulf Stream/North Atlantic Conveyor, measured

For some time scientists have talked about a possible slowing/stopping of the ocean current which warms Western Europe. It now appears that it may actually be happening, and not just a little bit:

From the amount of water in the subtropical gyre and the flow southwards at depth, they calculate that the quantity of warm water flowing north had fallen by around 30%.

Armadillo Reticence