Send As SMS

Tuesday, July 25, 2006

Human names as identifiers

A problem that comes up frequently in the creation of information systems, particularly those which are implemented with computers, is the assignment and use of identifiers to refer to things in the real world. Using real-world identifiers is often problematic, but sometimes unavoidable. Using human names has some special additional problems; this article provides some examples.

A common approach is to assign an internal identifier (integer, UUID, ...) which has no meaning in the outside world. This typically simplifies implementation as user errors in data entry can be dealt with smoothly (correcting the spelling of someone's name in a Land Registry's database doesn't suddenly cause all real estate registered in his/her name to be orphanned) and corner cases where unique identifiers become not so unique (courier companies using as an identifier the phone number presented by caller-ID will occasionally confuse multiple tennants of the same serviced office with each other) do not arise. Occasionally using internal identifiers, or depending exclusively upon them, is inadequate. An excellent example is the use of crytographic certificates for communication over an untrusted network in which external names, whether computer derived (email addresses, domain names) or not (human names), must be handled and matched, rather than simply treated as opaque data.

As all domain names and email addresses have distinguished/canonical representations (indeed, all can be represented uniquely, if not sensibly, in US-ASCII), correct matching is trivial; convert the two addresses to be compared into their canonical representations and then perform byte-by-byte comparison.

A quite different problem arises when trying to match human names. I encountered this problem in a discussion on a CAcert mailing list a while ago. I wrote:

> On Fri, 2006-02-17 at 13:21 +0100, Danilo Buerger wrote:
>
>> The Problem still stands. Cacert should treat the following as the same:
>>
>> ae <=> ä
>> oe <=> ö
>> ue <=> ü
>
> Oh, quite. (My mistake, I was "solving" the wrong problem.)
>
> The problem is even more subtle. While those three identities (plus ß
> <=> ss/sz) can safely be applied to the names of Germans and Austrians,
> this is not neccessarily the case for all languages which use the umlaut
> and as the characters themselves (whether Unicode or 8859-*) do not
> specify a language, the question can't be answered without reference to
> the language in which the name is ordinarily used, and even then a
> correct answer will usually require a knowledgable local. (Or multiple
> conflicting correct answers will require multiple locals.) It may even
> be the case that multiple transliterations are used by a single person
> for his/her own name depending upon what language is being used.



(On re-reading this, it occurs to me that my own name goes through this, contrary to my desire, without even a character set change. Britons frequently assume Rowland. Francophones, in both Quebec and France, assume, or even insist (!), Rolland.)

> (Further examples include è,é, and ê, which, when transliterating from
> French to ASCII tend all to end up as "e", although in some rare cases ê
> becomes "es". If we move on to Greek, most of the consonants have
> ambiguous transliterations, for instance Φ transliterations are split
> near uniformly between "f" and "ph" depending upon the age of the word.
> Dare I even mention south-east Asian languages? Chinese has one written
> form, two different spoken forms and multiple machine character set
> representations.)

I went on to suggest that CAcert's only way to handle this entire problem, if were inclined to do so, would be to permit assurers to assure multiple spellings of assuree's names; in other words that facilitating computerised matching of human names for languages beyond English requires that each real world "thing" (person) be able to have multiple real world identifiers associated with it.

During my recent trip to Prague, I learned of a further complication (and indeed, further support for my suggestion); whereas for English speakers our given and family names are essentially fixed, for Czech speakers their names are subject to change for gender (an Australian friend of Czech descent gets odd looks from Czech border officials when she presents a passport which has a Czech family name that is clearly in its male form (Hrouda rather than Hroudová); it is written this way because, as she was born in Australia, she took her father's family name as-is) and case (see Wikipedia's Czech name article for examples; essentially as the family name is used as an adjective, full declension is required). I suspect that this may be true of other languages.

(N.B. The declension problem can be solved, at least in Czech, by always using a particular form, which is presumably how Czechs deal with it. The gender issue may still cause problems in some cases though; automated name matching in genealogical systems for example.)

Traps for the unwary, or the parochial...