Sunday, April 27, 2008

The Trouble with Names

The trouble with people as subjects is that they have names, and that personal names are hard.
  • Names in the text may be illegible or incomplete, so that Reese _____ and Mr ____ Edmonds require special handling.
  • Names need be remembered by the scribe during their transcription. I discovered this the hard way.

    After doing some research in secondary documents, I was able to "improve" the entries for Julia's children. Thus Kate Harvey became Julia Katherine Brumfield Harvey and Mollie Reynolds became Mary Susan Brumfield Reynolds.

    The problem is that while I'm transcribing the diaries, I can't remember that "Mary Susan" == Mollie. The diaries consistently refer to her as Mollie Reynolds, and the family refers to to her as Mollie Reynolds. No other person working on the diaries is likely to have better luck than I've had remembering this. After fighting with the improved names for a while, I gave up and changed all the full names back to the common names, leaving the full names in the articles for each subject.

  • Names are odd ducks, when it comes to strings. "Mr. Zach Abel" should be sorted before "Dr. Anne Zweig", which requires either human intervention to break the string into component parts or some serious parsing effort. At this point my subject list has become unwieldy enough to require sorting, and the index generation code for PDFs is completely dependent on this kind of separation.
I'm afraid I'll have to solve all of these problems at the same time, as they're all interdependent. My initial inclination is to have subject articles for people allow the user to specify a full name in all its component parts. If none is chosen, I'll populate the parts via a series of regular expressions. This will probably also require a hard look at how both TEI and DocBook represent names.


Gavin Robinson said...

One thing that might help is a way of marking up a name to indicate that it's a name without making any claim about who it is. In TEI the <persName> tag just indicates the name of a person. It's the key attribute which identifies it and links it to other instances of the same person or an external data source.

With the regimental history I found that there are likely to be huge economies of scale by making the de-duping a separate phase. I identified 325 individuals, and it would have been really impractical to try to link them as I went along. Much easier to pull them all out into a database or spreadsheet, sort them, add keys, then feed the values back into the XML.

One problem with this approach, especially for a small volunteer project, is that it takes away some of the fun. Transcribers are more likely to feel that they're working on a production line. On the other hand, transcribing documents and identifying people need different skills and experience and involve different kinds of responsibility. Record linkage is more of an editorial thing and often requires arbitrary subjective decisions.

With the Wenham letters (and it's fine for you to mention me and this project - I definitely do endorse what I've seen of FtP so far) it's easy because there are so few names and because I've already done enough background research to know who they are. I suspect this is an unusual situation. In most cases collection owners might not know what they're going to find in the manuscript, so defining all the people in advance is going to be impossible. Even with all the background research I've done on the regimental history, there are still some name keys which might change as I don't know their first names yet and won't be able to find out without some digging in the archives.

Another advantage of being able to mark a name in some semantic way is that it makes it easier to add microformats or other metadata to the HTML.

Ben W. Brumfield said...

I think you're right about doing disambiguation/identification/annotation in a separate step. I've done a little of that with my secondary reference, and it does resolve a lot. The problem is that I still need to build a subject combining tool, as I mentioned in the previous post.

Thanks for the microformat suggestion -- looks like I'll need to be prepared to export in hCard, TEI, and DocBook. None of these really have provisions for the lifespan suffixes we've both been using, nor for the number (serial?) you've got after Wenham, William in FtP. Any ideas on those?

Gavin Robinson said...

The regularized forms of names with dates and service numbers are just a continuation of what I've been doing in TEI with the regimental history. These are all regularized names that I've created myself to use in the key attribute of the persName tag. They don't ever appear in this form in the original text.

This kind of thing doesn't quite fit into hCard for a couple of reasons. First, the "humans first, machines second" maxim limits the amount of extra metadata that can be put in. The basic purpose of microformats is to mark up the semantic meaning of information which is already present on the page. In terms of historical manuscripts and digital editions of books, that pretty much limits you to what's in the original text. I think the scope for adding database keys or regularized forms is quite limited.

Second, when a service number appears in the text hCard has no mechanism for representing it. Any metadata standard which is capable of dealing with soldiers is going to need specific markup for service numbers but so far I don't think there is one. When I've had more time to think about use cases, advantages and disadvantages I might try recommending to the microformat people that they add a service number extension.

Dealing with service numbers in TEI is no problem at all. Where they occur in the original text I've marked them with <name> tags with attribute value type="servicenumber". I think the new biographic extensions in TEI P5 can probably handle dates of birth and death too.

Ben W. Brumfield said...

Gavin, would you mind e-mailing me a chunk of TEI-markup text from the regimental history that includes persName tags? I think it would really help me out with the design of this thing.

Regarding hCard, I'm not talking about using it for internal representation of name structure. Like TEI and DocBook, I'm looking at it as an output format. Internally, I'll be breaking the name parts into separate fields in the RDBMS for sorting purposes, then converting them for print/export/display. It's the display version where I was thinking about using hCard. Do you think that adding microformats to the subject/reader views would be valuable? I'm not familiar enough with how microformats work to know how appropriate they are for historical data.

Gavin Robinson said...

XML is on its way.

I'm not very familiar with microformats, and maybe it's too early to tell how useful they're going to be. I suspect that for hardcore historical purposes RDF might be better. Right now historians who do web scraping seem to be in a small minority, but maybe The Programming Historian will change that. But for anyone who wants to pull out all the names in a document, microformats would certainly make it a lot easier.