Sunday, April 27, 2008

The Trouble with Names

The trouble with people as subjects is that they have names, and that personal names are hard.
  • Names in the text may be illegible or incomplete, so that Reese _____ and Mr ____ Edmonds require special handling.
  • Names need be remembered by the scribe during their transcription. I discovered this the hard way.

    After doing some research in secondary documents, I was able to "improve" the entries for Julia's children. Thus Kate Harvey became Julia Katherine Brumfield Harvey and Mollie Reynolds became Mary Susan Brumfield Reynolds.

    The problem is that while I'm transcribing the diaries, I can't remember that "Mary Susan" == Mollie. The diaries consistently refer to her as Mollie Reynolds, and the family refers to to her as Mollie Reynolds. No other person working on the diaries is likely to have better luck than I've had remembering this. After fighting with the improved names for a while, I gave up and changed all the full names back to the common names, leaving the full names in the articles for each subject.

  • Names are odd ducks, when it comes to strings. "Mr. Zach Abel" should be sorted before "Dr. Anne Zweig", which requires either human intervention to break the string into component parts or some serious parsing effort. At this point my subject list has become unwieldy enough to require sorting, and the index generation code for PDFs is completely dependent on this kind of separation.
I'm afraid I'll have to solve all of these problems at the same time, as they're all interdependent. My initial inclination is to have subject articles for people allow the user to specify a full name in all its component parts. If none is chosen, I'll populate the parts via a series of regular expressions. This will probably also require a hard look at how both TEI and DocBook represent names.

Friday, April 25, 2008

Feature: Data Maintenance Tools

With only two collections of documents, fewer than a hundred transcriptions, and only a half-dozen users who could be charitably described as "active", FromThePage is starting to strain under the weight of its data.

All of this has to do with subjects. These are the indexable words that provide navigation, analysis, and context to readers. They're working out pretty well, but frequency of use has highlighted some urgent features to be developed and intolerable bugs to be fixed:
  • We need a tool to combine subjects. Early in the transcription process, it was unclear to me whether "the Island" referred to Long Island, Virginia -- a nearby town with a post office and railroad station -- or someplace else. Increasing familiarity with the texts, however, has shown "the Island" to be definitely the same as "Long Island".

    The best interface for doing this sort of deduping is implemented by LibraryThing, which is so convenient that it has inspired the ad-hoc creation of a group of "combining" enthusiasts -- an astonishing development, since this is usually the worst of dull chores. A similar interface would invite the user viewing "the Island" to consider combining that subject with "Long Island". This requires an algorithm to suggest matches for combination, which is itself no trivial task.

  • We need another tool to review and delete orphans. As identification improves, we've been creating new subjects "Reese Smith" and linking previous references to "Reese" to that improved subject. This leaves the old, incomplete subject without any referents, but also without any way to prune it.

Autolink has become utterly essential to transcriptions, since it allows the scribe to identify the appropriate subject as well as the syntax to use for its link. Unfortunately, it has a few serious problems:

  • Autolink looks for substrings within the transcription without paying attention to word boundaries. This led to some odd autolink suggestions before this week, but since the addition of "ink" and "hen" to the subjects link text, autolink has started behaving egregiously. The word "then" is unhelpfully expanded to "t[[chickens|hen]]". I'll try to retain matches for display text regardless of inflectional markings, but it's time to move the matching algorithm to a more conservative heuristic.
  • Autolink also suggests matches from subjects that reside within different collections. It's wholly unhelpful for autolink to suggest a tenant farmer in rural Virginia within a context of UK soldiers in a WWI prison camp. This is a classical cross-talk bug, and needs fixing.

Monday, April 14, 2008

Collaborative transcription, the hard way

Archivalia has kept us informed of lots of manuscript projects going online in Europe last week, offering commentary along the way. Perhaps my favorite exchange was about the Chronicle of Sweder von Schele on the Internet:
Mit dem Projekt wird zunächst bezweckt, die Transkription zu ergänzen und zu verbessern. Hierzu können neue Abschriften per Mail an die am Projekt beteiligten Institutionen geschickt werden. Nach redaktioneller Prüfung werden die Seiten ausgetauscht.

Meine Güte, wie vorsintflutlich. Hat man noch nie etwas von einem Wiki gehört?
That's right -- the online community may send in corrections and additions to the transcription by mail.

Friday, April 11, 2008

Who do I build for?

Over at ClioWeb, Jeremy Boggs is starting a promising series of posts on the development process for digital humanites process. He's splitting the process up into five steps, which may happen at the same time, but still follow a rough sort of order. Step one, obviously, is "Figure out what exactly you’re building."

But is that really the first step? I'm finding that what I build is dependent on who I'm building for. Having launched an alpha version of the product and shaken out many of the bugs in the base functionality, I'm going to have to make some tough decisions about what I concentrate my time on next. These all revolve around who I'm developing the product for:

My Family: Remember that I developed the transcription software to help accomplish a specific goal: transcribe Julia Brumfield's diaries to share with family members. The features I should concentrate on for this audience are:
  • Finish porting my father's 1992 transcription of the 1918 diary to FromThePage
  • Fix zoom.
  • Improve the collaborative tools for discussing the works and figuring out which pages need review.
  • Build out the printing feature, so that the diaries can be shared with people who have limited computer access.
THATCamp attendees: Probably the best feedback I'll be able to receive on the analytical tools like subject graphs will come from THATCamp. This means that any analytical features I'd like to demo need to be presentable by the middle of May.

Institutional Users: This blog has drawn interest from a couple of people looking for software for their institutions to use. For the past month, I've corresponded extensively with John Rumm, editor of the Digital Buffalo Bill Project, based at McCracken Research Library, at the Buffalo Bill Historical Center. Though our conversations are continuing, it seems like the main features his project would need are:
  • Full support for manuscript transcription tags used to describe normalization, different hands, corrections/revisions to the text, illegible text and other descriptors used in low-level transcription work. (More on this in a separate post)
  • Integration with other systems the project may be using, like Omeka, Pachyderm, MediaWiki, and such.
My Volunteers: A few of the people who've been trying out the software have expressed interest in using it to host their own projects. More than any other audience, their needs would push FromThePage towards my vision of unlocking the filing cabinets in family archivists' basements and making these handwritten sources accessible online. We're in the very early stages of this, so I don't yet know what requirements will arise.

The problem is that there's very little overlap between the features these groups need. I will likely concentrate on family and volunteers, while doing the basics for THATCamp. I realize that's not a very tight focus, but it's much clearer to me now than it was last week.

Monday, April 7, 2008

Progress Report: One Month of Alpha Testing

FromThePage has been on a production server for about a month now, and the results have been fascinating. The first few days' testing revealed some shocking usability problems. In some places (transcription and login most notoriously) the code was swallowing error messages instead of displaying them to the user. Zoom didn't work in Internet Explorer. And there were no guardrails that kept the user from losing a transcription-in-progress.

After fixing these problems, the family and friends who'd volunteered to try out the software started making more progress. The next requests that came in were for transcription conventions. After about three requests for these, I started displaying the conventions on the transcription screen itself. This seems to have been very successful, and is something I'd never have come up with on my own.

The past couple of weeks have been exciting. My old college roommate started transcribing entries from the 1919 diary, and entered about 15 days in January -- all in two days work. In addition to his technical feedback, two things I'd hoped for happened:
  • We started collaborating. My roommate had transcribed the entries on a day when he'd had lots of time. I reviewed his transcriptions and edited the linking syntax. Then my father edited the resulting pages, correcting names of people, events, and animals based on his knowledge of the diaries' context.
  • My roommate got engaged. His technical suggestions were peppered with comments on the entries themselves and the lives of the people in the diaries. I've seen this phenomenon at Pepys Diary Online, but it's really heartening to get a glimpse of that kind of engagement with a manuscript.
I've also had some disappointments. It looks like I'll have to discard my zoom feature and replace it with something using the Google Maps API. People now expect to pan by dragging the image, and I my home-rolled system simply can't support that.

I really planned on moving developing printing and analytical tools next, but we're finding that the social software features are becoming essential. The bare-bones "Recent Activity" panel I slapped together one night has become the most popular feature. We need to know who edited what, what pages remain to be transcribed, and which transcriptions need review. I've resuscitated this winter's comment feature, polished the "Recent Activity" panel, and am working on a system for displaying a page's transcription status in the work's table of contents.

All of these developments are the results of several hours of careful, thoughtful review by volunteers like my father and my roommate. There is simply no way I could have invited the public in to the site as it stood a month ago, which I did not know at the time. There's still a lot to be done before FromThePage is "ready for company", but I think it's on track.

If you'd like to try the software out, leave me a comment or send mail to alpha.info@fromthepage.com.

Friday, April 4, 2008

Review: IATH Manuscript Transcription Database

This is the second of two reviews of similar transcription projects I wrote in correspondence with Brian Cafferelli, an undergraduate working on the WPI Manuscript Transcription Assistant. In this correspondence, I reviewed systems by their support for collaboration, automation, and analysis.

The IATH Manuscript Transcription Database was a system for producing transcriptions developed for the Salem Witch Trials court records. It allowed full collaboration within an institutional setting. An administrator assigned transcription work to a transcriber, who then pulled that work from the queue and edited and submitted a transcription. Presumably there was some sort of review performed by the admin, or a proofreading step done by comparison with another user's transcription. Near as I can tell from the manual, no dual-screen transcription editor was provided. Rather, transcription files were edited outside the system, and passed back and forth using MTD.

I'm a bit hazy on all this because after reviewing the docs and downloading the source code, I sent an inquiry to IATH about the legal status of the code. It turns out that while the author intended to release the system to the public, this was never formally done. The copyright status of the code is murky, and I view it as tainted IP for my purpose of producing a similar product. I'm afraid I deleted the files from my own system, unread.

For those interested in reading more, here is the announcement of the MTD
on the Humanist list.
The full manual, possibly with archived source code is accessible via the wayback machine. They've pulled this from the IATH site, presumably because of the IP problems.

So the MTD was another pure-production system. Automation and collaboration were fully supported, though the collaboration was designed for a purely institutional setting. Systems of assignment and review are inappropriate for a volunteer-driven system like my own.

My correspondent at IATH did pass along a piece of advice from someone who had worked on the project: "Use CVS instead". What I gather they meant by this was that the system for passing files between distributed transcribers, collating those files, and recording edits is a task that source code repositories already perform very well. This does nothing to replace transcription production tools akin to mine or the TEI editors, but the whole system of editorial oversight and coordination provided by the MTD is a subset of what a source code repository can do. A combination of Subversion and Trac would be a fantastic way to organize a distributed transcription effort, absent a pure-hosted tool.

This post contains a lot more speculation and supposition than usual, and I apologize in advance to my readers and the IATH for anything I've gotten wrong. If anyone associated with the MTD project would like to set me straight, please comment or send me email.