Collaborative Manuscript Transcription: June 2009

While editing and annotating Julia Brumfield's 1919 diary, I've tried to do research on the people who appear there. Who was Josie Carr's sister? Sites like FindAGrave.com can help, but the results may still be ambiguous: there are two Alice Woodings buried in the area, and either could be a match.

These questions could be resolved pretty easily through oral interviews -- most of the families mentioned in the diaries are still in the area, and a month spent knocking on doors could probably flesh out the networks of kinship I need for a complete annotation. However, that's really not time I have, and I can't imagine cold-calling strangers to ask nosy questions about their families -- I'm a computer programmer, after all.

It turns out that there might be an easier way. After Sara installed Google Analytics on FromThePage, I've been looking at referral log reports that show how people got to the site. Here's the keyword report for June, showing what keywords people were searching for when they found FromThePage:

Keyword	Visits	Pages/Visit	Avg. Time on Site
"tup walker"	21	12.47619	992.8571
"letcher craddock"	7	12.42857	890.1429
julia craddock brumfield	3	28	624.3333
juliacraddockbrumfield	3	74.33333	1385
"edwin mayhew"	2	7	117.5
"eva mae smith"	2	4	76.5
"josie carr" virginia	2	6.5	117
1918 candy	2	4	40
clack stone hubbard	2	55.5	1146

These website visitors are fellow researchers, trying to track down the same people that I am. I've got them on my website, they're engaged -- sometimes deeply so -- with the texts and the subjects, but I don't know who they are, and they haven't contacted me. Here are a couple of ideas that might help:

Add an introductory HTML block to the collection homepage. This would allow collection editors to explain their project, solicit help, and whatever contact information.
Add a 'contact us' footer to be displayed on every page of the collection, whether the user is viewing a subject article, reading a work, or viewing a manuscript page. Since people are finding the site via search engines, they're navigating directly to pages deep within a collection. We need to display 'about this project', 'contact us', or 'please help' messages on those pages.

One idea I think would not work is to build comment boxes or a 'contact us' form. I'm trying to establish a personal connection to these researchers, since in addition to asking "who is Alice Wooding", I'd like to locate other diaries or hunt down other information about local history. This is really best handled through email, where the barriers to participation are low.

One of the neatest things to happen in the world of transcription technology this year was the award of an NEH ODH Digital Humanities Start-Up Grant to "Image to XML", a project exploring image-based transcription at the line and word level. According to a press release from UNC, this will fund development of "a product that will allow librarians to digitally trace handwriting in an original document, encode the tracings in a language known as Scalable Vector Graphics, and then link the tracings at the line or even word level to files containing transcribed texts and annotations." This is based on the work of Hugh Cayless in developing Img2XML, which he has described in a presentation to Balisage, demonstrated at this static demo, and shared at this github repository.

Hugh was kind enough to answer my questions about the Img2XML project and has allowed me to publish his responses here in interview form:

First, let me congratulate you on img2xml's award of a Digital Humanities Start-Up Grant. What was that experience like?

Thanks! I've been involved in writing grant proposals before, and sat on an NEH review panel a couple of years ago. But this was the first time I've been the primary writer of a grant. Start-Up grants (understandably) are less work than the larger programs, but it was still a pretty intensive process. My colleague at UNC, Natasha Smith, and I worked right down to the wire on it. At research institutions like UNC, the hard part is not the writing of the proposal, but working through the submission and budgeting process with the sponsored research office. That's the part I really couldn't have done in time without help.

The writing part was relatively straightforward. I sent a draft to Jason Rhody, who's one of the ODH program officers, and he gave us some very helpful feedback. NEH does tell you this, but it is absolutely vital that you talk to a program officer before submitting. They are a great resource because they know the process from the inside. Jason gave us great feedback, which helped me refine and focus the narrative.

What's the relationship between img2xml and the other e-text projects you've worked on in the past? How did the idea come about?

At Docsouth, they've been publishing page images and transcriptions for years, so mechanisms for doing that had been on my mind. I did some research on generating structural visualizations of documents using SVG a few years ago, and presented a paper on it at the ACH conference in Victoria, so I'd some experience with it. There was also a project I worked on while I was at Lulu where I used Inkscape to produce a vector version of a bitmap image for calendars, so I knew it was possible. When I first had the idea, I went looking for tools that could create an SVG tracing of text on a page, and found potrace (which is embedded in Inkscape, in fact). I found that you can produce really nice tracings of text, especially if you do some pre-processing to make sure the text is distinct.

What kind of pre-processing was necessary? Was it all manual, or do you think the tracing step could be automated?

It varies. The big issue so far has been sorting out how to distinguish text from background (since potrace converts the image to black and white before running its tracing algorithm), particularly with materials like papyrus, which is quite dark. If you can eliminate the background color by subtracting it from the image, then you don't have to worry so much about picking a white/black cutover point--the defaults will work. So far it's been manual. One of the agendas of the grant is to figure out how much of this can be automated, or at least streamlined. For example, if you have a book with pages of similar background color, and you wanted to eliminate that background as part of pre-processing, it should be possible to figure out the color range you want to get rid of once, and do it for every page image.

I've read your Balisage presentation and played around with the viewer demonstration. It looks like img2xml was in proof-of-concept stage back in mid 2008. Where does the software stand now, and how far do you hope to take it?

It hasn't progressed much beyond that stage yet. The whole point of the grant was to open up some bandwidth to develop the tooling further, and implement it on a real-world project. We'll be using it to develop a web presentation of the diary of a 19th century Carolina student, James Dusenbery, some excerpts from which can be found on Documenting the American South at http://docsouth.unc.edu/true/mss04-04/mss04-04.html.

This has all been complicated a bit by the fact that I left UNC for NYU in February, so we have to sort out how I'm going to work on it, but it sounds like we'll be able to work something out.

It seems to me that you can automate generating the SVG's pretty easily. In the Dusenbery project, you're working with a pretty small set of pages and a traditional (i.e. institutionally-backed) structure for managing transcription. How well suited do you think img2xml is to larger, bulk digitization projects like the FamilySearch Indexer efforts to digitize US census records? Would the format require substantial software to manipulate the transcription/image links?

It might. Dusenbery gives us a very constrained playground, in which we're pretty sure we can be successful. So one prong of attack in the project is to do something end-to-end and figure out what that takes. The other part of the project will be much more open-ended and will involve experimenting with a wide range of materials. I'd like to figure out what it would take to work with lots of different types of manuscripts, with different workflows. If the method looks useful, then I hope we'll be able to do follow-on work to address some of these issues.

I'm fascinated by the way you've cross-linked lines of text in a transcription to lines of handwritten text in an SVG image. One of the features I've wanted for my own project was the ability to embed a piece of an image as an attribute for the transcribed text -- perhaps illustrating an unclear tag with the unclear handwriting itself. How would SVG make this kind of linking easier?

This is exactly the kind of functionality I want to enable. If you can get close to the actual written text in a referenceable way then all kinds of manipulations like this become feasible. The NEH grant will give us the chance to experiment with this kind of thing in various ways.

Will you be blogging your explorations? What is the best way for those interested in following its development to stay informed?

Absolutely. I'm trying to work out the best way to do this, but I'd like to have as much of the project happen out in the open as possible. Certainly the code will be regularly pushed to the github repo, and I'll either write about it there, or on my blog (http://philomousos.blogspot.com), or both. I'll probably twitter about it too (@hcayless). I expect to start work this week...

Many thanks to Hugh Cayless for spending the time on this interview. We're all wishing him and img2xml the best of luck!

Collaborative Manuscript Transcription

Friday, June 26, 2009

Connecting With Readers

Tuesday, June 2, 2009

Interview with Hugh Cayless

New Blog Posts are at FromThePage

Posts from the FromThePage Blog

Pages

Upcoming Conference Schedule

Past Conference Talks

Blog Archive

Subjects

Papers

Transcription Systems

Digital Family History