Tuesday, June 2, 2009

Interview with Hugh Cayless

One of the neatest things to happen in the world of transcription technology this year was the award of an NEH ODH Digital Humanities Start-Up Grant to "Image to XML", a project exploring image-based transcription at the line and word level. According to a press release from UNC, this will fund development of "a product that will allow librarians to digitally trace handwriting in an original document, encode the tracings in a language known as Scalable Vector Graphics, and then link the tracings at the line or even word level to files containing transcribed texts and annotations." This is based on the work of Hugh Cayless in developing Img2XML, which he has described in a presentation to Balisage, demonstrated at this static demo, and shared at this github repository.

Hugh was kind enough to answer my questions about the Img2XML project and has allowed me to publish his responses here in interview form:


First, let me congratulate you on img2xml's award of a Digital Humanities Start-Up Grant. What was that experience like?

Thanks! I've been involved in writing grant proposals before, and sat on an NEH review panel a couple of years ago. But this was the first time I've been the primary writer of a grant. Start-Up grants (understandably) are less work than the larger programs, but it was still a pretty intensive process. My colleague at UNC, Natasha Smith, and I worked right down to the wire on it. At research institutions like UNC, the hard part is not the writing of the proposal, but working through the submission and budgeting process with the sponsored research office. That's the part I really couldn't have done in time without help.

The writing part was relatively straightforward. I sent a draft to Jason Rhody, who's one of the ODH program officers, and he gave us some very helpful feedback. NEH does tell you this, but it is absolutely vital that you talk to a program officer before submitting. They are a great resource because they know the process from the inside. Jason gave us great feedback, which helped me refine and focus the narrative.

What's the relationship between img2xml and the other e-text projects you've worked on in the past? How did the idea come about?

At Docsouth, they've been publishing page images and transcriptions for years, so mechanisms for doing that had been on my mind. I did some research on generating structural visualizations of documents using SVG a few years ago, and presented a paper on it at the ACH conference in Victoria, so I'd some experience with it. There was also a project I worked on while I was at Lulu where I used Inkscape to produce a vector version of a bitmap image for calendars, so I knew it was possible. When I first had the idea, I went looking for tools that could create an SVG tracing of text on a page, and found potrace (which is embedded in Inkscape, in fact). I found that you can produce really nice tracings of text, especially if you do some pre-processing to make sure the text is distinct.

What kind of pre-processing was necessary? Was it all manual, or do you think the tracing step could be automated?

It varies. The big issue so far has been sorting out how to distinguish text from background (since potrace converts the image to black and white before running its tracing algorithm), particularly with materials like papyrus, which is quite dark. If you can eliminate the background color by subtracting it from the image, then you don't have to worry so much about picking a white/black cutover point--the defaults will work. So far it's been manual. One of the agendas of the grant is to figure out how much of this can be automated, or at least streamlined. For example, if you have a book with pages of similar background color, and you wanted to eliminate that background as part of pre-processing, it should be possible to figure out the color range you want to get rid of once, and do it for every page image.

I've read your Balisage presentation and played around with the viewer demonstration. It looks like img2xml was in proof-of-concept stage back in mid 2008. Where does the software stand now, and how far do you hope to take it?

It hasn't progressed much beyond that stage yet. The whole point of the grant was to open up some bandwidth to develop the tooling further, and implement it on a real-world project. We'll be using it to develop a web presentation of the diary of a 19th century Carolina student, James Dusenbery, some excerpts from which can be found on Documenting the American South at http://docsouth.unc.edu/true/mss04-04/mss04-04.html.

This has all been complicated a bit by the fact that I left UNC for NYU in February, so we have to sort out how I'm going to work on it, but it sounds like we'll be able to work something out.

It seems to me that you can automate generating the SVG's pretty easily. In the Dusenbery project, you're working with a pretty small set of pages and a traditional (i.e. institutionally-backed) structure for managing transcription. How well suited do you think img2xml is to larger, bulk digitization projects like the FamilySearch Indexer efforts to digitize US census records? Would the format require substantial software to manipulate the transcription/image links?

It might. Dusenbery gives us a very constrained playground, in which we're pretty sure we can be successful. So one prong of attack in the project is to do something end-to-end and figure out what that takes. The other part of the project will be much more open-ended and will involve experimenting with a wide range of materials. I'd like to figure out what it would take to work with lots of different types of manuscripts, with different workflows. If the method looks useful, then I hope we'll be able to do follow-on work to address some of these issues.

I'm fascinated by the way you've cross-linked lines of text in a transcription to lines of handwritten text in an SVG image. One of the features I've wanted for my own project was the ability to embed a piece of an image as an attribute for the transcribed text -- perhaps illustrating an unclear tag with the unclear handwriting itself. How would SVG make this kind of linking easier?

This is exactly the kind of functionality I want to enable. If you can get close to the actual written text in a referenceable way then all kinds of manipulations like this become feasible. The NEH grant will give us the chance to experiment with this kind of thing in various ways.

Will you be blogging your explorations? What is the best way for those interested in following its development to stay informed?

Absolutely. I'm trying to work out the best way to do this, but I'd like to have as much of the project happen out in the open as possible. Certainly the code will be regularly pushed to the github repo, and I'll either write about it there, or on my blog (http://philomousos.blogspot.com), or both. I'll probably twitter about it too (@hcayless). I expect to start work this week...


Many thanks to Hugh Cayless for spending the time on this interview. We're all wishing him and img2xml the best of luck!

No comments: