One of the neatest things to happen in the world of transcription technology this year was the award of an NEH
ODH Digital Humanities Start-Up Grant to
"Image to XML", a project exploring image-based transcription at the line and word level. According to
a press release from UNC, this will fund development of "a product that will allow librarians to digitally trace handwriting in an original document, encode the tracings in a language known as Scalable Vector Graphics, and then link the tracings at the line or even word level to files containing transcribed texts and annotations." This is based on the work of
Hugh Cayless in developing
Img2XML, which he has described in a
presentation to Balisage, demonstrated at
this static demo, and shared at this
github repository.
Hugh was kind enough to answer my questions about the
Img2XML project and has allowed me to publish his responses here in interview form:
First, let me congratulate you on img2xml's award of a Digital Humanities Start-Up Grant. What was that experience like?Thanks! I've been involved in writing grant proposals before, and sat on an NEH review panel a couple of years ago. But this was the first time I've been the primary writer of a grant. Start-Up grants (understandably) are less work than the larger programs, but it was still a pretty intensive process. My colleague at
UNC, Natasha Smith, and I worked right down to the wire on it. At research institutions like
UNC, the hard part is not the writing of the proposal, but working through the submission and budgeting process with the sponsored research office. That's the part I really couldn't have done in time without help.
The writing part was relatively straightforward. I sent a draft to Jason Rhody, who's one of the
ODH program officers, and he gave us some very helpful feedback. NEH does tell you this, but it is absolutely vital that you talk to a program officer before submitting. They are a great resource because they know the process from the inside. Jason gave us great feedback, which helped me refine and focus the narrative.
What's the relationship between img2xml and the other e-text projects you've worked on in the past? How did the idea come about?At
Docsouth, they've been publishing page images and transcriptions for years, so mechanisms for doing that had been on my mind. I did some research on generating structural visualizations of documents using
SVG a few years ago, and presented a paper on it at the
ACH conference in Victoria, so I'd some experience with it. There was also a project I worked on while I was at Lulu where I used
Inkscape to produce a vector version of a bitmap image for calendars, so I knew it was possible. When I first had the idea, I went looking for tools that could create an
SVG tracing of text on a page, and found
potrace (which is embedded in
Inkscape, in fact). I found that you can produce really nice tracings of text, especially if you do some
pre-processing to make sure the text is distinct.
What kind of pre-processing was necessary? Was it all manual, or do you think the tracing step could be automated?It varies. The big issue so far has been sorting out how to distinguish text from background (since
potrace converts the image to black and white before running its tracing algorithm), particularly with materials like papyrus, which is quite dark. If you can eliminate the background color by subtracting it from the image, then you don't have to worry so much about picking a white/black
cutover point--the defaults will work. So far it's been manual. One of the agendas of the grant is to figure out how much of this can be automated, or at least streamlined. For example, if you have a book with pages of similar background color, and you wanted to eliminate that background as part of
pre-processing, it should be possible to figure out the color range you want to get rid of once, and do it for every page image.
I've read your Balisage presentation and played around with the viewer demonstration. It looks like img2xml was in proof-of-concept stage back in mid 2008. Where does the software stand now, and how far do you hope to take it?It hasn't progressed much beyond that stage yet. The whole point of the grant was to open up some bandwidth to develop the tooling further, and implement it on a real-world project. We'll be using it to develop a web presentation of the diary of a 19
th century Carolina student, James
Dusenbery, some excerpts from which can be found on Documenting the American South at
http://docsouth.unc.edu/true/mss04-04/mss04-04.html.
This has all been complicated a bit by the fact that I left
UNC for NYU in February, so we have to sort out how I'm going to work on it, but it sounds like we'll be able to work something out.
It seems to me that you can automate generating the SVG's pretty easily. In the Dusenbery project, you're working with a pretty small set of pages and a traditional (i.e. institutionally-backed) structure for managing transcription. How well suited do you think img2xml is to larger, bulk digitization projects like the FamilySearch Indexer efforts to digitize US census records? Would the format require substantial software to manipulate the transcription/image links?
It might.
Dusenbery gives us a very constrained playground, in which we're pretty sure we can be successful. So one prong of attack in the project is to do something end-to-end and figure out what that takes. The other part of the project will be much more open-ended and will involve experimenting with a wide range of materials. I'd like to figure out what it would take to work with lots of different types of manuscripts, with different
workflows. If the method looks useful, then I hope we'll be able to do follow-on work to address some of these issues.
I'm fascinated by the way you've cross-linked lines of text in a transcription to lines of handwritten text in an SVG image. One of the features I've wanted for my own project was the ability to embed a piece of an image as an attribute for the transcribed text -- perhaps illustrating an unclear tag with the unclear handwriting itself. How would SVG make this kind of linking easier?
This is exactly the kind of functionality I want to enable. If you can get close to the actual written text in a
referenceable way then all kinds of manipulations like this become feasible. The NEH grant will give us the chance to experiment with this kind of thing in various ways.
Will you be blogging your explorations? What is the best way for those interested in following its development to stay informed? Absolutely. I'm trying to work out the best way to do this, but I'd like to have as much of the project happen out in the open as possible. Certainly the code will be regularly pushed to the
github repo, and I'll either write about it there, or on my blog (
http://philomousos.blogspot.com), or both. I'll probably twitter about it too (
@hcayless). I expect to start work this week...
Many thanks to Hugh
Cayless for spending the time on this interview. We're all wishing him and
img2
xml the best of luck!