Thursday, April 5, 2007

Gavin Robinson's Project Wenham

On his indispensable Investigations of a Dog, Gavin Robinson describes digitizing his grandfather's letters from a German POW camp in WWI:
The text will be transcribed and marked up with TEI compliant XML, and published on the web, along with background information written by me. There will be an index of people, and possibly places. Another optional extra will be selections of relevant documents from other sources, such as battalion war diaries.
Robinson's approach to his project is almost identical to mine:
There is no possibility of using OCR for handwritten text, so the letters will have to be transcribed. Although we have the originals to work from, there might be some need to work from digital images. Apart from saving wear and tear on the documents, digital images are more flexible. Difficult text can be enlarged on screen, and contrast can be adjusted to bring out faded text. However, there is the added problem of viewing an image and typing at the same time, which might require specialised software. I’ve found that Zotero notes can be very useful for transcribing text from images or PDFs and might be adequate to start with. I could also use HTML/PHP/MySQL to cobble together something like the Distributed Proofreaders interface for my own use. The front end is a simple web based interface, and although I don’t know how their back end works, a local version just for my own use could be much simpler.
The differences between Robinson's needs and mine are small but not insignificant:
  1. I'm dealing with books whose many pages need consistent titles. In the course of my digitization, I've found the effort involved in reviewing and labelling images to be immense.

    Say I've got just under 200 JPG files that are supposed to represent every even-numbered page of my great grandmother's 1925 diary. These images were taken in bulk, and the haste involved means that some pages may be missing and need to be filled, some are duplicates, and a few may even be out of focus and require re-shoots. Even after a set of images is labelled and cleaned up, it needs to be interleaved with a similarly processed set of odd numbered pages.

    I had to halt development on the transcription feature several months ago in order to concentrate on automating this process. I doubt that a project in which each work could be represented by only a handful of images would need similar tools — in fact, automation might be slower than manually reviewing and ordering the images.
  2. The shortness of letters may make their images easy to title, but they're more likely to need sophisticated categorization. The initial versions of FromThePage listed works (then called "books") in a single page, ordered alphabetically by title. I'm pretty sure that this would be completely inadequate for Robinson's or Susan Kitchens' needs.
  3. Project Wenham includes Robinson's grandfather's photographs and (from what I gather) the front side of postcards. I've not given any thought to including images in the transcribed works.
  4. Robinson apparently does not have my requirement for offline access to the transcribed works.

No comments: