Monday, April 30, 2007

Feature: Subject Indexes

One of the main goals of the diary digitization project is to create printed versions of Julia Brumfield's diaries to distribute to family with little internet access. The links between manuscript pages and articles described below provide the raw material for a print index. Reading storing page-to-article links in RDBMS record makes indexing almost, but not quite trivial.

The "not quite" part comes from possible need for index sub-topics. Linking every instance of "tobacco" is not terribly useful in a diary of life on a tobacco farm. It would still be nice to group planting, stripping, or barn-building under "Tobacco" in a printed index, however.

One way to accomplish this is to categorize articles by topic. This would provide an optional overlay to the article-based indexing, so that the printed index could list all references to each article, plus references to articles for all topics. If Feb 15th's "planting tobacco" was linked to an article on "Planting", but that Planting article was categorized under "Tobacco", the algorithm to generate a printed index could list the link in both places:
Planting: Feb. 15
Tobacco: Planting: Feb 15.

Pseudocode to do this:

indexable_subjects = merge all category titles with all article titles
for each subject in indexable_subjects {
print the subject title
find any article with title == subject
for each link from a page to that article {
print the page in a reference format
find any category with title == subject
for each article in that category
print the article title
for each link from a page to that article {
print the page in a reference format

Using an N:N categorization scheme would make this very flexible, indeed. I'd probably still want to present the auto-generated index to the user for review and cleanup before printing. Since this is my only mechanism for classification, some categories ("Personal Names", "Places") could be broken out into their own separate indexes.

Wednesday, April 25, 2007

Feature: Article Links

Both Wikipedia and Pepys Diary Online allow hyperlinks to be embedded into articles. These links are tracked bi-directionally, so that by viewing an article, you can see all the pages that link to it. This is an easy enough thing to do, once you allow embedded links:


The text markup to insert a link needs to be simple and quick, yet flexible if the scribe has the need and the skills to do complex operations. The wiki link syntax used in Wikipedia is becoming common, so seems like as good a starting place as any: double braces placed around text makes that text a link, e.g. [[tobacco]] links to an article Tobacco. In situations where the link target is different from the displayed text, and optional display text value is supplied after a pipe, so that [[Benjamin Franklin Brumfield|Ben]] links "Ben" to the article on Benjamin Franklin Brumfield. This user-friendly markup can be converted to my intermediate XML markup for processing on save and display.

Another mechanism for inserting links is a sort of right-click dropdown: select the linked text, right click it (or click some icon), and you see a pop-up single-select of the articles in the system, with additional actions of creating new articles with the selected text as a title, or creating a new article with a different title. Choosing/creating an article would swap out the selected text with the appropriate markup. It is debatable whether such a feature would be useful without significant automation (below).


When text is saved, we walk through the markup looking for link tags. When one is encountered, it's inserted into the RDBMS, recording the page the link is from, the article the link is to, and the actual display text. Any link target articles that do not already exist will be created with a title, but blank contents. Reading this RDBMS record makes indexing almost trivial — more on this later.

Automated linking

The reason for storing the display text in the link is to automate link creation. Diaries in particular include a frequently-repeated core of people or places that will be referred to with the same text. In my great-grandmother's diaries, "Ben" always refers to "Benjamin Frankly Brumfield Sr.", while "Franklin" refers to "Benjamin Franklin 'Pat' Brumfield Jr.". Offering to replace those words with the linked text could increase the ease of linking and possibly increase the quality of indexes.

I see three possible UIs for this. The first one is described above, in which the user selects the transcribed text "Ben", brings up a select list with options, and chooses one. The record of previous times "Ben" was linked to something can winnow down that list to something useful, a bit like auto-completion. Another option is to use a Javascript oberserver to suggest linking "Ben" to the appropriate article whenever the scribe types "Ben". This is an AJAX intensive implementation that would be ungainly under dial-up situations. Worse, it could be incredibly annoying, and since I'm hoping to minimize user enragement, I don't think it's a good idea unless handled with finesse. Maybe append suggested links to a sidebar while highlighting the marked text, rather than messing with popups? I'll have to mull that one over. The third scenario is to do nothing but server-side processing. Whenever a page is submitted, direct the user to a workflow in which they approve/deny suggested links. This could be made optional rather than integrated into the transcription flow, as Wikipedia's "wikify" feature does.


In my data model and UI, I'm differentiating between manuscript pages and article pages -- the former actually representing a page from the manuscript with its associated image, transcription, markup and annotation, while the latter represents a scribe-created note about a person, place, or event mentioned in one or more manuscript pages. The main use case of linking is to hotlink page text to articles. I can imagine linking articles to other articles as well, as when a person has their family listed, but it seems much less useful for manuscript pages to link to each other. Does the word "yesterday" really need to link March 14 to the page on March 13?

Friday, April 20, 2007

Risks: Why trust me with your stuff?

Why should someone trust me with their data? Why should someone trust me with the fruits of their effort? By "trust", I'm not talking about the possibility that I'll misuse the transcriptions -- I presume that someone's using the software in order to distribute their transcriptions more widely, so keeping it all a secret isn't that big a deal. Rather, I'm talking about the possibility that the computers the software is running on take a dive and never return. Or that the computers keep running but Julia ends up turning into a stagnant project riddled with comment spam the content creators are powerless to fight.

The only answer I can give is that the software must make it unnecessary for a work owner or scribe to trust in the future of that service. I can do this by making sure that the work owner can get their data back out. So at any time, any content authored on the site can be exported in a lossless format -- including transcription source, intermediate transcription documents, annotations, and the RDBMS structure that links it all. (This list does not include the original images, since I presume they have an independent existence on the machines of whomever uploaded them originally, and transferring them would be costly.) If the software itself is open-source, this could be loaded on another server with no dependencies on my own technical or personal reliability.

So in addition to features for transcription and printing, I need a full export feature.

Progress Report: Auto-titling

One of the first things I discovered when I started work on this project is that it's a lot of effort just to get the page images ready to transcribe. In my case, this means rotating each image (either 90 or 270 degrees, depending on recto or verso), shrinking the rotated images by 1/4 to get to the minimum legible size, shrinking the originals again by 1/2 to get to a zoom size, and then attaching titles to them.

This titling is itself quite difficult. While it's trivial to generate consistently-formatted dates to apply to carefully reviewed, consistently named page images, the results of scanning real-world manuscripts are much messier. In my case, I'm taking pictures of the odd numbered pages in order, then following with the even numbered pages. The resulting lists of files have duplicate images of the same pages, are missing pages, or have re-do image pairs in which I discovered my camera was on the wrong setting and had to re-shoot a series of pages. In one diary the titles in the original aren't even sequential, since every two months includes a "Memoranda" page. In another, a separate sheet has been glued in over an earlier diary entry.

The only solution I've come up with is a titling feature, which I completed this month. I upload a set of pictures I believe to be a reasonably-coherent series of pages, choose the proper orientation, and point to the spot on a sample image where the page number is located. This launches a series of background jobs that shrink each image to the minimum legible size, rotate the images correctly, and crops the part of the image that contains the page number. It then automatically generates titles for the images based on user-entered data, then presents the tops of each page along with the generated title for review. The user can delete pages, bump titles into synch with their images, or override titles for cases like my "Memoranda" pages.

The next step is a feature to collate recto and verso image sets, as well as one to fill in skipped page images. The UI for collation is going to be difficult.

Thursday, April 5, 2007

Gavin Robinson's Project Wenham

On his indispensable Investigations of a Dog, Gavin Robinson describes digitizing his grandfather's letters from a German POW camp in WWI:
The text will be transcribed and marked up with TEI compliant XML, and published on the web, along with background information written by me. There will be an index of people, and possibly places. Another optional extra will be selections of relevant documents from other sources, such as battalion war diaries.
Robinson's approach to his project is almost identical to mine:
There is no possibility of using OCR for handwritten text, so the letters will have to be transcribed. Although we have the originals to work from, there might be some need to work from digital images. Apart from saving wear and tear on the documents, digital images are more flexible. Difficult text can be enlarged on screen, and contrast can be adjusted to bring out faded text. However, there is the added problem of viewing an image and typing at the same time, which might require specialised software. I’ve found that Zotero notes can be very useful for transcribing text from images or PDFs and might be adequate to start with. I could also use HTML/PHP/MySQL to cobble together something like the Distributed Proofreaders interface for my own use. The front end is a simple web based interface, and although I don’t know how their back end works, a local version just for my own use could be much simpler.
The differences between Robinson's needs and mine are small but not insignificant:
  1. I'm dealing with books whose many pages need consistent titles. In the course of my digitization, I've found the effort involved in reviewing and labelling images to be immense.

    Say I've got just under 200 JPG files that are supposed to represent every even-numbered page of my great grandmother's 1925 diary. These images were taken in bulk, and the haste involved means that some pages may be missing and need to be filled, some are duplicates, and a few may even be out of focus and require re-shoots. Even after a set of images is labelled and cleaned up, it needs to be interleaved with a similarly processed set of odd numbered pages.

    I had to halt development on the transcription feature several months ago in order to concentrate on automating this process. I doubt that a project in which each work could be represented by only a handful of images would need similar tools — in fact, automation might be slower than manually reviewing and ordering the images.
  2. The shortness of letters may make their images easy to title, but they're more likely to need sophisticated categorization. The initial versions of FromThePage listed works (then called "books") in a single page, ordered alphabetically by title. I'm pretty sure that this would be completely inadequate for Robinson's or Susan Kitchens' needs.
  3. Project Wenham includes Robinson's grandfather's photographs and (from what I gather) the front side of postcards. I've not given any thought to including images in the transcribed works.
  4. Robinson apparently does not have my requirement for offline access to the transcribed works.

Wednesday, April 4, 2007

Picking a Name

When I started working on this piece of software back in 2005, I referred to it as "the diary project." Realizing that this would almost certainly result in a collision, I settled on "The Julia Project" as a far more elegant name.

A quick Google search reveals that most variations of "Julia" with "Project" or "System" are taken, most recently by a film production company. So now I'm left with either the unlovely CMTAS or referring to the system by a name I know I'll have to abandon. Fortunately, I'll only have to decide after signing up for hosting or starting a formal RubyForge project, so I can put off the decision.

Susan Kitchens' Letter Project

Susan Kitchens at Family Oral History Using Digital Tools [and I thought "Collaborative Manuscript Transcription" was a mouthful!] has a need that's very similar to my own. She's got a bunch of old letters and she wants to "scan them all and somehow make sense of them digitally". Her post on the subject outlines a plan to embed metadata into the scanned images themselves. This would allow her to use image viewing software -- she's looking at MemoryMiner -- to navigate the letter images.

Her project differs from mine in that
  1. She's not trying to distribute compact hardcopy versions, a core end-product of FromThePage.
  2. She needs more structured, analytical metadata than a freeform wiki-style "what links here" index can provide.
  3. She doesn't have the collaborative proofreading/correction/annotation needs I've seen in transcribing my great-great grandmother's diaries.
Kitchens' needs suggest enhancements to my design in structuring articles. I'd thought about differentiating general articles on subjects like "cutting match" or "tobacco" from those on people, but maybe further categorization would be worth investigating during early testing.

Paper: Computational Manuscript Indexing

The 2006 Family History Technology Workshop archives are online. One presentation ("Towards Searchable Indexes for Handwritten Documents") dealt with the difficulties of automating OCR. The conclusion: it's not impossible to pragmatically digitize manuscripts for the purpose of searching. Partial matches between search terms and recognized manuscript letters mean that so long as the user can tolerate imperfect search results, the manuscripts need not be fully transcribed in order to be indexed. Even this requires extensive training and consistent handwriting in the source texts, however.

Here are links to the paper and the slides.

Tuesday, April 3, 2007

Planning for GA

Sara and I had a discussion last night over supper, hashing out the business plan for the project. The most important conclusion was that I need to use the app with a couple of small user communities before any sort of general release. That helps me focus my efforts on some very specific features.
  1. Get a start-to-finish set of transcription features and the basics for annotation.
  2. Install the app on DreamHost.
  3. Load up the already-transcribed 1918 diary and the untranscribed 1932 diary.
  4. Begin work with various relatives on transcribing the 1932 diary and annotating/indexing the 1918 diary.
  5. Fix any bugs this reveals.
  6. Develop any bare-bones social features like to-do lists and progress bars needed by that user community.
  7. Photograph and load a document recording minutes from a Primitive Baptist church my great-great grandparents belonged to. This is a document of interest to local historians and genealogists, so should be a good opportunity to get a diverse and large-ish set of users.
  8. Ask for volunteers from the Pittsylvania County Historical Society and VAPITTSY-L to transcribe the minutes document.
  9. Incorporate application feedback from those users
  10. Analyze usage patterns to try to predict hosting costs
  11. Build GA features like work management tools
  12. Begin limited hosting for works owned/managed by other people.
The immediate development lesson to draw from this plan is that transcription features need to be built first, then printing and annotation tools, then social features, and last the works management features.

Monday, April 2, 2007

Feature: Regularization

One of the many editorial decisions that must be made while transcribing a manuscript is whether or not to preserve the document's original spellling and punctuation. Happily, TEI has a mechanism for preserving preserve both versions while typing the transcript, so the choice of which one to display is delegated to the reader/printer. Unhappily, the eierlegende wollmilchsau approach of TEI means their mechanism is pretty hokey:

  • <reg> (regularization) contains a reading which has been regularized or normalized in some sense.

  • <orig> (original form) contains a reading which is marked as following the original, rather than being normalized or corrected.

  • <choice> groups a number of alternative encodings for the same point in a text.

The reason they've made <reg> and <orig> freestanding elements is that they want to be able to show a word as having been corrected without providing an alternative, the same way that one uses sic. This is perfectly reasonable, though I do not think it applies to my application. Less defensible is their choice of <choice> to enclose orig/reg elements. <choice> is used elsewhere to encode variant readings encoded with the <unclear> tag. As a result, any XSL transform attempting to normalize (or originalize) a TEI-encoded document is stuck peeking within every <choice> element it encounters to search for the <reg>/<orig> pair.

Since my transcription source will have to use a different, per-page DTD, I'll probably create an <irreg> tag to use instead of <choice> here.

What I'm Building

I'm working on a piece of software for collaborative manuscript transcription and annotation. That's a bit of a mouthful, but what it boils down is this: I've got temporary access to several family documents which I am trying to transcribe and distribute. Being a software engineer by trade, it seems to me that the easiest way to do this is to write a system that allows me and other volunteers to write down, annotate, proofread and print the page images I've scanned. This would allow those of us who are more comfortable with the technology (or are perhaps merely quicker typists) to do the bulk of the complex work, while other (older) volunteers with more context for the manuscripts make interpretive decisions about names and events.

My proximate goal is for the online system allow wiki-like hyperlinking to (at minimum) proper names, which -- when backed by a relational database -- would completely automate creation of indices in a "final" printed copy. I'd also like to allow the reader/printer to choose whether to preserve original line-breaks and/or spelling, mark sections of text as sensitive (so that they would not be visible to the general public), and include images of illegible text to appear in print versions as footnotes.

My eventual goal is to release the system as (possibly) Open Source and open up fee-based hosting on my own servers to fund whatever hosting bills my own project incurs. The grandiose vision is to try to get people in the family history community to direct their efforts into making primary texts accessible to the public.