Monday, March 23, 2009

Feature: Mechanical Turk Integration

At last week's Austin On Rails SXSW party, my friend and compatriot Steve Odom gave me a really neat feature idea. "Why don't you integrate with Amazon's Mechanical Turk?" he asked. This is an intriguing notion, and while it's not on my own road map, it would be pretty easy to modify FromThePage to support that. Here's what I'd do to use FromThePage on a more traditional transcription project, with an experienced documentary editor at the head and funding for transcription work:

Page Forks: I assume that the editor using Mechanical Turk would want double keyed transcriptions to maintain quality, so the application needs to present the same, untranscribed page to multiple people. In the software world, when a project splits, we call this forking, and I think that the analogy applies here. This feature needs to be able to track an entirely separate edit history for the different forks of a page. This means a new attribute on the master page record describing whether more than one fork exists, and a separate edit history for each fork of a page that's created. There's no reason to limit these transcriptions to only two forks, even if that's the most common use case, so I'd want to provide a URL that will automatically create a new fork for a new transcriber to work in. The Amazon HIT (Human Intelligence Task) would have a link to that URL, so the transcriber need never track which fork they're working in, or even be aware of the double keying.

Reconciling Page Forks: After a page has been transcribed more than one time, the application needs to allow the editor to reconcile the transcriptions. This would involve a screen displaying the most recent version of two transcriptions alongside the scanned page image. Likely there's a decent Rails plug in already for displaying code diffs, so I could leverage that to highlight differences between the two transcriptions. A fourth pane would allow the editor to paste in the reconciled transcription into the master page object.

Publishing MTurk HITs: Since each page is an independent work unit, it should be possible to automatically convert an untranscribed work into MTurk HITs, with a work item for each page. I don't know enough about how MTurk works, but I assume that the editor would need to enter their Amazon account credentials to have the application create and post the HITs. The app also needs to prevent the same user from re-transcribing the same page in multiple forks.

In all, it doesn't sound like more than a month or two worth of work, even performed part-time. This isn't a need I have for the Julia Brumfield diaries, so I don't anticipate building this any time soon. Nevertheless, it's fun to speculate. Thanks, Steve!

No comments: