Collaborative Manuscript Transcription: January 2008

Monday, January 21, 2008

Four steps to deployment

Here are the things I need to do to deploy the 0.5 app on my shared hosting provider:

Install Capistrano and get it working
Upgrade my application stack to Rails 2.0
Switch my app from a subdirectory deep within another CVS module to its own Subversion module
Move the app to Dreamhost

But what order should I tackle this in? My temptation is to try deploying to Dreamhost via Capistrano, since I'm eager to get the app on a production server. Fortunately for my sanity, however, I read part of Cal Henderson's Building Scalable Websites this weekend. Henderson recommends using a staging site. While he probably had something different in mind, this seems like a perfect way to isolate these variables: get Capistrano scripts working on a staging location within my developent box, then once I really understand how deployment automation works, then point the scripts at the production server.

As for the rest, I'm not really sure when to do them. But I will try to tackle them one at a time.

Friday, January 18, 2008

What's next?

Now that I'm done with development driven only by my sense of what would be a good feature, it's time to move to step #2 in my year-old feature plan: deploying an alpha site.

I'm no longer certain about the second half of that plan -- other projects have presented themselves as opportunities that might have a more technically sophisticated user base, and thus might present more incremental enhancement requests. But getting the app to a server where I can follow Matt Mullenweg's advice and "become [my] most passionate user" seems more sensible now than ever.

Chris Wehner's SoldierStudies.org

This is the first of two reviews of similar transcription projects I wrote in correspondence with Brian Cafferelli, an undergraduate working on the WPI Manuscript Transcription Assistant. In this correspondence, I reviewed systems by their support for collaboration, automation, and analysis.

SoldierStudies.org is a non-academic/non-commercial effort like my own. It's a combined production-presentation system with simple but effective analysis tools. If you sign up for the site, you can transcribe letters you possess, entering metadata (name and unit of the soldier involved) and the transcription of the letter's text. You may also flag a letter's contents as belonging to N of about 30 subjects using a simple checkbox mechanism. The UI is a bit clunky in my opinion, but it actually has users (unlike my own program), so perhaps I shouldn't cast stones.

Nevertheless, SoldierStudies has some limitations. Most surprisingly, they are doing no image-based transcription whatsoever, even though they allow uploads of scans. Apparently those uploaded photos of letters are merely to authenticate that the user hasn't just created the letter out of nothing, and only a single page of a letter may be uploaded. Other problems seem inherent to the site's broad focus. SoldierStudies hosts some WebQuest modules intended for K-12 pedagogy. It also keeps copies of some letters transcribed in other projects, like letter from James Booker digitized as part of the Booker Letters Project at the University of Virginia. Neither of these seem part of the site's core goal to "to rescue Civil War letters before they are lost to future generations".

Unlike the pure-production systems like IATH MTD or WPI MTA, SoldierStudies transcriptions are presented dynamically. This allows full-text searching and browsing the database by metadata. Very cool.

So they've got automation mostly down (save the requirement that a scribe be in the same room as a text), analysis is pretty good, and there's a stab at collaboration, although texts cannot be revised by anybody but the original editor. Most importantly, they're online, actively engaged in preserving primary sources and making them accessible to the public via the web.

Thursday, January 17, 2008

Feature Triage for v1.0

I've been using this blog to brainstorm features since its inception. Partly this is to share my designs with people who may find them useful, but mainly it's been a way to flush this data out of my brain so that I don't have to worry about losing it.

Last night, I declared my app feature-complete for the first version. But what features are actually in it?

Let's do some data-flow analysis on the collaborative transcription process. Somebody out in the real world has a physical artifact containing handwritten text. I can't automate the process of converting that text into an image — plenty of other technologies do that already — so I have to start with a set of images capturing those handwritten pages. Starting with the images, the user must organize them for transcription, then begin transcription, then view the transcriptions. Only after that may they analyze or print those transcriptions.

Following the bulk of that flow, I've cut support for much of the beginning and end of the process, and pared away the ancillary features of page transcription. The resulting feature set is below, with features I've postponed till later releases ~~struck out~~.

Image Preparation
- Single image upload/replacement
- Image orientation/resolution controls
- ~~Upload several images~~
- Single-image titling
- Automatic title generation for several images
- Recto/verso image set collation
- Conversion of a set of titled images into pages of a transcribable work
Page Transcription
- Zoom
- Transcription text entry
- Subject links
- Auto-linking
- Request review
- ~~Unclear tags~~
- ~~Sensitive tags~~
Transcription Display
- Table of Contents for a work
- Page display
- Multi-page transcription display
- Page annotations
~~Printable Transcription~~
- ~~PDF generation~~
- ~~Typesetting~~
- ~~Table of contents page~~
- ~~Title/Author pages~~
- ~~Expansion footnotes~~
- ~~Subject article footnotes~~
- ~~Index~~
Text Analysis
- Subject Articles
- "What links here" indexes
- ~~Relatedness graphs~~ (Implemented, but turned off for v1.0)
- Subject categories
- Category-based navigation
- ~~Categorized relatedness graphs~~

Wednesday, January 16, 2008

Whew

I've just finished coding review requests, the last feature slated for release 1.0.

I figure I've completed 90% of the work to get to 1.0, so I'll dub this check-in Version 0.5.

The next step is to deploy to a production machine for alpha testing, which I expect will involve at least a few more code changes. But from here on out, every modification is a bug fix.

I'll try to post a bit more on my feature triage, my experience mis-applying the Evil Twin plugin design pattern, and my plans for the future.

Collaborative Manuscript Transcription

Monday, January 21, 2008

Four steps to deployment

Friday, January 18, 2008

What's next?

Chris Wehner's SoldierStudies.org

Thursday, January 17, 2008

Feature Triage for v1.0

Wednesday, January 16, 2008

Whew

New Blog Posts are at FromThePage

Posts from the FromThePage Blog

Pages

Upcoming Conference Schedule

Past Conference Talks

Blog Archive

Subjects

Papers

Transcription Systems

Digital Family History