Collaborative Manuscript Transcription

Monday, February 9, 2009

GoogleFight Resolves Unclear Handwriting

I've spent the last couple of weeks as a FromThePage user working seriously on annotation. This mainly involves identifying the people and events mentioned in Julia Brumfield's 1918 diary and writing short articles to appear as hyperlinked pages within the website, or be printed as footnotes following the first mention of the subject. Although my primary resource is a descendant chart in a book of family history, I've also found Google to be surprisingly helpful for people who are neighbors or acquaintances.

Here's a problem I ran into in the entry for June 30, 1918:

In this case, I was trying to identify the name in the middle of the photo. Bo__d Dews. The surname is a bit irregular for Julia's hand, but Dews is a common surname and occurs on the line above. In fact, this name is in the same list as another Mr. Dews, so I felt certain about the surname.

But what to make of the first name? The first two and final letters are clear and consistent: BO and D. The third letter is either an A or a U, and the fourth is either N or R. We can eliminate "Bourd" and "Boand" as unlikely phonetic spellings of any English name, leaving "Bound" and "Board". Neither of these are very likely names... or are they?

I thought I might have some luck by comparing the number and quality of Google search results for each of "Board Dews" and "Bound Dews". This is a pretty common practice used by Wikipedia editors to determine the most common title of a subject, and is sometimes known as a "Google fight". Let's look at the results:

"Bound Dews" yields four search results. The first two are archived titles from FromThePage itself, in which I'd retained a transcription of "Bound(?) Dews" in the text. The next two are randomly-generated strings on a spammer's site. We can't really rule out "Bound Dews" as a name based on this, however.

"Board Dews" yields 104 search results. The first page of results contains one person named Board Dews, who is listed on a genealogist's site as living from 1901 to 1957, and residing in nearby Campbell County. Perhaps more intriguing is the other surnames on the site, all from the area 10 miles east of Julia's home. The second page of results contains three links to a Board Dews, born in 1901 in Pittsylvania County.

At this point, I'm certain that the Bo__d Dews in the diary must be the Board Dews who would have been a seventeen-year-old neighbor. But I'm still astonished that I can resolve a legibility problem in a local diary with a Google search.

Thursday, February 5, 2009

Progress Report: Eight Months after THATCamp

It's been more than half a year since I've updated this blog. During that period, due to some events in my personal life, I was only able to spend a month or so on sustained development, but nevertheless made some real progress.

The big news is that I announced the project to some interested family members and have acquired one serious user. My cousin-in-law, Linda Tucker, has transcribed more than 60 pages of Julia Brumfield's 1919 diary since Christmas. In addition to her amazing productivity transcribing, she's explored a number of features of the software, reading most of the previously-transcribed 1918 diary, making notes and asking questions, and fighting with my zoom feature. Her enthusiasm is contagious, and her feedback -- not to mention her actual contributions -- has been invaluable.

During this period of little development, I spent a lot of time as a user. Fewer than 50 pages remain to transcribe in the 1918 diary, and I've started seriously researching the people mentioned in the diary for elaboration in footnotes. It's much easier to sustain work as a user than as a developer, since I don't need an hour or so of uninterrupted concentration to add a few links to a page.

I've also made some strides on printing. I jettisoned DocBook after too many problems and switched over to using Bruce Williamson's RTeX plugin. After some limited success, I will almost certainly craft my own set of ERb templates that generate LaTeX source for PDF generation. RTeX excels in serving up inline PDF files, which is somewhat antithetical to my versioned approach. Nevertheless, without RTeX, I might have never ventured away from DocBook. Thanks go to THATCamper Adam Solove for his willingness to share some of his hard-won LaTeX expertise in this matter.

Although I'm new to LaTeX, I've got footnotes working better than they were in DocBook. I still have many of the logical issues I addressed in the printing post to deal with, but am pretty confident I've found the right technology for printing.

I'm also working on re-implementing zoom in GSIV, rather than my cobbled-together solution. The ability to pan a zoomed image has been consistently requested by all of my alpha testers, the participants at THATCamp, and its lack is a real pain point for Linda, my first Real User. I really like the static-server approach GSIV takes, and will post when the first mock-up is done.

Saturday, June 21, 2008

Workflow: flags, tags, or ratings?

Over the past couple of months, I've gotten a lot of user feedback relating to workflow. Paraphrased, they include:

How do I mark a page "unfinished"? I've typed up half of it and want to return later.
How do I see all the pages that need transcription? I don't know where to start!
I'm not sure about the names or handwriting in this page. How do I ask someone else to review it?
Displaying whether a page has transcription text or not isn't good enough -- how do we know when something is really finished?
How do we ask for a proofreader, a tech savvy person to review the links, or someone familiar with the material to double-check names?

In a traditional, institutional setting, this is handled both through formal workflows (transcription assignments, designated reviewers, and researchers) and through informal face-to-face communication. None of these methods are available to volunteer-driven online
projects.

The folks at THATCamp recommended I get around this limitation by implementing user-driven ratings, similar to those found at online bookstores. Readers could flag pages as needing review, scribes could flag pages in which they need help, and volunteers could browse pages by quality to look for ways to help out. An additional benefit would be the low barrier to user-engagement, as just about anyone can click a button when they spot an error.

The next question is what this system should look like. Possible options are:

Rating scale: Add a one-to-five scale of "quality" to each page.
- Pros: Incredibly simple.
- Cons: "Quality" is ambiguous. There's no way to differentiate a page needing content review (i.e. "what is this placename?") from a page needing technical help (i.e. "I messed up the subject links"). Low quality ratings also have an almost accusatory tone, which can lead to lots of problems in social software.
Flags: Define a set of attributes ("needs review", "unfinished", "inappropriate") for pages and allow users to set or un-set them independently of each other.
- Pros: Also simple.
- Cons: Too precise. The flags I can think of wanting may be very different from those a different user wants. If I set up a flag-based data-model, it's going to be limited by my preconceptions.
Tags: Allow users to create their own labels for a page.
- Pros: Most flexible, easy to implement via acts_as_taggable or similar Rails plugins.
- Cons: Difficult to use. Tech-savvy users are comfortable with tags, but that may be a small proportion of my user base. An additional problem may be use of non-workflow based tags. If a page mentions a dramatic episode, why not tag it with that? (Admittedly this may actually be a feature.)

I'm currently leaning towards a combination of tags and flags: implement tags under the hood, but promote a predefined subset of tags to be accessible via a simple checkbox UI. Users could tag pages however they like, and if I see patterns emerge that suggest common use-cases, I could promote those tags as well.

Sunday, June 8, 2008

THATCamp Takeaways

I just got back from THATCamp, and man, what a ride! I've never been to a conference with this level of collaboration before -- neither academic nor technical. Literally nobody was "audience" -- I don't think a single person emerged from the conference without having presented in at least one session, and pitched in their ideas in half a dozen more.

To my astonishment, I ended up demoing FromThePage in two different sessions, and presented a custom how-to on GraphViz in a third. I was really surprised by the technical savvy of the participants -- just about everyone at the sessions I took part in had done development of one form or another. The feedback on FromThePage were far more concrete than I was expecting, and got me past several roadblocks. And since this is a product development blog, here's what they were:

Zoom: I've looked at Zoomify a few times in the past, but have never been able to get around the fact that their image-preparation software is incompatible with Unix-based server-side processing. Two different people suggested workarounds for this, which may just solve my zoom problems nicely.

WYSIWYG: I'd never heard of the Yahoo WYSIWYG before, but a couple of people recommended it as being especially extensible, and appropriate for linking text to subjects. I've looked over it a bit now, and am really impressed.
Analysis: One of the big problems I've had with my graphing algorithm is the noise that comes from members of Julia's household. Because they appear on 99 of 100 entries, they're more related everything, and (worse) show up on relatedness graphs for other subjects as more related than the subjects that's I'm actually looking for. Over the course of the weekend, while preparing my DorkShorts presentation, discussing it, and explaining the noise quandary in FromThePage, both problem and solution clarified.
The noise is due to the unit of analysis being a single diary entry. The solution is to reduce the unit of analysis. Many of the THATCampers suggested alternatives: look for related subjects within the same paragraph, or within N words, or even (using natural language toolkits) within the same sentence.
It might even be possible to do this without requiring markup of each mention of a subject. One possibility is to automate this by searching the entry text for likely mentions of the same subject that has occurred already. This search could be informed by previous matches -- the same data I'm using for the autolink feature. (Inspiration for this comes from Andrea Eastman-Mullins' description of how Alexander Street Press is using well-encoded texts to inform searches of unencoded texts.)
Autolink: Travis Brown, whose background is in computational linguistics, suggested a few basic tools for making autolink smarter. Namely, permuting the morphology of a word before the autolink feature looks for matches. This would allow me to clean up the matching algorithm, which currently does some gross things with regular expressions to approach the same goal.

Workflow: The participants at the Crowdsourcing Transcription and Annotation session were deeply sympathetic to the idea that volunteer-driven projects can't use the same kind of double-keyed, centrally organized workflows that institutional transcription projects use. They suggested a number of ways to use flagging and ratings to accomplish the same goals. Rather than assigning transcription to A, identification and markup to B, and proofreading to C, they suggested a user-driven rating system. This would allow scribes or viewers to indicate the quality level of a transcribed entry, marking it with ratings like "unfinished", "needs review", "pretty good", or "excellent". I'd add tools to the page list interface to show entries needing review, or ones that were nearly done, to allow volunteers to target the kind of contributions they were making.
Ratings also would provide an non-threatening way for novice users to contribute.

Mapping: Before the map session, I was manually clicking on Google's MyMaps, then embedding a link within subject articles. Now I expect to attach latitude/longitude coordinates to subjects, then generate maps via KML files. I'm still just exploring this functionality, but I feel like I've got a clue now.

Presentation: The Crowdsourcing session started brainstorming presentation tools for transcriptions. I'd seen a couple of these before, but never really considered them for FromThePage. Since one of my challenges is making the reader experience more visually appealing, it looks like it might be time to explore some of these.

These are all features I'd considered either out-of-reach, dead-ends, or (in one case) entirely impossible.

Thanks to my fellow THATCampers for all the suggestions, correction, and enthusiasm. Thanks also to THATCamp for letting an uncredentialed amateur working out of his garage attend. I only hope I gave half as much as I got.

Saturday, May 17, 2008

Progress Report: De-duping catastrophe and a host change

After a very difficult ten days of coding, I'm almost where I was at the beginning of May. The story:

Early in the month, I got a duplicate identifier feature coded. The UI was based on LibraryThing's, which is the best de-duping interface I've ever seen. Mine still falls short, but it's able to pair "Ren Worsham" up with "Wren Worsham", so it'll probably do for now. With that completed, I built a tool to combine subjects: if you see a possible duplicate of the subject you're viewing, you click combine, and it updates all textual references to the duplicate to point to the main article, then deletes the duplicate. Pretty simple, right?

Enter DreamHost's process killer. Now I love DreamHost, and I want to love them more, but I really don't think their cheap shared hosting plan is appropriate for a computationally intensive web-app. In order to insulate users from other, potentially clueless users, they have a daemon that monitors running processes and kills off any that look "bad". I'm not sure what criteria constitute "bad", but I should have realized the heuristic might be over-aggressive when I wasn't able to run basic database migrations without running afoul of it. Nevertheless, it didn't seem to be causing anything beyond the occasional "Rails Application failed to start" message that could be solved with a browser reload.

However. Killing a de-duping process in the middle of reference updates is altogether different from killing a relatedness graph display. Unfortunately, I wasn't quite aware of the problem before I'd tried to de-dup several records, sometimes multiple times. My app assumes its data will be internally consistent, so my attempts to clean up the carnage resulted in hundreds more duplicates being created.

So I've moved FromThePage from DreamHost to HostingRails, which I completed this morning. There remains a lot of back-end work to clean up the data, but I'm pretty sure I'll get there before THATCamp.

Sunday, April 27, 2008

The Trouble with Names

The trouble with people as subjects is that they have names, and that personal names are hard.

Names in the text may be illegible or incomplete, so that Reese _____ and Mr ____ Edmonds require special handling.
Names need be remembered by the scribe during their transcription. I discovered this the hard way.
After doing some research in secondary documents, I was able to "improve" the entries for Julia's children. Thus Kate Harvey became Julia Katherine Brumfield Harvey and Mollie Reynolds became Mary Susan Brumfield Reynolds.
The problem is that while I'm transcribing the diaries, I can't remember that "Mary Susan" == Mollie. The diaries consistently refer to her as Mollie Reynolds, and the family refers to to her as Mollie Reynolds. No other person working on the diaries is likely to have better luck than I've had remembering this. After fighting with the improved names for a while, I gave up and changed all the full names back to the common names, leaving the full names in the articles for each subject.
Names are odd ducks, when it comes to strings. "Mr. Zach Abel" should be sorted before "Dr. Anne Zweig", which requires either human intervention to break the string into component parts or some serious parsing effort. At this point my subject list has become unwieldy enough to require sorting, and the index generation code for PDFs is completely dependent on this kind of separation.

I'm afraid I'll have to solve all of these problems at the same time, as they're all interdependent. My initial inclination is to have subject articles for people allow the user to specify a full name in all its component parts. If none is chosen, I'll populate the parts via a series of regular expressions. This will probably also require a hard look at how both TEI and DocBook represent names.

Friday, April 25, 2008

Feature: Data Maintenance Tools

With only two collections of documents, fewer than a hundred transcriptions, and only a half-dozen users who could be charitably described as "active", FromThePage is starting to strain under the weight of its data.

All of this has to do with subjects. These are the indexable words that provide navigation, analysis, and context to readers. They're working out pretty well, but frequency of use has highlighted some urgent features to be developed and intolerable bugs to be fixed:

We need a tool to combine subjects. Early in the transcription process, it was unclear to me whether "the Island" referred to Long Island, Virginia -- a nearby town with a post office and railroad station -- or someplace else. Increasing familiarity with the texts, however, has shown "the Island" to be definitely the same as "Long Island".
The best interface for doing this sort of deduping is implemented by LibraryThing, which is so convenient that it has inspired the ad-hoc creation of a group of "combining" enthusiasts -- an astonishing development, since this is usually the worst of dull chores. A similar interface would invite the user viewing "the Island" to consider combining that subject with "Long Island". This requires an algorithm to suggest matches for combination, which is itself no trivial task.
We need another tool to review and delete orphans. As identification improves, we've been creating new subjects "Reese Smith" and linking previous references to "Reese" to that improved subject. This leaves the old, incomplete subject without any referents, but also without any way to prune it.

Autolink has become utterly essential to transcriptions, since it allows the scribe to identify the appropriate subject as well as the syntax to use for its link. Unfortunately, it has a few serious problems:

Autolink looks for substrings within the transcription without paying attention to word boundaries. This led to some odd autolink suggestions before this week, but since the addition of "ink" and "hen" to the subjects link text, autolink has started behaving egregiously. The word "then" is unhelpfully expanded to "t[[chickens|hen]]". I'll try to retain matches for display text regardless of inflectional markings, but it's time to move the matching algorithm to a more conservative heuristic.
Autolink also suggests matches from subjects that reside within different collections. It's wholly unhelpful for autolink to suggest a tenant farmer in rural Virginia within a context of UK soldiers in a WWI prison camp. This is a classical cross-talk bug, and needs fixing.

Monday, April 14, 2008

Collaborative transcription, the hard way

Archivalia has kept us informed of lots of manuscript projects going online in Europe last week, offering commentary along the way. Perhaps my favorite exchange was about the Chronicle of Sweder von Schele on the Internet:

Mit dem Projekt wird zunächst bezweckt, die Transkription zu ergänzen und zu verbessern. Hierzu können neue Abschriften per Mail an die am Projekt beteiligten Institutionen geschickt werden. Nach redaktioneller Prüfung werden die Seiten ausgetauscht.

Meine Güte, wie vorsintflutlich. Hat man noch nie etwas von einem Wiki gehört?

That's right -- the online community may send in corrections and additions to the transcription by mail.

Friday, April 11, 2008

Who do I build for?

Over at ClioWeb, Jeremy Boggs is starting a promising series of posts on the development process for digital humanites process. He's splitting the process up into five steps, which may happen at the same time, but still follow a rough sort of order. Step one, obviously, is "Figure out what exactly you’re building."

But is that really the first step? I'm finding that what I build is dependent on who I'm building for. Having launched an alpha version of the product and shaken out many of the bugs in the base functionality, I'm going to have to make some tough decisions about what I concentrate my time on next. These all revolve around who I'm developing the product for:

My Family: Remember that I developed the transcription software to help accomplish a specific goal: transcribe Julia Brumfield's diaries to share with family members. The features I should concentrate on for this audience are:

Finish porting my father's 1992 transcription of the 1918 diary to FromThePage
Fix zoom.
Improve the collaborative tools for discussing the works and figuring out which pages need review.
Build out the printing feature, so that the diaries can be shared with people who have limited computer access.

THATCamp attendees: Probably the best feedback I'll be able to receive on the analytical tools like subject graphs will come from THATCamp. This means that any analytical features I'd like to demo need to be presentable by the middle of May.

Institutional Users: This blog has drawn interest from a couple of people looking for software for their institutions to use. For the past month, I've corresponded extensively with John Rumm, editor of the Digital Buffalo Bill Project, based at McCracken Research Library, at the Buffalo Bill Historical Center. Though our conversations are continuing, it seems like the main features his project would need are:

Full support for manuscript transcription tags used to describe normalization, different hands, corrections/revisions to the text, illegible text and other descriptors used in low-level transcription work. (More on this in a separate post)
Integration with other systems the project may be using, like Omeka, Pachyderm, MediaWiki, and such.

My Volunteers: A few of the people who've been trying out the software have expressed interest in using it to host their own projects. More than any other audience, their needs would push FromThePage towards my vision of unlocking the filing cabinets in family archivists' basements and making these handwritten sources accessible online. We're in the very early stages of this, so I don't yet know what requirements will arise.

The problem is that there's very little overlap between the features these groups need. I will likely concentrate on family and volunteers, while doing the basics for THATCamp. I realize that's not a very tight focus, but it's much clearer to me now than it was last week.

Monday, April 7, 2008

Progress Report: One Month of Alpha Testing

FromThePage has been on a production server for about a month now, and the results have been fascinating. The first few days' testing revealed some shocking usability problems. In some places (transcription and login most notoriously) the code was swallowing error messages instead of displaying them to the user. Zoom didn't work in Internet Explorer. And there were no guardrails that kept the user from losing a transcription-in-progress.

After fixing these problems, the family and friends who'd volunteered to try out the software started making more progress. The next requests that came in were for transcription conventions. After about three requests for these, I started displaying the conventions on the transcription screen itself. This seems to have been very successful, and is something I'd never have come up with on my own.

The past couple of weeks have been exciting. My old college roommate started transcribing entries from the 1919 diary, and entered about 15 days in January -- all in two days work. In addition to his technical feedback, two things I'd hoped for happened:

We started collaborating. My roommate had transcribed the entries on a day when he'd had lots of time. I reviewed his transcriptions and edited the linking syntax. Then my father edited the resulting pages, correcting names of people, events, and animals based on his knowledge of the diaries' context.
My roommate got engaged. His technical suggestions were peppered with comments on the entries themselves and the lives of the people in the diaries. I've seen this phenomenon at Pepys Diary Online, but it's really heartening to get a glimpse of that kind of engagement with a manuscript.

I've also had some disappointments. It looks like I'll have to discard my zoom feature and replace it with something using the Google Maps API. People now expect to pan by dragging the image, and I my home-rolled system simply can't support that.

I really planned on moving developing printing and analytical tools next, but we're finding that the social software features are becoming essential. The bare-bones "Recent Activity" panel I slapped together one night has become the most popular feature. We need to know who edited what, what pages remain to be transcribed, and which transcriptions need review. I've resuscitated this winter's comment feature, polished the "Recent Activity" panel, and am working on a system for displaying a page's transcription status in the work's table of contents.

All of these developments are the results of several hours of careful, thoughtful review by volunteers like my father and my roommate. There is simply no way I could have invited the public in to the site as it stood a month ago, which I did not know at the time. There's still a lot to be done before FromThePage is "ready for company", but I think it's on track.

If you'd like to try the software out, leave me a comment or send mail to alpha.info@fromthepage.com.

Friday, April 4, 2008

Review: IATH Manuscript Transcription Database

This is the second of two reviews of similar transcription projects I wrote in correspondence with Brian Cafferelli, an undergraduate working on the WPI Manuscript Transcription Assistant. In this correspondence, I reviewed systems by their support for collaboration, automation, and analysis.

The IATH Manuscript Transcription Database was a system for producing transcriptions developed for the Salem Witch Trials court records. It allowed full collaboration within an institutional setting. An administrator assigned transcription work to a transcriber, who then pulled that work from the queue and edited and submitted a transcription. Presumably there was some sort of review performed by the admin, or a proofreading step done by comparison with another user's transcription. Near as I can tell from the manual, no dual-screen transcription editor was provided. Rather, transcription files were edited outside the system, and passed back and forth using MTD.

I'm a bit hazy on all this because after reviewing the docs and downloading the source code, I sent an inquiry to IATH about the legal status of the code. It turns out that while the author intended to release the system to the public, this was never formally done. The copyright status of the code is murky, and I view it as tainted IP for my purpose of producing a similar product. I'm afraid I deleted the files from my own system, unread.

For those interested in reading more, here is the announcement of the MTD
on the Humanist list.
The full manual, possibly with archived source code is accessible via the wayback machine. They've pulled this from the IATH site, presumably because of the IP problems.

So the MTD was another pure-production system. Automation and collaboration were fully supported, though the collaboration was designed for a purely institutional setting. Systems of assignment and review are inappropriate for a volunteer-driven system like my own.

My correspondent at IATH did pass along a piece of advice from someone who had worked on the project: "Use CVS instead". What I gather they meant by this was that the system for passing files between distributed transcribers, collating those files, and recording edits is a task that source code repositories already perform very well. This does nothing to replace transcription production tools akin to mine or the TEI editors, but the whole system of editorial oversight and coordination provided by the MTD is a subset of what a source code repository can do. A combination of Subversion and Trac would be a fantastic way to organize a distributed transcription effort, absent a pure-hosted tool.

This post contains a lot more speculation and supposition than usual, and I apologize in advance to my readers and the IATH for anything I've gotten wrong. If anyone associated with the MTD project would like to set me straight, please comment or send me email.

Thursday, March 27, 2008

Rails: Logging User Activity for Usability

At the beginning of the month, I started usability testing for FromThePage. Due to my limited resources, I'm not able to perform usability testing in control rooms, or (better yet) hire a disinterested expert with a background in the natural sciences to conduct usability tests for me. I'm pretty much limited to sending people the URL for the app with a pleading e-mail, then waiting with fingers crossed for a reply.

For anyone who finds themselves in the same situation, I recommend adding some logging code to your app. We tried this last year with Sara's project, discovering that only 5% of site visitors were even getting to the features we'd spent most of our time on. It was also invaluable resolving bugs reports. When a user complains they got logged off the system, we could track their clicks and see exactly what they were doing that killed their session.

Here's how I've done this for FromThePage in Rails:

First, you need a place to store each user action. You'll want to store information about who was performing the action, and what they were doing. I was willing to violate my sense of data model aesthetics for performance reasons, and abandon third normal form by combining these two distinct concepts into the same table.

# who's doing the clicking?
browser
session_id
ip_address
user_id #null if they're not logged in

Tracking the browser lets you figure out whether your code doesn't work in IE (it doesn't) and whether Google is scraping your site before it's ready (it is). The session ID is the key used to aggregate all these actions -- one of these corresponds to several clicks that make up a user session. Finally, the IP address give you a bit of a clue as to where the user is coming from.

Next, you need to store what's actually being done, and on what objects in your system. Again, this goes within the same table.

# what happened on this click?
:action
:params
:collection_id #null if inapplicable
:work_id #null if inapplicable
:page_id #null if inapplicable

In this case, every click will record the action and the associated HTTP parameters. If one of those parameters was collection_id, work_id, or page_id (the three most important objects within FromThePage), we'll store that too. Put all this in a migration script and create a model that refers to it. In this case, we'll call that model "activity".

Now we need to actually record the action. This is a good job for a before_filter. Since I've got a before_filter in ApplicationController that set up important variables like the page, work, or collection, I'll place my before_filter in the same spot and call it after that one.

before_filter :record_activity

But what does it do?

def record_activity
    @activity = Activity.new
    # who is doing the activity?
    @activity.session_id = session.session_id #record the session
    @activity.browser = request.env['HTTP_USER_AGENT']
    @activity.ip_address = request.env['REMOTE_ADDR']
    # what are they doing?
    @activity.action = action_name # grab this from the controller
    @activity.params = params.inspect # wrap this in an unless block if it might contain a password
    if @collection
       @activity.collection_id = @collection.id
    end
    # ditto for work, page, and user IDs
end

For extra credit, add a status field set to 'incomplete' in your record_activity method, then update it to 'complete' in an after_filter. This is a great way for catching activity that throws exceptions for users and presents error pages you might not know about otherwise.

P.S. Let me know if you'd like to try out the software.

Wednesday, March 26, 2008

THATCamp 2008

I'm going to THATCamp at the end of May to talk about From The Page and a few dozen other cool projects that are going on in the digital humanities. If anybody can offer advice on what to expect from an "unconference", I'd sure appreciate it.

This may be the thing that finally drives me to use Twitter.

Monday, March 17, 2008

Rails 2.0 Gotchas

The deprecation tools for Rails 2.0 are grand, but they really don't tell you everything you need to know. The things that have bitten me so far are:

The built-in pagination has been removed from the core framework. Unlike tools like acts_as_list and acts_as_tree, however, there's no obvious plugin that makes the old code work. This is because the old pagination code was really awful: it performed poorly and hid your content from search engines. Fortunately, Sara was able to convert my paginate calls to use the will_paginate plugin pretty easily.
Rails Engines, or at least the restful_comments plugin built on top of them, don't seem to work at all. So I've had to disable the comments and proofreading request system I spent November through January building.
Rails 2.0 adds some spiffy automated code to prevent cross-site-scripting security holes. For some reason this breaks my cross-controller AJAX calls, so I've had to add
protect_from_forgery :except => [my old actions]
to those controllers after getting InvalidAuthenticityToken exceptions.
The default session has been changed from a filesystem-based storage engine to one that shoves session data into the browser cookie. So if you're persisting large-ish objects across requests in the session, this will fail. Sadly, basic tests may pass, while serious work will break: I found my bulk page transformation
code to work fine for 20 pages, but break for 180. The solution for this is to add
config.action_controller.session_store = :p_store
in your environment.rb file.

Sunday, March 9, 2008

Collaborative Transcription as Crowdsourcing

Yesterday morning I saw Derek Powazek present on crowdsourcing -- user-generated content and collaborative communities. While he covered a lot of material that (users will do unexpected things, don't exploit people, design for the "selfish user"), there was one anecdote I thought especially relevant for FromThePage.

A publishing house had polled a targeted group of people to figure out whether they'd be interested in contributing magazine articles. The response was overwhelmingly positive. The appropriate studies were conducted, and the site was launched -- A blank page, ready for article contributions.

The response from those previously enthusiastic users was silence. Crickets. Tumbleweeds. The editors quickly changed tack and posted a list of ten subjects who'd agreed to be interviewed by the site's contributors,
asking for volunteers to conduct and write up the interviews. This time, people responded with the same enthusiasm they'd shown at the original survey.

The lesson was that successful editors of collaborative content endeavorshave less in common with traditional magazine/project editors thanthey do with community managers. Absent thecommand-and-control organizational structure, a volunteer community still needs to have its effortsdirected. However, this must be done through guidance andpersuasion through concrete suggestions, goal-setting, and feedback. In future releases, I need to add featuresto help work owners communicate suggestions and rewards^* to scribes.

(Powazek suggests attaboys, not complex replacements for currency here)

Wednesday, March 5, 2008

Meet me at SXSWi 2008

I'll be at South by Southwest Interactive this weekend. If any of my readers are also attending, please drop me a message or leave a comment. I'd love to meet up.

Thursday, February 7, 2008

Google Reads Fraktur

Yesterday, German blogger Archivalia reported that the quality of Fraktur OCR at Google Books has improved. There are still some problems, but they're on the same order of those found in books printed in Antiqua. Compare the text-only and page-image versions of Geschichte der teutschen Landwirthschaft (1800) with the text and image versions of Antigua Altnordisches Leben (1856).

This is a big deal, since previous OCR efforts produced results that were not only unreadable, but un-searchable as well. This example from the University of Michigan's MBooks website (digitized in partnership with Google) gives a flavor of the prior quality: "Ueber den Ursprung des Uebels." ("On the Origin of Evil") results in "Us-Wv ben Uvfprun@ - bed Its-beEd."

It's thrilling that these improvements are being made to the big digitization efforts — my guess is that they've added new blackletter typefaces to the OCR algorithm and reprocessed the previously-scanned images — but this highlights the dependency OCR technology has on well-known typefaces. Occasionally, when I tell friends about my software and the diaries I'm transcribing, I'm asked, "Why don't you just OCR the diaries?" Unfortunately, until someone comes with a OCR plugin for Julia Brumfield (age 72) and another for Julia Brumfield (age 88), we'll be stuck transcribing the diaries by hand.

Monday, February 4, 2008

Progress Report: Four N steps to deployment

I've completed one of the four steps I outlined below: my Rails app is now living in a SubVersion repository hosted somewhere further than 4 feet from where I'm typing this.

However, I've had to add a few more steps to the deployment process. These included:

Attempting to install Trac
Installing MySql on DreamHost
Installing SubVersion on DreamHost
Successfully installing BugZilla on DreamHost

None of these were included in my original estimate.

Name Update: FromThePage.com

I've finally picked a name. Despite its attractiveness, "Renan" proved unpronounceable. No wonder my ancestors said "REE-nan": it's at least four phonemes away from a native English word, and nobody who was shown the software was able to pronounce its title.

FromThePage is the new name. It not as lovely as some of the ones that came out of a brainstorming session (like "HandWritten"), but at least there are no existing software products that use it. I went ahead and registered fromthepage.com and fromthepage.org under the assumption that I'd be able to pull off the WordPress model of open-source software that's also hosted for a fee.

Monday, January 21, 2008

Four steps to deployment

Here are the things I need to do to deploy the 0.5 app on my shared hosting provider:

Install Capistrano and get it working
Upgrade my application stack to Rails 2.0
Switch my app from a subdirectory deep within another CVS module to its own Subversion module
Move the app to Dreamhost

But what order should I tackle this in? My temptation is to try deploying to Dreamhost via Capistrano, since I'm eager to get the app on a production server. Fortunately for my sanity, however, I read part of Cal Henderson's Building Scalable Websites this weekend. Henderson recommends using a staging site. While he probably had something different in mind, this seems like a perfect way to isolate these variables: get Capistrano scripts working on a staging location within my developent box, then once I really understand how deployment automation works, then point the scripts at the production server.

As for the rest, I'm not really sure when to do them. But I will try to tackle them one at a time.

Friday, January 18, 2008

What's next?

Now that I'm done with development driven only by my sense of what would be a good feature, it's time to move to step #2 in my year-old feature plan: deploying an alpha site.

I'm no longer certain about the second half of that plan -- other projects have presented themselves as opportunities that might have a more technically sophisticated user base, and thus might present more incremental enhancement requests. But getting the app to a server where I can follow Matt Mullenweg's advice and "become [my] most passionate user" seems more sensible now than ever.

Chris Wehner's SoldierStudies.org

This is the first of two reviews of similar transcription projects I wrote in correspondence with Brian Cafferelli, an undergraduate working on the WPI Manuscript Transcription Assistant. In this correspondence, I reviewed systems by their support for collaboration, automation, and analysis.

SoldierStudies.org is a non-academic/non-commercial effort like my own. It's a combined production-presentation system with simple but effective analysis tools. If you sign up for the site, you can transcribe letters you possess, entering metadata (name and unit of the soldier involved) and the transcription of the letter's text. You may also flag a letter's contents as belonging to N of about 30 subjects using a simple checkbox mechanism. The UI is a bit clunky in my opinion, but it actually has users (unlike my own program), so perhaps I shouldn't cast stones.

Nevertheless, SoldierStudies has some limitations. Most surprisingly, they are doing no image-based transcription whatsoever, even though they allow uploads of scans. Apparently those uploaded photos of letters are merely to authenticate that the user hasn't just created the letter out of nothing, and only a single page of a letter may be uploaded. Other problems seem inherent to the site's broad focus. SoldierStudies hosts some WebQuest modules intended for K-12 pedagogy. It also keeps copies of some letters transcribed in other projects, like letter from James Booker digitized as part of the Booker Letters Project at the University of Virginia. Neither of these seem part of the site's core goal to "to rescue Civil War letters before they are lost to future generations".

Unlike the pure-production systems like IATH MTD or WPI MTA, SoldierStudies transcriptions are presented dynamically. This allows full-text searching and browsing the database by metadata. Very cool.

So they've got automation mostly down (save the requirement that a scribe be in the same room as a text), analysis is pretty good, and there's a stab at collaboration, although texts cannot be revised by anybody but the original editor. Most importantly, they're online, actively engaged in preserving primary sources and making them accessible to the public via the web.

Thursday, January 17, 2008

Feature Triage for v1.0

I've been using this blog to brainstorm features since its inception. Partly this is to share my designs with people who may find them useful, but mainly it's been a way to flush this data out of my brain so that I don't have to worry about losing it.

Last night, I declared my app feature-complete for the first version. But what features are actually in it?

Let's do some data-flow analysis on the collaborative transcription process. Somebody out in the real world has a physical artifact containing handwritten text. I can't automate the process of converting that text into an image — plenty of other technologies do that already — so I have to start with a set of images capturing those handwritten pages. Starting with the images, the user must organize them for transcription, then begin transcription, then view the transcriptions. Only after that may they analyze or print those transcriptions.

Following the bulk of that flow, I've cut support for much of the beginning and end of the process, and pared away the ancillary features of page transcription. The resulting feature set is below, with features I've postponed till later releases ~~struck out~~.

Image Preparation
- Single image upload/replacement
- Image orientation/resolution controls
- ~~Upload several images~~
- Single-image titling
- Automatic title generation for several images
- Recto/verso image set collation
- Conversion of a set of titled images into pages of a transcribable work
Page Transcription
- Zoom
- Transcription text entry
- Subject links
- Auto-linking
- Request review
- ~~Unclear tags~~
- ~~Sensitive tags~~
Transcription Display
- Table of Contents for a work
- Page display
- Multi-page transcription display
- Page annotations
~~Printable Transcription~~
- ~~PDF generation~~
- ~~Typesetting~~
- ~~Table of contents page~~
- ~~Title/Author pages~~
- ~~Expansion footnotes~~
- ~~Subject article footnotes~~
- ~~Index~~
Text Analysis
- Subject Articles
- "What links here" indexes
- ~~Relatedness graphs~~ (Implemented, but turned off for v1.0)
- Subject categories
- Category-based navigation
- ~~Categorized relatedness graphs~~

Wednesday, January 16, 2008

Whew

I've just finished coding review requests, the last feature slated for release 1.0.

I figure I've completed 90% of the work to get to 1.0, so I'll dub this check-in Version 0.5.

The next step is to deploy to a production machine for alpha testing, which I expect will involve at least a few more code changes. But from here on out, every modification is a bug fix.

I'll try to post a bit more on my feature triage, my experience mis-applying the Evil Twin plugin design pattern, and my plans for the future.

Tuesday, November 13, 2007

Podcast Review: Matt Mullenweg at 2006 SF FOWA

Matt Mullenweg is an incredible speaker -- I've started going through the audio of every talk of his I can find online. His presentation at the 2006 Future of Web Apps (mp3, slides) is a review of the four projects he's worked on, and the lessons he's learned from them. It's a fantastic pep talk, and it's so meaty that I sometimes thought editing glitches had chopped out the transitions from one topic to another.

I see five take-aways for my project:

Don't think you aren't building an API. If things go well, other developers will be integrating your product into other systems, re-skinning it, writing plug-ins, or all of these. You'll be exposing internals of your app in ways you'll never imagine. If your functions are named poorly, if your database columns are inconsistently titled, they'll annoy you forever.

While you shouldn't optimize prematurely, try to get your architecture to "two" as soon as you can. What does "two" mean? Two databases, two appservers, two fileservers, and so on. Once that works, it's really easy to scale each of those to n as needed.

Present a "Contact Us" form somewhere the public can access it. Feedback from users who aren't logged in exposes a wide range of problems you don't get through integrated feedback tools.

Avoid options wherever possible.
Presenting a site admin with an option to do X or Y is a sort of easy way out -- the developer gets to avoid hard decisions, the team gets to avoid conflict. But options slow down the user, and most users never use a fraction of the optional features an app presents. Worse, the development effort is diffused forever forward -- an enhancement to X requires the same effort to be spent on Y for consistency's sake.
Plugins are the solution for user customization. Third-party plugins don't diffuse the development effort, since the people who really care about the feature spend their time on the plugin, while the core app team spend their efforts on the app.
The downside of plugins and themes is that you need to provide a way for users to find them. If you create a plugin system, you need to create some sort of plugin directory to support it.

Sunday, October 21, 2007

Podcast Review: "When Communities Attack!"

As predicted earlier, real life has consumed 100% of my energy for the last month or so, so I haven't made any progress on RefineScript/The Straightstone System/curingbarn.com/ManuscriptForge. I have, however, listened to some podcasts that I think are relevant to anyone else in my situation^*, so I think I'll start a series of reviews.

"When Communities Attack!" was presented at SXSWi07 by Chris Tolles, marketing director for Topix.net. You can download the MP3 at the SXSW Podcast website.

The talk covered the lessons learned about online communities through running a large-scale current-events message board system. Topix.net really seemed like it had been a crucible for user-created content, as Danish users argued with Yemenis about the Mohamed cartoons, or neighbors bickered with each other about cleaning up the local trailer park.

These were the points I thought relevant to my comment feature:

Give your site a purpose and encourage good behavior. Rather than discourage bad behavior, if your site has a purpose (like wikipedia) users have a metric to use to police themselves. Even debatable behavior can be tested against whether it helps transcribe the manuscript. This also prevents administrators from only acting as nags.
Geo-tag posts. Display the physical location where the comment comes from: "The commentary now autotags where the commenter is from. . . . The tone of the entire forum got a little more friendly once you start putting someone's town name next to [a post], because it turns out that no-one wants to bring shame to their town.
Anonymity breeds bad behavior. If people think their mother will read what they're writing, they're less likely to fly off the handle.
Don't erect time-consuming barriers to posting. It turns out that malevolent users have more free time than constructive users, and are actually more likely to register on the site and jump through hoops.
Management needs a presence. Like a beat cop, just making yourself visible encourages good behavior.
Expose user information like IP address. This can help the community police itself through shaming posters who use sock-puppet accounts.

[*] that is, any other micro-ISV building on collaboration software for the digital humanities.

Sunday, October 14, 2007

Matt Unger in the New York Times

Papa's Diary Project got a nice write up in today's New York Times.

I especially like that it's filed under "Urban Studies | Transcribing".

Tuesday, October 2, 2007

Name Update: Renan System pro and con

I've been referring to the software as the "Renan System" for the past few months. The connection to Julia Brumfield's community works well, and Google returns essentially nothing for it. The name is both generic and unique, so I registered renansystem.com.

There's just one problem: nobody can pronounce it.

The unincorporated community of Renan, Virginia is pronounced /REE nan/ by the locals, which occasionally gets them some kidding. I now understand why they say it that way - it's just about the only way to anglicize the word. /ray NAW/ is hard to say even if you don't attempt the voiced velar fricative or nasalized vowel. Since the majority of my hypothetical users will encounter the word in print alone, I have no idea how the pronunciation would settle out.

So unless there's some standard for saying "Renan" that I'm just missing, I'll have to start my name search again.

Friday, September 21, 2007

Progress Report: Printing

I just spent about two weeks doing what's known in software engineering as a "spike investigation." This is somewhat unusual, so it's worth explaining:

Say you have a feature you'd like to implement. It's not absolutely essential, but it would rank high in the priority queue if its cost were reasonably low. A spike investigation is the commitment of a little effort to clarify requirements, explore technology choices, create proof-of-concept implementations, and (most importantly) estimate costs for implementing that feature. From a developer's perspective you say "Pretend we're really doing X. Let's spend Y days to figure out how to do it, and how long that would take." Unlike other software projects, the goal is not a product, but 1) a plan and 2) a cost.

The Feature: PDF-download
According to the professionals, digitization projects are either oriented towards preservation (in which case the real-life copy may in theory be discarded after the project is done, but a website is merely a pleasant side-effect) or towards access (in which distribution takes primacy, and preservation concerns are an afterthought). FromThePage should enable digitization for access — after all, the point is to share all those primary sources locked away in somebody's file cabinet, like Julia Brumfield's diaries. Printed copies are a part of that access: when much of the audience is elderly, rural, or both, printouts really are the best vehicle for distribution.

The Plan: Generate a DocBook file, then convert it to PDF
While considering some PDF libraries for Ruby, I was fortunate enough to hear James Edward Gray II speak on "Ruby as a Glue Language". In one section called "shelling out", he talked about a requirement to produce a PDF when he was already rendering HTML. He investigated PDF libraries, but ended up piping his HTML through `html2ps | ps2pdf` and spending a day on the feature instead of several weeks. This got me looking outside the world of PDF-modifying Ruby gems and Rails plugins, at other documenting scripting languages. It makes a lot of sense — after all, I'm not hooking directly to the Graphviz engine for subject graphs, but generating .dot files and running neato on them.

I started by looking at typesetting languages like LaTeX, then stumbled upon DocBook. It's an SGML/XML-based formatting language which only specifies a logical structure. You divide your .docbook file into chapters, sections, paragraphs, and footnotes, then DocBook performs layout, applies typesetting styles, and generates a PDF file. Using the Rails templating system for this is a snap.

The Result:
See for yourself: This is a PDF generated from my development data. (Please ignore the scribbling.)

The Gaps:

Logical Gaps:
- The author name is hard-wired into the template. DocBook expects names of authors and editors to be marked up with elements like surname, firstname, othername, and heritage. I assume that this is for bibliographic support, but it means I'll have to write some name-parsing routine that converts "Julia Ann Craddock Brumfield" into "<firstname> Julia </firstname> <othername type="middle"> Ann </othername> <othername type="maiden"> Craddock </othername> <surname> Brumfield </surname>".
- There is a single chapter called "Entries" for an entire diary. It would be really nice to split those pages out into chapters based on the month name in the page title.
- Page ranges in the index aren't marked appropriately. You see "6,7,8" instead of "6-9".
- Names aren't subdivided (into surname, first name, suffix, etc.), and so are alphabetized incorrectly in the index. I suppose that I could apply the name-separating function created for the first gaps to all the subjects within a "Person" category to solve this.
Physical Layout: The footnote tags are rendering as end notes. Everyone hates end notes.
Typesetting: The font and typesetting betrays DocBook's origins in the world of technical writing. I'm not sure quite what's appropriate here, but "Section 1.3.4" looks more like a computer manual than a critical edition of someone's letters.

The Cost:
Fixing the problems with personal names requires a lot of ugly work with regular expressions to parse names, amounting to 20-40 hours to cover most cases for authors, editors, and indices. The work for chapter divisions is similar in size. I have little idea how easy it will be to fix the footnote problem, as it involves learning "a Scheme-like language" used for parsing .docbook files. Presumably I'm not the first person to want footnotes to render as footnotes, so perhaps I can find a .dsssl file that does this already. Finally, the typesetting should be a fairly straightforward task, but requires me to learn a lot more about CSS than the little I currently know. In all, producing truly readable copy is about a month's solid work, which works out to 4-6 months of calendar time at my current pace.

The Side-Effects:
Generating a .docbook file is very similar to generating any other XML file. Extending the printing code for exporting works in TEI or a FromThePage-specific format will only take 20-40 hours of work. Also, DocBook can be used to generate a set of paginated, static HTML files which users might want for some reason.

The Conclusions:
It's more important that I start transcribing in earnest to shake out problems with my core product, rather than delaying it to convert end notes to footnotes. As a result, printing is not slated for the alpha release.

Thursday, September 20, 2007

Money: Projected Costs

What are the costs involved making FromThePage a going concern? I see these five classes of costs:

DNS
Initial hosting bills
Marginal hosting fees associated with disk usage, cpu usage, or bandwidth served
Labor by people with skills neither Sara nor I possess
Labor by people with skills that Sara or I do possess, but do not have time or energy to spend

I can predict the first two with some degree of accuracy. I've already paid for a domain name, and the hosting provider I'm inclined towards costs around $20/month. When it comes to the cost of hosting other people's works for transcription, however, I have no idea at all what to expect.

I have started reading about start-up costs, and this week I listened to the SXSW panel "Barenaked App: The Figures Behind the Top Web Apps" (podcast, slides). What I find distressing about this panel is that the figures involved are so large: $20,000-$200,000 to build an application that costs $2000-$8000 per month for hardware and hosting! It's hard to figure out how comparable my own situation is to these companies, since I don't even have a paid host yet.

This big unknown is yet another argument for a slow rollout — not only will alpha testers supply feedback about bugs and usability, the usage patterns for their collections will give me data to figure out how much an n-page collection with m volunteers is likely to increase my costs. That should provide about half of the data I need to decide on a non-commercial open-source model versus a purely-hosted model.

Wednesday, September 19, 2007

Progress Report: Subject Categories

It's been several days since I updated this blog, but that doesn't mean I've been sitting idle.

I finished a basic implementation of subject categories a couple of weeks ago. I decided to go with hierarchical categories, as is pretty typical for web content. Furthermore, the N:N categorization scheme I described back in April turned out to be surprisingly simple to implement. There are currently three different ways to deal with categories:

Owners may add, rename, and delete categories within a collection.
Scribes associate or disassociate subjects with a category. The obvious place to put this was on the subject article edit screen, but a few minutes of scribal use demonstrated that this would lead to lots of uncategorized articles. Since transcription projects that don't care about subject indexing aren't likely to use the indexes anyway, I added a data-cleanup step to the transcription screen. Now, whenever a page contains a new, uncategorized subject reference, I display a separate screen when the transcription is saved. This screen shows all the uncategorized subjects for that page, allowing the scribe to categorize any subjects they've created.
Viewers see a category treeview on the collection landing page as well as on the work reader. Clicking a category lists subjects for that category, and clicking the subject link lists links to navigate to the pages referring to that subject.

The viewer treeview presents the most opportunities, thus the most difficulties from a UI perspective. Should a subject link load the subject article instead of the page list? Should it refer to a reader view of pages including that subject? When viewing a screen with only a few pages from one work, should the category tree only display terms used on that screen, or on the work, or on all works from the collection the work is a part of? I'm really not sure what the answer is. For the moment, I'm trying to achieve consistency at the cost of flexibility: the viewer will always see the same treeview for all pages within a collection, regardless of context.

Future ideas include:

Category filtering for subject graphs -- this would really allow analysis of questions like "what was the weather when people held dances?" without the need to wade through a cluttered graph.
Viewing the text of all pages that contain a certain category on the same page, with highlighting of the term within that category.

Tuesday, September 11, 2007

Feature: Comments (Part 2: What is commentable?)

Now that I've settled on the types of comments to support, where do they appear? What is commentable?

I've given a lot of thought to this lately, and have had more than one really good conversation about it. In a comment to an earlier post, Kathleen Burns recommended I investigate CommentPress, which supports annotation on a paragraph-by-paragraph level with a spiffy UI. At the Lone Star Ruby Con this weekend, Gregory Foster pointed out the limitations of XML for delimiting commentable spans of text if those spans overlap. As one door opens, another one closes.

What kinds of things can comments (as broadly defined below) be applied to? Here's my list of possiblities, with the really exciting stuff at the end:

Users: See comments on user profile pages at LibraryThing.
Articles: See annotations to article pages at Pepys' Diary.
Collections: Since these serve as the main landing page for sites on FromThePage, it makes sense to have a top-level discussion (albeit hidden on a separate tab).
Works: For large works, such as the Julia Brumfield diaries, discussions of the work itself might be appropriate. For smaller works like letters, annotating the "work" might make more sense than annotating individual pages.
Pages: This was the level I originally envisioned for comment support. I still think it's the highest priority, based on the value comments add to Pepys' Diary, but am no longer sure it's the smallest level of granularity worth supporting.
Image "chunks": Flickr has some kind of spiffy JavaScript/DHTML goodness that allows users to select a rectangle within an image and post comments on that. I imagine that would be incredibly useful for my purposes, especially when it comes to arguing about the proper reading of a word.
Paragraphs within a transcription: This is what CommentPress does, and it's a really neat idea. They've got an especially cool UI for it. But why stop at paragraphs?
Lines within a transcription: If it works for paragraphs, why not lines? Well, honestly you get into problems with that. Perhaps the best way to handle this is to have comments reference a line element within the transcription XML. Unfortunately, this means that you have to use LINE tags to be able to locate the line. Once you've done that, other tags (like my subject links) can't span linebreaks.
Spans of text within a transcription: Again, the nature of XML disallows overlapping elements. As a result, in the text "one two three", it would be impossible to add a comment to "one two" and a different comment to "two three".
Points within a transcription: This turns out to be really easy, since you can use an empty XML tag to anchor the comment. This won't break the XML heirarchy, and you might be able to add an "endpoint" or "startpoint" attribute to the tag that (when parsed by the displaying JavaScript) would act like 8 or 9.

Feature: Comments (Part 1: What is a comment?)

Back in June, Gavin Robinson and I had a conversation about comments. The problem with comments is figuring out whom they're directed to. Annotations like those in Pepys' Diary Online are directed towards both the general public and the community of Pepys "regulars". Sites built on user-generated content (like Flickr, or Yahoo Groups) necessitate "flag as inappropriate" functionality, in which the general public alerts an administrator of a problem. And Wikipedia overloads both their categorization function and their talk pages to flag articles as "needing attention", "needing peer-review", or simply "candidates for deletion".

If you expand the definition of "comments" to encompass all of these — and that's an appropriate expansion in this case, since I expect to use the same code behind the scenes — I see the following general types of comments as applicable to FromThePage documents:

Annotations: Pure annotations are comments left by users for the community of scribes and the general public. They don't have "state", and they don't disappear unless they're removed by their author or a work owner. Depending on configuration, they might appear in printed copies.
Review Requests: These are requests from one scribe to another to proofread, double-check transcriptions, review illegible tags, or identify subjects. Unlike annotations, these have state, in that another scribe can mark a request as completed.
Problem Reports: These range from scribes reporting out-of-focus images to readers reporting profane content and vandalism. These also have state, much like review requests. Unlike review requests, they have a more specific target — only the work owner can correct an image, and only an admin can ban a vandal.

Friday, August 31, 2007

Progress Report: Layout, Usability, and Collections

I've gotten a lot done on FromThePage over the last month:

I finished a navigation redesign that unites the different kind of actions you can take on a work, a page, or a subject article around those objects.
I added preview and error handling functionality to the transcription screen
We — well, Sara actually — figured out an appropriate two-column layout and implemented stylesheets based on Ruthsarian Layouts and Plone Clone.
Zoomable page images now appear in the page view screen in addition to the transcription screen.
I created a unififed, blog-style work reader to display transcriptions of multiple pages from the same work. This replaces the table of contents as the work's main screen, a change I think will be especially useful for letters or other short works.
"About this Work" now contains fields suggested by the TEI Manuscript Transcription guidelines tracking document history, physical description, etc.
Scribes can now rename subjects, and all the links embedded within transcription text will be updated. This is an important usability features, since you can now link incomplete terms like "Mr Edmonds" with the confidence that the entry can be corrected and expanded upon later research.
Full subject titles are expanded when the mouse pointer hovers over links in the text: i.e. "Josie" (try it!).
Works are now aggregated into collections. Articles are no longer global, but belong to a particular collection. This prevents work owners with unrelated data from seeing each other's articles if they have the same titles. Collection pages could also serve as a landing page for a FromThePage site, with a short URL and perhaps custom styles.

Wednesday, July 25, 2007

Money: Possible Revenue Sources

Let's return to the subject of money. If I thought the market for collaborative manuscript transcription software were a lucrative one, I might launch a business based on selling FromThePage (If you wonder why I keep referring to different software products, it's because I'm trying out different names to see how they feel. See my naming post for more details.)

I sincerely doubt that the market could sustain even a single salary, but it's still a useful exercise to list revenue opportunities for a hypothetical "Renan Systems, Inc."

Fee-for-hosting (including microsponsorships)
This is the standard model for the companies selling what we used to call Application Service Providers and now call something high-falutin' like "Software as a Service." In this model, a user with deep pockets pays a monthly fee to use the software. Their data is stored by the host (Today's Name, Inc.), and in addition to whatever access they have, the public can see whatever the client wants them to see.

This is the most likely model I see for FromThePage. Work owners would upload their documents and be charged some monthly or yearly rate. Neither scribes nor viewers would pay anything — all checks would come from the work owner.

There's one exception to this single-payer-per-work rule, and it's a pretty neat one. One of the panels at South by Southwest this year discussed microsponsorships (see podcast here, beginning at around 8 minutes). This is an idea that's been used to allow an audience to fund an independent content provider: you like X's podcasts, so you donate $25 and your name appears beside your favorite podcast. The $25 goes to X, presumably to support X's work.

The nature of family history suggests microsponsorships as an excellent way for a work owner to fund a site. The people involved in a family history project have amazingly diverse sets of skills and resources. One person may have loads of interpretive context but poor computer skills and a dial-up connection. Another may have loads of time to invest, but no money. And often there is a family member with great interest but little free time, who's willing to write a check to see a project through. Microsponsorship allows that last person to enable the previous two, advancing the project for them all.

Donations
Hey, it works for Wikipedia!

License to other hosting providers
Perhaps I don't want to get into the hosting business at all. After all, technical support and billing are hassles, and I've never had to deal with accounting before. If there were commercial value to FromThePage, another hosting company might buy the software as an add-on.

Where this really does become a possiblity is for the big genealogy sites to add FromThePage to their existing hosting services. I can see licensing the software to one of them simultaneously with hosting my own.

Affiliate sales
The only candidate for this is publish-on-demand printing, a service I'd like to offer anyway. For finished manuscript transcriptions, in addition to a PDF download, I'd like to offer the ability to print the transcription and have it bound and shipped. Plenty of services exist to do this already given a PDF as an input, so I can't imagine it would be too hard to hook into one of them.

Advertising
Ugh. It's hard to see much commercial value to an advertiser, unless the big genealogy sites have really impressive, context-sensitive APIs. And besides, ugh.

Thursday, July 12, 2007

Feature: Subect Graphs

Inspired by Bill Turkel's bibliography clusters, I spent the last week playing with GraphViz. Here is a relatedness graph generated by the subject page for my great grandfather:

One big problem with this graph is that it shows too many subjects. Once I add subject categorization, a filtered graph becomes a way of answering interesting questions. Viewing the graph for "cold" and only displaying subjects categorized as "activities" could tell you what sort of things took place on the farm in winter weather, for example.

Another issue is that the graph doesn't effectively display information. Nodes vary based on size and distance. Size is merely a function of the label length, so is meaningless. Distance from the subject is the result of the "relatedness" between the node and the subject, measured by page occurrences. This latter metric is what I'm trying to calculate, but it's not presented very well. I'll probably try to reinforce the relatedness by varying color intensity using the Graphviz color schemes, or suppress the label length by forcing nodes into a fixed size or shape.

Of course, common page links are only one way to relate subjects. It's easier, if less interesting, to graph the direct link between one subject article and another. Given more data, I could also show relationships between users based on the pages/articles/works they'd edited in common, or the relationships between works based on their shared editors. A final thing to explore is the Graphviz ImageMap output format, which apparently results in clickable nodes.

I'll put up a technical post on the subject once I split the dot-generation out into Rails' Builder — currently it's just a mess of string concatenation.

Feature: Transcription Versions

Page/Subject Article Versions
Last week I added versioning to articles and pages. The goal was to allow access to previous edits via a version page akin to the MediaWiki history tab.

Gavin Robinson suggested a system of review and approval before transcription changes go live, but I really think that this doesn't fit well with my user model. For one thing, I don't expect the same kinds of vandalism problems you see in Wikipedia to affect FromThePage works much, since the editors are specifically authorized by the work owner. For another, I can't imagine the solo-owner/scribe would tolerate having to submit and approve each one of their edits for long. Finally, since this is designed for a loosely-coupled, non-institutional user community, I simply can't assume that the work owner will check the site each day to review and approve changes. Projects must be able to keep their momentum without intervention by the work owner for months at a time.

His concerns are quite valid, however. Perhaps an alternative approach to transcription quality is to develop a few more owner tools, like a bulk review/revert feature for contributions made since a certain date or by a certain user.

Work Versions
Later, I'll put up a technical post on how I accomplished this with Rails after_save callbacks, but for now I'd like to talk about "versions" of a perpetually-editable work. What exactly does this mean? If a user prints out or downloads a transcription between one change and the next, how do you indicate that?

To address this, I decided to add the concept of a work's "transcription version". This is an additional attribute of the work itself, and every time an edit is made to any one of the work's pages, the work itself has its transcription version incremented. By recording the transcription version of the work in the page version record as well, I should be able to reconstruct the exact state of the digital work from a number added to an offline copy of the work.

I decided on transcription_version as an attribute name because comments and perhaps subject articles may change independently of the work's transcribed text. A printout that includes commentary needs a comment_version as well as a transcription_version. The two attributes seem orthogonal, because two transcription-only prints of the same work shouldn't appear different because a user has made an unprinted annotation.

Sunday, July 1, 2007

UI and Other Fun Stuff

Sara and I spent about an hour yesterday talking about the Renan System's UI on our drive to Houston. For someone who is frightened and confused by user interface design, this turned out to be surprisingly pleasant. She's recommended a two-column design, since the transcription page is necessarily broken into one column for the transcription form and one column for the manuscript page image. In coding things like article links, I find that this layout is generalizable given the structure of most of my data: ordinarily, text (the "important" stuff) lives in the leftmost pane, while images, links, and status panels live in the right.

Forcing myself to think about what to put in the right pane turned into a brainstorming session. When a user views an article, it seems obvious that a list of pages that link to that article should be visible. But thinking "what else goes in the article's 'fun stuff' pane?" made me realize that there were all sorts of ways to slice the data. For example, I could display a list of articles mentioned most frequently in pages that mentioned the viewed article. Rank this by number of occurrences, or rank it by percentage of occurances, and you get very different results: "What does it mean if 'Franklin' only occurs with 'Edna', but never in any other context?" I could also display timelines of article links, or which scribe had contributed the most to the article.

Going through the same process for the work page, the page list page, the scribe or viewer dashboards and such has generated a month's worth of feature ideas. Maybe there's something to this UI design after all!

Thursday, June 28, 2007

Progress Report: Article Links

Well, that was fast!

Four days after announcing that this blog would take a speculative turn (read "stall") while I spent months on article links, linking pages to articles works!

It only took me about ten hours of coding and research to learn to use the XML parser, write the article and link schema, process typed transcriptions into a canonical XML form with appropriate RDBMS updates, and transform that into HTML when displaying a page. In fact, the most difficult part was adding support for WikiMedia-style square-brace links.

After implementing the article links, it only took 15 minutes of transcription to discover that automated linking is a MUST-DO feature, rather than the NICE-TO-HAVE priority I'd assigned it. It's miserable to type [[Sally Joseph Carr Brumfield|Josie]] every time Julia mentions the daughter-in-law who lives with her.

Tuesday, June 26, 2007

Matt Unger on Article Links

Matt Unger has been kind enough to send his ideas on article links within transcription software. I've posted my plans here before, but Matt's ideas are borne of his experience after six months of work on Papa's Diary. Some of his ideas are already included by my link feature, others may inform FromThePage, but I think his specification is worth posting in its entirety. If I don't use his suggestions in my software, perhaps someone else will.

One of my frustrations with Blogger (and I'm sure this would be my frustration with any blog software since my coding skills start and end with HTML 2.0) is that I can't easily create/show the definitions of terms or explain the backgrounds of certain people or organizations without linking readers to other pages.

Here's what my dream software would do from the reader's point of view:

Individual words in each entry would appear in a different color if there was background material associated with it. On rollover, a popup window would give basic background information on the term: if it was a person, there'd be a thumbnail photo, a short bio, and a link to whatever posts involved that person if the reader wanted to see more; if it was the name of an organization, we might see a short background blurb and related links; if it was a slang term from the past, we might see its definition; if it was a reference to a movie, the popup window might play a clip, etc.

Here's how it would work from my perspective:

When I write a post, the software would automatically scan my text and create a link for each term it recognizes from a glossary that I would have defined in advance. This glossary would include all the text, links, and media assets associated with each term. The software would also scan my latest entry and compare it to the existing body of work and pick out any frequently-mentioned terms that don't have a glossary definition and ask me if I wanted to create a glossary entry for them. For example, if I frequently mention The New York Times, it would ask me if I wanted to create a definition for it. I could choose to say "Yes," "Not Now" or "Never." If I chose yes and created the definition, I'd have the option of applying the definition to all previous posts or only to future posts.

The application would also display my post in a preview mode pretty much as a regular reader would see it. If I were to roll over a term that had a glossary term associated with it, I'd see whatever the user would see but I would also have a few admin options in the pop-up window like: Deactivate (in case I didn't want it to be a rollover); Associate A Different Definition (in case I wanted to show another asset than usual for a term). If I didn't do anything to the links, they would simply default to the automatically-created rollovers when I confirm the post submission.

So, that's my dream (though I would settle for the ability to create a pop-up with some HTML in a pinch).

Matt adds:

One thing I forgot to mention, too, is that I would also want to be able to create a link to another page instead of a pop-up once in a while. I suppose this could just be another admin option, and maybe the user would see a different link color or some other visual signal if the text led to a pop-up rather than another page.