Collaborative Manuscript Transcription

Monday, December 21, 2009

Feature: Related Pages

I've been thinking a lot about page-to-subject links lately as I edit and annotate Julia Brumfield's 1921 diary. While I've been able to exploit the links data structure in editing, printing, analyzing and displaying the texts, I really haven't viewed it as a way to navigate from one manuscript page to another. In fact, the linkages I've made between pages have been pretty boring -- next/previous page links and a table of contents are the limit. I'm using the page-to-subject links to connect subjects to each other, so why not pages?

The obvious answer is that the subjects which page A would have most in common with page B are the same ones it would have in common with nearly every other page in the collection. In the corpus I'm working with, the diarist mentions her son and daughter-in-law in 95% of pages, for the simple reason that she lives with them. If I choose two pages at random, I find that March 12, 1921 and August 12, 1919 both contain Ben and Jim doing agricultural work, Josie doing domestic work, and Julia's near-daily visit to Marvin's. The two pages are connected through those four subjects (as well as this similarly-disappointing "dinner"), but not in a way that is at all meaningful. So I decided that a page-to-page relatedness tool couldn't be built from the page-to-subject link data.

All that changed two weeks ago, when I was editing the 1921 diary and came across the mention of a "musick box". In trying to figure out whether or not Julia was referring to a phonograph by the term, I discovered that the string "musick box" occurred only two times: when the phonograph was ordered and the first time Julia heard it played. Each one of these mentions shed so much light on the other that I was forced to re-evaluate how pages are connected through subjects. In particular, I was reminded of the "you and one other" recommendations that LibraryThing offers. This is a feature that find other users with whom you share an obscure book. In this case, obscurity is defined as the book occurring only twice in the system: once in your library, once in the other user's.

This would be a relatively easy feature to implement in FromThePage. When displaying a page, perform this algorithm:

For each subject link in the page, calculate how many times it is referenced within the collection, then
Sort those subjects by reference count, and
Take the 3 or 4 subject links with the lowest reference count and,
Display the pages which link to those subjects.

For a really useful experience, I'd want to display keyword-in-context, showing a few words to explain the context in which that other occurrence of "musick box" appears.

Friday, June 26, 2009

Connecting With Readers

While editing and annotating Julia Brumfield's 1919 diary, I've tried to do research on the people who appear there. Who was Josie Carr's sister? Sites like FindAGrave.com can help, but the results may still be ambiguous: there are two Alice Woodings buried in the area, and either could be a match.

These questions could be resolved pretty easily through oral interviews -- most of the families mentioned in the diaries are still in the area, and a month spent knocking on doors could probably flesh out the networks of kinship I need for a complete annotation. However, that's really not time I have, and I can't imagine cold-calling strangers to ask nosy questions about their families -- I'm a computer programmer, after all.

It turns out that there might be an easier way. After Sara installed Google Analytics on FromThePage, I've been looking at referral log reports that show how people got to the site. Here's the keyword report for June, showing what keywords people were searching for when they found FromThePage:

Keyword	Visits	Pages/Visit	Avg. Time on Site
"tup walker"	21	12.47619	992.8571
"letcher craddock"	7	12.42857	890.1429
julia craddock brumfield	3	28	624.3333
juliacraddockbrumfield	3	74.33333	1385
"edwin mayhew"	2	7	117.5
"eva mae smith"	2	4	76.5
"josie carr" virginia	2	6.5	117
1918 candy	2	4	40
clack stone hubbard	2	55.5	1146

These website visitors are fellow researchers, trying to track down the same people that I am. I've got them on my website, they're engaged -- sometimes deeply so -- with the texts and the subjects, but I don't know who they are, and they haven't contacted me. Here are a couple of ideas that might help:

Add an introductory HTML block to the collection homepage. This would allow collection editors to explain their project, solicit help, and whatever contact information.
Add a 'contact us' footer to be displayed on every page of the collection, whether the user is viewing a subject article, reading a work, or viewing a manuscript page. Since people are finding the site via search engines, they're navigating directly to pages deep within a collection. We need to display 'about this project', 'contact us', or 'please help' messages on those pages.

One idea I think would not work is to build comment boxes or a 'contact us' form. I'm trying to establish a personal connection to these researchers, since in addition to asking "who is Alice Wooding", I'd like to locate other diaries or hunt down other information about local history. This is really best handled through email, where the barriers to participation are low.

Tuesday, June 2, 2009

Interview with Hugh Cayless

One of the neatest things to happen in the world of transcription technology this year was the award of an NEH ODH Digital Humanities Start-Up Grant to "Image to XML", a project exploring image-based transcription at the line and word level. According to a press release from UNC, this will fund development of "a product that will allow librarians to digitally trace handwriting in an original document, encode the tracings in a language known as Scalable Vector Graphics, and then link the tracings at the line or even word level to files containing transcribed texts and annotations." This is based on the work of Hugh Cayless in developing Img2XML, which he has described in a presentation to Balisage, demonstrated at this static demo, and shared at this github repository.

Hugh was kind enough to answer my questions about the Img2XML project and has allowed me to publish his responses here in interview form:

First, let me congratulate you on img2xml's award of a Digital Humanities Start-Up Grant. What was that experience like?

Thanks! I've been involved in writing grant proposals before, and sat on an NEH review panel a couple of years ago. But this was the first time I've been the primary writer of a grant. Start-Up grants (understandably) are less work than the larger programs, but it was still a pretty intensive process. My colleague at UNC, Natasha Smith, and I worked right down to the wire on it. At research institutions like UNC, the hard part is not the writing of the proposal, but working through the submission and budgeting process with the sponsored research office. That's the part I really couldn't have done in time without help.

The writing part was relatively straightforward. I sent a draft to Jason Rhody, who's one of the ODH program officers, and he gave us some very helpful feedback. NEH does tell you this, but it is absolutely vital that you talk to a program officer before submitting. They are a great resource because they know the process from the inside. Jason gave us great feedback, which helped me refine and focus the narrative.

What's the relationship between img2xml and the other e-text projects you've worked on in the past? How did the idea come about?

At Docsouth, they've been publishing page images and transcriptions for years, so mechanisms for doing that had been on my mind. I did some research on generating structural visualizations of documents using SVG a few years ago, and presented a paper on it at the ACH conference in Victoria, so I'd some experience with it. There was also a project I worked on while I was at Lulu where I used Inkscape to produce a vector version of a bitmap image for calendars, so I knew it was possible. When I first had the idea, I went looking for tools that could create an SVG tracing of text on a page, and found potrace (which is embedded in Inkscape, in fact). I found that you can produce really nice tracings of text, especially if you do some pre-processing to make sure the text is distinct.

What kind of pre-processing was necessary? Was it all manual, or do you think the tracing step could be automated?

It varies. The big issue so far has been sorting out how to distinguish text from background (since potrace converts the image to black and white before running its tracing algorithm), particularly with materials like papyrus, which is quite dark. If you can eliminate the background color by subtracting it from the image, then you don't have to worry so much about picking a white/black cutover point--the defaults will work. So far it's been manual. One of the agendas of the grant is to figure out how much of this can be automated, or at least streamlined. For example, if you have a book with pages of similar background color, and you wanted to eliminate that background as part of pre-processing, it should be possible to figure out the color range you want to get rid of once, and do it for every page image.

I've read your Balisage presentation and played around with the viewer demonstration. It looks like img2xml was in proof-of-concept stage back in mid 2008. Where does the software stand now, and how far do you hope to take it?

It hasn't progressed much beyond that stage yet. The whole point of the grant was to open up some bandwidth to develop the tooling further, and implement it on a real-world project. We'll be using it to develop a web presentation of the diary of a 19th century Carolina student, James Dusenbery, some excerpts from which can be found on Documenting the American South at http://docsouth.unc.edu/true/mss04-04/mss04-04.html.

This has all been complicated a bit by the fact that I left UNC for NYU in February, so we have to sort out how I'm going to work on it, but it sounds like we'll be able to work something out.

It seems to me that you can automate generating the SVG's pretty easily. In the Dusenbery project, you're working with a pretty small set of pages and a traditional (i.e. institutionally-backed) structure for managing transcription. How well suited do you think img2xml is to larger, bulk digitization projects like the FamilySearch Indexer efforts to digitize US census records? Would the format require substantial software to manipulate the transcription/image links?

It might. Dusenbery gives us a very constrained playground, in which we're pretty sure we can be successful. So one prong of attack in the project is to do something end-to-end and figure out what that takes. The other part of the project will be much more open-ended and will involve experimenting with a wide range of materials. I'd like to figure out what it would take to work with lots of different types of manuscripts, with different workflows. If the method looks useful, then I hope we'll be able to do follow-on work to address some of these issues.

I'm fascinated by the way you've cross-linked lines of text in a transcription to lines of handwritten text in an SVG image. One of the features I've wanted for my own project was the ability to embed a piece of an image as an attribute for the transcribed text -- perhaps illustrating an unclear tag with the unclear handwriting itself. How would SVG make this kind of linking easier?

This is exactly the kind of functionality I want to enable. If you can get close to the actual written text in a referenceable way then all kinds of manipulations like this become feasible. The NEH grant will give us the chance to experiment with this kind of thing in various ways.

Will you be blogging your explorations? What is the best way for those interested in following its development to stay informed?

Absolutely. I'm trying to work out the best way to do this, but I'd like to have as much of the project happen out in the open as possible. Certainly the code will be regularly pushed to the github repo, and I'll either write about it there, or on my blog (http://philomousos.blogspot.com), or both. I'll probably twitter about it too (@hcayless). I expect to start work this week...

Many thanks to Hugh Cayless for spending the time on this interview. We're all wishing him and img2xml the best of luck!

Sunday, May 17, 2009

Review: USGS North American Bird Phenology Program

Who knew you could track climate change through crowdsourced transcription? The smart folks at the U. S. Geological Survey, that's who!

The USGS North American Bird Phenology program encouraged volunteers to submit bird sightings across North America from the 1880s through the 1970s. These cards are now being transcribed into a database for analysis of migratory pattern changes and what they imply about climate change.

There's a really nice DesertUSA NewsBlog article that covers the background of the project:

The cards record more than a century of information about bird migration, a veritable treasure trove for climate-change researchers because they will help them unravel the effects of climate change on bird behavior, said Jessica Zelt, coordinator of the North American Bird Phenology Program at the USGS Patuxent Wildlife Research Center.
That is — once the cards are transcribed and put into a scientific database.

And that’s where citizens across the country come in - the program needs help from birders and others across the nation to transcribe those cards into usable scientific information.

CNN also interviewed a few of the volunteers:

Bird enthusiast and star volunteer Stella Walsh, a 62-year-old retiree, pecks away at her keyboard for about four hours each day. She has already transcribed more than 2,000 entries from her apartment in Yarmouth, Maine.
"It's a lot more fun fondling feathers, but, the whole point is to learn about the data and be able to do something with it that is going to have an impact," Walsh said.

Let's talk about the software behind this effort.

The NABPP is fortunate to have a limited problem domain. A great deal of standardization was imposed on the manuscript sources themselves by the original organizers, so that for example, each card describes only a single species and location. In addition, the questions the modern researchers are asking of the corpus also limits the problem domain: nobody's going to be doing analysis of spelling variations between the cards. It's important to point out that this narrow scope exists in spite of wide variation in format between the index cards. Some are handwritten on pre-printed cards, some are type-written on blank cards, and some are entirely freeform. Nevertheless, they all describe species sightings in a regular format.

Because of this limited scope, the developers were (probably) able to build a traditional database and data-entry form, with specialized attributes for species, location, date, or other common fields that could be generalized from the corpus and the needs of the project. That meant custom-building an application specifically for the NABPP, which seems like a lot of work, but it does not require building the kind of Swiss Army Knife that medieval manuscript transcription requires. This presents an interesting parallel with other semi-standardized, hand-written historical documents like military muster rolls or pension applications.

One of the really neat possibilities of subject-specific transcription software is that you can combine training users on the software with training them on difficult handwriting, or variations in the text. NABPP has put together a screencast for this, which walks users through transcribing a few cards from different periods, written in different formats. This screencast explains how to use the software, but it also explains more traditional editorial issues like what the transcription conventions are, or how to process different formats of manuscript material.

This is only one of the innovative ways the NABPP deals with its volunteers. I received a newsletter by email shortly after volunteering, announcing their progress to date (70K cards transcribed) and some changes in the most recent version of the software. This included some potentially-embarrassing details that a less confident organization might not have mentioned, but which really do a lot. Users may get used to workarounds to annoying bugs, but in my experience they still remember them and are thrilled when those bugs are finally fixed. So when the newsletter announces that "The Backspace key no longer causes the previous page to be loaded", I know that they're making some of their volunteers very happy.

In addition to the newletter, the project also posts statistics on the transcription project, broken down both by volunteer and by bird. The top-ten list gives the game-like feedback you'd want in a project like this, although I'd be hesitant to foster competition in a less individually-oriented project. They're also posting preliminary analyses of the data, including the phenology of barn swallows, mapped by location and date of first sighting, and broken down by decade.

Congratulations to the North American Bird Phenology Program for making crowdsourced transcription a reality!

Saturday, May 2, 2009

Open Source vs. Open Access

I've reached a point in my development project at which I'd like to go ahead and release FromThePage as Open Source. There are now only two things holding me back. I'd really like to find a project willing to work together with me to fix any deployment problems, rather than posting my source code on GitHub and leaving users to fend for themselves. The other problem is a more serious issue that highlights what I think is a conflict between Open Access and Open Source Software.

Open Source/Free Software and Rights of Use

Most of the attention paid to Open Source software focuses on the user's right to modify the software to suit their needs and to redistribute that (or derivative) code. However, there is a different, more basic right conferred by Free and Open source licenses: the user's right to use the software for whatever purpose they wish. The Free Software Definition lists "Freedom 0" as:

The freedom to run the program, for any purpose.
Placing restrictions on the use of Free Software, such as time ("30 days trial period", "license expires January 1st, 2004") purpose ("permission granted for research and non-commercial use", "may not be used for benchmarking") or geographic area ("must not be used in country X") makes a program non-free.

Meanwhile, the Open Source Definition's sixth criterion is:

6. No Discrimination Against Fields of Endeavor
The license must not restrict anyone from making use of the program in a specific field of endeavor. For example, it may not restrict the program from being used in a business, or from being used for genetic research.

Traditionally this has not been a problem for non-commercial software developers like me. Once you decide not to charge for the editor, game, or compiler you've written, who cares how it's used?

However, if your motivation in writing software is to encourage people to share their data, as mine certainly is, then restrictions on use start to sound pretty attractive. I'd love for someone to run FromThePage as a commercial service, hosting the software and guiding users through posting their manuscripts online. It's a valuable service, and is worth paying for. However, I want the resulting transcriptions to be freely accessible on the web, so that we all get to read the documents that have been sitting in the basements and file folders of family archivists around the world.

Unfortunately, if you investigate the current big commercial repositories of this sort of data, you'll find that their pricing/access model is the opposite of what I describe. Both Footnote.com and Ancestry.com allow free hosting of member data, but both lock browsing of that data behind a registration wall. Even if registration is free, that hurdle may doom the user-created content to be inaccessible, unfindable or irrelevant to the general public.

Open Access

The open access movement has defined this problem with regards to scholarly literature, and I see no reason why their call should not be applied to historical primary sources like the 19th/20th century manuscripts FromThePage is designed to host. Here's the Budapest Open Access Initiative's definition:

By "open access" to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself.

Both the Budapest and Berlin definitions go on to talk about copyright quite a bit, however since the documents I'm hosting are already out-of-copyright, I don't really think that they're relevant. What I do have control over is my own copyright interest in the FromThePage software, and the ability to specify whatever kind of copyleft license I want.

My quandry is this: none of the existing Free or Open Source licenses allow me to require that FromThePage be used in conformance with Open Access. Obviously, that's because adding such a restriction -- requiring users of FromThePage not to charge for people reading the documents hosted on or produced through the software -- violates the basic principles of Free Software and Open Source. So where do I find such a license?

Have other Open Access developers run into such a problem? Should I hire a lawyer to write me a sui generis license for FromThePage? Or should I just get over the fear that someone, somewhere will be making money off my software by charging people to read the documents I want them to share?

Sunday, April 5, 2009

Feature: Full Text Search/Article Link integration

In the last couple of weeks, I've implemented most of the features in the editorial toolkit. Scribes can identify unannotated pages from the table of contents, readers can peruse all pages in a collection linked to a subject, and users can perform a full text search.

I'd like to describe the full text search in some detail, since there are some really interesting things you can do with the interplay between searching and linking. I also have a few unresolved questions to explore.

Basic Search

There are a lot of technologies for searching, so my first task was research. I decided on the simple MySQL fulltext search over SOLR, Sphinx, and acts_as_ferret because all the data I wanted to search was located within the PAGES table. As a result, this only required a migration script, a text input, and a new controller action to implement. You can see the result on the right hand side of the collection homepage.

Article-based Search

Once basic search was working, I could start integrating the search capability with subject indexes. Since each subject link contains the wording in the original text that was used to link to a subject, that wording can be used to seed a text search. This allows an editor to double-check pages in a collection to see if any references to a subject have been missed.

For example, Evelyn Brumfield is a grandchild who is mentioned fairly often in Julia's diaries. Julia spells her name variously as "Evylin", "Evelyn", and "Evylin Brumfield". So a link from the article page performs a full text search for "Evylin Brumfield" OR Evelyn OR Evylin.

While this is interesting, it doesn't directly address the editors need to find references they might have missed. Since we're able to see all the fulltext matches for Evelyn Brumfield, since we can see all pages that link to the Evelyn Brumfield subject, why not subtract the second set from the first? An additional link on the subject page searches for precisely this set: all references to Evelyn Brumfield within the text that are not on pages linked to the Evelyn Brumfield subject.

At the writing of this blog post, the results of such a search are pretty interesting. The first two pages in the results matched the first name in "Evylin Edmons", in pages that are already linked to Evelyn Edmonds subject. Matched pages 4-7 appear to be references to Evelyn Brumfield in pages that have not been annotated at all. But we hit pay dirt with page number 3: it's a page that was transcribed and annotated very early during the transcription project, containing reference to Evelyn Brumfield that should be linked to that subject but is not.

Questions

I originally intended to add links to search for each individual phrase linked to a subject. However, I'm still not sure this would be useful -- what value would separate, pre-populated searches for "Evelyn", "Evylin", and "Evylin Brumfield" add?

A more serious question is what exactly I should be searching on. I adopted a simple approach of searching the annotated XML text for each page. However, this means that subject name expansions will match a search, even if the words don't appear in the text. A search for "Brumfield" will return pages in which Julia never wrote Brumfield, merely because they link to "John", which is expanded to "John Brumfield". This is not a literal text search, and might astonish users. On the other hand, would a user searching for "Evelyn" expect to see the Evelyns in the text, even though they had been spelled "Evylin"?

Monday, March 23, 2009

Feature: Mechanical Turk Integration

At last week's Austin On Rails SXSW party, my friend and compatriot Steve Odom gave me a really neat feature idea. "Why don't you integrate with Amazon's Mechanical Turk?" he asked. This is an intriguing notion, and while it's not on my own road map, it would be pretty easy to modify FromThePage to support that. Here's what I'd do to use FromThePage on a more traditional transcription project, with an experienced documentary editor at the head and funding for transcription work:

Page Forks: I assume that the editor using Mechanical Turk would want double keyed transcriptions to maintain quality, so the application needs to present the same, untranscribed page to multiple people. In the software world, when a project splits, we call this forking, and I think that the analogy applies here. This feature needs to be able to track an entirely separate edit history for the different forks of a page. This means a new attribute on the master page record describing whether more than one fork exists, and a separate edit history for each fork of a page that's created. There's no reason to limit these transcriptions to only two forks, even if that's the most common use case, so I'd want to provide a URL that will automatically create a new fork for a new transcriber to work in. The Amazon HIT (Human Intelligence Task) would have a link to that URL, so the transcriber need never track which fork they're working in, or even be aware of the double keying.

Reconciling Page Forks: After a page has been transcribed more than one time, the application needs to allow the editor to reconcile the transcriptions. This would involve a screen displaying the most recent version of two transcriptions alongside the scanned page image. Likely there's a decent Rails plug in already for displaying code diffs, so I could leverage that to highlight differences between the two transcriptions. A fourth pane would allow the editor to paste in the reconciled transcription into the master page object.

Publishing MTurk HITs: Since each page is an independent work unit, it should be possible to automatically convert an untranscribed work into MTurk HITs, with a work item for each page. I don't know enough about how MTurk works, but I assume that the editor would need to enter their Amazon account credentials to have the application create and post the HITs. The app also needs to prevent the same user from re-transcribing the same page in multiple forks.

In all, it doesn't sound like more than a month or two worth of work, even performed part-time. This isn't a need I have for the Julia Brumfield diaries, so I don't anticipate building this any time soon. Nevertheless, it's fun to speculate. Thanks, Steve!

Wednesday, March 18, 2009

Progress Report: Page Thumbnails and Sensitive Tags

As anyone reading this blog through the Blogspot website knows, visual design is not one of my strengths. One of the challenges that users have with FromThePage is navigation. It's not apparent from the single-page screen that clicking on a work title will show you a list of pages. It's even less obvious from the multi-page work reading screen that the page images are accessible at all on the website.

Last week, I implemented a suggestion I'd received from my friend Dave McClinton. The work reading screen now includes a thumbnail image of each manuscript page beside the transcription of that page. The thumbnail is a clickable link to the full screen view of the page and its transcription. This should certainly improve the site's navigability, and I think it also increases FromThePage's visual appeal.

I tried a different approach to processing the images from the one I'd used before. For transcribable page images, I modified the images offline through a batch process, then transferred them to the application, which serves them statically. The only dynamic image processing the FromThePage software did for end-users was involved in zoom. This time, I added a hook to the image link code, so that if a thumbnail was requested by the browser, the application would generate it on the fly. This turned out to be no harder to code than a batch process, and the deployment was far easier. I haven't seen a single broken thumbnail image yet, so it looks like it's fairly robust, too.

The other new feature I added last week was support for sensitive tags. The support is still fairly primitive -- enclose text with and it will only be desplayed to users authorized to transcribe the work -- but it gets the job done and solves some issues that had come up with Julia Brumfield's 1919 diary. Happily, this took less than 10 minutes to implement.

Sunday, March 15, 2009

Feature: Editorial Toolkit

I'm pleased to report that my cousin Linda Tucker has finished transcribing the 1919 diary. I've been trying my best to keep up with her speed, but she's able to transcribe two pages in the amount of time it takes me to edit and annotate a single, simple page. If the editing work requires more extensive research, or (worse) reveals the need to re-do several previous pages, there is really no contest. In the course of this intensive editing, I've come up with a few ideas for new features, as well as a few observations on existing features.

Show All Pages Mentioning a Subject

Currently, the article page for each subject shows a list of the pages on which the subject is mentioned. This is pretty useful, but it really doesn't serve the purposes of the reader or editor who wants to read every mention of that subject, in context. In particular, after adding links to 300 diary pages, I realized that "Paul" might be either Paul Bennett, Julia's 20-year-old grandson who is making a crop on the farm, or Paul Smith, Julia's 7-year-old grandson who lives a mile away from her and visits frequently. Determining which Paul was which was pretty easy from the context, but navigating the application to each of those 100-odd pages took several hours.

Based on this experience, I intend to add a new way of filtering the multi-page view, which would display all the transcriptions of all pages that mention a subject. I've already partially developed this as a way to filter the pages within a work, but I really need to 1) see mentions across works, and 2) make this accessible from the subject article page. I am embarrassed to admit that the existing work-filtering feature is so hard to find, that I'd forgotten it even existed.

Autolink

The Autolink feature has proven invaluable. I originally developed it to save myself the bother of typing [[Benjamin Franklin Brumfield, Sr.|Ben]] every time Julia mentioned "Ben". However, it's proven especially useful as a way of maintaining editorial consistency. If I decided that "bathing babies" was worth an index entry on one page, I may not remember that decision 100 pages later. However, if Autolink suggests [[bathing babies]] when it sees the string "bathed the baby", I'll be reminded of that. It doesn't catch every instance , but for subjects that tend to cluster (like occurrences of newborns), it really helps out.

Full Text Search

Currently there is no text search feature. Implementing one would be pretty straightforward, but in addition to that I'd like to hook in the Autolink suggester. In particular, I'd like to scan through pages I've already edited to see if I missed mentions of indexed subjects. This would be especially helpful when I decide that a subject is noteworthy halfway through editing a work.

Unannotated Page List

This is more a matter of work flow management, but I really don't have a good way to find out which pages have been transcribed but not edited or linked. It's really hard to figure out where to resume my editing.

[Update: While this blog post was in draft, I added a status indicator to the table of contents screen to flag pages with transcriptions but no subject links.]

Dual Subject Graphs/Searches

Identifying names is especially difficult when the only evidence is the text itself. In some cases I've been able to use subject graphs to search for relationships between unknown and identified people. This might be much easier if I could filter either my subject graphs or the page display to see all occurrences of subjects X and Y on the same page.

Research Credits

Now that the Julia Brumfield Diaries are public, suggestions, corrections, and research is pouring in. My aunt has telephoned old-timers to ask what "rebulking tobacco" refers to. A great-uncle has emailed with definitions of more terms, and I've had other conversations via email and telephone identifying some of the people mentioned in the text. To my horror, I find that I've got no way to attribute any of this information to those sources. At minimum, I need a large, HTML acknowledgments field at the collection level. Ideally, I'd figure out an easy-to-use way to attribute article comments to individual sources.

Monday, February 9, 2009

GoogleFight Resolves Unclear Handwriting

I've spent the last couple of weeks as a FromThePage user working seriously on annotation. This mainly involves identifying the people and events mentioned in Julia Brumfield's 1918 diary and writing short articles to appear as hyperlinked pages within the website, or be printed as footnotes following the first mention of the subject. Although my primary resource is a descendant chart in a book of family history, I've also found Google to be surprisingly helpful for people who are neighbors or acquaintances.

Here's a problem I ran into in the entry for June 30, 1918:

In this case, I was trying to identify the name in the middle of the photo. Bo__d Dews. The surname is a bit irregular for Julia's hand, but Dews is a common surname and occurs on the line above. In fact, this name is in the same list as another Mr. Dews, so I felt certain about the surname.

But what to make of the first name? The first two and final letters are clear and consistent: BO and D. The third letter is either an A or a U, and the fourth is either N or R. We can eliminate "Bourd" and "Boand" as unlikely phonetic spellings of any English name, leaving "Bound" and "Board". Neither of these are very likely names... or are they?

I thought I might have some luck by comparing the number and quality of Google search results for each of "Board Dews" and "Bound Dews". This is a pretty common practice used by Wikipedia editors to determine the most common title of a subject, and is sometimes known as a "Google fight". Let's look at the results:

"Bound Dews" yields four search results. The first two are archived titles from FromThePage itself, in which I'd retained a transcription of "Bound(?) Dews" in the text. The next two are randomly-generated strings on a spammer's site. We can't really rule out "Bound Dews" as a name based on this, however.

"Board Dews" yields 104 search results. The first page of results contains one person named Board Dews, who is listed on a genealogist's site as living from 1901 to 1957, and residing in nearby Campbell County. Perhaps more intriguing is the other surnames on the site, all from the area 10 miles east of Julia's home. The second page of results contains three links to a Board Dews, born in 1901 in Pittsylvania County.

At this point, I'm certain that the Bo__d Dews in the diary must be the Board Dews who would have been a seventeen-year-old neighbor. But I'm still astonished that I can resolve a legibility problem in a local diary with a Google search.

Thursday, February 5, 2009

Progress Report: Eight Months after THATCamp

It's been more than half a year since I've updated this blog. During that period, due to some events in my personal life, I was only able to spend a month or so on sustained development, but nevertheless made some real progress.

The big news is that I announced the project to some interested family members and have acquired one serious user. My cousin-in-law, Linda Tucker, has transcribed more than 60 pages of Julia Brumfield's 1919 diary since Christmas. In addition to her amazing productivity transcribing, she's explored a number of features of the software, reading most of the previously-transcribed 1918 diary, making notes and asking questions, and fighting with my zoom feature. Her enthusiasm is contagious, and her feedback -- not to mention her actual contributions -- has been invaluable.

During this period of little development, I spent a lot of time as a user. Fewer than 50 pages remain to transcribe in the 1918 diary, and I've started seriously researching the people mentioned in the diary for elaboration in footnotes. It's much easier to sustain work as a user than as a developer, since I don't need an hour or so of uninterrupted concentration to add a few links to a page.

I've also made some strides on printing. I jettisoned DocBook after too many problems and switched over to using Bruce Williamson's RTeX plugin. After some limited success, I will almost certainly craft my own set of ERb templates that generate LaTeX source for PDF generation. RTeX excels in serving up inline PDF files, which is somewhat antithetical to my versioned approach. Nevertheless, without RTeX, I might have never ventured away from DocBook. Thanks go to THATCamper Adam Solove for his willingness to share some of his hard-won LaTeX expertise in this matter.

Although I'm new to LaTeX, I've got footnotes working better than they were in DocBook. I still have many of the logical issues I addressed in the printing post to deal with, but am pretty confident I've found the right technology for printing.

I'm also working on re-implementing zoom in GSIV, rather than my cobbled-together solution. The ability to pan a zoomed image has been consistently requested by all of my alpha testers, the participants at THATCamp, and its lack is a real pain point for Linda, my first Real User. I really like the static-server approach GSIV takes, and will post when the first mock-up is done.

Saturday, June 21, 2008

Workflow: flags, tags, or ratings?

Over the past couple of months, I've gotten a lot of user feedback relating to workflow. Paraphrased, they include:

How do I mark a page "unfinished"? I've typed up half of it and want to return later.
How do I see all the pages that need transcription? I don't know where to start!
I'm not sure about the names or handwriting in this page. How do I ask someone else to review it?
Displaying whether a page has transcription text or not isn't good enough -- how do we know when something is really finished?
How do we ask for a proofreader, a tech savvy person to review the links, or someone familiar with the material to double-check names?

In a traditional, institutional setting, this is handled both through formal workflows (transcription assignments, designated reviewers, and researchers) and through informal face-to-face communication. None of these methods are available to volunteer-driven online
projects.

The folks at THATCamp recommended I get around this limitation by implementing user-driven ratings, similar to those found at online bookstores. Readers could flag pages as needing review, scribes could flag pages in which they need help, and volunteers could browse pages by quality to look for ways to help out. An additional benefit would be the low barrier to user-engagement, as just about anyone can click a button when they spot an error.

The next question is what this system should look like. Possible options are:

Rating scale: Add a one-to-five scale of "quality" to each page.
- Pros: Incredibly simple.
- Cons: "Quality" is ambiguous. There's no way to differentiate a page needing content review (i.e. "what is this placename?") from a page needing technical help (i.e. "I messed up the subject links"). Low quality ratings also have an almost accusatory tone, which can lead to lots of problems in social software.
Flags: Define a set of attributes ("needs review", "unfinished", "inappropriate") for pages and allow users to set or un-set them independently of each other.
- Pros: Also simple.
- Cons: Too precise. The flags I can think of wanting may be very different from those a different user wants. If I set up a flag-based data-model, it's going to be limited by my preconceptions.
Tags: Allow users to create their own labels for a page.
- Pros: Most flexible, easy to implement via acts_as_taggable or similar Rails plugins.
- Cons: Difficult to use. Tech-savvy users are comfortable with tags, but that may be a small proportion of my user base. An additional problem may be use of non-workflow based tags. If a page mentions a dramatic episode, why not tag it with that? (Admittedly this may actually be a feature.)

I'm currently leaning towards a combination of tags and flags: implement tags under the hood, but promote a predefined subset of tags to be accessible via a simple checkbox UI. Users could tag pages however they like, and if I see patterns emerge that suggest common use-cases, I could promote those tags as well.

Sunday, June 8, 2008

THATCamp Takeaways

I just got back from THATCamp, and man, what a ride! I've never been to a conference with this level of collaboration before -- neither academic nor technical. Literally nobody was "audience" -- I don't think a single person emerged from the conference without having presented in at least one session, and pitched in their ideas in half a dozen more.

To my astonishment, I ended up demoing FromThePage in two different sessions, and presented a custom how-to on GraphViz in a third. I was really surprised by the technical savvy of the participants -- just about everyone at the sessions I took part in had done development of one form or another. The feedback on FromThePage were far more concrete than I was expecting, and got me past several roadblocks. And since this is a product development blog, here's what they were:

Zoom: I've looked at Zoomify a few times in the past, but have never been able to get around the fact that their image-preparation software is incompatible with Unix-based server-side processing. Two different people suggested workarounds for this, which may just solve my zoom problems nicely.

WYSIWYG: I'd never heard of the Yahoo WYSIWYG before, but a couple of people recommended it as being especially extensible, and appropriate for linking text to subjects. I've looked over it a bit now, and am really impressed.
Analysis: One of the big problems I've had with my graphing algorithm is the noise that comes from members of Julia's household. Because they appear on 99 of 100 entries, they're more related everything, and (worse) show up on relatedness graphs for other subjects as more related than the subjects that's I'm actually looking for. Over the course of the weekend, while preparing my DorkShorts presentation, discussing it, and explaining the noise quandary in FromThePage, both problem and solution clarified.
The noise is due to the unit of analysis being a single diary entry. The solution is to reduce the unit of analysis. Many of the THATCampers suggested alternatives: look for related subjects within the same paragraph, or within N words, or even (using natural language toolkits) within the same sentence.
It might even be possible to do this without requiring markup of each mention of a subject. One possibility is to automate this by searching the entry text for likely mentions of the same subject that has occurred already. This search could be informed by previous matches -- the same data I'm using for the autolink feature. (Inspiration for this comes from Andrea Eastman-Mullins' description of how Alexander Street Press is using well-encoded texts to inform searches of unencoded texts.)
Autolink: Travis Brown, whose background is in computational linguistics, suggested a few basic tools for making autolink smarter. Namely, permuting the morphology of a word before the autolink feature looks for matches. This would allow me to clean up the matching algorithm, which currently does some gross things with regular expressions to approach the same goal.

Workflow: The participants at the Crowdsourcing Transcription and Annotation session were deeply sympathetic to the idea that volunteer-driven projects can't use the same kind of double-keyed, centrally organized workflows that institutional transcription projects use. They suggested a number of ways to use flagging and ratings to accomplish the same goals. Rather than assigning transcription to A, identification and markup to B, and proofreading to C, they suggested a user-driven rating system. This would allow scribes or viewers to indicate the quality level of a transcribed entry, marking it with ratings like "unfinished", "needs review", "pretty good", or "excellent". I'd add tools to the page list interface to show entries needing review, or ones that were nearly done, to allow volunteers to target the kind of contributions they were making.
Ratings also would provide an non-threatening way for novice users to contribute.

Mapping: Before the map session, I was manually clicking on Google's MyMaps, then embedding a link within subject articles. Now I expect to attach latitude/longitude coordinates to subjects, then generate maps via KML files. I'm still just exploring this functionality, but I feel like I've got a clue now.

Presentation: The Crowdsourcing session started brainstorming presentation tools for transcriptions. I'd seen a couple of these before, but never really considered them for FromThePage. Since one of my challenges is making the reader experience more visually appealing, it looks like it might be time to explore some of these.

These are all features I'd considered either out-of-reach, dead-ends, or (in one case) entirely impossible.

Thanks to my fellow THATCampers for all the suggestions, correction, and enthusiasm. Thanks also to THATCamp for letting an uncredentialed amateur working out of his garage attend. I only hope I gave half as much as I got.

Saturday, May 17, 2008

Progress Report: De-duping catastrophe and a host change

After a very difficult ten days of coding, I'm almost where I was at the beginning of May. The story:

Early in the month, I got a duplicate identifier feature coded. The UI was based on LibraryThing's, which is the best de-duping interface I've ever seen. Mine still falls short, but it's able to pair "Ren Worsham" up with "Wren Worsham", so it'll probably do for now. With that completed, I built a tool to combine subjects: if you see a possible duplicate of the subject you're viewing, you click combine, and it updates all textual references to the duplicate to point to the main article, then deletes the duplicate. Pretty simple, right?

Enter DreamHost's process killer. Now I love DreamHost, and I want to love them more, but I really don't think their cheap shared hosting plan is appropriate for a computationally intensive web-app. In order to insulate users from other, potentially clueless users, they have a daemon that monitors running processes and kills off any that look "bad". I'm not sure what criteria constitute "bad", but I should have realized the heuristic might be over-aggressive when I wasn't able to run basic database migrations without running afoul of it. Nevertheless, it didn't seem to be causing anything beyond the occasional "Rails Application failed to start" message that could be solved with a browser reload.

However. Killing a de-duping process in the middle of reference updates is altogether different from killing a relatedness graph display. Unfortunately, I wasn't quite aware of the problem before I'd tried to de-dup several records, sometimes multiple times. My app assumes its data will be internally consistent, so my attempts to clean up the carnage resulted in hundreds more duplicates being created.

So I've moved FromThePage from DreamHost to HostingRails, which I completed this morning. There remains a lot of back-end work to clean up the data, but I'm pretty sure I'll get there before THATCamp.

Sunday, April 27, 2008

The Trouble with Names

The trouble with people as subjects is that they have names, and that personal names are hard.

Names in the text may be illegible or incomplete, so that Reese _____ and Mr ____ Edmonds require special handling.
Names need be remembered by the scribe during their transcription. I discovered this the hard way.
After doing some research in secondary documents, I was able to "improve" the entries for Julia's children. Thus Kate Harvey became Julia Katherine Brumfield Harvey and Mollie Reynolds became Mary Susan Brumfield Reynolds.
The problem is that while I'm transcribing the diaries, I can't remember that "Mary Susan" == Mollie. The diaries consistently refer to her as Mollie Reynolds, and the family refers to to her as Mollie Reynolds. No other person working on the diaries is likely to have better luck than I've had remembering this. After fighting with the improved names for a while, I gave up and changed all the full names back to the common names, leaving the full names in the articles for each subject.
Names are odd ducks, when it comes to strings. "Mr. Zach Abel" should be sorted before "Dr. Anne Zweig", which requires either human intervention to break the string into component parts or some serious parsing effort. At this point my subject list has become unwieldy enough to require sorting, and the index generation code for PDFs is completely dependent on this kind of separation.

I'm afraid I'll have to solve all of these problems at the same time, as they're all interdependent. My initial inclination is to have subject articles for people allow the user to specify a full name in all its component parts. If none is chosen, I'll populate the parts via a series of regular expressions. This will probably also require a hard look at how both TEI and DocBook represent names.

Friday, April 25, 2008

Feature: Data Maintenance Tools

With only two collections of documents, fewer than a hundred transcriptions, and only a half-dozen users who could be charitably described as "active", FromThePage is starting to strain under the weight of its data.

All of this has to do with subjects. These are the indexable words that provide navigation, analysis, and context to readers. They're working out pretty well, but frequency of use has highlighted some urgent features to be developed and intolerable bugs to be fixed:

We need a tool to combine subjects. Early in the transcription process, it was unclear to me whether "the Island" referred to Long Island, Virginia -- a nearby town with a post office and railroad station -- or someplace else. Increasing familiarity with the texts, however, has shown "the Island" to be definitely the same as "Long Island".
The best interface for doing this sort of deduping is implemented by LibraryThing, which is so convenient that it has inspired the ad-hoc creation of a group of "combining" enthusiasts -- an astonishing development, since this is usually the worst of dull chores. A similar interface would invite the user viewing "the Island" to consider combining that subject with "Long Island". This requires an algorithm to suggest matches for combination, which is itself no trivial task.
We need another tool to review and delete orphans. As identification improves, we've been creating new subjects "Reese Smith" and linking previous references to "Reese" to that improved subject. This leaves the old, incomplete subject without any referents, but also without any way to prune it.

Autolink has become utterly essential to transcriptions, since it allows the scribe to identify the appropriate subject as well as the syntax to use for its link. Unfortunately, it has a few serious problems:

Autolink looks for substrings within the transcription without paying attention to word boundaries. This led to some odd autolink suggestions before this week, but since the addition of "ink" and "hen" to the subjects link text, autolink has started behaving egregiously. The word "then" is unhelpfully expanded to "t[[chickens|hen]]". I'll try to retain matches for display text regardless of inflectional markings, but it's time to move the matching algorithm to a more conservative heuristic.
Autolink also suggests matches from subjects that reside within different collections. It's wholly unhelpful for autolink to suggest a tenant farmer in rural Virginia within a context of UK soldiers in a WWI prison camp. This is a classical cross-talk bug, and needs fixing.

Monday, April 14, 2008

Collaborative transcription, the hard way

Archivalia has kept us informed of lots of manuscript projects going online in Europe last week, offering commentary along the way. Perhaps my favorite exchange was about the Chronicle of Sweder von Schele on the Internet:

Mit dem Projekt wird zunächst bezweckt, die Transkription zu ergänzen und zu verbessern. Hierzu können neue Abschriften per Mail an die am Projekt beteiligten Institutionen geschickt werden. Nach redaktioneller Prüfung werden die Seiten ausgetauscht.

Meine Güte, wie vorsintflutlich. Hat man noch nie etwas von einem Wiki gehört?

That's right -- the online community may send in corrections and additions to the transcription by mail.

Friday, April 11, 2008

Who do I build for?

Over at ClioWeb, Jeremy Boggs is starting a promising series of posts on the development process for digital humanites process. He's splitting the process up into five steps, which may happen at the same time, but still follow a rough sort of order. Step one, obviously, is "Figure out what exactly you’re building."

But is that really the first step? I'm finding that what I build is dependent on who I'm building for. Having launched an alpha version of the product and shaken out many of the bugs in the base functionality, I'm going to have to make some tough decisions about what I concentrate my time on next. These all revolve around who I'm developing the product for:

My Family: Remember that I developed the transcription software to help accomplish a specific goal: transcribe Julia Brumfield's diaries to share with family members. The features I should concentrate on for this audience are:

Finish porting my father's 1992 transcription of the 1918 diary to FromThePage
Fix zoom.
Improve the collaborative tools for discussing the works and figuring out which pages need review.
Build out the printing feature, so that the diaries can be shared with people who have limited computer access.

THATCamp attendees: Probably the best feedback I'll be able to receive on the analytical tools like subject graphs will come from THATCamp. This means that any analytical features I'd like to demo need to be presentable by the middle of May.

Institutional Users: This blog has drawn interest from a couple of people looking for software for their institutions to use. For the past month, I've corresponded extensively with John Rumm, editor of the Digital Buffalo Bill Project, based at McCracken Research Library, at the Buffalo Bill Historical Center. Though our conversations are continuing, it seems like the main features his project would need are:

Full support for manuscript transcription tags used to describe normalization, different hands, corrections/revisions to the text, illegible text and other descriptors used in low-level transcription work. (More on this in a separate post)
Integration with other systems the project may be using, like Omeka, Pachyderm, MediaWiki, and such.

My Volunteers: A few of the people who've been trying out the software have expressed interest in using it to host their own projects. More than any other audience, their needs would push FromThePage towards my vision of unlocking the filing cabinets in family archivists' basements and making these handwritten sources accessible online. We're in the very early stages of this, so I don't yet know what requirements will arise.

The problem is that there's very little overlap between the features these groups need. I will likely concentrate on family and volunteers, while doing the basics for THATCamp. I realize that's not a very tight focus, but it's much clearer to me now than it was last week.

Monday, April 7, 2008

Progress Report: One Month of Alpha Testing

FromThePage has been on a production server for about a month now, and the results have been fascinating. The first few days' testing revealed some shocking usability problems. In some places (transcription and login most notoriously) the code was swallowing error messages instead of displaying them to the user. Zoom didn't work in Internet Explorer. And there were no guardrails that kept the user from losing a transcription-in-progress.

After fixing these problems, the family and friends who'd volunteered to try out the software started making more progress. The next requests that came in were for transcription conventions. After about three requests for these, I started displaying the conventions on the transcription screen itself. This seems to have been very successful, and is something I'd never have come up with on my own.

The past couple of weeks have been exciting. My old college roommate started transcribing entries from the 1919 diary, and entered about 15 days in January -- all in two days work. In addition to his technical feedback, two things I'd hoped for happened:

We started collaborating. My roommate had transcribed the entries on a day when he'd had lots of time. I reviewed his transcriptions and edited the linking syntax. Then my father edited the resulting pages, correcting names of people, events, and animals based on his knowledge of the diaries' context.
My roommate got engaged. His technical suggestions were peppered with comments on the entries themselves and the lives of the people in the diaries. I've seen this phenomenon at Pepys Diary Online, but it's really heartening to get a glimpse of that kind of engagement with a manuscript.

I've also had some disappointments. It looks like I'll have to discard my zoom feature and replace it with something using the Google Maps API. People now expect to pan by dragging the image, and I my home-rolled system simply can't support that.

I really planned on moving developing printing and analytical tools next, but we're finding that the social software features are becoming essential. The bare-bones "Recent Activity" panel I slapped together one night has become the most popular feature. We need to know who edited what, what pages remain to be transcribed, and which transcriptions need review. I've resuscitated this winter's comment feature, polished the "Recent Activity" panel, and am working on a system for displaying a page's transcription status in the work's table of contents.

All of these developments are the results of several hours of careful, thoughtful review by volunteers like my father and my roommate. There is simply no way I could have invited the public in to the site as it stood a month ago, which I did not know at the time. There's still a lot to be done before FromThePage is "ready for company", but I think it's on track.

If you'd like to try the software out, leave me a comment or send mail to alpha.info@fromthepage.com.

Friday, April 4, 2008

Review: IATH Manuscript Transcription Database

This is the second of two reviews of similar transcription projects I wrote in correspondence with Brian Cafferelli, an undergraduate working on the WPI Manuscript Transcription Assistant. In this correspondence, I reviewed systems by their support for collaboration, automation, and analysis.

The IATH Manuscript Transcription Database was a system for producing transcriptions developed for the Salem Witch Trials court records. It allowed full collaboration within an institutional setting. An administrator assigned transcription work to a transcriber, who then pulled that work from the queue and edited and submitted a transcription. Presumably there was some sort of review performed by the admin, or a proofreading step done by comparison with another user's transcription. Near as I can tell from the manual, no dual-screen transcription editor was provided. Rather, transcription files were edited outside the system, and passed back and forth using MTD.

I'm a bit hazy on all this because after reviewing the docs and downloading the source code, I sent an inquiry to IATH about the legal status of the code. It turns out that while the author intended to release the system to the public, this was never formally done. The copyright status of the code is murky, and I view it as tainted IP for my purpose of producing a similar product. I'm afraid I deleted the files from my own system, unread.

For those interested in reading more, here is the announcement of the MTD
on the Humanist list.
The full manual, possibly with archived source code is accessible via the wayback machine. They've pulled this from the IATH site, presumably because of the IP problems.

So the MTD was another pure-production system. Automation and collaboration were fully supported, though the collaboration was designed for a purely institutional setting. Systems of assignment and review are inappropriate for a volunteer-driven system like my own.

My correspondent at IATH did pass along a piece of advice from someone who had worked on the project: "Use CVS instead". What I gather they meant by this was that the system for passing files between distributed transcribers, collating those files, and recording edits is a task that source code repositories already perform very well. This does nothing to replace transcription production tools akin to mine or the TEI editors, but the whole system of editorial oversight and coordination provided by the MTD is a subset of what a source code repository can do. A combination of Subversion and Trac would be a fantastic way to organize a distributed transcription effort, absent a pure-hosted tool.

This post contains a lot more speculation and supposition than usual, and I apologize in advance to my readers and the IATH for anything I've gotten wrong. If anyone associated with the MTD project would like to set me straight, please comment or send me email.

Thursday, March 27, 2008

Rails: Logging User Activity for Usability

At the beginning of the month, I started usability testing for FromThePage. Due to my limited resources, I'm not able to perform usability testing in control rooms, or (better yet) hire a disinterested expert with a background in the natural sciences to conduct usability tests for me. I'm pretty much limited to sending people the URL for the app with a pleading e-mail, then waiting with fingers crossed for a reply.

For anyone who finds themselves in the same situation, I recommend adding some logging code to your app. We tried this last year with Sara's project, discovering that only 5% of site visitors were even getting to the features we'd spent most of our time on. It was also invaluable resolving bugs reports. When a user complains they got logged off the system, we could track their clicks and see exactly what they were doing that killed their session.

Here's how I've done this for FromThePage in Rails:

First, you need a place to store each user action. You'll want to store information about who was performing the action, and what they were doing. I was willing to violate my sense of data model aesthetics for performance reasons, and abandon third normal form by combining these two distinct concepts into the same table.

# who's doing the clicking?
browser
session_id
ip_address
user_id #null if they're not logged in

Tracking the browser lets you figure out whether your code doesn't work in IE (it doesn't) and whether Google is scraping your site before it's ready (it is). The session ID is the key used to aggregate all these actions -- one of these corresponds to several clicks that make up a user session. Finally, the IP address give you a bit of a clue as to where the user is coming from.

Next, you need to store what's actually being done, and on what objects in your system. Again, this goes within the same table.

# what happened on this click?
:action
:params
:collection_id #null if inapplicable
:work_id #null if inapplicable
:page_id #null if inapplicable

In this case, every click will record the action and the associated HTTP parameters. If one of those parameters was collection_id, work_id, or page_id (the three most important objects within FromThePage), we'll store that too. Put all this in a migration script and create a model that refers to it. In this case, we'll call that model "activity".

Now we need to actually record the action. This is a good job for a before_filter. Since I've got a before_filter in ApplicationController that set up important variables like the page, work, or collection, I'll place my before_filter in the same spot and call it after that one.

before_filter :record_activity

But what does it do?

def record_activity
    @activity = Activity.new
    # who is doing the activity?
    @activity.session_id = session.session_id #record the session
    @activity.browser = request.env['HTTP_USER_AGENT']
    @activity.ip_address = request.env['REMOTE_ADDR']
    # what are they doing?
    @activity.action = action_name # grab this from the controller
    @activity.params = params.inspect # wrap this in an unless block if it might contain a password
    if @collection
       @activity.collection_id = @collection.id
    end
    # ditto for work, page, and user IDs
end

For extra credit, add a status field set to 'incomplete' in your record_activity method, then update it to 'complete' in an after_filter. This is a great way for catching activity that throws exceptions for users and presents error pages you might not know about otherwise.

P.S. Let me know if you'd like to try out the software.

Wednesday, March 26, 2008

THATCamp 2008

I'm going to THATCamp at the end of May to talk about From The Page and a few dozen other cool projects that are going on in the digital humanities. If anybody can offer advice on what to expect from an "unconference", I'd sure appreciate it.

This may be the thing that finally drives me to use Twitter.

Monday, March 17, 2008

Rails 2.0 Gotchas

The deprecation tools for Rails 2.0 are grand, but they really don't tell you everything you need to know. The things that have bitten me so far are:

The built-in pagination has been removed from the core framework. Unlike tools like acts_as_list and acts_as_tree, however, there's no obvious plugin that makes the old code work. This is because the old pagination code was really awful: it performed poorly and hid your content from search engines. Fortunately, Sara was able to convert my paginate calls to use the will_paginate plugin pretty easily.
Rails Engines, or at least the restful_comments plugin built on top of them, don't seem to work at all. So I've had to disable the comments and proofreading request system I spent November through January building.
Rails 2.0 adds some spiffy automated code to prevent cross-site-scripting security holes. For some reason this breaks my cross-controller AJAX calls, so I've had to add
protect_from_forgery :except => [my old actions]
to those controllers after getting InvalidAuthenticityToken exceptions.
The default session has been changed from a filesystem-based storage engine to one that shoves session data into the browser cookie. So if you're persisting large-ish objects across requests in the session, this will fail. Sadly, basic tests may pass, while serious work will break: I found my bulk page transformation
code to work fine for 20 pages, but break for 180. The solution for this is to add
config.action_controller.session_store = :p_store
in your environment.rb file.

Sunday, March 9, 2008

Collaborative Transcription as Crowdsourcing

Yesterday morning I saw Derek Powazek present on crowdsourcing -- user-generated content and collaborative communities. While he covered a lot of material that (users will do unexpected things, don't exploit people, design for the "selfish user"), there was one anecdote I thought especially relevant for FromThePage.

A publishing house had polled a targeted group of people to figure out whether they'd be interested in contributing magazine articles. The response was overwhelmingly positive. The appropriate studies were conducted, and the site was launched -- A blank page, ready for article contributions.

The response from those previously enthusiastic users was silence. Crickets. Tumbleweeds. The editors quickly changed tack and posted a list of ten subjects who'd agreed to be interviewed by the site's contributors,
asking for volunteers to conduct and write up the interviews. This time, people responded with the same enthusiasm they'd shown at the original survey.

The lesson was that successful editors of collaborative content endeavorshave less in common with traditional magazine/project editors thanthey do with community managers. Absent thecommand-and-control organizational structure, a volunteer community still needs to have its effortsdirected. However, this must be done through guidance andpersuasion through concrete suggestions, goal-setting, and feedback. In future releases, I need to add featuresto help work owners communicate suggestions and rewards^* to scribes.

(Powazek suggests attaboys, not complex replacements for currency here)

Wednesday, March 5, 2008

Meet me at SXSWi 2008

I'll be at South by Southwest Interactive this weekend. If any of my readers are also attending, please drop me a message or leave a comment. I'd love to meet up.

Thursday, February 7, 2008

Google Reads Fraktur

Yesterday, German blogger Archivalia reported that the quality of Fraktur OCR at Google Books has improved. There are still some problems, but they're on the same order of those found in books printed in Antiqua. Compare the text-only and page-image versions of Geschichte der teutschen Landwirthschaft (1800) with the text and image versions of Antigua Altnordisches Leben (1856).

This is a big deal, since previous OCR efforts produced results that were not only unreadable, but un-searchable as well. This example from the University of Michigan's MBooks website (digitized in partnership with Google) gives a flavor of the prior quality: "Ueber den Ursprung des Uebels." ("On the Origin of Evil") results in "Us-Wv ben Uvfprun@ - bed Its-beEd."

It's thrilling that these improvements are being made to the big digitization efforts — my guess is that they've added new blackletter typefaces to the OCR algorithm and reprocessed the previously-scanned images — but this highlights the dependency OCR technology has on well-known typefaces. Occasionally, when I tell friends about my software and the diaries I'm transcribing, I'm asked, "Why don't you just OCR the diaries?" Unfortunately, until someone comes with a OCR plugin for Julia Brumfield (age 72) and another for Julia Brumfield (age 88), we'll be stuck transcribing the diaries by hand.

Monday, February 4, 2008

Progress Report: Four N steps to deployment

I've completed one of the four steps I outlined below: my Rails app is now living in a SubVersion repository hosted somewhere further than 4 feet from where I'm typing this.

However, I've had to add a few more steps to the deployment process. These included:

Attempting to install Trac
Installing MySql on DreamHost
Installing SubVersion on DreamHost
Successfully installing BugZilla on DreamHost

None of these were included in my original estimate.

Name Update: FromThePage.com

I've finally picked a name. Despite its attractiveness, "Renan" proved unpronounceable. No wonder my ancestors said "REE-nan": it's at least four phonemes away from a native English word, and nobody who was shown the software was able to pronounce its title.

FromThePage is the new name. It not as lovely as some of the ones that came out of a brainstorming session (like "HandWritten"), but at least there are no existing software products that use it. I went ahead and registered fromthepage.com and fromthepage.org under the assumption that I'd be able to pull off the WordPress model of open-source software that's also hosted for a fee.

Monday, January 21, 2008

Four steps to deployment

Here are the things I need to do to deploy the 0.5 app on my shared hosting provider:

Install Capistrano and get it working
Upgrade my application stack to Rails 2.0
Switch my app from a subdirectory deep within another CVS module to its own Subversion module
Move the app to Dreamhost

But what order should I tackle this in? My temptation is to try deploying to Dreamhost via Capistrano, since I'm eager to get the app on a production server. Fortunately for my sanity, however, I read part of Cal Henderson's Building Scalable Websites this weekend. Henderson recommends using a staging site. While he probably had something different in mind, this seems like a perfect way to isolate these variables: get Capistrano scripts working on a staging location within my developent box, then once I really understand how deployment automation works, then point the scripts at the production server.

As for the rest, I'm not really sure when to do them. But I will try to tackle them one at a time.

Friday, January 18, 2008

What's next?

Now that I'm done with development driven only by my sense of what would be a good feature, it's time to move to step #2 in my year-old feature plan: deploying an alpha site.

I'm no longer certain about the second half of that plan -- other projects have presented themselves as opportunities that might have a more technically sophisticated user base, and thus might present more incremental enhancement requests. But getting the app to a server where I can follow Matt Mullenweg's advice and "become [my] most passionate user" seems more sensible now than ever.

Chris Wehner's SoldierStudies.org

This is the first of two reviews of similar transcription projects I wrote in correspondence with Brian Cafferelli, an undergraduate working on the WPI Manuscript Transcription Assistant. In this correspondence, I reviewed systems by their support for collaboration, automation, and analysis.

SoldierStudies.org is a non-academic/non-commercial effort like my own. It's a combined production-presentation system with simple but effective analysis tools. If you sign up for the site, you can transcribe letters you possess, entering metadata (name and unit of the soldier involved) and the transcription of the letter's text. You may also flag a letter's contents as belonging to N of about 30 subjects using a simple checkbox mechanism. The UI is a bit clunky in my opinion, but it actually has users (unlike my own program), so perhaps I shouldn't cast stones.

Nevertheless, SoldierStudies has some limitations. Most surprisingly, they are doing no image-based transcription whatsoever, even though they allow uploads of scans. Apparently those uploaded photos of letters are merely to authenticate that the user hasn't just created the letter out of nothing, and only a single page of a letter may be uploaded. Other problems seem inherent to the site's broad focus. SoldierStudies hosts some WebQuest modules intended for K-12 pedagogy. It also keeps copies of some letters transcribed in other projects, like letter from James Booker digitized as part of the Booker Letters Project at the University of Virginia. Neither of these seem part of the site's core goal to "to rescue Civil War letters before they are lost to future generations".

Unlike the pure-production systems like IATH MTD or WPI MTA, SoldierStudies transcriptions are presented dynamically. This allows full-text searching and browsing the database by metadata. Very cool.

So they've got automation mostly down (save the requirement that a scribe be in the same room as a text), analysis is pretty good, and there's a stab at collaboration, although texts cannot be revised by anybody but the original editor. Most importantly, they're online, actively engaged in preserving primary sources and making them accessible to the public via the web.