Thursday, December 30, 2010

Two New Diary Transcription Projects

The last few weeks have seen the announcement of two new transcription projects. I'm particularly excited about them because--like FromThePage--their manuscripts are diaries and they plan to open the transcription tasks to the public.

Dear Diary announces the digitization of University of Iowa's Historic Iowa Children's Diaries:
We have a deep love for historic diaries as well, and we’re currently hard at work developing a site that will allow the public to help enhance our collections through “crowdsourcing” or collaborative transcription of diaries and other manuscript materials. Stay tuned!
A Look Inside J. Paul Getty’s Newly Digitized Diaries describes J. Paul Getty's diaries and their contents, and mentions in passing that
We will soon launch a website that will invite your participation to perform the transcriptions (via crowdsourcing), thus rendering the diaries keyword-searchable and dramatically improving their accessibility.
I will be very interested to follow these projects and see which transcription systems they use or develop.

Wednesday, December 15, 2010

NABPP Transcription User Survey Results

The ever-innovative North American Bird Phenology Program has just released the results of their user satisfaction survey. In addition to offering kudos to the NABPP for pleasing its users--who had transcribed 346,786 species observation cards as of November--I'd like to highlight some of the survey results -- after all, this is the first user survey for a transcription tool that I'm aware of.

This chart shows answers to the question "What inspires you to volunteer?":
Users overwhelmingly cite the importance of the program and their love of nature. The importance response obviously speaks to the "save the world" aspect of the NABPP's mission tracking climate change. But how does a love of nature inspire someone to sit in front of a computer and decipher seventy thousand pieces of old hand-writing? Based on my experience with the NABPP (and indeed with Julia Brumfield's diaries), I think the answer is simple: transcription is an immersive activity. It is rare that we read more deeply than when we transcribe a text, so the transcriber is transported to a different place and time in the same way as a reader lost in a novel or a player in a good video game.

While the whole survey is worth reading, one other response stood out for me. Answering the open-ended "how can we improve" question, one respondent requested examples of properly transcribed "difficult" cards on the transcription page. I know that I as a tool-maker tend to concentrate on providing users help using the software I develop. However, transcription tools need to provide users with non-technical guidance on paleographic challenges, unusual formatting, and other issues that may be unique to the material being transcribed. I'm not entirely sure how to accomplish this as a developer, other than by facilitating communication among transcribers, editors, and project coordinators.

Let me conclude by offering the NABPP my congratulations on their success and my thanks for their willingness to share these results with the rest of us.

Friday, December 10, 2010 and is a new transcription project organized by the City Archive Amsterdam that plans to use crowdsourcing to index militia registers from several Dutch archives. It's quite ambitious, and there are a number of innovative features about the project I'd like to address. However, I haven't seen any English-language coverage of the project so I'll try to translate and summarize it as best as my limited Dutch and Google's imperfect algorithms allow before offering my own commentary.

With the project "many hands make light work", Stadsarchief Amsterdam will make all militia records searchable online through crowdsourcing -- not just inventories, but indexes.

To research how archives and online users can work together to improve access to the archives, the Stadsarchief Amsterdam has set up the "Many Hands" project. With this project, we want to create a platform where all Dutch archives can offer their scans to be indexed and where all archival users can contribute in exchange for fun, contacts, information, honor, scanned goods, and whatever else we can think of.

To ask the whole Netherlands to index, we must start with archives that are important to the whole Netherlands. As the first pilot, we have chosen the Militia Registers, but there will soon be more archival files to be indexed so that everyone can choose something within his interest and skill-level.

All Militia Registers Online

Militia registers contain the records of all boys who were entered for conscription into military service during almost the entire 19th and part of the 20th centuries. These records were created in the entire Netherlands and are kept in many national and municipal archives.

The militia records are eminently suitable for large-scale digitization. The records consist of printed sheets. This uniformity makes scanning easy and thus inexpensive. More importantly, this resource is interesting for anyone with Dutch ancestry. Therefore we expect many volunteers to lend a hand to help unlock this wonderful resource, and that the online indexes will eventually attract many visitors.

But the first step is to scan the records. Soon the scanning of approximately one million pages will begin as a start. The more records we have digitized, the cheaper the scanning becomes, and the more attractive the indexing project becomes to volunteers. The Stadsarchive therefore calls upon all Dutch archival institutions to join!
  • At our institution, online scans are provided for free. Why should people pay for scans?
    Revenues from sales of scans over the two years duration of the project are a part of the financing of the project. The budget is based on the rates as used in the City Archives: € 0.50 to € 0.25 per scan, depending on the number of scans that someone buys. We ask the institutions that participate throughout the project do not sell their own scans or make them available for free. After the completion of the project, each institution may follow its own policy for providing the scans.
  • If we participate, who is the owner of the scans and index data?
    After production and payment, the scans will be delivered immediately to the institution which provided the militia records. The index information will also be supplied to the institutions after completion of the project. The institution remains the owner, but during the project period of approximately two years the material may not be used outside of the project.
  • What are the financial risks for participating archives?
    Participants pay only for their scans: the actual costs and preparation of the scanning process. The development and deployment of the index tool, volunteer recruitment and two years maintenance of the website from the project has been funded by grants and contributions by Stadsarchief Amsterdam. There are no financial surprises.
  • What does the schedule for the project look like?
    On July 12 and September 13 we are organizing meetings with potential participants to answer your questions. Until October 1, 2010, participants will sign up to participate in the project, in order for the scanning to start on that day. The tender process runs about 2 months, so a supplier can be contracted in 2010. In January 2011 we will start scanning, volunteers can begin indexing in the spring. The sister site the indexing will take place--will continue online for at least one year.
  • Will the indexing tool be developed as Open Source software?
    It is currently impossible to say whether the indexing tool will be developed via/as open source software. Of primary importance is finding the most cost-effective solution and that the software performs well and is user-friendly. The only hard requirement is the use of open standards for the import and export of metadata, so that vendor independence is guaranteed.
RFP (Warning: very loose translation!)
Below are some ideas SAA has formulated regarding the functionality and sustainability of
  • Facilities for importing and managing scans, and for exporting data in XML format.
  • Scan viewer with advanced features.
  • Functionality to simultaneously run multiple projects for indexing, transcription, and translation of scans.
  • Features for organizing and managing data from volunteer groups and for selectively enabling features for participants and volunteer coordinators.
  • Features for communication between archival staff and volunteers, as well as for volunteers to provide support to each other.
  • Automated features for control of the data produced.
  • Rewards system (material and immaterial) for volunteers.
  • Many volunteers may work in parallel to process scans quickly and effectively.
  • Facilities to search, view and share scans online.
Other Dutch bloggers have covered the unique approach that Stadsarchief Amsterdam envisions for volunteer motiviation and project support: users who want to download scans may either pay for them in cash or in labor, by indexing N scanned pages. Christian van der Ven's blog post Crowdsourcen rond militieregisters and the associated comment thread discusses this intensely and is worth reading in full. Here's a loosly-translated excerpt:
The project assumes that it can not allow the volunteer to indicate whether he wants to index Zeeland or Groningen. It is--in the words of the project leader--the Orange feeling, to see if the rural people can volunteer and not just concentrate on their own location. Indexing people from their own village? Please, not that!

Well since the last World Cup I'm feeling Orange again, but overall experience and research in archives teaches that all country people are more interested in the history of themselves, their own ancestors, their homes and the surrounding area. The closer [the data], the more motivation to do something.

And if the purpose of this project is to build an indexing tool, to scan registers, and then to obtain indexes through crowdsourcing as quickly as possible, it seems to me that the public should be given what it wants: local resources if desired. What I suggest is a choice menu: do you want records from your source environment? Do you want them maybe only from a certain period? Or do you want them filtered by time and place? That kind of choice will trigger as many people as possible to participate, I think.
My Observations:
  • The pay-or-transcribe approach for acquiring scans is a really innovative approach. Offering people alternatives for supporting the project is a great way of serving the varied constituencies that compose genealogical researchers, allowing cash-poor, time-rich users (like retirees) an easy way to access the project.
  • Although I have no experience in the subject, I suspect that this federated approach to digitization--taking structurally-similar material from regional archives and scanning/hosting it centrally--has a lot of possibilities.
  • Christian's criticism is quite valid, and drives right to the paradox of motivation in crowdsourcing: do you strive for breadth using external incentives like scoreboards and free recognition, or do you strive for depth and cultivate passionate users through internal incentives like deep engagement with the source material? Volunteer motivation and the trade-offs involved is a fascinating topic, and I hope to do a whole post on it soon.
  • One potential flaw is that it will be very hard to charge to view the scans when transcribers must be able to see the scans to do their indexing. I gather that the randomization in VeleHanden will address this.
  • The budget described in the RFP is maximum 150000 Euros. As a real-life software developer, it's hard for me to see how this would pay for building a transcription tool, index database, scan import tool, scan CMS, search database and (since they expect to sell the searched scans) eCommerce. And that includes running servers too!
  • This is yet another project that's transcribing structured data from tabular sources, which would benefit from the FamilySearch Indexer, if only it were open-source (or even for sale).

Saturday, December 4, 2010


Two important errata in previous posts are worth highlighting:

Wikisource for Manuscript Transcription

Commenters on my most recent post, "Wikisource for Manuscript Transcription" have pointed out that the Wikisource rule prohibiting the addition of unpublished works--thereby almost entirely prohibiting manuscript transcription projects--is specific to the language domains. The English and French language Wikisource projects enforce the prohibition, but the German and Hebrew language Wikisource projects do not.

Dovi, a Wikipedia editor whom I used to bump into on Ancient Near Eastern language articles, points to his work on the the Wikisource edition of Arukh Hashulchan, pointing to this example of simanim. Sadly, my Hebrew isn't up to the task of commenting on this effort.

I was delighted to see that the post also inspired a bit of commentary on the German-language Wikisource Skriptorium. In particular, WikiAnika seemed to agree with my criticism of flat transcription: Man scheint in gewisser Weise noch an der Printform „zu kleben“. Oder um einen Dritten zu zitieren: „WS ist eine Sackgasse – man findet vielleicht zu einem interessanten Text hin, aber man kommt nicht mehr weiter.“

A Short Introduction to before_filter

In my 2007 post on Rails filters, I mentioned using filters to authenticate users:
Filters are called filters because they return a Boolean, and if that return value is false, the action is never called. You can use the the logged_in? method of :acts_as_authenticated to prohibit access to non-users — just add before_filter :logged_in? to your controller class and you're set!
Thanks to some changes made in Rails 2.0, this is not just wrong but dangerously wrong. Rails now ignores filter return values, allowing unauthorized access to your actions if you followed my advice.

Because Rails no longer pays any attention to the return values of controller filters, I've had to replace all my return condition statements with unless condition redirect_to somewhere.

Wednesday, July 21, 2010

Wikisource for Manuscript Transcription

Of the crowdsourcing projects that have real users doing manuscript transcription, one of the largest is an offshoot of Wikisource. ProofreadPage was an extension to MediaWiki created around 2006 on the French-language Wikisource as a Wikisource/Internet Archive replacement for Project Gutenberg's Distributed Proofreaders. They were taking DjVu files from the InternetArchive and using them as sources (via OCR and correction) for WikiSource pages. This spread to the other Wikisource sites around 2008, radically changing the way Wikisource worked. More recently the German Wikisource has started using ProofreadPage for letters, pamphlets, and broadsheets.

The best example of ProofreadPage for handwriting is Winkler's Remarks on the Russian Campaign 1812-1813. First, the presentation is lovely. They've dealt with a typographically difficult text and are presenting alternate typefaces, illustrations, and even marginalia in a clear way in the transcription. The page numbers link to the images of the pages, and they're come up with transcription conventions which are clearly presented at the top of the text. This is impressive for a volunteer-driven edition!

Technically, the Winkler example illustrates ProofreadPage's solution to a difficult problem: how to organize and display pages, sections, and works in the appropriate context. This is not an issue that I've encountered with FromThePage—the Julia Brumfield Diaries are organized with only one entry per page—but I've worried about it since XML is so poorly suited to represent overlapping markup. When viewing Winkler as a work, paragraphs span multiple manuscript pages but are aggregated seamlessly into the text: search for "sind den Sommer" and you'll find a paragraph with a page-break in the middle of it, indicated by the hyperlink "[23]". Clicking on the page in which that paragraph begins shows the page and page image in isolation, along with footnotes about the page source and page-specific information about the status of the transcription. This is accomplished by programmatically stitching pages together into the work display while excluding page-specific markup via a noinclude tag.

But the transcription of Winkler also highlights some weaknesses I see in ProofreadPage. All annotation is done via footnotes which—although they are embedded within the source—are a far cry from the kind of markup we're used to with TEI or indeed HTML. In fact, aside from the footnotes and page numbers, there are no hyperlinks in the displayed pages at all. The inadequacies of this for someone who wants extensive text markup are highlighted by this personal name index page — it's a hand-compiled index! Had the tool (or its users) relied on in-text markup, such an index could be compiled by mining the markup. Of course, the reason I'm critical here is that FromThePage was inspired by the possibilities offered by using wiki-links within text to annotate, analyze, edit and index, and I've been delighted by the results.

When I originally researched ProofreadPage, one question perplexed me: why aren't more manuscripts being transcribed on Wikisource? A lot has happened since I last participated in the Wikisource community in 2004, especially within the realm of formalized rules. There now is a rule on the English, French, and German Wikisource sites banning unpublished work. Apparently the goal was to discourage self-promoters from using the site for their own novels or crackpot theories, and it's pretty drastic. The English language version specifies that sources must have been previously published on paper, and the French site has added "Ne publiez que des documents qui ont été déjà publiés ailleurs, sur papier" to the edit form itself! It is a rare manuscript indeed that has already been published in a print form which may be OCRed but which is worth transcribing from handwriting anyway. As a result, I suspect that we're not likely to see much attention paid to transcription proper within the ProofreadPage code, short of a successful non-Wikisource Mediawiki/ProofreadPage project.

Aside from FromThePage (which is accepting new transcription projects!) ProofreadPage/Mediawiki is my favorite transcription tool. Its origins outside the English-language community and Wikisource community policy have obscured its utility for transcribing manuscripts, which is why I think it's been overlooked. It's got a lot of momentum behind it, and while it is still centered around OCR, I feel like it will work for many needs. Best of all, it's open-source, so you can start a transcription project by setting up your own private wikisource instance.

Thanks to Klaus Graf at Archivalia for much of the information in this article.

Tuesday, June 15, 2010

Facebook versus Twitter for Crowdsourcing Document Transcription

Last week, I posted a new document to FromThePage and inadvertently conducted a little experiment on how to publicize crowdsourcing. Sometime in the early 1970s, my uncle was on a hunting trip when he was driven into an old, abandoned building by a thunderstorm. While waiting for the weather to moderate, he found a few old documents -- two envelopes bearing Confederate stamps, one of which contained a letter. I photographed this letter and uploaded it to FromThePage while on vacation last week, and that's where the experiment begins.

I use both Facebook and Twitter, but post entirely different material to them. My 347 Facebook friends include most of my high school classmates, several of my friends and classmates from college, much of my extended family, and a few of my friends in here in Austin. I mostly post status updates about my personal life, only occasionally sharing links to Julia Brumfield diaries whenever I read an especially moving passage. My 344 Twitter followers are almost entirely people I've met at or through conferences like THATCamp08 or THATCamp Austin. They consist of academics, librarians, archivists, and programmers -- mostly ones who identify in some way with the "digital humanities" label. I usually tweet about technical/theoretical issues I've encountered in my own DH work. At least a few of my followers even run their own transcription software projects. Given the overlap between interesting content and FromThePage development, I decided to post news of the East Civil War letters to both systems.

My initial tweet/status--posted while I was still cropping the images--got similar responses from both systems. Two people on Twitter and five on Facebook replied, helping me resolve the letter's year to 1862. Here's what I posted on FaceBook:

And here's the post on Twitter:

After I got one of the envelopes created in FromThePage, I tested the images out by posting again to Facebook. This update got no response.

The next day, I uploaded the second envelope and letter, then posted to Twitter and Facebook while packing for our return trip.



This time the contrast in responses was striking. I got 3 click-throughs from Twitter in the first three days, and I'm not entirely sure that one of those wasn't me clicking the link by accident. While my statistics aren't as good for Facebook click-throughs, there were at least 6 I could identify. More important, however, was the transcription activity -- which is the point of my crowdsourcing project, after all. Within 3 hours of posting the link on Facebook, one very-occasional user had contributed a transcription, and I'd gotten two personal emails requesting reminders of login credentials from other people who wanted to help with the letter.

What accounts for this difference? One possibility is that the archivists and humanists who comprise my Twitter followership are less likely to get excited about a previously-unpublished Civil War letter -- after all, many of them have their own stacks of unpublished material to transcribe. Another possibility is that the envelope link I posted on Facebook increased people's anticipation and engagement. However, I suspect that the most important difference is that the Facebook link itself was more compelling due to the inclusion of an image of the manuscript page. Images are just more compelling than dry text, and Facebook's thumbnail service draws potential volunteers in.

My conclusion is that it's worth the effort to build an easy Facebook sharing mechanism into any document crowdsourcing tool, especially if that mechanism provides the scanned document as the image thumbnail.

Tuesday, March 2, 2010

Feature plan for 2010

Three years ago, I laid out a plan for getting FromThePage to general availability. Since that time, I completed most of the features I thought necessary, gained some dedicated users, and saw the software used to transcribe and annotate over a thousand pages of Julia Brumfield's diaries. However, most of the second half of 2009 was spent using the product in my editorial capacity and developing features to support that effort, rather than moving FromThePage out of its status as a limited beta.

Here's what my top priorities are for 2010:

Release FromThePage as Free/Open Source Software.
I've written before about my position as a hobbyist-developer and my desire not to see my code become abandonware if I am no longer able to maintain it. Open Source seems to be the obvious solution, and despite my concerns about Open Access use, I am taking some good advice and publishing the source code under the GNU Affero GPL. I've taken some steps to do this, including migrating towards a GitHub repository, but hope to wait until I've made a few test drives and developed some installation instructions before I announce the release.

Complete PDF Generation and Publish-on-Demand Integration.
While I think that releasing FromThePage is the best way forward for the software itself, it doesn't get me any closer to accomplishing my original objective for the project -- sharing Julia Brumfield's diaries with her family. It's hard to balance these goals, but at this point I think that my efforts after the F/OSS release should be directed towards printable and print-on-demand formats for the diary transcriptions. I've built one proof-of-concept and I've settled on an output technology (RTeX), so all that remains is writing the code to do it.

I'll still work on other features as they occur to me, I'm still editing Julia's 1921 diary, and I'm still looking for other transcription projects to host, but these two goals will be the main focus of my development this year.