Thursday, December 30, 2010

Two New Diary Transcription Projects

The last few weeks have seen the announcement of two new transcription projects. I'm particularly excited about them because--like FromThePage--their manuscripts are diaries and they plan to open the transcription tasks to the public.

Dear Diary announces the digitization of University of Iowa's Historic Iowa Children's Diaries:
We have a deep love for historic diaries as well, and we’re currently hard at work developing a site that will allow the public to help enhance our collections through “crowdsourcing” or collaborative transcription of diaries and other manuscript materials. Stay tuned!
A Look Inside J. Paul Getty’s Newly Digitized Diaries describes J. Paul Getty's diaries and their contents, and mentions in passing that
We will soon launch a website that will invite your participation to perform the transcriptions (via crowdsourcing), thus rendering the diaries keyword-searchable and dramatically improving their accessibility.
I will be very interested to follow these projects and see which transcription systems they use or develop.

Wednesday, December 15, 2010

NABPP Transcription User Survey Results

The ever-innovative North American Bird Phenology Program has just released the results of their user satisfaction survey. In addition to offering kudos to the NABPP for pleasing its users--who had transcribed 346,786 species observation cards as of November--I'd like to highlight some of the survey results -- after all, this is the first user survey for a transcription tool that I'm aware of.

This chart shows answers to the question "What inspires you to volunteer?":
Users overwhelmingly cite the importance of the program and their love of nature. The importance response obviously speaks to the "save the world" aspect of the NABPP's mission tracking climate change. But how does a love of nature inspire someone to sit in front of a computer and decipher seventy thousand pieces of old hand-writing? Based on my experience with the NABPP (and indeed with Julia Brumfield's diaries), I think the answer is simple: transcription is an immersive activity. It is rare that we read more deeply than when we transcribe a text, so the transcriber is transported to a different place and time in the same way as a reader lost in a novel or a player in a good video game.

While the whole survey is worth reading, one other response stood out for me. Answering the open-ended "how can we improve" question, one respondent requested examples of properly transcribed "difficult" cards on the transcription page. I know that I as a tool-maker tend to concentrate on providing users help using the software I develop. However, transcription tools need to provide users with non-technical guidance on paleographic challenges, unusual formatting, and other issues that may be unique to the material being transcribed. I'm not entirely sure how to accomplish this as a developer, other than by facilitating communication among transcribers, editors, and project coordinators.

Let me conclude by offering the NABPP my congratulations on their success and my thanks for their willingness to share these results with the rest of us.

Friday, December 10, 2010 and is a new transcription project organized by the City Archive Amsterdam that plans to use crowdsourcing to index militia registers from several Dutch archives. It's quite ambitious, and there are a number of innovative features about the project I'd like to address. However, I haven't seen any English-language coverage of the project so I'll try to translate and summarize it as best as my limited Dutch and Google's imperfect algorithms allow before offering my own commentary.

With the project "many hands make light work", Stadsarchief Amsterdam will make all militia records searchable online through crowdsourcing -- not just inventories, but indexes.

To research how archives and online users can work together to improve access to the archives, the Stadsarchief Amsterdam has set up the "Many Hands" project. With this project, we want to create a platform where all Dutch archives can offer their scans to be indexed and where all archival users can contribute in exchange for fun, contacts, information, honor, scanned goods, and whatever else we can think of.

To ask the whole Netherlands to index, we must start with archives that are important to the whole Netherlands. As the first pilot, we have chosen the Militia Registers, but there will soon be more archival files to be indexed so that everyone can choose something within his interest and skill-level.

All Militia Registers Online

Militia registers contain the records of all boys who were entered for conscription into military service during almost the entire 19th and part of the 20th centuries. These records were created in the entire Netherlands and are kept in many national and municipal archives.

The militia records are eminently suitable for large-scale digitization. The records consist of printed sheets. This uniformity makes scanning easy and thus inexpensive. More importantly, this resource is interesting for anyone with Dutch ancestry. Therefore we expect many volunteers to lend a hand to help unlock this wonderful resource, and that the online indexes will eventually attract many visitors.

But the first step is to scan the records. Soon the scanning of approximately one million pages will begin as a start. The more records we have digitized, the cheaper the scanning becomes, and the more attractive the indexing project becomes to volunteers. The Stadsarchive therefore calls upon all Dutch archival institutions to join!
  • At our institution, online scans are provided for free. Why should people pay for scans?
    Revenues from sales of scans over the two years duration of the project are a part of the financing of the project. The budget is based on the rates as used in the City Archives: € 0.50 to € 0.25 per scan, depending on the number of scans that someone buys. We ask the institutions that participate throughout the project do not sell their own scans or make them available for free. After the completion of the project, each institution may follow its own policy for providing the scans.
  • If we participate, who is the owner of the scans and index data?
    After production and payment, the scans will be delivered immediately to the institution which provided the militia records. The index information will also be supplied to the institutions after completion of the project. The institution remains the owner, but during the project period of approximately two years the material may not be used outside of the project.
  • What are the financial risks for participating archives?
    Participants pay only for their scans: the actual costs and preparation of the scanning process. The development and deployment of the index tool, volunteer recruitment and two years maintenance of the website from the project has been funded by grants and contributions by Stadsarchief Amsterdam. There are no financial surprises.
  • What does the schedule for the project look like?
    On July 12 and September 13 we are organizing meetings with potential participants to answer your questions. Until October 1, 2010, participants will sign up to participate in the project, in order for the scanning to start on that day. The tender process runs about 2 months, so a supplier can be contracted in 2010. In January 2011 we will start scanning, volunteers can begin indexing in the spring. The sister site the indexing will take place--will continue online for at least one year.
  • Will the indexing tool be developed as Open Source software?
    It is currently impossible to say whether the indexing tool will be developed via/as open source software. Of primary importance is finding the most cost-effective solution and that the software performs well and is user-friendly. The only hard requirement is the use of open standards for the import and export of metadata, so that vendor independence is guaranteed.
RFP (Warning: very loose translation!)
Below are some ideas SAA has formulated regarding the functionality and sustainability of
  • Facilities for importing and managing scans, and for exporting data in XML format.
  • Scan viewer with advanced features.
  • Functionality to simultaneously run multiple projects for indexing, transcription, and translation of scans.
  • Features for organizing and managing data from volunteer groups and for selectively enabling features for participants and volunteer coordinators.
  • Features for communication between archival staff and volunteers, as well as for volunteers to provide support to each other.
  • Automated features for control of the data produced.
  • Rewards system (material and immaterial) for volunteers.
  • Many volunteers may work in parallel to process scans quickly and effectively.
  • Facilities to search, view and share scans online.
Other Dutch bloggers have covered the unique approach that Stadsarchief Amsterdam envisions for volunteer motiviation and project support: users who want to download scans may either pay for them in cash or in labor, by indexing N scanned pages. Christian van der Ven's blog post Crowdsourcen rond militieregisters and the associated comment thread discusses this intensely and is worth reading in full. Here's a loosly-translated excerpt:
The project assumes that it can not allow the volunteer to indicate whether he wants to index Zeeland or Groningen. It is--in the words of the project leader--the Orange feeling, to see if the rural people can volunteer and not just concentrate on their own location. Indexing people from their own village? Please, not that!

Well since the last World Cup I'm feeling Orange again, but overall experience and research in archives teaches that all country people are more interested in the history of themselves, their own ancestors, their homes and the surrounding area. The closer [the data], the more motivation to do something.

And if the purpose of this project is to build an indexing tool, to scan registers, and then to obtain indexes through crowdsourcing as quickly as possible, it seems to me that the public should be given what it wants: local resources if desired. What I suggest is a choice menu: do you want records from your source environment? Do you want them maybe only from a certain period? Or do you want them filtered by time and place? That kind of choice will trigger as many people as possible to participate, I think.
My Observations:
  • The pay-or-transcribe approach for acquiring scans is a really innovative approach. Offering people alternatives for supporting the project is a great way of serving the varied constituencies that compose genealogical researchers, allowing cash-poor, time-rich users (like retirees) an easy way to access the project.
  • Although I have no experience in the subject, I suspect that this federated approach to digitization--taking structurally-similar material from regional archives and scanning/hosting it centrally--has a lot of possibilities.
  • Christian's criticism is quite valid, and drives right to the paradox of motivation in crowdsourcing: do you strive for breadth using external incentives like scoreboards and free recognition, or do you strive for depth and cultivate passionate users through internal incentives like deep engagement with the source material? Volunteer motivation and the trade-offs involved is a fascinating topic, and I hope to do a whole post on it soon.
  • One potential flaw is that it will be very hard to charge to view the scans when transcribers must be able to see the scans to do their indexing. I gather that the randomization in VeleHanden will address this.
  • The budget described in the RFP is maximum 150000 Euros. As a real-life software developer, it's hard for me to see how this would pay for building a transcription tool, index database, scan import tool, scan CMS, search database and (since they expect to sell the searched scans) eCommerce. And that includes running servers too!
  • This is yet another project that's transcribing structured data from tabular sources, which would benefit from the FamilySearch Indexer, if only it were open-source (or even for sale).

Saturday, December 4, 2010


Two important errata in previous posts are worth highlighting:

Wikisource for Manuscript Transcription

Commenters on my most recent post, "Wikisource for Manuscript Transcription" have pointed out that the Wikisource rule prohibiting the addition of unpublished works--thereby almost entirely prohibiting manuscript transcription projects--is specific to the language domains. The English and French language Wikisource projects enforce the prohibition, but the German and Hebrew language Wikisource projects do not.

Dovi, a Wikipedia editor whom I used to bump into on Ancient Near Eastern language articles, points to his work on the the Wikisource edition of Arukh Hashulchan, pointing to this example of simanim. Sadly, my Hebrew isn't up to the task of commenting on this effort.

I was delighted to see that the post also inspired a bit of commentary on the German-language Wikisource Skriptorium. In particular, WikiAnika seemed to agree with my criticism of flat transcription: Man scheint in gewisser Weise noch an der Printform „zu kleben“. Oder um einen Dritten zu zitieren: „WS ist eine Sackgasse – man findet vielleicht zu einem interessanten Text hin, aber man kommt nicht mehr weiter.“

A Short Introduction to before_filter

In my 2007 post on Rails filters, I mentioned using filters to authenticate users:
Filters are called filters because they return a Boolean, and if that return value is false, the action is never called. You can use the the logged_in? method of :acts_as_authenticated to prohibit access to non-users — just add before_filter :logged_in? to your controller class and you're set!
Thanks to some changes made in Rails 2.0, this is not just wrong but dangerously wrong. Rails now ignores filter return values, allowing unauthorized access to your actions if you followed my advice.

Because Rails no longer pays any attention to the return values of controller filters, I've had to replace all my return condition statements with unless condition redirect_to somewhere.