Friday, December 10, 2010 and is a new transcription project organized by the City Archive Amsterdam that plans to use crowdsourcing to index militia registers from several Dutch archives. It's quite ambitious, and there are a number of innovative features about the project I'd like to address. However, I haven't seen any English-language coverage of the project so I'll try to translate and summarize it as best as my limited Dutch and Google's imperfect algorithms allow before offering my own commentary.

With the project "many hands make light work", Stadsarchief Amsterdam will make all militia records searchable online through crowdsourcing -- not just inventories, but indexes.

To research how archives and online users can work together to improve access to the archives, the Stadsarchief Amsterdam has set up the "Many Hands" project. With this project, we want to create a platform where all Dutch archives can offer their scans to be indexed and where all archival users can contribute in exchange for fun, contacts, information, honor, scanned goods, and whatever else we can think of.

To ask the whole Netherlands to index, we must start with archives that are important to the whole Netherlands. As the first pilot, we have chosen the Militia Registers, but there will soon be more archival files to be indexed so that everyone can choose something within his interest and skill-level.

All Militia Registers Online

Militia registers contain the records of all boys who were entered for conscription into military service during almost the entire 19th and part of the 20th centuries. These records were created in the entire Netherlands and are kept in many national and municipal archives.

The militia records are eminently suitable for large-scale digitization. The records consist of printed sheets. This uniformity makes scanning easy and thus inexpensive. More importantly, this resource is interesting for anyone with Dutch ancestry. Therefore we expect many volunteers to lend a hand to help unlock this wonderful resource, and that the online indexes will eventually attract many visitors.

But the first step is to scan the records. Soon the scanning of approximately one million pages will begin as a start. The more records we have digitized, the cheaper the scanning becomes, and the more attractive the indexing project becomes to volunteers. The Stadsarchive therefore calls upon all Dutch archival institutions to join!
  • At our institution, online scans are provided for free. Why should people pay for scans?
    Revenues from sales of scans over the two years duration of the project are a part of the financing of the project. The budget is based on the rates as used in the City Archives: € 0.50 to € 0.25 per scan, depending on the number of scans that someone buys. We ask the institutions that participate throughout the project do not sell their own scans or make them available for free. After the completion of the project, each institution may follow its own policy for providing the scans.
  • If we participate, who is the owner of the scans and index data?
    After production and payment, the scans will be delivered immediately to the institution which provided the militia records. The index information will also be supplied to the institutions after completion of the project. The institution remains the owner, but during the project period of approximately two years the material may not be used outside of the project.
  • What are the financial risks for participating archives?
    Participants pay only for their scans: the actual costs and preparation of the scanning process. The development and deployment of the index tool, volunteer recruitment and two years maintenance of the website from the project has been funded by grants and contributions by Stadsarchief Amsterdam. There are no financial surprises.
  • What does the schedule for the project look like?
    On July 12 and September 13 we are organizing meetings with potential participants to answer your questions. Until October 1, 2010, participants will sign up to participate in the project, in order for the scanning to start on that day. The tender process runs about 2 months, so a supplier can be contracted in 2010. In January 2011 we will start scanning, volunteers can begin indexing in the spring. The sister site the indexing will take place--will continue online for at least one year.
  • Will the indexing tool be developed as Open Source software?
    It is currently impossible to say whether the indexing tool will be developed via/as open source software. Of primary importance is finding the most cost-effective solution and that the software performs well and is user-friendly. The only hard requirement is the use of open standards for the import and export of metadata, so that vendor independence is guaranteed.
RFP (Warning: very loose translation!)
Below are some ideas SAA has formulated regarding the functionality and sustainability of
  • Facilities for importing and managing scans, and for exporting data in XML format.
  • Scan viewer with advanced features.
  • Functionality to simultaneously run multiple projects for indexing, transcription, and translation of scans.
  • Features for organizing and managing data from volunteer groups and for selectively enabling features for participants and volunteer coordinators.
  • Features for communication between archival staff and volunteers, as well as for volunteers to provide support to each other.
  • Automated features for control of the data produced.
  • Rewards system (material and immaterial) for volunteers.
  • Many volunteers may work in parallel to process scans quickly and effectively.
  • Facilities to search, view and share scans online.
Other Dutch bloggers have covered the unique approach that Stadsarchief Amsterdam envisions for volunteer motiviation and project support: users who want to download scans may either pay for them in cash or in labor, by indexing N scanned pages. Christian van der Ven's blog post Crowdsourcen rond militieregisters and the associated comment thread discusses this intensely and is worth reading in full. Here's a loosly-translated excerpt:
The project assumes that it can not allow the volunteer to indicate whether he wants to index Zeeland or Groningen. It is--in the words of the project leader--the Orange feeling, to see if the rural people can volunteer and not just concentrate on their own location. Indexing people from their own village? Please, not that!

Well since the last World Cup I'm feeling Orange again, but overall experience and research in archives teaches that all country people are more interested in the history of themselves, their own ancestors, their homes and the surrounding area. The closer [the data], the more motivation to do something.

And if the purpose of this project is to build an indexing tool, to scan registers, and then to obtain indexes through crowdsourcing as quickly as possible, it seems to me that the public should be given what it wants: local resources if desired. What I suggest is a choice menu: do you want records from your source environment? Do you want them maybe only from a certain period? Or do you want them filtered by time and place? That kind of choice will trigger as many people as possible to participate, I think.
My Observations:
  • The pay-or-transcribe approach for acquiring scans is a really innovative approach. Offering people alternatives for supporting the project is a great way of serving the varied constituencies that compose genealogical researchers, allowing cash-poor, time-rich users (like retirees) an easy way to access the project.
  • Although I have no experience in the subject, I suspect that this federated approach to digitization--taking structurally-similar material from regional archives and scanning/hosting it centrally--has a lot of possibilities.
  • Christian's criticism is quite valid, and drives right to the paradox of motivation in crowdsourcing: do you strive for breadth using external incentives like scoreboards and free recognition, or do you strive for depth and cultivate passionate users through internal incentives like deep engagement with the source material? Volunteer motivation and the trade-offs involved is a fascinating topic, and I hope to do a whole post on it soon.
  • One potential flaw is that it will be very hard to charge to view the scans when transcribers must be able to see the scans to do their indexing. I gather that the randomization in VeleHanden will address this.
  • The budget described in the RFP is maximum 150000 Euros. As a real-life software developer, it's hard for me to see how this would pay for building a transcription tool, index database, scan import tool, scan CMS, search database and (since they expect to sell the searched scans) eCommerce. And that includes running servers too!
  • This is yet another project that's transcribing structured data from tabular sources, which would benefit from the FamilySearch Indexer, if only it were open-source (or even for sale).


Richard Keijzer said...

Hi Ben,

Good initiative to translate this into English. When transcribing you can get a multitude of different originals, like preprinted forms (even in the 17th century in Amsterdam), or large handwritten pages, or typewritten deeds from 19th century US. It's very difficult to get the appropriate tool for all of them. You might take a look at Transcript, a freeware program written by Jacob Boerema. It can be seen here:
I use it regularly and the only disadvantage is, that it is rather hard to create and maintaing columns in a transcription.


Ben W. Brumfield said...

Thanks, Richard. I know that Van Papier Naar Digital uses Transcript primarily for their clients, and hope to cover that project soon.

You're correct of course about formats -- I'd like to put together a post on the taxonomy of manuscript formats and the challenges of determining the target data structure. Certainly a tool like Transcript (or my own FromThePage, or MediaWiki's ProofreadPage) is a poor choice for tabular data without significant customization to the software or rigorous adherence to conventions by the users. Unfortunately, so far as I'm aware the best tool for tabular data (FamilySearch Indexing) is not publicly availible.

Christian said...

Of course I already read through the draft of your post, but I thought it would be nice to comment on the 'official' post as well -- it's a nice one! Nothing wrong with your skills for translating either, or for interpreting Google's version of it. ;-)

In regard to the first bullet in the FAQ section of your post: later it was agreed on to allow for some archives to have scans participate in the project that they already have available, available for free even on their website. However, it's still not allowed to have those scans available for free in combination with the index. Yet this agreement makes it possible for the project to have even more scans available than would have been possible by scanning alone.

The index will be quite a simple one though: just each person's name, date and place of birth (and hopefully also his or her residence -- I got that in as a suggestion), enough to identify a person, and to decide whether or not paying for the scan is a good idea.

About the costs: A partner for the actual scanning has already been found, and there are serveral potential partners interested in building the platform/software.

Annemarie said...

Hello Ben,

Wonderful to learn about our project even from Texas! Many thanks for your kind words. And, as Richard already said: you’re doing a great job translating our site.

Indeed, we just contracted a partner for the scanning part of the project, and we are now in the middle of selecting a software developer for building the platform. We received as many as seven proposals in reaction of our RFP. We have no doubt this will result in a well-build service, including an indexing tool, a searchable database, a CMS, and all other facilities necessary for indexing, searching and buying. And perhaps the most important part: it must be fun!

We keep you posted!

Annemarie Lavèn
(project secretary)

Ben W. Brumfield said...

Thanks, Christian and Annemarie! While one might wish for a full transcription of all fields in the registers, one might also wish for infinite volunteer, so I think that the indexing approach is the most cost-effective way to go.

I wish the project the best, and hope you continue to blog about it with the openness you've shown so far!