Collaborative Manuscript Transcription: May 2013

This is a transcript of my talk at the Society of Southwestern Archivists 2013 Annual Meeting.

[Update 2013-05-28: The audio for the talk may be downloaded as an MP3.]

This talk is about choosing a crowdsourced transcription platform, but "choosing" means a couple of things. "Choosing" means which, and "choosing" can mean whether-- should you do this at all?. I'd like to address the last and give a little background on crowdsourcing and transcription before I go into any kind of discussion of tool selection.

So the first question is, why transcribe? Because, after all, there are a lot of different crowdsourcing projects that are not transcription. You can do georectification. There are a lot of people doing tagging. After I'm done talking, Micah Erwin is going to give a presentation on his pretty amazing work doing crowdsourced identification of items within their collection. So why transcribe?

One reason to transcribe, is that many of us face a problem. Which is that if you have scanned documents, you have a problem:

Now what? The fundamental problem with this is that nobody's going to read it. Nobody's going to read this, because nobody's going to find it. Because Google cannot index handwritten materials. These are pixels; these aren't data -- they aren't words to search engines.

So all the serendipity that you get in the Internet age from search engines is not useful to you. Once you get these transcribed, you get the opportunity to connect with people who find you by searching for, say, their own name, and discover that you have material that contains their great-grandfather, who they're named after.

One of my most active volunteers is transcribing a diary that was written by someone he's not related to. He found out about the project because he is named after the diarist's mailman.

So why crowdsource, rather than doing all this transcription yourself?

Well, one argument is that it's free labor! You're getting people to do your work for you! This is a very powerful argument, and many of you may find it a very useful argument with your management. It may even be an argument for putting material online that you wouldn't otherwise.

Unfortunately, that's not true. It take a lot of effort to run a crowdsourcing project.

Now, I'm an open source developer, and in the open source world we tend to differentiate between "free as in beer" or "free as in speech".

Crowdsourcing projects are really "free as in puppy". The puppy is free, but you have to take care of it; you have to do a lot of work. Because volunteers that are participating in these things don't like being ignored. They don't like having their work lost. They're doing something that they feel is meaningful and engaging with you, therefore you need to make sure their work is meaningful and engage with them.

So if free labor isn't the reason for crowdsourcing, why do a crowdsourcing project?

One of the most interesting perspectives on this comes from Trevor Owens at the Library of Congress. He wrote a blog post last spring called "Crowdsourcing Cultural Heritage: The Objectives are Upside Down", in which he looked at the experience of volunteers participating in these crowdsourcing projects. And he says that fundamentally, this isn't about getting free labor from the public. This is about offering people a brand new and deeper way to interact with your collections: getting them to produce knowledge. Getting them to engage more deeply with the material you put online more deeply than just a consumer experience of scanning through things.

One example of that I'd like to give is a citizen science project. The North American Bird Phenology Program is a crowdsourcing project that is inviting the public to transcribe bird observation cards that were made by amateur bird watchers a hundred years ago, over the course of about seventy years.

Now, I tried this project out because I'm interested in transcription tools. I'm not interested in birds. But as I was going through marking up these observations cards from these different observers, I'm not really quite sitting at my computer anymore -- I'm deeply immersed within the documents.

And suddenly, I'm sitting with these guys, who are sitting and writing up their observations.

So you have this opportunity to engage people very deeply -- to immerse them in your materials by offering them this kind of way of participating.

One of the examples that I like to use was a collaboration between me and Kathryn Stallard at Southwestern University--raise your hand, please Kathryn--in which she put online a diary of the Mexican-American War. One volunteer--before we had even announced the project--went online and transcribed the entire diary. But he didn't just transcribe it--he didn't just type what he saw. He went back and made multiple revisions. He corrected things. He identified names of materials and locations and battles. He did research on the life histories of the people who were mentioned there.

This is not a consumer experience -- it's a way of pulling people into your materials. And yes, Kathryn did get a transcript out of the results. But I'm not sure that that was more valuable than the experience that Scott Patrick got going through transcribing, researching, and immersing himself within this diary of this Texan soldier.

So why crowdsource? So if you're bringing people into this experience--you're engaging members of the public who may live hundreds of miles away from your institution--what you're doing is converting them from site visitors into another kind of relationship with your institution and with your materials.

Paul Flemons at the Atlas of Living Australia--the Australia Museum--describes it this way: Fundamentally, by engaging the public in digitizing their collections, they're educating the public and satisfying that part of their mission. They are providing increased access to their collections that they would not have, again, with just images. But most importantly, they're building an advocacy network for their collections, for their institution, for their discipline.

So, if crowdsourcing is a way to convert site visitors into volunteers, and to convert volunteers into advocates, what's next?

I'm not sure--this is all very new--but we're exploring this. I'm working with an archives that possesses a popular author's drafts, and they're starting a crowdsourced transcription project. Part of what we're trying to do with that is linking the transcription and the transcripts to a donation campaign that is designated for digitizing more of their material.

So, we don't know--I'd love to come back next year and tell you how it worked out--but I'm really interested to see if we can create a virtuous cycle among digitization, crowdsourcing, fundraising that funds digitization, and on back. I would love to see this [succeed].

Okay, so how do you choose a platform? I'm a software developer, and I usually get up here and say, well, you could use this tool or this tool or this tool. Unfortunately, there are a lot of tools, so I'm not going to stand up here and walk you through thirty tools; I'm not even going to talk about the two that I've been building.

What I want to talk about instead are thing to consider when you're selecting a tool. I group selection factors into four categories. One of them is what kind of source material you're working with. Another is the purpose of the transcripts -- what you are going to use those for, and maybe what the public are going to use those for. Then there's the fit within your organization, then technical resources considerations.

So you really need to think about what source material you're working with before you choose your platform. People dealing with medieval manuscripts may want to work with a program which was developed for medieval manuscripts. That program may be totally unsuitable for nineteenth-century letters. So you really have to figure out what you're putting online first. (Fortunately, most of the people in this room are already starting with scans -- with things they've already digitized.)

There are a lot of other factors here, but what I really want to drive home is that there are fits between particular materials and particular tools, because there is no "one size fits all" tool for transcription.

The purpose: how are you going to be using the data? Are you going to be analyzing it? The people who are tracking these bird observations really want a searchable database that they can go through and do climate change and habitat change analysis on. People in this room may be more interested in extracting the subjects--the person names and place names that are mentioned within the documents. But there are a lot of different uses, so you need to think about that.

So how does this fit within your organization? There are platforms here which can be used behind closed walls. So maybe you have students who you want to give the job of transcribing, and you don't even want the public to be involved. Your goal is to improve undergraduate education by getting history students to interact with primary documents. Maybe, on the other hand, you really want to cast as wide a net as possible to engage people way outside your institution, and no one within your institution really cares about this particular material you have.

How long is the project going to last? Traditionally, crowdsourcing projects work well with the sorts of organic institutions that most people here [represent]. They work more poorly where they are, say, funded by a one-year grant -- where after a year of building up a community and working on the material, suddenly it's pencils down; lights off!

The final set of considerations are the financial and technical resources, which unfortunately may overwhelm all the other considerations.

I was talking to an archivist at a library in Belgium last month who had a set of medieval manuscripts and wanted to use T-PEN, a tool which was built specifically for medieval manuscripts. She knew all about it; she loved it. But her material was in Omeka; the tool didn't work with Omeka; so she was going to use Scripto. Which is a great tool, but it just supports plain text transcripts, which isn't really suited for the material. She knew that, but her material was here [gestures], so that was directing her decision.

I think that's a shame, but it's unfortunately an important factor. People don't want to have to set up multiple systems. If you have all your material in ContentDM, you don't really want step one [of a crowdsourcing project] to be get it all back out again.

All of these tools require some customization and some technical experience to get set up and running, so you need to consider whether you have people on-site who can do that or can pay people off-site who can do that. And then there are all the digital preservation issues, which everyone here understands very well.

So rather than going through the tools, I want to direct you to a Google document which has been contributed to by about twenty-four people who have added their own projects, explaining whether their tools support TEI or EAD, whether they support mark-up that's semantic or genetic, what their platforms are, what their rates are -- things like that.

So this TranscriptionToolGDoc is something I recommend. I love having conversations about this, so send me email and we'll brainstorm about projects.

A translation of my 2012-03-05 post "Quality Control for Crowdsourced Transcription" which appeared in "Etat de l’art en matière de Crowdsourcing dans les bibliothèques numériques" by Moirez, Moreaux, and Josse (2013), reproduced for Francophone readers:

«Single-track methods»: le document ne fait l’objet que d’une seule transcription (par un seul contributeur ou de façon collaborative ensemble sur le même document)
1. «Open-ended community revison»: (Wikipédia) les utilisateurs peuvent continuer à modifier le texte transcrit, sans limite dans le temps. Un historique des modifications permet de revenir à la version précédente et d’éviter le vandalisme.
2. «Fixed-term community revision» (Transcribe Bentham) : convient pour des projets d’édition plus traditionnels, dont l’objectif est la publication d’une “version finale”. Quand une transcription atteint un niveau acceptable, val idée par les experts, elle est close et publiée.
3. «Community-controlled revision workflows» (Wikisource) : la transcription est considérée comme une “version finale” non plus par des experts, mais parce qu’elle a traversé un workflow collaboratif de correction/révision/validation -
4. «Transcriptions with "known-bad" insertions before proofreading» : dans une première phase, les correcteurs sont invités à transcrire. Puis d’autres correcteurs révisent la transcription en la comparant au texte original; pour s’assurer que la seconde lecture est bien réalisée, des erreurs sont ajoutées dans le texte: si toutes les «fausses erreurs» sont corrigées, le système déduit que les «vraies erreurs» ont dû être corrigées aussi.
5. «Single-keying with expert review» : lorsqu’une transcription a été réalisée par un contributeur, elle est validée ou rejetée par un expert (soit un professionnel de l’institution à l’origine du projet, soit un contributeur sélectionné). Si la correction est rejetée, elle est soit à nouveau soumise à correction, soit corrigée par l’expert et validée.
«Multi-track methods»: ces méthodes conviennent particulièrement à des corrections portant sur des données structurées ou des micro-tâches. La même image de départ est présentée à plusieurs contributeurs qui transcrivent chacun à partir de zéro. Généralement, les contributeurs ne savent pas s’ils sont les premiers correcteurs ou si d’autres transcriptions ont déjà été soumises. Puis les données ainsi collectées sont comparées automatiquement.
1. «Triple-keying with voting» (Old Weather, ReCAPTCHA) : l’image est présentée à 3 contributeurs, la majorité l’emporte (au depart, Old Weather proposait l’image à 10 contributeurs, mais ils se sont aperçus que la pertinence était sensiblement la même avec 3 qu’avec 10 contributeurs)
2. «Double-keying with expert reconciliation»: la même donnée est présentée à deux contributeurs, et, s’ils ne sont pas d’accord entre eux, un expert tranche.
3. «Double-keying with emergent community-expert reconciliation» (FamilySearch Indexing): la method est presque similaire à la précédente, sauf que l’expert qui tranche entre deux corrections divergentes est lui-même un contributeur, qui a été promu conciliateur grâce à l’analyse automatique de ses contributions (volume,pertinence).
4. «Double-keying with N-keyed run-off votes»: si les deux contributeurs ne sont pas d’accord, la correction est re-proposée à un nouveau duo/trio d’usagers.

Collaborative Manuscript Transcription

Friday, May 24, 2013

Choosing Crowdsourced Transcription Platforms at SSA 2013

Tuesday, May 14, 2013

Typologie des méthodes de contrôle de la qualité dans les projets de crowdsourcing

New Blog Posts are at FromThePage

Posts from the FromThePage Blog

Pages

Upcoming Conference Schedule

Past Conference Talks

Blog Archive

Subjects

Papers

Transcription Systems

Digital Family History