This is a transcript of my talk at the Society of Southwestern Archivists 2013 Annual Meeting.
[Update 2013-05-28: The audio for the talk may be downloaded as an MP3.]
This talk is about choosing a crowdsourced transcription platform, but
"choosing" means a couple of things. "Choosing" means which, and
"choosing" can mean whether-- should you do this at all?. I'd
like to address the last and give a little background on crowdsourcing
and transcription before I go into any kind of discussion of tool
So the first question is, why transcribe? Because, after all,
there are a lot of different crowdsourcing projects that are not
transcription. You can do georectification. There are a lot of people
doing tagging. After I'm done talking, Micah Erwin is going to give a
presentation on his pretty amazing work doing crowdsourced
identification of items within their collection. So why transcribe?
One reason to transcribe, is that many of us face a problem. Which is
that if you have scanned documents, you have a problem:
Now what? The fundamental problem with this is that nobody's
going to read it. Nobody's going to read this, because nobody's going
to find it. Because Google cannot index handwritten materials. These
are pixels; these aren't data -- they aren't words to search engines.
So all the serendipity that you get in the Internet age from search
engines is not useful to you. Once you get these transcribed, you get
the opportunity to connect with people who find you by searching for,
say, their own name, and discover that you have material that contains
their great-grandfather, who they're named after.
One of my most active
volunteers is transcribing a diary that was written by someone he's not related to. He found out about the project because he is named after
the diarist's mailman.
Well, one argument is that it's free labor! You're getting people to do your work for you! This is a very powerful argument, and many of you may find it a very useful argument with your management. It may even be an argument for putting material online that you wouldn't otherwise.
Now, I'm an open source developer, and in the open source world we tend to differentiate between "free as in beer" or "free as in speech".
puppy is free, but you have to take care of it; you have to do a lot of work. Because volunteers that are participating in these things don't like being ignored. They don't like having their work lost. They're doing something that they feel is meaningful and engaging with you, therefore you need to make sure their work is meaningful and engage with them.
So if free labor isn't the reason for crowdsourcing, why do a crowdsourcing project?
"Crowdsourcing Cultural Heritage: The Objectives are Upside Down", in which he looked at the experience of volunteers participating in these crowdsourcing projects. And he says that fundamentally, this isn't about getting free labor from the public. This is about offering people a brand new and deeper way to interact with your collections: getting them to produce knowledge. Getting them to engage more deeply with the material you put online more deeply than just a consumer experience of scanning through things.
North American Bird Phenology Program is a crowdsourcing project that is inviting the public to transcribe bird observation cards that were made by amateur bird watchers a hundred years ago, over the course of about seventy years.
Now, I tried this project out because I'm interested in transcription tools. I'm not interested in birds. But as I was going through marking up these observations cards from these different observers, I'm not really quite sitting at my computer anymore -- I'm deeply immersed within the documents.
So you have this opportunity to engage people very deeply -- to immerse them in your materials by offering them this kind of way of participating.
One of the examples that I like to use was a collaboration between me and Kathryn Stallard at Southwestern University--raise your hand, please Kathryn--in which she put online a diary of the Mexican-American War. One volunteer--before we had even announced the project--went online and transcribed the entire diary. But he didn't just transcribe it--he didn't just type what he saw. He went back and made multiple revisions. He corrected things. He identified names of materials and locations and battles. He did research on the life histories of the people who were mentioned there.
This is not a consumer experience -- it's a way of pulling people into your materials. And yes, Kathryn did get a transcript out of the results. But I'm not sure that that was more valuable than the experience that Scott Patrick got going through transcribing, researching, and immersing himself within this diary of this Texan soldier.
Paul Flemons at the Atlas of Living Australia--the Australia Museum--describes it this way: Fundamentally, by engaging the public in digitizing their collections, they're educating the public and satisfying that part of their mission. They are providing increased access to their collections that they would not have, again, with just images. But most importantly, they're building an advocacy network for their collections, for their institution, for their discipline.
I'm not sure--this is all very new--but we're exploring this. I'm working with an archives that possesses a popular author's drafts, and they're starting a crowdsourced transcription project. Part of what we're trying to do with that is linking the transcription and the transcripts to a donation campaign that is designated for digitizing more of their material.
So, we don't know--I'd love to come back next year and tell you how it worked out--but I'm really interested to see if we can create a virtuous cycle among digitization, crowdsourcing, fundraising that funds digitization, and on back. I would love to see this [succeed].
the two that I've been building.
There are a lot of other factors here, but what I really want to drive home is that there are fits between particular materials and particular tools, because there is no "one size fits all" tool for transcription.
How long is the project going to last? Traditionally, crowdsourcing projects work well with the sorts of organic institutions that most people here [represent]. They work more poorly where they are, say, funded by a one-year grant -- where after a year of building up a community and working on the material, suddenly it's pencils down; lights off!
I was talking to an archivist at a library in Belgium last month who had a set of medieval manuscripts and wanted to use T-PEN, a tool which was built specifically for medieval manuscripts. She knew all about it; she loved it. But her material was in Omeka; the tool didn't work with Omeka; so she was going to use Scripto. Which is a great tool, but it just supports plain text transcripts, which isn't really suited for the material. She knew that, but her material was here [gestures], so that was directing her decision.
I think that's a shame, but it's unfortunately an important factor. People don't want to have to set up multiple systems. If you have all your material in ContentDM, you don't really want step one [of a crowdsourcing project] to be get it all back out again.
So rather than going through the tools, I want to direct you to a Google document which has been contributed to by about twenty-four people who have added their own projects, explaining whether their tools support TEI or EAD, whether they support mark-up that's semantic or genetic, what their platforms are, what their rates are -- things like that.
TranscriptionToolGDoc is something I recommend. I love having conversations about this, so send me email and we'll brainstorm about projects.
Friday, May 24, 2013
Tuesday, May 14, 2013
A translation of my 2012-03-05 post "Quality Control for Crowdsourced Transcription" which appeared in "Etat de l’art en matière de Crowdsourcing dans les bibliothèques numériques" by Moirez, Moreaux, and Josse (2013), reproduced for Francophone readers:
- «Single-track methods»: le document ne fait l’objet que d’une seule
transcription (par un seul contributeur ou de façon collaborative ensemble sur le
- «Open-ended community revison»: (Wikipédia) les utilisateurs peuvent continuer à modifier le texte transcrit, sans limite dans le temps. Un historique des modifications permet de revenir à la version précédente et d’éviter le vandalisme.
- «Fixed-term community revision» (Transcribe Bentham) : convient pour des projets d’édition plus traditionnels, dont l’objectif est la publication d’une “version finale”. Quand une transcription atteint un niveau acceptable, val idée par les experts, elle est close et publiée.
- «Community-controlled revision workflows» (Wikisource) : la transcription est considérée comme une “version finale” non plus par des experts, mais parce qu’elle a traversé un workflow collaboratif de correction/révision/validation -
- «Transcriptions with "known-bad" insertions before proofreading» : dans une première phase, les correcteurs sont invités à transcrire. Puis d’autres correcteurs révisent la transcription en la comparant au texte original; pour s’assurer que la seconde lecture est bien réalisée, des erreurs sont ajoutées dans le texte: si toutes les «fausses erreurs» sont corrigées, le système déduit que les «vraies erreurs» ont dû être corrigées aussi.
- «Single-keying with expert review» : lorsqu’une transcription a été réalisée par un contributeur, elle est validée ou rejetée par un expert (soit un professionnel de l’institution à l’origine du projet, soit un contributeur sélectionné). Si la correction est rejetée, elle est soit à nouveau soumise à correction, soit corrigée par l’expert et validée.
- «Multi-track methods»: ces méthodes conviennent particulièrement à des corrections portant sur des données structurées ou des micro-tâches. La même image de départ est présentée à plusieurs contributeurs qui transcrivent chacun à partir de zéro. Généralement, les contributeurs ne savent pas s’ils sont les premiers correcteurs ou si d’autres transcriptions ont déjà été soumises. Puis les données ainsi collectées sont comparées automatiquement.
- «Triple-keying with voting» (Old Weather, ReCAPTCHA) : l’image est présentée à 3 contributeurs, la majorité l’emporte (au depart, Old Weather proposait l’image à 10 contributeurs, mais ils se sont aperçus que la pertinence était sensiblement la même avec 3 qu’avec 10 contributeurs)
- «Double-keying with expert reconciliation»: la même donnée est présentée à deux contributeurs, et, s’ils ne sont pas d’accord entre eux, un expert tranche.
- «Double-keying with emergent community-expert reconciliation» (FamilySearch Indexing): la method est presque similaire à la précédente, sauf que l’expert qui tranche entre deux corrections divergentes est lui-même un contributeur, qui a été promu conciliateur grâce à l’analyse automatique de ses contributions (volume,pertinence).
- «Double-keying with N-keyed run-off votes»: si les deux contributeurs ne sont pas d’accord, la correction est re-proposée à un nouveau duo/trio d’usagers.