Collaborative Manuscript Transcription: May 2015

This is the text of my talk at the best practices panel at the Crowd Consortium for Libraries and Archives meeting Engaging the Public on May 8, 2015.

One caveat: most of my background is in crowdsourced manuscript transcription, though with the development of FromThePage 2 I've become involved in the related fields of collaborative document translation and crowd-sourced OCR correction. I hope that this is useful to non-textual projects as well.

The best practice I'd like to talk about is returning the product of crowd-sourcing to the volunteers that produced it.

What do I mean by product?

I'm not talking about what project managers consider the final product, whether that be item-level finding aids or peer-reviewed papers in the scholarly press. I'm talking about the raw product – the actual work that comes out of a volunteer's direct effort, or the efforts of their fellow volunteers – the transcript of a letter, the corrected text of a newspaper article, the translated photo captions, the carefully researched footnotes and often personal comments left on pages.

Why?

First, it's the right thing to do. Yesterday we talked about reciprocity and social justice. An older text says “Thou shalt not muzzle the oxen that tread out the corn.”

Crowdsourced transcription projects vary a lot on this. For wiki-like systems, displaying volunteer transcripts is built into the system – I know that's the case for FromThePage, TranscribeBentham and WikiSource, and suspect the same applies to Scripto and DIYHistory. For others, users can't even see their own contributions after they have submitted them. However, the Smithsonian Institute Transcription Center actually added this feature on purpose – the team implementing the center added the ability for users to download PDFs of transcribed documents specifically because they felt it was the Right Thing to Do.

Now that I've quoted the Bible, let's talk about purely instrumental reasons crowdsourcing projects should return volunteers' labor to them.

Incentives

For one thing, exposing the raw data early can better align our projects with the incentives that motivate many volunteers. Most volunteers are not participating because of their affiliation with an institution, nor because they treasure clean library metadata – at least not primarily! What keeps them coming back and contributing is their connection to the material – an intrinsic motivation of experiencing life as a bird-watcher in the 1920s, of marching alongside a Civil War soldier as they transcribe observation cards or diaries.

We should expose the texts volunteers have worked on in ways that are immediately usable to them – PDFs they can print out, texts they can email, URLs they can post on Facebook—to show their friends and families just what they've been up to, and why they're so excited to volunteer.

In some cases this may provide extrinsic rewards project managers can't envision. One of the first projects I worked on, the Zenas Matthews diary of the Mexican-American War—attracted a super-volunteer early on who transcribed the entire diary in two weeks. When I interviewed Scott Patrick, I learned that the biggest reward we could provide – the thing he'd treasure above over badges or leader boards – would be the text itself in a printable and publishable format. You see, Mr. Patrick's heritage organization formally recognizes members who have written books, including editions of primary sources. His contribution to the project certainly matched his fellows' for quality, but access to a usable form of the text—the text he'd transcribed himself—was the thing that stood in his way.

Recruitment

Exposing raw transcripts online during the crowdsourcing process can actually enhance recruitment to crowd-sourcing projects. I've seen this in a personal project I worked on. in which one super-volunteer found the project by Googling his own name. You see, a previous volunteer had transcribed a lot of material that mentioned the a letter carrier named Nat Wooding. So when Nat Wooding did a vanity search, he found the transcribed diaries, recognized the letter carrier as his great-uncle, and became a major contributor to the project. Had the user-generated transcripts been locked away for expert review, or even published online somewhere outside of the crowdsourcing tool, we would have missed the contributions of a new super-volunteer.

Engagement

For the past three years, I've been involved with an non- called Free UK Genealogy. They have volunteers around the world transcribe genealogical records using offline, spreadsheet-like tools so that they can be searched on a freely accessible website.

I spent several months building a new system for crowd-sourced transcription of parish registers, but encountered very little enthusiasm—actually some outright opposition—from the most active volunteers. They were used to their spreadsheets, and saw no value at all to changing what they were doing.

Eventually, we switched from improving the transcription tool-chain to improving the delivery system. We re-wrote the public-facing search engine from scratch, focusing on the product visible to the volunteers and their communities. When we launched the site in April, it received the most positive reviews of any software redesign I've been involved with in two decades in the industry. Best of all—although time frame is too short to have hard numbers—the volunteer community seems to have been reinvigorated, as the FreeREG2 database passed 32 million records at the beginning of the month.

So that's my best practice: expose volunteer contributions online, within your crowdsourcing system, as they are produced. It will improve the quality and productivity of the project, and it's the right thing to do.

Collaborative Manuscript Transcription

Tuesday, May 19, 2015

Day of DH 2015

Friday, May 8, 2015

Best Practices at Engaging the Public at CCLA

Why?

Incentives

Recruitment

Engagement

New Blog Posts are at FromThePage

Posts from the FromThePage Blog

Pages

Upcoming Conference Schedule

Past Conference Talks

Blog Archive

Subjects

Papers

Transcription Systems

Digital Family History