Collaborative Manuscript Transcription: March 2012

Saturday, March 17, 2012

Crowdsourcing at IMLS WebWise 2012

The video of the crowdsourcing panel at IMLS WebWise is online, so I thought I'd post my talk. Like anyone who's created a transcript of their own unscripted remarks, I recommend watching the video. (My bit starts at 6:00, though all the speakers were excellent. Full-screening will hide the slides.) Nevertheless, I've added hyperlinks to the transcript and interpolated the slides with my comments below.

Okay. I'd like to talk about some of the lessons that have come out of my collaborations with small crowdsourcing projects. We hear a lot about these large projects like GalaxyZoo, like Transcribe Bentham. What can small institutions and small projects do, and do the rules that seem to apply to large projects also apply to them?

So there are three projects that I'm drawing from here in this experience. The first one I'm going to talk about in a little bit is one that was run by Balboa Park Online Collaborative. It's a Klauber Field Notes Transcription Project--the field notes of Laurence M. Klauber, who was the nation's foremost authority on rattlesnakes. These are field notes that he kept from 1923 through 1967. This is done by the San Diego Natural History Museum and is run by our own Perian Sully who is out there in the room somewhere.

The next project I want to talk about is the Diary of Zenas Matthews. Zenas Matthews was a volunteer from Texas who served in the American forces in the US-Mexican War of 1846 and this diary is kept by Southwestern's article on the Zenas Matthews diary project. This had been digitized for a previous researcher and is small, but Southwestern itself is also quite small.

The third project I want to talk about is actually the origin of the software, which is the Julia Brumfield Diaries. If the name looks familiar, it's because she's my great-great grandmother. This project was the impetus for me to develop this tool for crowdsourced transcription.

So all of these projects, what they have in common is that we're talking about page counts that are in the thousands and volunteer counts that are numbered in the dozens at best. So these are not FamilySearch Indexing, where you can rely on hundreds of thousands of volunteers and large networks.

So who participates in large projects and who participates in small projects? One thing that I think is really interesting about crowdsourcing and these other sorts of participatory online communities is that the ratio of contributions to users follows what's called a power-law distribution.

If you look here, we see--and most famously this is Wikipedia--and you see a chart of the number of users on Wikipedia ranked by their contributions. And what you see is that 90% of the edits made to Wikipedia are done by 10% of the users.

If we look at other crowdsourced projects: this is the North American Bird Phenology Program out of Patuxent ~~Bay~~ Research Center [ed: actually Patuxent Wildlife Research Center], and this is a project in which volunteers are transcribing ornithology records--basically bird-watching records--that were sent in from the ~~1870s through the 1950s~~ [ed: 1880s-1970s], entering them into a database where they can be mined for climate change [data]. What's interesting about this to me at least is that--and this has been a phenomenally successful project: they've got 560,000 cards transcribed all by volunteers, but StellaW@Maine here has transcribed 126,000 of them, which is 22% of them. Now, CharlotteC@Maryland is close behind her (so go, local team!) but again you see the same kind of curve.

If we look at another relatively large project, the Transcribe Bentham project: this isn't a graph, but if you look at the numbers here, you see the same kind of thing. You see Diane with 78,000 points, you see Ben Pokowski with 51,000 points. You see this curve sort of taper down into more of a long tail.

So what about the small projects?

Well, let's look at the Klauber diaries. This is the top ten transcribers of the field notes of Laurence Klauber. And if you look at the numbers here--again in this case it's not quite as pronounced because I think the previous leader has dropped out and other people have overtaken him--but you see the same kind of distribution. This is not a linear progression; this is more of a power-law distribution.

If you look at an even smaller project--now, mind you this is a project that is really only of interest to members of my family and elderly neighbors of the diarist--but look: We've got Linda Tucker who has transcribed 713 of these pages followed by me and a few other people. But again, you have this power law that the majority of the work is being done by a very small group of people.

Okay, what's going on really? What does this mean and why does it matter? The thing that I think this gets to, the reason that I think that this is important, is for a couple of reasons.

One is that this kind of behavior addresses one of the main objections to crowdsourcing. Now there are a lot of valid objections to crowdsourcing; I think that there are also a few invalid objections and one of them is essentially the idea that members of the public cannot participate in scholarly projects because my next door neighbor is neither capable nor interested in participating in scholarly projects. And we see this all over the place. I mean, here's a few example quotes--and I'm not going to read them out. I believe that this objection (which I have heard a number of times; I mean we see some examples right here) is a non sequitur. And I believe that the power-law distribution proves that it's a non sequitur. Really, I saw this most egregiously framed by a scholar who was passionately--just absolutely decrying--the idea that classical music fans would be able to competently translate from German into English because, he said, "After all, 40% of South Carolina voted for Newt Gingrich." Okay.

All right, so what's going on is I think best summed up by Rachel Stone, and what she essentially said is that crowdsourcing isn't getting the sort of random distribution from the crowd. Crowdsourcing is getting a number of "well-informed enthusiasts."

So where do we find well-informed enthusiasts to do this work and to do it well? Big projects have an advantage, right? They have marketing budgets. They have press coverage. They have an existing user base.

If you ask the people at the Transcrbe Bentham project how did they get their users, they'll say "Well, you know that New York Times article really helped. " That's cool! All right.

The GalaxyZoo people--Citizen Science Alliance--yesterday, 24 hours ago, announced a new project, SETILive. Now what this does is it pulls in live data from the SETI satellites[sic: actually telescope], and in those 24 hours--I took this screenshot; I actually skipped lunch to get this one screenshot because I knew that it would pass 10,000 people participating with 80,000 of these classifications. And it would have been higher, except last night the telescope got covered by cloud cover. So they dropped from getting 30 to 40 contributions per second to having to show sort of archival data and getting only 10 contributions per second. Well, they can do this because they have an existing base of active volunteers that numbers around 600,000.

So how do WE do that? How do we find well-informed enthusiasts? This is something that Kathryn Stallard and Anne Veerkamp-Andersen at Southwestern University Special Collections and I discussed a lot when we were trying to launch the Zenas Matthews Diary. We said, "Well, we don't have any budget at all." Kathryn said, "Well, let's talk about local archival newsletters. Let's post to H-Net lists." I was in favor of looking at online communities of people who might be doing Matthews genealogy or the military history war-gamers who have discussion forums on the Mexican War.

While we're arguing about this, Kathryn gets an email from a patron saying, "Hey, I'm a member of an organization. We see that you have this document. It relates to the Battle of San Jacinto and the Texas Revolution of 1836. Can you send this to us?"

She responds saying, "Hey, 1846, great. Check out this diary we just put online. I think that's what you're talking about."

Well that wasn't actually what he was talking about, but he responds and says, "Yeah, okay, I'll check that out, but can you please give me the document I want." They get it back to him and we returned to our discussion of "Okay, what do we need to do to roll this out? We're going to start working on the information architecture. We're going to work on the UI. We're going to work on help screens." And while we're having this conversation, Mr. Patrick checks it out.

And Scott Patrick starts transcribing.

And he starts transcribing some more.

And he continues transcribing.

And at this point, we're talking about working on the wording of the help screens, the wording of our announcement trying to attract volunteers, and this is page 43 of the 43-page diary!

And while we're discussing this, he goes back and he starts adding footnotes. Look at this: he's identifying the people who are in this, saying, "Hey, this guy who is mentioned is -- here's what his later life was. This other guy--hey, he's my first cousin, by the way, but he also left the governorship of the State of Texas to fight in this war."

He sees--and believe me, in the actual original diary, Piloncillo is not spelled Piloncillo. I mean it is a -- Zenas Matthews does not know Spanish, right? He identifies this! He identifies and looks up works that are mentioned here.

So wow! All right! We got our well-informed enthusiast! In 14 days, he transcribed the diary, and he didn't do just one pass. I mean as he got familiar with the hand, he goes back and revises the earlier transcriptions. He kind of figures out who's involved. He asks other members of his heritage organization what this is. He adds two dozen footnotes.

What just happened? What was that about? Who is this guy? Well, Scott Patrick is a retired petroleum worker who got interested in his family history, and then got interested in local history, and then got interested in heritage organizations. And he is our ideal "well-informed enthusiast".

So how did we find him? The project isn't public yet, right? Our challenge now is rephrasing our public announcement. We're now looking for volunteers to ... something that adequately describes what's left to do. Well, let's go back and take a look at this original letter, right? What we did is, we responded to an inquiry from a patron--and not an in-person patron: this is someone who lives 200 miles away from Georgetown, Texas.

What you have when someone is coming in and asking about material is, if you think about this in terms of target marketing--this is a target-rich environment. Here is someone who is interested. He's online. He's researching this particular subject. He is not an existing patron. he has no prior relationship with Southwestern University Libraries, but "Hey, while we answer your request, you might check this thing out that's in this related field." That seems to have worked in this one case. Hopefully, we'll get some more experience with future projects.

Okay, so how do we motivate volunteers? More importantly, how do we avoid de-motivating them?

Big projects, a lot of times they have a lot of interesting game-like features. Some of them actually are games. You have leader boards, you have badges, you have ways of making the experience more immersive.

OldWeather, which is run by GalaxyZoo, will plot your ship on a Google map as you transcribe the latitude and longitude elements from the log books.

The National Library of Finland has partnered with Microtask to actually create a crowdsourcing game of Whac-A-Mole. So this is crowdsourcing taken to the extreme.

But there's a peril here, and the peril is that all of these things are extrinsic motivators.

And we ran into this with the Klauber diaries. Perian came to me and said, "Hey, let's come up with a stats page, because we want to track where the diaries are at. So we come up with the stats page -- pretty basic, here's where some of these are at.

And hey, while we're at it, let's mine our data. We can come up with a couple of top-10 lists. So we come up with the top-ten list of transcribers and a top-ten list of editors, because that's the data I have.

Well remember, the whole point of this exercise is to index these diaries so that we can find the mentions of these individual species in the original manuscripts. Do you see indexing on here anywhere? Neither did our volunteers, and the minute this went up, the volunteers who previously had been transcribing and indexing every single page stopped indexing completely. They weren't being measured on it. We weren't saying that we rewarded them for it, so they stopped.

Needless to say, our next big-rush change was a top-ten indexers.

So this gets to "crowding-out" theory of motivation, and the expert on this is a researcher in the UK named Alexandra Eveleigh. Her point is that if you're going to design any kind of extrinsic motivation, you have to make sure that it promotes the actual contributory behavior, and this is something that applies, I believe, to small projects as well as large projects.

So I have 13 seconds left, so thank you, and I'll just end on that note.

Thursday, March 8, 2012

Jumping In With Both Feet

Although I didn't know it at the time, since I began work on FromThePage in 2005 I've had one toe in the digital humanities community. I've worked on FromThePage and I've blogged about crowdsourced manuscript transcription. I've met some smart, friendly people doing fascinating things and I've even taught some of them the magic of regular expressions. But I've always tried to squeeze this work into my "spare time" -- the interstices in the daily life of an involved father and a professional software engineer working a demanding but rewarding job. As the demands of vocation and avocation increase; as disparate duties begin to compete with each other; as new babies come into my home while new technologies come into my workplace and new requests for FromThePage arrive in my inbox, the basement inventor model becomes increasingly untenable. The numbers don't lie: I've only checked in code on four days during the last six months.

In January I was offered an incredible opportunity. Chris Lintott invited me to the Adler Planetarium to meet the Citizen Science Alliance's dev team. This talented, generous team of astronomer-developers gave me a behind-the-scenes tour of their Scribe tool--early versions of which powered OldWeather.org--and I was blown away. I don't think I've ever been so excited about a technology, and my mind raced with ideas for projects using it. . Serendipitously, two days later I received email from Ben Laurie asking if I'd like to implement Scribe for the FreeREG project, a part of the FreeBMD genealogy charity that is transcribing parish registers recording the baptisms, marriages, and burials in England and Wales from 1538-1835. All development would be released open source, and all data would be as open as possible. It's a dream project for someone with my interests; there was no way I could pass this up.

So as of March 18 I'm starting a new career as an independent digital history developer. It is heartbreaking to leave my friends at Convio after nearly a dozen years, but I'm delighted with the possibilities my new autonomy offers. I hope to specialize in projects relating to crowdsourcing and/or manuscript transcription, but to be honest I'm not sure where this path will lead. Of course I plan to devote more time to FromThePage -- this year should finally see the publish-on-demand integration I've always been wishing for, as well as a few other features people have requested. If you've got a project that seems appropriate--whether it involves genealogy or herpetology, agricultural history or textile history--drop me a line.

Monday, March 5, 2012

Quality Control for Crowdsourced Transcription

Whenever I talk about crowd-sourced transcription--actually whenever I talk about crowdsourced anything--the first question people ask is about accuracy. Nobody trusts the public add to an institution's data/meta-data, nor especially to correct it. However, quality control over data entry is a well-explored problem, and while I'm not familiar with the literature from industry regarding commercial approaches, I'd like to offer the systems I've seen implemented in the kinds of volunteer transcription projects I follow. (Note: the terminology is my own, and may be non-standard.)

Single-track methods (mainly employed with large, prosy text that is difficult to compare against independent transcriptions of the same text). In these methods, all changes and corrections are made to a single transcription which originated with a volunteer and is modified thereafter. There no parallel/alternate transcription to compare against.
1. Open-ended community revision: This is the method that Wikipedia uses, and it's the strategy I've followed in FromThePage. In this method, users may continue to change the text of a transcription forever. Because all changes are logged--with a pointer of some sort to the user who logged them--vandalism or edits which are not in good faith may be reverted to a known-good state easily. This is in keeping with the digital humanities principle of "no final version." In my own projects, I've seen edits made to a transcription two decades after the initial version, and those changes were indeed correct. (Who knew that "drugget" was a coarse fabric used for covering tobacco plant-beds?) Furthermore, I believe that there is no reason other than the cost of implementation why any of the methods below which operate from the "final version" mind-set should not allow error reports against their "published" form.
2. Fixed-term community revision: Early versions of both TransribeBentham and Scripto followed this model, and while I'm not sure if either of them still do, it does seem to appeal to traditional documentary editing projects that are incorporating crowdsourcing as a valuable initial input to a project while wishing to retain ultimate control over the "final version". In this model, wiki-like systems are used to gather the inital data, with periodic review by experts. Once a transcription reaches an acceptable status (deemed so by the experts), it is locked to further community edits and the transcription is "published" to a more traditional medium like a CMS or a print edition.
3. Community-controlled revision work-flows: This model is a cross between the two above-mentioned methods. Like fixed-term revision, it embraces the concept of a "final version", after which the text may not be modified. Unlike fixed-term revision, there are no experts involved here -- rather the tool itself forces a text to go through an edit/review/proofread/reject-approve workflow by the community, after which the version is locked for future edits. As far as I'm aware, this is only implemented by the ProofreadPage plugin to MediaWiki that has been used by Wikisource for the past few years, but it seems quite effective.
4. Transcription with "known-bad" insertions before proofreading: This is a two-phase process, which to my knowledge has only been tried by the Written Rummage project as described in Code4Lib issue 15. In the first phase, an initial transcription is solicited from the crowd (which in their case is a Mechanical Turk workforce willing to transcribe 19th-century diaries for around eight cents per page). In the second phase, the crowd is asked to review the initial transcription against the original image, proof-reading and correcting the first transcription. In order to make sure that a review is effective, however, extra words/characters are added to the data before it is presented to the proof-reader, and the location within the text of these known-bad insertions is recorded. The resulting corrected transcription is then programmatially searched for the bad data which had been inserted, and if it has been removed the system assumes that any other errors have also been removed -- or at least that a good-faith effort has been made to proofread and correct the transcript.
5. Single-keying with expert review: In this methodology, once a single volunteer contribution is made, it is reviewed by an expert and either approved or rejected. The expert is necessarily authorized in some sense -- in the case of the UIowa Civil War Diaries, the review is done by the library staff member processing the mailto form contribution, while in the case of FreeREG the expert is a "syndicate manager" -- a particular kind of volunteer within the FreeBMD charity. (FreeREG may be unique in using a single-track method for small, structured records, however it demands more paleographic and linguistic expertise from its volunteers than any other project I'm aware of.) If a transcription is rejected, it may be either returned to the submitter for correction or corrected by the expert and published in corrected form.
Multi-track methods (mainly employed with easily-isolated, structured records like census entries or ship's log books). In all of these cases, the same image is presented to different users to be transcribed from scratch. The data thus collected is compared programmatically on the assumption that two correct transcriptions will agree with each other and may be assumed to be valid. If the two transcriptions disagree with each other, however, one of them must be in error, so some kind of programmatic or human expert intervention is needed. It should be noted that all of these methodologys are technically "blind" n-way keying, as the volunteers are unaware of each other's contributions and do not know whether they are interpreting the data for the first time or contributing a duplicate entry.
1. Triple-keying with voting: This is the method that the Zooniverse OldWeather team uses. Originally the OldWeather team collected the same information in ten different independent tracks, entered by users who were unaware of each other's contributions: blind, ten-way keying. The assumption was that majority reading would be the correct one, so essentially this is a voting system. After some analysis it was determined that the quality of three-way keying was indistinguishable from that of ten-way keying, so the system was modified to a less-skeptical algorithm, saving volunteer effort. If I understand correctly, the same kind of voting methodology is used by ReCAPTCHA for its OCR correction, which allowed its exploitation by 4chan.
2. Double-keying with expert reconciliation: In this system, the same entry is shown to two different volunteers, and if their submissions do not agree it is passed to an expert for reconciliation. This requires a second level of correction software capable of displaying the original image along with both submitted transcriptions. If I recall my fellow panelist David Klevan's WebWise presentation correctly, this system is used by the Holocaust Museum for one of their crowdsourcing projects.
3. Double-keying with emergent community-expert reconciliation: This method is almost identical to the previous one, with one important exception. The experts who reconcile divergent transcriptions are themselves volunteers -- volunteers who have been promoted to from transcribers to reconcilers through an algorithm. If a user has submitted a certain (large) number of transcriptions, and if those transcriptions have either 1) matched their counterpart's submission, or 2) been deemed correct by the reconciler when they are in conflict with their counterpart's transcription, then the user is automatically promoted. After promotion, they are able to choose their volunteer activity from either the queue of images to be transcribed or the queue of conflicting transcriptions to be reconciled. This is the system used by FamilySearch Indexing, and its emergent nature makes it a particularly scalable solution for quality control.
4. Double-keying with N-keyed run-off votes: Nobody actually does this that I'm aware of, but I think it might be cost-effective. If the initial set of two volunteer submissions don't agree, rather than submit the argument to an expert, re-queue the transcription to new volunteers. I'm not sure what the right number is here -- perhaps only a single tie-breaker vote, but perhaps three new volunteers to provide an overwhelming consensus against the original readings. If this is indecisive, why not re-submit the transcription again to an even larger group? Obviously this requires some limits, or else the whole thing could spiral into an infinite loop in which your entire pool of volunteers are arguing with each other about the reading of a single entry that is truly indecipherable. However, I think it has some promise as it may have the same scalability benefits of the previous method without needing the complex promotion algorithm nor the reconciliation UI.

Caveats: Some things are simply not knowable. It is hard to evaluate the effectiveness of quality control seriously without taking into account the possibility that volunteer contributors may be correct and experts may be wrong, nor more importantly that some images are simply illegible regardless of the paleographic expertise of the transcriber. The Zooniverse team is now exploring ways for volunteers to correct errors made not by transcribers but rather by the midshipmen of the watch who recorded the original entries a century ago. They realize that a mistaken "E" for "W" in a longitude record may be more amenable to correction than a truly illegible entry. Not all errors are made by the "crowd", after all.

Much of this list is based on observation of working sites and extrapolation, rather than any inside information. I welcome corrections and additions in the comments or at benwbrum@gmail.com.

[Update 2012-03-07: Folks from the Transcribe Bentham informed me on Twitter that "In general, at the moment most transcripts are worked on by one volunteer, checked and then locked. Vols seem to prefer working on fresh MSS to part transcribed." and "For the record, @TranscriBentham does still use 'Fixed-term community revision'. There are weekly updates on the blog." Thanks, Tim and Justin!]

Sunday, March 4, 2012

We Get Press!

Crowdsourced transcription projects--and FromThePage in particular--have gotten some really nice press in the last few weeks.

Konrad Lawson posted an excellent review of Scripto and FromThePage on the ProfHacker blog at The Chronicle of Higher Education: Crowdsourcing Transcription: FromThePage and Scripto.

Francine Diep wrote a great article on the phenomenon at Innovation News Daily: Volunteer Transcribers Put Millions of Pages Online.

Ellen Davis's article on Southwestern's transcription of Zenas Matthews's 1846 Mexican War Diary is especially notable because it includes an interview with Scott Patrick, the volunteer who has done such a spectacular job: Collaborative Transcription Project.

Collaborative Manuscript Transcription

Saturday, March 17, 2012

Crowdsourcing at IMLS WebWise 2012

Thursday, March 8, 2012

Jumping In With Both Feet

Monday, March 5, 2012

Quality Control for Crowdsourced Transcription

Sunday, March 4, 2012

We Get Press!

New Blog Posts are at FromThePage

Posts from the FromThePage Blog

Pages

Upcoming Conference Schedule

Past Conference Talks

Blog Archive

Subjects

Papers

Transcription Systems

Digital Family History