Monday, March 5, 2012

Quality Control for Crowdsourced Transcription

Whenever I talk about crowd-sourced transcription--actually whenever I talk about crowdsourced anything--the first question people ask is about accuracy. Nobody trusts the public add to an institution's data/meta-data, nor especially to correct it. However, quality control over data entry is a well-explored problem, and while I'm not familiar with the literature from industry regarding commercial approaches, I'd like to offer the systems I've seen implemented in the kinds of volunteer transcription projects I follow. (Note: the terminology is my own, and may be non-standard.)
  1. Single-track methods (mainly employed with large, prosy text that is difficult to compare against independent transcriptions of the same text). In these methods, all changes and corrections are made to a single transcription which originated with a volunteer and is modified thereafter.  There no parallel/alternate transcription to compare against.
    1. Open-ended community revision: This is the method that Wikipedia uses, and it's the strategy I've followed in FromThePage. In this method, users may continue to change the text of a transcription forever. Because all changes are logged--with a pointer of some sort to the user who logged them--vandalism or edits which are not in good faith may be reverted to a known-good state easily. This is in keeping with the digital humanities principle of "no final version." In my own projects, I've seen edits made to a transcription two decades after the initial version, and those changes were indeed correct. (Who knew that "drugget" was a coarse fabric used for covering tobacco plant-beds?) Furthermore, I believe that there is no reason other than the cost of implementation why any of the methods below which operate from the "final version" mind-set should not allow error reports against their "published" form.
    2. Fixed-term community revision: Early versions of both TransribeBentham and Scripto followed this model, and while I'm not sure if either of them still do, it does seem to appeal to traditional documentary editing projects that are incorporating crowdsourcing as a valuable initial input to a project while wishing to retain ultimate control over the "final version". In this model, wiki-like systems are used to gather the inital data, with periodic review by experts. Once a transcription reaches an acceptable status (deemed so by the experts), it is locked to further community edits and the transcription is "published" to a more traditional medium like a CMS or a print edition.
    3. Community-controlled revision work-flows: This model is a cross between the two above-mentioned methods. Like fixed-term revision, it embraces the concept of a "final version", after which the text may not be modified. Unlike fixed-term revision, there are no experts involved here -- rather the tool itself forces a text to go through an edit/review/proofread/reject-approve workflow by the community, after which the version is locked for future edits. As far as I'm aware, this is only implemented by the ProofreadPage plugin to MediaWiki that has been used by Wikisource for the past few years, but it seems quite effective.
    4. Transcription with "known-bad" insertions before proofreading: This is a two-phase process, which to my knowledge has only been tried by the Written Rummage project as described in Code4Lib issue 15. In the first phase, an initial transcription is solicited from the crowd (which in their case is a Mechanical Turk workforce willing to transcribe 19th-century diaries for around eight cents per page). In the second phase, the crowd is asked to review the initial transcription against the original image, proof-reading and correcting the first transcription. In order to make sure that a review is effective, however, extra words/characters are added to the data before it is presented to the proof-reader, and the location within the text of these known-bad insertions is recorded. The resulting corrected transcription is then programmatially searched for the bad data which had been inserted, and if it has been removed the system assumes that any other errors have also been removed -- or at least that a good-faith effort has been made to proofread and correct the transcript.
    5. Single-keying with expert review: In this methodology, once a single volunteer contribution is made, it is reviewed by an expert and either approved or rejected. The expert is necessarily authorized in some sense -- in the case of the UIowa Civil War Diaries, the review is done by the library staff member processing the mailto form contribution, while in the case of FreeREG the expert is a "syndicate manager" -- a particular kind of volunteer within the FreeBMD charity. (FreeREG may be unique in using a single-track method for small, structured records, however it demands more paleographic and linguistic expertise from its volunteers than any other project I'm aware of.) If a transcription is rejected, it may be either returned to the submitter for correction or corrected by the expert and published in corrected form.
  2. Multi-track methods (mainly employed with easily-isolated, structured records like census entries or ship's log books). In all of these cases, the same image is presented to different users to be transcribed from scratch. The data thus collected is compared programmatically on the assumption that two correct transcriptions will agree with each other and may be assumed to be valid. If the two transcriptions disagree with each other, however, one of them must be in error, so some kind of programmatic or human expert intervention is needed. It should be noted that all of these methodologys are technically "blind" n-way keying, as the volunteers are unaware of each other's contributions and do not know whether they are interpreting the data for the first time or contributing a duplicate entry.
    1. Triple-keying with voting: This is the method that the Zooniverse OldWeather team uses. Originally the OldWeather team collected the same information in ten different independent tracks, entered by users who were unaware of each other's contributions: blind, ten-way keying. The assumption was that majority reading would be the correct one, so essentially this is a voting system. After some analysis it was determined that the quality of three-way keying was indistinguishable from that of ten-way keying, so the system was modified to a less-skeptical algorithm, saving volunteer effort. If I understand correctly, the same kind of voting methodology is used by ReCAPTCHA for its OCR correction, which allowed its exploitation by 4chan.
    2. Double-keying with expert reconciliation: In this system, the same entry is shown to two different volunteers, and if their submissions do not agree it is passed to an expert for reconciliation. This requires a second level of correction software capable of displaying the original image along with both submitted transcriptions. If I recall my fellow panelist David Klevan's WebWise presentation correctly, this system is used by the Holocaust Museum for one of their crowdsourcing projects.
    3. Double-keying with emergent community-expert reconciliation: This method is almost identical to the previous one, with one important exception. The experts who reconcile divergent transcriptions are themselves volunteers -- volunteers who have been promoted to from transcribers to reconcilers through an algorithm. If a user has submitted a certain (large) number of transcriptions, and if those transcriptions have either 1) matched their counterpart's submission, or 2) been deemed correct by the reconciler when they are in conflict with their counterpart's transcription, then the user is automatically promoted. After promotion, they are able to choose their volunteer activity from either the queue of images to be transcribed or the queue of conflicting transcriptions to be reconciled. This is the system used by FamilySearch Indexing, and its emergent nature makes it a particularly scalable solution for quality control.
    4. Double-keying with N-keyed run-off votes: Nobody actually does this that I'm aware of, but I think it might be cost-effective. If the initial set of two volunteer submissions don't agree, rather than submit the argument to an expert, re-queue the transcription to new volunteers. I'm not sure what the right number is here -- perhaps only a single tie-breaker vote, but perhaps three new volunteers to provide an overwhelming consensus against the original readings. If this is indecisive, why not re-submit the transcription again to an even larger group? Obviously this requires some limits, or else the whole thing could spiral into an infinite loop in which your entire pool of volunteers are arguing with each other about the reading of a single entry that is truly indecipherable. However, I think it has some promise as it may have the same scalability benefits of the previous method without needing the complex promotion algorithm nor the reconciliation UI.
Caveats: Some things are simply not knowable. It is hard to evaluate the effectiveness of quality control seriously without taking into account the possibility that volunteer contributors may be correct and experts may be wrong, nor more importantly that some images are simply illegible regardless of the paleographic expertise of the transcriber. The Zooniverse team is now exploring ways for volunteers to correct errors made not by transcribers but rather by the midshipmen of the watch who recorded the original entries a century ago. They realize that a mistaken "E" for "W" in a longitude record may be more amenable to correction than a truly illegible entry. Not all errors are made by the "crowd", after all.

Much of this list is based on observation of working sites and extrapolation, rather than any inside information. I welcome corrections and additions in the comments or at benwbrum@gmail.com.

[Update 2012-03-07: Folks from the Transcribe Bentham informed me on Twitter that "In general, at the moment most transcripts are worked on by one volunteer, checked and then locked. Vols seem to prefer working on fresh MSS to part transcribed." and "For the record, does still use 'Fixed-term community revision'. There are weekly updates on the blog."  Thanks, Tim and Justin!]

2 comments:

Anonymous said...

I may be remembering this incorrectly, but I thought the USGS Bird Phenelogy Program does or used to do your #9 "Double-keying with N-keyed run-off votes". I'm having conflicting memories about this program. About 8 months, I remember hearing they used an algorithm to match 2 submissions, and if no match occurred, it would be transcribed by additional transcribers (not sure how many times). More recently when I heard someone from that project speak, they seemed to be doing something different.

Ben W. Brumfield said...

I asked Jessica Zelt about the BPP's methodology, and she kindly responded with the details:

"Two transcribers transcribe one card. No transcriber can get the same card twice. Those two transcriptions are then compared to each other. If they match, the data is sent into the database. If they do not match, the card sent to our "rectification system." This system is currently in beta testing, in the meanwhile, the card is separated so it will not be transcribed further. Only a select group of volunteers will have access to our rectification screen where they will do the final transcription and/or rectify some incorrectly transcribed information. The card is then sent into the final database."