Thursday, October 18, 2012

Jens Brokfeld's Thesis on Crowdsourced Transcription

Although the field of transcription tools has become increasingly popular over the last couple of years, most academic publications on the topic focus on a single project and the lessons that project can teach.  While those provide invaluable advice on how to run crowdsourcing projects, they do not lend much help to memory professionals trying to decide which tools to explore when they begin a new project.  Jens Brokfeld's thesis for his MLIS degree at Fachhochschule Potsdam is the most systematic, detailed, and thorough review of crowdsourced manuscript transcription tools to date.

After a general review of crowdsourcing cultural heritage, Brokfeld reviews Rose Holley's checklist for crowdsourcing projects and then expands upon the part of my own TCDL presentation which discussed criteria for selecting transcription tools, synthesizing it with published work on the subject.  He then defines his own test criteria for transcription tools, about which more below.  Then, informed by seventy responses to a bilingual survey of crowdsourced transcription users, Brokfeld evaluates six tools (FromThePage, Refine!, Wikisource, Scripto, T-PEN, and the Bentham Transcription Desk) with forty-two pages (pp. 40-82) devoted to tool-specific descriptions of the capabilities and gaps within each system.  This exploration is followed by an eighteen-page comparison of the tools against each other (pp. 83-100). The whole paper is very much worth your time, and can be downloaded at the "Masterarbeit.pdf" link here: "Evaluation von Editionswerkzeugen zur nutzergenerierten Transkription handschriftlicher Quellen".

It would be asking too much of my limited German to translate the extensive tool descriptions, but I think I should acknowledge that I found no errors in Brokfeld's description of my own tool, FromThePage, so I'm confident in his evaluation of the other five systems.  However, I feel like I ought to attempt to abstract and translate some of his criteria for evaluation, as well as his insightful analysis of each tool's suitability for a particular target group.
Chapter 5:  Pr├╝fkriterien ("Test Critera")

5.1 Accessibility (by which he means access to transcription data from different personal-computer-based clients)
5.1.1 Browser Support
5.2 Findability
5.2.1 Interfaces (including support for such API protocols as OAI-PMH, but including functionality to export transcripts in XML or to import facsimiles) 
5.2.2 References to Standards (this includes support for normalization of personal and place names in the resulting editions)
5.3 Longevity
5.3.1 License (is the tool released under an open-source license that addresses digital preservation concerns?)
5.3.2 Encoding Format (TEI or something else?)
5.3.3 Hosting
5.4 Intellectual Integrity (primarily concerned with support for annotations and explicit notation of editorial emendations)
5.4.1 Text Markup
5.5 Usability (similar to "accessibility" in American usage)
5.5.1 Transcription Mode (transcriber workflows)
5.5.2 Presentation Mode (transcription display/navigation)
5.5.3 Editorial Statistics (tracking edits made by individual users)
5.5.4 User Management (how does the tool balance ease-of-use with preventing vandalism?)

I don't believe that I've seen many of these criteria used before, and would welcome a more complete translation.  

His comparison based on target group is even more innovative.  Brokfeld recognizes that different transcription projects have different needs, and is the first scholar to define those target groups.  Chapter 7 of his thesis defines those groups as follows:

Science:  The scientific community is characterized by concern over the richness of mark-up as well as a preference for customizability of the tool over simplicity of user interface. [Note: it is entirely possible that I mis-translated Wissenschaft as "science" instead of "scholarship".]
Family History: Usability and a simple transcription interface are paramount for family historians, but privacy concerns over personal data may play an important role in particular projects.
Archives: While archives attend to scholarly standards, their primary concern is for the transcription of extensive inventories of manuscripts -- for which shallow markup may be sufficient.  Archives are particularly concerned with support for standards.
Libraries: Libraries pay particular attention to bibliographical standards. They also may organize their online transcription projects by fonds, folders, and boxes.
Museums: In many cases museums possess handwritten sources which refer to their material collections.  As a result, their transcriptions need to be linked to the corresponding object.

It's very difficult for me to summarize or extract Brokfeld's evaluation of the six different tools for five different target groups, since those comparisons are in tabular form with extensive prose explanations.  I encourage you to read the original, but I can provide a totally inadequate summary for the impatient:
  • FromThePage: Best for family history and libraries; worst for science.
  • Refine!: Best for libraries, followed by archives; worst for family history.
  • Wikisource: Best for libraries, archives and museums; worst for family history.
  • Scripto: Best for museums, followed by archives and libraries; worst for family history and science.
  • T-PEN: Best for science. 
  • Bentham Transcription Desk: Best for libraries, archives and museums.
Note: This is a summary of part of a 140-page German document translated by an amateur.  Consult the original before citing or making decisions based on the information here. Jens Brokfeld welcomes questions and comments (in English or German) through this webform:


JB Piggin said...

Very interesting. You are quite right: in this context, one would indeed translate Wissenschaft as scholarship.

Arno Bosse said...

Thank you very much for this link - I wasn't aware of Jens Brokfeld's work. I think Peter Organisciak's M.A. thesis evaluating crowdsourcing sites could be available as well. He presented some initial work-in-progress findings at DH2011 in Stanford.