Wednesday, October 24, 2012

Interview with Ben Crowder on Unbindery

One of the pleasures of maintaining the crowdsourced transcription tool list is learning about systems I'd never heard about before.  One of these is Unbindery, a tool being built by Ben Crowder for audio and manuscript transcription as well as OCR correction.  Ben was gracious enough to grant me an interview, even though he's concentrating on the final stretch of development work on Unbindery.

First, let me wish you luck as you enter the final push on Unbindery. What would you say is the most essential feature you have left to work on?

Thanks! Probably private projects -- I've been looking forward to using Unbindery to transcribe my journals, but haven't wanted them to be open for just anyone to work on. I'm also very excited about chunking audio into small segments (I used to publish an online magazine where we primarily published interviews, and transcribing two hours of audio can be really daunting).

Tell us more about how Unbindery handles both audio transcription and manuscript transcription. Usually those tools are very different, aren't they?

The audio transcription part started out as Crosswrite, a little proof-of-concept I threw together when I realized one day that JavaScript would let me control the playhead on an audio element, making it really easy to write a software version of a transcription foot pedal. I also wanted to start using Unbindery for family history purposes (transcribing audio interviews with my grandparents, mainly, and divvying up that workload among my siblings).

So, to handle both audio transcription and page image transcription, Unbindery has a modular item type editor system. Each item type has its own set of code (HTML/CSS/JavaScript) that it loads when transcribing an item. For example, page images show an image and a text box, with some JavaScript to place a highlight line when you click on the image, whereas audio items replace the image with Crosswrite's audio element (and the keyboard controls for rewinding and fast forwarding the audio). It would be fairly trivial to add, say, an item type editor that lets the user mark up parts of the transcript with XML tags pulled from a database or web service somewhere. Or an editor for transcribing video. It's pretty flexible.

How did you come up with the idea for Unbindery?

I had done some Project Gutenberg work back in 2002, and somewhere along the way I came across Distributed Proofreaders, which basically does the same thing. A few years later, I'd recently gotten home from an LDS mission to Thailand and wanted to start a Thai branch of Project Gutenberg with one of my mission friends. He came up with the name Unbindery and I made some mockups, but nothing happened until 2010 when I launched my Mormon Texts Project. Manually sending batches of images and text for volunteers to proof was laborious at best, so I was motivated to finally write Unbindery. I threw together a prototype in a couple weeks and we've been using it for MTP ever since. I'm also nearing the end of a complete rewrite to make Unbindery more extensible and useful to other people. And because the original code was ugly and nasty and seriously embarrassing.

In my experience, the transcription tools that currently exist are very much informed by the texts they were built to work with, with some concentrating on OCR-correction, others on semantic indexing, and others on mark-up of handwritten changes to the text. How do you feel like the Mormon Texts Project has shaped the features and focus of Unbindery?

Mormon Texts Project has been entirely focused on correcting OCR for publication in nice, clean ebook editions, which is why we've gone with a plain old text box and not much more than that. (Especially considering that we were originally posting the books to Project Gutenberg, where our target output format was very plain text.)

What is your grand dream for Unbindery? (Feel free to be sweeping here and assume grateful, enthusiastic users and legions of cobbler's elves to help with the code.)

To get men on Mars. No, really, I don't think my dreams for Unbindery are all that grand -- I'd be more than satisfied if it helps make transcription easier for users, whether working alone or in groups, and whether they're publishing ebooks or magazines or transcribing oral histories or journals or what have you.

In an ideal world it would be wonderful if a small, dedicated group of coders were to adopt it and take care of it going forward. But I don't expect that. I'll get it to a state where I can publicly release it and people can use it, but other than bugfixes, I don't see myself doing much active development on Unbindery beyond that point. I know, I know, abandoning my project before it's even out the door makes me a horrible open source developer. But to be honest with you, I don't really even want to be an open source developer -- I'm far more interested in my other projects (like MTP) and I want to get back to doing those things. Unbindery is just a tool I needed, an itch I scratched because there wasn't anything out there that met my needs. People have expressed interest in using it so I'm putting it up on GitHub for free, but I don't see myself doing much with Unbindery after that. Sorry! This is the sad part of the interview.

What programming languages or technical frameworks do you work in?

Unbindery is PHP with JavaScript for the front end. I love JavaScript, but I'm only using PHP because of its ubiquity -- I'd much, much, much rather use Python. But it's a lot easier for people to get PHP apps running on cheap shared hosts, so there you have it.

It seems like you're putting a lot of effort into ease of deployment. How do you see Unbindery being used? Do you expect to offer hosting, do you hope people install their own instances, or is there another model you hope to follow?

I won't be offering hosting, so yes, I'm expecting people to install their own instances, and that's why I want it to be easy to install. (There may be some people who decide to offer hosting for it as well, and that's fine by me.)

How can people get involved with the project?

Coders: The code isn't quite ready for other people to hack on it yet, but it's getting a lot closer to that point. For now, coders can look at my roadmap page to see what tasks need doing. (Also, it won't be long before I start adding issues to GitHub so people can help squash bugs.)

Other people: Once the core functionality is in place, just having people install it and test it would probably be the most helpful.

Thursday, October 18, 2012

Jens Brokfeld's Thesis on Crowdsourced Transcription

Although the field of transcription tools has become increasingly popular over the last couple of years, most academic publications on the topic focus on a single project and the lessons that project can teach.  While those provide invaluable advice on how to run crowdsourcing projects, they do not lend much help to memory professionals trying to decide which tools to explore when they begin a new project.  Jens Brokfeld's thesis for his MLIS degree at Fachhochschule Potsdam is the most systematic, detailed, and thorough review of crowdsourced manuscript transcription tools to date.

After a general review of crowdsourcing cultural heritage, Brokfeld reviews Rose Holley's checklist for crowdsourcing projects and then expands upon the part of my own TCDL presentation which discussed criteria for selecting transcription tools, synthesizing it with published work on the subject.  He then defines his own test criteria for transcription tools, about which more below.  Then, informed by seventy responses to a bilingual survey of crowdsourced transcription users, Brokfeld evaluates six tools (FromThePage, Refine!, Wikisource, Scripto, T-PEN, and the Bentham Transcription Desk) with forty-two pages (pp. 40-82) devoted to tool-specific descriptions of the capabilities and gaps within each system.  This exploration is followed by an eighteen-page comparison of the tools against each other (pp. 83-100). The whole paper is very much worth your time, and can be downloaded at the "Masterarbeit.pdf" link here: "Evaluation von Editionswerkzeugen zur nutzergenerierten Transkription handschriftlicher Quellen".

It would be asking too much of my limited German to translate the extensive tool descriptions, but I think I should acknowledge that I found no errors in Brokfeld's description of my own tool, FromThePage, so I'm confident in his evaluation of the other five systems.  However, I feel like I ought to attempt to abstract and translate some of his criteria for evaluation, as well as his insightful analysis of each tool's suitability for a particular target group.
Chapter 5:  Prüfkriterien ("Test Critera")

5.1 Accessibility (by which he means access to transcription data from different personal-computer-based clients)
5.1.1 Browser Support
5.2 Findability
5.2.1 Interfaces (including support for such API protocols as OAI-PMH, but including functionality to export transcripts in XML or to import facsimiles) 
5.2.2 References to Standards (this includes support for normalization of personal and place names in the resulting editions)
5.3 Longevity
5.3.1 License (is the tool released under an open-source license that addresses digital preservation concerns?)
5.3.2 Encoding Format (TEI or something else?)
5.3.3 Hosting
5.4 Intellectual Integrity (primarily concerned with support for annotations and explicit notation of editorial emendations)
5.4.1 Text Markup
5.5 Usability (similar to "accessibility" in American usage)
5.5.1 Transcription Mode (transcriber workflows)
5.5.2 Presentation Mode (transcription display/navigation)
5.5.3 Editorial Statistics (tracking edits made by individual users)
5.5.4 User Management (how does the tool balance ease-of-use with preventing vandalism?)

I don't believe that I've seen many of these criteria used before, and would welcome a more complete translation.  

His comparison based on target group is even more innovative.  Brokfeld recognizes that different transcription projects have different needs, and is the first scholar to define those target groups.  Chapter 7 of his thesis defines those groups as follows:

Science:  The scientific community is characterized by concern over the richness of mark-up as well as a preference for customizability of the tool over simplicity of user interface. [Note: it is entirely possible that I mis-translated Wissenschaft as "science" instead of "scholarship".]
Family History: Usability and a simple transcription interface are paramount for family historians, but privacy concerns over personal data may play an important role in particular projects.
Archives: While archives attend to scholarly standards, their primary concern is for the transcription of extensive inventories of manuscripts -- for which shallow markup may be sufficient.  Archives are particularly concerned with support for standards.
Libraries: Libraries pay particular attention to bibliographical standards. They also may organize their online transcription projects by fonds, folders, and boxes.
Museums: In many cases museums possess handwritten sources which refer to their material collections.  As a result, their transcriptions need to be linked to the corresponding object.

It's very difficult for me to summarize or extract Brokfeld's evaluation of the six different tools for five different target groups, since those comparisons are in tabular form with extensive prose explanations.  I encourage you to read the original, but I can provide a totally inadequate summary for the impatient:
  • FromThePage: Best for family history and libraries; worst for science.
  • Refine!: Best for libraries, followed by archives; worst for family history.
  • Wikisource: Best for libraries, archives and museums; worst for family history.
  • Scripto: Best for museums, followed by archives and libraries; worst for family history and science.
  • T-PEN: Best for science. 
  • Bentham Transcription Desk: Best for libraries, archives and museums.
Note: This is a summary of part of a 140-page German document translated by an amateur.  Consult the original before citing or making decisions based on the information here. Jens Brokfeld welcomes questions and comments (in English or German) through this webform: http://opus4.kobv.de/opus4-fhpotsdam/frontdoor/mail/toauthor/docId/331.

Wednesday, October 10, 2012

Webwise Reprise on Crowdsourcing

Back in June, the folks at IMLS and Heritage Preservation ran a webinar exploring the issues and tools discussed at the IMLS Webwise Crowdsourcing panel "Sharing Public History Work: Crowdsourcing Data and Sources."

After a introduction by Kevin Cherry and Kristen Laise,  Sharon Leon, who chaired the live panel, presented a wonderful overview of crowdsourcing cultural heritage and discussed the kinds of crowdsourcing projects that have been successful -- including, of course, the Papers of the War Department and Scripto, the transcription tool the Roy Rosenzweig Center for History and New Media developed from that project.  They then ran the video of my own presentation, "Lessons from Small Crowdsourcing Projects", followed by a live demo of FromThePage.  Perhaps the best part of the webinar, however, was the Q&A from people all over the country asking for details about how these kinds of projects work.

The recording of the webinar is online, and I encourage you to check it out.  (Here's a direct link, if you have trouble.) I'm very grateful to IMLS and Heritage Preservation for their work in making this knowledge accessible so effectively.

Tuesday, October 9, 2012

Mosman 1914-1918 on FromThePage

The Mosman community in New South Wales is preparing for the centennial of World War One, and as part of this project they've launched mosman1914-1918.net: "Doing our bit, Mosman 1914–1918".  The project describes itself as "an innovative online resource to collect and display information about the wartime experiences of local service people," and includes scan-a-thons, hack days, and build-a-thons.

One of their efforts involves transcription of a serviceman's diary with links to related names on local honor boards.  I'm delighted to report that they're hosting this project on FromThePage.com, and I look forward to working with and learning from the Mosman team.

Read more about Allan Allsop's diary on the Mosman 1914-1918 project blog and lend a hand transcribing!

Wednesday, October 3, 2012

Building a Structured Transcription Tool with FreeUKGen

I'm currently working with FreeUKGen--the charity behind the genealogy database FreeBMD--to build a general-purpose, open-source tool for crowdsourced transcription of structured manuscript data into a searchable database.

We're basing our system on the Scribe tool developed for the Citizen Science Alliance for What's the Score at the Bodleian, which originated out of their experience building OldWeather and other citizen science sites.

We are building the following systems:
  1. A new tool for loading image sets into the Scribe system and attaching them to data-entry templates. 
  2. Modifications to the Scribe system to handle our volunteer organization's workflow, plus some usability enhancements.
  3. A publicly-accessible search-and-display website to mine the database created through data entry. 
  4. A reporting, monitoring, and coordinating system for our volunteer supervisors. 
We also plan to add support for geocoding during transcription and GIS support within the search and display system. Currently, initial development is mostly finished with 1 and moving on to 2 and 3 above.

Although this tool is focused on support for parish registers and census forms, we are intent on creating a general-purpose system for any tabular/structured data.   Scribe's data-entry templates are defined in its database, with the possibility to assign different templates to different images or sets of images.  As a result, we can use a simple template for a 1750 register of burials or a much more complex template for an 1881 census form.  Since each transcribed record is linked to the section of the page image it represents, we have the ability to display the facsimile version of a record alongside its transcript in a list of search results, or to get fancy and pre-populate a transcriber's form with frequently-repeated information like months or birthplaces.

Under the guidance of Ben Laurie, the trustee directing the project, we are committed to open source and open data.  We're releasing the source code under an Apache license and planning to build API access to the full set of record data.

We feel that the more the merrier in an open-source project, so we're looking for collaborators, whether they contribute code, funding, or advice.  We are especially interested in collaborators from archives, libraries, and the genealogy world.

Tuesday, October 2, 2012

ReportersLab Reviews FromThePage

Tyler Dukes has written a concise introduction to the issues with handwritten material and a lovely review of FromThePage at ReportersLab:
Even when physical documents are converted into digital format, subtle inconsistencies in handwriting prove too much for optical character recognition software. The best computer scientists have been able to do is apply various machine learning techniques, but most of these require a lot of training data — accurate transcriptions deciphered by humans and fed into an algorithm.

“Fundamentally, I don’t think that we’re going to see effective OCR for freeform cursive any time soon,” Brumfield said. “The big successes so far with machine recognition have been in domains in which there’s a really constrained possibilities for what is written down.”

That means entries like numbers. Dates. Zip codes. Get beyond that, and you’re out of luck.
I don't know much about the world of investigative journalism, but it wouldn't surprise me if it holds as many intriguing parallels and new challenges as I've discovered among natural science collections.   Handwriting might still be the most interdisciplinary technology.