Sunday, July 6, 2014

Collaborative Digitization at ALA 2014

This is a transcript of the talk I gave at the Collaborative Digitization SIG meeting at the American Library Association annual meeting on June 28, 2014 in Caesar's Palace casino in Las Vegas.  I was preceded by Frederick Zarndt delivering his excellent talk on Crowdsourcing, Family History, and Long Tails for Libraries, which focused particularly on newspaper digitization and crowdsourced OCR correction.  (See Laura McElfresh's notes [below] for a near-transcript of his talk.)
I'd like to thank Frederick for a number of reasons, one of them being that I don't need to define crowdsourcing, which gives me the opportunity to be a little more technical.
Before we start, I'd just like to make a quick note that all of the slides, the audio files in MP3 format, and a full transcript will be posted at my blog.

I can also direct you to the notes taken by Laura McElfresh [see pp. 19-22] over there who does an amazing job at these [conferences].

Finally, if you tweet about this, there's my handle.

Okay, so we've talked about OCR correction. What's the difference between OCR correction and manuscript transcription? Why would people transcribe manuscripts -- isn't OCR good enough?

I'd like to go into that and talk about the [effectiveness] of OCR on printed material versus handwritten materials.


We're going to go into detail on the results of running Tesseract--which is a popular, open-source OCR tool--on this particular herbarium specimen label.

I chose this one because it's got a title in print up here at the top, and then we've got a handwritten portion down here at the bottom.

So how does Tesseract do with these pieces?

With the print, it does a pretty good job, right? I mean, even though this is sort of an antique typeface, really every character is correct except that this period over here--for some reason--is OCRed as a back-tick.

So it's getting one character wrong out of--fifty, perhaps?

So how about the handwritten portion? What do you get when you run the same Tesseract program on that?

So here's the handwritten stuff, and the results are -- I'm actually pretty impressed -- I think it got the "2" right.

So in this case it got one character right out of the whole thing. So this is actually total garbage.

And my argument is that the quantitative difference in accuracy of OCR software between script versus print actually results in a qualitative difference between these two processes.

This has implications.

One of them is on methodology, which is that--as we've demonstrated--we can't use software to automatically transcribe (particularly joined-up, cursive) writing. You have to use humans.

There are a couple of other implications too, that I want to dive into a bit deeper.

One of them is the goal of the process. In the case of OCR correction, we're talking about improving accuracy of something that already exists. In the case of manuscript transcription, we're actually talking about generating a (rough) transcript from scratch.

The second one comes down to workflow, and I'll go into that in a minute.

Let's talk about findability.

Right now, if you put this page online--this manuscript image--no-one's going to find it. No-one's going to read it. Because Google cannot crawl it -- these are not words to Google, these are pixels. And without a transcript, without that findability, you miss out on the amazing serendipity that is a feature of the internet age. We don't have the serendipity of spotting books shelved next to each other anymore, but we do have the serendipity of--in this case--of a retired statistical analyst named Nat Wooding doing a vanity search on his name. And encountering a transcript of this diary--my great-great grandmother's diary--mentioning her mailman, Nat Wooding--and realizing that this is his great uncle.

Having discovered this, he started contributing to the project--not financially, but he went through and transcribed an entire year's worth of diaries. So he's contributing his labor.

Other people who've encountered these have made different kinds of contributions. These diaries were distributed on my great-great grandmother's death among her grandchildren. So they were scattered to the four winds. After putting these online, I received a package in the mail one day containing a diary from someone I'd never met, saying "looks like you'll do more with this than I will. So this element of user engagement in this case is bringing the collection back together.

Let's talk about the implications on workflow.

This is--I'm not going to say a typical--OCR correction workflow. The thing that I want to draw your attention to is that OCR correction of print can be done at a very fine grain. The National Library of Finland's Digital Koot project is asking users to correct a small block of text: a single word, a single character even. This lends itself to gamification. It lends itself to certain kinds of quality control, in which maybe you show the same image to multiple people and compare them to see if they match.

That really doesn't work very well with handwritten text, because readers have to get used to a script. Context is really important! And you find this when you put material online: people will go through and transcribe a couple of pages, then say "Oh, that's a 'W'!" And they go back and [correct earlier pages].

I want to tell the story of Page 19. This was a project that was a collaboration between me (and the FromThePage platform) and the Smith Library Special Collections at Southwestern University in Georgetown (Texas). They put a diary of a Texas volunteer in the Mexican-American War online--his name was Zenas Matthews. They found one volunteer who came online and transcribed the whole thing. He added all these footnotes. He did an amazing job.

But let's look at the edit history of one page, and what he did.

We put the material online in September. Two months later, he discovers it, and transcribes it in one session in the morning. Then he comes back in the afternoon and makes a revision to the transcript.

Time passes. Two weeks go by, and he's going back [over the text]. He makes six more revisions in one sitting on December 8, then he makes two more revisions on the next morning. Then another eight months go past, and he comes back in August in the next year, because he's thought of something -- he's reviewing his work and he improves the transcription again. He ends up with [an edition] that I'd argue is very good.

Well, this is very different from the one-time pass of OCR correction. This is, in my opinion, a qualitative difference. We have this deep, editorial approach with crowdsourced transcription.

I'm a tool maker; I'm a tool reviewer, and I'm here to try to give you some hands-on advice about choosing tools and platforms for crowdsourced transcription projects.

Now, I used to go through and review [all of the] tools. Well, I have some good news, which is that there are a lot of tools out there nowadays. There are at least thirty-seven that I'm aware of. Many of them are open source. The bad news is that there are thirty-seven to choose from, and many of them are pretty rough.

So instead of talking about the actual tools, I'm going to direct you to a spreadsheet -- a Google Doc that I put together that is itself crowdsourced. About twenty people have contributed their own tools, so it's essentially a registry of different software platforms for [crowdsourced transcription].

Instead, I'm going to discuss selection criteria -- things to consider when you're looking at launching a crowdsourced transcription project.

The first selection criterion is to look at the kind of material you're dealing with. And there are two broad divisions in source material for transcription.

This top image is a diary entry from Viscountess Emily Anne Strangford's travels through the Mediterranean in the 1850s. The bottom image is a census entry.

These are very different kinds of material. A plaintext transcript that could be printed out and read in bed is probably the [most appropriate purpose] for a diary entry. Wheras, for a census record, you don't really want plaintext -- you want something that can go into a structured database.

And there are a limited number of tools that nevertheless have been used very effectively to transcribe this kind of structured data. FamilySearch Indexing is one that we're all familiar with, as Frederick mentioned it. There are a few others from the Citizen Science world: PyBossa comes from the Open Knowledge Foundation, and Scribe and Notes From Nature both come out of GalaxyZoo. [The Zooniverse/Citizen Science Alliance.] I'm going to leave those, and concentrate on more traditional textual materials.

One of the things you want to ask is, What is the purpose of this transcript? Is mark-up necessary? These kinds of texts, as we're all aware, are not already edited, finished materials.

Most transcription tools which exist ask users for plain-text transcripts, and that's it. So the overwhelming majority of platforms support no mark-up whatsoever.

However, there are two families of mark-up [support] which do exist. One of them is a subset of TEI markup. It's part of this TEI Toolbar which was developed by Transcribe Bentham for their own platform [the Bentham Transcription Desk] which is a modification of MediaWiki. It then was later repurposed by the 1916 Letters project and used on top of a totally different software stack, the NARA Transcribr Drupal module [actually DIYHistory]. And what it does is give users a mall series of buttons which can be used to mark up features within a text. So this is really useful if you're dealing with marginalia, with additions and deletions within the text, and you want to track all that. Not everybody wants to track all that, but if that's the kind of purpose that you have, you'll want to look at in-page mark-up.

The other form of mark-up is one that I've been using in FromThePage, using wiki-links to do subject identification within the text. [2-3 sentences inaudible: see "Wikilinks in FromThePage" for a detailed presentation given at the iDigBio Original Sources Digitization Workshop.]

What this means is that if users encounter "Irvin Harvey" and it's marked up like this:

The tool will automatically generate an index that shows every time that Irvin Harvey was mentioned within the texts, or read all the pages mentioning Irvin Harvey. You can actually do network analysis and other digital humanities stuff based on [mining the subject mark-up].

So that's a different flavor of mark-up to consider.

Another question to ask is, how open is your project? Right now I know of projects that are using my own FromThePage tool entirely for staff to use internally.

There are others in which they have students working on the transcripts. And in some cases, this is for privacy reasons. For example, Rhodes College Libraries is using FromThePage to transcribe the diaries of Shelby Foote. Well, Shelby Foote only died a few years ago. [His diaries] are private. So this installation is entirely internal. The transcriptions are all done by students. I've never seen it -- I don't have access to it because it's not on the broad Internet.

Then there's the idea of leveraging your own volunteers on-site, with maybe some [ancillary] openness on the Internet. San Diego Natural History Museum is doing this with the people who come in, and ordinarily will volunteer to clean fossils or prepare specimens for photographs. Well, now they're saying Can you transcribe these herpetology field notes?

So these kinds of platforms are not only wide-open crowdsourcing tools; they can be private, and you should consider this. In some cases, the same platform can support both private projects and crowdsourced projects simultaneously, so you can get all of your data in the same place. [One sentence inaudible.]

Branding! Branding may be very important.

Here are a couple of platforms, with screenshots of each.

The first one is is the French-language version of Wikisource. Wikisource is a sister project to Wikipedia that was spun off around 2003 that allows people to transcribe documents and do OCR correction both. This is being used by the Departmental Archives of Alpes-Maritimes to transcribe a set of journals of episcopal visits. The bishop in the sixteenth century would go around and report on all the villages [in his diocese], so there's all this local history, but it's also got some difficult paleography.

So they're using Wikisource, which is a great tool! It has all kinds of version control. It has ways to track proofreading. It does an elegant job of putting together indiviual pages into larger documents. But, do you see "Departmental Archives of Alpes-Maritimes" on this page? No! You have no idea [who the institution is]. Now, if they're using this internally, that may be fine -- it's a powerful tool.

By contrast, look at the Letters of 1916. [Three sentences inaudible.] This is public engagement in a public-facing site.

Most platforms are somewhere between the two.

Integration: Let's say you've just done a lot of work to scan a lot of material, gather item-level metadata, and you've [ingested it] into CONTENTdm or another CMS. Now you want to launch a crowdsourcing project. Often, the first thing you have to do is get it all back out again and put it into your crowdsourcing platform.

So you need to look at integration. You need to ask the questions, How am I going to get data into the transcription platform? How am I going to get data back out? These may be totally different things: I know of one project that's trying to get data from Fedora into FromThePage, then trying to get it out of FromThePage by publishing to Omeka. There's a different project that wants to get data from Omeka into FromThePage. But these are totally different code paths! They have nothing to do with each other, believe it or not. So you really have to ask detailed questions about this.

Here are a few of the tools that exist, with what they support. (Or what they plan to support -- last week I was contacted about Fedora support and CONTENTdm support for FromThePage, one on Wednesday and one on Thursday, so if anyone has any advice on integration with those systems, please let me know.)

Hosting: Do you want to install everything on-site? Do you have sysadmins and servers? Is this actually a requirement? Or do you want this all hosted by someone else?

Right now you have pretty limited options for hosting. Notes from Nature and the GalaxyZoo projects host everything themselves. Wikisource and FromThePage can be either local or hosted. Everything else, you've got to download and get running on your servers.

Finally, I'd like to talk a little bit about asking yourself, what are yardsticks for success?

If you're doing this for volunteer engagement, what does successful engagement look like? I know of one project that launched a trial in which they put some material from 19th century Texas online. One volunteer found this and dove into it. He transcribed a hundred pages in a week, he started adding footnotes -- I mean he just plowed through this. After a couple of weeks, the librarians I was working with cancelled the trial, and I asked them to give me details. One of the things that they said was, We were really disappointed that only one volunteer showed up. Our goal for public engagement was to do a lot of public education and public outreach, and we wanted to reach out [to] a lot of people.

[For them,] a hundred pages transcribed by one volunteer is a failure compared with one page each transcribed by ten volunteers. So what are your goals?

Similarly, if you're using a platform that is a wiki-like platform--an editorial platform--you'll get obsessive users who will go back and revise page 19 over and over again. That may be fine for you. Maybe you want the highest quality transcripts and you don't mind that there's sort of spotty coverage because users come in and only transcribe the things that really interest them.

Other systems try to go for coverage over quality and depth. ProPublica developed the transcribable Ruby on Rails plugin for research on campaign contributions. They intentionally designed their tool with no back button -- there's no way for a user to review what they did. And they wrote a great article about this which is very relevant to this conference venue: it's called "Casino-Driven Design: One Exit, No Windows, Free Drinks". So for them, the page 19 situation would be an absolute failure, while for me I'm thrilled with it. So again there's this trade off of quality versus quantity in product as well as in engagement.
[Audio to follow.]