Sunday, July 6, 2014

Collaborative Digitization at ALA 2014

This is a transcript of the talk I gave at the Collaborative Digitization SIG meeting at the American Library Association annual meeting on June 28, 2014 in Caesar's Palace casino in Las Vegas.  I was preceded by Frederick Zarndt delivering his excellent talk on Crowdsourcing, Family History, and Long Tails for Libraries, which focused particularly on newspaper digitization and crowdsourced OCR correction.  (See Laura McElfresh's notes [below] for a near-transcript of his talk.)
I'd like to thank Frederick for a number of reasons, one of them being that I don't need to define crowdsourcing, which gives me the opportunity to be a little more technical.
Before we start, I'd just like to make a quick note that all of the slides, the audio files in MP3 format, and a full transcript will be posted at my blog.

I can also direct you to the notes taken by Laura McElfresh [see pp. 19-22] over there who does an amazing job at these [conferences].

Finally, if you tweet about this, there's my handle.

Okay, so we've talked about OCR correction. What's the difference between OCR correction and manuscript transcription? Why would people transcribe manuscripts -- isn't OCR good enough?

I'd like to go into that and talk about the [effectiveness] of OCR on printed material versus handwritten materials.


We're going to go into detail on the results of running Tesseract--which is a popular, open-source OCR tool--on this particular herbarium specimen label.

I chose this one because it's got a title in print up here at the top, and then we've got a handwritten portion down here at the bottom.

So how does Tesseract do with these pieces?

With the print, it does a pretty good job, right? I mean, even though this is sort of an antique typeface, really every character is correct except that this period over here--for some reason--is OCRed as a back-tick.

So it's getting one character wrong out of--fifty, perhaps?

So how about the handwritten portion? What do you get when you run the same Tesseract program on that?

So here's the handwritten stuff, and the results are -- I'm actually pretty impressed -- I think it got the "2" right.

So in this case it got one character right out of the whole thing. So this is actually total garbage.

And my argument is that the quantitative difference in accuracy of OCR software between script versus print actually results in a qualitative difference between these two processes.

This has implications.

One of them is on methodology, which is that--as we've demonstrated--we can't use software to automatically transcribe (particularly joined-up, cursive) writing. You have to use humans.

There are a couple of other implications too, that I want to dive into a bit deeper.

One of them is the goal of the process. In the case of OCR correction, we're talking about improving accuracy of something that already exists. In the case of manuscript transcription, we're actually talking about generating a (rough) transcript from scratch.

The second one comes down to workflow, and I'll go into that in a minute.

Let's talk about findability.

Right now, if you put this page online--this manuscript image--no-one's going to find it. No-one's going to read it. Because Google cannot crawl it -- these are not words to Google, these are pixels. And without a transcript, without that findability, you miss out on the amazing serendipity that is a feature of the internet age. We don't have the serendipity of spotting books shelved next to each other anymore, but we do have the serendipity of--in this case--of a retired statistical analyst named Nat Wooding doing a vanity search on his name. And encountering a transcript of this diary--my great-great grandmother's diary--mentioning her mailman, Nat Wooding--and realizing that this is his great uncle.

Having discovered this, he started contributing to the project--not financially, but he went through and transcribed an entire year's worth of diaries. So he's contributing his labor.

Other people who've encountered these have made different kinds of contributions. These diaries were distributed on my great-great grandmother's death among her grandchildren. So they were scattered to the four winds. After putting these online, I received a package in the mail one day containing a diary from someone I'd never met, saying "looks like you'll do more with this than I will. So this element of user engagement in this case is bringing the collection back together.

Let's talk about the implications on workflow.

This is--I'm not going to say a typical--OCR correction workflow. The thing that I want to draw your attention to is that OCR correction of print can be done at a very fine grain. The National Library of Finland's Digital Koot project is asking users to correct a small block of text: a single word, a single character even. This lends itself to gamification. It lends itself to certain kinds of quality control, in which maybe you show the same image to multiple people and compare them to see if they match.

That really doesn't work very well with handwritten text, because readers have to get used to a script. Context is really important! And you find this when you put material online: people will go through and transcribe a couple of pages, then say "Oh, that's a 'W'!" And they go back and [correct earlier pages].

I want to tell the story of Page 19. This was a project that was a collaboration between me (and the FromThePage platform) and the Smith Library Special Collections at Southwestern University in Georgetown (Texas). They put a diary of a Texas volunteer in the Mexican-American War online--his name was Zenas Matthews. They found one volunteer who came online and transcribed the whole thing. He added all these footnotes. He did an amazing job.

But let's look at the edit history of one page, and what he did.

We put the material online in September. Two months later, he discovers it, and transcribes it in one session in the morning. Then he comes back in the afternoon and makes a revision to the transcript.

Time passes. Two weeks go by, and he's going back [over the text]. He makes six more revisions in one sitting on December 8, then he makes two more revisions on the next morning. Then another eight months go past, and he comes back in August in the next year, because he's thought of something -- he's reviewing his work and he improves the transcription again. He ends up with [an edition] that I'd argue is very good.

Well, this is very different from the one-time pass of OCR correction. This is, in my opinion, a qualitative difference. We have this deep, editorial approach with crowdsourced transcription.

I'm a tool maker; I'm a tool reviewer, and I'm here to try to give you some hands-on advice about choosing tools and platforms for crowdsourced transcription projects.

Now, I used to go through and review [all of the] tools. Well, I have some good news, which is that there are a lot of tools out there nowadays. There are at least thirty-seven that I'm aware of. Many of them are open source. The bad news is that there are thirty-seven to choose from, and many of them are pretty rough.

So instead of talking about the actual tools, I'm going to direct you to a spreadsheet -- a Google Doc that I put together that is itself crowdsourced. About twenty people have contributed their own tools, so it's essentially a registry of different software platforms for [crowdsourced transcription].

Instead, I'm going to discuss selection criteria -- things to consider when you're looking at launching a crowdsourced transcription project.

The first selection criterion is to look at the kind of material you're dealing with. And there are two broad divisions in source material for transcription.

This top image is a diary entry from Viscountess Emily Anne Strangford's travels through the Mediterranean in the 1850s. The bottom image is a census entry.

These are very different kinds of material. A plaintext transcript that could be printed out and read in bed is probably the [most appropriate purpose] for a diary entry. Wheras, for a census record, you don't really want plaintext -- you want something that can go into a structured database.

And there are a limited number of tools that nevertheless have been used very effectively to transcribe this kind of structured data. FamilySearch Indexing is one that we're all familiar with, as Frederick mentioned it. There are a few others from the Citizen Science world: PyBossa comes from the Open Knowledge Foundation, and Scribe and Notes From Nature both come out of GalaxyZoo. [The Zooniverse/Citizen Science Alliance.] I'm going to leave those, and concentrate on more traditional textual materials.

One of the things you want to ask is, What is the purpose of this transcript? Is mark-up necessary? These kinds of texts, as we're all aware, are not already edited, finished materials.

Most transcription tools which exist ask users for plain-text transcripts, and that's it. So the overwhelming majority of platforms support no mark-up whatsoever.

However, there are two families of mark-up [support] which do exist. One of them is a subset of TEI markup. It's part of this TEI Toolbar which was developed by Transcribe Bentham for their own platform [the Bentham Transcription Desk] which is a modification of MediaWiki. It then was later repurposed by the 1916 Letters project and used on top of a totally different software stack, the NARA Transcribr Drupal module [actually DIYHistory]. And what it does is give users a mall series of buttons which can be used to mark up features within a text. So this is really useful if you're dealing with marginalia, with additions and deletions within the text, and you want to track all that. Not everybody wants to track all that, but if that's the kind of purpose that you have, you'll want to look at in-page mark-up.

The other form of mark-up is one that I've been using in FromThePage, using wiki-links to do subject identification within the text. [2-3 sentences inaudible: see "Wikilinks in FromThePage" for a detailed presentation given at the iDigBio Original Sources Digitization Workshop.]

What this means is that if users encounter "Irvin Harvey" and it's marked up like this:

The tool will automatically generate an index that shows every time that Irvin Harvey was mentioned within the texts, or read all the pages mentioning Irvin Harvey. You can actually do network analysis and other digital humanities stuff based on [mining the subject mark-up].

So that's a different flavor of mark-up to consider.

Another question to ask is, how open is your project? Right now I know of projects that are using my own FromThePage tool entirely for staff to use internally.

There are others in which they have students working on the transcripts. And in some cases, this is for privacy reasons. For example, Rhodes College Libraries is using FromThePage to transcribe the diaries of Shelby Foote. Well, Shelby Foote only died a few years ago. [His diaries] are private. So this installation is entirely internal. The transcriptions are all done by students. I've never seen it -- I don't have access to it because it's not on the broad Internet.

Then there's the idea of leveraging your own volunteers on-site, with maybe some [ancillary] openness on the Internet. San Diego Natural History Museum is doing this with the people who come in, and ordinarily will volunteer to clean fossils or prepare specimens for photographs. Well, now they're saying Can you transcribe these herpetology field notes?

So these kinds of platforms are not only wide-open crowdsourcing tools; they can be private, and you should consider this. In some cases, the same platform can support both private projects and crowdsourced projects simultaneously, so you can get all of your data in the same place. [One sentence inaudible.]

Branding! Branding may be very important.

Here are a couple of platforms, with screenshots of each.

The first one is is the French-language version of Wikisource. Wikisource is a sister project to Wikipedia that was spun off around 2003 that allows people to transcribe documents and do OCR correction both. This is being used by the Departmental Archives of Alpes-Maritimes to transcribe a set of journals of episcopal visits. The bishop in the sixteenth century would go around and report on all the villages [in his diocese], so there's all this local history, but it's also got some difficult paleography.

So they're using Wikisource, which is a great tool! It has all kinds of version control. It has ways to track proofreading. It does an elegant job of putting together indiviual pages into larger documents. But, do you see "Departmental Archives of Alpes-Maritimes" on this page? No! You have no idea [who the institution is]. Now, if they're using this internally, that may be fine -- it's a powerful tool.

By contrast, look at the Letters of 1916. [Three sentences inaudible.] This is public engagement in a public-facing site.

Most platforms are somewhere between the two.

Integration: Let's say you've just done a lot of work to scan a lot of material, gather item-level metadata, and you've [ingested it] into CONTENTdm or another CMS. Now you want to launch a crowdsourcing project. Often, the first thing you have to do is get it all back out again and put it into your crowdsourcing platform.

So you need to look at integration. You need to ask the questions, How am I going to get data into the transcription platform? How am I going to get data back out? These may be totally different things: I know of one project that's trying to get data from Fedora into FromThePage, then trying to get it out of FromThePage by publishing to Omeka. There's a different project that wants to get data from Omeka into FromThePage. But these are totally different code paths! They have nothing to do with each other, believe it or not. So you really have to ask detailed questions about this.

Here are a few of the tools that exist, with what they support. (Or what they plan to support -- last week I was contacted about Fedora support and CONTENTdm support for FromThePage, one on Wednesday and one on Thursday, so if anyone has any advice on integration with those systems, please let me know.)

Hosting: Do you want to install everything on-site? Do you have sysadmins and servers? Is this actually a requirement? Or do you want this all hosted by someone else?

Right now you have pretty limited options for hosting. Notes from Nature and the GalaxyZoo projects host everything themselves. Wikisource and FromThePage can be either local or hosted. Everything else, you've got to download and get running on your servers.

Finally, I'd like to talk a little bit about asking yourself, what are yardsticks for success?

If you're doing this for volunteer engagement, what does successful engagement look like? I know of one project that launched a trial in which they put some material from 19th century Texas online. One volunteer found this and dove into it. He transcribed a hundred pages in a week, he started adding footnotes -- I mean he just plowed through this. After a couple of weeks, the librarians I was working with cancelled the trial, and I asked them to give me details. One of the things that they said was, We were really disappointed that only one volunteer showed up. Our goal for public engagement was to do a lot of public education and public outreach, and we wanted to reach out [to] a lot of people.

[For them,] a hundred pages transcribed by one volunteer is a failure compared with one page each transcribed by ten volunteers. So what are your goals?

Similarly, if you're using a platform that is a wiki-like platform--an editorial platform--you'll get obsessive users who will go back and revise page 19 over and over again. That may be fine for you. Maybe you want the highest quality transcripts and you don't mind that there's sort of spotty coverage because users come in and only transcribe the things that really interest them.

Other systems try to go for coverage over quality and depth. ProPublica developed the transcribable Ruby on Rails plugin for research on campaign contributions. They intentionally designed their tool with no back button -- there's no way for a user to review what they did. And they wrote a great article about this which is very relevant to this conference venue: it's called "Casino-Driven Design: One Exit, No Windows, Free Drinks". So for them, the page 19 situation would be an absolute failure, while for me I'm thrilled with it. So again there's this trade off of quality versus quantity in product as well as in engagement.
[Audio to follow.]

Friday, March 14, 2014

Wikilinks in FromThePage

From March 10-12, I got to participate in the iDigBio Original Sources Digitization Workshop, a gathering of natural history collections managers, archivists, and technologists. Although the focus of digitization within natural history has been on specimens or specimen labels, this workshop sought to address the challenges and opportunities involved in digitizing ledgers, field notes, and other non-specimen data. As usual for iDigBio events, the workshop was spectacular.

Carolyn Sheffield chaired a panel (video recording) on crowdsourcing which included Rob Guralnik discussing Notes From Nature, Christina Fidler talking about the Grinnell field notes on FromThePage, my talk, and a long, valuable discussion among all participants. My presentation covered the data model and uses of wiki links as I'm using them in FromThePage.

Video, slides, and transcript are below:

"From The Page" - Ben Brumfield from iDigBio on Vimeo.
I'm Ben Brumfield.  You saw a little bit about FromThePage in Christina Fidler's presentation, so I wanted to talk about the internals -- the design and the datastructures behind some of the things that make this a little bit different from NotesFromNature or the NARA Transcribr Drupal module.
This is the transcription screen.  You've seen this with Christina, so I'll probably go over this pretty quickly.  This is a full-text transcription, not individual records like you get with Notes From Nature. 
The reason for that is that FromThePage was built to be a wiki-like tool, purpose-built for creating amateur editions.  So we've got a text and we want to create an edition from the text that can then be re-used, printed, and analyzed.

I say "amateur" editions because we're not dealing with the kinds of things that textual scholars in the humanities are dealing with, where they're trying to compare different variant manuscript versions of Chaucer.  [By contrast, we] have something that's very straightforward, and we're interested in some fairly simple annotations.

It's purpose-built -- free-standing on MySQL and Ruby on Rails, so it's not integrated with MediaWiki or anything like that.
So who's using it?

[FromThePage] was built originally for a set of my great-great grandmother's diaries.

Since then it's been used for military diaries by libraries and history departments.
It's been used for literary diaries--in this case for Shelby Foote's diaries--for literary drafts, and for punk rock fanzines.  (Which is kind of awesome!)
So what does that have to do with the people in this room and the kind of material [we're working with]?

Here's an example:  This is an 1859 journal from an expedition in which someone went out and made a number of observations and collected some things to bring back with them.  There are scholars interested in mining those.

But it's not a naturalist expedition.  This is Viscountess Emily Anne Smyth Strangford, who in this case is touring the Mediterranean and visiting a lot of classical monuments.  The folks at the Duke Computational Classics Collaboratory are interested in finding all the places in which she recorded Latin and Greek inscriptions, coming up with her itenerary, and figuring out how [that data] connects to the objects her father-in-law had collected for the British Museum twenty years earlier.

So there's a lot of correspondence, I tend to think, with field notes.
The San Diego Natural History Museum started using FromThePage for field books in 2010.  They're still working on the project.
  • They've identified ten thousand subjects worth classifying in their system.
  • Individual pages have been edited twenty-four thousand times.  And this goes back to the wiki-like approach -- people transcribe a page, and then they revisit it. They make a number of edits to a page as they get comfortable with the handwriting.
  • And then they've linked individual observations, species mentioned, and people in the field notes to those subjects forty-two thousand times.
Then there are a couple of other projects working with field notes.  [Museum of Vertebrate Zoology] obviously is in trial, and [the Museum of Comparative Zoology] and Missouri Botanical Gardens are just evaluating the software right now.  
So, what is a wiki link?

Any of us who've edited Wikipedia may be used to this.  I followed the same syntax [in FromThePage].

What we have here is a set of double square braces with the canonical name of the subject--this could be a formatted date, this could be a full name that's spelled out--and then the text that's actually used within the verbatim transcript.

So our example here -- this is when Grinnell meets Klauber.  The field note actually says "L. M. Klauber", so the person transcribing has expanded this out to "Laurence M. Klauber".  So we have the ability to handle variance in references to Klauber, but still identify them as Klauber.
Technically speaking, what's behind one of these wiki links?

There are a lot of tables in this database.
  • We know that there's this page that Klauber is mentioned on.  It's S1 Page 3 in the Grinnell field notes that MVZ has online.
  • We've got a subject which is Laurence M. Klauber.
  • The subject is categorized as a person, which can be used for analysis and filtering, like Christina showed you.
  • And then the individual link between the page and the subject, that contains the variation, is also stored.
So there are a lot of things you can do with that.
  • You can show all the pages that mention Laurence M. Klauber, and read the pages in context or just get a listing of them.
  • More helpfully, as you're transcribing we can mine those links to automatically suggest mark-up.  So the next time we encounter "L. M. Klauber", we can push a button and that will automatically expand the mark-up of "L. M. Klauber" to "[[Laurence M. Klauber|L. M. Klauber]]".
  • You can also feed this to full-text searches.  So if you've got a lot of plain-text transcripts which contain Laurence M. Klauber, we can automatically populate the search with those variations, creating an OR query with "Klauber", "L. M. Klauber"
  • And then we can mine the mark-up for correspondences [between subjects] as Christina showed.

The last thing you can do with it is export.
Here is a TEI-XML export of the Joseph Grinnell notes.  This is useful for interchange, but the most important thing this does is that it allows amateurs to create well-formatted, TEI P5-compliant XML.  And it will handle one of the things that's very hard about creating TEI in an XML editor, which is associating reference string to their entries over in the TEI header which describes who the people are outside the text.
This is a CSV export of the Grinnell field notes.  Basically this is every observation and every person who's mentioned, exported as a CSV file with links back to the pages and URLs at which those pages can be found.  This is the kind of thing that perhaps could be ingested into [museum collection management database] Arctos.
Future plans:

We're going to be doing more CMS integrations.  We're working on Omeka.  The Internet Archive is done.  There are a couple of grant applications that involve hooking FromThePage up to Fedora Commons.

We also really want to contextualize links in time and place.  We want the ability for people to define where the person writing the journal is where they're writing, and then to apply those geotags and chronotags to the references.  So you could map when species were mentioned.  You could extract a visual itenerary.

We need more formatting options.  One of our volunteers has found all kinds of crazy editorial issues for handling strike-outs and things like that.

And the last thing that we're looking for is more projects.