Saturday, November 10, 2012

What does it mean to "support TEI" for manuscript transcription?

This is a transcript of my talk at the 2012 TEI meeting at Texas A&M University, "What does it mean to 'support TEI' for manuscript transcription: a tool-maker's perspective."

You can download an MP3 recording of the talk here.
Let's get started with a couple of definitions.  All the tools and the sites that I'm reviewing are cloud based, which means that I'm ruling out--perhaps arbitrarily--any projects that involve people doing offline edition and then publishing that on the web.  I'm only talking about online-based tools.

So that's a very strict definition of clouds, and I'm going to have a very loose and squishy definition of crowds, in which I'm talking about any sort of tool that allows collaborative editing of manuscript material, and not just ones that are directed at amateurs.  That's important for a couple of reasons: one, because it gave me a sample size that was large enough to find out how people are using TEI, but--for another reason--because "amateurs" aren't really amateurs.  What we see with crowdsourcing projects is that amateurs become experts very quickly.  And given that your average user of any citizen science or historical crowdsourcing project is a woman over 50 who has at least a Master's degree, this isn't sort of the unwashed masses.
Okay, so crowdsourced transcription has been going on for a while, and it's been happening in four different traditions that developed this all independently.  You have genealogists who are doing this, primarily with things like census records.  The 1940 census is the most prominent example: they have volunteers transcribing as many as ten million records a day.  The natural sciences are doing something similar, particularly GalaxyZoo, the OldWeather people are looking at climate change data, where you have to look at old, handwritten records to figure out how the climate has changed, because you need to know how the climate used to be. And then there are also some projects going on in the Open Source/Creative Commons world: the Wikisource people--particularly the German language Wikisource community--and libraries, archives, and museums have jumped into this recently. 
So here are a couple of examples from the citizen science world.  OldWeather has a tool that allows people to record ship log book entries and weather observations.  As you can see, this is all field based -- this isn't quite an attempt to represent a document.  We'll get back to this in a minute.
The North American Bird Phenology Program is transcribing old bird[-watching] observation cards from about a hundred years ago.  They're recording species names and all sorts of other things about this particular Grosbeak in 1938. 
All of these--this is the majority of the crowdsourced transcription that's happening out there--there are millions of records--there are millions of records that are happening that are all record based.  These are not document-based, they aren't page-based.  They're dealing with data that is fundamentally tabular -- those are their inputs.  Their outputs are databases that they want to be able to either search or analyze.  So we're producing nothing that anyone would ever want to print out.

And another interesting thing about this is that these record-based transcription projects--the uses are understood in advance.  If you're building a genealogy index, you know that people are going to want to search for names and be able to see the results.  And that's it -- you're not building something that allows someone to go off and do some other kind of analysis.

Now what kind of mark-up are these record-based transcription projects using?  Well, it's kind of idiosyncratic, at best.
Here's an example from my client FreeREG.  This is a mark-up language that they developed about ten years ago for indicating unclear readings of manuscripts.  It's actually fairly sophisticated--it's based on the regular expression programming sub-language--but it's not anything that's informed by the TEI world.
On the other hand, here is the mark-up that the New York Public Library is using.  Let me read this out to you: "Please type the text of the indicated dish exactly as it appears.  Don't worry about accents."  This is almost an anti-markup.
So what about free-form transcription?  There's a lot of development of people doing free-form transcription.  You have Scripto out of CHNM.  You have a couple of different (perhaps competing) NARA initiatives.  Wikisource.  There's my own FromThePage.  What kind of mark-up are they doing?  Well, for the most part, none! 
Here's Scripto--the Papers of the War Department-- and you type what you see, and that's what you get.
Here is the French-language Wikisource, hosting materials from the Archives departmentales du Cantal  (who are doing some very cool things here).  But this is just typing things into a wiki and not even internally using wiki links.  This is almost pre-formed text -- it's pretty much plaintext.  
My own project, FromThePage.
I'm internally using wiki-links, but really only for creating indexes and annotations, not for indicating...any of the power that you have with TEI.
So if no one is using TEI, why is TEI important?  I think that TEI is important because crowdsourced transcription projects are how the public is interacting with edition.  This is how people are learning what editing is, what the editing process is, and why and whether it's important.  And they're using tools that are developed by people like me.  Now how do people like me learn about edition?
The answer is, by reading the TEI Guidelines.  The TEI Guidelines have an impact that goes far beyond people who are actually implementing TEI.  I started work on FromThePage in complete isolation in 2005.  By 2007, I was reading the TEI Guidelines.  I wasn't implementing TEI, but the questions that were asked--these notions of "here's how you expand abbreviations", "here's how you regularize things"--had a tremendous impact on me.  By contrast, the Guide to Documentary Editing--which is a wonderful book!--I only found out in January of this year.

TEI is online, it's concise, it's available.  And when I talk to people in the genealogy development world, they know about TEI. They've heard of it.  They have opinions.  They're not using it, but -- you people are making an impact on how the world does edition!
Okay, so if all of these people aren't using TEI, who is doing it?

I run a transcription tool directory that is itself crowdsourced.  It's been edited by 23 different people who've entered information about 27 different tools. Of those 27 tools, 7 are marked as "supporting TEI".  There's a little column, "does it support TEI?", seven of them say "Yes".

Actually, that's not true.  Some of them say "yes", but some of those seven say "well, sort of".  So what does that mean?
To find that out, I interviewed five of those seven projects.
  • Transcribe Bentham.  
  • T-PEN (which there's a poster session about tonight), which is a line-based system for medieval manuscripts.  
  • A customization of T-PEN, the Carolingian Canon Law project, out of the University of Kentucky.  
  • Our own Hugh Cayless for the Papyrological Editor, which is dealing with papyri.  
  • And then MOM-CA is one of these "sort of"s.  You have two implementations of it.  
    • One of them is the Virtualles deutsches Urkundennetzwerk, which is a German charter collection.  It supports "TEI, sort-of" -- actually it supports CEI and EAD.  
    • But it's been customized for extensive TEI support for the Itinera Nova project which is out of the archive of Leuven, Belgium.     
I'm going to talk about what I found out, but I'm going to emphasize Transcribe Bentham.  Not because it's better than the other tools, but because they actually ran their transcription project as an experiment.  They wanted to know, can the public do TEI? Can the public handle it?  And they've published their results: they've conducted user surveys of what was your experience using TEI?  Which makes it particularly useful for those of us who are trying to figure out how it's being used.
Okay, so there's a lot of variation among these projects.  You've got a varied committment to TEI.  Transcribe Bentham: Yes, we're going to use TEI!  You see Melissa Terras here saying that "it was untenable" that we'd ask for anything else.  These people know how to do it; why would we depart from that?

For T-PEN, James Ginther says: Hey, I'm kind of skeptical.  We'll support any XSD you want to upload, if it happens to be TEI, that's okay.
Abigail Firey, who's using T-PEN, basically says: look, it's probably necessary.  It's very useful.  It lets us develop these valuable intellectual perspectives on our text.  And she considered it important that their text encoding was done within the community of practice represented by the people in this room.
Okay, so more variation between these.  Where's the TEI located within these projects?  Where does it live?  I'm a developer; I'm interested in the application stack.

It turns out that there's no agreement at all.  Transcribe Bentham has people entering TEI in person.  And then it's storing it off in a MediaWiki, using MediaWiki versioning, not actually putting [...] pages in one big TEI document.

On the other hand, Itinera Nova is actually storing everything in an XRX-based XML database.  I mean, it is pure TEI on the back end.  But none of the volunteers using Itinera Nova actually are typing any angle brackets.  So we have a lot of variation here.
However, there was no variation when I asked people about encoding.  There is a perfectly common perception that is: Encoding is hard!

And there are these great responses--that you can see both on the Transcribe Bentham blog and in their DHQuarterly paper that just came out, which I highly recommend--describing it as "too much markup", "unnecessarily complicated", "a hopeless nightmare", and the entire transcription process is "a horror."
But, lots of things are hard.

In my own experience with FromThePage, I have one user who has transcribed one thousand pages, but she does not like using any mark-up at all.  She's contributing!  She's contributing plaintext transcriptions, but I'm going back to add wikilinks.  So it's not about the angle brackets.  (Maybe square brackets have a problem too, I don't know.)

And fundamentally, transcribing--reading old manuscripts--is hard.  "Deciphering Bentham's hand took longer than encoding," for over half of the Bentham respondents.

So there's more commonality: everyone wants to make encoding easier.  How do we do that?  There's a couple of different approaches.  One approach--the most common approach--is using different kinds of buttons and menus to automate the insertion of tags.  Which gets around (primarily) the need for people to memorize tag names and attributes, and--God help us--close tags.

So these are implemented--we've got buttons on T-PEN and CCL.  We've got buttons on the TEI Toolbar.  We've got menus on VdU and the Papyrological Editor.
And you can see them.  Here's a screenshot of Jeremy Bentham.  A couple of interesting things about this: it's very small, but we've got a toolbar at the top.  We've got TEI text: angle-bracket D.E.L.  Angle-bracket, slash, D.E.L.  So we're actually exposing the TEI to users in Transcribe Bentham, though we're providing them with some buttons.
Those buttons represent a subset--I'll get to the selection of those tags later.  Here's a more detailed description of what they do.
Here's what's going on with VdU.  Only in this case, they're not actually exposing the angle brackets to the user. They're replacing all of these in a pseudo-WYSIWYG that allows people to choose from a menu and select text that then gets tagged. 
Okay -- limitations of the buttons.  There's a good limitation, which is that as users become more comfortable with TEI, they outgrow buttons.  And this is something that the people at Transcribe Bentham reported to me.  They're seeing a fair number of people just skip the buttons altogether and type angle brackets.  Remember: these are members of the public who have never met any of the Transcribe Bentham people.

On the down side, users also ignore the buttons.  Again users ignoring encoding, but in this case we've got something that's a little bit worse.  Georg Vogeler is reporting something very interesting, which is that in a lot of cases, they were seeing users who were using print apparatus for doing this kind of work, and just ignoring the buttons -- going around them.
So the problem with using print-style notations.  People are dealing with these print editions [notations] -- this can be a problem or it can be an opportunity. is viewing it that way.  Itinera Nova is using it that way., their front-end interface for most users is Leiden+, which is a standard for marking up papyri.  And, as you can see, users enter text in Leiden+, and that generates TEI.  (EpiDoc TEI, I believe.)
This is the same kind of process that's done in Itinera Nova.  In that case, they're using for notation whatever it is that the Leuven archives uses for their mark-up. And they're doing the same kind of transposition [ed: translation] of replacing their notation with TEI tags before they save it.
And this is actually what users see as they're typing. They don't see the TEI tags -- we're hiding the angle brackets from them.

So this is an alternative to buttons.  And in my opinion, it's not that bad an alternative. 
This hasn't been a problem for the Bentham people, however.  It's a non-problem for them. And they are the most "crowdy", the most amateur-focused, and the most committed to a TEI interface.

Tim Causer went through and reviewed all of this and said, you know, it just doesn't happen.  People are not using any print notation at all.  They're using buttons.  They're using angle-brackets by hand.  They're not even using plaintext.  They're using TEI.  Their users are comfortable with TEI.
So what accounts for the difference between the experience of the VdU and the Transcribe Bentham people?  I don't know.  I've got a couple of theories about what might be going on.

One of them is really the corpus of texts we're working with.  If you're only dealing with papyrus fragments, and you're used to a well-established way of notating them--that's been around since 1935 in the case of Leiden+--well, it's kind of hard to break out of that.  On the other hand, there's not a single convention for print editions.  There's all sorts of ways of indicating additions and deletions for print editions of more modern texts.  So maybe it's a lack of a standard.

Or, maybe it's who the users are.  Maybe scholars are stubborner, and amateurs are more tractable and don't have bad habits to break.  I don't know!  I don't know, but I'd be really interested in any other ideas.
Okay, how do these projects choose the tags that they're dealing with?  We've got a very long quote, but I'm just going to read out a couple of little bits of them.

Really, choosing a subset of tags is important.  Showing 67 buttons was not a good usability thing for T-PEN.  And in particular, what they ended up doing was getting rid of the larger, structural set of markup, and focusing just on sort of phrase-level markup.
This also, I think, true if we go back a minute and look at Bentham.  Here, again, we're talking phrase-level tags.  We're not talking about anything beyond that.
Justin Tonra said that it was actually really hard to pare down the number of tags for Transcribe Bentham.  He wanted to do more, but, you know, he's pleased with what they got.  They didn't want to "overcomplicate the user's job."
Richard Davis, also with Transcribe Bentham, had a great deal of experience dealing with editors for EAD and other XML.  And he said you're always dealing with this balance between usability and flexibility, and there's just not much way of getting around it.  It's going to be a compromise, no matter what.
So what's the future for these projects that are using TEI for crowds?  Well, if getting people up to speed is hard, and if nobody reads the help--as Valerie Wallace at one time said about their absolutely intimidating help page for Transcribe Bentham (you should look at it -- it's amazing!)--then what are the alternatives for getting people up to speed?

Georg Vogeler says that they are trying to come up with a way of teaching people how to use the tool and how to use the markup in almost a game-like scenario.  We're not talking about the kind of Whak-a-Mole things that we sometimes see, but really just sort of leading people through Let's try this. Now let's try this. Now let's try this. Okay now you know how to deal with this [tool].  It's something that I think we're actually pretty familiar with from any other kinds of projects dealing with historic handwriting.: people have to come up to speed.
Another possibility is a WYSIWYG.  Tim Causer announced the idea of spending their new Mellon grant on building a WYSIWYG for Transcribe Bentham's TEI.  The blog entry is fascinating because he gets about seven user comments, some of which express a whole lot of skepticism that a WYSIWYG is going to be able to handle nested tagging in particular.  Other ones of which make comments about the whole XML system and its usability in vivid prose, which is very worth reading.
And maybe combinations of these.  So we have these intermediate notations -- Itinera Nova, for example, they're using this let's begin a strike-through with an equals sign (which is apparently what they've been using at that archive for a while).  And the minute you type that equals sign in, you actually get a WYSIWYG strike-through that runs all the way through your transcript.

That may be the future.  We'll see.  I think that we have a lot of room for exploring different ways for handling this.
So let me wrap up and thank my interviewees.

Transcribe Bentham: Melissa Terras, Justin Tonra, Tim Causer, Richard Davis.
T-PEN: James Ginther, Abigail Firey Hugh Cayless, Tom Elliot
MOM-CA: Georg Vogeler and Jochen Graf


[All questions will be paraphrased in the transcript due to sound quality, and are not to be regarded as direct quotations without verification via the audio.]

Syd Bauman: Of the systems which allow users to type tags free-hand, what percentage come out well-formed? 

Me: The only one that presents free-hand [tagging] is Transcribe Bentham. Tim [Causer] gets well-formed XML for most everything he gets. There is no validation being performed by that wiki, but what he's getting is pretty good. He says that the biggest challenge when he's post-processing documents is closing tags and mis-placed nesting.

Syd Bauman: I'd be curious about the exact percentages.

Me: Right. I'd have to go back and look at my interview. He said that it represents a pretty small percentage, like single digits of the submissions they get.

John Unsworth: Do any of the systems use keyboard short-cuts?

Me: I know of none that use hot-keys.

John Unsworth: Do you think that would be more or less desirable than the systems you've described?

Me: I really only see hot-keys as being desirable for projects that are using more recent and clearer documents. Speed of data-entry from the keyboard perspective doesn't help much when you're having to stare and zoom and scroll on a document that is as dense and illegible as Bentham or Greek papyri.

Elena Pierazzo [very faint audio]: In some cases it's hard to define which is the error: choosing the tags or reading the text. I've been working with my students on Transcribe Bentham--they're all TEI-aware--and to be honest it was hard. The difficulty was not the mark-up. In a sense we do sometimes forget in these crowdsourcing projects, that the text itself is very hard, so probably adding a level of complexity to the task via the mark-up is very difficult.

I have all respect and sympathy for the people who stick to the ideal of doing TEI, which I commend entirely. But in some cases, it may be that asking amateur people to do [the decipherment] and do the mark up is a pretty strong request, and makes a big assumption about what the people "out there" are capable of without formation.

Me: I'd agree with you. However, there have been some studies on these users' ability to produce quality transcripts outside of the TEI world.... Old Weather did a great deal of research on that, and they found that individual users tended to submit correct transcripts 97% of the time. They're doing blind triple-keying, so they're comparing people's transcripts against others. [They found] that of 1000 different entries, typically on average 13 will be wrong. Of those thirteen, three will be due to user error--so it does happen; I'm not saying people are perfect. Three will be generally[ed: genuinely] illegible. And the remaining seven will be due to the officer of the watch having written the wrong thing down and placing the ship in Afghanistan instead of in the Indian Ocean. So there are errors everywhere. [I mis-remembered the numbers here: actually it's 3 errors due to transcriber error, 10 genuinely illegible, and 3 due to error at time of inscription.]

Lou Burnard: The concept of error is a nuanced one. I would like to counter-argue Elena's [point]. I think that one of the reasons that Bentham has been successful is precisely because it's difficult material. Why do I think that? Because if you are faced with something difficult, you need something powerful to express your understanding of it. The problem with not using something as rich and semantically expressive as TEI when you're doing your transcription is that it doesn't exist! All you can do is type in the words you think it might have been, and possibly put in some arbitrary code to say, "Well, I'm not sure about that." Once you've mastered the semantics of the TEI markup--which doesn't actually take that long, if you're interested in it--now you can express yourself. Now you can communicate in a [...] satisfactory way. And I think that's why people like it.

Me: I have anecdotal, personal evidence to agree with you. In my own system (that does not use TEI), I have had users who have transcribed several pages, and then they'd get to a table in some biologist's field notes, for example, and they stop. And they say, "well, I don't know what to do here." So they're done.

Lou Burnard: The example you cite of the erroneous data in the source is a very good one, because if you've mastered TEI then you know how to express in markup: 'this is what it actually says but clearly he wasn't in Afghanistan.' And that isn't the case in any other markup system I've ever heard of. 

[I welcome corrections to my transcript or the contents of the talk itself at or in the comments to this post.]

Wednesday, October 24, 2012

Interview with Ben Crowder on Unbindery

One of the pleasures of maintaining the crowdsourced transcription tool list is learning about systems I'd never heard about before.  One of these is Unbindery, a tool being built by Ben Crowder for audio and manuscript transcription as well as OCR correction.  Ben was gracious enough to grant me an interview, even though he's concentrating on the final stretch of development work on Unbindery.

First, let me wish you luck as you enter the final push on Unbindery. What would you say is the most essential feature you have left to work on?

Thanks! Probably private projects -- I've been looking forward to using Unbindery to transcribe my journals, but haven't wanted them to be open for just anyone to work on. I'm also very excited about chunking audio into small segments (I used to publish an online magazine where we primarily published interviews, and transcribing two hours of audio can be really daunting).

Tell us more about how Unbindery handles both audio transcription and manuscript transcription. Usually those tools are very different, aren't they?

The audio transcription part started out as Crosswrite, a little proof-of-concept I threw together when I realized one day that JavaScript would let me control the playhead on an audio element, making it really easy to write a software version of a transcription foot pedal. I also wanted to start using Unbindery for family history purposes (transcribing audio interviews with my grandparents, mainly, and divvying up that workload among my siblings).

So, to handle both audio transcription and page image transcription, Unbindery has a modular item type editor system. Each item type has its own set of code (HTML/CSS/JavaScript) that it loads when transcribing an item. For example, page images show an image and a text box, with some JavaScript to place a highlight line when you click on the image, whereas audio items replace the image with Crosswrite's audio element (and the keyboard controls for rewinding and fast forwarding the audio). It would be fairly trivial to add, say, an item type editor that lets the user mark up parts of the transcript with XML tags pulled from a database or web service somewhere. Or an editor for transcribing video. It's pretty flexible.

How did you come up with the idea for Unbindery?

I had done some Project Gutenberg work back in 2002, and somewhere along the way I came across Distributed Proofreaders, which basically does the same thing. A few years later, I'd recently gotten home from an LDS mission to Thailand and wanted to start a Thai branch of Project Gutenberg with one of my mission friends. He came up with the name Unbindery and I made some mockups, but nothing happened until 2010 when I launched my Mormon Texts Project. Manually sending batches of images and text for volunteers to proof was laborious at best, so I was motivated to finally write Unbindery. I threw together a prototype in a couple weeks and we've been using it for MTP ever since. I'm also nearing the end of a complete rewrite to make Unbindery more extensible and useful to other people. And because the original code was ugly and nasty and seriously embarrassing.

In my experience, the transcription tools that currently exist are very much informed by the texts they were built to work with, with some concentrating on OCR-correction, others on semantic indexing, and others on mark-up of handwritten changes to the text. How do you feel like the Mormon Texts Project has shaped the features and focus of Unbindery?

Mormon Texts Project has been entirely focused on correcting OCR for publication in nice, clean ebook editions, which is why we've gone with a plain old text box and not much more than that. (Especially considering that we were originally posting the books to Project Gutenberg, where our target output format was very plain text.)

What is your grand dream for Unbindery? (Feel free to be sweeping here and assume grateful, enthusiastic users and legions of cobbler's elves to help with the code.)

To get men on Mars. No, really, I don't think my dreams for Unbindery are all that grand -- I'd be more than satisfied if it helps make transcription easier for users, whether working alone or in groups, and whether they're publishing ebooks or magazines or transcribing oral histories or journals or what have you.

In an ideal world it would be wonderful if a small, dedicated group of coders were to adopt it and take care of it going forward. But I don't expect that. I'll get it to a state where I can publicly release it and people can use it, but other than bugfixes, I don't see myself doing much active development on Unbindery beyond that point. I know, I know, abandoning my project before it's even out the door makes me a horrible open source developer. But to be honest with you, I don't really even want to be an open source developer -- I'm far more interested in my other projects (like MTP) and I want to get back to doing those things. Unbindery is just a tool I needed, an itch I scratched because there wasn't anything out there that met my needs. People have expressed interest in using it so I'm putting it up on GitHub for free, but I don't see myself doing much with Unbindery after that. Sorry! This is the sad part of the interview.

What programming languages or technical frameworks do you work in?

Unbindery is PHP with JavaScript for the front end. I love JavaScript, but I'm only using PHP because of its ubiquity -- I'd much, much, much rather use Python. But it's a lot easier for people to get PHP apps running on cheap shared hosts, so there you have it.

It seems like you're putting a lot of effort into ease of deployment. How do you see Unbindery being used? Do you expect to offer hosting, do you hope people install their own instances, or is there another model you hope to follow?

I won't be offering hosting, so yes, I'm expecting people to install their own instances, and that's why I want it to be easy to install. (There may be some people who decide to offer hosting for it as well, and that's fine by me.)

How can people get involved with the project?

Coders: The code isn't quite ready for other people to hack on it yet, but it's getting a lot closer to that point. For now, coders can look at my roadmap page to see what tasks need doing. (Also, it won't be long before I start adding issues to GitHub so people can help squash bugs.)

Other people: Once the core functionality is in place, just having people install it and test it would probably be the most helpful.

Thursday, October 18, 2012

Jens Brokfeld's Thesis on Crowdsourced Transcription

Although the field of transcription tools has become increasingly popular over the last couple of years, most academic publications on the topic focus on a single project and the lessons that project can teach.  While those provide invaluable advice on how to run crowdsourcing projects, they do not lend much help to memory professionals trying to decide which tools to explore when they begin a new project.  Jens Brokfeld's thesis for his MLIS degree at Fachhochschule Potsdam is the most systematic, detailed, and thorough review of crowdsourced manuscript transcription tools to date.

After a general review of crowdsourcing cultural heritage, Brokfeld reviews Rose Holley's checklist for crowdsourcing projects and then expands upon the part of my own TCDL presentation which discussed criteria for selecting transcription tools, synthesizing it with published work on the subject.  He then defines his own test criteria for transcription tools, about which more below.  Then, informed by seventy responses to a bilingual survey of crowdsourced transcription users, Brokfeld evaluates six tools (FromThePage, Refine!, Wikisource, Scripto, T-PEN, and the Bentham Transcription Desk) with forty-two pages (pp. 40-82) devoted to tool-specific descriptions of the capabilities and gaps within each system.  This exploration is followed by an eighteen-page comparison of the tools against each other (pp. 83-100). The whole paper is very much worth your time, and can be downloaded at the "Masterarbeit.pdf" link here: "Evaluation von Editionswerkzeugen zur nutzergenerierten Transkription handschriftlicher Quellen".

It would be asking too much of my limited German to translate the extensive tool descriptions, but I think I should acknowledge that I found no errors in Brokfeld's description of my own tool, FromThePage, so I'm confident in his evaluation of the other five systems.  However, I feel like I ought to attempt to abstract and translate some of his criteria for evaluation, as well as his insightful analysis of each tool's suitability for a particular target group.
Chapter 5:  Pr├╝fkriterien ("Test Critera")

5.1 Accessibility (by which he means access to transcription data from different personal-computer-based clients)
5.1.1 Browser Support
5.2 Findability
5.2.1 Interfaces (including support for such API protocols as OAI-PMH, but including functionality to export transcripts in XML or to import facsimiles) 
5.2.2 References to Standards (this includes support for normalization of personal and place names in the resulting editions)
5.3 Longevity
5.3.1 License (is the tool released under an open-source license that addresses digital preservation concerns?)
5.3.2 Encoding Format (TEI or something else?)
5.3.3 Hosting
5.4 Intellectual Integrity (primarily concerned with support for annotations and explicit notation of editorial emendations)
5.4.1 Text Markup
5.5 Usability (similar to "accessibility" in American usage)
5.5.1 Transcription Mode (transcriber workflows)
5.5.2 Presentation Mode (transcription display/navigation)
5.5.3 Editorial Statistics (tracking edits made by individual users)
5.5.4 User Management (how does the tool balance ease-of-use with preventing vandalism?)

I don't believe that I've seen many of these criteria used before, and would welcome a more complete translation.  

His comparison based on target group is even more innovative.  Brokfeld recognizes that different transcription projects have different needs, and is the first scholar to define those target groups.  Chapter 7 of his thesis defines those groups as follows:

Science:  The scientific community is characterized by concern over the richness of mark-up as well as a preference for customizability of the tool over simplicity of user interface. [Note: it is entirely possible that I mis-translated Wissenschaft as "science" instead of "scholarship".]
Family History: Usability and a simple transcription interface are paramount for family historians, but privacy concerns over personal data may play an important role in particular projects.
Archives: While archives attend to scholarly standards, their primary concern is for the transcription of extensive inventories of manuscripts -- for which shallow markup may be sufficient.  Archives are particularly concerned with support for standards.
Libraries: Libraries pay particular attention to bibliographical standards. They also may organize their online transcription projects by fonds, folders, and boxes.
Museums: In many cases museums possess handwritten sources which refer to their material collections.  As a result, their transcriptions need to be linked to the corresponding object.

It's very difficult for me to summarize or extract Brokfeld's evaluation of the six different tools for five different target groups, since those comparisons are in tabular form with extensive prose explanations.  I encourage you to read the original, but I can provide a totally inadequate summary for the impatient:
  • FromThePage: Best for family history and libraries; worst for science.
  • Refine!: Best for libraries, followed by archives; worst for family history.
  • Wikisource: Best for libraries, archives and museums; worst for family history.
  • Scripto: Best for museums, followed by archives and libraries; worst for family history and science.
  • T-PEN: Best for science. 
  • Bentham Transcription Desk: Best for libraries, archives and museums.
Note: This is a summary of part of a 140-page German document translated by an amateur.  Consult the original before citing or making decisions based on the information here. Jens Brokfeld welcomes questions and comments (in English or German) through this webform:

Wednesday, October 10, 2012

Webwise Reprise on Crowdsourcing

Back in June, the folks at IMLS and Heritage Preservation ran a webinar exploring the issues and tools discussed at the IMLS Webwise Crowdsourcing panel "Sharing Public History Work: Crowdsourcing Data and Sources."

After a introduction by Kevin Cherry and Kristen Laise,  Sharon Leon, who chaired the live panel, presented a wonderful overview of crowdsourcing cultural heritage and discussed the kinds of crowdsourcing projects that have been successful -- including, of course, the Papers of the War Department and Scripto, the transcription tool the Roy Rosenzweig Center for History and New Media developed from that project.  They then ran the video of my own presentation, "Lessons from Small Crowdsourcing Projects", followed by a live demo of FromThePage.  Perhaps the best part of the webinar, however, was the Q&A from people all over the country asking for details about how these kinds of projects work.

The recording of the webinar is online, and I encourage you to check it out.  (Here's a direct link, if you have trouble.) I'm very grateful to IMLS and Heritage Preservation for their work in making this knowledge accessible so effectively.

Tuesday, October 9, 2012

Mosman 1914-1918 on FromThePage

The Mosman community in New South Wales is preparing for the centennial of World War One, and as part of this project they've launched "Doing our bit, Mosman 1914–1918".  The project describes itself as "an innovative online resource to collect and display information about the wartime experiences of local service people," and includes scan-a-thons, hack days, and build-a-thons.

One of their efforts involves transcription of a serviceman's diary with links to related names on local honor boards.  I'm delighted to report that they're hosting this project on, and I look forward to working with and learning from the Mosman team.

Read more about Allan Allsop's diary on the Mosman 1914-1918 project blog and lend a hand transcribing!

Wednesday, October 3, 2012

Building a Structured Transcription Tool with FreeUKGen

I'm currently working with FreeUKGen--the charity behind the genealogy database FreeBMD--to build a general-purpose, open-source tool for crowdsourced transcription of structured manuscript data into a searchable database.

We're basing our system on the Scribe tool developed for the Citizen Science Alliance for What's the Score at the Bodleian, which originated out of their experience building OldWeather and other citizen science sites.

We are building the following systems:
  1. A new tool for loading image sets into the Scribe system and attaching them to data-entry templates. 
  2. Modifications to the Scribe system to handle our volunteer organization's workflow, plus some usability enhancements.
  3. A publicly-accessible search-and-display website to mine the database created through data entry. 
  4. A reporting, monitoring, and coordinating system for our volunteer supervisors. 
We also plan to add support for geocoding during transcription and GIS support within the search and display system. Currently, initial development is mostly finished with 1 and moving on to 2 and 3 above.

Although this tool is focused on support for parish registers and census forms, we are intent on creating a general-purpose system for any tabular/structured data.   Scribe's data-entry templates are defined in its database, with the possibility to assign different templates to different images or sets of images.  As a result, we can use a simple template for a 1750 register of burials or a much more complex template for an 1881 census form.  Since each transcribed record is linked to the section of the page image it represents, we have the ability to display the facsimile version of a record alongside its transcript in a list of search results, or to get fancy and pre-populate a transcriber's form with frequently-repeated information like months or birthplaces.

Under the guidance of Ben Laurie, the trustee directing the project, we are committed to open source and open data.  We're releasing the source code under an Apache license and planning to build API access to the full set of record data.

We feel that the more the merrier in an open-source project, so we're looking for collaborators, whether they contribute code, funding, or advice.  We are especially interested in collaborators from archives, libraries, and the genealogy world.

Tuesday, October 2, 2012

ReportersLab Reviews FromThePage

Tyler Dukes has written a concise introduction to the issues with handwritten material and a lovely review of FromThePage at ReportersLab:
Even when physical documents are converted into digital format, subtle inconsistencies in handwriting prove too much for optical character recognition software. The best computer scientists have been able to do is apply various machine learning techniques, but most of these require a lot of training data — accurate transcriptions deciphered by humans and fed into an algorithm.

“Fundamentally, I don’t think that we’re going to see effective OCR for freeform cursive any time soon,” Brumfield said. “The big successes so far with machine recognition have been in domains in which there’s a really constrained possibilities for what is written down.”

That means entries like numbers. Dates. Zip codes. Get beyond that, and you’re out of luck.
I don't know much about the world of investigative journalism, but it wouldn't surprise me if it holds as many intriguing parallels and new challenges as I've discovered among natural science collections.   Handwriting might still be the most interdisciplinary technology.

Monday, September 24, 2012

Bilateral Digitization at Digital Frontiers 2012

This is a transcript of the talk I gave at Digital Frontiers 2012.

Abstract: One of the ironies of the Internet age is that traditional standards for accessibility have changed radically. Intelligent members of the public refer to undigitized manuscripts held in a research library as "locked away", even though anyone may study the well-cataloged, well-preserved material in the library's reading room. By the standard of 1992, institutionally-held manuscripts are far more accessible to researchers than uncatalogued materials in private collections -- especially when the term "private collections" includes over-stuffed suburban filing cabinets or unopened boxes inherited from the family archivist. In 2012, the democratization of digitization technology may favor informal collections over institutional ones, privileging online access over quality, completeness, preservation and professionalism.

Will the "cult of the amateur" destroy scholarly and archival standards? Will crowdsourcing unlock a vast, previously invisible archive of material scattered among the public for analysis by scholars? How can we influence the headlong rush to digitize through education and software design? This presentation will discuss the possibilities and challenges of mass digitization for amateurs, traditional scholars, libraries and archives, with a focus on handwritten documents.

My presentation is on bilateral digitzation: digitization done by institutions and by individuals outside of institutions and the wall that's sort of in between institutions and individuals.
In 1823, a young man named Jeremiah White Graves moved to Pittsylvania County, Virginia and started working as a clerk in a country store. Also that year he started recording a diary and journal of his experiences. He maintained this diary for the next fifty-five years, so it covers his experience -- his rise to become a relatively prominent landowner, tobacco farmer, and slaveholder. It covers the Civil War, it covers Reconstruction and the aftermath. (This is an entry covering Lee's surrender.)

In addition to the diary, he kept account books that give you details of plantation life that range from -- that you wouldn't otherwise see in the diaries. So for example, this is his daughter Fanny,

And this is a list of every single article of clothing that she took with her when she went off to a boarding school for a semester.

Perhaps more interesting, this is a memorandum of cash payments that he made to certain of his enslaved laborers for work on their customary holidays -- another sort of interesting factor. I got interested in this because I'm interested in the property that he lived in. The house that he built is now in my family, and I was doing some research on this. Since these account books include details of construction of the house, I spent a lot of time looking for these books. I've been looking for them for about the last ten years. I got in contact with some of the descendants of Jeremiah White Graves and found out through them that one of their ancestors had donated the diaries to the Alderman Library at the University of Virginia. I looked into getting them digitized and tried to get some collaboration [going] with some of the descendants, and one of them in particular, Alan Williams, was extremely helpful to me. But this was his reaction:

Okay. So we have diaries that are put in a library -- I believe one of the top research libraries in the country -- and they are behind a wall. They are locked away from him.
So let's talk about walls. From his perspective, the fact that these diaries--these family manuscripts of his--are in the Alderman Library means:
  • They're professionally conserved -- great! 
  • They're publicly accessible, so anyone can walk in and look at them in the Reading Room. 
  • They're cataloged, which would not be the case if they'd still been sitting in his family. 
  • On the down side, they're a thousand miles away: they're in Virginia, he's in Florida, I'm in Texas. We all want to look at these, but it's awfully hard for people to get there if we don't have research budgets. 
  • We have to deal with reading room restrictions if we actually get there. 
  • Once we work on getting things digitized we have these permission-to-publish that we need to deal with, which have some moral challenges for someone from whose family these diaries came from. 
  • And we have the scanning fees: the cost of getting them scanned by the excellent digitization department at the Alderman Library is a thousand dollars. Which is not unreasonable, but it's still pretty costly.
So here's a wall--a real, physical wall--between this institution and the public. How do we get through walls? Everyone here is familiar with digitization and collaboration. This is how we share things nowadays. It's how we've been sharing things for the last fifteen years, in fact. But, at least fifteen years ago, when we got started doing digitization, we had shallow digitization.
The prevalent practice in institutions was "scan-and-dump": make some scans, put them in a repository online.

One of the problems with that is that you have very limited metadata. The metadata is usually institutionally-oriented. No transcripts, in particular -- nobody has time for this. And quite often, they're in software platforms that are not crawlable by search engines.
Now meanwhile, amateurs are digitizing things, and they're doing something that's actually even worse! They are producing full transcripts, but they're not attaching them to any facsimiles. They're not including any provenance information or information about where their sources came from. Their editorial decisions about expanding abbreviations or any other sorts of modernizations or things like that -- they're invisible; none of those are documented.

Worst of all, however, is that the way that these things are propagated through the Internet is through cut-and-paste: so quite often from a website to a newsgroup to emails, you can't even find the original person who typed up whatever the source material was.
So how do we get to deep digitization and solve both of these problems?

The challenges to institutions, in my opinion, come down to funding and manpower. As we just mentioned, generally archives don't have a staff of people ready to produce documentary editions and put them online.

Outside institutions, the big challenge is standards; it is expertise. You've got manpower, you've got willingness, but you've got a lot of trouble making things work using the sorts of methodologies that have come out of the scholarly world and have been developed over the last hundred years.

So how do we fix these challenges?
One possible solution for institutions is crowdsourcing.  We've talked about this this morning; we don't need to go into detail about what crowdsourcing is, but I'd like to talk a little bit about who participates in crowdsourcing projects and what kinds of things they can do and what this says about [crowdsourcing projects]. I've got three examples here. is a project from GalaxyZoo, the Zooniverse/Citizen Science Alliance. The Zenas Matthews Diary was something that I collaborated with the Southwestern University Smith Library Special Collections on. And the Harry Ransom Center's Manuscript Fragments Project.
Okay, in Old Weather there are Royal Navy logbooks that record temperature measurements every four hours: the midshipman of the watch would come out on deck and record barometric pressure, wind speed, wind direction and temperature. This is of incredible importance for climate scientists because you cannot point a weather satellite at the south Pacific in 1916. The problem is that it's all handwritten and you need humans to transcribe this.

They launched this project three years ago, I believe, and they're done. They've transcribed all the Royal Navy logs from the period essentially around World War I -- all in triplicate. So blind triple keying every record. And the results are pretty impressive.

Each individual volunteer's transcripts tend to be about 97% accurate. For every thousand logbook entries, three entries are going to be wrong because of volunteer error. But this compares pretty favorably with the ten that are actually honestly illegible, or indeed the three that are the result of the midshipman of the watch confusing north and south.
So in terms of participation, OldWeather has gotten transcribed more than 1.6 million weather observations--again, all triple-keyed--through the efforts of sixteen thousand volunteers who've been transcribing pages from a million pages of logs.

So what this means is that you have a mean contribution of one hundred transcriptions per user. But that statistic is worthless!
Because you don't have individual volunteers transcribing one hundred things apiece. You don't have an even distribution. This is a color map of contributions per user. Each user has a square. The size of the square represents the quantity of records that they transcribed. And what you can see here is that of those 1.6 million records, fully a tenth (in the left-hand column) were transcribed by only ten users.
So we see this in other projects. This is a power-law distribution in which most of the contributions are made by a hand-full of "well-informed enthusiasts". I've talked elsewhere about how this is true in small projects as well. What I'd like to talk about here is some of the implications.
One of the implications is that very small projects can work: This is the Zenas Matthews Diaries that were transcribed on FromThePage by one single volunteer -- one well-informed enthusiast in fourteen days.
Before we had announced the project publicly he found it, transcribed the entire 43-page diary from the Mexican-American War of a Texas volunteer, went back and made two hundred and fifty revisions to those pages, and added two dozen footnotes.
This also has implications for the kinds of tasks you can ask volunteers to do. This is the Harry Ransom Center Manuscript Fragments Project in which the Ransom Center has a number of fragments of medieval manuscripts that were later used in binding for later works, and they're asking people to identify them so that perhaps they can reassemble them.

So here's a posting on Flickr. They're saying, "Please identify this in the comments thread."
And look: we've got people volunteering transcriptions of exactly what this is: identifying, "Hey, this is the Digest of Justinian, oh, and this is where you can go find this."
This is true even for smaller, more difficult fragments. Here we have one user going through and identifying just the left hand fragment of this chunk of manuscript that was used for binding.
So crowdsourcing and deep digitization has a virtuous cycle in my opinion. You go through and you try to engage volunteers to come do this kind of work. That generates deep digitization which means that these resources are findable. And because they're findable, you can find more volunteers.
I've had this happen recently with a personal project, transcribing my great-great grandmother's diary. The current top volunteer on this is a man named Nat Wooding. He's a retired data analyst from Halifax County, Virginia. He's transcribed a hundred pages and indexed them in six months. He has no relationship whatsoever to the diarist.

But his great uncle was the postman who's mentioned in the diaries, and once we had a few pages worth of transcripts done, he went online and did a vanity search for "Nat Wooding", found the postman--also named Nat Wooding--discovered that that was his great uncle and has become a volunteer.
Here's the example: this is just a scan/facsimile. Google can't read this.
Google can read this, and find Nat Wooding.
Now I'd like to turn to non-institutional digitization. I said "bilateral" -- this means, what happens when the public initiates digitization efforts. What are the challenges--I mentioned standards--how can we fix those. And why is this important?
Well, there is this--what I call the Invisible Archive, of privately held materials throughout the country and indeed the world. And most of it is not held by private collectors that are wealthy, like private art collectors. They are someone's great aunt who has things stashed away in filing cabinets in her basement. Or worse, they are the heirs of that great aunt, who aren't interested and have them stuck in boxes in their attic. We have primary sources here of non-notable subjects, that are very hard to study because you can't get at them.

But this is a problem that has been solved, outside of manuscripts. It's been solved with photographs. It's been solved by Flickr. Nowadays, if you want to find photographs of African-American girls from the 1960s on tricycles, you can find them on Flickr. Twenty years ago, this was something that was irretrievable. So Flickr is a good example, and I'd like to use it to describe how we might be able to apply it to other fields.
So, in terms of solving the standards problem, amateur digitization has a bad, bad reputation, as you can see here.
And much of that bad reputation is deserved, and isn't specific to digitization. This has been a problem with print editions in the past, it is a problem online now. Frankly, scholars don't trust the materials because they're not up to standard.
How do we solve this? Collaboration: we'd like to see more participation from people who are scholars, who are trained archivists, who are trained librarians to participate in some of these projects.
One of the ones I'm working with [is] digitizing these registers from the Reformation up to the present. We're building this generalizable, open-source, crowdsourced transcription tool and indexing tool for structured data. We'd love to find archivists to tell us what to do, what not to do, and to collaborate with us on this.

Another solution is community. You don't go on Flickr just to share your photos; you go on Flickr to learn to become a better photographer. And I think that creating platforms and creating communities that can come up with these standards and enforce them among themselves can really help.

The same thing is true with software platforms, if they actually prompt users and say: "when you're uploading this image, tell us about the provenance." "Maybe you might want to scan the frontispieces." "Maybe you'd like to tell us the history of ownership."
Those are the things that I think might get us there. I've just hit my time limit, I think, so thanks a lot!

Ben Brumfield is a family historian and independent software engineer. For the last seven years he has been developing FromThePage, an open source manuscript transcription tool in use by libraries, museums, and family historians. He is currently working with FreeUKGen to create an open source system for indexing images and transcribing structured, hand-written material. Contact Ben at