Thursday, May 31, 2012

Survey on Crowdsourced Transcription Tools

Via Twitter and the TEI-L mailing list, I see that Jens Brokfeld, a graduate student in Potsdam, is conducting a survey of projects using transcription tools for his thesis, “Creating Digital Editions with Crowdsourced Manuscript Transcription: A Tool Evaluation”.  I encourage readers of this blog to take the survey and help advance the field.
In case that your organisation (archives, libraries, scientific institutions) works on projects for digital editions or may do so in the future, I would be very grateful for your answers to my survey questions. Please forward this e-mail to anyone working with digital editions and transcription tools. The survey will take 15-20 minutes.

Please click on the following link to complete the survey: https://www.surveymonkey.com/s/survey_transcription_tools

Friday, May 25, 2012

Transcription Tools at TCDL 2012

Yesterday I presented a guide to choosing software for crowdsourced manuscript transcription at the Texas Conference on Digital Libraries. Here are the slides from that talk:

Wednesday, May 2, 2012

Bending Regular Expressions to Express Uncertainty

This post expands upon an email I sent to James Edward Gray II, my mentor in regular expressions, whom I thank for his generosity.

For the last few months--ever since I started conducting regular expression workshops at THATCamps--I've been thinking about using regular expressions to represent uncertainty. The great power of the regex is its ability to define context, boundaries, and precision over what is essentially unknown: /A[BC]/ means "I'm looking for an A, followed by either a B or a C -- I don't know which, but if you see either of those, that's the text."

I build tools for transcribing handwritten text -- some of it very old, some of it illegible, some of it damaged by fire or water. Representing uncertainty is a common problem within that domain, and it's important for a couple of reasons:
  1. Usually the text will be presented out of context -- the ASCII (or Unicode) characters representing the underlying text will be separated from the image containing the underlying text.
  2. Even when the images are available, the person doing the transcription is far more skilled at deciphering a particular cursive hand than their readers are likely to be. Take someone trained in 16th century paleography, who has spent weeks working on a particular author's handwriting -- their opinion on whether a scribble is an "f" or an "s" is going to be worth more than that of a casual researcher who encounters that handwriting in a single record of search results. We need to pass that opinion from the expert to the reader.

While otherwise rigorous, the methods professional documentary editors use for recording uncertain readings are pretty shabby -- suited to print editions, they're often concise but imprecise: the Emerson Journals display missing text via || . . . ||, with "three dots representing one to five words; four dots, six to ten words; and five dots, sixteen to thirty words". Another example uses [ . . . ], with each period representing an illegible letter. There is no convention for expressing "I'm sure this is either an 'a' or a 'u', but I can't be certain which one." (Kline and Perdue, A Guide to Documentary Editing, Third Edition)

This is where the notation regular expressions use for their search patterns comes in. It seems like a perfect fit to record Br[au]mfield when the user can't tell whether a name is "Brumfield" or "Bramfield" but is certain that it's not, say "Bremfield". And happily, FreeUKGen is doing exactly this -- they've created a regex-inspired Uncertain Character Format (UCF) for their volunteers to use when they're not quite able to make out text:
_ (Underscore) A single uncertain character. It could be anything but is definitely one character. It can be repeated for each uncertain character.
* (Asterisk) Several adjacent uncertain characters. A single * is used when there are 1 or more adjacent uncertain characters. It is not used immediately before or after a _ or another *.
Note: If it is clear there is a space, then * * is used to represent 2 words, neither of which can be read.
[abc] A single character that could be any one of the contained characters and only those characters. There must be at least two characters between the brackets.
For example, [79] would mean either a 7 or a 9, whereas [C_] would mean a C or some other character.
{min,max} Repeat count - the preceding character occurs somewhere between min and max times. max may be omitted, meaning there is no upper limit. So _{1,} would be equivalent to *, and _{0,1} means that it is unclear if there is any character.
? Sometimes you will have the situation where all of the characters have been read but you remain uncertain of the word. In this case append a ? at the end of the word e.g. RACHARD? The most frequent place where a ? is used is with transcription that have been donated from other systems and are being converted for entry into FreeREG.
And their volunteers are actually using UCF a lot -- here's a file with a minimal-but-effective example, and here's a list of files with a high incidence of records containing UCF.

One of the only big differences between UCF notation and regular expression notation is the use of underscore (_) for "I'm not sure what this character is". In a sense, this is equivalent to the regex . character, but in practice that's not how it's used. /[i.]/ makes no sense in regular expressions: "either 'i' or any character" is turned into the logical set of all characters plus the set of { 'i' }, the union of which is the same as the set of all characters.  As a result, in regular expressions the 'i' is redundant in [i.].  However, that's not how they use _ in UCF.  [i_] means "I think this character is an 'i', but I'm not really sure."  That statement is not the same thing as "I don't know what this character is" -- not at all! 

So, cool! We've got a notation for describing uncertainty inspired by regular expressions. Problem solved, right? Well, not quite. While FreeUKGen's UCF represents uncertain readings successfully, I think, there are still a couple of issues to iron out.

The first one of these is displaying the data -- I won't go too far into this, as UI is not really my forte, but it seems like we might be able to represent notations of the form "[a_]" by using a different font weight. I have no idea what we'll do about a "Br[au]mfield", though.

The second issue is in searching the data. No problem, right? Regular expressions are designed for searching! Well, sort of. In this case, we expect end users (primarily genealogy researchers) to be typing in precise search strings like "Brumfield" which they expect to match against the regular expression /Br[au]mfield/. This wouldn't be a problem if we only had a handful of records -- we'd convert the UCF in each transcription into its equivalent regular expression, then iterate through each record, matching it against the user-entered search string. Unfortunately this approach might take a while on a database containing hundreds of millions of records.

The problem with searching regular expressions is that you can't index them. So far as I'm aware, it's theoretically impossible to shove a working finite state machine into a B+-tree. What you can do, however, is index permutations of UCF -- if a first name is [_J]ane, you can at least index Jane so that a search for "Jane" will find that record. You can also permute /Br[au]mfield/ into both "Brumfield" and "Bramfield" and index them each, so that a search on either string will find the /Br[au]mfield/ record. This an incomplete solution, in that its results will differ from the aforementioned, logically correct approach of applying each regex against the search string. However, it might be just adequate for the most common cases.

After writing this and reading James's response, I started thinking more about my options.  One of these is a parallelized brute-force approach.  Why can't I match each regex in the database against a search string?  After all, we're talking about fewer than a billion records, and asking does X match Y is the sort of thing that is easily parallelized.  O brave new world, that has such infrastructure! I'm hesitant to go down this path, but I may be missing something -- perhaps some Hadoopy, Erlangy, Map-Reducey algorithm is cheap, easy, and presents the simplest solution to the problem?  Any other option is really an approximation to "correct", so it would be a shame to rule this out because of my own lack of experience.

Another approach might be to categorize each kind of UCF expression. Based on my limited research so far, it appears that the majority of the UCF in the existing transcripts falls into either the "completely unknown" category of "*" for an entire field, or the nuanced "I think this is a J but I'm not sure" category represented by [J_].  We will likely have to handle the former gingerly no matter what we do -- if a surname is totally illegible in the manuscript, the search engine will have to rely on other fields.  The former expression could be approximated by "J", which would match precise searches well provided the transcriber actually has the greatest possible expertise.
Moving into what I expect are rarer cases, expressions like /Br[au]mfield/ would work well with the permutation treatment I outlined above. If the system already supports begins-with and ends-with searches, we should be able to index "Brumf*" as well as "*field".  In fact, we might even be able to index a single, infixed wildcard like "Br*ld" by doing both begins-with and ends-with searches with a combination of cleverness and hackery.  This leaves some smaller number of true regular expression-equivalent UCF-encoded records like "Ca_{1,2}s*d" to deal with.  It's possible that this represents such a small sample that the system could actually apply each record containing an irreducible regex to every search, whether via big parallelization or a long loop.

Yet another approach is one that I understand to be deployed on the FreeUKGen databases now -- lossy comparison that reduces UCF to searchable data.  For example, the venerable Soundex algorithm begins by stripping non-alphabetic data from a record, converting /[J_]ane/ to "Jane".  The uncertainty recorded by the transcriber fades into the larger fog of the search algorithm.  I'm just as uncomfortable with this methodology as I am with the permutation of /[J_]ane/ to "Jane" I described above.  I suspect that my discomfort is due to simply not knowing what the correct behavior is when a user is searching on /[J_]ane/ -- I know that "Jane" should match, but am not entirely sure whether "Zane" should match the record.

Perhaps the right approach is a hybrid -- use tricks with database indexing for the majority of cases--which don't involve any UCF at all--provide Soundex and Metaphone in transparent ways, and shove the irreducable regular expressions into a spot where they can be processed cheaply.  But I really don't know, and I don't anticipate knowing for months yet.  Of course, if you happen to have done this before, I'd love to know how.  I'm heading to bed, expecting dreams which revolve around gem install index-fsm.

Wednesday, April 11, 2012

Crowdsourced Transcription Tool List

When I first started this blog, I spent a lot of time writing detailed reviews of different transcription projects.  This has become difficult as my available time shrinks and the number of crowdsourcing projects grows.  So when Kate Bowers posted to the Society of American Archivists mailing list asking for a directory of transcription tools, I figured it was time to take a different approach.

http://tinyurl.com/TranscriptionToolGDoc
 
The link above is a Google Documents spreadsheet listing different tools and the features I thought were relevant. It's been updated several times over the last few weeks, and I'm pleased to see that it's expanded to include a score of technologies. I hope it's useful.

Wednesday, April 4, 2012

French Departmental Archive on Wikisource

While the transcription world buzzes with news of the release of the 1940 US census and the crowdsourced transcription projects that surround it, I'd like to draw your attention to a blog post published last week on La Tribune des Archives: "Edition collaborative de manuscrits sur Wikisource : 1er retour d'expérience".  The post covers the efforts of the archives of the department of Alpes-Maritimes to transcribe 17th- and 18th-century records of episcopal visits to the communes in the diocese.  These records are rich sources on local history, but "readers struggle over the chicken-scratch, and the collection is too large to be edited by a single person."  The archive has used Wikisource.fr to transcribe these manuscripts with great success, so I'd like to quote and translate extensive portions of their post.

Why Wikisource?

It's already there! (No software to create, maintain, administer, no specs -- just a strong will and a a core of 2-6 people).
It offers features designed for manuscript editions requiring more than one editor.
Particularly useful functions (aside from the collaborative aspect) :
  • Side-by-side display of facsimile and transcription
  • Workflow indicating whether a page is transcribed, corrected, or validated by two administrators.
  • The visualization is very practical for motivating the community of transcribers.
  • Version history control and the ability to comment or discuss difficult issues.
  • Wikisource's high Google page rank.

The article continues to describe the factors they weighed when choosing material for the project (accessibility of the script and local interest, among others), how they got started (the standard GLAMWiki approach), then continues to the community management aspects I find so fascinating:

How do you motivate your paleographers?

 In our experience, transcribers are essentially former university students and internally-trained archivists who want to extend their education (either by making further progress or by avoiding becoming rusty).
Work times and rest times clearly defined in advance.
A regular, fixed-date schedule defined in advance (for example, one month: upload on the 15th and correction every last day of the month) helps the group to make progess and to break up its efforts with relaxation periods (for the eyes, the editors, and the correctors) and lets everyone have rapid feedback (new pages are in fact corrected practically every night).

Findings on the behavior of "students" on Wikisource

The first exercises attracted the kind help support of Wikisource regulars and administrators  (Adrienne Alix, SereinWMfr, Pyb, Hsarrazin), a few new registered paleographers (Cavalié, LINCK, Braxmeyer, Gustave) and some anonymous IPs.  One or two correctors can suffice easily to keep track of the work of 5-10 "students".  Contrary to homework done in class, the "students" apply themselves regularly to the task, and the size and number of contributors does not increase on the night before the deadline.
Writings dating from before 1660 receive fewer volunteers but could very well serve as university exercises graded online (at the rate of one page per student).
For more on the archive's efforts (including their similar outreach on Flickr), take a look at the departmental archive news page.

Saturday, March 17, 2012

Crowdsourcing at IMLS WebWise 2012

The video of the crowdsourcing panel at IMLS WebWise is online, so I thought I'd post my talk.  Like anyone who's created a transcript of their own unscripted remarks, I recommend watching the video. (My bit starts at 6:00, though all the speakers were excellent. Full-screening will hide the slides.)  Nevertheless, I've added hyperlinks to the transcript and interpolated the slides with my comments below.
Okay. I'd like to talk about some of the lessons that have come out of my collaborations with small crowdsourcing projects. We hear a lot about these large projects like GalaxyZoo, like Transcribe Bentham. What can small institutions and small projects do, and do the rules that seem to apply to large projects also apply to them?
So there are three projects that I'm drawing from here in this experience. The first one I'm going to talk about in a little bit is one that was run by Balboa Park Online Collaborative. It's a Klauber Field Notes Transcription Project--the field notes of Laurence M. Klauber, who was the nation's foremost authority on rattlesnakes. These are field notes that he kept from 1923 through 1967. This is done by the San Diego Natural History Museum and is run by our own Perian Sully who is out there in the room somewhere.

The next project I want to talk about is the Diary of Zenas Matthews. Zenas Matthews was a volunteer from Texas who served in the American forces in the US-Mexican War of 1846 and this diary is kept by Southwestern's article on the Zenas Matthews diary project. This had been digitized for a previous researcher and is small, but Southwestern itself is also quite small.
The third project I want to talk about is actually the origin of the software, which is the Julia Brumfield Diaries. If the name looks familiar, it's because she's my great-great grandmother. This project was the impetus for me to develop this tool for crowdsourced transcription.

So all of these projects, what they have in common is that we're talking about page counts that are in the thousands and volunteer counts that are numbered in the dozens at best. So these are not FamilySearch Indexing, where you can rely on hundreds of thousands of volunteers and large networks.

So who participates in large projects and who participates in small projects? One thing that I think is really interesting about crowdsourcing and these other sorts of participatory online communities is that the ratio of contributions to users follows what's called a power-law distribution.
If you look here, we see--and most famously this is Wikipedia--and you see a chart of the number of users on Wikipedia ranked by their contributions. And what you see is that 90% of the edits made to Wikipedia are done by 10% of the users.
If we look at other crowdsourced projects: this is the North American Bird Phenology Program out of Patuxent Bay Research Center [ed: actually Patuxent Wildlife Research Center], and this is a project in which volunteers are transcribing ornithology records--basically bird-watching records--that were sent in from the 1870s through the 1950s [ed: 1880s-1970s], entering them into a database where they can be mined for climate change [data]. What's interesting about this to me at least is that--and this has been a phenomenally successful project: they've got 560,000 cards transcribed all by volunteers, but StellaW@Maine here has transcribed 126,000 of them, which is 22% of them. Now, CharlotteC@Maryland is close behind her (so go, local team!) but again you see the same kind of curve.
 If we look at another relatively large project, the Transcribe Bentham project: this isn't a graph, but if you look at the numbers here, you see the same kind of thing. You see Diane with 78,000 points, you see Ben Pokowski with 51,000 points. You see this curve sort of taper down into more of a long tail.
 So what about the small projects?
Well, let's look at the Klauber diaries. This is the top ten transcribers of the field notes of Laurence Klauber. And if you look at the numbers here--again in this case it's not quite as pronounced because I think the previous leader has dropped out and other people have overtaken him--but you see the same kind of distribution. This is not a linear progression; this is more of a power-law distribution.
If you look at an even smaller project--now, mind you this is a project that is really only of interest to members of my family and elderly neighbors of the diarist--but look: We've got Linda Tucker who has transcribed 713 of these pages followed by me and a few other people. But again, you have this power law that the majority of the work is being done by a very small group of people.
Okay, what's going on really? What does this mean and why does it matter? The thing that I think this gets to, the reason that I think that this is important, is for a couple of reasons.
One is that this kind of behavior addresses one of the main objections to crowdsourcing. Now there are a lot of valid objections to crowdsourcing; I think that there are also a few invalid objections and one of them is essentially the idea that members of the public cannot participate in scholarly projects because my next door neighbor is neither capable nor interested in participating in scholarly projects. And we see this all over the place. I mean, here's a few example quotes--and I'm not going to read them out. I believe that this objection (which I have heard a number of times; I mean we see some examples right here) is a non sequitur. And I believe that the power-law distribution proves that it's a non sequitur. Really, I saw this most egregiously framed by a scholar who was passionately--just absolutely decrying--the idea that classical music fans would be able to competently translate from German into English because, he said, "After all, 40% of South Carolina voted for Newt Gingrich." Okay.
All right, so what's going on is I think best summed up by Rachel Stone, and what she essentially said is that crowdsourcing isn't getting the sort of random distribution from the crowd. Crowdsourcing is getting a number of "well-informed enthusiasts."
So where do we find well-informed enthusiasts to do this work and to do it well? Big projects have an advantage, right? They have marketing budgets. They have press coverage. They have an existing user base.
If you ask the people at the Transcrbe Bentham project how did they get their users, they'll say "Well, you know that New York Times article really helped. "  That's cool! All right.
The GalaxyZoo people--Citizen Science Alliance--yesterday, 24 hours ago, announced a new project, SETILive. Now what this does is it pulls in live data from the SETI satellites[sic: actually telescope], and in those 24 hours--I took this screenshot; I actually skipped lunch to get this one screenshot because I knew that it would pass 10,000 people participating with 80,000 of these classifications. And it would have been higher, except last night the telescope got covered by cloud cover. So they dropped from getting 30 to 40 contributions per second to having to show sort of archival data and getting only 10 contributions per second. Well, they can do this because they have an existing base of active volunteers that numbers around 600,000.
So how do WE do that? How do we find well-informed enthusiasts? This is something that Kathryn Stallard and Anne Veerkamp-Andersen at Southwestern University Special Collections and I discussed a lot when we were trying to launch the Zenas Matthews Diary. We said, "Well, we don't have any budget at all." Kathryn said, "Well, let's talk about local archival newsletters. Let's post to H-Net lists." I was in favor of looking at online communities of people who might be doing Matthews genealogy or the military history war-gamers who have discussion forums on the Mexican War.
While we're arguing about this, Kathryn gets an email from a patron saying, "Hey, I'm a member of an organization. We see that you have this document. It relates to the Battle of San Jacinto and the Texas Revolution of 1836. Can you send this to us?"
She responds saying, "Hey, 1846, great. Check out this diary we just put online. I think that's what you're talking about."

Well that wasn't actually what he was talking about, but he responds and says, "Yeah, okay, I'll check that out, but can you please give me the document I want." They get it back to him and we returned to our discussion of "Okay, what do we need to do to roll this out? We're going to start working on the information architecture. We're going to work on the UI. We're going to work on help screens."  And while we're having this conversation, Mr. Patrick checks it out.
And Scott Patrick starts transcribing.
And he starts transcribing some more.
  And he continues transcribing.
 And at this point, we're talking about working on the wording of the help screens, the wording of our announcement trying to attract volunteers, and this is page 43 of the 43-page diary!
And while we're discussing this, he goes back and he starts adding footnotes. Look at this: he's identifying the people who are in this, saying, "Hey, this guy who is mentioned is -- here's what his later life was. This other guy--hey, he's my first cousin, by the way, but he also left the governorship of the State of Texas to fight in this war."
He sees--and believe me, in the actual original diary, Piloncillo is not spelled Piloncillo. I mean it is a -- Zenas Matthews does not know Spanish, right? He identifies this! He identifies and looks up works that are mentioned here.

So wow! All right! We got our well-informed enthusiast! In 14 days, he transcribed the diary, and he didn't do just one pass. I mean as he got familiar with the hand, he goes back and revises the earlier transcriptions. He kind of figures out who's involved. He asks other members of his heritage organization what this is. He adds two dozen footnotes.

What just happened? What was that about? Who is this guy? Well, Scott Patrick is a retired petroleum worker who got interested in his family history, and then got interested in local history, and then got interested in heritage organizations. And he is our ideal "well-informed enthusiast".
So how did we find him? The project isn't public yet, right? Our challenge now is rephrasing our public announcement. We're now looking for volunteers to ... something that adequately describes what's left to do. Well, let's go back and take a look at this original letter, right? What we did is, we responded to an inquiry from a patron--and not an in-person patron: this is someone who lives 200 miles away from Georgetown, Texas.

What you have when someone is coming in and asking about material is, if you think about this in terms of target marketing--this is a target-rich environment. Here is someone who is interested. He's online. He's researching this particular subject. He is not an existing patron. he has no prior relationship with Southwestern University Libraries, but "Hey, while we answer your request, you might check this thing out that's in this related field." That seems to have worked in this one case. Hopefully, we'll get some more experience with future projects.

Okay, so how do we motivate volunteers? More importantly, how do we avoid de-motivating them?
Big projects, a lot of times they have a lot of interesting game-like features. Some of them actually are games. You have leader boards, you have badges, you have ways of making the experience more immersive.
OldWeather, which is run by GalaxyZoo, will plot your ship on a Google map as you transcribe the latitude and longitude elements from the log books.
The National Library of Finland has partnered with Microtask to actually create a crowdsourcing game of Whac-A-Mole. So this is crowdsourcing taken to the extreme.
But there's a peril here, and the peril is that all of these things are extrinsic motivators.
And we ran into this with the Klauber diaries. Perian came to me and said, "Hey, let's come up with a stats page, because we want to track where the diaries are at. So we come up with the stats page -- pretty basic, here's where some of these are at.

And hey, while we're at it, let's mine our data. We can come up with a couple of top-10 lists. So we come up with the top-ten list of transcribers and a top-ten list of editors, because that's the data I have.

Well remember, the whole point of this exercise is to index these diaries so that we can find the mentions of these individual species in the original manuscripts. Do you see indexing on here anywhere? Neither did our volunteers, and the minute this went up, the volunteers who previously had been transcribing and indexing every single page stopped indexing completely. They weren't being measured on it. We weren't saying that we rewarded them for it, so they stopped.
Needless to say, our next big-rush change was a top-ten indexers.
So this gets to "crowding-out" theory of motivation, and the expert on this is a researcher in the UK named Alexandra Eveleigh. Her point is that if you're going to design any kind of extrinsic motivation, you have to make sure that it promotes the actual contributory behavior, and this is something that applies, I believe, to small projects as well as large projects.
So I have 13 seconds left, so thank you, and I'll just end on that note.

Thursday, March 8, 2012

Jumping In With Both Feet


Although I didn't know it at the time, since I began work on FromThePage in 2005 I've had one toe in the digital humanities community.  I've worked on FromThePage and I've blogged about crowdsourced manuscript transcription.  I've met some smart, friendly people doing fascinating things and I've even taught some of them the magic of regular expressions.  But I've always tried to squeeze this work into my "spare time" -- the interstices in the daily life of an involved father and a professional software engineer working a demanding but rewarding job.  As the demands of vocation and avocation increase; as disparate duties begin to compete with each other; as new babies come into my home while new technologies come into my workplace and new requests for FromThePage arrive in my inbox, the basement inventor model becomes increasingly untenable.  The numbers don't lie: I've only checked in code on four days during the last six months.

In January  I was offered an incredible opportunity.  Chris Lintott invited me to the Adler Planetarium to meet the Citizen Science Alliance's dev team.  This talented, generous team of astronomer-developers gave me a behind-the-scenes tour of their Scribe tool--early versions of which powered OldWeather.org--and I was blown away.  I don't think I've ever been so excited about a technology, and my mind raced with ideas for projects using it.  .  Serendipitously, two days later I received email from Ben Laurie asking if I'd like to implement Scribe for the FreeREG project, a part of the FreeBMD genealogy charity that is transcribing parish registers recording the baptisms, marriages, and burials in England and Wales from 1538-1835.  All development would be released open source, and all data would be as open as possible.  It's a dream project for someone with my interests; there was no way I could pass this up.

So as of March 18 I'm starting a new career as an independent digital history developer.  It is heartbreaking to leave my friends at Convio after nearly a dozen years, but I'm delighted with the possibilities my new autonomy offers. I hope to specialize in projects relating to crowdsourcing and/or manuscript transcription, but to be honest I'm not sure where this path will lead.   Of course I plan to devote more time to FromThePage -- this year should finally see the publish-on-demand integration I've always been wishing for, as well as a few other features people have requested.  If you've got a project that seems appropriate--whether it involves genealogy or herpetology, agricultural history or textile history--drop me a line.


Monday, March 5, 2012

Quality Control for Crowdsourced Transcription

Whenever I talk about crowd-sourced transcription--actually whenever I talk about crowdsourced anything--the first question people ask is about accuracy. Nobody trusts the public add to an institution's data/meta-data, nor especially to correct it. However, quality control over data entry is a well-explored problem, and while I'm not familiar with the literature from industry regarding commercial approaches, I'd like to offer the systems I've seen implemented in the kinds of volunteer transcription projects I follow. (Note: the terminology is my own, and may be non-standard.)
  1. Single-track methods (mainly employed with large, prosy text that is difficult to compare against independent transcriptions of the same text). In these methods, all changes and corrections are made to a single transcription which originated with a volunteer and is modified thereafter.  There no parallel/alternate transcription to compare against.
    1. Open-ended community revision: This is the method that Wikipedia uses, and it's the strategy I've followed in FromThePage. In this method, users may continue to change the text of a transcription forever. Because all changes are logged--with a pointer of some sort to the user who logged them--vandalism or edits which are not in good faith may be reverted to a known-good state easily. This is in keeping with the digital humanities principle of "no final version." In my own projects, I've seen edits made to a transcription two decades after the initial version, and those changes were indeed correct. (Who knew that "drugget" was a coarse fabric used for covering tobacco plant-beds?) Furthermore, I believe that there is no reason other than the cost of implementation why any of the methods below which operate from the "final version" mind-set should not allow error reports against their "published" form.
    2. Fixed-term community revision: Early versions of both TransribeBentham and Scripto followed this model, and while I'm not sure if either of them still do, it does seem to appeal to traditional documentary editing projects that are incorporating crowdsourcing as a valuable initial input to a project while wishing to retain ultimate control over the "final version". In this model, wiki-like systems are used to gather the inital data, with periodic review by experts. Once a transcription reaches an acceptable status (deemed so by the experts), it is locked to further community edits and the transcription is "published" to a more traditional medium like a CMS or a print edition.
    3. Community-controlled revision work-flows: This model is a cross between the two above-mentioned methods. Like fixed-term revision, it embraces the concept of a "final version", after which the text may not be modified. Unlike fixed-term revision, there are no experts involved here -- rather the tool itself forces a text to go through an edit/review/proofread/reject-approve workflow by the community, after which the version is locked for future edits. As far as I'm aware, this is only implemented by the ProofreadPage plugin to MediaWiki that has been used by Wikisource for the past few years, but it seems quite effective.
    4. Transcription with "known-bad" insertions before proofreading: This is a two-phase process, which to my knowledge has only been tried by the Written Rummage project as described in Code4Lib issue 15. In the first phase, an initial transcription is solicited from the crowd (which in their case is a Mechanical Turk workforce willing to transcribe 19th-century diaries for around eight cents per page). In the second phase, the crowd is asked to review the initial transcription against the original image, proof-reading and correcting the first transcription. In order to make sure that a review is effective, however, extra words/characters are added to the data before it is presented to the proof-reader, and the location within the text of these known-bad insertions is recorded. The resulting corrected transcription is then programmatially searched for the bad data which had been inserted, and if it has been removed the system assumes that any other errors have also been removed -- or at least that a good-faith effort has been made to proofread and correct the transcript.
    5. Single-keying with expert review: In this methodology, once a single volunteer contribution is made, it is reviewed by an expert and either approved or rejected. The expert is necessarily authorized in some sense -- in the case of the UIowa Civil War Diaries, the review is done by the library staff member processing the mailto form contribution, while in the case of FreeREG the expert is a "syndicate manager" -- a particular kind of volunteer within the FreeBMD charity. (FreeREG may be unique in using a single-track method for small, structured records, however it demands more paleographic and linguistic expertise from its volunteers than any other project I'm aware of.) If a transcription is rejected, it may be either returned to the submitter for correction or corrected by the expert and published in corrected form.
  2. Multi-track methods (mainly employed with easily-isolated, structured records like census entries or ship's log books). In all of these cases, the same image is presented to different users to be transcribed from scratch. The data thus collected is compared programmatically on the assumption that two correct transcriptions will agree with each other and may be assumed to be valid. If the two transcriptions disagree with each other, however, one of them must be in error, so some kind of programmatic or human expert intervention is needed. It should be noted that all of these methodologys are technically "blind" n-way keying, as the volunteers are unaware of each other's contributions and do not know whether they are interpreting the data for the first time or contributing a duplicate entry.
    1. Triple-keying with voting: This is the method that the Zooniverse OldWeather team uses. Originally the OldWeather team collected the same information in ten different independent tracks, entered by users who were unaware of each other's contributions: blind, ten-way keying. The assumption was that majority reading would be the correct one, so essentially this is a voting system. After some analysis it was determined that the quality of three-way keying was indistinguishable from that of ten-way keying, so the system was modified to a less-skeptical algorithm, saving volunteer effort. If I understand correctly, the same kind of voting methodology is used by ReCAPTCHA for its OCR correction, which allowed its exploitation by 4chan.
    2. Double-keying with expert reconciliation: In this system, the same entry is shown to two different volunteers, and if their submissions do not agree it is passed to an expert for reconciliation. This requires a second level of correction software capable of displaying the original image along with both submitted transcriptions. If I recall my fellow panelist David Klevan's WebWise presentation correctly, this system is used by the Holocaust Museum for one of their crowdsourcing projects.
    3. Double-keying with emergent community-expert reconciliation: This method is almost identical to the previous one, with one important exception. The experts who reconcile divergent transcriptions are themselves volunteers -- volunteers who have been promoted to from transcribers to reconcilers through an algorithm. If a user has submitted a certain (large) number of transcriptions, and if those transcriptions have either 1) matched their counterpart's submission, or 2) been deemed correct by the reconciler when they are in conflict with their counterpart's transcription, then the user is automatically promoted. After promotion, they are able to choose their volunteer activity from either the queue of images to be transcribed or the queue of conflicting transcriptions to be reconciled. This is the system used by FamilySearch Indexing, and its emergent nature makes it a particularly scalable solution for quality control.
    4. Double-keying with N-keyed run-off votes: Nobody actually does this that I'm aware of, but I think it might be cost-effective. If the initial set of two volunteer submissions don't agree, rather than submit the argument to an expert, re-queue the transcription to new volunteers. I'm not sure what the right number is here -- perhaps only a single tie-breaker vote, but perhaps three new volunteers to provide an overwhelming consensus against the original readings. If this is indecisive, why not re-submit the transcription again to an even larger group? Obviously this requires some limits, or else the whole thing could spiral into an infinite loop in which your entire pool of volunteers are arguing with each other about the reading of a single entry that is truly indecipherable. However, I think it has some promise as it may have the same scalability benefits of the previous method without needing the complex promotion algorithm nor the reconciliation UI.
Caveats: Some things are simply not knowable. It is hard to evaluate the effectiveness of quality control seriously without taking into account the possibility that volunteer contributors may be correct and experts may be wrong, nor more importantly that some images are simply illegible regardless of the paleographic expertise of the transcriber. The Zooniverse team is now exploring ways for volunteers to correct errors made not by transcribers but rather by the midshipmen of the watch who recorded the original entries a century ago. They realize that a mistaken "E" for "W" in a longitude record may be more amenable to correction than a truly illegible entry. Not all errors are made by the "crowd", after all.

Much of this list is based on observation of working sites and extrapolation, rather than any inside information. I welcome corrections and additions in the comments or at benwbrum@gmail.com.

[Update 2012-03-07: Folks from the Transcribe Bentham informed me on Twitter that "In general, at the moment most transcripts are worked on by one volunteer, checked and then locked. Vols seem to prefer working on fresh MSS to part transcribed." and "For the record, does still use 'Fixed-term community revision'. There are weekly updates on the blog."  Thanks, Tim and Justin!]

Sunday, March 4, 2012

We Get Press!

Crowdsourced transcription projects--and FromThePage in particular--have gotten some really nice press in the last few weeks.

Konrad Lawson posted an excellent review of Scripto and FromThePage on the ProfHacker blog at The Chronicle of Higher Education: Crowdsourcing Transcription: FromThePage and Scripto.

Francine Diep wrote a great article on the phenomenon at Innovation News Daily: Volunteer Transcribers Put Millions of Pages Online.

Ellen Davis's article on Southwestern's transcription of Zenas Matthews's 1846 Mexican War Diary is especially notable because it includes an interview with Scott Patrick, the volunteer who has done such a spectacular job: Collaborative Transcription Project.
 

Wednesday, January 18, 2012

A Developer Goes to AHA2012

Last Sunday I returned from the 2012 meeting of the American Historical Association.  Although I have attended my share of conferences and unconferences--from Lone Star Ruby Con to Dreamforce and Texas State Historical Association to Museum Computer Network--I'd never attended one of the big mid-year academic conferences before.  The experience was strange but fruitful, and I hope I'll be able to attend again.

Let me start with my superficial impressions.  First, historians dress much better than developers do, though they really don't hold a candle to the art gallery folks.  They also are a more reactive audience, although there is very little back-channel conversation on Twitter -- in fact I was informed that typing on laptops would be considered rude! Finally, they are pretty introverted -- more likely to strike up a conversation with a stranger than your average Rubyist, but not by much.

The conference itself is a bit warped by the fact that many of the attendees are there for the sole purpose of conducting job interviews.  This apparently involves a days-long series of thirty-to-ninety minute interviews designed to figure out which candidates to invite to campus for an on-site interview -- a grueling process for the interviewers and an expensive one for the interviewees.  (The analogous activity in the software world is the phone screen, in which a hiring manager discusses experience and skill-set with a candidate.  Over the phone.)  If most attendees are interviewing,  they aren't actually participating in the conference -- I was told that around 12,000 people register, but only 5000 attend.  This gives AHA a kind of Potemkin village flavor, and it's not unusual to see a panel lecturing to a nearly empty room.  In fact, the last session I attended had five speakers on the podium and only three people in the audience.

Nevertheless, AHA2012 and the associated THATCamp were tremendously productive for me.  There were several opportunities for collaboration, so while I didn't find my dream partner for FromThePage--that institution with a staff of front-end experts and a burning need for transcription software--I did have some really good conversations.  I've been trying to add better support for letters to FromThePage, and Jean Bauer gave me a detailed walk-through of the Project Quincy data model for correspondence.  A lot of people were interested in starting their own crowdsourcing projects, and we've been swapping emails since.  Most importantly, while I was in town I met with the development team behind Scribe and Talk, the open-source tools that power Citizen Science Alliance projects like OldWeather and AncientLives. I'll be posting about that separately.

One of the things that impressed me most about the history world was the potential there is for a programmer to make a big impact. The graduate student I roomed with was an expert with regular expressions--his texts were in Arabic, so the RTL/LTR mix required him to close his eyes as he composed his patterns--but he had no experience with elementary scripting.  In one two-hour hack session, we were able to split a three-hundred-thousand-line medieval biographical dictionary into twenty thousand small files representing individual entries.  With a couple more hours' work, we'd have been able to extract dates, places, names, and other data from these files.  It is a delight for a software engineer to work in a domain where such minimal effort can make such a difference: most of our work deals with obscure edge cases of hard/boring problems, so removing months of tedious manual labor with an hour's worth of programming is incredibly rewarding.

Crowdsourcing History: Collaborative Transcription and Archives, the panel I presented at, seemed to go well.  Moderator Shane Landrum invited the audience to give 3-minute presentations on their own crowdsourcing projects after the presenters finished their 8-minute talks, then he opened the floor for questions.  Although I was skeptical about this format, it worked very well indeed.  In particular, the Q/A period was blessedly free of the self-promoters who plague events like South by Southwest.  Perhaps this can be attributed to the novel format or perhaps it was due to the inherent civility of academic historians -- all I know is that it succeeded.  I felt very fortunate to be among the panelists, who were a Who's Who of manuscript transcription tools, although a couple prominent projects were not represented because they were too recent to be included in the proposal.  Because the context was already set by my fellow panelists and because the time was so constrained, I decided to concentrate my own talk on one feature of FromThePage: subject indexing through wiki-links.  An abbreviated recap of the presentation is embedded below:


On the whole, I think I'd like to go back to the AHA meeting. The conversations and collaborations made the trip worth the expense, and it was gratifying to finally meet the people behind the big transcription projects face-to-face.  I even managed to learn some fascinating stuff about American history.

Wednesday, December 7, 2011

Developments in Wikisource/ProofreadPage for Transcription

Last year I reviewed Wikisource as a platform for manuscript transcription projects, concluding that the ProofreadPage plug-in was quite versatile, but that unfortunately the en.wikisource.org policy prohibiting any text not already published on paper ruled out its use for manuscripts.

I'm pleased to report that this policy has been softened. About a month ago, NARA started to partner with the Wikimedia Foundation to to host material—including manuscripts—on Wikisource.  While I was at MCN, I discussed this with Katie Filbert, the president of Wikimedia DC, who set me straight.  Wikisouce is now very interested in partnering with institutions to host manuscripts of importance, but it is still not a place for ordinary people to upload great-grandpa's journal from World War I.

Once you host a project on Wikisource, what do you do with it?  Andie, Rob and Gaurav over at the blog So You Think You Can Digitize?—and it's worth your time to read at least the last six posts—have been writing on exactly that subject.  Their most recent post describes their experience with Junius Henderson's Field Notes, and although it concentrates on their success flushing out more Henderson material and recounts how they dealt with the wikisource software, I'd like to concentrate on a detail:
What we currently want is a no-cost, minimal effort system that will make scans AND transcriptions AND annotations available, and that can facilitate text mining of the transcriptions.  Do we have that in WikiSource?  We will see.  More on annotations to follow in our next post but some father to a sister of some thoughts are already percolating and we have even implemented some rudimentary examples.
This is really exciting stuff.  They're experimenting with wiki mark-up of the transcriptions  with the goal of annotation and text-mining.  I tried to do this back in 2005, but abandoned the effort because I never could figure out how to clearly differentiate MediaWiki articles about subjects (i.e. annotations) from articles that presented manuscript pages and their transcribed text.   The lack of wiki-linking was also the one of my criticisms most taken to heart by the German Wikisource community last October.

So how is the mark-up working out?  Gaurav and the team have addressed the differentiation issue by using cross-wiki links, a standard way of linking from an article on one Wikimedia project to another.  So the text "English sparrows" in the transcription is annotated [[:w:Passer domesticus|English sparrows]], which is wiki-speak for Link the text "English sparrows" to the Wikipedia article "Passer domesticus". Wikipedia's redirects then send the browser off to the article "House Sparrow".

So far so good.  The only complaint I can make is that—so far as I can tell—cross-wiki links don't appear in the "What links here" screen tool on Wikipedia, neither for Passer domesticus, nor for House Sparrow.  This means that the annotation can't provide an indexing function, so that users can't see all the pages that reference possums, nor read a selection of those pages.  I'm not sure that the cross-wiki link data isn't tracked, however — just that I can't see it in the UI.  Tantalizingly, cross-wiki links are tracked when images or other files are included in multiple locations: see the "Global file usage" section of the sparrow image, for example.  Perhaps there is an API somewhere that the Henderson Field Note project could use to mine this data, or perhaps they could move their links targets from Wikipedia articles to some intermediary in a different Wikisource namespace.

Regardless, the direction Wikisource is moving should make it an excellent option for institutions looking to host documentary transcription projects and experiment with crowdsourcing without running their own servers.  I can't wait to see what happens once Andie, Rob, and Gaurav start experimenting with PediaPress!

Friday, November 18, 2011

Crowdsourcing Transcription at MCN 2011

These are links to the papers, websites, and systems mentioned in my presentation at the Museum Computer Network 2011 conference.

Friday, August 5, 2011

Programmers: Wikisource Needs You!

Wikisource is powered by a MediaWiki extension which allows page images to be displayed beside the wiki editing form. This extension also handles editorial workflow by allowing pages, chapters, and books to be marked as unedited, partially edited, in need of review, or finished. It's a fine system, and while the policy of the English language Wikisource community prevents it from being used for manuscript transcription, there are active manuscript projects using the software in other communities.

Yesterday, Mark Hershberger wrote this in a comment: For what its worth the extension used by WikiSource, ProofreadPage, now needs a maintainer. I posted about this here: http://thread.gmane.org/gmane.science.linguistics.wikipedia.technical/54831

While I'm sorry to hear it, this is an excellent opportunity for someone with Mediawiki skills to do some real good.

Tuesday, July 26, 2011

Can a Closed Crowdsourcing Project Succeed?

Last night, the Zooniverse folks announced their latest venture: Ancient Lives, which invites the public to help analyze the Oxyrhynchus Papyri. The transcription tool meets the high standards we now expect from the team who designed Old Weather, but the project immediately stirred some controversy because of its terms of use:


Sean is referring to this section of the copyright statement (technically, not a terms of use), which is re-displayed from the tutorial:
Images may not be copied or offloaded, and the images and their texts may not be published. All digital images of the Oxyrhynchus Papyri are © Imaging Papyri Project, University of Oxford. The papyri themselves are owned by the Egypt Exploration Society, London. All rights reserved.
Future use of the transcriptions may be hinted at a bit on the About page:
The papyri belong to the Egypt Exploration Society and their texts will eventually be published and numbered in Society's Greco-Roman Memoirs series in the volumes entitled The Oxyrhynchus Papyri.
It should be noted that the closed nature of the project is likely a side-effect of UK copyright law, not a policy decision by the Zooniverse team. In the US, a scan or transcription of a public domain work is also public domain and not subject to copyright. In the UK, however, scanning an image creates a copyright in the scan, so the up-stream providers automatically are able to restrict down-stream use of public domain materials. In the case of federated digitization projects this can create a situation like that of the Old Bailey Online, where different pieces of a seemingly-seamless digital database are owned by entirely different institutions.

I will be very interested to see how the Ancient Lives project fares compared to GalaxyZoo's other successes. If the transcriptions are posted and accessible on their own site, users may not care about the legal ownership of the results of their labor. They've already had 100,000 characters transcribed, so perhaps these concerns are irrelevant for most volunteers.

Wednesday, July 20, 2011

Crowdsourcing and Variant Digital Editions

Writing at the JISC Digitization Blog, Alastair Dunning warns of "problems with crowdsourcing having the ability to create multiple editions."

For example, the much-lauded Early English Books Online (EEBO) and Eighteenth Century Collections Online (ECCO) are now beginning to appear on many different digital platforms.

ProQuest currently hold a licence that allows users to search over the entire EEBO corpus, while Gale-Cengage own the rights to ECCO.

Meanwhile, JISC Collections are planning to release a platform entitled JISC Historic Books, which makes licenced versions of EEBO and ECCO available to UK Higher Education users.

And finally, the Universities of Michigan and Oxford are heading the Text Creation Partnership (TCP), which is methodically working its way through releasing full-text versions of EEBO, ECCO and other resources. These versions are available online, and are also being harvested out to sites like 18th Century Connect.

So this gives us four entry points into ECCO – and it’s not inconceivable that there could be more in the future.

What’s more, there have been some initial discussions about introducing crowdsourcing techniques to some of these licensed versions; allowing permitted users to transcribe and interpret the original historical documents. But of course this crowdsourcing would happen on different platforms with different communities, who may interpret and transcribe the documents in different way. This could lead to the tricky problem of different digital versions of the corpus. Rather than there being one EEBO, several EEBOs exist.

Variant editions are indeed a worrisome prospect, but I don't think that it's unique to projects created through crowdsourcing. In fact, I think that the mechanism of producing crowdsourced editions actually reduces the possibility for variants to emerge. Dunning and I corresponded briefly over Twitter, then I wrote this comment to the JISC Digitization blog. Since that blog seems to be choking on the mark-up, I'll post my reply here:
benwbrum Reading @alastairdunning's post connecting
crowdsourcing to variant editions: bit.ly/raVuzo Feel like Wikipedia
solved this years ago.

benwbrum If you don't publish (i.e. copy) a "final" edition of a crowdsourced transcription, you won't have variant "final" versions.

benwbrum The wiki model allows linking to a particular version of an article. I expanded this to the whole work: link

alastairdunning But does that work with multiple providers offering restricted access to the same corpus sitting on different platforms?

alastairdunning ie, Wikipedia can trace variants cause it's all on the same platform; but there are multiple copies of EEBO in different places

benwbrum I'd argue the problem is the multiple platforms, not the crowdsourcing.

alastairdunning Yes, you're right. Tho crowdsourcing considerably amplifies the problem as the versions are likely to diverge more quickly

benwbrum You're assuming multiple platforms for both reading and editing the text? That could happen, akin to a code fork.

benwbrum Also, why would a crowd sourced edition be restricted? I don't think that model would work.
I'd like to explore this a bit more. I think that variant editions are less likely in a crowdsourced project than in a traditional edition, but efforts to treat crowdsourced editions in a traditional manner can indeed result in the situation you warn against.

When we're talking about crowdsourced editions, we're usually talking about user-generated content that is produced in collaboration with an editor or community manager. Without exception, this requires some significant technical infrastructure -- a wiki platform for transcribing free-form text or an even more specialized tool for transcribing structured data like census records or menus. For most projects, the resulting edition is hosted on that same platform -- the Bentham wiki which displays the transcriptions for scholars to read and analyze is the same tool that volunteers use to create the transcriptions. This kind of monolithic platform does not lend itself to the kind of divergence you describe: copies of the edition are always dated as soon as they are separated from the production platform, and making a full copy of the production platform requires a major rift among the editors and volunteer community. These kind of rifts can happen--in my world of software development, the equivalent phenomenon is a code fork--but they're very rare.

But what about projects which don't run on a monolithic platform? There are a few transcription projects in which editing is done via a wiki (Scripto) or webform (UIowa) but the transcriptions are posted to a content management system. There is indeed potential for the "published" version on the CMS to drift from the "working" version on the editing platform, but in my opinion the problem lies not in crowdsourcing, but in the attempt to impose a traditional publishing model onto a participatory project by inserting editorial review in the wrong place:

Imagine a correspondence transcription project in which volunteers make their edits on a wiki but the transcriptions are hosted on a CMS. One model I've seen often involves editors taking the transcriptions from the wiki system, reviewing and editing them, then publishing the final versions on the CMS. This is a tempting work-flow -- it makes sense to most of us both because the writer/editor/reader roles are clearly defined and because the act of copying the transcription to the CMS seems analogous to publishing a text. Unfortunately, this model fosters divergence between the "published" edition and the working copy as voluteers continue to make changes to the transcriptions on the wiki, sometimes ignoring changes made by the reviewer, sometimes correcting text regardless of whether a letter has been pushed to the CMS. The alternative model has reviewers make their edits within the wiki system itself, with content pushed to the CMS automatically. In this model, the wiki is the system-of-record; the working copy is the official version. Since the CMS simply reflects the production platform, it does not diverge from it. The difficulty lies in abandoning the idea of a final version.

It's not at all clear to me how EEBO or ECCO are examples of crowdsourcing, rather than traditional restricted-access databases created and distributed through traditional means, so I'm not sure that they're good examples.