Collaborative Manuscript Transcription

Wednesday, April 11, 2012

Crowdsourced Transcription Tool List

When I first started this blog, I spent a lot of time writing detailed reviews of different transcription projects. This has become difficult as my available time shrinks and the number of crowdsourcing projects grows. So when Kate Bowers posted to the Society of American Archivists mailing list asking for a directory of transcription tools, I figured it was time to take a different approach.

http://tinyurl.com/TranscriptionToolGDoc

The link above is a Google Documents spreadsheet listing different tools and the features I thought were relevant. It's been updated several times over the last few weeks, and I'm pleased to see that it's expanded to include a score of technologies. I hope it's useful.

Wednesday, April 4, 2012

French Departmental Archive on Wikisource

While the transcription world buzzes with news of the release of the 1940 US census and the crowdsourced transcription projects that surround it, I'd like to draw your attention to a blog post published last week on La Tribune des Archives: "Edition collaborative de manuscrits sur Wikisource : 1er retour d'expérience". The post covers the efforts of the archives of the department of Alpes-Maritimes to transcribe 17th- and 18th-century records of episcopal visits to the communes in the diocese. These records are rich sources on local history, but "readers struggle over the chicken-scratch, and the collection is too large to be edited by a single person." The archive has used Wikisource.fr to transcribe these manuscripts with great success, so I'd like to quote and translate extensive portions of their post.

Why Wikisource?
It's already there! (No software to create, maintain, administer, no specs -- just a strong will and a a core of 2-6 people).
It offers features designed for manuscript editions requiring more than one editor.
Particularly useful functions (aside from the collaborative aspect) :

Side-by-side display of facsimile and transcription

Workflow indicating whether a page is transcribed, corrected, or validated by two administrators.

The visualization is very practical for motivating the community of transcribers.

Version history control and the ability to comment or discuss difficult issues.

Wikisource's high Google page rank.

The article continues to describe the factors they weighed when choosing material for the project (accessibility of the script and local interest, among others), how they got started (the standard GLAMWiki approach), then continues to the community management aspects I find so fascinating:

How do you motivate your paleographers?
In our experience, transcribers are essentially former university students and internally-trained archivists who want to extend their education (either by making further progress or by avoiding becoming rusty).
Work times and rest times clearly defined in advance.
A regular, fixed-date schedule defined in advance (for example, one month: upload on the 15th and correction every last day of the month) helps the group to make progess and to break up its efforts with relaxation periods (for the eyes, the editors, and the correctors) and lets everyone have rapid feedback (new pages are in fact corrected practically every night).

Findings on the behavior of "students" on Wikisource
The first exercises attracted the kind help support of Wikisource regulars and administrators (Adrienne Alix, SereinWMfr, Pyb, Hsarrazin), a few new registered paleographers (Cavalié, LINCK, Braxmeyer, Gustave) and some anonymous IPs. One or two correctors can suffice easily to keep track of the work of 5-10 "students". Contrary to homework done in class, the "students" apply themselves regularly to the task, and the size and number of contributors does not increase on the night before the deadline.
Writings dating from before 1660 receive fewer volunteers but could very well serve as university exercises graded online (at the rate of one page per student).

For more on the archive's efforts (including their similar outreach on Flickr), take a look at the departmental archive news page.

Saturday, March 17, 2012

Crowdsourcing at IMLS WebWise 2012

The video of the crowdsourcing panel at IMLS WebWise is online, so I thought I'd post my talk. Like anyone who's created a transcript of their own unscripted remarks, I recommend watching the video. (My bit starts at 6:00, though all the speakers were excellent. Full-screening will hide the slides.) Nevertheless, I've added hyperlinks to the transcript and interpolated the slides with my comments below.

Okay. I'd like to talk about some of the lessons that have come out of my collaborations with small crowdsourcing projects. We hear a lot about these large projects like GalaxyZoo, like Transcribe Bentham. What can small institutions and small projects do, and do the rules that seem to apply to large projects also apply to them?

So there are three projects that I'm drawing from here in this experience. The first one I'm going to talk about in a little bit is one that was run by Balboa Park Online Collaborative. It's a Klauber Field Notes Transcription Project--the field notes of Laurence M. Klauber, who was the nation's foremost authority on rattlesnakes. These are field notes that he kept from 1923 through 1967. This is done by the San Diego Natural History Museum and is run by our own Perian Sully who is out there in the room somewhere.

The next project I want to talk about is the Diary of Zenas Matthews. Zenas Matthews was a volunteer from Texas who served in the American forces in the US-Mexican War of 1846 and this diary is kept by Southwestern's article on the Zenas Matthews diary project. This had been digitized for a previous researcher and is small, but Southwestern itself is also quite small.

The third project I want to talk about is actually the origin of the software, which is the Julia Brumfield Diaries. If the name looks familiar, it's because she's my great-great grandmother. This project was the impetus for me to develop this tool for crowdsourced transcription.

So all of these projects, what they have in common is that we're talking about page counts that are in the thousands and volunteer counts that are numbered in the dozens at best. So these are not FamilySearch Indexing, where you can rely on hundreds of thousands of volunteers and large networks.

So who participates in large projects and who participates in small projects? One thing that I think is really interesting about crowdsourcing and these other sorts of participatory online communities is that the ratio of contributions to users follows what's called a power-law distribution.

If you look here, we see--and most famously this is Wikipedia--and you see a chart of the number of users on Wikipedia ranked by their contributions. And what you see is that 90% of the edits made to Wikipedia are done by 10% of the users.

If we look at other crowdsourced projects: this is the North American Bird Phenology Program out of Patuxent ~~Bay~~ Research Center [ed: actually Patuxent Wildlife Research Center], and this is a project in which volunteers are transcribing ornithology records--basically bird-watching records--that were sent in from the ~~1870s through the 1950s~~ [ed: 1880s-1970s], entering them into a database where they can be mined for climate change [data]. What's interesting about this to me at least is that--and this has been a phenomenally successful project: they've got 560,000 cards transcribed all by volunteers, but StellaW@Maine here has transcribed 126,000 of them, which is 22% of them. Now, CharlotteC@Maryland is close behind her (so go, local team!) but again you see the same kind of curve.

If we look at another relatively large project, the Transcribe Bentham project: this isn't a graph, but if you look at the numbers here, you see the same kind of thing. You see Diane with 78,000 points, you see Ben Pokowski with 51,000 points. You see this curve sort of taper down into more of a long tail.

So what about the small projects?

Well, let's look at the Klauber diaries. This is the top ten transcribers of the field notes of Laurence Klauber. And if you look at the numbers here--again in this case it's not quite as pronounced because I think the previous leader has dropped out and other people have overtaken him--but you see the same kind of distribution. This is not a linear progression; this is more of a power-law distribution.

If you look at an even smaller project--now, mind you this is a project that is really only of interest to members of my family and elderly neighbors of the diarist--but look: We've got Linda Tucker who has transcribed 713 of these pages followed by me and a few other people. But again, you have this power law that the majority of the work is being done by a very small group of people.

Okay, what's going on really? What does this mean and why does it matter? The thing that I think this gets to, the reason that I think that this is important, is for a couple of reasons.

One is that this kind of behavior addresses one of the main objections to crowdsourcing. Now there are a lot of valid objections to crowdsourcing; I think that there are also a few invalid objections and one of them is essentially the idea that members of the public cannot participate in scholarly projects because my next door neighbor is neither capable nor interested in participating in scholarly projects. And we see this all over the place. I mean, here's a few example quotes--and I'm not going to read them out. I believe that this objection (which I have heard a number of times; I mean we see some examples right here) is a non sequitur. And I believe that the power-law distribution proves that it's a non sequitur. Really, I saw this most egregiously framed by a scholar who was passionately--just absolutely decrying--the idea that classical music fans would be able to competently translate from German into English because, he said, "After all, 40% of South Carolina voted for Newt Gingrich." Okay.

All right, so what's going on is I think best summed up by Rachel Stone, and what she essentially said is that crowdsourcing isn't getting the sort of random distribution from the crowd. Crowdsourcing is getting a number of "well-informed enthusiasts."

So where do we find well-informed enthusiasts to do this work and to do it well? Big projects have an advantage, right? They have marketing budgets. They have press coverage. They have an existing user base.

If you ask the people at the Transcrbe Bentham project how did they get their users, they'll say "Well, you know that New York Times article really helped. " That's cool! All right.

The GalaxyZoo people--Citizen Science Alliance--yesterday, 24 hours ago, announced a new project, SETILive. Now what this does is it pulls in live data from the SETI satellites[sic: actually telescope], and in those 24 hours--I took this screenshot; I actually skipped lunch to get this one screenshot because I knew that it would pass 10,000 people participating with 80,000 of these classifications. And it would have been higher, except last night the telescope got covered by cloud cover. So they dropped from getting 30 to 40 contributions per second to having to show sort of archival data and getting only 10 contributions per second. Well, they can do this because they have an existing base of active volunteers that numbers around 600,000.

So how do WE do that? How do we find well-informed enthusiasts? This is something that Kathryn Stallard and Anne Veerkamp-Andersen at Southwestern University Special Collections and I discussed a lot when we were trying to launch the Zenas Matthews Diary. We said, "Well, we don't have any budget at all." Kathryn said, "Well, let's talk about local archival newsletters. Let's post to H-Net lists." I was in favor of looking at online communities of people who might be doing Matthews genealogy or the military history war-gamers who have discussion forums on the Mexican War.

While we're arguing about this, Kathryn gets an email from a patron saying, "Hey, I'm a member of an organization. We see that you have this document. It relates to the Battle of San Jacinto and the Texas Revolution of 1836. Can you send this to us?"

She responds saying, "Hey, 1846, great. Check out this diary we just put online. I think that's what you're talking about."

Well that wasn't actually what he was talking about, but he responds and says, "Yeah, okay, I'll check that out, but can you please give me the document I want." They get it back to him and we returned to our discussion of "Okay, what do we need to do to roll this out? We're going to start working on the information architecture. We're going to work on the UI. We're going to work on help screens." And while we're having this conversation, Mr. Patrick checks it out.

And Scott Patrick starts transcribing.

And he starts transcribing some more.

And he continues transcribing.

And at this point, we're talking about working on the wording of the help screens, the wording of our announcement trying to attract volunteers, and this is page 43 of the 43-page diary!

And while we're discussing this, he goes back and he starts adding footnotes. Look at this: he's identifying the people who are in this, saying, "Hey, this guy who is mentioned is -- here's what his later life was. This other guy--hey, he's my first cousin, by the way, but he also left the governorship of the State of Texas to fight in this war."

He sees--and believe me, in the actual original diary, Piloncillo is not spelled Piloncillo. I mean it is a -- Zenas Matthews does not know Spanish, right? He identifies this! He identifies and looks up works that are mentioned here.

So wow! All right! We got our well-informed enthusiast! In 14 days, he transcribed the diary, and he didn't do just one pass. I mean as he got familiar with the hand, he goes back and revises the earlier transcriptions. He kind of figures out who's involved. He asks other members of his heritage organization what this is. He adds two dozen footnotes.

What just happened? What was that about? Who is this guy? Well, Scott Patrick is a retired petroleum worker who got interested in his family history, and then got interested in local history, and then got interested in heritage organizations. And he is our ideal "well-informed enthusiast".

So how did we find him? The project isn't public yet, right? Our challenge now is rephrasing our public announcement. We're now looking for volunteers to ... something that adequately describes what's left to do. Well, let's go back and take a look at this original letter, right? What we did is, we responded to an inquiry from a patron--and not an in-person patron: this is someone who lives 200 miles away from Georgetown, Texas.

What you have when someone is coming in and asking about material is, if you think about this in terms of target marketing--this is a target-rich environment. Here is someone who is interested. He's online. He's researching this particular subject. He is not an existing patron. he has no prior relationship with Southwestern University Libraries, but "Hey, while we answer your request, you might check this thing out that's in this related field." That seems to have worked in this one case. Hopefully, we'll get some more experience with future projects.

Okay, so how do we motivate volunteers? More importantly, how do we avoid de-motivating them?

Big projects, a lot of times they have a lot of interesting game-like features. Some of them actually are games. You have leader boards, you have badges, you have ways of making the experience more immersive.

OldWeather, which is run by GalaxyZoo, will plot your ship on a Google map as you transcribe the latitude and longitude elements from the log books.

The National Library of Finland has partnered with Microtask to actually create a crowdsourcing game of Whac-A-Mole. So this is crowdsourcing taken to the extreme.

But there's a peril here, and the peril is that all of these things are extrinsic motivators.

And we ran into this with the Klauber diaries. Perian came to me and said, "Hey, let's come up with a stats page, because we want to track where the diaries are at. So we come up with the stats page -- pretty basic, here's where some of these are at.

And hey, while we're at it, let's mine our data. We can come up with a couple of top-10 lists. So we come up with the top-ten list of transcribers and a top-ten list of editors, because that's the data I have.

Well remember, the whole point of this exercise is to index these diaries so that we can find the mentions of these individual species in the original manuscripts. Do you see indexing on here anywhere? Neither did our volunteers, and the minute this went up, the volunteers who previously had been transcribing and indexing every single page stopped indexing completely. They weren't being measured on it. We weren't saying that we rewarded them for it, so they stopped.

Needless to say, our next big-rush change was a top-ten indexers.

So this gets to "crowding-out" theory of motivation, and the expert on this is a researcher in the UK named Alexandra Eveleigh. Her point is that if you're going to design any kind of extrinsic motivation, you have to make sure that it promotes the actual contributory behavior, and this is something that applies, I believe, to small projects as well as large projects.

So I have 13 seconds left, so thank you, and I'll just end on that note.

Thursday, March 8, 2012

Jumping In With Both Feet

Although I didn't know it at the time, since I began work on FromThePage in 2005 I've had one toe in the digital humanities community. I've worked on FromThePage and I've blogged about crowdsourced manuscript transcription. I've met some smart, friendly people doing fascinating things and I've even taught some of them the magic of regular expressions. But I've always tried to squeeze this work into my "spare time" -- the interstices in the daily life of an involved father and a professional software engineer working a demanding but rewarding job. As the demands of vocation and avocation increase; as disparate duties begin to compete with each other; as new babies come into my home while new technologies come into my workplace and new requests for FromThePage arrive in my inbox, the basement inventor model becomes increasingly untenable. The numbers don't lie: I've only checked in code on four days during the last six months.

In January I was offered an incredible opportunity. Chris Lintott invited me to the Adler Planetarium to meet the Citizen Science Alliance's dev team. This talented, generous team of astronomer-developers gave me a behind-the-scenes tour of their Scribe tool--early versions of which powered OldWeather.org--and I was blown away. I don't think I've ever been so excited about a technology, and my mind raced with ideas for projects using it. . Serendipitously, two days later I received email from Ben Laurie asking if I'd like to implement Scribe for the FreeREG project, a part of the FreeBMD genealogy charity that is transcribing parish registers recording the baptisms, marriages, and burials in England and Wales from 1538-1835. All development would be released open source, and all data would be as open as possible. It's a dream project for someone with my interests; there was no way I could pass this up.

So as of March 18 I'm starting a new career as an independent digital history developer. It is heartbreaking to leave my friends at Convio after nearly a dozen years, but I'm delighted with the possibilities my new autonomy offers. I hope to specialize in projects relating to crowdsourcing and/or manuscript transcription, but to be honest I'm not sure where this path will lead. Of course I plan to devote more time to FromThePage -- this year should finally see the publish-on-demand integration I've always been wishing for, as well as a few other features people have requested. If you've got a project that seems appropriate--whether it involves genealogy or herpetology, agricultural history or textile history--drop me a line.

Monday, March 5, 2012

Quality Control for Crowdsourced Transcription

Whenever I talk about crowd-sourced transcription--actually whenever I talk about crowdsourced anything--the first question people ask is about accuracy. Nobody trusts the public add to an institution's data/meta-data, nor especially to correct it. However, quality control over data entry is a well-explored problem, and while I'm not familiar with the literature from industry regarding commercial approaches, I'd like to offer the systems I've seen implemented in the kinds of volunteer transcription projects I follow. (Note: the terminology is my own, and may be non-standard.)

Single-track methods (mainly employed with large, prosy text that is difficult to compare against independent transcriptions of the same text). In these methods, all changes and corrections are made to a single transcription which originated with a volunteer and is modified thereafter. There no parallel/alternate transcription to compare against.
1. Open-ended community revision: This is the method that Wikipedia uses, and it's the strategy I've followed in FromThePage. In this method, users may continue to change the text of a transcription forever. Because all changes are logged--with a pointer of some sort to the user who logged them--vandalism or edits which are not in good faith may be reverted to a known-good state easily. This is in keeping with the digital humanities principle of "no final version." In my own projects, I've seen edits made to a transcription two decades after the initial version, and those changes were indeed correct. (Who knew that "drugget" was a coarse fabric used for covering tobacco plant-beds?) Furthermore, I believe that there is no reason other than the cost of implementation why any of the methods below which operate from the "final version" mind-set should not allow error reports against their "published" form.
2. Fixed-term community revision: Early versions of both TransribeBentham and Scripto followed this model, and while I'm not sure if either of them still do, it does seem to appeal to traditional documentary editing projects that are incorporating crowdsourcing as a valuable initial input to a project while wishing to retain ultimate control over the "final version". In this model, wiki-like systems are used to gather the inital data, with periodic review by experts. Once a transcription reaches an acceptable status (deemed so by the experts), it is locked to further community edits and the transcription is "published" to a more traditional medium like a CMS or a print edition.
3. Community-controlled revision work-flows: This model is a cross between the two above-mentioned methods. Like fixed-term revision, it embraces the concept of a "final version", after which the text may not be modified. Unlike fixed-term revision, there are no experts involved here -- rather the tool itself forces a text to go through an edit/review/proofread/reject-approve workflow by the community, after which the version is locked for future edits. As far as I'm aware, this is only implemented by the ProofreadPage plugin to MediaWiki that has been used by Wikisource for the past few years, but it seems quite effective.
4. Transcription with "known-bad" insertions before proofreading: This is a two-phase process, which to my knowledge has only been tried by the Written Rummage project as described in Code4Lib issue 15. In the first phase, an initial transcription is solicited from the crowd (which in their case is a Mechanical Turk workforce willing to transcribe 19th-century diaries for around eight cents per page). In the second phase, the crowd is asked to review the initial transcription against the original image, proof-reading and correcting the first transcription. In order to make sure that a review is effective, however, extra words/characters are added to the data before it is presented to the proof-reader, and the location within the text of these known-bad insertions is recorded. The resulting corrected transcription is then programmatially searched for the bad data which had been inserted, and if it has been removed the system assumes that any other errors have also been removed -- or at least that a good-faith effort has been made to proofread and correct the transcript.
5. Single-keying with expert review: In this methodology, once a single volunteer contribution is made, it is reviewed by an expert and either approved or rejected. The expert is necessarily authorized in some sense -- in the case of the UIowa Civil War Diaries, the review is done by the library staff member processing the mailto form contribution, while in the case of FreeREG the expert is a "syndicate manager" -- a particular kind of volunteer within the FreeBMD charity. (FreeREG may be unique in using a single-track method for small, structured records, however it demands more paleographic and linguistic expertise from its volunteers than any other project I'm aware of.) If a transcription is rejected, it may be either returned to the submitter for correction or corrected by the expert and published in corrected form.
Multi-track methods (mainly employed with easily-isolated, structured records like census entries or ship's log books). In all of these cases, the same image is presented to different users to be transcribed from scratch. The data thus collected is compared programmatically on the assumption that two correct transcriptions will agree with each other and may be assumed to be valid. If the two transcriptions disagree with each other, however, one of them must be in error, so some kind of programmatic or human expert intervention is needed. It should be noted that all of these methodologys are technically "blind" n-way keying, as the volunteers are unaware of each other's contributions and do not know whether they are interpreting the data for the first time or contributing a duplicate entry.
1. Triple-keying with voting: This is the method that the Zooniverse OldWeather team uses. Originally the OldWeather team collected the same information in ten different independent tracks, entered by users who were unaware of each other's contributions: blind, ten-way keying. The assumption was that majority reading would be the correct one, so essentially this is a voting system. After some analysis it was determined that the quality of three-way keying was indistinguishable from that of ten-way keying, so the system was modified to a less-skeptical algorithm, saving volunteer effort. If I understand correctly, the same kind of voting methodology is used by ReCAPTCHA for its OCR correction, which allowed its exploitation by 4chan.
2. Double-keying with expert reconciliation: In this system, the same entry is shown to two different volunteers, and if their submissions do not agree it is passed to an expert for reconciliation. This requires a second level of correction software capable of displaying the original image along with both submitted transcriptions. If I recall my fellow panelist David Klevan's WebWise presentation correctly, this system is used by the Holocaust Museum for one of their crowdsourcing projects.
3. Double-keying with emergent community-expert reconciliation: This method is almost identical to the previous one, with one important exception. The experts who reconcile divergent transcriptions are themselves volunteers -- volunteers who have been promoted to from transcribers to reconcilers through an algorithm. If a user has submitted a certain (large) number of transcriptions, and if those transcriptions have either 1) matched their counterpart's submission, or 2) been deemed correct by the reconciler when they are in conflict with their counterpart's transcription, then the user is automatically promoted. After promotion, they are able to choose their volunteer activity from either the queue of images to be transcribed or the queue of conflicting transcriptions to be reconciled. This is the system used by FamilySearch Indexing, and its emergent nature makes it a particularly scalable solution for quality control.
4. Double-keying with N-keyed run-off votes: Nobody actually does this that I'm aware of, but I think it might be cost-effective. If the initial set of two volunteer submissions don't agree, rather than submit the argument to an expert, re-queue the transcription to new volunteers. I'm not sure what the right number is here -- perhaps only a single tie-breaker vote, but perhaps three new volunteers to provide an overwhelming consensus against the original readings. If this is indecisive, why not re-submit the transcription again to an even larger group? Obviously this requires some limits, or else the whole thing could spiral into an infinite loop in which your entire pool of volunteers are arguing with each other about the reading of a single entry that is truly indecipherable. However, I think it has some promise as it may have the same scalability benefits of the previous method without needing the complex promotion algorithm nor the reconciliation UI.

Caveats: Some things are simply not knowable. It is hard to evaluate the effectiveness of quality control seriously without taking into account the possibility that volunteer contributors may be correct and experts may be wrong, nor more importantly that some images are simply illegible regardless of the paleographic expertise of the transcriber. The Zooniverse team is now exploring ways for volunteers to correct errors made not by transcribers but rather by the midshipmen of the watch who recorded the original entries a century ago. They realize that a mistaken "E" for "W" in a longitude record may be more amenable to correction than a truly illegible entry. Not all errors are made by the "crowd", after all.

Much of this list is based on observation of working sites and extrapolation, rather than any inside information. I welcome corrections and additions in the comments or at benwbrum@gmail.com.

[Update 2012-03-07: Folks from the Transcribe Bentham informed me on Twitter that "In general, at the moment most transcripts are worked on by one volunteer, checked and then locked. Vols seem to prefer working on fresh MSS to part transcribed." and "For the record, @TranscriBentham does still use 'Fixed-term community revision'. There are weekly updates on the blog." Thanks, Tim and Justin!]

Sunday, March 4, 2012

We Get Press!

Crowdsourced transcription projects--and FromThePage in particular--have gotten some really nice press in the last few weeks.

Konrad Lawson posted an excellent review of Scripto and FromThePage on the ProfHacker blog at The Chronicle of Higher Education: Crowdsourcing Transcription: FromThePage and Scripto.

Francine Diep wrote a great article on the phenomenon at Innovation News Daily: Volunteer Transcribers Put Millions of Pages Online.

Ellen Davis's article on Southwestern's transcription of Zenas Matthews's 1846 Mexican War Diary is especially notable because it includes an interview with Scott Patrick, the volunteer who has done such a spectacular job: Collaborative Transcription Project.

Wednesday, January 18, 2012

A Developer Goes to AHA2012

Last Sunday I returned from the 2012 meeting of the American Historical Association. Although I have attended my share of conferences and unconferences--from Lone Star Ruby Con to Dreamforce and Texas State Historical Association to Museum Computer Network--I'd never attended one of the big mid-year academic conferences before. The experience was strange but fruitful, and I hope I'll be able to attend again.

Let me start with my superficial impressions. First, historians dress much better than developers do, though they really don't hold a candle to the art gallery folks. They also are a more reactive audience, although there is very little back-channel conversation on Twitter -- in fact I was informed that typing on laptops would be considered rude! Finally, they are pretty introverted -- more likely to strike up a conversation with a stranger than your average Rubyist, but not by much.

The conference itself is a bit warped by the fact that many of the attendees are there for the sole purpose of conducting job interviews. This apparently involves a days-long series of thirty-to-ninety minute interviews designed to figure out which candidates to invite to campus for an on-site interview -- a grueling process for the interviewers and an expensive one for the interviewees. (The analogous activity in the software world is the phone screen, in which a hiring manager discusses experience and skill-set with a candidate. Over the phone.) If most attendees are interviewing, they aren't actually participating in the conference -- I was told that around 12,000 people register, but only 5000 attend. This gives AHA a kind of Potemkin village flavor, and it's not unusual to see a panel lecturing to a nearly empty room. In fact, the last session I attended had five speakers on the podium and only three people in the audience.

Nevertheless, AHA2012 and the associated THATCamp were tremendously productive for me. There were several opportunities for collaboration, so while I didn't find my dream partner for FromThePage--that institution with a staff of front-end experts and a burning need for transcription software--I did have some really good conversations. I've been trying to add better support for letters to FromThePage, and Jean Bauer gave me a detailed walk-through of the Project Quincy data model for correspondence. A lot of people were interested in starting their own crowdsourcing projects, and we've been swapping emails since. Most importantly, while I was in town I met with the development team behind Scribe and Talk, the open-source tools that power Citizen Science Alliance projects like OldWeather and AncientLives. I'll be posting about that separately.

One of the things that impressed me most about the history world was the potential there is for a programmer to make a big impact. The graduate student I roomed with was an expert with regular expressions--his texts were in Arabic, so the RTL/LTR mix required him to close his eyes as he composed his patterns--but he had no experience with elementary scripting. In one two-hour hack session, we were able to split a three-hundred-thousand-line medieval biographical dictionary into twenty thousand small files representing individual entries. With a couple more hours' work, we'd have been able to extract dates, places, names, and other data from these files. It is a delight for a software engineer to work in a domain where such minimal effort can make such a difference: most of our work deals with obscure edge cases of hard/boring problems, so removing months of tedious manual labor with an hour's worth of programming is incredibly rewarding.

Crowdsourcing History: Collaborative Transcription and Archives, the panel I presented at, seemed to go well. Moderator Shane Landrum invited the audience to give 3-minute presentations on their own crowdsourcing projects after the presenters finished their 8-minute talks, then he opened the floor for questions. Although I was skeptical about this format, it worked very well indeed. In particular, the Q/A period was blessedly free of the self-promoters who plague events like South by Southwest. Perhaps this can be attributed to the novel format or perhaps it was due to the inherent civility of academic historians -- all I know is that it succeeded. I felt very fortunate to be among the panelists, who were a Who's Who of manuscript transcription tools, although a couple prominent projects were not represented because they were too recent to be included in the proposal. Because the context was already set by my fellow panelists and because the time was so constrained, I decided to concentrate my own talk on one feature of FromThePage: subject indexing through wiki-links. An abbreviated recap of the presentation is embedded below:

On the whole, I think I'd like to go back to the AHA meeting. The conversations and collaborations made the trip worth the expense, and it was gratifying to finally meet the people behind the big transcription projects face-to-face. I even managed to learn some fascinating stuff about American history.

Wednesday, December 7, 2011

Developments in Wikisource/ProofreadPage for Transcription

Last year I reviewed Wikisource as a platform for manuscript transcription projects, concluding that the ProofreadPage plug-in was quite versatile, but that unfortunately the en.wikisource.org policy prohibiting any text not already published on paper ruled out its use for manuscripts.

I'm pleased to report that this policy has been softened. About a month ago, NARA started to partner with the Wikimedia Foundation to to host material—including manuscripts—on Wikisource. While I was at MCN, I discussed this with Katie Filbert, the president of Wikimedia DC, who set me straight. Wikisouce is now very interested in partnering with institutions to host manuscripts of importance, but it is still not a place for ordinary people to upload great-grandpa's journal from World War I.

Once you host a project on Wikisource, what do you do with it? Andie, Rob and Gaurav over at the blog So You Think You Can Digitize?—and it's worth your time to read at least the last six posts—have been writing on exactly that subject. Their most recent post describes their experience with Junius Henderson's Field Notes, and although it concentrates on their success flushing out more Henderson material and recounts how they dealt with the wikisource software, I'd like to concentrate on a detail:

What we currently want is a no-cost, minimal effort system that will make scans AND transcriptions AND annotations available, and that can facilitate text mining of the transcriptions. Do we have that in WikiSource? We will see. More on annotations to follow in our next post but some father to a sister of some thoughts are already percolating and we have even implemented some rudimentary examples.

This is really exciting stuff. They're experimenting with wiki mark-up of the transcriptions with the goal of annotation and text-mining. I tried to do this back in 2005, but abandoned the effort because I never could figure out how to clearly differentiate MediaWiki articles about subjects (i.e. annotations) from articles that presented manuscript pages and their transcribed text. The lack of wiki-linking was also the one of my criticisms most taken to heart by the German Wikisource community last October.

So how is the mark-up working out? Gaurav and the team have addressed the differentiation issue by using cross-wiki links, a standard way of linking from an article on one Wikimedia project to another. So the text "English sparrows" in the transcription is annotated [[:w:Passer domesticus|English sparrows]], which is wiki-speak for Link the text "English sparrows" to the Wikipedia article "Passer domesticus". Wikipedia's redirects then send the browser off to the article "House Sparrow".

So far so good. The only complaint I can make is that—so far as I can tell—cross-wiki links don't appear in the "What links here" screen tool on Wikipedia, neither for Passer domesticus, nor for House Sparrow. This means that the annotation can't provide an indexing function, so that users can't see all the pages that reference possums, nor read a selection of those pages. I'm not sure that the cross-wiki link data isn't tracked, however — just that I can't see it in the UI. Tantalizingly, cross-wiki links are tracked when images or other files are included in multiple locations: see the "Global file usage" section of the sparrow image, for example. Perhaps there is an API somewhere that the Henderson Field Note project could use to mine this data, or perhaps they could move their links targets from Wikipedia articles to some intermediary in a different Wikisource namespace.

Regardless, the direction Wikisource is moving should make it an excellent option for institutions looking to host documentary transcription projects and experiment with crowdsourcing without running their own servers. I can't wait to see what happens once Andie, Rob, and Gaurav start experimenting with PediaPress!

Friday, November 18, 2011

Crowdsourcing Transcription at MCN 2011

MCN2011 Crowdsourcing Transcription

View more presentations from benwbrum.

These are links to the papers, websites, and systems mentioned in my presentation at the Museum Computer Network 2011 conference.

Can't OCR Cursive

"Toward Searchable Indexes for Handwritten Documents", Kennard and Barrett

Genealogy

Natural Science

Atlas of Living Australia Australian Museum Cicada Expedition
Botanical Society of the British Isles Herbaria@Home

Open Source/Creative Commons

Wikisource for Manuscript Tranascription (my review)
Open Genealogy Alliance (UK)

Libraries

What's on the Menu? New York Public Library
University of Iowa Libraries Civil War Diaries and Letters
North Carolina Family History Transcription Project, Government and Heritage Library, State Library of North Carolina

Archives

Stadsarchief Amsterdam's Militieregisters.nl and Velehanden.nl
Public Records Office Victoria

Museums

San Diego Natural History Museum's Laurence Klauber Field Notes

Why?

North American Bird Phenology Program User Satisfaction Survey
OldWeather
DigitalKoot
North American Bird Phenology Program Top 50 Transcribers
North American Bird Phenology Projgram April Newsletter
Code4Lib Journal, "Using Amazon Mechanical Turk to Transcribe Historical Handwritten Documents

How will you use the transcription?

Transcribe Bentham
Julia Brumfield Diaries view of references to Irvin Harvey
Papers of the War Department

Tools

Wikisource/ProofreadPage (see review above)
FromThePage on GitHub
Scripto on GitHub
Bentham Transcription Desk on Google Code
Zooniverse Scribe on GitHub
OpenScribe on Google Code

Friday, August 5, 2011

Programmers: Wikisource Needs You!

Wikisource is powered by a MediaWiki extension which allows page images to be displayed beside the wiki editing form. This extension also handles editorial workflow by allowing pages, chapters, and books to be marked as unedited, partially edited, in need of review, or finished. It's a fine system, and while the policy of the English language Wikisource community prevents it from being used for manuscript transcription, there are active manuscript projects using the software in other communities.

Yesterday, Mark Hershberger wrote this in a comment: For what its worth the extension used by WikiSource, ProofreadPage, now needs a maintainer. I posted about this here: http://thread.gmane.org/gmane.science.linguistics.wikipedia.technical/54831

While I'm sorry to hear it, this is an excellent opportunity for someone with Mediawiki skills to do some real good.

Tuesday, July 26, 2011

Can a Closed Crowdsourcing Project Succeed?

Last night, the Zooniverse folks announced their latest venture: Ancient Lives, which invites the public to help analyze the Oxyrhynchus Papyri. The transcription tool meets the high standards we now expect from the team who designed Old Weather, but the project immediately stirred some controversy because of its terms of use:

Sean is referring to this section of the copyright statement (technically, not a terms of use), which is re-displayed from the tutorial:

Images may not be copied or offloaded, and the images and their texts may not be published. All digital images of the Oxyrhynchus Papyri are © Imaging Papyri Project, University of Oxford. The papyri themselves are owned by the Egypt Exploration Society, London. All rights reserved.

Future use of the transcriptions may be hinted at a bit on the About page:

The papyri belong to the Egypt Exploration Society and their texts will eventually be published and numbered in Society's Greco-Roman Memoirs series in the volumes entitled The Oxyrhynchus Papyri.

It should be noted that the closed nature of the project is likely a side-effect of UK copyright law, not a policy decision by the Zooniverse team. In the US, a scan or transcription of a public domain work is also public domain and not subject to copyright. In the UK, however, scanning an image creates a copyright in the scan, so the up-stream providers automatically are able to restrict down-stream use of public domain materials. In the case of federated digitization projects this can create a situation like that of the Old Bailey Online, where different pieces of a seemingly-seamless digital database are owned by entirely different institutions.

I will be very interested to see how the Ancient Lives project fares compared to GalaxyZoo's other successes. If the transcriptions are posted and accessible on their own site, users may not care about the legal ownership of the results of their labor. They've already had 100,000 characters transcribed, so perhaps these concerns are irrelevant for most volunteers.

Wednesday, July 20, 2011

Crowdsourcing and Variant Digital Editions

Writing at the JISC Digitization Blog, Alastair Dunning warns of "problems with crowdsourcing having the ability to create multiple editions."

For example, the much-lauded Early English Books Online (EEBO) and Eighteenth Century Collections Online (ECCO) are now beginning to appear on many different digital platforms.

ProQuest currently hold a licence that allows users to search over the entire EEBO corpus, while Gale-Cengage own the rights to ECCO.

Meanwhile, JISC Collections are planning to release a platform entitled JISC Historic Books, which makes licenced versions of EEBO and ECCO available to UK Higher Education users.

And finally, the Universities of Michigan and Oxford are heading the Text Creation Partnership (TCP), which is methodically working its way through releasing full-text versions of EEBO, ECCO and other resources. These versions are available online, and are also being harvested out to sites like 18th Century Connect.

So this gives us four entry points into ECCO – and it’s not inconceivable that there could be more in the future.

What’s more, there have been some initial discussions about introducing crowdsourcing techniques to some of these licensed versions; allowing permitted users to transcribe and interpret the original historical documents. But of course this crowdsourcing would happen on different platforms with different communities, who may interpret and transcribe the documents in different way. This could lead to the tricky problem of different digital versions of the corpus. Rather than there being one EEBO, several EEBOs exist.

Variant editions are indeed a worrisome prospect, but I don't think that it's unique to projects created through crowdsourcing. In fact, I think that the mechanism of producing crowdsourced editions actually reduces the possibility for variants to emerge. Dunning and I corresponded briefly over Twitter, then I wrote this comment to the JISC Digitization blog. Since that blog seems to be choking on the mark-up, I'll post my reply here:

benwbrum Reading @alastairdunning's post connecting
crowdsourcing to variant editions: bit.ly/raVuzo Feel like Wikipedia
solved this years ago.

benwbrum If you don't publish (i.e. copy) a "final" edition of a crowdsourced transcription, you won't have variant "final" versions.

benwbrum The wiki model allows linking to a particular version of an article. I expanded this to the whole work: link

alastairdunning But does that work with multiple providers offering restricted access to the same corpus sitting on different platforms?

alastairdunning ie, Wikipedia can trace variants cause it's all on the same platform; but there are multiple copies of EEBO in different places

benwbrum I'd argue the problem is the multiple platforms, not the crowdsourcing.

alastairdunning Yes, you're right. Tho crowdsourcing considerably amplifies the problem as the versions are likely to diverge more quickly

benwbrum You're assuming multiple platforms for both reading and editing the text? That could happen, akin to a code fork.

benwbrum Also, why would a crowd sourced edition be restricted? I don't think that model would work.

I'd like to explore this a bit more. I think that variant editions are less likely in a crowdsourced project than in a traditional edition, but efforts to treat crowdsourced editions in a traditional manner can indeed result in the situation you warn against.

When we're talking about crowdsourced editions, we're usually talking about user-generated content that is produced in collaboration with an editor or community manager. Without exception, this requires some significant technical infrastructure -- a wiki platform for transcribing free-form text or an even more specialized tool for transcribing structured data like census records or menus. For most projects, the resulting edition is hosted on that same platform -- the Bentham wiki which displays the transcriptions for scholars to read and analyze is the same tool that volunteers use to create the transcriptions. This kind of monolithic platform does not lend itself to the kind of divergence you describe: copies of the edition are always dated as soon as they are separated from the production platform, and making a full copy of the production platform requires a major rift among the editors and volunteer community. These kind of rifts can happen--in my world of software development, the equivalent phenomenon is a code fork--but they're very rare.

But what about projects which don't run on a monolithic platform? There are a few transcription projects in which editing is done via a wiki (Scripto) or webform (UIowa) but the transcriptions are posted to a content management system. There is indeed potential for the "published" version on the CMS to drift from the "working" version on the editing platform, but in my opinion the problem lies not in crowdsourcing, but in the attempt to impose a traditional publishing model onto a participatory project by inserting editorial review in the wrong place:

Imagine a correspondence transcription project in which volunteers make their edits on a wiki but the transcriptions are hosted on a CMS. One model I've seen often involves editors taking the transcriptions from the wiki system, reviewing and editing them, then publishing the final versions on the CMS. This is a tempting work-flow -- it makes sense to most of us both because the writer/editor/reader roles are clearly defined and because the act of copying the transcription to the CMS seems analogous to publishing a text. Unfortunately, this model fosters divergence between the "published" edition and the working copy as voluteers continue to make changes to the transcriptions on the wiki, sometimes ignoring changes made by the reviewer, sometimes correcting text regardless of whether a letter has been pushed to the CMS. The alternative model has reviewers make their edits within the wiki system itself, with content pushed to the CMS automatically. In this model, the wiki is the system-of-record; the working copy is the official version. Since the CMS simply reflects the production platform, it does not diverge from it. The difficulty lies in abandoning the idea of a final version.

It's not at all clear to me how EEBO or ECCO are examples of crowdsourcing, rather than traditional restricted-access databases created and distributed through traditional means, so I'm not sure that they're good examples.

Tuesday, February 15, 2011

My Goals for FromThePage

A couple of recent interactions have made me realize that I've never publicly articulated my goals in developing FromThePage. Like anyone else managing a multi-year project, my objectives have shifted over time. However, there are three main themes of my work developing web-based software for transcribing handwritten documents:

Transcribing and publishing family diaries. FromThePage was developed to allow me and my immediate family to collaboratively transcribe the diaries of Julia Craddock Brumfield (fl. 1915-1936), my great-great grandmother. This objective has drifted over time to include the diaries of Jeremiah White Graves (fl. 1823-1878) as well, but despite that addition the effort is on track to achieve this original goal. Since the website was announced to my extended family in early 2009, Linda Tucker—a cousin whom I'd never met—has transcribed every single page I've put online, then located and scanned three more diaries that were presumed lost. The only software development work remaining to complete this goal is the integration of the tool with a publish-on-demand service so that we may distribute the diaries to family members without Internet access.
Creating generally useful transcription software. As I developed FromThePage, I quickly realized that the tool would be useful to anyone transcribing, indexing and annotating handwritten material. It seemed a waste to pour effort into a tool that was only accessible to me, so a new goal arose of converting FromThePage into a viable multi-user software project. This has been a more difficult endeavor, but in 2010 I released FromThePage under the AGPL, and it's been adopted with great enthusiasm by the Balboa Park Online Collaborative for transcription projects at the San Diego Natural History Museum.
Providing access to privately-held manuscripts. The vision behind FromThePage is to generalize my own efforts digitizing family diaries across the broader public. There is what I call an invisble archive--an enormous collection of primary documents that is distributed among the filing cabinets, cedar chests, and attics of the nation's nostalgic great aunts, genealogists, and family historians. This archive is inaccessible to all but the most closely connected family and neighbors of the documents' owners — indeed it's most often not merely inaccessible but entirely unknown. When effort is put into researching this archival material, it's done by amateurs like myself, and more often than not the results are naïve works of historical interpretation: rather than editing and annotating a Civil War diary, the researcher draws from it to create yet another Lost Cause narrative. I would love for FromThePage to transform this situation, channeling amateur efforts into digitizing and sharing irreplaceable primary material with researchers and family members alike. This has proven a far greater challenge than my proximate or intermediate goals: technically speaking the processing and hosting of page scans has been costly and difficult, while my efforts to recruit from the family history community have met with little success. Nevertheless I remain hopeful that events like this month's RootsTech conference will build the same online network among family researchers that THATCamp has among professional digital humanists.

Wednesday, February 2, 2011

2010: The Year of Crowdsourcing Transcription

2010 was the year that collaborative manuscript transcription finally caught on.

Back when I started work on FromThePage in 2005, I got the same response from most people I told about the project: "Why don't you just use OCR software?" To say that the challenges of digitizing handwritten material were poorly understood might be inaccurate—after all the TEI standard included an entire chapter on manuscripts—but there were no tools in use designed for the purpose. Five years later, half a dozen web-based transcription projects are in progress and new projects may choose from several existing tools to host their own. Crowdsourced transcription was even the written up in the New York Times!

I'm going to review the field as I see it, then make some predictions for 2011.

Ongoing Structured Transcription Projects
By far the most successful transcription project is FamilySearch Indexing. In 2010, 185,900,667 records were transcribed from manuscript census forms, parish registers, tithe lists, and other sources world-wide. This brings the total up to 437,795,000 records double-keyed and reconciled by more than four hundred thousand volunteers — itself an awe-inspiring number with an equally impressive support structure.

October saw the launch of OldWeather, a project in which GalaxyZoo applied its crowdsourcing technology to transcription of Royal Navy ship's logs from WWI. As I write, volunteers have transcribed an astonishing 308169 pages of logs — many of which include multiple records. I hope to do a more detailed review of the software soon, but for now let me note how elegantly the software uses the data itself to engage volunteers, so that transcribers can see the motion of "their ship" on a map as they enter dates, latitudes and longitudes. This leverages the immersive nature of transcription as an incentive, projecting users deep within history.

The North American Bird Phenology Program transcribed nearly 160,000 species sighting cards between December 2009 and 2010 and maintained their reputation as a model for crowdsourcing projects by publishing the first user satisfaction survey for a transcription tool. Interestingly the program seems to have followed a growth pattern a bit similar to Wikipedia's, as the cards transcribed rose from 203,967 to 362,996 while the number of volunteers only increased from 1,666 to 2,204 (32% vs 78%) — indicating that a core of passionate volunteers remain the most active contributors.

I've only recently discovered Demogen, a project operated by the Belgium Rijksarchief to enlist the public to index handwritten death notices. Although most of the documentation is in Flemish, the Windows-based transcription software will also operate in French. I've had trouble finding statistics on how many of the record sets have been completed (a set comprising a score of pages with half a dozen personal records per page). By my crude estimate, the 4000ish sets are approximately 63% indexed — say a total of 300,000 records to date. I'd like to write a more detailed review of Demogen/Visu and would welcome any pointers to project status and community support pages.

Ancestry.com's World Archives Project has been operating since 2008, but I've been unable to find any statistics on the total number of records transcribed. The project allows volunteers to index personal information from a fairly heterogeneous assortment of records scanned from microfilm. Each set of records has its own project page with help and statistics. The keying software is a Windows-based application free for download by any Ancestry.com registered user, while support is provided through discussion boards and a wiki.

Ongoing Free-form Transcription Projects
While I've written about Wikisource and its ProofreadPage plug-in before, it remains worth very much following. More than two hundred thousand scanned pages have been proofread, had problems reconciled, and been reviewed out of 1.3 million scanned pages. Only a tiny percent of those are handwritten, but that's still a few thousand pages, making it the most popular automated free-form transcription tool.

This blog was started to track my own work developing FromThePage to transcribe Julia Brumfield's diaries. As I type, beta.fromthepage.com hosts 1503 transcribed pages—of which 988 are indexed and annotated—and volunteers are now waiting on me to prepare and upload more page images. Major developments in 2010 included the release of FromThePage on GitHub under a Free software license and installation of the software by the Balboa Park Online Collaborative for transcription projects by their member institutions.

Probably the biggest news this year was TranscribeBentham, a project at University College London to crowdsource the transcription of Jeremy Bentham's papers. This involved the development of Transcription Desk, a MediaWiki-based tool which is slated to be released under an open-source license. The team of volunteers had transcribed 737 pages of very difficult handwriting when I last consulted the Benthamometer. The Bentham team has done more than any other transcription tool to publicize the field -- explaining their work on their blog, reaching out through the media (including articles in the Chronicle of Higher Education and the New York Times), and even highlighting other transcription projects on Melissa Terras's blog.

Halted Transcription Projects
The Historic Journals project is a fascinating tool for indexing—and optionally transcribing—privately-held diaries and journals. It's run by Doug Kennard at at Brigham Young University, and you can read about his vision in this FHT09 paper. Technically, I found a couple of aspects of the project to be particularly innovative. First, the software integrates with ContentDM to display manuscript page images from that system within its own context. Second, the tool is tightly integrated with FamilySearch, the LDS Church's database of genealogical material. It uses the FamilySearch API to perform searches for personal or place names, and can then use the FamilySearch IDs to uniquely identify subjects mentioned within the texts. Unfortunately, because the FamilySearch API is currently limited to LDS members, development on Historic Journals has been temporarily halted.

Begun as a desktop application in 1998, the uScript Transcription Assistant is the longest-running program in the field. Recently ported over to modern web-based technologies, the system is similar to Img2XML and T-PEN in that it links individual transcribed words to the corresponding images within the scanned page. Although the system is not in use and the source-code is not accessible outside WPI, you can read papers describing it by WPI students in 2003 or in 2005 by Fabio Carrera (the faculty member leading the project). Unfortunately, according to Carrera's blog work on the project has stopped for lack of funding.

According to the New York Times article, there was an attempt to crowdsource the Papers of Abraham Lincoln. The article quotes project director Daniel Stowell explaining that nonacademic transcribers "produced so many errors and gaps in the papers that 'we were spending more time and money correcting them as creating them from scratch.'" The prototype transcription tool (created by NCSA at UIUC) has been abandoned.

Upcoming Transcription ProjectsThe Center for History and New Media at George Mason University is developing a transcription tool called Scripto based on MediaWiki and architected around integration with an external CMS for hosting page images. The initial transcription project will be their Papers of the War Department site, but connector scripts for other content management systems are under development. Scripto is being developed in a particularly open manner, with the source code available for immediate inspection and download on GitHub and a project blog covering the tool's progress.

T-PEN is a tool under development by Saint Louis Univiersity to enable line-by-line transcription and paleographic annotation. It's focused on medieval manuscripts, and automatically identifies the lines of text within a scanned page — even if that page is divided into columns. The team integrated crowdsourcing into their development process by challenging the public to test and give feedback on their line identification algorithm, gathering perhaps a thousand ratings in a two week period. There's no word on whether T-PEN will be released under a free license. I should also mention that they've got the best logo of any transcription tool.

I covered Militieregisters.nl at length below, but the most recent news is that a vendor has been picked to develop the VeleHanden transcription tool. I would not be at all surprised if 2011 saw the deployment of that system.

The Balboa Park Online Collaborative is going into collaborative transcription in a big way with the Field Notes of Laurence Klauber for the San Diego Natural History Museum. They've picked my own FromThePage to host their transcriptions, and have been driving a lot of the development on that system since October through their enthusiastic feature requests, bug reports, and funding. Future transcription projects are in the early planning stages, but we're trying to complete features suggested by the Klauber material first.

The University of Iowa Libraries plan to crowdsource transcription of their Historic Iowa Children's Diaries. There is no word on the technology they plan to use.

The Getty Research Institute plans to crowdsource transcription of J. Paul Getty's diaries. This project also appears to be in the very early stages of planning, with no technology chosen.

Invisible Australians is a digitization project by Kate Bagnall and Tim Sherratt to explore the lives of Australians subjected to the White Australia policy through their extensive records. While it's still in the planning stages (with only a set of project blogs and a Zotero library publicly visible), the heterogeneity of the source material make it one of the most ambitious documentary transcription projects I've seen. Some of the data is traditionally structured (like government forms), some free-form (like letters), and there are photographs and even hand-prints to present alongside the transcription! Invisible Australians will be a fascinating project to follow in 2011.

Obscure Transcription Projects
Because the field is so fragmented, there are a number of projects I follow that are not entirely automated, not entirely public, not entirely collaborative, moribund or awaiting development. In fact, some projects have so little written about them online that they're almost mysterious.

Commenters to a blog post at Rogue Classicism are discussing this APA job posting for a Classicist to help develop a new GalaxyZoo project transcribing the Oxyrhynchus Papyri.
Some cryptic comments on blog posts covering TranscribeBentham point to FadedPage, which appears to be a tool similar to Project Gutenberg's Distributed Proofreaders. Further investigation has yielded no instances of it being used for handwritten material.
A blog called On the Written, the Digital, and the Transcription tracks development of WrittenRummage, which was apparently a crowdsourced transcription tool that sought to leverage Amazon's Mechanical Turk.
Van Papier Naar Digitaal is a project by Hans den Braber and Herman de Wit in which volunteers photograph or scan handwritten material then send the images to Hans. Hans reviews them and puts them on the website as a PDF, where Herman publicizes them to transcription volunteers. Those volunteers download the PDF and use Jacob Boerema's desktop-based Transcript software to transcribe the records, which are then linked from Digitale Bronbewerkinge Nederland en België. With my limited Dutch it is hard for me to evaluate how much has been completed, but in the years that the program has been running its results seem to have been pretty impressive.
BYU's Immigrant Ancestors Project was begun in 1996 as a survey of German archival holdings, then was expanded into a crowdsourced indexing project. A 2009 article by Mark Witmer predicts the immanent roll-out of a new version of the indexing software, but the project website looks quite stale and says that it's no longer accepting volunteers.
In November, a Google Groups post highlighted the use of Islandora for side-by-side presentation of a page image and a TEI editor for transcription. However I haven't found any examples of its use for manuscript material.
Wiktenauer is a MediaWiki installation for fans of western martial arts. It hosts several projects transcribing and translating medieval manuals of fighting and swordsmanship, although I haven't yet figured out whether they're automating the transcription.
Melissa Terras' manuscript transcription blog post mentioned a Drupal-based tool called OpenScribe, built by the New Zealand Electronic Text Centre. However, the Google Code site doesn't show any updates since mid-2009, so I'm not sure how active the project is. This project is particularly difficult to research because "OpenScribe" is also the name chosen for an audio transcription tool hosted on SourceForge as well as a commercial scanning station.

I welcome any corrections or updates on these projects.

Predictions for 2011

Emerging Community
Nearly all of the transcription projects I've discussed were begun in isolation, unaware of previous work towards transcription tools. While I expect this fragmented situation to continue--in fact I've seen isolated proposals as recently as Shawn Moore's October 12 HASTAC post--it should lessen a bit as toolmakers and project managers enter into dialogue with each other on comment threads, conference panels or GitHub. Tentative steps were made towards overcoming linguistic division in 2010, with Dutch archivists covering TranscribeBentham and a scattered bit of bloggy conversation between Dutch, German, English and American participants. The publicity given to projects like OldWeather, Scripto, and TranscribeBentham can only help this community form.

No Single Tool
We will not see the development of a single tool that supports transcription of both structured and free-form manuscripts, nor both paleographic and semantic annotation in 2011. The field is too young and fragmented -- most toolmakers have enough work providing the basic functionality required by their own manuscripts.

New Client-side Editors
Although I don't foresee convergence of server tools, there is already some exciting work being done towards Javascript-based editors for TEI, the mark-up language that informs most manuscript annotation. TEILiteEditor is an open-source WYSIWYG for editing TEI, while RaiseXML is an open-source editor for manipulating TEI tags directly. Both projects have seen a lot of activity over the past few weeks, and it's easy to imagine a future in which many different transcription tools support the same user-facing editor.

External Integration
2010 already saw strides being made towards integration with external CMSs, with BYU's Historic Journals serving page images from ContentDM and FromThePage serving page images from the Internet Archive. Scripto is apparently designed entirely around CMS integration, as it does not host images itself and is architected to support connectors for many different content management systems. I feel that this will be a big theme of transcription tool development in 2011, with new support for feeding transcriptions and text annotations back to external CMSs.

Outreach/Volunteer Motivation
We're learning that a key to success in crowdsourcing projects is recruiting volunteers. I think that 2011 will see a lot of attention paid to identifying and enlisting existing communities interested in the subject matter for a transcription project. In addition to finding volunteers, projects will better understand volunteer motivation and the trade-offs between game-like systems that encourage participation through score cards and points on the one hand, and immersive systems that enhance the volunteers' engagement with the text on the other.

Taxonomy
As number of transcription projects multiplies, I think that we will be able to start generalizing from the unique needs of each collection of manuscript material to form a sort of taxonomy of transcription projects. In the list above, I've separated the projects indexing structured data like militia rolls from those dealing with free-form text like diaries or letters. I think that in 2011 we'll be able to classify projects by their paleographic requirements, the kinds of analysis that will be performed on the transcribed texts, the quantity of non-textual images that must be incorporated into the transcription presentation, and other dimensions. It's possible that the existing tools will specialize in a few of these areas, providing support for needs similar to those of their original project so that a sort of decision tree could guide new projects toward the appropriate tool for their manuscript material.

2011 is going to be a great year!

Wednesday, April 11, 2012

Wednesday, April 4, 2012

Why Wikisource?

How do you motivate your paleographers?

Findings on the behavior of "students" on Wikisource

Saturday, March 17, 2012

Thursday, March 8, 2012

Monday, March 5, 2012

Sunday, March 4, 2012

Wednesday, January 18, 2012

Wednesday, December 7, 2011

Friday, November 18, 2011

Friday, August 5, 2011

Tuesday, July 26, 2011

Wednesday, July 20, 2011

Tuesday, February 15, 2011

Wednesday, February 2, 2011

New Blog Posts are at FromThePage

Posts from the FromThePage Blog

Pages

Upcoming Conference Schedule

Past Conference Talks

Blog Archive

Subjects

Papers

Transcription Systems

Digital Family History