Collaborative Manuscript Transcription: Crowdsourcing at IMLS WebWise 2012

The video of the crowdsourcing panel at IMLS WebWise is online, so I thought I'd post my talk. Like anyone who's created a transcript of their own unscripted remarks, I recommend watching the video. (My bit starts at 6:00, though all the speakers were excellent. Full-screening will hide the slides.) Nevertheless, I've added hyperlinks to the transcript and interpolated the slides with my comments below.

Okay. I'd like to talk about some of the lessons that have come out of my collaborations with small crowdsourcing projects. We hear a lot about these large projects like GalaxyZoo, like Transcribe Bentham. What can small institutions and small projects do, and do the rules that seem to apply to large projects also apply to them?

So there are three projects that I'm drawing from here in this experience. The first one I'm going to talk about in a little bit is one that was run by Balboa Park Online Collaborative. It's a Klauber Field Notes Transcription Project--the field notes of Laurence M. Klauber, who was the nation's foremost authority on rattlesnakes. These are field notes that he kept from 1923 through 1967. This is done by the San Diego Natural History Museum and is run by our own Perian Sully who is out there in the room somewhere.

The next project I want to talk about is the Diary of Zenas Matthews. Zenas Matthews was a volunteer from Texas who served in the American forces in the US-Mexican War of 1846 and this diary is kept by Southwestern's article on the Zenas Matthews diary project. This had been digitized for a previous researcher and is small, but Southwestern itself is also quite small.

The third project I want to talk about is actually the origin of the software, which is the Julia Brumfield Diaries. If the name looks familiar, it's because she's my great-great grandmother. This project was the impetus for me to develop this tool for crowdsourced transcription.

So all of these projects, what they have in common is that we're talking about page counts that are in the thousands and volunteer counts that are numbered in the dozens at best. So these are not FamilySearch Indexing, where you can rely on hundreds of thousands of volunteers and large networks.

So who participates in large projects and who participates in small projects? One thing that I think is really interesting about crowdsourcing and these other sorts of participatory online communities is that the ratio of contributions to users follows what's called a power-law distribution.

If you look here, we see--and most famously this is Wikipedia--and you see a chart of the number of users on Wikipedia ranked by their contributions. And what you see is that 90% of the edits made to Wikipedia are done by 10% of the users.

If we look at other crowdsourced projects: this is the North American Bird Phenology Program out of Patuxent ~~Bay~~ Research Center [ed: actually Patuxent Wildlife Research Center], and this is a project in which volunteers are transcribing ornithology records--basically bird-watching records--that were sent in from the ~~1870s through the 1950s~~ [ed: 1880s-1970s], entering them into a database where they can be mined for climate change [data]. What's interesting about this to me at least is that--and this has been a phenomenally successful project: they've got 560,000 cards transcribed all by volunteers, but StellaW@Maine here has transcribed 126,000 of them, which is 22% of them. Now, CharlotteC@Maryland is close behind her (so go, local team!) but again you see the same kind of curve.

If we look at another relatively large project, the Transcribe Bentham project: this isn't a graph, but if you look at the numbers here, you see the same kind of thing. You see Diane with 78,000 points, you see Ben Pokowski with 51,000 points. You see this curve sort of taper down into more of a long tail.

So what about the small projects?

Well, let's look at the Klauber diaries. This is the top ten transcribers of the field notes of Laurence Klauber. And if you look at the numbers here--again in this case it's not quite as pronounced because I think the previous leader has dropped out and other people have overtaken him--but you see the same kind of distribution. This is not a linear progression; this is more of a power-law distribution.

If you look at an even smaller project--now, mind you this is a project that is really only of interest to members of my family and elderly neighbors of the diarist--but look: We've got Linda Tucker who has transcribed 713 of these pages followed by me and a few other people. But again, you have this power law that the majority of the work is being done by a very small group of people.

Okay, what's going on really? What does this mean and why does it matter? The thing that I think this gets to, the reason that I think that this is important, is for a couple of reasons.

One is that this kind of behavior addresses one of the main objections to crowdsourcing. Now there are a lot of valid objections to crowdsourcing; I think that there are also a few invalid objections and one of them is essentially the idea that members of the public cannot participate in scholarly projects because my next door neighbor is neither capable nor interested in participating in scholarly projects. And we see this all over the place. I mean, here's a few example quotes--and I'm not going to read them out. I believe that this objection (which I have heard a number of times; I mean we see some examples right here) is a non sequitur. And I believe that the power-law distribution proves that it's a non sequitur. Really, I saw this most egregiously framed by a scholar who was passionately--just absolutely decrying--the idea that classical music fans would be able to competently translate from German into English because, he said, "After all, 40% of South Carolina voted for Newt Gingrich." Okay.

All right, so what's going on is I think best summed up by Rachel Stone, and what she essentially said is that crowdsourcing isn't getting the sort of random distribution from the crowd. Crowdsourcing is getting a number of "well-informed enthusiasts."

So where do we find well-informed enthusiasts to do this work and to do it well? Big projects have an advantage, right? They have marketing budgets. They have press coverage. They have an existing user base.

If you ask the people at the Transcrbe Bentham project how did they get their users, they'll say "Well, you know that New York Times article really helped. " That's cool! All right.

The GalaxyZoo people--Citizen Science Alliance--yesterday, 24 hours ago, announced a new project, SETILive. Now what this does is it pulls in live data from the SETI satellites[sic: actually telescope], and in those 24 hours--I took this screenshot; I actually skipped lunch to get this one screenshot because I knew that it would pass 10,000 people participating with 80,000 of these classifications. And it would have been higher, except last night the telescope got covered by cloud cover. So they dropped from getting 30 to 40 contributions per second to having to show sort of archival data and getting only 10 contributions per second. Well, they can do this because they have an existing base of active volunteers that numbers around 600,000.

So how do WE do that? How do we find well-informed enthusiasts? This is something that Kathryn Stallard and Anne Veerkamp-Andersen at Southwestern University Special Collections and I discussed a lot when we were trying to launch the Zenas Matthews Diary. We said, "Well, we don't have any budget at all." Kathryn said, "Well, let's talk about local archival newsletters. Let's post to H-Net lists." I was in favor of looking at online communities of people who might be doing Matthews genealogy or the military history war-gamers who have discussion forums on the Mexican War.

While we're arguing about this, Kathryn gets an email from a patron saying, "Hey, I'm a member of an organization. We see that you have this document. It relates to the Battle of San Jacinto and the Texas Revolution of 1836. Can you send this to us?"

She responds saying, "Hey, 1846, great. Check out this diary we just put online. I think that's what you're talking about."

Well that wasn't actually what he was talking about, but he responds and says, "Yeah, okay, I'll check that out, but can you please give me the document I want." They get it back to him and we returned to our discussion of "Okay, what do we need to do to roll this out? We're going to start working on the information architecture. We're going to work on the UI. We're going to work on help screens." And while we're having this conversation, Mr. Patrick checks it out.

And Scott Patrick starts transcribing.

And he starts transcribing some more.

And he continues transcribing.

And at this point, we're talking about working on the wording of the help screens, the wording of our announcement trying to attract volunteers, and this is page 43 of the 43-page diary!

And while we're discussing this, he goes back and he starts adding footnotes. Look at this: he's identifying the people who are in this, saying, "Hey, this guy who is mentioned is -- here's what his later life was. This other guy--hey, he's my first cousin, by the way, but he also left the governorship of the State of Texas to fight in this war."

He sees--and believe me, in the actual original diary, Piloncillo is not spelled Piloncillo. I mean it is a -- Zenas Matthews does not know Spanish, right? He identifies this! He identifies and looks up works that are mentioned here.

So wow! All right! We got our well-informed enthusiast! In 14 days, he transcribed the diary, and he didn't do just one pass. I mean as he got familiar with the hand, he goes back and revises the earlier transcriptions. He kind of figures out who's involved. He asks other members of his heritage organization what this is. He adds two dozen footnotes.

What just happened? What was that about? Who is this guy? Well, Scott Patrick is a retired petroleum worker who got interested in his family history, and then got interested in local history, and then got interested in heritage organizations. And he is our ideal "well-informed enthusiast".

So how did we find him? The project isn't public yet, right? Our challenge now is rephrasing our public announcement. We're now looking for volunteers to ... something that adequately describes what's left to do. Well, let's go back and take a look at this original letter, right? What we did is, we responded to an inquiry from a patron--and not an in-person patron: this is someone who lives 200 miles away from Georgetown, Texas.

What you have when someone is coming in and asking about material is, if you think about this in terms of target marketing--this is a target-rich environment. Here is someone who is interested. He's online. He's researching this particular subject. He is not an existing patron. he has no prior relationship with Southwestern University Libraries, but "Hey, while we answer your request, you might check this thing out that's in this related field." That seems to have worked in this one case. Hopefully, we'll get some more experience with future projects.

Okay, so how do we motivate volunteers? More importantly, how do we avoid de-motivating them?

Big projects, a lot of times they have a lot of interesting game-like features. Some of them actually are games. You have leader boards, you have badges, you have ways of making the experience more immersive.

OldWeather, which is run by GalaxyZoo, will plot your ship on a Google map as you transcribe the latitude and longitude elements from the log books.

The National Library of Finland has partnered with Microtask to actually create a crowdsourcing game of Whac-A-Mole. So this is crowdsourcing taken to the extreme.

But there's a peril here, and the peril is that all of these things are extrinsic motivators.

And we ran into this with the Klauber diaries. Perian came to me and said, "Hey, let's come up with a stats page, because we want to track where the diaries are at. So we come up with the stats page -- pretty basic, here's where some of these are at.

And hey, while we're at it, let's mine our data. We can come up with a couple of top-10 lists. So we come up with the top-ten list of transcribers and a top-ten list of editors, because that's the data I have.

Well remember, the whole point of this exercise is to index these diaries so that we can find the mentions of these individual species in the original manuscripts. Do you see indexing on here anywhere? Neither did our volunteers, and the minute this went up, the volunteers who previously had been transcribing and indexing every single page stopped indexing completely. They weren't being measured on it. We weren't saying that we rewarded them for it, so they stopped.

Needless to say, our next big-rush change was a top-ten indexers.

So this gets to "crowding-out" theory of motivation, and the expert on this is a researcher in the UK named Alexandra Eveleigh. Her point is that if you're going to design any kind of extrinsic motivation, you have to make sure that it promotes the actual contributory behavior, and this is something that applies, I believe, to small projects as well as large projects.

So I have 13 seconds left, so thank you, and I'll just end on that note.

2 comments:

Rose Holley - Digital Library Specialist said...: It's always interesting to hear what goes on 'behind the scenes' with crowdsourcing projects. Thanks for sharing Ben.; March 25, 2012 at 5:05 PM
Anonymous said...: What a great discussion of finding contributors to online academic and quasi-academic projects-- thanks!; April 15, 2012 at 12:30 PM

Collaborative Manuscript Transcription

Saturday, March 17, 2012

Crowdsourcing at IMLS WebWise 2012

2 comments:

New Blog Posts are at FromThePage

Posts from the FromThePage Blog

Pages

Upcoming Conference Schedule

Past Conference Talks

Blog Archive

Subjects

Papers

Transcription Systems

Digital Family History