The video of the crowdsourcing panel
at IMLS WebWise is online, so I thought I'd post my talk. Like anyone who's created a transcript of their own unscripted remarks, I recommend watching the video. (My bit starts at 6:00, though all the speakers were excellent. Full-screening will hide the slides.) Nevertheless, I've added hyperlinks to the transcript and interpolated the slides with my comments below.
Okay. I'd like to talk about some of the lessons that have come out of
my collaborations with small crowdsourcing projects. We hear a lot about
these large projects like GalaxyZoo
, like Transcribe Bentham
. What can
small institutions and small projects do, and do the rules that seem to apply to large projects also apply to them?
So there are three projects that I'm drawing from here in this
experience. The first one I'm going to talk about in a little bit is
one that was run by Balboa Park Online Collaborative
. It's a Klauber Field Notes Transcription Project
notes of Laurence M. Klauber
, who was the nation's foremost authority on
rattlesnakes. These are field notes that he kept from 1923 through
1967. This is done by the San Diego Natural History Museum and is run
by our own Perian Sully who is out there in the room somewhere.
The next project I want to talk about is the Diary of Zenas Matthews
Zenas Matthews was a volunteer from Texas who served in the American
forces in the US-Mexican War of 1846 and this diary is kept by
Southwestern's article on the Zenas Matthews diary project
. This had
been digitized for a previous researcher and is small, but Southwestern
itself is also quite small.
The third project I want to talk about is actually the origin of the
software, which is the Julia Brumfield Diaries
. If the name looks
familiar, it's because she's my great-great grandmother. This project
was the impetus for me to develop this tool
So all of these projects, what they have in common is that we're talking
about page counts that are in the thousands and volunteer counts that
are numbered in the dozens at best. So these are not FamilySearch Indexing
, where you can rely on hundreds of thousands of volunteers and
So who participates in large projects and who participates in small
projects? One thing that I think is really interesting about
crowdsourcing and these other sorts of participatory online communities
is that the ratio of contributions to users follows what's called a
If you look here, we see--and most famously this is Wikipedia--and you
see a chart of the number of users on Wikipedia ranked by their
contributions. And what you see is that 90% of the edits made
to Wikipedia are done by 10% of the users.
If we look at other crowdsourced projects: this
is the North American
Bird Phenology Program out of Patuxent
Research Center [ed: actually Patuxent Wildlife Research Center
], and this is
a project in which volunteers are transcribing ornithology
records--basically bird-watching records--that were sent in from the
1870s through the 1950s
[ed: 1880s-1970s], entering them into a database where they can be
mined for climate change [data]. What's interesting about this to me at
least is that--and this has been a phenomenally successful project:
they've got 560,000 cards transcribed all by volunteers, but
StellaW@Maine here has transcribed 126,000 of them, which is 22% of
them. Now, CharlotteC@Maryland is close behind her (so go, local team!)
but again you see the same kind of curve.
If we look at another relatively large project, the Transcribe Bentham
project: this isn't a graph, but if you look at the numbers here, you
see the same kind of thing. You see Diane with 78,000 points, you see
Ben Pokowski with 51,000 points. You see this curve sort of taper down
into more of a long tail.
So what about the small projects?
Well, let's look at the Klauber
diaries. This is the top ten transcribers of the field notes of
Laurence Klauber. And if you look at the numbers here--again in this
case it's not quite as pronounced because I think the previous leader
has dropped out and other people have overtaken him--but you see the
same kind of distribution. This is not a linear progression; this is more of a power-law distribution.
If you look at an even smaller project--now, mind you this is a project
that is really only of interest to members of my family and elderly
neighbors of the diarist--but look: We've got Linda Tucker who has
transcribed 713 of these pages followed by me and a few other people.
But again, you have this power law that the majority of the work is being done by a very small group of people.
Okay, what's going on really? What does this mean and why does it
matter? The thing that I think this gets to, the reason that I think
that this is important, is for a couple of reasons.
One is that this kind of behavior addresses one of the main objections
to crowdsourcing. Now there are a lot of valid objections to
crowdsourcing; I think that there are also a few invalid objections and
one of them is essentially the idea that members of the public cannot
participate in scholarly projects because my next door neighbor is
neither capable nor interested in participating in scholarly projects.
And we see this all over the place. I mean, here's a few example
quotes--and I'm not going to read them out. I believe that this
objection (which I have heard a number of times; I mean we see some
examples right here) is a non sequitur. And I believe that the power-law
distribution proves that it's a non sequitur. Really, I saw this most
egregiously framed by a scholar who was passionately--just absolutely
decrying--the idea that classical music fans would be able to
competently translate from German into English because, he said, "After
all, 40% of South Carolina voted for Newt Gingrich." Okay.
All right, so what's going on is I think best summed up by Rachel Stone
and what she essentially said is that crowdsourcing isn't getting the
sort of random distribution from the crowd. Crowdsourcing is getting a
number of "well-informed enthusiasts."
So where do we find well-informed enthusiasts to do this work and to do
it well? Big projects have an advantage, right? They have marketing
budgets. They have press coverage. They have an existing user base.
If you ask the people at the Transcrbe Bentham project how did they get
their users, they'll say "Well, you know that New York Times article
really helped. " That's cool! All right.
The GalaxyZoo people--Citizen Science
Alliance--yesterday, 24 hours ago, announced a new project, SETILive
Now what this does is it pulls in live data from the SETI satellites[sic: actually telescope],
and in those 24 hours--I took this screenshot; I actually skipped lunch
to get this one screenshot because I knew that it would pass 10,000
people participating with 80,000 of these classifications. And it would
have been higher, except last night the telescope got covered by cloud
cover. So they dropped from getting 30 to 40 contributions per second
to having to show sort of archival data and getting only 10
contributions per second. Well, they can do this because they have an
existing base of active volunteers that numbers around 600,000.
So how do WE do that? How do we find well-informed enthusiasts? This
is something that Kathryn Stallard and Anne Veerkamp-Andersen
Southwestern University Special Collections and I discussed a lot when
we were trying to launch the Zenas Matthews Diary. We said, "Well, we
don't have any budget at all." Kathryn said, "Well, let's talk about
local archival newsletters. Let's post to H-Net lists." I was in favor
of looking at online communities of people who might be doing Matthews
genealogy or the military history war-gamers who have discussion forums
on the Mexican War.
While we're arguing about this, Kathryn gets an email from a patron
saying, "Hey, I'm a member of an organization. We see that you have
this document. It relates to the Battle of San Jacinto and the Texas
Revolution of 1836. Can you send this to us?"
She responds saying,
"Hey, 1846, great. Check out this diary we just put online. I think
that's what you're talking about."
Well that wasn't actually what he
was talking about, but he responds and says, "Yeah, okay, I'll check
that out, but can you please give me the document I want." They get it
back to him and we returned to our discussion of "Okay, what do we need
to do to roll this out? We're going to start working on the information
architecture. We're going to work on the UI. We're going to work on
help screens." And while we're having this conversation, Mr. Patrick checks it out.
And Scott Patrick starts transcribing.
And he starts transcribing some
And he continues transcribing.
And at this point, we're talking
about working on the wording of the help screens, the wording of our
announcement trying to attract volunteers, and this is page 43 of the
And while we're discussing this, he goes back and he
starts adding footnotes. Look at this: he's identifying the people who
are in this, saying, "Hey, this guy who is mentioned is -- here's what
his later life was. This other guy--hey, he's my first cousin, by the
way, but he also left the governorship of the State of Texas to fight in
He sees--and believe me, in the actual original diary,
Piloncillo is not spelled Piloncillo. I mean it is a -- Zenas Matthews
know Spanish, right? He identifies this! He identifies and
looks up works that are mentioned here.
So wow! All right! We got our well-informed enthusiast! In 14 days,
he transcribed the diary, and he didn't do just one pass. I mean as he
got familiar with the hand, he goes back and revises the earlier
transcriptions. He kind of figures out who's involved. He asks other
members of his heritage organization what this is. He adds two dozen
What just happened? What was that about? Who is this guy? Well, Scott
Patrick is a retired petroleum worker who got interested in his family
history, and then got interested in local history, and then got
interested in heritage organizations. And he is our ideal "well-informed
So how did we find him? The project isn't public yet, right? Our
challenge now is rephrasing our public announcement. We're now looking
for volunteers to ... something that adequately describes what's left to
do. Well, let's go back and take a look at this original letter,
right? What we did is, we responded to an inquiry from a patron--and
not an in-person patron: this is someone who lives 200 miles away from
What you have when someone is coming in and asking about material is, if
you think about this in terms of target marketing--this is a target-rich
environment. Here is someone who is interested. He's online. He's
researching this particular subject. He is not an existing patron. he
has no prior relationship with Southwestern University Libraries, but
"Hey, while we answer your request, you might check this thing out
that's in this related field." That seems to have worked in this one
case. Hopefully, we'll get some more experience with future projects.
Okay, so how do we motivate volunteers? More importantly, how do we
avoid de-motivating them?
Big projects, a lot of times they have a lot
of interesting game-like features. Some of them actually are games.
You have leader boards, you have badges, you have ways of making the
experience more immersive.
, which is run by GalaxyZoo, will
plot your ship on a Google map as you transcribe the latitude and
longitude elements from the log books.
The National Library of Finland
has partnered with Microtask to actually create a crowdsourcing game of Whac-A-Mole
. So this is crowdsourcing taken to the extreme.
But there's a peril here, and the peril is that all of these things are
And we ran into this with the Klauber diaries.
Perian came to me and said, "Hey, let's come up with a stats page,
because we want to track where the diaries are at. So we come up with
the stats page -- pretty basic, here's where some of these are at.
hey, while we're at it, let's mine our data. We can come up with a
couple of top-10 lists. So we come up with the top-ten list of
transcribers and a top-ten list of editors, because that's the data I
Well remember, the whole point of this exercise is to index these
diaries so that we can find the mentions of these individual species in
the original manuscripts. Do you see indexing on here anywhere?
Neither did our volunteers, and the minute this went up, the volunteers
who previously had been transcribing and indexing every single page stopped indexing completely
. They weren't being measured on it. We weren't saying that we rewarded them for it, so they stopped.
to say, our next big-rush change was a top-ten indexers.
So this gets to "crowding-out" theory of motivation, and the expert on
this is a researcher in the UK named Alexandra Eveleigh
. Her point is
that if you're going to design any kind of extrinsic motivation, you
have to make sure that it promotes the actual contributory behavior, and
this is something that applies, I believe, to small projects as well as
So I have 13 seconds left, so thank you, and I'll just end on that note.
It's always interesting to hear what goes on 'behind the scenes' with crowdsourcing projects. Thanks for sharing Ben.
What a great discussion of finding contributors to online academic and quasi-academic projects-- thanks!
Post a Comment