Wednesday, July 20, 2011

Crowdsourcing and Variant Digital Editions

Writing at the JISC Digitization Blog, Alastair Dunning warns of "problems with crowdsourcing having the ability to create multiple editions."

For example, the much-lauded Early English Books Online (EEBO) and Eighteenth Century Collections Online (ECCO) are now beginning to appear on many different digital platforms.

ProQuest currently hold a licence that allows users to search over the entire EEBO corpus, while Gale-Cengage own the rights to ECCO.

Meanwhile, JISC Collections are planning to release a platform entitled JISC Historic Books, which makes licenced versions of EEBO and ECCO available to UK Higher Education users.

And finally, the Universities of Michigan and Oxford are heading the Text Creation Partnership (TCP), which is methodically working its way through releasing full-text versions of EEBO, ECCO and other resources. These versions are available online, and are also being harvested out to sites like 18th Century Connect.

So this gives us four entry points into ECCO – and it’s not inconceivable that there could be more in the future.

What’s more, there have been some initial discussions about introducing crowdsourcing techniques to some of these licensed versions; allowing permitted users to transcribe and interpret the original historical documents. But of course this crowdsourcing would happen on different platforms with different communities, who may interpret and transcribe the documents in different way. This could lead to the tricky problem of different digital versions of the corpus. Rather than there being one EEBO, several EEBOs exist.

Variant editions are indeed a worrisome prospect, but I don't think that it's unique to projects created through crowdsourcing. In fact, I think that the mechanism of producing crowdsourced editions actually reduces the possibility for variants to emerge. Dunning and I corresponded briefly over Twitter, then I wrote this comment to the JISC Digitization blog. Since that blog seems to be choking on the mark-up, I'll post my reply here:
benwbrum Reading @alastairdunning's post connecting
crowdsourcing to variant editions: Feel like Wikipedia
solved this years ago.

benwbrum If you don't publish (i.e. copy) a "final" edition of a crowdsourced transcription, you won't have variant "final" versions.

benwbrum The wiki model allows linking to a particular version of an article. I expanded this to the whole work: link

alastairdunning But does that work with multiple providers offering restricted access to the same corpus sitting on different platforms?

alastairdunning ie, Wikipedia can trace variants cause it's all on the same platform; but there are multiple copies of EEBO in different places

benwbrum I'd argue the problem is the multiple platforms, not the crowdsourcing.

alastairdunning Yes, you're right. Tho crowdsourcing considerably amplifies the problem as the versions are likely to diverge more quickly

benwbrum You're assuming multiple platforms for both reading and editing the text? That could happen, akin to a code fork.

benwbrum Also, why would a crowd sourced edition be restricted? I don't think that model would work.
I'd like to explore this a bit more. I think that variant editions are less likely in a crowdsourced project than in a traditional edition, but efforts to treat crowdsourced editions in a traditional manner can indeed result in the situation you warn against.

When we're talking about crowdsourced editions, we're usually talking about user-generated content that is produced in collaboration with an editor or community manager. Without exception, this requires some significant technical infrastructure -- a wiki platform for transcribing free-form text or an even more specialized tool for transcribing structured data like census records or menus. For most projects, the resulting edition is hosted on that same platform -- the Bentham wiki which displays the transcriptions for scholars to read and analyze is the same tool that volunteers use to create the transcriptions. This kind of monolithic platform does not lend itself to the kind of divergence you describe: copies of the edition are always dated as soon as they are separated from the production platform, and making a full copy of the production platform requires a major rift among the editors and volunteer community. These kind of rifts can happen--in my world of software development, the equivalent phenomenon is a code fork--but they're very rare.

But what about projects which don't run on a monolithic platform? There are a few transcription projects in which editing is done via a wiki (Scripto) or webform (UIowa) but the transcriptions are posted to a content management system. There is indeed potential for the "published" version on the CMS to drift from the "working" version on the editing platform, but in my opinion the problem lies not in crowdsourcing, but in the attempt to impose a traditional publishing model onto a participatory project by inserting editorial review in the wrong place:

Imagine a correspondence transcription project in which volunteers make their edits on a wiki but the transcriptions are hosted on a CMS. One model I've seen often involves editors taking the transcriptions from the wiki system, reviewing and editing them, then publishing the final versions on the CMS. This is a tempting work-flow -- it makes sense to most of us both because the writer/editor/reader roles are clearly defined and because the act of copying the transcription to the CMS seems analogous to publishing a text. Unfortunately, this model fosters divergence between the "published" edition and the working copy as voluteers continue to make changes to the transcriptions on the wiki, sometimes ignoring changes made by the reviewer, sometimes correcting text regardless of whether a letter has been pushed to the CMS. The alternative model has reviewers make their edits within the wiki system itself, with content pushed to the CMS automatically. In this model, the wiki is the system-of-record; the working copy is the official version. Since the CMS simply reflects the production platform, it does not diverge from it. The difficulty lies in abandoning the idea of a final version.

It's not at all clear to me how EEBO or ECCO are examples of crowdsourcing, rather than traditional restricted-access databases created and distributed through traditional means, so I'm not sure that they're good examples.


Ryan Baumann said...

The way we've handled this on the Integrating Digital Papyrology project ( is to have the editing environment use Git (intimately) as its data backend. Git's "directed graph of objects" abstraction allows for a lot of things to be done easily that would be quite difficult with older version control systems or storing full revisions in e.g. an RDBMS (the typical wiki model). For example, "forking" and then sharing revisions or merging between forks is much easier (we actually treat every submission of an emendation as a fork, so forking/merging happens hundreds of times a day for us: We do exert editorial control over what changes we publish in our public repository, but the nice thing with Git is that people can copy this repository, make their own changes, and still pull in new updates from our repository or submit their changes back without the typical overhead these tasks would require. We actually take advantage of this ourselves, as not all the changes are made through the online editor, e.g. batch XML changes can be made offline and merged in. The "frontend" search and display environment is also completely separate from the editor - similar to your wiki/CMS example. Though it can lag behind, we've been working towards automating its updates from the Git repository so that when changes are pushed to it they're reflected live there much more quickly.

I think the Git model can extend to collaborations between partner projects quite well, provided they can agree to use Git as the core "monolithic" platform for data storage. Of course, this doesn't really address some of the social/scholarly trust issues raised in the JISC post, but I would argue that those issues have always been present, and the new models just force you to realize it. The idea that before the new digital resources you could simply trust a single edition of something is probably inaccurate for most works - and where it increases in accuracy is due to the production of editions which incorporate multiple versions in themselves. I think digital publication is actually analogous to this - would you trust a printed "critical edition" of a text which omitted significant known variants? Perhaps for some purposes, but for others, the publication of the revisions available is important in itself. Digital publication allows us to be more explicit about the provisional nature of our work, but it can also allow us to record and publish these revisions in stable, accurate, machine-actionable ways.

Ben W. Brumfield said...

That's fascinating, Ryan! As a software developer, I'm tempted to analogize these editing projects to the models I'm familiar with, equating the EEBO situation with shipped/installed software which must be serviced in its multiple versions and the concerns about variant editions as similar to concerns about code forks. I never quite know how well these analogies apply, however, as I've never worked on a traditional editing project.

It's intriguing that you've embraced Git--with its elegant solutions for managing code forks--as a solution for editorial collaboration. Are users able to see the forking behind the scenes and use GitHub's social features to collaborate, or is that all invisible to them? (That's one crazy fork network graph, by the way -- apparently there's a summer program in Vienna making a bunch of changes at once.)

Although you say it's not automated, it's not quite clear to me from your comment whether the front-end search tool is a true copy of the Git repo, or whether some editorial massaging is exerted on data when its moved into the public tool. It's this latter practice I find problematic.

Is there any sense in which IDP has "final" versions, or is the repository network the authoritative version?

Unknown said...

The "published" version in the interface comes straight from the "mothership" git repository. The mothership is just our nickname for the repo we use to merge changes incoming from the editor and from github and to push those merged changes back. There is no additional data massaging at all, unless the merge happens to result in a conflict, in which case that's fixed. Various processes are run to populate our RDF triple store and search index, and to generate the HTML for the site, but those are all automatic.

The goal is for all of this to be completely automated in the near future, so that changes made in the editor become visible and searchable in short order.

Ryan Baumann said...

Right, as Hugh highlights, there can be genuine merge conflicts which need to be resolved, which is the only real additional editorial massaging that happens for the public tool (and these conflict resolutions also get pushed back to the editing environment and GitHub). Typically for us these are more technical merge conflicts, rather than textual conflicts - i.e. we change the way something is semantically encoded in XML across the corpus, but in the meantime someone has made a change to a text elsewhere on the same line as an occurrence of the changed encoding.

We've tried to structure the editing environment workflow to be as transparent as possible about what sort of editorial interventions get made. Actually each user gets their own copy of the repository, and work occurs on branches which are automatically created for them - then when they submit something, this branch gets copied to an editorial board's repository for voting. Once it's been accepted, it gets copied to the "finalizer" (a random member of the editorial board), who incorporates any fixes or changes suggested during voting, and oversees the final merge into the "master" branch of the core repository. Here is what a typical submission of a new text from a user looks like after the process (we treat each "save" in the editor as separate commits in the user's branch/repository, but during finalization these are "flattened" into a single commit for clarity). You can see here that the committer (finalizer) is different from the author, and subsequent commits show the interventions he made during the finalization process (many of these, such as renames and header updates, are automated through the editing environment).

Most of the Git backend for the online editing environment is transparent to the users, but we could also expose some of this in the future for more technically advanced users. For now, if there are technically advanced users who wish to work directly with the XML files en masse, I think we would be open to them making submissions using GitHub's social features (i.e. fork/pull request) against the public GitHub repository.

Anonymous said...

Sometimes forking will be necessary because different users need different transcription conventions or markup. For example, the way genealogists transcribe documents is often very different from the way academic historians do it - not necessarily objectively worse, but just suited to different needs. The requirement for XML tags to be properly nested means that markup for radically different purposes will most likely have to fork.

I suspect that a lot of concerns raised about supposed problems of digitization are straw men which are more to do with power, exclusion and conservatism. For example a notoriously reactionary British historian recently claimed that digitization is wrong because the true meaning of a manuscript can only be known if you're looking at the original in an archive. That's suspiciously similar to the medieval catholic church's objections to a vernacular bible.

(I'm looking forward to finishing my 'proper' monograph so I can get back into more digital things. I've just acquired a whole box of WW1 letters that would go nicely on From the Page.)