Wednesday, July 21, 2010

Wikisource for Manuscript Transcription

Of the crowdsourcing projects that have real users doing manuscript transcription, one of the largest is an offshoot of Wikisource. ProofreadPage was an extension to MediaWiki created around 2006 on the French-language Wikisource as a Wikisource/Internet Archive replacement for Project Gutenberg's Distributed Proofreaders. They were taking DjVu files from the InternetArchive and using them as sources (via OCR and correction) for WikiSource pages. This spread to the other Wikisource sites around 2008, radically changing the way Wikisource worked. More recently the German Wikisource has started using ProofreadPage for letters, pamphlets, and broadsheets.

The best example of ProofreadPage for handwriting is Winkler's Remarks on the Russian Campaign 1812-1813. First, the presentation is lovely. They've dealt with a typographically difficult text and are presenting alternate typefaces, illustrations, and even marginalia in a clear way in the transcription. The page numbers link to the images of the pages, and they're come up with transcription conventions which are clearly presented at the top of the text. This is impressive for a volunteer-driven edition!

Technically, the Winkler example illustrates ProofreadPage's solution to a difficult problem: how to organize and display pages, sections, and works in the appropriate context. This is not an issue that I've encountered with FromThePage—the Julia Brumfield Diaries are organized with only one entry per page—but I've worried about it since XML is so poorly suited to represent overlapping markup. When viewing Winkler as a work, paragraphs span multiple manuscript pages but are aggregated seamlessly into the text: search for "sind den Sommer" and you'll find a paragraph with a page-break in the middle of it, indicated by the hyperlink "[23]". Clicking on the page in which that paragraph begins shows the page and page image in isolation, along with footnotes about the page source and page-specific information about the status of the transcription. This is accomplished by programmatically stitching pages together into the work display while excluding page-specific markup via a noinclude tag.

But the transcription of Winkler also highlights some weaknesses I see in ProofreadPage. All annotation is done via footnotes which—although they are embedded within the source—are a far cry from the kind of markup we're used to with TEI or indeed HTML. In fact, aside from the footnotes and page numbers, there are no hyperlinks in the displayed pages at all. The inadequacies of this for someone who wants extensive text markup are highlighted by this personal name index page — it's a hand-compiled index! Had the tool (or its users) relied on in-text markup, such an index could be compiled by mining the markup. Of course, the reason I'm critical here is that FromThePage was inspired by the possibilities offered by using wiki-links within text to annotate, analyze, edit and index, and I've been delighted by the results.

When I originally researched ProofreadPage, one question perplexed me: why aren't more manuscripts being transcribed on Wikisource? A lot has happened since I last participated in the Wikisource community in 2004, especially within the realm of formalized rules. There now is a rule on the English, French, and German Wikisource sites banning unpublished work. Apparently the goal was to discourage self-promoters from using the site for their own novels or crackpot theories, and it's pretty drastic. The English language version specifies that sources must have been previously published on paper, and the French site has added "Ne publiez que des documents qui ont été déjà publiés ailleurs, sur papier" to the edit form itself! It is a rare manuscript indeed that has already been published in a print form which may be OCRed but which is worth transcribing from handwriting anyway. As a result, I suspect that we're not likely to see much attention paid to transcription proper within the ProofreadPage code, short of a successful non-Wikisource Mediawiki/ProofreadPage project.

Aside from FromThePage (which is accepting new transcription projects!) ProofreadPage/Mediawiki is my favorite transcription tool. Its origins outside the English-language community and Wikisource community policy have obscured its utility for transcribing manuscripts, which is why I think it's been overlooked. It's got a lot of momentum behind it, and while it is still centered around OCR, I feel like it will work for many needs. Best of all, it's open-source, so you can start a transcription project by setting up your own private wikisource instance.

Thanks to Klaus Graf at Archivalia for much of the information in this article.


Gavin Robinson said...

I looked into Wikisource a while ago but I've never tried doing anything with it. I don't really know much about Djvu files and it seemed to be a choice between that or manually uploading and linking each page image (but I guess that could probably be automated by a bot script using the Mediawiki API).

It's really weird that they don't use wiki links for annotating the text. I would've thought that would be a really obvious thing for a WikiMedia project to do. At Your Archives it's kind of the opposite: we use wikilinks a lot for linking to people, places, word definitions etc but there's no easy way of transcribing documents within the wiki. Document transcripts are usually done outside it then pasted in when they're finished.

Now that the UK National Archives allow uploading images to the web for non-commercial use there's a lot more scope for online transcription projects. The terms are a bit vague but they definitely allow and encourage uploads to Flickr. They still wouldn't be allowed on WikiMedia Commons because the terms are too restrictive.

Ben W. Brumfield said...

I guess the thing that really struck me about this is that so many of the limitations on Mediawiki's use are merely side-effects of Wikisource policy, rather than technical limitations. I also don't know anything about DjVu, but suspect that it would be quite easy to set up an alternative wikisource system for unpublished manuscripts. It could probably even have the Mediawiki publish-on-demand feature enabled.

I suspect that one reason they don't use wikilinks for annotation is the same technical issue I ran into playing with Mediawiki in 2005: difficulty separating annotation pages from source text pages. However, even for the notable persons mentioned in the Winkler text, the annotations displayed by pop-ups are in-line in the source, rather than links to the relevant Wikipedia articles. I just don't get that.

Perian said...

Hi there! I just tried emailing you at about an interesting transcription project I have that I wanted to talk to you about. Unfortunately, the email bounced back to me. Is there another email address I could try contacting you at?

Ben W. Brumfield said...

Certainly -- try

Finanzer said...

On german wikisource there is no rule to ban unconditionally unpublished works. In this case you have not found the example in your aticle. And we have some other unplubished works too. But you are rigth, there is a very small number of these works in wikisource.

Dovi said...

Hi, it's true that the English Wikisource is highly preoccupied (in my opinion to an unhealthy degree) with transcribing published material from the 19th and early 20th centuries.

I agree with you that Wikisource could be and should be an extraordinary environment for collaborative work transcribing manuscripts and publishing corrected editions of texts based on manuscript evidence. The extraordinary software lets you exhibit the ms along with the typed and edited text, and simple templates can allow discussions of editorial details including citation of the evidence in the mss to be hidden within the edit page.

If you are interested, exactly this kind of work is currently being done at the Hebrew Wikisource. See here for an example of a previously lost part of a book newly published from a manuscript:

At he.wikisource we also compare various old published editions and publish corrected editions of the classics based upon them. Everything is documented as in a scholarly critical edition. In fact, this helps open up the entire process of manuscript scholarship to the public in a way that was never possible before.

Mark Hershberger said...

For what its worth the extension used by WikiSource, ProofreadPage, now needs a maintainer. I posted about this here: