Wednesday, December 7, 2011

Developments in Wikisource/ProofreadPage for Transcription

Last year I reviewed Wikisource as a platform for manuscript transcription projects, concluding that the ProofreadPage plug-in was quite versatile, but that unfortunately the policy prohibiting any text not already published on paper ruled out its use for manuscripts.

I'm pleased to report that this policy has been softened. About a month ago, NARA started to partner with the Wikimedia Foundation to to host material—including manuscripts—on Wikisource.  While I was at MCN, I discussed this with Katie Filbert, the president of Wikimedia DC, who set me straight.  Wikisouce is now very interested in partnering with institutions to host manuscripts of importance, but it is still not a place for ordinary people to upload great-grandpa's journal from World War I.

Once you host a project on Wikisource, what do you do with it?  Andie, Rob and Gaurav over at the blog So You Think You Can Digitize?—and it's worth your time to read at least the last six posts—have been writing on exactly that subject.  Their most recent post describes their experience with Junius Henderson's Field Notes, and although it concentrates on their success flushing out more Henderson material and recounts how they dealt with the wikisource software, I'd like to concentrate on a detail:
What we currently want is a no-cost, minimal effort system that will make scans AND transcriptions AND annotations available, and that can facilitate text mining of the transcriptions.  Do we have that in WikiSource?  We will see.  More on annotations to follow in our next post but some father to a sister of some thoughts are already percolating and we have even implemented some rudimentary examples.
This is really exciting stuff.  They're experimenting with wiki mark-up of the transcriptions  with the goal of annotation and text-mining.  I tried to do this back in 2005, but abandoned the effort because I never could figure out how to clearly differentiate MediaWiki articles about subjects (i.e. annotations) from articles that presented manuscript pages and their transcribed text.   The lack of wiki-linking was also the one of my criticisms most taken to heart by the German Wikisource community last October.

So how is the mark-up working out?  Gaurav and the team have addressed the differentiation issue by using cross-wiki links, a standard way of linking from an article on one Wikimedia project to another.  So the text "English sparrows" in the transcription is annotated [[:w:Passer domesticus|English sparrows]], which is wiki-speak for Link the text "English sparrows" to the Wikipedia article "Passer domesticus". Wikipedia's redirects then send the browser off to the article "House Sparrow".

So far so good.  The only complaint I can make is that—so far as I can tell—cross-wiki links don't appear in the "What links here" screen tool on Wikipedia, neither for Passer domesticus, nor for House Sparrow.  This means that the annotation can't provide an indexing function, so that users can't see all the pages that reference possums, nor read a selection of those pages.  I'm not sure that the cross-wiki link data isn't tracked, however — just that I can't see it in the UI.  Tantalizingly, cross-wiki links are tracked when images or other files are included in multiple locations: see the "Global file usage" section of the sparrow image, for example.  Perhaps there is an API somewhere that the Henderson Field Note project could use to mine this data, or perhaps they could move their links targets from Wikipedia articles to some intermediary in a different Wikisource namespace.

Regardless, the direction Wikisource is moving should make it an excellent option for institutions looking to host documentary transcription projects and experiment with crowdsourcing without running their own servers.  I can't wait to see what happens once Andie, Rob, and Gaurav start experimenting with PediaPress!