Wednesday, December 7, 2011

Developments in Wikisource/ProofreadPage for Transcription

Last year I reviewed Wikisource as a platform for manuscript transcription projects, concluding that the ProofreadPage plug-in was quite versatile, but that unfortunately the policy prohibiting any text not already published on paper ruled out its use for manuscripts.

I'm pleased to report that this policy has been softened. About a month ago, NARA started to partner with the Wikimedia Foundation to to host material—including manuscripts—on Wikisource.  While I was at MCN, I discussed this with Katie Filbert, the president of Wikimedia DC, who set me straight.  Wikisouce is now very interested in partnering with institutions to host manuscripts of importance, but it is still not a place for ordinary people to upload great-grandpa's journal from World War I.

Once you host a project on Wikisource, what do you do with it?  Andie, Rob and Gaurav over at the blog So You Think You Can Digitize?—and it's worth your time to read at least the last six posts—have been writing on exactly that subject.  Their most recent post describes their experience with Junius Henderson's Field Notes, and although it concentrates on their success flushing out more Henderson material and recounts how they dealt with the wikisource software, I'd like to concentrate on a detail:
What we currently want is a no-cost, minimal effort system that will make scans AND transcriptions AND annotations available, and that can facilitate text mining of the transcriptions.  Do we have that in WikiSource?  We will see.  More on annotations to follow in our next post but some father to a sister of some thoughts are already percolating and we have even implemented some rudimentary examples.
This is really exciting stuff.  They're experimenting with wiki mark-up of the transcriptions  with the goal of annotation and text-mining.  I tried to do this back in 2005, but abandoned the effort because I never could figure out how to clearly differentiate MediaWiki articles about subjects (i.e. annotations) from articles that presented manuscript pages and their transcribed text.   The lack of wiki-linking was also the one of my criticisms most taken to heart by the German Wikisource community last October.

So how is the mark-up working out?  Gaurav and the team have addressed the differentiation issue by using cross-wiki links, a standard way of linking from an article on one Wikimedia project to another.  So the text "English sparrows" in the transcription is annotated [[:w:Passer domesticus|English sparrows]], which is wiki-speak for Link the text "English sparrows" to the Wikipedia article "Passer domesticus". Wikipedia's redirects then send the browser off to the article "House Sparrow".

So far so good.  The only complaint I can make is that—so far as I can tell—cross-wiki links don't appear in the "What links here" screen tool on Wikipedia, neither for Passer domesticus, nor for House Sparrow.  This means that the annotation can't provide an indexing function, so that users can't see all the pages that reference possums, nor read a selection of those pages.  I'm not sure that the cross-wiki link data isn't tracked, however — just that I can't see it in the UI.  Tantalizingly, cross-wiki links are tracked when images or other files are included in multiple locations: see the "Global file usage" section of the sparrow image, for example.  Perhaps there is an API somewhere that the Henderson Field Note project could use to mine this data, or perhaps they could move their links targets from Wikipedia articles to some intermediary in a different Wikisource namespace.

Regardless, the direction Wikisource is moving should make it an excellent option for institutions looking to host documentary transcription projects and experiment with crowdsourcing without running their own servers.  I can't wait to see what happens once Andie, Rob, and Gaurav start experimenting with PediaPress!


Dominic McDevitt-Parks said...

Hi Ben,

I'm not sure if you have heard of the National Archives' Citizen Archivist Dashboard yet (see, for example, this article), so you might be interested to hear about it. As part of the dashboard's project to crowdsource transcription of documents, we'll be using both a new, in-house transcription pilot project being developed for as well as Wikisource. The dashboard page for Wikisource will send members of the public interested in transcribing directly to pages on Wikisource for them to work on, with introductory material at the top to get them started (example). It's set to be rolled out later this month.

Dominic McDevitt-Parks

Gaurav said...

Hi Ben,

That's an excellent summary of where we're headed next with the Henderson Field Notes project! The problem of interwiki links not being in the database is definitely an issue for us, but I think we can "hack" out a solution within MediaWiki. One possibility would be to use Categories (e.g. {{species name|Panthera tigris}} resolves as an interwiki link to Wikipedia while also adding the current page to [[Category:Panthera tigris]] (a lovely counterpart to Category:Panthera tigris on the Commons!). Note that this should add the category to both the individually transcribed page as well as "display" pages, such as our Notebook 1 page, which would be excellent.

I've set up a Portal on the English Wikisource to develop and experiment with the best ways of doing this. If you have time, I hope you'll be able to join in!

Gaurav said...

Dominic: That looks awesome! I'd love to see more Wikipedians -- and not just members of the public -- migrating to WikiSource to transcribe, annotate and illustrate texts! Do you know if there is a single category on the Commons for "documents which need transcription"?

Dominic, Ben: will you be attending GLAMcamp DC 2012?

Ben W. Brumfield said...


Thanks for your comment. I became aware of the NARA project just a day or two before I discussed Wikisource partnerships with Katie, so I wasn't able to process much about what's going on. Correct me if I'm wrong, but I gather that your vision is centered around exposing NARA documents to the public and engaging with volunteers via transcription, rather than a more general Flickr-for-documents model where you'll be hosting any manuscript uploaded by the public, right?

I'm interested to hear that you're looking at both an in-house tool and Wikisource. I'd love to cover your tool, your reasons for building one, and your plans for its future on this blog. Is there any publicly available information--ideally technical--on the transcription pilot?

Ben W. Brumfield said...


I'm glad I didn't offend anybody with my concerns. You guys are doing some amazing work!

I really hope you can hack something out on MediaWiki, not least because it seems to be the most vibrant platform out there (aside from FromThePage, of course!), with cool features like Publish-On-Demand integration baked in. Based on my own experience with FromThePage and the Julia Brumfield Diaries, I'd encourage you to ask these questions:

1) If links are used for indexing, what should the scope of the indexed material be? In other words, if Henderson mentions "Shiloh, TN", should browsing that entry pull in all references within the Henderson Field Notes [obviously yes], all of those plus other naturalists' field notes [probably yes], all naturalist's field notes plus mentions of Shiloh, TN within Civil War diaries of the battle of Shiloh [maybe], or all mentions of Shiloh in any Wikimedia project context anywhere? In FromThePage, I've chosen to limit the scope of subject indexes to a "collection", which is an arbitrary aggregation of works. Nevertheless, I'm not at all sure that this was the right decision, so am eagerly watching other projects' design decisions.

2) Is there a chance you could use/create a new namespace-like-thingy within Wikisource without having to jump directly to Wikipedia? I didn't understand namespaces at all when I was working with Mediawiki in 2005, and still am on unsure footing when I speak of them. As a result, I hope that other Mediawiki projects will grapple with them, since the Page: namespace was the solution to my transcription vs. annotation differentiation problem.

3) How much control do you need over the record of page-to-subject links? This level of indirection forms the core of FromThePage, and most of the features that differentiate it from other transcription tools spring from exploiting the very simple page_id, subject_id, link_text data model.

One last thing -- you didn't include a link to your portal. I'm very interested in following this project, and would love to try it out.

Ben W. Brumfield said...


So far, my conference attendance in 2012 is limited to the American Historical Association meeting in Chicago in January and IMLS's WebWise in Baltimore in February/March. I'd hoped to attend RootsTech in Salt Lake City in early February, but that's lost out to WebWise. I will certainly be "around" during SXSWi next year, participating in affiliated events but not attending the conference proper.

Gaurav said...

Hi Ben,

Aw, shucks - I don't think I'll get to go to any of those conferences! The WikiSource Portal I've created is at I *love* FromThePage; we only went with the WikiMediaverse because it already has a large, active community, an easy upload interface (the Commons), a brand name which will get people excited, and you don't have to maintain your own servers!

If you don't mind, I've moved your comment to the Portal's talk page: please feel free to edit, delete or respond it as you like! I'll try to reply to your questions by tomorrow at the latest.

Eliud said...

I have a musical manuscript that a small group will collaborate for transcribing. I've never done this before, What would you suggest would be a good starting point for software and workflow? Thanks so much.


Ben W. Brumfield said...

Eliud, why don't you send me a note at explaining more about your project. I'd be happy to provide any advice that I can.

Denise Levenick said...

Ben, I'm still looking for a way to crowd source my family history transcription project and following the progress of open source tools. What should I be watching this year?