Friday, October 25, 2013

Feature: TEI-XML Export

How do you get the data out?

This is a question I hear pretty often, particularly from professional archivists.  If an institution and its users have put the effort into creating digital editions on FromThePage, how can they pull the transcripts out of FromThePage to back it up, repurpose it, or import it into other systems?

This spring, I created an XHTML exporter that will generate a single-page XHTML file containing transcripts of a work's pages, their version history, all articles written about subjects within the work, and internally-linked indices between subjects and pages.  Inspired by conversations at the TEI and SDSE conferences and informed by my TEI work for a client project, I decided to explore a more detailed export in TEI.

This is the result, posted on github for discussion:
Zenas Matthews' Mexican War Diary was scanned and posted by Southwestern University's Smith Library Special Collections.  It was transcribed, indexed, and annotated by Scott Patrick, a retired petroleum worker from Houston.
Julia Brumfield's 1919 Diary was scanned and posted by me, transcribed largely by volunteer Linda Tucker, and indexed and annotated by me.

I requested comment on the TEI mailing list (see the thread "Draft TEI Export from FromThePage"), and got a lot of really helpful, generous feedback both on- and off-list.  It's obvious that I've got more work to do for certain kinds of texts--which will probably involve creating a section header notation in my wiki mark-up--but I'm pretty pleased with the results.

One of the most exciting possibilities of TEI export is interoperability with other systems.  I'd been interested in pushing FromThePage editions to TAPAS, but after I posted the TEI-L announcement, Peter Robinson pulled some of the exports into Textual Communities.  We're exploring a way to connect the two systems, which might give editors the opportunity to do the sophisticated TEI editing and textual scholarship supported by Textual Communities starting from the simple UI and powerful indexing of FromThePage.   I can imagine an ecosystem of tools good at OCR correction, genetic mark-up, display and analysis of correspondence, amateur-accessible UIs, or preservation -- all focusing on their strengths and communicating via TEI-XML.

I'm interested in more suggestions for ways to improve the exports, new things to do with TEI, or systems to explore integration options before I deploy the export feature on production. 

Sunday, October 20, 2013

A Gresham's Law for Crowdsourcing and Scholarship?

This is a comment I wanted to make at Neil Fraistat's "Participatory DH" session (proposal, notes) at THATCamp Leadership, but ended up having on twitter instead.

Much of the discussion in the first half of the session focused on the qualitative difference between the activities we ask amateurs to do and the activities performed by scholars.  One concern voiced was that we're not asking "citizen scholars" to do real scholarly work, and then labeling their activity scholarship -- a concern I share with regard to editing.  If most crowdsourcing projects ask amateurs to do little more than wash test tubes, where are the projects that solicit scholarly interpretation?

The Harry Ransom Center's Manuscript Fragments Project is just such a crowdsourcing project, and I think the results may be disquieting.  In this project, fragments of medieval manuscripts reused as binding for printed books are photographed and posted on Flickr.  Volunteers use the comments to identify the fragments, discussing the scribal hand and researching the source texts. I'd argue that while this does not duplicate the full range of an academic medievalist's scholarly activities, it's certainly not just "bottle-washing" either.

The project has been very successful.  (See organizer Micah Erwin's talks for details.)  Most of the contributions to the project have been made on Flickr in the comments by a few "super volunteers" -- retired rare book dealers and graduate students among them.  However, around 20% of the identifications were made by professional medievalists who learned about the project, visited the Flickr site, and then called or emailed the project organizer.  None of their contributions were made on the public Flickr forum at all.

So why did professional scholars avoid contributing in public?  I related this on Twitter, and got some interesting suggestions
Many of these suggest a sort of Gresham's Law of crowdsourcing, in which inviting the public to participate in an activity lowers that activity's status, driving out professionals concerned with their reputation. 

There's a more reassuring explanation as well -- many people with domain expertise still aren't very comfortable with technology.  Asking them to use a public forum puts additional pressure on them, as any mistakes typing, encoding, and using the forum will be public and likely permanent.  This challenge is not confined to professionals, either -- I receive commentary on the Julia Brumfield Diaries via email from people without high school degrees, who have no professional reputation to protect.