Friday, October 25, 2013

Feature: TEI-XML Export

How do you get the data out?

This is a question I hear pretty often, particularly from professional archivists.  If an institution and its users have put the effort into creating digital editions on FromThePage, how can they pull the transcripts out of FromThePage to back it up, repurpose it, or import it into other systems?

This spring, I created an XHTML exporter that will generate a single-page XHTML file containing transcripts of a work's pages, their version history, all articles written about subjects within the work, and internally-linked indices between subjects and pages.  Inspired by conversations at the TEI and SDSE conferences and informed by my TEI work for a client project, I decided to explore a more detailed export in TEI.

This is the result, posted on github for discussion:
https://gist.github.com/benwbrum/6933615
Zenas Matthews' Mexican War Diary was scanned and posted by Southwestern University's Smith Library Special Collections.  It was transcribed, indexed, and annotated by Scott Patrick, a retired petroleum worker from Houston.

https://gist.github.com/benwbrum/6933603
Julia Brumfield's 1919 Diary was scanned and posted by me, transcribed largely by volunteer Linda Tucker, and indexed and annotated by me.

I requested comment on the TEI mailing list (see the thread "Draft TEI Export from FromThePage"), and got a lot of really helpful, generous feedback both on- and off-list.  It's obvious that I've got more work to do for certain kinds of texts--which will probably involve creating a section header notation in my wiki mark-up--but I'm pretty pleased with the results.


One of the most exciting possibilities of TEI export is interoperability with other systems.  I'd been interested in pushing FromThePage editions to TAPAS, but after I posted the TEI-L announcement, Peter Robinson pulled some of the exports into Textual Communities.  We're exploring a way to connect the two systems, which might give editors the opportunity to do the sophisticated TEI editing and textual scholarship supported by Textual Communities starting from the simple UI and powerful indexing of FromThePage.   I can imagine an ecosystem of tools good at OCR correction, genetic mark-up, display and analysis of correspondence, amateur-accessible UIs, or preservation -- all focusing on their strengths and communicating via TEI-XML.


I'm interested in more suggestions for ways to improve the exports, new things to do with TEI, or systems to explore integration options before I deploy the export feature on production. 

No comments: