Collaborative Manuscript Transcription: Code and Conversations in 2013

It's often hard to explain what it is that I do, so perhaps a list of what I did will help. Inspired by Tim Sherratt's "talking" and "making" posts at the end of 2012, here's my 2013.

Code

I work on a number of software projects, whether as contract developer, pro bono "code fairy", or product owner.

FromThePage

It's been a big year for FromThePage, my open-source tool for manuscript transcription and annotation. We started work upgrading the tool to Rails 3, and built a TEI Export (see discussion on the TEI-L) and an exploratory Omeka integration. Several institutions (including University of Delaware and the Museum of Vertebrate Zoology) launched trials on FromThePage.com for material ranging from naturalist field notes to Civil War diaries. Pennsylvania State University joined the ranks of on-site FromThePage installations with their "Zebrapedia", transcribing Philip K. Dick's Exegesis -- initially as a class project and now as an ongoing work of participatory scholarship.

One of the most interesting developments of 2013 was that customizations and enhancements to FromThePage were written into three grant applications. These enhancements--if funded--would add significant features to the tool, including Fedora integration, authority file import, redaction of transcripts and facsimiles, and support for externally-hosted images. All these features would be integrated into the FromThePage source, benefiting everybody.

Two other collaborations this year promise interesting developments in 2014. The Duke Collaboratory for Classics Computing (DC3) will be pushing the tool to support 19th-century women's travel diaries and Byzantine liturgical texts, both of which require more sophisticated encoding than the tool currently supports. (Expect Unicode support by Valentine's Day.) The Austin Fanzine Project will be using a new EAC-CPF export which I'll deliver by mid-January.

OpenSourceIndexing / FreeREG 2

Most of my work this year has been focused on improving the new search engine for the twenty-six million church register entries the FreeREG organization has assembled in CSV files over the last decade and a half. In the spring, I integrated the parsed CSV records into the search engine and converted our ORM to Mongoid. I also launched the Open Source Indexing Github page to rally developers around the project and began collecting case studies from historical and genealogical organizations.

In May, I built a parser for historical dates into the search engine I'm building for FreeREG. It handles split dates like "4 Jan 1688/9", illegible date portions in UCF like "4 Jan 165_", and preserves the verbatim transcription as well as programmatically handling searching and sorting correctly. Eventually I'll incorporate this into an antique_date gem for general use.

Most of the fall was spent adding GIS search capabilities to the search engine. In fact, my last commit of the year added the ability to search for records within a radius of a place. The new year will bring more developments on GIS features, since an effective and easy interface to a geocoded database is just as big a challenge as the geocoding logic itself.

Other Projects

In January I added a command-line wrapper to Autosplit, my library for automatically detecting the spine in a two-page flatbed scan and splitting the image into recto and verso halves. In addition to making the tool more usable, it also added support for notebook-bound books which must be split top-to-bottom rather than left-to-right.

For the iDigBio Augmenting OCR Hackathon in February, I worked on two exploratory software projects. HandwritingDetection (code, write-up) analyzes OCR text to look for patterns characteristically produced when OCR tools encounter handwriting. LabelExtraction (code, write-up) parses OCR-generated bounding boxes and text to identify labels on specimen images. To my delight, in October part of this second tool was generalized by Matt Christy at the IDHMC to illustrate OCR bounding boxes for the eMOP project's work tuning OCR algorithms for Early Modern English books.

In June and July, I started working on the Digital Austin Papers, contract development work for Andrew Torget at the University of North Texas. This was what freelancers call a "rescue" project, as the digital edition software had been mostly written but was still in an exploratory state when the previous programmer left. My job was to triage features, then turn off anything half-done and non-essential, complete anything half-done and essential, and QA and polish core pieces that worked well. I think we're all pretty happy with the results, and hope to push the site to production in early 2014. I'm particularly excited about exposing the TEI XML through the delivery system as well as via GitHub for bulk re-use.

Also in June, I worked on a pro bono project with the Civil War-era census and service records from Pittsylvania County, Virginia which were collected by Jeff McClurken in his research. My goal is to make the PittsylvaniaCivilWarVets database freely available for both public and scholarly use. Most of the work remaining here is HTML/CSS formatting, and I'd welcome volunteers to help with that.

In November, I contributed some modifications to Lincoln Mullen's Omeka client for ruby. The client should now support read-only interactions with the Omeka API for files, as well as being a bit more robust.

December offered the opportunity to spend a couple of days building a tool for reconciling multi-keyed transcripts produced from the NotesFromNature citizen science UI. One of the things this effort taught me was how difficult it is to find corresponding transcript to reconcile -- a very different problem from reconciliation itself. The project itself is over, but ReconciliationUI is still deployed on the development site.

Conversations

February 13-15 -- iDigBio Augmenting OCR Hackathon at the Botanical Research Institute of Texas. "Improving OCR Inputs from OCR Outputs?" (See below.)

February 26 -- Interview with Ngoni Munyaradzi of the University of Cape Town. See our discussion of his work with Bushman languages of southern Africa.

March 20-24 -- RootsTech in Salt Lake City. "Introduction to Regular Expressions"

April 24-28 -- International Colloquium Itinera Nova in Leuven, Belgium. "Itinera Nova in the World(s) of Crowdsourcing and TEI".

May 7-8 -- Texas Conference on Digital Libraries in Austin, Texas. I was so impressed with TCDL when Katheryn Stallard and I presented in 2012 that I attended again this year. While I was disappointed to miss Jennifer Hecker's presentation on the Austin Fanzine Project, I was so impressed with Nicholas Woodward's talk in the same time slot that I talked him into writing it up as a guest post.

May 22-24 -- Society of Southwestern Archivists Meeting in Austin, Texas. On a fun panel with Jennifer Hecker and Micah Erwin, I presented "Choosing Crowdsourced Transcription Platforms"

July 11-14 -- Social Digital Scholarly Editing at the University of Saskatchewan. A truly amazing conference. My talk: "The Collaborative Future of Amateur Editions".

July 16-20 -- Digital Humanities at the University of Nebraska, Lincoln. Panel "Text Theory, Digital Document, and the Practice of Digital Editions". My brief talk discussed the importance of blending both theoretical rigor and good usability into editorial tools.

July 23 -- Interview with Sarah Allen, Presidential Innovation Fellow at the Smithsonian Institution. Sarah's notes are at her blog Ultrasaurus under the posts "Why Crowdsourced Transcription?" and "Crowdsourced Transcription Landscape".

September 12 -- University of Southern Mississippi. "Crowdsourcing and Transcription". An introduction to crowdsourced transcription for a general audience.

September 20 -- Interview with Nathan Raab for Forbes.com. Nathan and I had a great conversation, although his article "Crowdsourcing Technology Offers Organizations New Ways to Engage Public in History" was mostly finished by that point, so my contributions were minor. His focus on the engagement and outreach aspects of crowdsourcing and its implications for fundraising is one to watch in 2014.

September 25 -- Wisconsin Historical Society. "The Crowdsourced Transcription Landscape". Same presentation as USM, with minor changes based on their questions. Contents: 1. Methodological and community origins. 2. Volunteer demographics and motivations. 3. Accuracy. 4. Case study: Harry Ransom Center Manuscript Fragments. 5. Case study: Itinera Nova at Stadarchief Leuven.

September 26-27 -- Midwest Archives Conference Fall Symposium in Green Bay, Wisconsin. "Crowdsourcing Transcription with Open Source Software". 1. Overview: why archives are crowdsourcing transcription. 2. Selection criteria for choosing a transcription platform. 3. On-site tools: Scripto, Bentham Transcription Desk, NARA Transcribr Drupal Module, Zooniverse Scribe. 4. Hosted tools deep-dive: Virtual Transcription Laboratory, Wikisource, FromThePage.

October 9-10 -- THATCamp Leadership at George Mason University. In "Show Me Your Data", Jeff McClurken and I talked about the issues that have come up in our collaboration to put online the database he developed for his book, Take Care of the Living. See my summary or the expanded notes.

November 1-2 -- Texas State Genealogy Society Conference in Round Rock, Texas. Attempting to explore public interest in transcribing their own family documents, I set up as an exhibitor, striking up conversations with attendees and demoing FromThePage. The minority of attendees who possessed family papers were receptive, and in some cases enthusiastic about producing amateur editions. Many of them had already scanned in their family documents and were wondering what to do next. That said, privacy and access control was a very big concern -- especially with more recent material which mentioned living people.

November 7 -- THATCamp Digital Humanities & Libraries in Austin, Texas. Great conversations about CMS APIs and GIS visualization tools.

November 19-20 -- Duke University. I worked with my hosts at the Duke Collaboratory for Classics Computing to transcribe a 19th-century travel diary using FromThePage, then spoke on "The Landscape of Crowdsourcing and Transcription", an expansion of my talks at USM and WHS. (See a longer write-up and video.)

December 17-20 -- iDigBio Citizen Science Hackathon. Due to schedule conflicts, I wasn't able to attend this in person, but followed the conversations on the wiki and the collaborative Google docs. For the hackathon, I built ReconciliationUI, a Ruby on Rails app for reconciling different NotesFromNature-produced transcripts of the same image on the model of FamilySearch Indexing's arbitration tool.

2014

All these projects promise to keep me busy in the new year, though I anticipate taking on more development work in the summer and fall. If you're interested in collaborating with me in 2014--whether to give a talk, work on a software project, or just chat about crowdsourcing and transcription--please get in touch.

Collaborative Manuscript Transcription

Tuesday, December 31, 2013

Code and Conversations in 2013

Code

FromThePage

OpenSourceIndexing / FreeREG 2

Other Projects

Conversations

2014

No comments:

New Blog Posts are at FromThePage

Posts from the FromThePage Blog

Pages

Upcoming Conference Schedule

Past Conference Talks

Blog Archive

Subjects

Papers

Transcription Systems

Digital Family History