Friday, March 14, 2014

Wikilinks in FromThePage

From March 10-12, I got to participate in the iDigBio Original Sources Digitization Workshop, a gathering of natural history collections managers, archivists, and technologists. Although the focus of digitization within natural history has been on specimens or specimen labels, this workshop sought to address the challenges and opportunities involved in digitizing ledgers, field notes, and other non-specimen data. As usual for iDigBio events, the workshop was spectacular.

Carolyn Sheffield chaired a panel (video recording) on crowdsourcing which included Rob Guralnik discussing Notes From Nature, Christina Fidler talking about the Grinnell field notes on FromThePage, my talk, and a long, valuable discussion among all participants. My presentation covered the data model and uses of wiki links as I'm using them in FromThePage.

Video, slides, and transcript are below:

"From The Page" - Ben Brumfield from iDigBio on Vimeo.
I'm Ben Brumfield.  You saw a little bit about FromThePage in Christina Fidler's presentation, so I wanted to talk about the internals -- the design and the datastructures behind some of the things that make this a little bit different from NotesFromNature or the NARA Transcribr Drupal module.
This is the transcription screen.  You've seen this with Christina, so I'll probably go over this pretty quickly.  This is a full-text transcription, not individual records like you get with Notes From Nature. 
The reason for that is that FromThePage was built to be a wiki-like tool, purpose-built for creating amateur editions.  So we've got a text and we want to create an edition from the text that can then be re-used, printed, and analyzed.

I say "amateur" editions because we're not dealing with the kinds of things that textual scholars in the humanities are dealing with, where they're trying to compare different variant manuscript versions of Chaucer.  [By contrast, we] have something that's very straightforward, and we're interested in some fairly simple annotations.

It's purpose-built -- free-standing on MySQL and Ruby on Rails, so it's not integrated with MediaWiki or anything like that.
So who's using it?

[FromThePage] was built originally for a set of my great-great grandmother's diaries.

Since then it's been used for military diaries by libraries and history departments.
It's been used for literary diaries--in this case for Shelby Foote's diaries--for literary drafts, and for punk rock fanzines.  (Which is kind of awesome!)
So what does that have to do with the people in this room and the kind of material [we're working with]?

Here's an example:  This is an 1859 journal from an expedition in which someone went out and made a number of observations and collected some things to bring back with them.  There are scholars interested in mining those.

But it's not a naturalist expedition.  This is Viscountess Emily Anne Smyth Strangford, who in this case is touring the Mediterranean and visiting a lot of classical monuments.  The folks at the Duke Computational Classics Collaboratory are interested in finding all the places in which she recorded Latin and Greek inscriptions, coming up with her itenerary, and figuring out how [that data] connects to the objects her father-in-law had collected for the British Museum twenty years earlier.

So there's a lot of correspondence, I tend to think, with field notes.
The San Diego Natural History Museum started using FromThePage for field books in 2010.  They're still working on the project.
  • They've identified ten thousand subjects worth classifying in their system.
  • Individual pages have been edited twenty-four thousand times.  And this goes back to the wiki-like approach -- people transcribe a page, and then they revisit it. They make a number of edits to a page as they get comfortable with the handwriting.
  • And then they've linked individual observations, species mentioned, and people in the field notes to those subjects forty-two thousand times.
Then there are a couple of other projects working with field notes.  [Museum of Vertebrate Zoology] obviously is in trial, and [the Museum of Comparative Zoology] and Missouri Botanical Gardens are just evaluating the software right now.  
So, what is a wiki link?

Any of us who've edited Wikipedia may be used to this.  I followed the same syntax [in FromThePage].

What we have here is a set of double square braces with the canonical name of the subject--this could be a formatted date, this could be a full name that's spelled out--and then the text that's actually used within the verbatim transcript.

So our example here -- this is when Grinnell meets Klauber.  The field note actually says "L. M. Klauber", so the person transcribing has expanded this out to "Laurence M. Klauber".  So we have the ability to handle variance in references to Klauber, but still identify them as Klauber.
Technically speaking, what's behind one of these wiki links?

There are a lot of tables in this database.
  • We know that there's this page that Klauber is mentioned on.  It's S1 Page 3 in the Grinnell field notes that MVZ has online.
  • We've got a subject which is Laurence M. Klauber.
  • The subject is categorized as a person, which can be used for analysis and filtering, like Christina showed you.
  • And then the individual link between the page and the subject, that contains the variation, is also stored.
So there are a lot of things you can do with that.
  • You can show all the pages that mention Laurence M. Klauber, and read the pages in context or just get a listing of them.
  • More helpfully, as you're transcribing we can mine those links to automatically suggest mark-up.  So the next time we encounter "L. M. Klauber", we can push a button and that will automatically expand the mark-up of "L. M. Klauber" to "[[Laurence M. Klauber|L. M. Klauber]]".
  • You can also feed this to full-text searches.  So if you've got a lot of plain-text transcripts which contain Laurence M. Klauber, we can automatically populate the search with those variations, creating an OR query with "Klauber", "L. M. Klauber"
  • And then we can mine the mark-up for correspondences [between subjects] as Christina showed.

The last thing you can do with it is export.
Here is a TEI-XML export of the Joseph Grinnell notes.  This is useful for interchange, but the most important thing this does is that it allows amateurs to create well-formatted, TEI P5-compliant XML.  And it will handle one of the things that's very hard about creating TEI in an XML editor, which is associating reference string to their entries over in the TEI header which describes who the people are outside the text.
This is a CSV export of the Grinnell field notes.  Basically this is every observation and every person who's mentioned, exported as a CSV file with links back to the pages and URLs at which those pages can be found.  This is the kind of thing that perhaps could be ingested into [museum collection management database] Arctos.
Future plans:

We're going to be doing more CMS integrations.  We're working on Omeka.  The Internet Archive is done.  There are a couple of grant applications that involve hooking FromThePage up to Fedora Commons.

We also really want to contextualize links in time and place.  We want the ability for people to define where the person writing the journal is where they're writing, and then to apply those geotags and chronotags to the references.  So you could map when species were mentioned.  You could extract a visual itenerary.

We need more formatting options.  One of our volunteers has found all kinds of crazy editorial issues for handling strike-outs and things like that.

And the last thing that we're looking for is more projects.

1 comment:

Greg G said...

Hi Ben

I can't access the FromThePage demo at present -

http://beta.fromthepage.com/demo