Saturday, December 10, 2016

Tools and Techniques for Enhanced Encoding of Account Books from US Plantations (MEDEA2)

In April of 2016, Anna Agbe-Davies and I attended MEDEA2 (Modeling Semantically Enhanced Digital Editions of Accounts) at Wheaton College in Norton, Massachusetts to meet with other scholars and technologists working on digital editions of financial records.  This is the talk we gave, compiled from Anna's prepared remarks and a transcript of my oral presentation.


Agbe-Davies: Manuscript accounts from plantations come in a variety of forms: accounts of plantation residents (enslaved and free) as they frequented local stores; records of the daily expenditures and income realized by slave owners as a direct result of their human property; and accounts tracking economic exchanges between plantation owners and the laborers on whom they depended for their livelihood.  The data recorded in these sources present an unparalleled opportunity for scholarly analysis of the economic and social structures that characterized the plantation for people throughout its hierarchy.

The properties of these manuscripts are simultaneously the source of their richness and the font of many challenges.  The average American--for all we think we know about our recent plantation past--has little idea of the economic underpinnings of that regime and likewise little sense of how individual men and women may have navigated it.  The idea that enslaved people engaged in commercial transactions, were consumers, at the same time that they were treated as chattel property, runs counter to our understanding of what slavery meant and how it was experienced.

Therefore, primary documents challenging these deeply-held beliefs are an important resource, not only for researchers, but the general public as well.  We have set out to develop a mechanism that delivers these resources to a wider public, enables their participation in transcription, which makes these sources readable by machines and by people not well-versed in 18th- and 19th-century handwriting.

Just to review, for those of you who were not present for our paper in Regensburg.  Neither of us is an historian.  I am an archaeologist and, as is usual in the US, also an anthropologist.  I came to texts such as the “Slave Ledger” discussed throughout this presentation with a straightforward question: what were enslaved people buying in 19th-century North Carolina?  In this sense, the store records complement the archaeological record, which is my primary interest.  Clearly, however, these texts have additional meanings and potential for addressing much more than material culture and consumption.  This is exciting for the anthropologist in me.  Ben is editing the account books of Jeremiah White Graves, a ledger and miscellany from a Virginia tobacco plantation.  We are collaborating to extend the capabilities of Ben’s online transcription tool FromThePage, to unleash the full analytical possibilities embodied in financial records.  This paper follows up on our previous contribution by showing how the new version of FromThePage meets the challenges that we outlined in October.


Stagville was the founding farm for a vast plantation complex assembled by several generations of Bennehans and Camerons.  A local historian, estimates that at their most powerful, the family owned around 900 men, women, and children (Anderson 1985:95). Some of the people at Stagville stayed on after Emancipation, allowing for a fascinating glimpse of the transition from slavery to tenancy and wage labor.
Daybooks and ledgers from plantation stores owned by the Bennehan-Cameron family cover the years 1773 to 1895. Many of the men and women whose purchases are recorded therein were the family’s chattel property and, in later years, their tenants or employees. There are forty-five daybooks and twenty ledgers in the family papers, which are collected in the University of North Carolina’s Southern Historical Collection. Eleven volumes are flagged by the finding aid as including purchases by “slaves” or “farm laborers,” though many volumes have no summary and may contain as-yet unidentified African American consumers.
My plan is to digitize and analyze a selection of these daybooks and ledgers.  This project augments the Southern Historical Collection’s effort to make important manuscripts available via the Internet.  My project not only increases the number of volumes online and in a format that enables analysis by users with varying levels of expertise, but makes the contents of these documents available as data, not merely images.

Aims and Problems

We wanted the tool to be user-friendly.  We didn't want users to have to learn complicated or esoteric conventions for the transcription or marking of the texts. Support for encoding, display, and export of tabular data appearing within documents. This feature allows Markdown encoding of tables appearing within documents, and enhances the semantic division of texts into sections, as Ben will discuss below.

In pilot studies I had done before adopting FromThePage, participants cited the need to have the manuscript being transcribed visible at the same time as the transcription window.

One of the problems with transcription is how to treat variations in terminology and orthography.  This conversation was discussed at the last MEDEA meeting as a difference between historical and linguistic content. 
An analyst has good reason to want to treat all variants of whiskey as one category.  But by the same token, differences in how the word is rendered are important information for other forms of analysis: who wrote the entry; do differences signify levels of literacy or specialized knowledge?
Just like commodities, people’s names come in several forms.  Unlike commodities, the same name may refer to more than one individual.  When it is possible to merge multiple names into a single individual or distinguish two similar names as two different people, it’s important to be able to record that insight without doing violence to the original structure of the manuscript.


Brumfield: We don't usually think about wiki encoding as encoding--we don't usually think about plain-text transcription as encoding--but fundamentally, it is.

The goal of wiki encoding is quick and easy data entry.  What that means is that where possible, users are typing in plain text.  Now this is a compromise. It is a compromise between presentational mark-up and semantic mark-up.  But fundamentally all editions are compromises[inaudible]

If the user encounters a line break in the original text, they hit carriage return.  That encodes a line-break.  For a paragraph break, they encode a blank line.  So you end up with something very similar to the old-fashioned typographic facsimile in your transcript.

That's not enough, however -- you have to have some explicit mark-up.  That's where we've added lightweight encoding.  Most wiki systems support that, but the one which we use most prominently are wikilinks that are backed by a relational database system that records all of the encoding.
What's a wiki-link?  Here's an example: there are two square braces, and on one side of a pipe sign is the canonical name of the subject [while on the other side you have the verbatim text.]  In this account, Anna has encoded a reference to gunflint, and the verbatim text is "flint", and it appears in the context of "at sixpence".  That's not that complex, but it does give the opportunity to encode variation in terminology; to resolve all these terms into one canonical subject.
Because all of the tags are saved into the database, you can do some really interesting things, like dynamically generating an index whenever a page is saved.  If you go to the entry on "gunflint", you can see all of the entries that mention gunflint.  You can link directly to [the pages], and you can see the variations.  So on page 3 we see "Gunflints", on page 4 we see "Gunflints", and page 5 we see "flints", which is the entry we just saw.
When a transcriber records something like "flint", they are asked to categorize it and add it to an ontology that's specific to a a project.  Anna has developed this ontology of categories to put these subjects in.  A subject can belong to more than one category.  The subjects that are of particular importance to this project are the persons and their status: You can have a person who shows up as an account holder, you can have a person who shows up as enslaved, you can have a person who shows up as a creditor, and the same person can be all of these different things.

In this case, "gunflint" shows up in the category of "arms".

The other thing that you can do is mine this tagging to do some elementary network analysis by looking at colocation.  When a subject like 'gunflint' appears most commonly in the same physical space on the page with another subject, you can tell that it's related to that subject more closely than it is to things that appear farther away.  Unsurprisingly, 'gunflint' appears most commonly with 'shot' and with 'powder'.  So people are going hunting and buying all their supplies at once.  (This is why I'll never get a PhD: discovering that people buy powder and shot at the same time is not groundbreaking research!  But it could be a useful tool for someone like Anna.)
Another thing we can do:  Even though we're not encoding in TEI--we're not using TEI for data entry--we can generate TEI-XML from the system.  These are two sections of an export from the Jeremiah White Graves Diary, in which there's a single line with "rather saturday.  Joseph A. Whitehead's little" and there's a line-break.  That was encoded by someone typing in "Jos A. Whitehead" in angle brackets, then "little", and hitting carriage return.

The TEI exporter generates that text with a reference string and a line break.  It also generates the personography entry for Joseph A Whitehead from the the wiki-link connecting "Jos. A Whitehead" to the canonical name. 
What does any of that have to do with accounts?  That's really great for prose, but it doesn't help us deal with the kinds of tabular data that shows up in financial records.

Here's an example from the Stagville Account books that Anna has encoded, which shows off the mark-up which we developed for this project.  And I have to say that this is "hot code" -- this was developed in February, so we are nowhere near done with it yet.

We needed to come up with some way to encode these tabular records in a semantically meaningful way and to render them usefully.  We chose the Markdown sub-flavor of wiki-markup to come up with this format which looks vaguely tabular.
Whenever you encounter a table, you create a section header which identifies the table, then a table heading for all the columns, and then you type in what you see as the verbatim text, separated by pipes.  You can add lots of whitespace if you want all of your columns to line up in your transcription and make your transcript look really pretty, or not -- it doesn't really matter.
One of the hiccups we ran into pretty quickly was that not all of the tables in our documents were nicely formatted with headers.  What we needed to do was encode the status of these columns.  All these column headers are important -- they're applied to the role of the data cells below them.  So we came up with this idea of "bang notation", which essentially strips this for display, but leaves it for semantic encoding.
That serves our purposes for display and for the textual side of the edition.  What about the analytical side?  Once all this is encoded, we're able to export--not just in TEI-XML--but we're able to export all of the tabular data that appears in one account book and create a single spreadsheet from it.  Because the tables that appear in an account book can be really heterogeneous--some of us are dealing with sources in which you have a list of bacon shipments, and then you have an actual account to a creditor, and then you have an IOU--but you want to be able to track the same amounts from one table to another, when those values are actually relevant. 

We do something really pretty simple: we look at the heading, and if something appears under the same heading in more than one table, we'll put it in the same column in the spreadsheet we generate.  That means that some rows are going to have blank cells because they've had different headers.  Filtering should allow you to put that together.  Here you see Mrs. Henry's Abram's account, you have Frederick's account, so you can filter those and say you just want to study Frederick's account.  You just want to study those two accounts together. 

Furthermore, we also have the ability to link back to the original page, so that you can get back from the spreadsheet to the textual edition.
We've got a lot more to do.  We are probably less than fifty percent done with the software development side of this, and way less than that for the whole project.
  • We need to work on encoding dates that are useful for analysis.
  • We need to figure out hot to integrate subjects and tables in ways that can be used analytically.
  • We need to add this table support to our TEI exports.
And then we need to do a lot more testing.  We're still working on this.

With that, I turn it back to Anna.


Agbe-Davies: We have conceptualized usability in terms of both process and product. Because FromThePage is designed to facilitate crowdsourced transcription of manuscript accounts the functionality of the input process is as important as the form the output will take. The resulting transcription will be exponentially more readable for nonspecialist users, while at the same time allowing researchers to perform quantitative analyses on data that would otherwise be inaccessible. 

Each of these audiences can contribute to the development of these datasets and use them in creative ways.  In my association with Stagville State Historic Site, I have the opportunity to share research findings with the general public and they are eager to explore these data for themselves and turn them to their own purposes.  Teachers can use this material in their classes.  Site interpreters and curators can enrich their museums’ content with it.  History enthusiasts can get a sense of the primary data that underlies historical scholarship.  Researchers can manipulate and examine transcriptions in ways that are both quantitative and qualitative.  In a recent paper in crowdsourcing as it applies to archaeological research, a colleague and I wrote, “[there is an] urgent need for access to comparative data and information technology infrastructure so that we may produce increasingly synthetic research.  We encourage increased attention not only the ways that technology  permits new kinds of analyses, but also to the ways that we can use it to improve access to scattered datasets and bring more hands to take up these challenges.”  A similar argument can be made for the modeling of historic accounts.