Tuesday, August 15, 2017

Beyond Coocurrence: Network Visualization in the Civil War Governors of Kentucky Digital Documentary Edition

On August 10, 2017, my partner Sara Carlstead Brumfield and I delivered this presentation at Digital Humanities 2017 in Montreal.  The presentation was coauthored by Patrick Lewis, Whitney Smith, Tony Curtis, and Jeff Dycus, our collaborators at Kentucky Historical Society.

This is a transcript of our talk, which has been very lightly edited.  See also the Google Slides presentation and m4a and ogg audio files from the talk.


[Ben] We regret that our colleagues at the Kentucky Historical Society are not able to be with us; as a result, this presentation will probably skew towards the technical. Whenever you see an unattributed quotation, that will be by our colleagues at the Kentucky Historical Society.


The Civil War Governors of Kentucky Digital Documentary Edition was conceived to address a problem in the historical record of Civil War-era Kentucky that originates from the conflict between the slave-holding, unionist elite with the federal government. During the course of the war, they had fallen out completely. As a result, at the end of the war the people who wrote the histories of the war—even though they had been Unionists—ended up wishing they had seceded, so they wrote these pro-Confederate histories that biased the historical record. What this means is that the secondary sources are these sort-of Lost Cause narratives that don't reflect the lived experience of the people of Kentucky during the Civil War. So in order to find about that experience, we have to go back to the primary sources.

The project was proposed about seven years ago; editorial work began in 2012 – gathering the documents, imaging them, and transcribing them in TEI-XML. In 2016, the Early Access edition published ten thousand documents on an Omeka site, discovery.civilwargovernors.org. Sara and I became involved around that time for Phase 2.

The goal of Phase 2 was to publish 1500 heavily annotated documents that had already been published on the Omeka site, and to identify people within them.


The corpus follows the official correspondence of the Office of the Governor. As Kentucky was a divided state, there were three Union governors during the Civil War, and there were also two provisional Confederate governors.


Fundamentally, the documentary edition is not about the governors. We want to look at the individual people and their experience of war-time Kentucky through their correspondence with the Office of the Governor. This correspondence includes details of everyday life from raids to property damage, to – all kinds of stuff: when people had problems, they wrote to the governor when they didn't know where else to go.


If we're trying to highlight the people within these documents, how do you do that within a documentary edition? In a traditional digital edition, you use TEI. Each individual entity that's recognized—whether it's a person, place, organization, or geographic feature—will have an entry about it created in the TEI header or some external authority file, and the times that they are mentioned within the text will be marked up.

Now when done well, this approach is unparalleled in its quality. When you get names that have correct references within the ref attribute of their placeName tags --- you really can't beat it. The problem with this approach is that it's very labor-intensive, and because it's done before publication, it adds an extra step before the readership can have access to the documents.


The alternative approach which we seen in the digital humanities is the text mining approach, in which existing documents have Named Entity Recognition and other machine learning algorithms applied to them to attempt to find people who are mentioned within the documents.

Here's an example, looking for places, people, or concepts within a text.


The problem with this approach—while it's not labor-intensive at all—is that it doesn't produce very good quality. The third Union governor of Kentucky, Beriah Magoffin, appears with one hundred and seven variants within the text that have been annotated so far. These can be spelling variants, they can be abbreviations, and the there are all these periphrasitic expressions like “your Excellency”, “Dear Sir”, “your predecessor” (in a letter to his successor). So there is all kind of variation in the way this person appears [in the text].


Furthermore, even when you have consistency in the reference, the referent itself may be different. So, “his wife” appears in these documents as a reference to eight different people. What are you going to do with that? No clustering algorithm is going to figure out that “his wife” is one of these eight people.

Our goal was to try to reproduce the quality of the hand-encoded TEI-XML model in a less labor-intensive way.


[Sara] So how do we do that? We built a system, called Mashbill, for a cadre of eight GRAs, each assigned 150 documents from the corpus, who used a Chrome plug-in called hypothes.is to highlight every entity in the published version of the documents. So the documents are transcribed and published, and the GRAs highlight every instance of an entity.

If we look at the second [highlight] “Geo W. Johnson, Esq”, they highlight it, and then they use Mashbill, where we use the hypothes.is API to pull in all of their verbatim annotations.


Each GRA sees the annotations they have created, and [next to each] is an “identify” button. This pulls the verbatim text into a database search using Postgres's trigram library to look for closest matches within our database of known entities.

“Geo W. Johnson, Esq” has the potential to match a lot of people—mostly based on surname. It looks like it might be the second one, George Johnston (a judge), but probably it's George Washington Johnson—halfway down the page—who was one of these provisional Confederate governors. The GRA would choose that to associate the string with the entity in the database, but if they couldn't find an entity—remember that the goal is to find all the people in the corpus who are not already known to historians—they have the ability to create an entity record.


When you create a new entity or when you're working with an entity, we flesh out a lot of really rich information about that entity within the tool. The GRAs would fill in attributes from their research into a set of approved references for Kentucky in this period, including dates, race, gender, geographic location (latitude/longitutde). We also get short biographies which will be incorporated into the edition, and also a list of documents [mentioning the entity].
 

Once you have the information, you can do a lot with the entities really quickly. We can do rich entity visualizations: the big dot is people, places, organizations and geographic features; you can look at gender of entities within the corpus; we can look at entities that appear more often than others and who they are. You can do a lot of high-value work with the data.


We can also look at documents and the places that they mention – large dots are places that are mentioned more often in the documents.

 
[Ben] The last stage of this, is—once the entity research is finished and once the annotations for the document have all been identified—the Mashbill system will produce a TEI-XML file for every entity. It will also update the existing TEI documents that were created during the transcription process with the appropriate persName, placeName, orgName tags with references to [the entity files]. It will also automatically check those files into Github so that the Github browser interfaces will display the differences between [the versions].

So we end up with an output that is equivalent to a hand-coded digital edition which is P5-compliant TEI, but which we hope takes a little bit less labor.
 
If we're trying to look at relationships between people in this corpus, we need to define those relationships. One traditional method—which we saw earlier in [François Dominic Laramée's presentation, "La Production de l’Espace dans l’Imprimé Français d’Ancien Régime : Le Cas de la Gazette"]—is coocurrence: trying to identify entities that are mentioned within a block of text. Maybe [that block] is a page, maybe it's a paragraph, maybe it's a sentence or a word window.

But coocurrence has a lot of challenges. For example, [pointing] if we look right here at “our Sheriff” (who is identified as Reuben Jones, I think) is mentioned within the same paragraph as these other names. But the reason he's mentioned is – it's just an aside: we sent a letter via our sheriff, now we're going to talk about these county officers. There's no relationship between the sheriff, Reuben Jones, and the officers of the county court. The only relationship that we know if is between Reuben Jones and the letter writer – and that's it. Coocurrence would be completely misleading here.


[Sara] So what do we do instead?

Once you've identified the entities within the text, the next step in the Mashbill pipeline is to define the relationship that you're seeing. Those might be relationships that are attested to by the document itself, or they might be relationships that the GRAs found over the course of their research for the biographies.

Mashbill displays a list of all the entities that appear in a document, and the GRAs choose relationships for those entities based on their research. We have six different types of relationships—social, legal, political, slavery, military—and we also prompt the GRAs, showing them what we already know about the relationships of the [entities mentioned within a document].


So we have richer relationship data than a lot of traditional computational approaches, which means that you can do visualizations which have more data encoded within them, and can be more interesting.

This is Caroline Dennett, who was an enslaved woman who was brought [to Kentucky] as contraband with the Union Army, was “employed” by a family in Louisville, and was accused of poisoning their eighteen month old daughter. There are a lot of documents about here, because there are people writing to the governor about pardoning her, or attesting to her character (or lack of ability to do anything that horrible).

What we did with our network is not just Caroline and all the people and organizations she was related to, but rather we have different types of relationships. We have legal relationships, political relationships; we have social relationships. So a preacher in her town was one of the people who wrote to the governor on her behalf, so we show a social relationship with that person. We have about three different types [of relationships] displayed in different colors on this graph.


What are our results?

As of a week ago, this project had annotated 1228 documents with 15931 annotations. Of those annotations, 14470 have been identified as 8086 particular entities. On our right [pointing], we have the distribution of annotations on documents: some of them, like petitions have as many as 238 names, but our median is around eight entities named per document.


You can find the project at civilwargovernors.org. That's the Early Access version which is just the transcriptions; by October those will be republished with all the biography data and the links between the documents and the entity biographies.

The software is on Github. I'm Sara Brumfield, this is Ben Brumfield; we're with Brumfield Labs. Patrick Lewis is the PI on this project, Whitney Smith, Tony Curtis, and Jeff Dycus are editors and technologists at the Kentucky Historical Society. We also want to thank the graduate research assistants.

[applause]

Many questions were very faint in the audio recording; as a result, the following question texts should be regarded as paraphrase rather than transcripts.

Question: You mentioned the project's goals of trying to get beyond a pro-slavery, pro-Confederate historical record. Do you have an idea of how that's going?

Answer: [Ben] What we find is that the documents skew male; they skew white. So it's not like we can create documents that don't exist. But what we can do now is identify documents and people, so you can say “Show me all the women of color who are mentioned within the documents; I want to read about them.” So at least you can find them.

Question: Despite the workflow and process, it seems like there are still a lot of hours of labor involved in this. Can you give us an idea of the amount of labor involved in this project, outside of building the software?

Answer: [Sara] The budget for the labor was $40,000, which hired eight GRAs for the summer. [Ben] They're not done yet, but we think they will achieve the goal of 15,000 entities. It's hard to tell the difference between this and a TEI tagging project, in part because—in addition to identifying entities—every single entity had to be researched, and a biography had to be written for them if possible. [Sara] That's obviously labor-intensive. From a software perspective, we tried to think really hard about how to make this work go faster. So using hypothes.is for annotation: hypothes.is is really slick, and we also didn't have to build an annotator, so that keeps your costs of software development down. So that went really fast. Trying to match entities to choose; we tried to do a lot of that sort of work to make the GRAs as effective as possible. [Ben] But they still have to do the research; they still have to read the documents.

Question: All of your TEI examples focus on places – were you able to handle other kinds of entities?

Answer: [Ben] We concentrated on people, places, and organizations, but one interesting thing about this approach is that—if you look up here at entities mentioned more than ten times, and I'm sorry there's no label—the largest red blob and the largest blue blob are both Kentucky. One of them is the Government of Kentucky; the other is Kentucky as a place. Again, humans can differentiate that in a way that computers can't. [Sara] We did organizations, people, places, and geographic features .

Question: This is a fantastic resource for not just Kentucky Historical Society, but also in terms of thinking through history in the US. I was wondering what your data plan was, and how available and malleable is the data that you produce.

Answer: [Sara] The data itself is flushed to Github as TEI documents, so every entity will have a document there, as well as every document. The database itself is not published anywhere. [Ben] Our goal with this was that, by we got to the “pencils down” phase of the project, everything was interoperable, in Github, so that people could reconstruct the project from that, and that no information was lost – but that's the extent of it.

Question: A technical question – I missed the part with Github. How does that work?

Answer: [Ben] So the editors were looking for a way of exposing the TEI for reuse by other people. Doing all this work on TEI, then locking it away behind HTML is no fun. That said, they were not that happy with—and they loved the idea of Github as a repository; we had used it before for the Stephen F. Austin papers as a raw publication venue—they were really not comfortable with their graduate research assistants having to figure out how git works, and what do you resolve merge conflicts, and such. As a result, Mashbill—the Ruby on Rails application that we built—every time there's a change to a document or an entity, it does a checkout and merge, finds the TEI, adds the tags—essentially merges all that data in—and then checks that back into Github. As a result [the GRAs] are able to use the Github web interface to see the diffs and publish the data, but they didn't have to actually touch git. [Sara] Right, but the editors might, if they need to.

About us: Brumfield Labs, LLC is a software consultancy specializing in digital editions and adjacent methodologies like crowdsourced transcription, image processing/IIIF, and text mining.  If you have a project you'd like to discuss, or just want to pick our brains, we'd love to talk to you. Just send a note to benwbrum@gmail.com or saracarl@gmail.com and we'll chat.