Collaborative Manuscript Transcription

Monday, March 2, 2020

THATCamp Gratitude

In 2012, I wrote this email to Dave Lester and Jeremy Boggs:

Yesterday, I was telling someone that I'd had three big breaks to be able to do DHish work as a career. Enumerating them I realized that two of them were or involved THATCamps!
Thanks, guys! I'm following my dreams thanks to y'all.

THATCamp was the first encounter I’d had with academia since graduation (aside from appeals to donate to my alma mater’s annual fund). I was astonished to discover that most of the other campers had (or were working on) PhDs, but their welcome and enthusiasm soon set me at ease. By the end, I’d made some good friends and gotten over most of my technical challenges.

I missed THATCamp in 2009, but Dave and Jeremy were enthusiastic about another THATCamp here in Austin, while Jeanne Kramer-Smythe would be in town, so--with Lisa Grimm and Peter Keane--the first regional THATCamp was born. I left the 2012 THATCamp AHA with an opportunity to quit my industry job and work on DH full time, and--with encouragement from yet more THATCampers--took it. Over the past dozen years, the generosity of the people I’ve met to share their wisdom and perspectives has been a model that my partner Sara and I have tried to emulate.

So thanks again to Jeremy and Dave, to RRCHNM, and to all the colleagues who have been part of the conversation at THATCamp.

Tuesday, August 15, 2017

Beyond Coocurrence: Network Visualization in the Civil War Governors of Kentucky Digital Documentary Edition

On August 10, 2017, my partner Sara Carlstead Brumfield and I delivered this presentation at Digital Humanities 2017 in Montreal. The presentation was coauthored by Patrick Lewis, Whitney Smith, Tony Curtis, and Jeff Dycus, our collaborators at Kentucky Historical Society.

This is a transcript of our talk, which has been very lightly edited. See also the Google Slides presentation and m4a and ogg audio files from the talk.

[Ben] We regret that our colleagues at the Kentucky Historical Society are not able to be with us; as a result, this presentation will probably skew towards the technical. Whenever you see an unattributed quotation, that will be by our colleagues at the Kentucky Historical Society.

The Civil War Governors of Kentucky Digital Documentary Edition was conceived to address a problem in the historical record of Civil War-era Kentucky that originates from the conflict between the slave-holding, unionist elite with the federal government. During the course of the war, they had fallen out completely. As a result, at the end of the war the people who wrote the histories of the war—even though they had been Unionists—ended up wishing they had seceded, so they wrote these pro-Confederate histories that biased the historical record. What this means is that the secondary sources are these sort-of Lost Cause narratives that don't reflect the lived experience of the people of Kentucky during the Civil War. So in order to find about that experience, we have to go back to the primary sources.

The project was proposed about seven years ago; editorial work began in 2012 – gathering the documents, imaging them, and transcribing them in TEI-XML. In 2016, the Early Access edition published ten thousand documents on an Omeka site, discovery.civilwargovernors.org. Sara and I became involved around that time for Phase 2.

The goal of Phase 2 was to publish 1500 heavily annotated documents that had already been published on the Omeka site, and to identify people within them.

The corpus follows the official correspondence of the Office of the Governor. As Kentucky was a divided state, there were three Union governors during the Civil War, and there were also two provisional Confederate governors.

Fundamentally, the documentary edition is not about the governors. We want to look at the individual people and their experience of war-time Kentucky through their correspondence with the Office of the Governor. This correspondence includes details of everyday life from raids to property damage, to – all kinds of stuff: when people had problems, they wrote to the governor when they didn't know where else to go.

If we're trying to highlight the people within these documents, how do you do that within a documentary edition? In a traditional digital edition, you use TEI. Each individual entity that's recognized—whether it's a person, place, organization, or geographic feature—will have an entry about it created in the TEI header or some external authority file, and the times that they are mentioned within the text will be marked up.

Now when done well, this approach is unparalleled in its quality. When you get names that have correct references within the ref attribute of their placeName tags --- you really can't beat it. The problem with this approach is that it's very labor-intensive, and because it's done before publication, it adds an extra step before the readership can have access to the documents.

The alternative approach which we seen in the digital humanities is the text mining approach, in which existing documents have Named Entity Recognition and other machine learning algorithms applied to them to attempt to find people who are mentioned within the documents.

Here's an example, looking for places, people, or concepts within a text.

The problem with this approach—while it's not labor-intensive at all—is that it doesn't produce very good quality. The third Union governor of Kentucky, Beriah Magoffin, appears with one hundred and seven variants within the text that have been annotated so far. These can be spelling variants, they can be abbreviations, and the there are all these periphrasitic expressions like “your Excellency”, “Dear Sir”, “your predecessor” (in a letter to his successor). So there is all kind of variation in the way this person appears [in the text].

Furthermore, even when you have consistency in the reference, the referent itself may be different. So, “his wife” appears in these documents as a reference to eight different people. What are you going to do with that? No clustering algorithm is going to figure out that “his wife” is one of these eight people.

Our goal was to try to reproduce the quality of the hand-encoded TEI-XML model in a less labor-intensive way.

[Sara] So how do we do that? We built a system, called Mashbill, for a cadre of eight GRAs, each assigned 150 documents from the corpus, who used a Chrome plug-in called hypothes.is to highlight every entity in the published version of the documents. So the documents are transcribed and published, and the GRAs highlight every instance of an entity.

If we look at the second [highlight] “Geo W. Johnson, Esq”, they highlight it, and then they use Mashbill, where we use the hypothes.is API to pull in all of their verbatim annotations.

Each GRA sees the annotations they have created, and [next to each] is an “identify” button. This pulls the verbatim text into a database search using Postgres's trigram library to look for closest matches within our database of known entities.

“Geo W. Johnson, Esq” has the potential to match a lot of people—mostly based on surname. It looks like it might be the second one, George Johnston (a judge), but probably it's George Washington Johnson—halfway down the page—who was one of these provisional Confederate governors. The GRA would choose that to associate the string with the entity in the database, but if they couldn't find an entity—remember that the goal is to find all the people in the corpus who are not already known to historians—they have the ability to create an entity record.

When you create a new entity or when you're working with an entity, we flesh out a lot of really rich information about that entity within the tool. The GRAs would fill in attributes from their research into a set of approved references for Kentucky in this period, including dates, race, gender, geographic location (latitude/longitutde). We also get short biographies which will be incorporated into the edition, and also a list of documents [mentioning the entity].

Once you have the information, you can do a lot with the entities really quickly. We can do rich entity visualizations: the big dot is people, places, organizations and geographic features; you can look at gender of entities within the corpus; we can look at entities that appear more often than others and who they are. You can do a lot of high-value work with the data.

We can also look at documents and the places that they mention – large dots are places that are mentioned more often in the documents.

[Ben] The last stage of this, is—once the entity research is finished and once the annotations for the document have all been identified—the Mashbill system will produce a TEI-XML file for every entity. It will also update the existing TEI documents that were created during the transcription process with the appropriate persName, placeName, orgName tags with references to [the entity files]. It will also automatically check those files into Github so that the Github browser interfaces will display the differences between [the versions].

So we end up with an output that is equivalent to a hand-coded digital edition which is P5-compliant TEI, but which we hope takes a little bit less labor.

If we're trying to look at relationships between people in this corpus, we need to define those relationships. One traditional method—which we saw earlier in [François Dominic Laramée's presentation, "La Production de l’Espace dans l’Imprimé Français d’Ancien Régime : Le Cas de la Gazette"]—is coocurrence: trying to identify entities that are mentioned within a block of text. Maybe [that block] is a page, maybe it's a paragraph, maybe it's a sentence or a word window.

But coocurrence has a lot of challenges. For example, [pointing] if we look right here at “our Sheriff” (who is identified as Reuben Jones, I think) is mentioned within the same paragraph as these other names. But the reason he's mentioned is – it's just an aside: we sent a letter via our sheriff, now we're going to talk about these county officers. There's no relationship between the sheriff, Reuben Jones, and the officers of the county court. The only relationship that we know if is between Reuben Jones and the letter writer – and that's it. Coocurrence would be completely misleading here.

[Sara] So what do we do instead?

Once you've identified the entities within the text, the next step in the Mashbill pipeline is to define the relationship that you're seeing. Those might be relationships that are attested to by the document itself, or they might be relationships that the GRAs found over the course of their research for the biographies.

Mashbill displays a list of all the entities that appear in a document, and the GRAs choose relationships for those entities based on their research. We have six different types of relationships—social, legal, political, slavery, military—and we also prompt the GRAs, showing them what we already know about the relationships of the [entities mentioned within a document].

So we have richer relationship data than a lot of traditional computational approaches, which means that you can do visualizations which have more data encoded within them, and can be more interesting.

This is Caroline Dennett, who was an enslaved woman who was brought [to Kentucky] as contraband with the Union Army, was “employed” by a family in Louisville, and was accused of poisoning their eighteen month old daughter. There are a lot of documents about here, because there are people writing to the governor about pardoning her, or attesting to her character (or lack of ability to do anything that horrible).

What we did with our network is not just Caroline and all the people and organizations she was related to, but rather we have different types of relationships. We have legal relationships, political relationships; we have social relationships. So a preacher in her town was one of the people who wrote to the governor on her behalf, so we show a social relationship with that person. We have about three different types [of relationships] displayed in different colors on this graph.

What are our results?

As of a week ago, this project had annotated 1228 documents with 15931 annotations. Of those annotations, 14470 have been identified as 8086 particular entities. On our right [pointing], we have the distribution of annotations on documents: some of them, like petitions have as many as 238 names, but our median is around eight entities named per document.

You can find the project at civilwargovernors.org. That's the Early Access version which is just the transcriptions; by October those will be republished with all the biography data and the links between the documents and the entity biographies.

The software is on Github. I'm Sara Brumfield, this is Ben Brumfield; we're with Brumfield Labs. Patrick Lewis is the PI on this project, Whitney Smith, Tony Curtis, and Jeff Dycus are editors and technologists at the Kentucky Historical Society. We also want to thank the graduate research assistants.

[applause]

Many questions were very faint in the audio recording; as a result, the following question texts should be regarded as paraphrase rather than transcripts.

Question: You mentioned the project's goals of trying to get beyond a pro-slavery, pro-Confederate historical record. Do you have an idea of how that's going?

Answer: [Ben] What we find is that the documents skew male; they skew white. So it's not like we can create documents that don't exist. But what we can do now is identify documents and people, so you can say “Show me all the women of color who are mentioned within the documents; I want to read about them.” So at least you can find them.

Question: Despite the workflow and process, it seems like there are still a lot of hours of labor involved in this. Can you give us an idea of the amount of labor involved in this project, outside of building the software?

Answer: [Sara] The budget for the labor was $40,000, which hired eight GRAs for the summer. [Ben] They're not done yet, but we think they will achieve the goal of 15,000 entities. It's hard to tell the difference between this and a TEI tagging project, in part because—in addition to identifying entities—every single entity had to be researched, and a biography had to be written for them if possible. [Sara] That's obviously labor-intensive. From a software perspective, we tried to think really hard about how to make this work go faster. So using hypothes.is for annotation: hypothes.is is really slick, and we also didn't have to build an annotator, so that keeps your costs of software development down. So that went really fast. Trying to match entities to choose; we tried to do a lot of that sort of work to make the GRAs as effective as possible. [Ben] But they still have to do the research; they still have to read the documents.

Question: All of your TEI examples focus on places – were you able to handle other kinds of entities?

Answer: [Ben] We concentrated on people, places, and organizations, but one interesting thing about this approach is that—if you look up here at entities mentioned more than ten times, and I'm sorry there's no label—the largest red blob and the largest blue blob are both Kentucky. One of them is the Government of Kentucky; the other is Kentucky as a place. Again, humans can differentiate that in a way that computers can't. [Sara] We did organizations, people, places, and geographic features .

Question: This is a fantastic resource for not just Kentucky Historical Society, but also in terms of thinking through history in the US. I was wondering what your data plan was, and how available and malleable is the data that you produce.

Answer: [Sara] The data itself is flushed to Github as TEI documents, so every entity will have a document there, as well as every document. The database itself is not published anywhere. [Ben] Our goal with this was that, by we got to the “pencils down” phase of the project, everything was interoperable, in Github, so that people could reconstruct the project from that, and that no information was lost – but that's the extent of it.

Question: A technical question – I missed the part with Github. How does that work?

Answer: [Ben] So the editors were looking for a way of exposing the TEI for reuse by other people. Doing all this work on TEI, then locking it away behind HTML is no fun. That said, they were not that happy with—and they loved the idea of Github as a repository; we had used it before for the Stephen F. Austin papers as a raw publication venue—they were really not comfortable with their graduate research assistants having to figure out how git works, and what do you resolve merge conflicts, and such. As a result, Mashbill—the Ruby on Rails application that we built—every time there's a change to a document or an entity, it does a checkout and merge, finds the TEI, adds the tags—essentially merges all that data in—and then checks that back into Github. As a result [the GRAs] are able to use the Github web interface to see the diffs and publish the data, but they didn't have to actually touch git. [Sara] Right, but the editors might, if they need to.

About us: Brumfield Labs, LLC is a software consultancy specializing in digital editions and adjacent methodologies like crowdsourced transcription, image processing/IIIF, and text mining. If you have a project you'd like to discuss, or just want to pick our brains, we'd love to talk to you. Just send a note to benwbrum@gmail.com or saracarl@gmail.com and we'll chat.

Saturday, December 10, 2016

Tools and Techniques for Enhanced Encoding of Account Books from US Plantations (MEDEA2)

In April of 2016, Anna Agbe-Davies and I attended MEDEA2 (Modeling Semantically Enhanced Digital Editions of Accounts) at Wheaton College in Norton, Massachusetts to meet with other scholars and technologists working on digital editions of financial records. This is the talk we gave, compiled from Anna's prepared remarks and a transcript of my oral presentation.

Introduction

Agbe-Davies: Manuscript accounts from plantations come in a variety of forms: accounts of plantation residents (enslaved and free) as they frequented local stores; records of the daily expenditures and income realized by slave owners as a direct result of their human property; and accounts tracking economic exchanges between plantation owners and the laborers on whom they depended for their livelihood. The data recorded in these sources present an unparalleled opportunity for scholarly analysis of the economic and social structures that characterized the plantation for people throughout its hierarchy.

The properties of these manuscripts are simultaneously the source of their richness and the font of many challenges. The average American--for all we think we know about our recent plantation past--has little idea of the economic underpinnings of that regime and likewise little sense of how individual men and women may have navigated it. The idea that enslaved people engaged in commercial transactions, were consumers, at the same time that they were treated as chattel property, runs counter to our understanding of what slavery meant and how it was experienced.

Therefore, primary documents challenging these deeply-held beliefs are an important resource, not only for researchers, but the general public as well. We have set out to develop a mechanism that delivers these resources to a wider public, enables their participation in transcription, which makes these sources readable by machines and by people not well-versed in 18th- and 19th-century handwriting.

Just to review, for those of you who were not present for our paper in Regensburg. Neither of us is an historian. I am an archaeologist and, as is usual in the US, also an anthropologist. I came to texts such as the “Slave Ledger” discussed throughout this presentation with a straightforward question: what were enslaved people buying in 19th-century North Carolina? In this sense, the store records complement the archaeological record, which is my primary interest. Clearly, however, these texts have additional meanings and potential for addressing much more than material culture and consumption. This is exciting for the anthropologist in me. Ben is editing the account books of Jeremiah White Graves, a ledger and miscellany from a Virginia tobacco plantation. We are collaborating to extend the capabilities of Ben’s online transcription tool FromThePage, to unleash the full analytical possibilities embodied in financial records. This paper follows up on our previous contribution by showing how the new version of FromThePage meets the challenges that we outlined in October.

Sources

Stagville was the founding farm for a vast plantation complex assembled by several generations of Bennehans and Camerons. A local historian, estimates that at their most powerful, the family owned around 900 men, women, and children (Anderson 1985:95). Some of the people at Stagville stayed on after Emancipation, allowing for a fascinating glimpse of the transition from slavery to tenancy and wage labor.

Daybooks and ledgers from plantation stores owned by the Bennehan-Cameron family cover the years 1773 to 1895. Many of the men and women whose purchases are recorded therein were the family’s chattel property and, in later years, their tenants or employees. There are forty-five daybooks and twenty ledgers in the family papers, which are collected in the University of North Carolina’s Southern Historical Collection. Eleven volumes are flagged by the finding aid as including purchases by “slaves” or “farm laborers,” though many volumes have no summary and may contain as-yet unidentified African American consumers.

My plan is to digitize and analyze a selection of these daybooks and ledgers. This project augments the Southern Historical Collection’s effort to make important manuscripts available via the Internet. My project not only increases the number of volumes online and in a format that enables analysis by users with varying levels of expertise, but makes the contents of these documents available as data, not merely images.

Aims and Problems

We wanted the tool to be user-friendly. We didn't want users to have to learn complicated or esoteric conventions for the transcription or marking of the texts. Support for encoding, display, and export of tabular data appearing within documents. This feature allows Markdown encoding of tables appearing within documents, and enhances the semantic division of texts into sections, as Ben will discuss below.

In pilot studies I had done before adopting FromThePage, participants cited the need to have the manuscript being transcribed visible at the same time as the transcription window.

One of the problems with transcription is how to treat variations in terminology and orthography. This conversation was discussed at the last MEDEA meeting as a difference between historical and linguistic content.

An analyst has good reason to want to treat all variants of whiskey as one category. But by the same token, differences in how the word is rendered are important information for other forms of analysis: who wrote the entry; do differences signify levels of literacy or specialized knowledge?

Just like commodities, people’s names come in several forms. Unlike commodities, the same name may refer to more than one individual. When it is possible to merge multiple names into a single individual or distinguish two similar names as two different people, it’s important to be able to record that insight without doing violence to the original structure of the manuscript.

Encoding

Brumfield: We don't usually think about wiki encoding as encoding--we don't usually think about plain-text transcription as encoding--but fundamentally, it is.

The goal of wiki encoding is quick and easy data entry. What that means is that where possible, users are typing in plain text. Now this is a compromise. It is a compromise between presentational mark-up and semantic mark-up. But fundamentally all editions are compromises. [inaudible]

If the user encounters a line break in the original text, they hit carriage return. That encodes a line-break. For a paragraph break, they encode a blank line. So you end up with something very similar to the old-fashioned typographic facsimile in your transcript.

That's not enough, however -- you have to have some explicit mark-up. That's where we've added lightweight encoding. Most wiki systems support that, but the one which we use most prominently are wikilinks that are backed by a relational database system that records all of the encoding.

What's a wiki-link? Here's an example: there are two square braces, and on one side of a pipe sign is the canonical name of the subject [while on the other side you have the verbatim text.] In this account, Anna has encoded a reference to gunflint, and the verbatim text is "flint", and it appears in the context of "at sixpence". That's not that complex, but it does give the opportunity to encode variation in terminology; to resolve all these terms into one canonical subject.

Because all of the tags are saved into the database, you can do some really interesting things, like dynamically generating an index whenever a page is saved. If you go to the entry on "gunflint", you can see all of the entries that mention gunflint. You can link directly to [the pages], and you can see the variations. So on page 3 we see "Gunflints", on page 4 we see "Gunflints", and page 5 we see "flints", which is the entry we just saw.

When a transcriber records something like "flint", they are asked to categorize it and add it to an ontology that's specific to a a project. Anna has developed this ontology of categories to put these subjects in. A subject can belong to more than one category. The subjects that are of particular importance to this project are the persons and their status: You can have a person who shows up as an account holder, you can have a person who shows up as enslaved, you can have a person who shows up as a creditor, and the same person can be all of these different things.

In this case, "gunflint" shows up in the category of "arms".

The other thing that you can do is mine this tagging to do some elementary network analysis by looking at colocation. When a subject like 'gunflint' appears most commonly in the same physical space on the page with another subject, you can tell that it's related to that subject more closely than it is to things that appear farther away. Unsurprisingly, 'gunflint' appears most commonly with 'shot' and with 'powder'. So people are going hunting and buying all their supplies at once. (This is why I'll never get a PhD: discovering that people buy powder and shot at the same time is not groundbreaking research! But it could be a useful tool for someone like Anna.)

Another thing we can do: Even though we're not encoding in TEI--we're not using TEI for data entry--we can generate TEI-XML from the system. These are two sections of an export from the Jeremiah White Graves Diary, in which there's a single line with "rather saturday. Joseph A. Whitehead's little" and there's a line-break. That was encoded by someone typing in "Jos A. Whitehead" in angle brackets, then "little", and hitting carriage return.

The TEI exporter generates that text with a reference string and a line break. It also generates the personography entry for Joseph A Whitehead from the the wiki-link connecting "Jos. A Whitehead" to the canonical name.

What does any of that have to do with accounts? That's really great for prose, but it doesn't help us deal with the kinds of tabular data that shows up in financial records.

Here's an example from the Stagville Account books that Anna has encoded, which shows off the mark-up which we developed for this project. And I have to say that this is "hot code" -- this was developed in February, so we are nowhere near done with it yet.

We needed to come up with some way to encode these tabular records in a semantically meaningful way and to render them usefully. We chose the Markdown sub-flavor of wiki-markup to come up with this format which looks vaguely tabular.

Whenever you encounter a table, you create a section header which identifies the table, then a table heading for all the columns, and then you type in what you see as the verbatim text, separated by pipes. You can add lots of whitespace if you want all of your columns to line up in your transcription and make your transcript look really pretty, or not -- it doesn't really matter.

One of the hiccups we ran into pretty quickly was that not all of the tables in our documents were nicely formatted with headers. What we needed to do was encode the status of these columns. All these column headers are important -- they're applied to the role of the data cells below them. So we came up with this idea of "bang notation", which essentially strips this for display, but leaves it for semantic encoding.

That serves our purposes for display and for the textual side of the edition. What about the analytical side? Once all this is encoded, we're able to export--not just in TEI-XML--but we're able to export all of the tabular data that appears in one account book and create a single spreadsheet from it. Because the tables that appear in an account book can be really heterogeneous--some of us are dealing with sources in which you have a list of bacon shipments, and then you have an actual account to a creditor, and then you have an IOU--but you want to be able to track the same amounts from one table to another, when those values are actually relevant.

We do something really pretty simple: we look at the heading, and if something appears under the same heading in more than one table, we'll put it in the same column in the spreadsheet we generate. That means that some rows are going to have blank cells because they've had different headers. Filtering should allow you to put that together. Here you see Mrs. Henry's Abram's account, you have Frederick's account, so you can filter those and say you just want to study Frederick's account. You just want to study those two accounts together.

Furthermore, we also have the ability to link back to the original page, so that you can get back from the spreadsheet to the textual edition.

We've got a lot more to do. We are probably less than fifty percent done with the software development side of this, and way less than that for the whole project.

We need to work on encoding dates that are useful for analysis.
We need to figure out hot to integrate subjects and tables in ways that can be used analytically.
We need to add this table support to our TEI exports.

And then we need to do a lot more testing. We're still working on this.

With that, I turn it back to Anna.

Conclusion

Agbe-Davies: We have conceptualized usability in terms of both process and product. Because FromThePage is designed to facilitate crowdsourced transcription of manuscript accounts the functionality of the input process is as important as the form the output will take. The resulting transcription will be exponentially more readable for nonspecialist users, while at the same time allowing researchers to perform quantitative analyses on data that would otherwise be inaccessible.

Each of these audiences can contribute to the development of these datasets and use them in creative ways. In my association with Stagville State Historic Site, I have the opportunity to share research findings with the general public and they are eager to explore these data for themselves and turn them to their own purposes. Teachers can use this material in their classes. Site interpreters and curators can enrich their museums’ content with it. History enthusiasts can get a sense of the primary data that underlies historical scholarship. Researchers can manipulate and examine transcriptions in ways that are both quantitative and qualitative. In a recent paper in crowdsourcing as it applies to archaeological research, a colleague and I wrote, “[there is an] urgent need for access to comparative data and information technology infrastructure so that we may produce increasingly synthetic research. We encourage increased attention not only the ways that technology permits new kinds of analyses, but also to the ways that we can use it to improve access to scattered datasets and bring more hands to take up these challenges.” A similar argument can be made for the modeling of historic accounts.

Thursday, June 9, 2016

Accidental Editors and the Crowd at DiXiT 2

In March, I gave a "Club Talk" at Dixit 2: Academia, Commons, Society, a conference of digital scholarly editors hosted by the Cologne Center for e-Humanities and the Institut für Dokumentologie und Editorik. The club talk was a real departure for me -- for the first time in my career, my name was on a concert poster!

Club lecture, sword fight & live music @StereoKoeln w/ @benwbrum @colognecommons Register @ https://t.co/Vn942KPONR pic.twitter.com/DTsNEyXDaI
— DiXiT (@DiXiT_eu) February 26, 2016

I tried to keep the material light, and was helped by some serious swordplay by Langes Schwert Köln, the local HEMA group. Langes Schwert had spent weeks preparing a their demonstration, highlighting each phrase in a parallel text edition of a fighting manual as they worked through the moves it described. I really can't thank them enough.

Here is the video, slides, and transcript of my talk, followed by the video and transcript of the swordfighting demonstration.

Thanks to DiXiT for bringing me here and thank you all for coming. All right, my talk tonight is about accidental editors and the crowd. What is an accidental editor? Most of you people in this room are here because you're editors and you work with editions. So I ask you, look back, think back to when you decided to become an editor. Maybe you were a small child and you told your mother, “When I grow up I want to be an editor.” Or maybe it was just when you applied for a fellowship at DiXiT because it sounded like a good deal.

The fact of the matter is there are many editions that are happening by people who never decided to become an editor. They never made any intentional decision to do this and I'd like to talk about this tonight.

So all this week we've been talking digital scholarly editions, tonight, however, I'd like to take you on a tour of digital editions that have no connection whatsoever to the scholarly community in this room.

Torsten Schaßan yesterday defined digital editions saying that, “A digital edition is anything that calls itself a digital edition.” None of the projects that I'm going to talk about tonight call themselves digital editions. Many of them have never heard of digital editions.

So, we're going to need another definition. We're going to need a functional definition along the lines of Patrick Sahle's, and this is the definition I'd like to use tonight. So these are “Encoded representations of primary sources that are designed to serve a digital research need.”

All right, so the need is important. The need gives birth to these digital editions. So what is a need in the world of people who are doing editing without knowing they're doing editing?

Well, I'll start with OhNoRobot. Everyone is familiar with the digital editing platform OhNoRobot, right? Right?

All right, so let's say that you read a web comic. Here's my favorite web comic, Achewood, and it has some lovely dark humor about books being "huge money-losers" and everyone "gets burned on those deals". And now you have a problem which is that two years later a friend of yours says, “I'm going to write a book and it's going to be great.” And you'd say, “Oh, I remember this great comic I read about that. How am I going to find that, though?”

Well, fortunately you can go to the Achewood Search Service and you type in “huge money-loser” and you see a bit of transcript and you click on it...

And you have the comic again. You've suddenly found the comic strip from 2002 that referred to books as huge money losers. Now, how is that possibly? See this button down here? This button here that says “Improve Transcription.” If you click on that button...

You'll get at a place to edit the text and you'll get a set of instructions. And you'll get a format, a very specific format and encoding style for editing this web comic. All right? Where did that format—where did that encoding come from? Well, it came from the world of stage, the world of screenplays. So this reads like a script. And the thing is, it actually does work. It works pretty well. So that community has developed this encoding standard to solve this problem.

Let's say that you're a genealogist and you want to track records of burials from 1684 that are written in horrible old secretary hand and you want to share them with people.

No one is going to sit down and read that. They're going to interact with us through something like FreeReg. This is a search engine that I developed for Free UK Genealogy which is an Open Data genealogy non-profit in the U.K. And this is how they're going to interact with this data. But how's it actually encoded? How are these volunteers entering what is now, I'm pleased to say, 38 million records?

Well, they have rules. They have very strict rules. They have rules that are so strict that they are written in bold. “You transcribe what you read errors and all!”

And if you need help here is a very simple set of encoding standards that are derived from regular expressions from the world of computer programming. All right? This is a very effective thing to do.

One thing I'd like to point out is that in the current database records encoded using this encoding style are never actually returned. This is [encoded] because volunteers demand the ability to represent what they see and encoding that's sufficient to do that even if the results might even be lost, in the hope that some day in the future they will be able to retrieve them.

Okay. So far I've been talking mainly about amateur editions. I'd like to talk about another set of accidental editors which are people in the natural sciences. For years and years naturalists have studied collections and they've studied specimens in museums and they've gotten very, very good at digitizing things like...

This is a "wet collection". It's a spider in a jar and it's a picture I took at the Peabody Museum.

In case you've ever wondered whether provenance can be a matter of horror [laughter] I will tell you that the note on this says, “Found on bananas from Ecuador.” Be careful opening your bananas from Ecuador!

Thanks to climate change and thanks to habitat loss these scientists are returning to these original field books to try to find out about the locations that these were collected from to find out what the habitats looked like 100 years ago or more. And for that these records need to be transcribed.

So here is the Smithsonian Institute Transcription Center. This is going to [familiar] to a lot of people in the room. The encoding is something really interesting because we have this set of square notes: vertical notation in left margin, vertical in red, slash left margin, vertical in red all around "Meeker". The interesting thing about this encoding is that this was not developed by the Smithsonian. Where did they get this encoding from?

They got this encoding from a blog post by one of their volunteers. This is a blog post by Siobhan Leachman who spends a lot of time volunteering and transcribing for the Smithsonian. And because of her encounter with the text she was forced to develop a set of transcription encoding standards and to tell all of her friends about it, to try to proselytize, to convert all of the other volunteers to use these conventions.

And the conventions are pretty complete: They talk about circled text, they talk about superscript text, they talk about geographical names. I'm fairly convinced--and having met Siobhan I believe she can do it--that given another couple of years she will have reinvented the TEI. [laughter].

So you may ask me, “Why are squished into the back of a room?” To make room for the swords. And we haven't talked about swords yet.

So I'd like to talk about people doing what's called Historical European Martial Arts. This is sword fighting. It's HEMA for short. So you have a group of people doing martial arts in the athletic tradition as well as in the tradition of re-enactors who are trying to recreate the martial arts techniques of the past.

So there are HEMA chapters all over. This is a map of central Texas showing the groups near me within about 100 kilometers and as you can see many clubs specialize in different traditions. There are two clubs near me that specialize in German long sword. There's one club that specializes in the Italian traditions and there are—there's at least one club I know of that specializes in a certain set of weapons from all over Europe.

So how do they use them? Right? How do they actually recreate the sword fighting techniques? They use the texts in training. And this is a scene from an excellent documentary called “Back to the Source,” which I think is very telling, talking about how they actually interact with these. So here we have somebody explaining a technique, explaining how to pull a sword from someone's hand...

And now they're demonstrating it.

So where do they get these sources from? For a long time they worked with 19th century print editions. For a long time people, including the group in this room, worked with photocopies or PDFs on forms. Really all of this stuff was very sort of separated and disparate until about five years ago.

So five or six years ago Michael Chidester who was a HEMA practitioner who was bedridden due to a leg injury had a lot of time on the computer to modify Wikisource, which is the best media wiki platform for creating digital editions, to create a site called Wiktenauer.

What can you find on Wiktenauer? Okay, here's a very simple example of a fighting manual. We've got the image on one side. We've got a facsimile with illustrations. We have a transcription, we have a translation in the middle. This is the most basic. This is something that people can very easily print out, use in the field in their training.

Still it's a parallel-text edition. If you click through any of those you get to the editing interface which has a direct connection between the facsimile and the transcript. And the transcript is done using pretty traditional MediaWiki mark up.

Okay. Now, and I apologize to the people in the back of the room because this is a complex document. We get into more complex texts. So this is a text by someone named Ringeck and here we have four variants of the same text because they're producing a variorum edition. In addition to producing the variants...

They have a nice introduction explaining the history of Ringeck himself and contextualizing the text.

What's more, they traced the text itself and they do stemmatology to explain how these texts developed.

And in fact even come up with these these nice stemmata graphs.

So how are they used? So, people study the text, they encounter a new text and then they practice. As my friends last night explained to me, the practice informs their reading of the text. They are informed deeply by die Körperlichkeit -- the actual physicality of trying out moves.

The reason that they're doing this is because they're trying to get back to the original text and the original text is not what was written down by a scribe the first time. The original text, this Urtext, is what was actually practiced 700 years ago and taught in schools. Much like Claire Clivaz mentioned talking about Clement of Alexandria: You have this living tradition, parts of it are written down, those parts are elaborated by members of that living tradition and now they're reconstructed.

What if your interpretation is wrong? Well, one way they find out is by fighting each other. You go to a tournament. You try out your interpretation of those moves. Someone else tries out their interpretation of those moves. If one of you would end up dead that person's interpretation is wrong. [laughter] (People think that the stakes of scholarly editing are high.)

What are the challenges to projects like Wiktenauer? So one of the projects—when I interviewed Michael Chidester he explained that they particularly, editors in the U.S., actually do struggle and they would love to have help from members of the scholarly community dealing with paleography, dealing with linguistic issues, and some of these fundamental issues.

One of the other big challenges that I found is--by contrast with some of the other projects we talked about--in many cases the text on Wiktenauer are of highly varied quality. They try to adjust for this by giving each text a grade, but if an individual is going to contribute a text and they're the only one willing to do it, you sort of have to take what they get. My theory for why Wiktenauer transcripts may be of different quality from those that you see on the Smithsonian or that genealogists produce is that for those people the transcription--the act of working with the text--is an end in itself whereas for the HEMA community the texts are a way to get to the fun part, to get it to the fighting.

And now--speaking of "the fun part"--it's time for our demonstration.

It gives me great pleasure to welcome Langes Schwert Cologne, with
Junior Instructor, Georg Schlager
Senior Instructor, Richard Strey
Head Instructor, Michael Rieck

START AUDIO: [0:02:00]

RICHARD: Okay, two things first I will be presenting this in English even though the text is in German, but you won't really have to read the text to understand what's going on. Also, we will have to adjust for a couple of things. A sword fight usually starts at long range unless someone is jumping out from behind a bush or something. So we'll have to adjust for that. In reality moves would be quite large. All right. So he could actually go to the toilet and then kill me with just one step. So we will be symbolizing the onset of the fight by him doing a little step instead us both taking several steps. All right.

So basically, this is what it's all about. These techniques also work in castles and small hallways. [laughter] All right. Now, again we are Langes Schwert we have been doing the historical German martial arts since 1992. We train here in Cologne and today we would like to show you how we get from the written source to an actual working fighting. You can all calm down now from now on it's going to be a slow match. So, in case you didn't get what happened the whole thing in slow.

Okay, so how do we know what to do? We have books that tell us. For this presentation we will be using four primary sources: fencing books from around 1450 to 1490 all dealing with the same source material. On the right hand side you can see our second source the text you see there will be exactly what we are doing now. Also, we use a transcription by Didien de Conier from France. He did that in 2003, but since the words don't change we can still use it. All right, so how do we know what to do? I can talk in this direction.

He's the good guy, I'm the bad guy.

GEORGE: We can see that. [laughter]

RICHARD: So how does he know what to do? I'll be basically reading this to you in English, we've translated it. In our group we have several historians. Several other members as well can actually read the original sources, but still in training we go the easy and do the transcription. But still usually we have the originals with us in PDF format so in case we figure, “Well, maybe something's wrong there,” we can still look at it. For the most part that doesn't really matter, but the difference between seine rechte seite and deine rechte seite -- "his right side" and "your right side" can make a difference. [laughter]

Okay, so what we're dealing with today is the easiest cut, the wrath cut. The sources tell us that the wrath cut breaks all attacks that from above with the point of the sword and yet it's nothing but the poor peasant's blow. Essentially what you would do with a baseball bat, all right? But very refined. [laughter] So, usually his plan would be to come here and kill me. Sadly, I was better than him. I have the initiative so he has to react. and it says, do it like this, if you come to him in the onset, symbolized, and he strikes at you from his right side with a long edge diagonally at your head like this...then also strike at him from your right side diagonally from above onto his sword without any deflection or defense.

So that's his plan. It would be a very short fight if I didn't notice that. [laughter] So, thirdly it says if he weak—oh no, if he is soft in the bind which implies that I survive the first part which looks like this then let your point shoot in long towards his face or chest and stab at him. See we like each other so he doesn't actually stab me. [laughter]

Okay, it says this is how you set your point on him. Okay, next part, if he becomes aware of your thrust and become hard in the bind and pushes your sword aside with strength...then you should grip your sword up along his blade back down the other side again along the blade and hit him in the head. [laughter] All right, this is called taking it all from above. Okay. This is the end of our source right here. Now we have left out lots of things. There are a lot of things that are not said in the text. For example it says if he's weak in the bind, or soft in the bind, actually I'm not, I'm neutral and then I become soft. How do I know this? It doesn't say, so right here. Well maybe I could just try it being soft. It would basically look like this.

It doesn't really work. [laughter] Now, if being soft doesn't really work maybe being hard does. So I'll try that next. It doesn't really work either. Okay. So this is an example from fencing, from actually doing it you know that in the bind you have to be neutral. If you decide too early he can react to it and you can change your mind too fast.

All right, now what we read here is just one possible outcome of a fight. A fight is always a decision tree. Whenever something happens you can decide to do this or do the other thing and if you fight at the master level which is this you lots and lots and lots of actions that happen and the opponent notices it, reacts accordingly and now if you were to carry out your plan you would die. So you have to abort your plan and do something else instead. So now we'll show you what our actual plans were and what happened or didn't happen. So, I was the lucky guy who got to go first. I have the initiative. So my plan A is always this.

And then I'll go have a beer. [laughter] And the talk is over, but he's the good guy so he notices what happened and his plan is this. He hits my sword and my head at the same time. So if I just did what they do in the movies notice that he's going bang, he is not dead, I'll do something else, no it doesn't work. I'm already finished. So I have to notice this while I'm still in the air. Abort the attack, right? I was going to go far like this, now I'm not going to do that. I'm going to shorten my attack and my stab. And from here I'm going to keep going. He is strong in the bind so I will work around his sword and hit him in the face. How to do this is described I think two pages later or three.

All right. Now, remember I was supposed to be weak or soft in the bind. My plan was to kill him, it didn't work. I had to abort it. When I hit his blade he was strong, I go around it which makes me soft which is what is says there. Okay, sadly he doesn't really give me the time to do my attack and instead, he keeps going. I was going to hit him in the face, but since I was soft he took the middle, hits me in the head. All thrusts are depending on the range. In here we don't have much range so we'll always be hitting. If we're farther apart it will be a thrust. Okay, but I notice that, so I'm not afraid for my head I'm afraid for the people. [laughter]

MAN #1: Divine intervention!

RICHARD: So I notice him hitting me in the head so I'll just take the middle and hit him in the head. It usually works. Okay. Now, obviously, the smart move for him is to just take his sword away, let me drop into the hole because I was pushing in that direction anyway and then he'll keep me from getting back inside by just going down this way. And that, basically, is how the good guy wins. Except, of course, there is a page over there where it tells me how to win. And it says, well if he tries to go up and down you just go in. See, you have to stand here.

So, basically there is never any foolproof way to win. It's always a case of initiative and feeling the right thing. Actually, there is one foolproof way, but we're not going to tell you. [laughter] This concludes our small demonstration. [applause] I'm not finished yet. So, what we did was about 60 to 70% speed. We couldn't go full speed here because of the beam. Also it was just a fraction of the possible power we could do it at. We counted yesterday, we had been for the practice session what you just saw are nine different actions that are taken within about one-and-a-half seconds and that's not studied or choreography. In each instance you feel what is happening. You feel soft, hard, left, right, whatever, it's not magic, everyone does it. Well, all martial arts do it once they get to a certain level. So we would like to thank Didier de Conier who, unknowingly, provided us with the transcriptions. We would like to thank Wiktenauer even though we don't need it that often because we can actually read this stuff. It's a great resource for everyone else and as always we have that. And oh yeah, we train at the Uni Mensa every Sunday at 2. So whoever wants to drop by and join is invited to do so. [applause]

AUDIO END: [0:20:51]

Collaborative Manuscript Transcription

Monday, March 2, 2020

THATCamp Gratitude

Tuesday, August 15, 2017

Beyond Coocurrence: Network Visualization in the Civil War Governors of Kentucky Digital Documentary Edition

Saturday, December 10, 2016

Tools and Techniques for Enhanced Encoding of Account Books from US Plantations (MEDEA2)

Introduction

Sources

Aims and Problems

Encoding

Conclusion

Thursday, June 9, 2016

Accidental Editors and the Crowd at DiXiT 2

New Blog Posts are at FromThePage

Posts from the FromThePage Blog

Pages

Upcoming Conference Schedule

Past Conference Talks

Blog Archive

Subjects

Papers

Transcription Systems

Digital Family History