Sunday, October 25, 2015

Encoding Account Books Relating to Slavery in the U.S. South at MEDEA Regensburg

On October 22-24 of 2015, I was fortunate to attend the NEH/DFG-sponsored MEDEA workshop in Regensburg, Germany. The workshop gathered together American and European scholars, editors, and technicians working with digital editions of financial records, and often-overlooked type of textual source.  I presented along with Anna Agbe-Davies, a faculty member at the University of North Carolina-Chapel Hill, with whom I am collaborating to extend FromThePage to support tabular data within texts.  You can read background on the project at our abstract at the MEDEA website.

This document is a composite of the prepared text delivered by Anna Agbe-Davies and a transcript of the ex tempore talk by Ben Brumfield.  Each section will be preceded by the name of the speaker in boldface, with editorial interventions in [brackets].

Neither of us is an historian.  I am an archaeologist and, as is usual in the US, also an anthropologist.  I came to texts such as the “Slave Ledger” discussed below with a straightforward question: what were enslaved people buying in 19th-century North Carolina?  In this sense, the store records complement the archaeological record, which is my primary interest.  Clearly, however, these texts have additional meanings and potential for addressing much more than material culture and consumption.  This is exciting for the anthropologist in me.  I have experience with the methods of historical analysis, but the technological advances of the last few years mean that I have much to learn about the best techniques for harnessing the potential of such documents.
Ben and I are collaborating to extend the capabilities of his online transcription tool FromThePage, to unleash the full analytical possibilities embodied in such texts, including the archive he will now describe.
I'd like to introduce the papers of Jeremiah White Graves.  These are three volumes that were bound posthumously from approximately thirty notebooks with roughly 1600 pages worth of diaries, formal accounts, and informal accounts that are held at the Alderman Library at the University of Virginia and which may be accessed online at in facsimile edition.
Jeremiah White Graves moved from Louisa County, Virginia to Pittsylvania County, Virginia when he was fifteen years old.  In 1823, at the age of 22, using the skills that he learned as a store clerk, he began keeping accounts on his own.  These accounts cover his activities trading with his neighbors, but primarily [cover his activities as] a plantation owner.  He acquired the plantation of Aspen Grove, as well as inheriting other plantations.  Aspen Grove is 120 kilometers north of Stagville, which is the plantation that Anna will discuss, and--like Stagville Plantation--it primarily produced tobacco crops for cash through the work of enslaved laborers.
Some of his accounts are formal.  These may look very familiar to many of you.  These are how he started his accounts in 1822, but he soon found that a formal accounting system did not serve his needs very well.
He started keeping informal accounts to track other activities, such as (in this case) visits by a doctor to treat members of his household, both slave and free.  These informal accounts also covers shipments of logs, corn or cotton to mills.  They cover days his children attended school  They also cover articles of clothing his children took with them to boarding schools.

One of the most interesting things about these accounts is the light they shed on the relationship between Graves and his enslaved laborers, and the relationships among them and the rest of the community.  One of the challenges of the accounts is that they have a very complex topology.  Because the accounts are informal, accounts will be written in separate, unrelated [inaudible].  

In this case, we have a two-entry account between Graves and "my Henry"--who is one of his primary enslaved laborers--who he loans money to.  Henry then pays him back.  So we have two entries in this account.

This account is stuck between shipments of cotton and logs to mills in a previous year, sticks of tobacco [stripped], a later account of tobacco cut in fields, and then a much earlier account of tobacco [stripped] in prize barns.

You see a similar challenge over here [points to second page], where--over the intriguing entries on meat sent to laborers at Aspen Grove Plantation from a different plantation--you find this fascinating account with entries between "my Frederic" (another one of Graves's laborers) and Graves.  One of the fascinating things about this account is that Frederic dies, and--in one of the only instances in which Graves records women in his informal accounts--Graves settles his account with Malissa, Frederick's enslaved widow.
Another challenge of the accounts is that they have a complex order.  Graves began his notebooks with diary entries from front to back.  He would write his accounts from back to front. Then when they met in the middle, he would start a new book, [though] sometimes returning to the older books.

As you see here, we have a four-year-long account that starts on the second-to-last page of the book, continues on page 18, then on page 17, and finally finishes up on page 5 of the volume.

While these accounts are complex, they are not unique, so I will hand this over to Anna.
Stagville was the founding farm for a vast plantation complex assembled by several generations of Bennehans and Camerons1.  A local historian estimates that at their most powerful, the family owned around 900 men, women, and children (Anderson 1985:95). Some of the people at Stagville stayed on after Emancipation, allowing for a fascinating glimpse of the transition from slavery to tenancy and wage labor.

1. The Bennehan/Cameron holdings included nearly 20,000 acres in Durham, Wake, and Granville Counties in 1890 (McDuffie 1890).  Anderson estimated a peak 30,000 acres along the Flat, Eno, and Neuse Rivers, not to mention thousands more in western NC, plantations in Alabama and Mississippi, as well as residences in the county seat and the state capital.

Daybooks and ledgers from plantation stores owned by the Bennehan-Cameron family cover the years 1773 to 1895. Many of the men and women whose purchases are recorded therein were the family’s chattel property and, in later years, their tenants or employees. There are forty-five daybooks and twenty ledgers in the family papers, which are collected in the University of North Carolina’s Southern Historical Collection2.  Eleven volumes are flagged by the finding aid as including purchases by “slaves” or “farm laborers,” though many volumes have no summary and may contain as-yet unidentified African American consumers.

2. In addition to the daybooks and ledgers, there are also cash books, books of ready money sales, and personal/household account books, numbering 142 “financial volumes.” 
My aim is to digitize and analyze a selection of these daybooks and ledgers.  This project augments the Southern Historical Collection’s effort to make important manuscripts available via the Internet.  My project not only increases the number of volumes online and in a format that enables analysis by users with varying levels of expertise, but makes the contents of these documents available as data, not merely images.

One of the questions guiding my research is this: What did it mean to shop in a store, if you yourself can be bought and sold? I am interested in both the financial and social aspects of accounting in the plantation context.  Daybooks, and ledgers offer an important compliment to the archaeological record at Historic Stagville, in Durham, North Carolina.

[omitted from the oral presentation:] Archaeologists can speculate about, but seldom demonstrate, the paths by which goods reached the quarter. Artifacts may reflect the actions of the owner who issued clothing or tools and passed along hand-me-downs. Conversely, finds may speak to the agency of the owned, as when they hunted or grew food for their own consumption or purchased items of personal adornment with cash earned on the side. However, neither interpretation is evident in the artifacts themselves. Archaeologists need additional sources of information because these distinctions have implications for how we view material aspects of the relationship between owner and owned—how power was wielded, how demands were negotiated. The daybooks and ledgers are one way in which to capture how African American consumers at Stagville—pre-Emancipation and during the years of Jim Crow—fashioned lives with the things that they bought.
Brumfield: What we plan to do is to use the open-source digital edition tool FromThePage--which I run, though I welcome contributions from anyone else--to digitize these documents -- to transcribe them.
FromThePage already handles transcription and presentation online.  The core functionality of FromThePage is the wiki-link.  FromThePage handles mark-up using a wiki syntax that is backed by a relation database that suggests mark-up.  So if a user sees the phrase "Renan" and they transcribe it, this then is expanded to the canonical name "Renan, Virginia".
This this is used for presentation: Users who see Renan can see the explanation.  If they explore the subject, they can see an automatically-generated index. 
What we plan to do--now we're moving to the draft design--is to add new wiki mark-up to handle sections that will define different blocks within the text.  To continue this, to use MarkDown wiki mark-up to describe tables.  This addresses data entry.  (We're not big fans of hand-coded XML as a user interface; hand-coded wiki?  We'll see how that works.)
But what's important and relevant here is that this [mark-up] is interpreted by the software and then displayed -- in HTML we have a display as simple HTML tables.  For TEI, we'll expand to TEI tables with the wiki-links expanded using A tags for HTML or references strings to elements within the TEI header.
We have further ideas for exports -- I'm very interested to see other presentations for ideas for those.
However, to serve Anna's analytical needs, we need to export these tables in CSV format.  So what we have designed is the ability to export all records from the collection in a single spreadsheet.  The spreadsheet will be sparse, so that entries from different tables that contained the same column header when they were encoded will appear in the same column on the spreadsheet.  If one table contains an extra column that other tables did not, that will appear in the final spreadsheet, but tables that did not contain that column will [have blank cells] in the spreadsheet.  We also plan to expand the data columns to handle the wiki text, so that both canonical subjects and verbatim text will be included.
Agbe-Davies: I have transcribed one document called the “Slave Ledger,” but have found the result to be inadequate for the analyses I would like to perform. The combination of qualitative and quantitative research goals means that neither transcription, nor a spreadsheet can handle the range of analyses necessary.

The many goods listed in the document (spelled variously) need to be categorized in several ways.  Sometimes they are purchases, other times, sources of credit.  I would like to be able to find both instances of “shoes” but also other instances of “footwear” and “clothing” and “goods made by other members of the plantation community”  Not to mention being able to, in various circumstances either merge or separate “shoes” from “repair of shoes.”
Another form of analysis enabled by tags is pulling out purchases by a single canonical individual, even when different names are used.  Using my transcription of the Slave Ledger, I still had to pick out individuals for this chart by hand because no text search would pull out all and only references to Frank Kinnon, when there are multiple “Frank”s and his second name appears with several different spellings and grammatical constructions3.

As this slide also shows, the ability to pull together records by categories—with those categories being multiscalar—is important for the quantitative analyses that I perform.  In order to examine both trends and change over time, I will be performing analyses within, across, and among manuscripts.  Thus, these tags should live somewhere outside any single document.

I will be examining how people spent precious cash or credit to determine whether gaps were left by the provisioning system during slavery times. If the Benehans’ and Camerons’ human property regularly purchased basic staples it would offer an interesting contrast to the paternalistic, “enlightened” slaveowner of their own imaginations (Anderson 1985:96). In addition, I want to know whether people on the Bennehan-Cameron farms were making similar purchases to folks elsewhere in the plantation South (Heath 2004; Martin 1993).  Also what (dis)continuities exist between the pre- and post-Emancipation eras, as households assumed greater responsibility for their own sustenance?

3. For example, Frank Kinnon, Kennon Frank, and Frank Kennon, not to be confused with Old Frank/Old Frank Eno.
Because I am not an expert on account books, I don’t know how unusual this is, but I am finding in the Stagville accounts, many instances of debtors trading credits among themselves, using them as a kind of currency unconnected to store purchases, also, instances of someone buying an item for another debtor, and even instances of cooperative purchase or credits.  Again, these don’t fit neatly into a standardized recording structure, hence the need for something that is more flexible than a database or spreadsheet, but which nevertheless retains some of the qualities of those kinds of documents.  I am as interested in Solomon’s relations with Britain, Mark, Sam, and Ben, as I am in his relationship to R. Bennehan & Son.
At the moment, I have to choose between capturing the qualities of this text as a physical document, or capturing the information that the text contains.  It is doubtless significant that Ned’s and Miller George’s entries are off-set here.  I don’t want to lose this information in an effort to fit these transactions into a one-size-fits-all structure, such as a database.  Likewise, some accounts (like Davy’s, here) are reconciled frequently, others run for long periods of time without a full accounting of what is owed or credited.  It will be important to be able to record interim calculations as well as individual debits and credits.
Once digitized, the resulting product will allow users easily to identify seasonal patterns in purchasing, follow individual shoppers, or discover the popularity of store-bought clothing over time, for example.  Such resources can reach audiences with different levels of expertise or interest and provide them with rich, attractive materials for their own use, or let them explore the end result as a virtual museum to complement the physical museum experience. Users could easily search on characteristics of the transactions, such as individual account holder, item, or date, to independently answer their own questions about plantation life and modern consumerism.   This exploration may even take place on-site.  Historic Stagville has had great success with their genealogical database and the staff and board are eager to work together to develop more resources to share with their visitors and other stakeholders, such as the Stagville Descendants Council, an African American heritage group.

My aim is to open transcription up to include friends of, and visitors to, Stagville State Historic Site. My time in the museum world largely predates the blossoming of the digital humanities, but I do know how compelling interactive experiences can be, and that audiences understand and appreciate knowledge so much more when they have a hand in its creation (Smith 2014).

There is no conclusion.  This project is an ongoing effort and we feel fortunate to engage with a community of like-minded researchers before we finalize the protocols for transcription and before Ben does additional programming for FromthePage.  We have come to this meeting to learn from the successes, mistakes, and experience of others and look forward to many fruitful exchanges with you all.

Anderson, Jean Bradley
           1985    Piedmont Plantation: the Bennehan-Cameron family and lands in North Carolina. Durham, North Carolina: Historic Preservation Society of Durham.

Heath, Barbara J.
           2004    Engendering Choice: Slavery and Consumerism in Central Virginia. In Engendering African American Archaeology: A Southern Perspective. J.E. Galle and A.L. Young, eds. Pp. 19-38. Knoxville: The University of Tennessee Press.

Martin, Ann Smart
           1993    Buying into the world of goods: Eighteenth-century consumerism and the retail trade from London to the Virginia frontier Ph.D. dissertation, History, The College of William and Mary.

McDuffie, D. G.
           1890    Map of Honorable Paul C. Cameron's Land on Flat, Eno, and Neuse Rivers in Durham, Wake, and Granville Counties, March 1890. Manuscript map in the Southern Historical Collection, University of North Carolina at Chapel Hill.

Smith, Monica L.
           2014    Citizen Science in Archaeology. American Antiquity 79(4):749-762.

Tuesday, May 19, 2015

Day of DH 2015

For the fourth year, I'm participating in the Day of DH.

You can follow my day at the Day of DH blog.

Friday, May 8, 2015

Best Practices at Engaging the Public at CCLA

This is the text of my talk at the best practices panel at the Crowd Consortium for Libraries and Archives meeting Engaging the Public on May 8, 2015.

One caveat: most of my background is in crowdsourced manuscript transcription, though with the development of FromThePage 2 I've become involved in the related fields of collaborative document translation and crowd-sourced OCR correction. I hope that this is useful to non-textual projects as well.

The best practice I'd like to talk about is returning the product of crowd-sourcing to the volunteers that produced it.

What do I mean by product?
I'm not talking about what project managers consider the final product, whether that be item-level finding aids or peer-reviewed papers in the scholarly press. I'm talking about the raw product – the actual work that comes out of a volunteer's direct effort, or the efforts of their fellow volunteers – the transcript of a letter, the corrected text of a newspaper article, the translated photo captions, the carefully researched footnotes and often personal comments left on pages.


First, it's the right thing to do. Yesterday we talked about reciprocity and social justice. An older text says “Thou shalt not muzzle the oxen that tread out the corn.”

Crowdsourced transcription projects vary a lot on this. For wiki-like systems, displaying volunteer transcripts is built into the system – I know that's the case for FromThePage, TranscribeBentham and WikiSource, and suspect the same applies to Scripto and DIYHistory. For others, users can't even see their own contributions after they have submitted them. However, the Smithsonian Institute Transcription Center actually added this feature on purpose – the team implementing the center added the ability for users to download PDFs of transcribed documents specifically because they felt it was the Right Thing to Do.

Now that I've quoted the Bible, let's talk about purely instrumental reasons crowdsourcing projects should return volunteers' labor to them.


For one thing, exposing the raw data early can better align our projects with the incentives that motivate many volunteers. Most volunteers are not participating because of their affiliation with an institution, nor because they treasure clean library metadata – at least not primarily! What keeps them coming back and contributing is their connection to the material – an intrinsic motivation of experiencing life as a bird-watcher in the 1920s, of marching alongside a Civil War soldier as they transcribe observation cards or diaries.

We should expose the texts volunteers have worked on in ways that are immediately usable to them – PDFs they can print out, texts they can email, URLs they can post on Facebook—to show their friends and families just what they've been up to, and why they're so excited to volunteer.

In some cases this may provide extrinsic rewards project managers can't envision. One of the first projects I worked on, the Zenas Matthews diary of the Mexican-American War—attracted a super-volunteer early on who transcribed the entire diary in two weeks. When I interviewed Scott Patrick, I learned that the biggest reward we could provide – the thing he'd treasure above over badges or leader boards – would be the text itself in a printable and publishable format. You see, Mr. Patrick's heritage organization formally recognizes members who have written books, including editions of primary sources. His contribution to the project certainly matched his fellows' for quality, but access to a usable form of the text—the text he'd transcribed himself—was the thing that stood in his way.


Exposing raw transcripts online during the crowdsourcing process can actually enhance recruitment to crowd-sourcing projects. I've seen this in a personal project I worked on. in which one super-volunteer found the project by Googling his own name. You see, a previous volunteer had transcribed a lot of material that mentioned the a letter carrier named Nat Wooding. So when Nat Wooding did a vanity search, he found the transcribed diaries, recognized the letter carrier as his great-uncle, and became a major contributor to the project. Had the user-generated transcripts been locked away for expert review, or even published online somewhere outside of the crowdsourcing tool, we would have missed the contributions of a new super-volunteer.


For the past three years, I've been involved with an non- called Free UK Genealogy. They have volunteers around the world transcribe genealogical records using offline, spreadsheet-like tools so that they can be searched on a freely accessible website.

I spent several months building a new system for crowd-sourced transcription of parish registers, but encountered very little enthusiasm—actually some outright opposition—from the most active volunteers. They were used to their spreadsheets, and saw no value at all to changing what they were doing.

Eventually, we switched from improving the transcription tool-chain to improving the delivery system. We re-wrote the public-facing search engine from scratch, focusing on the product visible to the volunteers and their communities. When we launched the site in April, it received the most positive reviews of any software redesign I've been involved with in two decades in the industry. Best of all—although time frame is too short to have hard numbers—the volunteer community seems to have been reinvigorated, as the FreeREG2 database passed 32 million records at the beginning of the month.

So that's my best practice: expose volunteer contributions online, within your crowdsourcing system, as they are produced. It will improve the quality and productivity of the project, and it's the right thing to do.

Sunday, July 6, 2014

Collaborative Digitization at ALA 2014

This is a transcript of the talk I gave at the Collaborative Digitization SIG meeting at the American Library Association annual meeting on June 28, 2014 in Caesar's Palace casino in Las Vegas.  I was preceded by Frederick Zarndt delivering his excellent talk on Crowdsourcing, Family History, and Long Tails for Libraries, which focused particularly on newspaper digitization and crowdsourced OCR correction.  (See Laura McElfresh's notes [below] for a near-transcript of his talk.)
I'd like to thank Frederick for a number of reasons, one of them being that I don't need to define crowdsourcing, which gives me the opportunity to be a little more technical.
Before we start, I'd just like to make a quick note that all of the slides, the audio files in MP3 format, and a full transcript will be posted at my blog.

I can also direct you to the notes taken by Laura McElfresh [see pp. 19-22] over there who does an amazing job at these [conferences].

Finally, if you tweet about this, there's my handle.

Okay, so we've talked about OCR correction. What's the difference between OCR correction and manuscript transcription? Why would people transcribe manuscripts -- isn't OCR good enough?

I'd like to go into that and talk about the [effectiveness] of OCR on printed material versus handwritten materials.

We're going to go into detail on the results of running Tesseract--which is a popular, open-source OCR tool--on this particular herbarium specimen label.

I chose this one because it's got a title in print up here at the top, and then we've got a handwritten portion down here at the bottom.

So how does Tesseract do with these pieces?

With the print, it does a pretty good job, right? I mean, even though this is sort of an antique typeface, really every character is correct except that this period over here--for some reason--is OCRed as a back-tick.

So it's getting one character wrong out of--fifty, perhaps?

So how about the handwritten portion? What do you get when you run the same Tesseract program on that?

So here's the handwritten stuff, and the results are -- I'm actually pretty impressed -- I think it got the "2" right.

So in this case it got one character right out of the whole thing. So this is actually total garbage.

And my argument is that the quantitative difference in accuracy of OCR software between script versus print actually results in a qualitative difference between these two processes.

This has implications.

One of them is on methodology, which is that--as we've demonstrated--we can't use software to automatically transcribe (particularly joined-up, cursive) writing. You have to use humans.

There are a couple of other implications too, that I want to dive into a bit deeper.

One of them is the goal of the process. In the case of OCR correction, we're talking about improving accuracy of something that already exists. In the case of manuscript transcription, we're actually talking about generating a (rough) transcript from scratch.

The second one comes down to workflow, and I'll go into that in a minute.

Let's talk about findability.

Right now, if you put this page online--this manuscript image--no-one's going to find it. No-one's going to read it. Because Google cannot crawl it -- these are not words to Google, these are pixels. And without a transcript, without that findability, you miss out on the amazing serendipity that is a feature of the internet age. We don't have the serendipity of spotting books shelved next to each other anymore, but we do have the serendipity of--in this case--of a retired statistical analyst named Nat Wooding doing a vanity search on his name. And encountering a transcript of this diary--my great-great grandmother's diary--mentioning her mailman, Nat Wooding--and realizing that this is his great uncle.

Having discovered this, he started contributing to the project--not financially, but he went through and transcribed an entire year's worth of diaries. So he's contributing his labor.

Other people who've encountered these have made different kinds of contributions. These diaries were distributed on my great-great grandmother's death among her grandchildren. So they were scattered to the four winds. After putting these online, I received a package in the mail one day containing a diary from someone I'd never met, saying "looks like you'll do more with this than I will. So this element of user engagement in this case is bringing the collection back together.

Let's talk about the implications on workflow.

This is--I'm not going to say a typical--OCR correction workflow. The thing that I want to draw your attention to is that OCR correction of print can be done at a very fine grain. The National Library of Finland's Digital Koot project is asking users to correct a small block of text: a single word, a single character even. This lends itself to gamification. It lends itself to certain kinds of quality control, in which maybe you show the same image to multiple people and compare them to see if they match.

That really doesn't work very well with handwritten text, because readers have to get used to a script. Context is really important! And you find this when you put material online: people will go through and transcribe a couple of pages, then say "Oh, that's a 'W'!" And they go back and [correct earlier pages].

I want to tell the story of Page 19. This was a project that was a collaboration between me (and the FromThePage platform) and the Smith Library Special Collections at Southwestern University in Georgetown (Texas). They put a diary of a Texas volunteer in the Mexican-American War online--his name was Zenas Matthews. They found one volunteer who came online and transcribed the whole thing. He added all these footnotes. He did an amazing job.

But let's look at the edit history of one page, and what he did.

We put the material online in September. Two months later, he discovers it, and transcribes it in one session in the morning. Then he comes back in the afternoon and makes a revision to the transcript.

Time passes. Two weeks go by, and he's going back [over the text]. He makes six more revisions in one sitting on December 8, then he makes two more revisions on the next morning. Then another eight months go past, and he comes back in August in the next year, because he's thought of something -- he's reviewing his work and he improves the transcription again. He ends up with [an edition] that I'd argue is very good.

Well, this is very different from the one-time pass of OCR correction. This is, in my opinion, a qualitative difference. We have this deep, editorial approach with crowdsourced transcription.

I'm a tool maker; I'm a tool reviewer, and I'm here to try to give you some hands-on advice about choosing tools and platforms for crowdsourced transcription projects.

Now, I used to go through and review [all of the] tools. Well, I have some good news, which is that there are a lot of tools out there nowadays. There are at least thirty-seven that I'm aware of. Many of them are open source. The bad news is that there are thirty-seven to choose from, and many of them are pretty rough.

So instead of talking about the actual tools, I'm going to direct you to a spreadsheet -- a Google Doc that I put together that is itself crowdsourced. About twenty people have contributed their own tools, so it's essentially a registry of different software platforms for [crowdsourced transcription].

Instead, I'm going to discuss selection criteria -- things to consider when you're looking at launching a crowdsourced transcription project.

The first selection criterion is to look at the kind of material you're dealing with. And there are two broad divisions in source material for transcription.

This top image is a diary entry from Viscountess Emily Anne Strangford's travels through the Mediterranean in the 1850s. The bottom image is a census entry.

These are very different kinds of material. A plaintext transcript that could be printed out and read in bed is probably the [most appropriate purpose] for a diary entry. Wheras, for a census record, you don't really want plaintext -- you want something that can go into a structured database.

And there are a limited number of tools that nevertheless have been used very effectively to transcribe this kind of structured data. FamilySearch Indexing is one that we're all familiar with, as Frederick mentioned it. There are a few others from the Citizen Science world: PyBossa comes from the Open Knowledge Foundation, and Scribe and Notes From Nature both come out of GalaxyZoo. [The Zooniverse/Citizen Science Alliance.] I'm going to leave those, and concentrate on more traditional textual materials.

One of the things you want to ask is, What is the purpose of this transcript? Is mark-up necessary? These kinds of texts, as we're all aware, are not already edited, finished materials.

Most transcription tools which exist ask users for plain-text transcripts, and that's it. So the overwhelming majority of platforms support no mark-up whatsoever.

However, there are two families of mark-up [support] which do exist. One of them is a subset of TEI markup. It's part of this TEI Toolbar which was developed by Transcribe Bentham for their own platform [the Bentham Transcription Desk] which is a modification of MediaWiki. It then was later repurposed by the 1916 Letters project and used on top of a totally different software stack, the NARA Transcribr Drupal module [actually DIYHistory]. And what it does is give users a mall series of buttons which can be used to mark up features within a text. So this is really useful if you're dealing with marginalia, with additions and deletions within the text, and you want to track all that. Not everybody wants to track all that, but if that's the kind of purpose that you have, you'll want to look at in-page mark-up.

The other form of mark-up is one that I've been using in FromThePage, using wiki-links to do subject identification within the text. [2-3 sentences inaudible: see "Wikilinks in FromThePage" for a detailed presentation given at the iDigBio Original Sources Digitization Workshop.]

What this means is that if users encounter "Irvin Harvey" and it's marked up like this:

The tool will automatically generate an index that shows every time that Irvin Harvey was mentioned within the texts, or read all the pages mentioning Irvin Harvey. You can actually do network analysis and other digital humanities stuff based on [mining the subject mark-up].

So that's a different flavor of mark-up to consider.

Another question to ask is, how open is your project? Right now I know of projects that are using my own FromThePage tool entirely for staff to use internally.

There are others in which they have students working on the transcripts. And in some cases, this is for privacy reasons. For example, Rhodes College Libraries is using FromThePage to transcribe the diaries of Shelby Foote. Well, Shelby Foote only died a few years ago. [His diaries] are private. So this installation is entirely internal. The transcriptions are all done by students. I've never seen it -- I don't have access to it because it's not on the broad Internet.

Then there's the idea of leveraging your own volunteers on-site, with maybe some [ancillary] openness on the Internet. San Diego Natural History Museum is doing this with the people who come in, and ordinarily will volunteer to clean fossils or prepare specimens for photographs. Well, now they're saying Can you transcribe these herpetology field notes?

So these kinds of platforms are not only wide-open crowdsourcing tools; they can be private, and you should consider this. In some cases, the same platform can support both private projects and crowdsourced projects simultaneously, so you can get all of your data in the same place. [One sentence inaudible.]

Branding! Branding may be very important.

Here are a couple of platforms, with screenshots of each.

The first one is is the French-language version of Wikisource. Wikisource is a sister project to Wikipedia that was spun off around 2003 that allows people to transcribe documents and do OCR correction both. This is being used by the Departmental Archives of Alpes-Maritimes to transcribe a set of journals of episcopal visits. The bishop in the sixteenth century would go around and report on all the villages [in his diocese], so there's all this local history, but it's also got some difficult paleography.

So they're using Wikisource, which is a great tool! It has all kinds of version control. It has ways to track proofreading. It does an elegant job of putting together indiviual pages into larger documents. But, do you see "Departmental Archives of Alpes-Maritimes" on this page? No! You have no idea [who the institution is]. Now, if they're using this internally, that may be fine -- it's a powerful tool.

By contrast, look at the Letters of 1916. [Three sentences inaudible.] This is public engagement in a public-facing site.

Most platforms are somewhere between the two.

Integration: Let's say you've just done a lot of work to scan a lot of material, gather item-level metadata, and you've [ingested it] into CONTENTdm or another CMS. Now you want to launch a crowdsourcing project. Often, the first thing you have to do is get it all back out again and put it into your crowdsourcing platform.

So you need to look at integration. You need to ask the questions, How am I going to get data into the transcription platform? How am I going to get data back out? These may be totally different things: I know of one project that's trying to get data from Fedora into FromThePage, then trying to get it out of FromThePage by publishing to Omeka. There's a different project that wants to get data from Omeka into FromThePage. But these are totally different code paths! They have nothing to do with each other, believe it or not. So you really have to ask detailed questions about this.

Here are a few of the tools that exist, with what they support. (Or what they plan to support -- last week I was contacted about Fedora support and CONTENTdm support for FromThePage, one on Wednesday and one on Thursday, so if anyone has any advice on integration with those systems, please let me know.)

Hosting: Do you want to install everything on-site? Do you have sysadmins and servers? Is this actually a requirement? Or do you want this all hosted by someone else?

Right now you have pretty limited options for hosting. Notes from Nature and the GalaxyZoo projects host everything themselves. Wikisource and FromThePage can be either local or hosted. Everything else, you've got to download and get running on your servers.

Finally, I'd like to talk a little bit about asking yourself, what are yardsticks for success?

If you're doing this for volunteer engagement, what does successful engagement look like? I know of one project that launched a trial in which they put some material from 19th century Texas online. One volunteer found this and dove into it. He transcribed a hundred pages in a week, he started adding footnotes -- I mean he just plowed through this. After a couple of weeks, the librarians I was working with cancelled the trial, and I asked them to give me details. One of the things that they said was, We were really disappointed that only one volunteer showed up. Our goal for public engagement was to do a lot of public education and public outreach, and we wanted to reach out [to] a lot of people.

[For them,] a hundred pages transcribed by one volunteer is a failure compared with one page each transcribed by ten volunteers. So what are your goals?

Similarly, if you're using a platform that is a wiki-like platform--an editorial platform--you'll get obsessive users who will go back and revise page 19 over and over again. That may be fine for you. Maybe you want the highest quality transcripts and you don't mind that there's sort of spotty coverage because users come in and only transcribe the things that really interest them.

Other systems try to go for coverage over quality and depth. ProPublica developed the transcribable Ruby on Rails plugin for research on campaign contributions. They intentionally designed their tool with no back button -- there's no way for a user to review what they did. And they wrote a great article about this which is very relevant to this conference venue: it's called "Casino-Driven Design: One Exit, No Windows, Free Drinks". So for them, the page 19 situation would be an absolute failure, while for me I'm thrilled with it. So again there's this trade off of quality versus quantity in product as well as in engagement.
[Audio to follow.]