Abstract: One of the ironies of the Internet age is that traditional standards for accessibility have changed radically. Intelligent members of the public refer to undigitized manuscripts held in a research library as "locked away", even though anyone may study the well-cataloged, well-preserved material in the library's reading room. By the standard of 1992, institutionally-held manuscripts are far more accessible to researchers than uncatalogued materials in private collections -- especially when the term "private collections" includes over-stuffed suburban filing cabinets or unopened boxes inherited from the family archivist. In 2012, the democratization of digitization technology may favor informal collections over institutional ones, privileging online access over quality, completeness, preservation and professionalism.
Will the "cult of the amateur" destroy scholarly and archival standards? Will crowdsourcing unlock a vast, previously invisible archive of material scattered among the public for analysis by scholars? How can we influence the headlong rush to digitize through education and software design? This presentation will discuss the possibilities and challenges of mass digitization for amateurs, traditional scholars, libraries and archives, with a focus on handwritten documents.
In addition to the diary, he kept account books that give you details of plantation life that range from -- that you wouldn't otherwise see in the diaries. So for example, this is his daughter Fanny,
And this is a list of every single article of clothing that she took with her when she went off to a boarding school for a semester.
Perhaps more interesting, this is a memorandum of cash payments that he made to certain of his enslaved laborers for work on their customary holidays -- another sort of interesting factor. I got interested in this because I'm interested in the property that he lived in. The house that he built is now in my family, and I was doing some research on this. Since these account books include details of construction of the house, I spent a lot of time looking for these books. I've been looking for them for about the last ten years. I got in contact with some of the descendants of Jeremiah White Graves and found out through them that one of their ancestors had donated the diaries to the Alderman Library at the University of Virginia. I looked into getting them digitized and tried to get some collaboration [going] with some of the descendants, and one of them in particular, Alan Williams, was extremely helpful to me. But this was his reaction:
Okay. So we have diaries that are put in a library -- I believe one of the top research libraries in the country -- and they are behind a wall. They are locked away from him.
- They're professionally conserved -- great!
- They're publicly accessible, so anyone can walk in and look at them in the Reading Room.
- They're cataloged, which would not be the case if they'd still been sitting in his family.
- On the down side, they're a thousand miles away: they're in Virginia, he's in Florida, I'm in Texas. We all want to look at these, but it's awfully hard for people to get there if we don't have research budgets.
- We have to deal with reading room restrictions if we actually get there.
- Once we work on getting things digitized we have these permission-to-publish that we need to deal with, which have some moral challenges for someone from whose family these diaries came from.
- And we have the scanning fees: the cost of getting them scanned by the excellent digitization department at the Alderman Library is a thousand dollars. Which is not unreasonable, but it's still pretty costly.
One of the problems with that is that you have very limited metadata. The metadata is usually institutionally-oriented. No transcripts, in particular -- nobody has time for this. And quite often, they're in software platforms that are not crawlable by search engines.
Worst of all, however, is that the way that these things are propagated through the Internet is through cut-and-paste: so quite often from a website to a newsgroup to emails, you can't even find the original person who typed up whatever the source material was.
The challenges to institutions, in my opinion, come down to funding and manpower. As we just mentioned, generally archives don't have a staff of people ready to produce documentary editions and put them online.
Outside institutions, the big challenge is standards; it is expertise. You've got manpower, you've got willingness, but you've got a lot of trouble making things work using the sorts of methodologies that have come out of the scholarly world and have been developed over the last hundred years.
So how do we fix these challenges?
OldWeather.org is a project from GalaxyZoo, the Zooniverse/Citizen Science Alliance. The Zenas Matthews Diary was something that I collaborated with the Southwestern University Smith Library Special Collections on. And the Harry Ransom Center's Manuscript Fragments Project.
They launched this project three years ago, I believe, and they're done. They've transcribed all the Royal Navy logs from the period essentially around World War I -- all in triplicate. So blind triple keying every record. And the results are pretty impressive.
Each individual volunteer's transcripts tend to be about 97% accurate. For every thousand logbook entries, three entries are going to be wrong because of volunteer error. But this compares pretty favorably with the ten that are actually honestly illegible, or indeed the three that are the result of the midshipman of the watch confusing north and south.
more than 1.6 million weather observations--again, all triple-keyed--through the efforts of sixteen thousand volunteers who've been transcribing pages from a million pages of logs.
So what this means is that you have a mean contribution of one hundred transcriptions per user. But that statistic is worthless!
color map of contributions per user. Each user has a square. The size of the square represents the quantity of records that they transcribed. And what you can see here is that of those 1.6 million records, fully a tenth (in the left-hand column) were transcribed by only ten users.
talked elsewhere about how this is true in small projects as well. What I'd like to talk about here is some of the implications.
So here's a posting on Flickr. They're saying, "Please identify this in the comments thread."
Here we have one user going through and identifying just the left hand fragment of this chunk of manuscript that was used for binding.
transcribing my great-great grandmother's diary. The current top volunteer on this is a man named Nat Wooding. He's a retired data analyst from Halifax County, Virginia. He's transcribed a hundred pages and indexed them in six months. He has no relationship whatsoever to the diarist.
But his great uncle was the postman who's mentioned in the diaries, and once we had a few pages worth of transcripts done, he went online and did a vanity search for "Nat Wooding", found the postman--also named Nat Wooding--discovered that that was his great uncle and has become a volunteer.
But this is a problem that has been solved, outside of manuscripts. It's been solved with photographs. It's been solved by Flickr. Nowadays, if you want to find photographs of African-American girls from the 1960s on tricycles, you can find them on Flickr. Twenty years ago, this was something that was irretrievable. So Flickr is a good example, and I'd like to use it to describe how we might be able to apply it to other fields.
has been a problem with print editions in the past, it is a problem online now. Frankly, scholars don't trust the materials because they're not up to standard.
Another solution is community. You don't go on Flickr just to share your photos; you go on Flickr to learn to become a better photographer. And I think that creating platforms and creating communities that can come up with these standards and enforce them among themselves can really help.
The same thing is true with software platforms, if they actually prompt users and say: "when you're uploading this image, tell us about the provenance." "Maybe you might want to scan the frontispieces." "Maybe you'd like to tell us the history of ownership."
Ben Brumfield is a family historian and independent software engineer. For the last seven years he has been developing FromThePage, an open source manuscript transcription tool in use by libraries, museums, and family historians. He is currently working with FreeUKGen to create an open source system for indexing images and transcribing structured, hand-written material. Contact Ben at firstname.lastname@example.org.