Abstract: Crowdsourcing for cultural heritage material has become increasingly popular over the last decade, but manuscript transcription has become the most actively studied and widely discussed crowdsourcing activity over the last four years. However, of the thirty collaborative transcription tools which have been developed since 2005, only a handful attempt to support the Text Encoding Initiative (TEI) standard first published in 1990. What accounts for the reluctance to adopt editorial best practices, and what is the way forward for crowdsourced transcription and community edition? This talk will draw on interviews with the organizers behind Transcribe Bentham, MoM-CA, the Papyrological Editor, and T-PEN as well as the speaker's own experience working with transcription projects to situate Itinera Nova within the world of crowdsourced transcription and suggest that Itinera Nova's approach to mark-up may represent a pragmatic future for public editions.
Itinera Nova within the world of crowdsourced transcription tools, which means that I need to talk a little bit about crowdsourced transcription tools themselves, and their history, and the new things that Itinera Nova brings.
- A Dutch initiative: Van Papier naar Digitaal which is transcribing primarily genealogy records.
- FreeBMD, FreeREG, and FreeCEN in the UK, transcribing church registers and census records.
- Demogen in Belgium -- I don't know a lot about this -- it appears to be dead right now, but if anyone can tell me more about this, I'd like to talk after this.
- Archivalier Online--also transcribing census records--in Denmark,
- And a series of projects by the Western Michigan Genealogy Society to transcribe local census records and also to create indexes of obituaries.
Familysearch Indexing is, again, a genealogy system primarily concerned with records of genealogical interest which are tabular. It is put up by the Mormon Church.
Then things start to change a little bit. In 2008, I publish FromThePage, which is not designed for genealogy records per se -- rather it's designed for 19th and 20th century diaries and letters. (So here we have more complex textual documents.) Also in 2008, Wikisource--which had been a development of Wikipedia to put primary sources online--start using a transcription tool. But initially, they're not using it for manuscripts because of policy in the English, French, and Spanish language Wikisources. The only people using it for manuscripts are the German Wikisource community, which has always been slightly separate. So they start transcribing free-form textual material like war journals [ed: memoirs] and letters. But again, we have a departure from the genealogy world.
In 2009, the North American Bird Phenology Program starts transcribing bird observations. So in the 1880s you had amateur bird-watchers who would go into the field and they would record their sightings of certain ducks, or geese, or things like that, and they would record the location and the birds they had observed. So we have this huge database of the presences of species throughout North America that is all on index cards. And as the climate changes and habitats change, those species are no longer there. So scientists who want to study bird migration and climate change need access to these. But they're hand-written on 250,000 index cards, so they need to be transformed. So that requires transcription, also by volunteers. [ed: The correct number of cards is over 6 million, according to Jessica Zelt's "Phenology Program (BPP): Reviving a Historic Program in the Digital Era"]
Old Weather project, which comes out of the Citizen Science Alliance and the Zooniverse team that got started with GalaxyZoo. The problem with studying climate change isn't knowing what the climate is like now. It is very easy to point a weather satellite at the South Pacific right now. The problem is that you can't point a weather satellite at the South Pacific in 1911. Fortunately, in many of the world's navies, the officer of the watch would, every four hours, record the barometric pressure, the temperature, the wind speed and direction, the latitude and the longitude in the ships logs. So all we have to do is type up every weather observation for all the navies' ships, and suddenly we know what the climate was like. Well, they've actually succeeded at this point -- in 2012 they finished transcribing all the British Royal Navy's ships log weather observations during World War I. So this has been very successful -- it's a monumental effort: they have over six hundred thousand registered accounts--not all of those are active, but they have a very large number of volunteers.
Transcribe Bentham goes live. (We'll talk a lot more about this -- it's a very well documented project.) This is a project to transcribe the notes and papers of the utilitarian philosopher Jeremy Bentham. It's very interesting technically, but it was also very successful drawing attention to the world of crowdsourced transcription.
Papers of the United States War Department, and builds a tool called Scripto that plugs into it. Now this is primarily of interest to military and social historians, but again we're getting away from the world of genealogy, we're getting away from the world of individual tabular records, and we're getting into dealing with documents.
There's another tension that I want to get into here, since today is the technical track, and that's the difference between easy tools and powerful tools, and [the question of] making powerful tools easy to use. This is common to all technology--not just software, and certainly not just crowdsourced transcription--but it's new because this is the first time we're asking people to do these sorts of transcription projects.
Historically these professional [projects] have been done using mark-up to indicate deletions or abbreviations or things like that.
Well, what is going to happen? Well, one solution--and it's a solution that I'm distressed to say is becoming more and more popular in the United States--is to get rid of the mark-up, and to say, well, let's just ask them to type plain text.
"What's on the Menu?" project. They have an enormous collection of menus from around the world, and they want to track to culinary history of the world as dishes originate in one spot and move to other locations, the change in dishes--when did anchovies become popular? Why are they no longer popular?--things like that. So they're asking users to transcribe all of these menu items. They developed a very elegant and simple UI. This UI did not involve mark-up; this is plain-text. In fact--I'm going to get over here and read this--if you look at this instruction, this is almost stripped text: "Please type the text of the indicated dish exactly as it appears. Don't worry about accents."
Rühreier is scrambled eggs. And what they type is converted to "Ruhreier", which are... eggs from the Ruhrgebiet? I don't know? This is not a dish. I'm not familiar with German cuisine, but I don't think that the Ruhr valley is famous for its eggs.
So we have this frustration. We have this potential to lose users when we abandon mark-up; when we don't give them the tools to do the job that we're asking them to do.
Manfred [Thaller] mentioned it some time earlier. It's been a standard since 1990, and it's ubiquitous in the world of scholarly editing.
Remember, up until recently, all scholarly editing was done by professionals. These professionals were using offline tools to edit this XML which Manfred described as a "labyrinth of angle brackets." It was never really designed to be hand-edited, but that's what we're doing.
And because it's ubiquitous and because it's old, there's a perception among at least some scholars, some editors, that this is just a 'boring old standard'. I have a colleague who did a set of interviews with scholars about evaluating digital scholarship, and not all but some of the responses she got when she brought up TEI were "TEI? Oh, that's just for data entry."
It has great tools for presentation and analysis. Notice I didn't say transcription.
And it has a very active community, and that community is doing some really exciting things.
I want to use just one example of something that has only been around in the last four years that it's been developed. It's a module that was created for TEI called the Genetic Edition module. A "genetic edition" is the idea of studying a text as it changes -- studying the changes that an author has made as they cross through sections and created new sections, or over-written pieces.
So it's very sophisticated, and I want to show you the sorts of things you can do [with it] by demostrating an example of one of these presentation tools by Elena Pierazzo and Julie Andre. Elena's at King's College London, and they developed this last year.
Proust Prototype.] And as you slide, you see transcripts appear on the page in the order that they're created,
And in the order that they're deleted even.
So this is the kind of thing that you can do with this powerful data model.
an extension to that thousand-page book. It's only about fifty pages long, printed, and it contains individual sets of guidelines. In this case, this is how Henrik Ibsen clarified a letter. In order to encode this, you use this
rewritetag with a
cause... And this is that forest of angle brackets; this is very hard. And this is only one item from this document of instructions, which was small enough that I could cut it out and fit it on a slide.
So this is incredibly complex. So if TEI is powerful; and if, as it gets more complex, it becomes harder to hand-encode; and as we start inviting members of the public and amateurs to participate in this work, how are we going to resolve this?
And it is very rarely attempted. I maintain a directory of crowdsourced transcription tools, with multiple projects per tool. And of the 29 projects in this directory, only 7 claim to support TEI.
One of them is Itinera Nova. I found out about this when I was preparing a presentation for the TEI conference last year, in which I interviewed people running projects doing this crowdsourcing, and found out about their experience of users trying to encode in TEI, and asked, "Do you know anyone else?"
And that's how I found out about Itinera Nova, which is unfortunately not very well known outside of Belgium. This is something that I hope to part of correcting, because you have a hidden gem here -- you really do. It is amazing.
T-PEN (created by the Center for Digital Thelogy out of Saint Louis University), and a project associated with them, the Carolingian Canon Law Project. It's also the approach taken by Transcribe Bentham with their TEI toolbar. Menus are an alternative, but essentially the do the same thing -- they're a way of keeping users from typing angle brackets. So the Virtuelles deutsches Urkundennetzwerk is one of those, as well as the Papyrological Editor which is used by scholars studying Greek papyri.
Monasterium. And the results are still very complicated. The presentation here is hard. It's hard to read; it's hard to work with.
That does not mean that amateurs cannot do it at all! Certainly the experience of Transcribe Bentham proves that amateurs to the same level as any professional transcriber, using these tools and coding these manuscripts, even without the background.
Another problem is more interesting to me, which is when users ignore buttons. Here we have one editor who's dealing with German charters, who uses these double-pipes instead of the line break tag, because this is what he was used to from print. This speaks to something very interesting, which is that we have users who are used to their own formats, they're used to their own languages for mark-up, they're used to their own notations from print editions that they have either read or created themselves. And by asking them to switch over to this style of tagging, we're asking them not just to learn something new, but also to abandon what they may already know.
This is something that I think is really the way forward for crowdsourced transcription. It is being done right now by the Papyrological Editor, it has been done by Itinera Nova for a long time. And there are now some incipient projects to move forward with this. One of these is a new project at the University of Maryland, Maryland Institute for Technology and the Humanities, the Skylark project, in which they are taking those same transcription tools that were used for Old Weather to allow people to mark up and transcribe portions of an image of a literary text that has been heavily annotated--like that Proust--to create data using the data model that can be viewed with tools like the Proust viewer.
So this is, I think, the technical contribution that Itinera Nova is making. Obviously there are a lot more contributions--I mean I'm absolutely stunned by the interaction with the volunteer community that's happening here--but I'm staying on the technical track, so I'm not going to get into that.
Are there any questions? No? Keep up the great work -- you folks are amazing.