Collaborative Manuscript Transcription: similar projects

Showing posts with label similar projects. Show all posts

Monday, April 29, 2013

Itinera Nova in the World(s) of Crowdsourcing and TEI

On April 25, 2013, I presented this talk at the International Colloquium Itinera Nova in Leuven, Belgium. It was a fantastic experience, which I plan to post (and speak) more about, but I wanted to get my slides and transcript online as soon as possible.

Abstract: Crowdsourcing for cultural heritage material has become increasingly popular over the last decade, but manuscript transcription has become the most actively studied and widely discussed crowdsourcing activity over the last four years. However, of the thirty collaborative transcription tools which have been developed since 2005, only a handful attempt to support the Text Encoding Initiative (TEI) standard first published in 1990. What accounts for the reluctance to adopt editorial best practices, and what is the way forward for crowdsourced transcription and community edition? This talk will draw on interviews with the organizers behind Transcribe Bentham, MoM-CA, the Papyrological Editor, and T-PEN as well as the speaker's own experience working with transcription projects to situate Itinera Nova within the world of crowdsourced transcription and suggest that Itinera Nova's approach to mark-up may represent a pragmatic future for public editions.

I'd like to talk about Itinera Nova within the world of crowdsourced transcription tools, which means that I need to talk a little bit about crowdsourced transcription tools themselves, and their history, and the new things that Itinera Nova brings.

Crowdsourced transcription has actually been around for a long time. Starting in the 1990s we see a number of what are called "offline" projects. This is before the term crowdsourcing was invented.

A Dutch initiative: Van Papier naar Digitaal which is transcribing primarily genealogy records.
FreeBMD, FreeREG, and FreeCEN in the UK, transcribing church registers and census records.
Demogen in Belgium -- I don't know a lot about this -- it appears to be dead right now, but if anyone can tell me more about this, I'd like to talk after this.
Archivalier Online--also transcribing census records--in Denmark,
And a series of projects by the Western Michigan Genealogy Society to transcribe local census records and also to create indexes of obituaries.

One thing these have in common, you'll notice, is that these are all genealogists. They are primarily interested in person names and dates. And they emerge out of an (at least) one hundred year old tradition of creating print indexes to manuscript sources which were then published. Once the web came online, the idea of publishing these on the web [instead] became obvious. But the tools that were used to create these were spreadsheets that people would use on their home computers. Then they would put CD ROMs or floppy disks in the posts and send them off to be pubished online.

Really the modern era of crowdsourced transcription begins about eight years ago. There are a number of projects that begin development in 2005. They are released (even though they've been in development for a while) starting around 2006. Familysearch Indexing is, again, a genealogy system primarily concerned with records of genealogical interest which are tabular. It is put up by the Mormon Church.

Then things start to change a little bit. In 2008, I publish FromThePage, which is not designed for genealogy records per se -- rather it's designed for 19th and 20th century diaries and letters. (So here we have more complex textual documents.) Also in 2008, Wikisource--which had been a development of Wikipedia to put primary sources online--start using a transcription tool. But initially, they're not using it for manuscripts because of policy in the English, French, and Spanish language Wikisources. The only people using it for manuscripts are the German Wikisource community, which has always been slightly separate. So they start transcribing free-form textual material like war journals [ed: memoirs] and letters. But again, we have a departure from the genealogy world.

In 2009, the North American Bird Phenology Program starts transcribing bird observations. So in the 1880s you had amateur bird-watchers who would go into the field and they would record their sightings of certain ducks, or geese, or things like that, and they would record the location and the birds they had observed. So we have this huge database of the presences of species throughout North America that is all on index cards. And as the climate changes and habitats change, those species are no longer there. So scientists who want to study bird migration and climate change need access to these. But they're hand-written on 250,000 index cards, so they need to be transformed. So that requires transcription, also by volunteers. [ed: The correct number of cards is over 6 million, according to Jessica Zelt's "Phenology Program (BPP): Reviving a Historic Program in the Digital Era"]

2010 is the year that crowdsourced transcription really gets big. The first big development is the Old Weather project, which comes out of the Citizen Science Alliance and the Zooniverse team that got started with GalaxyZoo. The problem with studying climate change isn't knowing what the climate is like now. It is very easy to point a weather satellite at the South Pacific right now. The problem is that you can't point a weather satellite at the South Pacific in 1911. Fortunately, in many of the world's navies, the officer of the watch would, every four hours, record the barometric pressure, the temperature, the wind speed and direction, the latitude and the longitude in the ships logs. So all we have to do is type up every weather observation for all the navies' ships, and suddenly we know what the climate was like. Well, they've actually succeeded at this point -- in 2012 they finished transcribing all the British Royal Navy's ships log weather observations during World War I. So this has been very successful -- it's a monumental effort: they have over six hundred thousand registered accounts--not all of those are active, but they have a very large number of volunteers.

Also in 2010 in the UK, Transcribe Bentham goes live. (We'll talk a lot more about this -- it's a very well documented project.) This is a project to transcribe the notes and papers of the utilitarian philosopher Jeremy Bentham. It's very interesting technically, but it was also very successful drawing attention to the world of crowdsourced transcription.

In 2011, the Center for History and New Media at George Mason University in northern Virginia published the Papers of the United States War Department, and builds a tool called Scripto that plugs into it. Now this is primarily of interest to military and social historians, but again we're getting away from the world of genealogy, we're getting away from the world of individual tabular records, and we're getting into dealing with documents.

Once we get there, we have a tension. And this is a pretty common tension. There's an institutional tension, in that editing of documents has historically been done by professionals, and amateur editions have very bad reputations. Well now we're asking volunteers to transcribe. And there's a big tension between, well how do volunteers deal with this [process], do we trust volunteers? Wouldn't it be better just to give us more money to hire more professionals? So there's a tension there.

There's another tension that I want to get into here, since today is the technical track, and that's the difference between easy tools and powerful tools, and [the question of] making powerful tools easy to use. This is common to all technology--not just software, and certainly not just crowdsourced transcription--but it's new because this is the first time we're asking people to do these sorts of transcription projects.

Historically these professional [projects] have been done using mark-up to indicate deletions or abbreviations or things like that.

So there's this fear: what happens when you take amateurs and add mark-up?

Well, what is going to happen? Well, one solution--and it's a solution that I'm distressed to say is becoming more and more popular in the United States--is to get rid of the mark-up, and to say, well, let's just ask them to type plain text.

There's a problem with this. Which is that giving users power to represent what they see--to do the tasks that we're asking them to do--enables them. Lack of power frustrates them. And when you're asking people to transcribe documents that are even remotely complex, mark-up is power.

So I'm going to tell a little story about scrambled eggs. These are not the scrambled eggs that I ate this morning--which were delicious by the way--but they're very similar.

I'm going to pick on my friends at the New York Public Library, who in 2011 launched the "What's on the Menu?" project. They have an enormous collection of menus from around the world, and they want to track to culinary history of the world as dishes originate in one spot and move to other locations, the change in dishes--when did anchovies become popular? Why are they no longer popular?--things like that. So they're asking users to transcribe all of these menu items. They developed a very elegant and simple UI. This UI did not involve mark-up; this is plain-text. In fact--I'm going to get over here and read this--if you look at this instruction, this is almost stripped text: "Please type the text of the indicated dish exactly as it appears. Don't worry about accents."

Well, this may not be a problem for Americans, but it turns out that some of their menus are in languages that contain things that American developers might consider accents. This is a menu that was published on their site in 2011. They sent out an appeal asking, "can anyone read Sütterlin or old German Kurrentschrift"? I saw this and I went over to a chat channel for people who are discussing German and the German language, because I knew that there were some people familiar with German paleography there, and I wanted to try it out.

So the transcribers are going through and they're transcribing things, and they get to this entry: Rühreier. All right, let's transcribe that without accents. So they type in what they see. Rühreier is scrambled eggs. And what they type is converted to "Ruhreier", which are... eggs from the Ruhrgebiet? I don't know? This is not a dish. I'm not familiar with German cuisine, but I don't think that the Ruhr valley is famous for its eggs.

And this is incredibly frustrating! We see in the chat room logs: "Man, I can't get rid of 'Ruhreier' and this (all-capital) 'OMELETTE'! What's going on? Is someone adding these back? Can you try to change "Ruhreier" to "Rühreier"? It keeps going back!"

So we have this frustration. We have this potential to lose users when we abandon mark-up; when we don't give them the tools to do the job that we're asking them to do.

Okay. Let's shift gears and talk about a different world. This is the world of TEI, the Text Encoding Initiative. It's regarded as the ultimate in mark-up -- Manfred [Thaller] mentioned it some time earlier. It's been a standard since 1990, and it's ubiquitous in the world of scholarly editing.

Remember, up until recently, all scholarly editing was done by professionals. These professionals were using offline tools to edit this XML which Manfred described as a "labyrinth of angle brackets." It was never really designed to be hand-edited, but that's what we're doing.

And because it's ubiquitous and because it's old, there's a perception among at least some scholars, some editors, that this is just a 'boring old standard'. I have a colleague who did a set of interviews with scholars about evaluating digital scholarship, and not all but some of the responses she got when she brought up TEI were "TEI? Oh, that's just for data entry."

Well, not quite. TEI has some strengths. It is an incredibly powerful data model. The people who are doing this--these professionals who have been working with manuscripts for decades--they've developed very sophisticated ways of modeling additions to texts, deletions to texts, personal names, foreign terms -- all sorts of ways of marking this up.

It has great tools for presentation and analysis. Notice I didn't say transcription.

And it has a very active community, and that community is doing some really exciting things.

I want to use just one example of something that has only been around in the last four years that it's been developed. It's a module that was created for TEI called the Genetic Edition module. A "genetic edition" is the idea of studying a text as it changes -- studying the changes that an author has made as they cross through sections and created new sections, or over-written pieces.

So it's very sophisticated, and I want to show you the sorts of things you can do [with it] by demostrating an example of one of these presentation tools by Elena Pierazzo and Julie Andre. Elena's at King's College London, and they developed this last year.

This is a draft of--I believe it's Proust's Recherches du Temps Perdu--unfortunately I can't see up there. But as you can see, this is a very complicated document. The author has struck through sections and over-written them. He's indicated parts moved. He's even -- if you look over here -- he's pasted on an extra page to the bottom of this document. So if you can transcribe this to indicate those changes, then you can visualize them.

[Demo screenshots from the Proust Prototype.] And as you slide, you see transcripts appear on the page in the order that they're created,

And in the order that they're deleted even.

There's even rotation and stuff --

It's just a brilliant visualization!

So this is the kind of thing that you can do with this powerful data model.

But how was that encoded? How did you get there?

Well, in this case, this is an extension to that thousand-page book. It's only about fifty pages long, printed, and it contains individual sets of guidelines. In this case, this is how Henrik Ibsen clarified a letter. In order to encode this, you use this rewrite tag with a cause... And this is that forest of angle brackets; this is very hard. And this is only one item from this document of instructions, which was small enough that I could cut it out and fit it on a slide.

So this is incredibly complex. So if TEI is powerful; and if, as it gets more complex, it becomes harder to hand-encode; and as we start inviting members of the public and amateurs to participate in this work, how are we going to resolve this?

If there's a fear about combining amateurs and mark-up, what do we do when we combine amateurs with TEI? This is panic!

And it is very rarely attempted. I maintain a directory of crowdsourced transcription tools, with multiple projects per tool. And of the 29 projects in this directory, only 7 claim to support TEI.

One of them is Itinera Nova. I found out about this when I was preparing a presentation for the TEI conference last year, in which I interviewed people running projects doing this crowdsourcing, and found out about their experience of users trying to encode in TEI, and asked, "Do you know anyone else?"

And that's how I found out about Itinera Nova, which is unfortunately not very well known outside of Belgium. This is something that I hope to part of correcting, because you have a hidden gem here -- you really do. It is amazing.

So how do you support TEI? Well, one approach--the most common approach--is to say we'll have our users enter TEI, but we'll give them help. We'll create buttons that add tags, or menus that add tags. This has been the approach taken by T-PEN (created by the Center for Digital Thelogy out of Saint Louis University), and a project associated with them, the Carolingian Canon Law Project. It's also the approach taken by Transcribe Bentham with their TEI toolbar. Menus are an alternative, but essentially the do the same thing -- they're a way of keeping users from typing angle brackets. So the Virtuelles deutsches Urkundennetzwerk is one of those, as well as the Papyrological Editor which is used by scholars studying Greek papyri.

So how well does that work? You provide users with buttons that add tags to their text. Here's an example from Transcribe Bentham.

Here's an example from Monasterium. And the results are still very complicated. The presentation here is hard. It's hard to read; it's hard to work with.

That does not mean that amateurs cannot do it at all! Certainly the experience of Transcribe Bentham proves that amateurs to the same level as any professional transcriber, using these tools and coding these manuscripts, even without the background.

But there are limitations. One limitation is that users outgrow buttons. In Transcribe Bentham, [the most active] users eventually just started typing the angle brackets themselves -- they returned to that labyrinth of angle brackets of TEI tags.

Another problem is more interesting to me, which is when users ignore buttons. Here we have one editor who's dealing with German charters, who uses these double-pipes instead of the line break tag, because this is what he was used to from print. This speaks to something very interesting, which is that we have users who are used to their own formats, they're used to their own languages for mark-up, they're used to their own notations from print editions that they have either read or created themselves. And by asking them to switch over to this style of tagging, we're asking them not just to learn something new, but also to abandon what they may already know.

And, frankly, it's really hard to figure out which buttons [to support]. Abigail Firey of the Carolingian Canon Law Project talks about how when they were designing their interface, they had 67 buttons. This is very hard to navigate, and the users would just give up and start typing angle brackets instead, because buttons aren't a magic solution.

This is where Itinera Nova comes in. The "intermediate notation" that Professor Thaller was talking about is quite clear-cut, and it maps well to the print notations that volunteers are already used to.

And what's interesting about this is that what many people may not realize is that Itinera Nova--despite having a very clear, non-TEI interface--has full TEI under the hood.

Everything is persisted in this TEI database, so the kinds of complex analysis that we talked about earlier--not necessarily the Proust genetic editions, but this kind of thing--is possible with the data that's being created. It's not idiosyncratic.

So as a result, I really think that in this, Itinera Nova points the way to the future. Which is to abandon this idea that TEI is just for data entry, or that amateurs cannot do mark-up. Both of those ideas are bogus! Instead, let's say: use TEI for the data model; for the presentation, so we have these beautiful sliders. And whatever else will get created out of the annotation tool, out of the transcription tool, let's use that for the data model and for the presentation. But let's consider let's consider hooking up these--I don't want to say "easier"--but these more straightforward, these more traditional user interfaces [for transcription].

This is something that I think is really the way forward for crowdsourced transcription. It is being done right now by the Papyrological Editor, it has been done by Itinera Nova for a long time. And there are now some incipient projects to move forward with this. One of these is a new project at the University of Maryland, Maryland Institute for Technology and the Humanities, the Skylark project, in which they are taking those same transcription tools that were used for Old Weather to allow people to mark up and transcribe portions of an image of a literary text that has been heavily annotated--like that Proust--to create data using the data model that can be viewed with tools like the Proust viewer.

So this is, I think, the technical contribution that Itinera Nova is making. Obviously there are a lot more contributions--I mean I'm absolutely stunned by the interaction with the volunteer community that's happening here--but I'm staying on the technical track, so I'm not going to get into that.

Are there any questions? No? Keep up the great work -- you folks are amazing.

Tuesday, February 26, 2013

Ngoni Munyaradzi on Transcribe Bleek and Lloyd

Ngoni Munyaradzi is a Master's student in Computer Science at the University of Cape Town, South Africa, working on a research project on the transcription of the Digital Bleek and Lloyd collection. He kindly agreed to an interview over email, which I present below:

Your website does an excellent job explaining the background and motivation of Transcribe Bleek and Lloyd. Can you tell us more about the field notebooks you are transcribing?

The Digital Bleek and Lloyd Collection is composed of dictionaries, artwork and notebooks documenting stories about the earliest inhabitants of Southern Africa, the Bushman people. The notebooks were written by Wilhelm Bleek, his sister-in-law, Lucy Lloyd and Dorothea Bleek (Wilhelm's daughter) in the 19th century, with the help of a number of Bushmen people who were prisoners in the Western Cape region of South Africa at the time. The notebooks were recorded in the |Xam and !Kun languages and English translations of these languages are available in the notebooks.

Link to the collection: http://lloydbleekcollection.cs.uct.ac.za/

Correct me if I'm wrong, but it seems like at least in the case of |Xam, you are working with one of the only representatives of an extinct language. Are there any standard data models for these kinds of vocabularies/bilingual texts which you're using?

There are no complete models - the best known models are still only partial.

I suspect that I'm not alone in wondering why these Bushman people were prisoners during the writing of these texts. Can you tell us a bit more about the Bleek/Lloyd informants, or point us to resources on the subject?

The bushman people were prisoners because of petty crimes and a grossly unfair colonial government. On the Bleek and Lloyd website there is a story on each contributor. There is information in various books on the subject as well, but I am not sure there is more that is known than what is on the website. see:
http://lloydbleekcollection.cs.uct.ac.za/xam.html
http://lloydbleekcollection.cs.uct.ac.za/kun.html

This is the first transcription project I'm aware of using the Bossa Crowd Create platform. What are the factors that led you to choose that platform and what's been your experience setting it up?

In 2011 when our project began Bossa was the most mature opensource crowdsourcing framework that was tailored for volunteer projects available. Due to this Bossa suited well with the project's requirements. The alternative crowdsourcing frameworks available at the time used payment methods.

Setting up the Bossa framework was a relatively straight-forward task. The documentation online is very thorough and with examples of how to set-up test applications. I also got assistance from David Anderson the developer of Bossa.

The Bushman writing system seems extremely complex with it's special characters and multiple diacritics. I see that you are using LaTeX macros to encode these complexities. Why did you decide on LaTeX and what has been the user response to using that notation?

So the project is part of ongoing research related to the Bleek and Lloyd Collection within our Digital Libraries Laboratory at the University of Cape Town. Credit for developing the encoding tool goes to Kyle Williams. And the reason why he chose to use LaTeX was that; using custom LaTeX macros allowed for both the problem of the encoding and visual rendering of the text to be solved in a single step. Developing a unique font for the Bushman script is something we might look at in the future!

Here's a link to a paper published on the encoding tool developed by Kyle Williams: http://link.springer.com/chapter/10.1007%2F978-3-642-24826-9_28

Overall the user feedback has been good, as most users are able to complete transcriptions using the LaTeX macros. We have gotten suggestions from users to use glyphs to encode the complexities. Currently the scope of my masters research project does not include that. There are talks in our research group to develop a unique font to represent the |Xam and !Kun languages, as this is not supported by Unicode.

User 1 Comment: "I think the palette handles the complexity of the character set very well. This material is inherently difficult to transcribe. The tool has, on the whole, been well thought out to meet this challenge. I think it needs to be improved in some ways, but considering the difficulties it is remarkably well done."

User 2 Comment: "VERY intuitive, after a few practice transcriptions. I actually enjoyed using the tool after a page was done."

This is incredibly useful. So far as I'm aware, yours is only the third crowdsourced transcription project that's surveyed users seriously (after the North American Bird Phenology Project and Transcribe Bentham). Do you have any advice on collecting user feedback at such an early stage?

Collecting user feedback in the early stages will tremendously help project administrators determine whether the setup of the project is easy to follow for participants. One can easily pick up any hindrances to user participation and address these early. From our project, I've found that participants can actually suggest very helpful ideas that will make the data collection process better.

Crowdsourced citizen science and cultural heritage projects have mostly been based in the USA, Northern Europe and Australia until recently -- in fact, yours is the first that I'm aware of originating in sub-Saharan Africa. I'd really like to know which projects inspired your work with Transcribe Bushman, and what your hopes are for crowdsourced transcription projects focusing on Africa?

Our work was mostly inspired by the success of GalaxyZoo at recruiting volunteers, and also the Transcribe Bentham project that explored the feasibility of volunteers performing transcription. I hope that more crowdsourced transcription projects will start-up within Africa in the near future. What would be interesting is to see a transcription project for the Timbuktu manuscripts of Mali. Beyond transcription, I would like to see other researchers adopting crowdsourcing in fields of specialty within Africa.

Thanks so much for this interview. If people want to help out on the project, what's the best way for them to contribute?

Interested participants can simply:

Create an account on the project website.
Watch a 5 minute video tutorial on how to transcribe the Bushman languages.
With that, you are ready to start transcribing pages.

Saturday, November 10, 2012

What does it mean to "support TEI" for manuscript transcription?

This is a transcript of my talk at the 2012 TEI meeting at Texas A&M University, "What does it mean to 'support TEI' for manuscript transcription: a tool-maker's perspective."

You can download an MP3 recording of the talk here.

Let's get started with a couple of definitions. All the tools and the sites that I'm reviewing are cloud based, which means that I'm ruling out--perhaps arbitrarily--any projects that involve people doing offline edition and then publishing that on the web. I'm only talking about online-based tools.

So that's a very strict definition of clouds, and I'm going to have a very loose and squishy definition of crowds, in which I'm talking about any sort of tool that allows collaborative editing of manuscript material, and not just ones that are directed at amateurs. That's important for a couple of reasons: one, because it gave me a sample size that was large enough to find out how people are using TEI, but--for another reason--because "amateurs" aren't really amateurs. What we see with crowdsourcing projects is that amateurs become experts very quickly. And given that your average user of any citizen science or historical crowdsourcing project is a woman over 50 who has at least a Master's degree, this isn't sort of the unwashed masses.

Okay, so crowdsourced transcription has been going on for a while, and it's been happening in four different traditions that developed this all independently. You have genealogists who are doing this, primarily with things like census records. The 1940 census is the most prominent example: they have volunteers transcribing as many as ten million records a day. The natural sciences are doing something similar, particularly GalaxyZoo, the OldWeather people are looking at climate change data, where you have to look at old, handwritten records to figure out how the climate has changed, because you need to know how the climate used to be. And then there are also some projects going on in the Open Source/Creative Commons world: the Wikisource people--particularly the German language Wikisource community--and libraries, archives, and museums have jumped into this recently.

So here are a couple of examples from the citizen science world. OldWeather has a tool that allows people to record ship log book entries and weather observations. As you can see, this is all field based -- this isn't quite an attempt to represent a document. We'll get back to this in a minute.

The North American Bird Phenology Program is transcribing old bird[-watching] observation cards from about a hundred years ago. They're recording species names and all sorts of other things about this particular Grosbeak in 1938.

All of these--this is the majority of the crowdsourced transcription that's happening out there--there are millions of records--there are millions of records that are happening that are all record based. These are not document-based, they aren't page-based. They're dealing with data that is fundamentally tabular -- those are their inputs. Their outputs are databases that they want to be able to either search or analyze. So we're producing nothing that anyone would ever want to print out.

And another interesting thing about this is that these record-based transcription projects--the uses are understood in advance. If you're building a genealogy index, you know that people are going to want to search for names and be able to see the results. And that's it -- you're not building something that allows someone to go off and do some other kind of analysis.

Now what kind of mark-up are these record-based transcription projects using? Well, it's kind of idiosyncratic, at best.

Here's an example from my client FreeREG. This is a mark-up language that they developed about ten years ago for indicating unclear readings of manuscripts. It's actually fairly sophisticated--it's based on the regular expression programming sub-language--but it's not anything that's informed by the TEI world.

On the other hand, here is the mark-up that the New York Public Library is using. Let me read this out to you: "Please type the text of the indicated dish exactly as it appears. Don't worry about accents." This is almost an anti-markup.

So what about free-form transcription? There's a lot of development of people doing free-form transcription. You have Scripto out of CHNM. You have a couple of different (perhaps competing) NARA initiatives. Wikisource. There's my own FromThePage. What kind of mark-up are they doing? Well, for the most part, none!

Here's Scripto--the Papers of the War Department-- and you type what you see, and that's what you get.

Here is the French-language Wikisource, hosting materials from the Archives departmentales du Cantal (who are doing some very cool things here). But this is just typing things into a wiki and not even internally using wiki links. This is almost pre-formed text -- it's pretty much plaintext.

My own project, FromThePage.

I'm internally using wiki-links, but really only for creating indexes and annotations, not for indicating...any of the power that you have with TEI.

So if no one is using TEI, why is TEI important? I think that TEI is important because crowdsourced transcription projects are how the public is interacting with edition. This is how people are learning what editing is, what the editing process is, and why and whether it's important. And they're using tools that are developed by people like me. Now how do people like me learn about edition?

The answer is, by reading the TEI Guidelines. The TEI Guidelines have an impact that goes far beyond people who are actually implementing TEI. I started work on FromThePage in complete isolation in 2005. By 2007, I was reading the TEI Guidelines. I wasn't implementing TEI, but the questions that were asked--these notions of "here's how you expand abbreviations", "here's how you regularize things"--had a tremendous impact on me. By contrast, the Guide to Documentary Editing--which is a wonderful book!--I only found out in January of this year.

TEI is online, it's concise, it's available. And when I talk to people in the genealogy development world, they know about TEI. They've heard of it. They have opinions. They're not using it, but -- you people are making an impact on how the world does edition!

Okay, so if all of these people aren't using TEI, who is doing it?

I run a transcription tool directory that is itself crowdsourced. It's been edited by 23 different people who've entered information about 27 different tools. Of those 27 tools, 7 are marked as "supporting TEI". There's a little column, "does it support TEI?", seven of them say "Yes".

Actually, that's not true. Some of them say "yes", but some of those seven say "well, sort of". So what does that mean?

To find that out, I interviewed five of those seven projects.

Transcribe Bentham.
T-PEN (which there's a poster session about tonight), which is a line-based system for medieval manuscripts.
A customization of T-PEN, the Carolingian Canon Law project, out of the University of Kentucky.
Our own Hugh Cayless for the Papyrological Editor, which is dealing with papyri.
And then MOM-CA is one of these "sort of"s. You have two implementations of it.

One of them is the Virtualles deutsches Urkundennetzwerk, which is a German charter collection. It supports "TEI, sort-of" -- actually it supports CEI and EAD.
But it's been customized for extensive TEI support for the Itinera Nova project which is out of the archive of Leuven, Belgium.

I'm going to talk about what I found out, but I'm going to emphasize Transcribe Bentham. Not because it's better than the other tools, but because they actually ran their transcription project as an experiment. They wanted to know, can the public do TEI? Can the public handle it? And they've published their results: they've conducted user surveys of what was your experience using TEI? Which makes it particularly useful for those of us who are trying to figure out how it's being used.

Okay, so there's a lot of variation among these projects. You've got a varied committment to TEI. Transcribe Bentham: Yes, we're going to use TEI! You see Melissa Terras here saying that "it was untenable" that we'd ask for anything else. These people know how to do it; why would we depart from that?

For T-PEN, James Ginther says: Hey, I'm kind of skeptical. We'll support any XSD you want to upload, if it happens to be TEI, that's okay.

Abigail Firey, who's using T-PEN, basically says: look, it's probably necessary. It's very useful. It lets us develop these valuable intellectual perspectives on our text. And she considered it important that their text encoding was done within the community of practice represented by the people in this room.

Okay, so more variation between these. Where's the TEI located within these projects? Where does it live? I'm a developer; I'm interested in the application stack.

It turns out that there's no agreement at all. Transcribe Bentham has people entering TEI in person. And then it's storing it off in a MediaWiki, using MediaWiki versioning, not actually putting [...] pages in one big TEI document.

On the other hand, Itinera Nova is actually storing everything in an XRX-based XML database. I mean, it is pure TEI on the back end. But none of the volunteers using Itinera Nova actually are typing any angle brackets. So we have a lot of variation here.

However, there was no variation when I asked people about encoding. There is a perfectly common perception that is: Encoding is hard!

And there are these great responses--that you can see both on the Transcribe Bentham blog and in their DHQuarterly paper that just came out, which I highly recommend--describing it as "too much markup", "unnecessarily complicated", "a hopeless nightmare", and the entire transcription process is "a horror."

But, lots of things are hard.

In my own experience with FromThePage, I have one user who has transcribed one thousand pages, but she does not like using any mark-up at all. She's contributing! She's contributing plaintext transcriptions, but I'm going back to add wikilinks. So it's not about the angle brackets. (Maybe square brackets have a problem too, I don't know.)

And fundamentally, transcribing--reading old manuscripts--is hard. "Deciphering Bentham's hand took longer than encoding," for over half of the Bentham respondents.

So there's more commonality: everyone wants to make encoding easier. How do we do that? There's a couple of different approaches. One approach--the most common approach--is using different kinds of buttons and menus to automate the insertion of tags. Which gets around (primarily) the need for people to memorize tag names and attributes, and--God help us--close tags.

So these are implemented--we've got buttons on T-PEN and CCL. We've got buttons on the TEI Toolbar. We've got menus on VdU and the Papyrological Editor.

And you can see them. Here's a screenshot of Jeremy Bentham. A couple of interesting things about this: it's very small, but we've got a toolbar at the top. We've got TEI text: angle-bracket D.E.L. Angle-bracket, slash, D.E.L. So we're actually exposing the TEI to users in Transcribe Bentham, though we're providing them with some buttons.

Those buttons represent a subset--I'll get to the selection of those tags later. Here's a more detailed description of what they do.

Here's what's going on with VdU. Only in this case, they're not actually exposing the angle brackets to the user. They're replacing all of these in a pseudo-WYSIWYG that allows people to choose from a menu and select text that then gets tagged.

Okay -- limitations of the buttons. There's a good limitation, which is that as users become more comfortable with TEI, they outgrow buttons. And this is something that the people at Transcribe Bentham reported to me. They're seeing a fair number of people just skip the buttons altogether and type angle brackets. Remember: these are members of the public who have never met any of the Transcribe Bentham people.

On the down side, users also ignore the buttons. Again users ignoring encoding, but in this case we've got something that's a little bit worse. Georg Vogeler is reporting something very interesting, which is that in a lot of cases, they were seeing users who were using print apparatus for doing this kind of work, and just ignoring the buttons -- going around them.

So the problem with using print-style notations. People are dealing with these print editions [notations] -- this can be a problem or it can be an opportunity. Papyri.info is viewing it that way. Itinera Nova is using it that way.

Papyri.info, their front-end interface for most users is Leiden+, which is a standard for marking up papyri. And, as you can see, users enter text in Leiden+, and that generates TEI. (EpiDoc TEI, I believe.)

This is the same kind of process that's done in Itinera Nova. In that case, they're using for notation whatever it is that the Leuven archives uses for their mark-up. And they're doing the same kind of transposition [ed: translation] of replacing their notation with TEI tags before they save it.

And this is actually what users see as they're typing. They don't see the TEI tags -- we're hiding the angle brackets from them.

So this is an alternative to buttons. And in my opinion, it's not that bad an alternative.

This hasn't been a problem for the Bentham people, however. It's a non-problem for them. And they are the most "crowdy", the most amateur-focused, and the most committed to a TEI interface.

Tim Causer went through and reviewed all of this and said, you know, it just doesn't happen. People are not using any print notation at all. They're using buttons. They're using angle-brackets by hand. They're not even using plaintext. They're using TEI. Their users are comfortable with TEI.

So what accounts for the difference between the experience of the VdU and the Transcribe Bentham people? I don't know. I've got a couple of theories about what might be going on.

One of them is really the corpus of texts we're working with. If you're only dealing with papyrus fragments, and you're used to a well-established way of notating them--that's been around since 1935 in the case of Leiden+--well, it's kind of hard to break out of that. On the other hand, there's not a single convention for print editions. There's all sorts of ways of indicating additions and deletions for print editions of more modern texts. So maybe it's a lack of a standard.

Or, maybe it's who the users are. Maybe scholars are stubborner, and amateurs are more tractable and don't have bad habits to break. I don't know! I don't know, but I'd be really interested in any other ideas.

Okay, how do these projects choose the tags that they're dealing with? We've got a very long quote, but I'm just going to read out a couple of little bits of them.

Really, choosing a subset of tags is important. Showing 67 buttons was not a good usability thing for T-PEN. And in particular, what they ended up doing was getting rid of the larger, structural set of markup, and focusing just on sort of phrase-level markup.

This also, I think, true if we go back a minute and look at Bentham. Here, again, we're talking phrase-level tags. We're not talking about anything beyond that.

Justin Tonra said that it was actually really hard to pare down the number of tags for Transcribe Bentham. He wanted to do more, but, you know, he's pleased with what they got. They didn't want to "overcomplicate the user's job."

Richard Davis, also with Transcribe Bentham, had a great deal of experience dealing with editors for EAD and other XML. And he said you're always dealing with this balance between usability and flexibility, and there's just not much way of getting around it. It's going to be a compromise, no matter what.

So what's the future for these projects that are using TEI for crowds? Well, if getting people up to speed is hard, and if nobody reads the help--as Valerie Wallace at one time said about their absolutely intimidating help page for Transcribe Bentham (you should look at it -- it's amazing!)--then what are the alternatives for getting people up to speed?

Georg Vogeler says that they are trying to come up with a way of teaching people how to use the tool and how to use the markup in almost a game-like scenario. We're not talking about the kind of Whak-a-Mole things that we sometimes see, but really just sort of leading people through Let's try this. Now let's try this. Now let's try this. Okay now you know how to deal with this [tool]. It's something that I think we're actually pretty familiar with from any other kinds of projects dealing with historic handwriting.: people have to come up to speed.

Another possibility is a WYSIWYG. Tim Causer announced the idea of spending their new Mellon grant on building a WYSIWYG for Transcribe Bentham's TEI. The blog entry is fascinating because he gets about seven user comments, some of which express a whole lot of skepticism that a WYSIWYG is going to be able to handle nested tagging in particular. Other ones of which make comments about the whole XML system and its usability in vivid prose, which is very worth reading.

And maybe combinations of these. So we have these intermediate notations -- Itinera Nova, for example, they're using this let's begin a strike-through with an equals sign (which is apparently what they've been using at that archive for a while). And the minute you type that equals sign in, you actually get a WYSIWYG strike-through that runs all the way through your transcript.

That may be the future. We'll see. I think that we have a lot of room for exploring different ways for handling this.

So let me wrap up and thank my interviewees.

Transcribe Bentham: Melissa Terras, Justin Tonra, Tim Causer, Richard Davis.

T-PEN: James Ginther, Abigail Firey
Papyri.info: Hugh Cayless, Tom Elliot
MOM-CA: Georg Vogeler and Jochen Graf

Questions

[All questions will be paraphrased in the transcript due to sound quality, and are not to be regarded as direct quotations without verification via the audio.]

Syd Bauman: Of the systems which allow users to type tags free-hand, what percentage come out well-formed?

Me: The only one that presents free-hand [tagging] is Transcribe Bentham. Tim [Causer] gets well-formed XML for most everything he gets. There is no validation being performed by that wiki, but what he's getting is pretty good. He says that the biggest challenge when he's post-processing documents is closing tags and mis-placed nesting.

Syd Bauman: I'd be curious about the exact percentages.

Me: Right. I'd have to go back and look at my interview. He said that it represents a pretty small percentage, like single digits of the submissions they get.

John Unsworth: Do any of the systems use keyboard short-cuts?

Me: I know of none that use hot-keys.

John Unsworth: Do you think that would be more or less desirable than the systems you've described?

Me: I really only see hot-keys as being desirable for projects that are using more recent and clearer documents. Speed of data-entry from the keyboard perspective doesn't help much when you're having to stare and zoom and scroll on a document that is as dense and illegible as Bentham or Greek papyri.

Elena Pierazzo [very faint audio]: In some cases it's hard to define which is the error: choosing the tags or reading the text. I've been working with my students on Transcribe Bentham--they're all TEI-aware--and to be honest it was hard. The difficulty was not the mark-up. In a sense we do sometimes forget in these crowdsourcing projects, that the text itself is very hard, so probably adding a level of complexity to the task via the mark-up is very difficult.

I have all respect and sympathy for the people who stick to the ideal of doing TEI, which I commend entirely. But in some cases, it may be that asking amateur people to do [the decipherment] and do the mark up is a pretty strong request, and makes a big assumption about what the people "out there" are capable of without formation.

Me: I'd agree with you. However, there have been some studies on these users' ability to produce quality transcripts outside of the TEI world.... Old Weather did a great deal of research on that, and they found that individual users tended to submit correct transcripts 97% of the time. They're doing blind triple-keying, so they're comparing people's transcripts against others. [They found] that of 1000 different entries, typically on average 13 will be wrong. Of those thirteen, three will be due to user error--so it does happen; I'm not saying people are perfect. Three will be generally[ed: genuinely] illegible. And the remaining seven will be due to the officer of the watch having written the wrong thing down and placing the ship in Afghanistan instead of in the Indian Ocean. So there are errors everywhere. [I mis-remembered the numbers here: actually it's 3 errors due to transcriber error, 10 genuinely illegible, and 3 due to error at time of inscription.]

Lou Burnard: The concept of error is a nuanced one. I would like to counter-argue Elena's [point]. I think that one of the reasons that Bentham has been successful is precisely because it's difficult material. Why do I think that? Because if you are faced with something difficult, you need something powerful to express your understanding of it. The problem with not using something as rich and semantically expressive as TEI when you're doing your transcription is that it doesn't exist! All you can do is type in the words you think it might have been, and possibly put in some arbitrary code to say, "Well, I'm not sure about that." Once you've mastered the semantics of the TEI markup--which doesn't actually take that long, if you're interested in it--now you can express yourself. Now you can communicate in a [...] satisfactory way. And I think that's why people like it.

Me: I have anecdotal, personal evidence to agree with you. In my own system (that does not use TEI), I have had users who have transcribed several pages, and then they'd get to a table in some biologist's field notes, for example, and they stop. And they say, "well, I don't know what to do here." So they're done.

Lou Burnard: The example you cite of the erroneous data in the source is a very good one, because if you've mastered TEI then you know how to express in markup: 'this is what it actually says but clearly he wasn't in Afghanistan.' And that isn't the case in any other markup system I've ever heard of.

[I welcome corrections to my transcript or the contents of the talk itself at benwbrum@gmail.com or in the comments to this post.]

Collaborative Manuscript Transcription

Monday, April 29, 2013

Itinera Nova in the World(s) of Crowdsourcing and TEI

Tuesday, February 26, 2013

Ngoni Munyaradzi on Transcribe Bleek and Lloyd

Saturday, November 10, 2012

What does it mean to "support TEI" for manuscript transcription?

Questions

New Blog Posts are at FromThePage

Posts from the FromThePage Blog

Pages

Upcoming Conference Schedule

Past Conference Talks

Blog Archive

Subjects

Papers

Transcription Systems

Digital Family History