You can download an MP3 recording of the talk here.
So that's a very strict definition of clouds, and I'm going to have a very loose and squishy definition of crowds, in which I'm talking about any sort of tool that allows collaborative editing of manuscript material, and not just ones that are directed at amateurs. That's important for a couple of reasons: one, because it gave me a sample size that was large enough to find out how people are using TEI, but--for another reason--because "amateurs" aren't really amateurs. What we see with crowdsourcing projects is that amateurs become experts very quickly. And given that your average user of any citizen science or historical crowdsourcing project is a woman over 50 who has at least a Master's degree, this isn't sort of the unwashed masses.
ten million records a day. The natural sciences are doing something similar, particularly GalaxyZoo, the OldWeather people are looking at climate change data, where you have to look at old, handwritten records to figure out how the climate has changed, because you need to know how the climate used to be. And then there are also some projects going on in the Open Source/Creative Commons world: the Wikisource people--particularly the German language Wikisource community--and libraries, archives, and museums have jumped into this recently.
OldWeather has a tool that allows people to record ship log book entries and weather observations. As you can see, this is all field based -- this isn't quite an attempt to represent a document. We'll get back to this in a minute.
North American Bird Phenology Program is transcribing old bird[-watching] observation cards from about a hundred years ago. They're recording species names and all sorts of other things about this particular Grosbeak in 1938.
And another interesting thing about this is that these record-based transcription projects--the uses are understood in advance. If you're building a genealogy index, you know that people are going to want to search for names and be able to see the results. And that's it -- you're not building something that allows someone to go off and do some other kind of analysis.
Now what kind of mark-up are these record-based transcription projects using? Well, it's kind of idiosyncratic, at best.
a mark-up language that they developed about ten years ago for indicating unclear readings of manuscripts. It's actually fairly sophisticated--it's based on the regular expression programming sub-language--but it's not anything that's informed by the TEI world.
New York Public Library is using. Let me read this out to you: "Please type the text of the indicated dish exactly as it appears. Don't worry about accents." This is almost an anti-markup.
Papers of the War Department-- and you type what you see, and that's what you get.
French-language Wikisource, hosting materials from the Archives departmentales du Cantal (who are doing some very cool things here). But this is just typing things into a wiki and not even internally using wiki links. This is almost pre-formed text -- it's pretty much plaintext.
TEI is online, it's concise, it's available. And when I talk to people in the genealogy development world, they know about TEI. They've heard of it. They have opinions. They're not using it, but -- you people are making an impact on how the world does edition!
I run a transcription tool directory that is itself crowdsourced. It's been edited by 23 different people who've entered information about 27 different tools. Of those 27 tools, 7 are marked as "supporting TEI". There's a little column, "does it support TEI?", seven of them say "Yes".
Actually, that's not true. Some of them say "yes", but some of those seven say "well, sort of". So what does that mean?
- Transcribe Bentham.
- T-PEN (which there's a poster session about tonight), which is a line-based system for medieval manuscripts.
- A customization of T-PEN, the Carolingian Canon Law project, out of the University of Kentucky.
- Our own Hugh Cayless for the Papyrological Editor, which is dealing with papyri.
- And then MOM-CA is one of these "sort of"s. You have two implementations of it.
- One of them is the Virtualles deutsches Urkundennetzwerk, which is a German charter collection. It supports "TEI, sort-of" -- actually it supports CEI and EAD.
- But it's been customized for extensive TEI support for the Itinera Nova project which is out of the archive of Leuven, Belgium.
For T-PEN, James Ginther says: Hey, I'm kind of skeptical. We'll support any XSD you want to upload, if it happens to be TEI, that's okay.
It turns out that there's no agreement at all. Transcribe Bentham has people entering TEI in person. And then it's storing it off in a MediaWiki, using MediaWiki versioning, not actually putting [...] pages in one big TEI document.
On the other hand, Itinera Nova is actually storing everything in an XRX-based XML database. I mean, it is pure TEI on the back end. But none of the volunteers using Itinera Nova actually are typing any angle brackets. So we have a lot of variation here.
And there are these great responses--that you can see both on the Transcribe Bentham blog and in their DHQuarterly paper that just came out, which I highly recommend--describing it as "too much markup", "unnecessarily complicated", "a hopeless nightmare", and the entire transcription process is "a horror."
In my own experience with FromThePage, I have one user who has transcribed one thousand pages, but she does not like using any mark-up at all. She's contributing! She's contributing plaintext transcriptions, but I'm going back to add wikilinks. So it's not about the angle brackets. (Maybe square brackets have a problem too, I don't know.)
And fundamentally, transcribing--reading old manuscripts--is hard. "Deciphering Bentham's hand took longer than encoding," for over half of the Bentham respondents.
So these are implemented--we've got buttons on T-PEN and CCL. We've got buttons on the TEI Toolbar. We've got menus on VdU and the Papyrological Editor.
On the down side, users also ignore the buttons. Again users ignoring encoding, but in this case we've got something that's a little bit worse. Georg Vogeler is reporting something very interesting, which is that in a lot of cases, they were seeing users who were using print apparatus for doing this kind of work, and just ignoring the buttons -- going around them.
So this is an alternative to buttons. And in my opinion, it's not that bad an alternative.
Tim Causer went through and reviewed all of this and said, you know, it just doesn't happen. People are not using any print notation at all. They're using buttons. They're using angle-brackets by hand. They're not even using plaintext. They're using TEI. Their users are comfortable with TEI.
One of them is really the corpus of texts we're working with. If you're only dealing with papyrus fragments, and you're used to a well-established way of notating them--that's been around since 1935 in the case of Leiden+--well, it's kind of hard to break out of that. On the other hand, there's not a single convention for print editions. There's all sorts of ways of indicating additions and deletions for print editions of more modern texts. So maybe it's a lack of a standard.
Or, maybe it's who the users are. Maybe scholars are stubborner, and amateurs are more tractable and don't have bad habits to break. I don't know! I don't know, but I'd be really interested in any other ideas.
Really, choosing a subset of tags is important. Showing 67 buttons was not a good usability thing for T-PEN. And in particular, what they ended up doing was getting rid of the larger, structural set of markup, and focusing just on sort of phrase-level markup.
Georg Vogeler says that they are trying to come up with a way of teaching people how to use the tool and how to use the markup in almost a game-like scenario. We're not talking about the kind of Whak-a-Mole things that we sometimes see, but really just sort of leading people through Let's try this. Now let's try this. Now let's try this. Okay now you know how to deal with this [tool]. It's something that I think we're actually pretty familiar with from any other kinds of projects dealing with historic handwriting.: people have to come up to speed.
blog entry is fascinating because he gets about seven user comments, some of which express a whole lot of skepticism that a WYSIWYG is going to be able to handle nested tagging in particular. Other ones of which make comments about the whole XML system and its usability in vivid prose, which is very worth reading.
That may be the future. We'll see. I think that we have a lot of room for exploring different ways for handling this.
Transcribe Bentham: Melissa Terras, Justin Tonra, Tim Causer, Richard Davis.
Papyri.info: Hugh Cayless, Tom Elliot
MOM-CA: Georg Vogeler and Jochen Graf
Questions[All questions will be paraphrased in the transcript due to sound quality, and are not to be regarded as direct quotations without verification via the audio.]
Syd Bauman: Of the systems which allow users to type tags free-hand, what percentage come out well-formed?
Me: The only one that presents free-hand [tagging] is Transcribe Bentham. Tim [Causer] gets well-formed XML for most everything he gets. There is no validation being performed by that wiki, but what he's getting is pretty good. He says that the biggest challenge when he's post-processing documents is closing tags and mis-placed nesting.
Syd Bauman: I'd be curious about the exact percentages.
Me: Right. I'd have to go back and look at my interview. He said that it represents a pretty small percentage, like single digits of the submissions they get.
John Unsworth: Do any of the systems use keyboard short-cuts?
Me: I know of none that use hot-keys.
John Unsworth: Do you think that would be more or less desirable than the systems you've described?
Me: I really only see hot-keys as being desirable for projects that are using more recent and clearer documents. Speed of data-entry from the keyboard perspective doesn't help much when you're having to stare and zoom and scroll on a document that is as dense and illegible as Bentham or Greek papyri.
Elena Pierazzo [very faint audio]: In some cases it's hard to define which is the error: choosing the tags or reading the text. I've been working with my students on Transcribe Bentham--they're all TEI-aware--and to be honest it was hard. The difficulty was not the mark-up. In a sense we do sometimes forget in these crowdsourcing projects, that the text itself is very hard, so probably adding a level of complexity to the task via the mark-up is very difficult.
I have all respect and sympathy for the people who stick to the ideal of doing TEI, which I commend entirely. But in some cases, it may be that asking amateur people to do [the decipherment] and do the mark up is a pretty strong request, and makes a big assumption about what the people "out there" are capable of without formation.
Me: I'd agree with you. However, there have been some studies on these users' ability to produce quality transcripts outside of the TEI world.... Old Weather did a great deal of research on that, and they found that individual users tended to submit correct transcripts 97% of the time. They're doing blind triple-keying, so they're comparing people's transcripts against others. [They found] that of 1000 different entries, typically on average 13 will be wrong. Of those thirteen, three will be due to user error--so it does happen; I'm not saying people are perfect. Three will be generally[ed: genuinely] illegible. And the remaining seven will be due to the officer of the watch having written the wrong thing down and placing the ship in Afghanistan instead of in the Indian Ocean. So there are errors everywhere. [I mis-remembered the numbers here: actually it's 3 errors due to transcriber error, 10 genuinely illegible, and 3 due to error at time of inscription.]
Lou Burnard: The concept of error is a nuanced one. I would like to counter-argue Elena's [point]. I think that one of the reasons that Bentham has been successful is precisely because it's difficult material. Why do I think that? Because if you are faced with something difficult, you need something powerful to express your understanding of it. The problem with not using something as rich and semantically expressive as TEI when you're doing your transcription is that it doesn't exist! All you can do is type in the words you think it might have been, and possibly put in some arbitrary code to say, "Well, I'm not sure about that." Once you've mastered the semantics of the TEI markup--which doesn't actually take that long, if you're interested in it--now you can express yourself. Now you can communicate in a [...] satisfactory way. And I think that's why people like it.
Me: I have anecdotal, personal evidence to agree with you. In my own system (that does not use TEI), I have had users who have transcribed several pages, and then they'd get to a table in some biologist's field notes, for example, and they stop. And they say, "well, I don't know what to do here." So they're done.
Lou Burnard: The example you cite of the erroneous data in the source is a very good one, because if you've mastered TEI then you know how to express in markup: 'this is what it actually says but clearly he wasn't in Afghanistan.' And that isn't the case in any other markup system I've ever heard of.
[I welcome corrections to my transcript or the contents of the talk itself at email@example.com or in the comments to this post.]