Collaborative Manuscript Transcription: May 2009

Sunday, May 17, 2009

Review: USGS North American Bird Phenology Program

Who knew you could track climate change through crowdsourced transcription? The smart folks at the U. S. Geological Survey, that's who!

The USGS North American Bird Phenology program encouraged volunteers to submit bird sightings across North America from the 1880s through the 1970s. These cards are now being transcribed into a database for analysis of migratory pattern changes and what they imply about climate change.

There's a really nice DesertUSA NewsBlog article that covers the background of the project:

The cards record more than a century of information about bird migration, a veritable treasure trove for climate-change researchers because they will help them unravel the effects of climate change on bird behavior, said Jessica Zelt, coordinator of the North American Bird Phenology Program at the USGS Patuxent Wildlife Research Center.
That is — once the cards are transcribed and put into a scientific database.

And that’s where citizens across the country come in - the program needs help from birders and others across the nation to transcribe those cards into usable scientific information.

CNN also interviewed a few of the volunteers:

Bird enthusiast and star volunteer Stella Walsh, a 62-year-old retiree, pecks away at her keyboard for about four hours each day. She has already transcribed more than 2,000 entries from her apartment in Yarmouth, Maine.
"It's a lot more fun fondling feathers, but, the whole point is to learn about the data and be able to do something with it that is going to have an impact," Walsh said.

Let's talk about the software behind this effort.

The NABPP is fortunate to have a limited problem domain. A great deal of standardization was imposed on the manuscript sources themselves by the original organizers, so that for example, each card describes only a single species and location. In addition, the questions the modern researchers are asking of the corpus also limits the problem domain: nobody's going to be doing analysis of spelling variations between the cards. It's important to point out that this narrow scope exists in spite of wide variation in format between the index cards. Some are handwritten on pre-printed cards, some are type-written on blank cards, and some are entirely freeform. Nevertheless, they all describe species sightings in a regular format.

Because of this limited scope, the developers were (probably) able to build a traditional database and data-entry form, with specialized attributes for species, location, date, or other common fields that could be generalized from the corpus and the needs of the project. That meant custom-building an application specifically for the NABPP, which seems like a lot of work, but it does not require building the kind of Swiss Army Knife that medieval manuscript transcription requires. This presents an interesting parallel with other semi-standardized, hand-written historical documents like military muster rolls or pension applications.

One of the really neat possibilities of subject-specific transcription software is that you can combine training users on the software with training them on difficult handwriting, or variations in the text. NABPP has put together a screencast for this, which walks users through transcribing a few cards from different periods, written in different formats. This screencast explains how to use the software, but it also explains more traditional editorial issues like what the transcription conventions are, or how to process different formats of manuscript material.

This is only one of the innovative ways the NABPP deals with its volunteers. I received a newsletter by email shortly after volunteering, announcing their progress to date (70K cards transcribed) and some changes in the most recent version of the software. This included some potentially-embarrassing details that a less confident organization might not have mentioned, but which really do a lot. Users may get used to workarounds to annoying bugs, but in my experience they still remember them and are thrilled when those bugs are finally fixed. So when the newsletter announces that "The Backspace key no longer causes the previous page to be loaded", I know that they're making some of their volunteers very happy.

In addition to the newletter, the project also posts statistics on the transcription project, broken down both by volunteer and by bird. The top-ten list gives the game-like feedback you'd want in a project like this, although I'd be hesitant to foster competition in a less individually-oriented project. They're also posting preliminary analyses of the data, including the phenology of barn swallows, mapped by location and date of first sighting, and broken down by decade.

Congratulations to the North American Bird Phenology Program for making crowdsourced transcription a reality!

Saturday, May 2, 2009

Open Source vs. Open Access

I've reached a point in my development project at which I'd like to go ahead and release FromThePage as Open Source. There are now only two things holding me back. I'd really like to find a project willing to work together with me to fix any deployment problems, rather than posting my source code on GitHub and leaving users to fend for themselves. The other problem is a more serious issue that highlights what I think is a conflict between Open Access and Open Source Software.

Open Source/Free Software and Rights of Use

Most of the attention paid to Open Source software focuses on the user's right to modify the software to suit their needs and to redistribute that (or derivative) code. However, there is a different, more basic right conferred by Free and Open source licenses: the user's right to use the software for whatever purpose they wish. The Free Software Definition lists "Freedom 0" as:

The freedom to run the program, for any purpose.
Placing restrictions on the use of Free Software, such as time ("30 days trial period", "license expires January 1st, 2004") purpose ("permission granted for research and non-commercial use", "may not be used for benchmarking") or geographic area ("must not be used in country X") makes a program non-free.

Meanwhile, the Open Source Definition's sixth criterion is:

6. No Discrimination Against Fields of Endeavor
The license must not restrict anyone from making use of the program in a specific field of endeavor. For example, it may not restrict the program from being used in a business, or from being used for genetic research.

Traditionally this has not been a problem for non-commercial software developers like me. Once you decide not to charge for the editor, game, or compiler you've written, who cares how it's used?

However, if your motivation in writing software is to encourage people to share their data, as mine certainly is, then restrictions on use start to sound pretty attractive. I'd love for someone to run FromThePage as a commercial service, hosting the software and guiding users through posting their manuscripts online. It's a valuable service, and is worth paying for. However, I want the resulting transcriptions to be freely accessible on the web, so that we all get to read the documents that have been sitting in the basements and file folders of family archivists around the world.

Unfortunately, if you investigate the current big commercial repositories of this sort of data, you'll find that their pricing/access model is the opposite of what I describe. Both Footnote.com and Ancestry.com allow free hosting of member data, but both lock browsing of that data behind a registration wall. Even if registration is free, that hurdle may doom the user-created content to be inaccessible, unfindable or irrelevant to the general public.

Open Access

The open access movement has defined this problem with regards to scholarly literature, and I see no reason why their call should not be applied to historical primary sources like the 19th/20th century manuscripts FromThePage is designed to host. Here's the Budapest Open Access Initiative's definition:

By "open access" to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself.

Both the Budapest and Berlin definitions go on to talk about copyright quite a bit, however since the documents I'm hosting are already out-of-copyright, I don't really think that they're relevant. What I do have control over is my own copyright interest in the FromThePage software, and the ability to specify whatever kind of copyleft license I want.

My quandry is this: none of the existing Free or Open Source licenses allow me to require that FromThePage be used in conformance with Open Access. Obviously, that's because adding such a restriction -- requiring users of FromThePage not to charge for people reading the documents hosted on or produced through the software -- violates the basic principles of Free Software and Open Source. So where do I find such a license?

Have other Open Access developers run into such a problem? Should I hire a lawyer to write me a sui generis license for FromThePage? Or should I just get over the fear that someone, somewhere will be making money off my software by charging people to read the documents I want them to share?

Collaborative Manuscript Transcription

Sunday, May 17, 2009

Review: USGS North American Bird Phenology Program

Saturday, May 2, 2009

Open Source vs. Open Access

New Blog Posts are at FromThePage

Posts from the FromThePage Blog

Pages

Upcoming Conference Schedule

Past Conference Talks

Blog Archive

Subjects

Papers

Transcription Systems

Digital Family History