Showing posts with label blogland. Show all posts
Showing posts with label blogland. Show all posts

Wednesday, December 7, 2011

Developments in Wikisource/ProofreadPage for Transcription

Last year I reviewed Wikisource as a platform for manuscript transcription projects, concluding that the ProofreadPage plug-in was quite versatile, but that unfortunately the en.wikisource.org policy prohibiting any text not already published on paper ruled out its use for manuscripts.

I'm pleased to report that this policy has been softened. About a month ago, NARA started to partner with the Wikimedia Foundation to to host material—including manuscripts—on Wikisource.  While I was at MCN, I discussed this with Katie Filbert, the president of Wikimedia DC, who set me straight.  Wikisouce is now very interested in partnering with institutions to host manuscripts of importance, but it is still not a place for ordinary people to upload great-grandpa's journal from World War I.

Once you host a project on Wikisource, what do you do with it?  Andie, Rob and Gaurav over at the blog So You Think You Can Digitize?—and it's worth your time to read at least the last six posts—have been writing on exactly that subject.  Their most recent post describes their experience with Junius Henderson's Field Notes, and although it concentrates on their success flushing out more Henderson material and recounts how they dealt with the wikisource software, I'd like to concentrate on a detail:
What we currently want is a no-cost, minimal effort system that will make scans AND transcriptions AND annotations available, and that can facilitate text mining of the transcriptions.  Do we have that in WikiSource?  We will see.  More on annotations to follow in our next post but some father to a sister of some thoughts are already percolating and we have even implemented some rudimentary examples.
This is really exciting stuff.  They're experimenting with wiki mark-up of the transcriptions  with the goal of annotation and text-mining.  I tried to do this back in 2005, but abandoned the effort because I never could figure out how to clearly differentiate MediaWiki articles about subjects (i.e. annotations) from articles that presented manuscript pages and their transcribed text.   The lack of wiki-linking was also the one of my criticisms most taken to heart by the German Wikisource community last October.

So how is the mark-up working out?  Gaurav and the team have addressed the differentiation issue by using cross-wiki links, a standard way of linking from an article on one Wikimedia project to another.  So the text "English sparrows" in the transcription is annotated [[:w:Passer domesticus|English sparrows]], which is wiki-speak for Link the text "English sparrows" to the Wikipedia article "Passer domesticus". Wikipedia's redirects then send the browser off to the article "House Sparrow".

So far so good.  The only complaint I can make is that—so far as I can tell—cross-wiki links don't appear in the "What links here" screen tool on Wikipedia, neither for Passer domesticus, nor for House Sparrow.  This means that the annotation can't provide an indexing function, so that users can't see all the pages that reference possums, nor read a selection of those pages.  I'm not sure that the cross-wiki link data isn't tracked, however — just that I can't see it in the UI.  Tantalizingly, cross-wiki links are tracked when images or other files are included in multiple locations: see the "Global file usage" section of the sparrow image, for example.  Perhaps there is an API somewhere that the Henderson Field Note project could use to mine this data, or perhaps they could move their links targets from Wikipedia articles to some intermediary in a different Wikisource namespace.

Regardless, the direction Wikisource is moving should make it an excellent option for institutions looking to host documentary transcription projects and experiment with crowdsourcing without running their own servers.  I can't wait to see what happens once Andie, Rob, and Gaurav start experimenting with PediaPress!

Friday, August 5, 2011

Programmers: Wikisource Needs You!

Wikisource is powered by a MediaWiki extension which allows page images to be displayed beside the wiki editing form. This extension also handles editorial workflow by allowing pages, chapters, and books to be marked as unedited, partially edited, in need of review, or finished. It's a fine system, and while the policy of the English language Wikisource community prevents it from being used for manuscript transcription, there are active manuscript projects using the software in other communities.

Yesterday, Mark Hershberger wrote this in a comment: For what its worth the extension used by WikiSource, ProofreadPage, now needs a maintainer. I posted about this here: http://thread.gmane.org/gmane.science.linguistics.wikipedia.technical/54831

While I'm sorry to hear it, this is an excellent opportunity for someone with Mediawiki skills to do some real good.

Wednesday, February 2, 2011

2010: The Year of Crowdsourcing Transcription

2010 was the year that collaborative manuscript transcription finally caught on.

Back when I started work on FromThePage in 2005, I got the same response from most people I told about the project: "Why don't you just use OCR software?" To say that the challenges of digitizing handwritten material were poorly understood might be inaccurate—after all the TEI standard included an entire chapter on manuscripts—but there were no tools in use designed for the purpose. Five years later, half a dozen web-based transcription projects are in progress and new projects may choose from several existing tools to host their own. Crowdsourced transcription was even the written up in the New York Times!

I'm going to review the field as I see it, then make some predictions for 2011.

Ongoing Structured Transcription Projects
By far the most successful transcription project is FamilySearch Indexing. In 2010, 185,900,667 records were transcribed from manuscript census forms, parish registers, tithe lists, and other sources world-wide. This brings the total up to 437,795,000 records double-keyed and reconciled by more than four hundred thousand volunteers — itself an awe-inspiring number with an equally impressive support structure.

October saw the launch of OldWeather, a project in which GalaxyZoo applied its crowdsourcing technology to transcription of Royal Navy ship's logs from WWI. As I write, volunteers have transcribed an astonishing 308169 pages of logs — many of which include multiple records. I hope to do a more detailed review of the software soon, but for now let me note how elegantly the software uses the data itself to engage volunteers, so that transcribers can see the motion of "their ship" on a map as they enter dates, latitudes and longitudes. This leverages the immersive nature of transcription as an incentive, projecting users deep within history.

The North American Bird Phenology Program transcribed nearly 160,000 species sighting cards between December 2009 and 2010 and maintained their reputation as a model for crowdsourcing projects by publishing the first user satisfaction survey for a transcription tool. Interestingly the program seems to have followed a growth pattern a bit similar to Wikipedia's, as the cards transcribed rose from 203,967 to 362,996 while the number of volunteers only increased from 1,666 to 2,204 (32% vs 78%) — indicating that a core of passionate volunteers remain the most active contributors.

I've only recently discovered Demogen, a project operated by the Belgium Rijksarchief to enlist the public to index handwritten death notices. Although most of the documentation is in Flemish, the Windows-based transcription software will also operate in French. I've had trouble finding statistics on how many of the record sets have been completed (a set comprising a score of pages with half a dozen personal records per page). By my crude estimate, the 4000ish sets are approximately 63% indexed — say a total of 300,000 records to date. I'd like to write a more detailed review of Demogen/Visu and would welcome any pointers to project status and community support pages.

Ancestry.com's World Archives Project has been operating since 2008, but I've been unable to find any statistics on the total number of records transcribed. The project allows volunteers to index personal information from a fairly heterogeneous assortment of records scanned from microfilm. Each set of records has its own project page with help and statistics. The keying software is a Windows-based application free for download by any Ancestry.com registered user, while support is provided through discussion boards and a wiki.

Ongoing Free-form Transcription Projects
While I've written about Wikisource and its ProofreadPage plug-in before, it remains worth very much following. More than two hundred thousand scanned pages have been proofread, had problems reconciled, and been reviewed out of 1.3 million scanned pages. Only a tiny percent of those are handwritten, but that's still a few thousand pages, making it the most popular automated free-form transcription tool.

This blog was started to track my own work developing FromThePage to transcribe Julia Brumfield's diaries. As I type, beta.fromthepage.com hosts 1503 transcribed pages—of which 988 are indexed and annotated—and volunteers are now waiting on me to prepare and upload more page images. Major developments in 2010 included the release of FromThePage on GitHub under a Free software license and installation of the software by the Balboa Park Online Collaborative for transcription projects by their member institutions.

Probably the biggest news this year was TranscribeBentham, a project at University College London to crowdsource the transcription of Jeremy Bentham's papers. This involved the development of Transcription Desk, a MediaWiki-based tool which is slated to be released under an open-source license. The team of volunteers had transcribed 737 pages of very difficult handwriting when I last consulted the Benthamometer. The Bentham team has done more than any other transcription tool to publicize the field -- explaining their work on their blog, reaching out through the media (including articles in the Chronicle of Higher Education and the New York Times), and even highlighting other transcription projects on Melissa Terras's blog.

Halted Transcription Projects
The Historic Journals project is a fascinating tool for indexing—and optionally transcribing—privately-held diaries and journals. It's run by Doug Kennard at at Brigham Young University, and you can read about his vision in this FHT09 paper. Technically, I found a couple of aspects of the project to be particularly innovative. First, the software integrates with ContentDM to display manuscript page images from that system within its own context. Second, the tool is tightly integrated with FamilySearch, the LDS Church's database of genealogical material. It uses the FamilySearch API to perform searches for personal or place names, and can then use the FamilySearch IDs to uniquely identify subjects mentioned within the texts. Unfortunately, because the FamilySearch API is currently limited to LDS members, development on Historic Journals has been temporarily halted.

Begun as a desktop application in 1998, the uScript Transcription Assistant is the longest-running program in the field. Recently ported over to modern web-based technologies, the system is similar to Img2XML and T-PEN in that it links individual transcribed words to the corresponding images within the scanned page. Although the system is not in use and the source-code is not accessible outside WPI, you can read papers describing it by WPI students in 2003 or in 2005 by Fabio Carrera (the faculty member leading the project). Unfortunately, according to Carrera's blog work on the project has stopped for lack of funding.

According to the New York Times article, there was an attempt to crowdsource the Papers of Abraham Lincoln. The article quotes project director Daniel Stowell explaining that nonacademic transcribers "produced so many errors and gaps in the papers that 'we were spending more time and money correcting them as creating them from scratch.'" The prototype transcription tool (created by NCSA at UIUC) has been abandoned.

Upcoming Transcription ProjectsThe Center for History and New Media at George Mason University is developing a transcription tool called Scripto based on MediaWiki and architected around integration with an external CMS for hosting page images. The initial transcription project will be their Papers of the War Department site, but connector scripts for other content management systems are under development. Scripto is being developed in a particularly open manner, with the source code available for immediate inspection and download on GitHub and a project blog covering the tool's progress.

T-PEN is a tool under development by Saint Louis Univiersity to enable line-by-line transcription and paleographic annotation. It's focused on medieval manuscripts, and automatically identifies the lines of text within a scanned page — even if that page is divided into columns. The team integrated crowdsourcing into their development process by challenging the public to test and give feedback on their line identification algorithm, gathering perhaps a thousand ratings in a two week period. There's no word on whether T-PEN will be released under a free license. I should also mention that they've got the best logo of any transcription tool.

I covered Militieregisters.nl at length below, but the most recent news is that a vendor has been picked to develop the VeleHanden transcription tool. I would not be at all surprised if 2011 saw the deployment of that system.

The Balboa Park Online Collaborative is going into collaborative transcription in a big way with the Field Notes of Laurence Klauber for the San Diego Natural History Museum. They've picked my own FromThePage to host their transcriptions, and have been driving a lot of the development on that system since October through their enthusiastic feature requests, bug reports, and funding. Future transcription projects are in the early planning stages, but we're trying to complete features suggested by the Klauber material first.

The University of Iowa Libraries plan to crowdsource transcription of their Historic Iowa Children's Diaries. There is no word on the technology they plan to use.

The Getty Research Institute plans to crowdsource transcription of J. Paul Getty's diaries. This project also appears to be in the very early stages of planning, with no technology chosen.

Invisible Australians is a digitization project by Kate Bagnall and Tim Sherratt to explore the lives of Australians subjected to the White Australia policy through their extensive records. While it's still in the planning stages (with only a set of project blogs and a Zotero library publicly visible), the heterogeneity of the source material make it one of the most ambitious documentary transcription projects I've seen. Some of the data is traditionally structured (like government forms), some free-form (like letters), and there are photographs and even hand-prints to present alongside the transcription! Invisible Australians will be a fascinating project to follow in 2011.

Obscure Transcription Projects
Because the field is so fragmented, there are a number of projects I follow that are not entirely automated, not entirely public, not entirely collaborative, moribund or awaiting development. In fact, some projects have so little written about them online that they're almost mysterious.
  • Commenters to a blog post at Rogue Classicism are discussing this APA job posting for a Classicist to help develop a new GalaxyZoo project transcribing the Oxyrhynchus Papyri.
  • Some cryptic comments on blog posts covering TranscribeBentham point to FadedPage, which appears to be a tool similar to Project Gutenberg's Distributed Proofreaders. Further investigation has yielded no instances of it being used for handwritten material.
  • A blog called On the Written, the Digital, and the Transcription tracks development of WrittenRummage, which was apparently a crowdsourced transcription tool that sought to leverage Amazon's Mechanical Turk.
  • Van Papier Naar Digitaal is a project by Hans den Braber and Herman de Wit in which volunteers photograph or scan handwritten material then send the images to Hans. Hans reviews them and puts them on the website as a PDF, where Herman publicizes them to transcription volunteers. Those volunteers download the PDF and use Jacob Boerema's desktop-based Transcript software to transcribe the records, which are then linked from Digitale Bronbewerkinge Nederland en België. With my limited Dutch it is hard for me to evaluate how much has been completed, but in the years that the program has been running its results seem to have been pretty impressive.
  • BYU's Immigrant Ancestors Project was begun in 1996 as a survey of German archival holdings, then was expanded into a crowdsourced indexing project. A 2009 article by Mark Witmer predicts the immanent roll-out of a new version of the indexing software, but the project website looks quite stale and says that it's no longer accepting volunteers.
  • In November, a Google Groups post highlighted the use of Islandora for side-by-side presentation of a page image and a TEI editor for transcription. However I haven't found any examples of its use for manuscript material.
  • Wiktenauer is a MediaWiki installation for fans of western martial arts. It hosts several projects transcribing and translating medieval manuals of fighting and swordsmanship, although I haven't yet figured out whether they're automating the transcription.
  • Melissa Terras' manuscript transcription blog post mentioned a Drupal-based tool called OpenScribe, built by the New Zealand Electronic Text Centre. However, the Google Code site doesn't show any updates since mid-2009, so I'm not sure how active the project is. This project is particularly difficult to research because "OpenScribe" is also the name chosen for an audio transcription tool hosted on SourceForge as well as a commercial scanning station.
I welcome any corrections or updates on these projects.

Predictions for 2011

Emerging Community
Nearly all of the transcription projects I've discussed were begun in isolation, unaware of previous work towards transcription tools. While I expect this fragmented situation to continue--in fact I've seen isolated proposals as recently as Shawn Moore's October 12 HASTAC post--it should lessen a bit as toolmakers and project managers enter into dialogue with each other on comment threads, conference panels or GitHub. Tentative steps were made towards overcoming linguistic division in 2010, with Dutch archivists covering TranscribeBentham and a scattered bit of bloggy conversation between Dutch, German, English and American participants. The publicity given to projects like OldWeather, Scripto, and TranscribeBentham can only help this community form.

No Single Tool
We will not see the development of a single tool that supports transcription of both structured and free-form manuscripts, nor both paleographic and semantic annotation in 2011. The field is too young and fragmented -- most toolmakers have enough work providing the basic functionality required by their own manuscripts.

New Client-side Editors
Although I don't foresee convergence of server tools, there is already some exciting work being done towards Javascript-based editors for TEI, the mark-up language that informs most manuscript annotation. TEILiteEditor is an open-source WYSIWYG for editing TEI, while RaiseXML is an open-source editor for manipulating TEI tags directly. Both projects have seen a lot of activity over the past few weeks, and it's easy to imagine a future in which many different transcription tools support the same user-facing editor.

External Integration
2010 already saw strides being made towards integration with external CMSs, with BYU's Historic Journals serving page images from ContentDM and FromThePage serving page images from the Internet Archive. Scripto is apparently designed entirely around CMS integration, as it does not host images itself and is architected to support connectors for many different content management systems. I feel that this will be a big theme of transcription tool development in 2011, with new support for feeding transcriptions and text annotations back to external CMSs.

Outreach/Volunteer Motivation
We're learning that a key to success in crowdsourcing projects is recruiting volunteers. I think that 2011 will see a lot of attention paid to identifying and enlisting existing communities interested in the subject matter for a transcription project. In addition to finding volunteers, projects will better understand volunteer motivation and the trade-offs between game-like systems that encourage participation through score cards and points on the one hand, and immersive systems that enhance the volunteers' engagement with the text on the other.

Taxonomy
As number of transcription projects multiplies, I think that we will be able to start generalizing from the unique needs of each collection of manuscript material to form a sort of taxonomy of transcription projects. In the list above, I've separated the projects indexing structured data like militia rolls from those dealing with free-form text like diaries or letters. I think that in 2011 we'll be able to classify projects by their paleographic requirements, the kinds of analysis that will be performed on the transcribed texts, the quantity of non-textual images that must be incorporated into the transcription presentation, and other dimensions. It's possible that the existing tools will specialize in a few of these areas, providing support for needs similar to those of their original project so that a sort of decision tree could guide new projects toward the appropriate tool for their manuscript material.

2011 is going to be a great year!

Thursday, December 30, 2010

Two New Diary Transcription Projects

The last few weeks have seen the announcement of two new transcription projects. I'm particularly excited about them because--like FromThePage--their manuscripts are diaries and they plan to open the transcription tasks to the public.

Dear Diary announces the digitization of University of Iowa's Historic Iowa Children's Diaries:
We have a deep love for historic diaries as well, and we’re currently hard at work developing a site that will allow the public to help enhance our collections through “crowdsourcing” or collaborative transcription of diaries and other manuscript materials. Stay tuned!
A Look Inside J. Paul Getty’s Newly Digitized Diaries describes J. Paul Getty's diaries and their contents, and mentions in passing that
We will soon launch a website that will invite your participation to perform the transcriptions (via crowdsourcing), thus rendering the diaries keyword-searchable and dramatically improving their accessibility.
I will be very interested to follow these projects and see which transcription systems they use or develop.

Friday, December 10, 2010

Militieregisters.nl and Velehanden.nl

Militieregisters.nl is a new transcription project organized by the City Archive Amsterdam that plans to use crowdsourcing to index militia registers from several Dutch archives. It's quite ambitious, and there are a number of innovative features about the project I'd like to address. However, I haven't seen any English-language coverage of the project so I'll try to translate and summarize it as best as my limited Dutch and Google's imperfect algorithms allow before offering my own commentary.

With the project "many hands make light work", Stadsarchief Amsterdam will make all militia records searchable online through crowdsourcing -- not just inventories, but indexes.

About
To research how archives and online users can work together to improve access to the archives, the Stadsarchief Amsterdam has set up the "Many Hands" project. With this project, we want to create a platform where all Dutch archives can offer their scans to be indexed and where all archival users can contribute in exchange for fun, contacts, information, honor, scanned goods, and whatever else we can think of.

To ask the whole Netherlands to index, we must start with archives that are important to the whole Netherlands. As the first pilot, we have chosen the Militia Registers, but there will soon be more archival files to be indexed so that everyone can choose something within his interest and skill-level.

All Militia Registers Online

Militia registers contain the records of all boys who were entered for conscription into military service during almost the entire 19th and part of the 20th centuries. These records were created in the entire Netherlands and are kept in many national and municipal archives.

The militia records are eminently suitable for large-scale digitization. The records consist of printed sheets. This uniformity makes scanning easy and thus inexpensive. More importantly, this resource is interesting for anyone with Dutch ancestry. Therefore we expect many volunteers to lend a hand to help unlock this wonderful resource, and that the online indexes will eventually attract many visitors.

But the first step is to scan the records. Soon the scanning of approximately one million pages will begin as a start. The more records we have digitized, the cheaper the scanning becomes, and the more attractive the indexing project becomes to volunteers. The Stadsarchive therefore calls upon all Dutch archival institutions to join!
FAQ
  • At our institution, online scans are provided for free. Why should people pay for scans?
    Revenues from sales of scans over the two years duration of the project are a part of the financing of the project. The budget is based on the rates as used in the City Archives: € 0.50 to € 0.25 per scan, depending on the number of scans that someone buys. We ask the institutions that participate throughout the project do not sell their own scans or make them available for free. After the completion of the project, each institution may follow its own policy for providing the scans.
  • If we participate, who is the owner of the scans and index data?
    After production and payment, the scans will be delivered immediately to the institution which provided the militia records. The index information will also be supplied to the institutions after completion of the project. The institution remains the owner, but during the project period of approximately two years the material may not be used outside of the project.
  • What are the financial risks for participating archives?
    Participants pay only for their scans: the actual costs and preparation of the scanning process. The development and deployment of the index tool, volunteer recruitment and two years maintenance of the website from the project has been funded by grants and contributions by Stadsarchief Amsterdam. There are no financial surprises.
  • What does the schedule for the project look like?
    On July 12 and September 13 we are organizing meetings with potential participants to answer your questions. Until October 1, 2010, participants will sign up to participate in the project, in order for the scanning to start on that day. The tender process runs about 2 months, so a supplier can be contracted in 2010. In January 2011 we will start scanning, volunteers can begin indexing in the spring. The sister site www.velehanden.nl--where the indexing will take place--will continue online for at least one year.
  • Will the indexing tool be developed as Open Source software?
    It is currently impossible to say whether the indexing tool will be developed via/as open source software. Of primary importance is finding the most cost-effective solution and that the software performs well and is user-friendly. The only hard requirement is the use of open standards for the import and export of metadata, so that vendor independence is guaranteed.
RFP (Warning: very loose translation!)
Below are some ideas SAA has formulated regarding the functionality and sustainability of VeleHanden.nl:
  • Facilities for importing and managing scans, and for exporting data in XML format.
  • Scan viewer with advanced features.
  • Functionality to simultaneously run multiple projects for indexing, transcription, and translation of scans.
  • Features for organizing and managing data from volunteer groups and for selectively enabling features for participants and volunteer coordinators.
  • Features for communication between archival staff and volunteers, as well as for volunteers to provide support to each other.
  • Automated features for control of the data produced.
  • Rewards system (material and immaterial) for volunteers.
  • Many volunteers may work in parallel to process scans quickly and effectively.
  • Facilities to search, view and share scans online.
Other Dutch bloggers have covered the unique approach that Stadsarchief Amsterdam envisions for volunteer motiviation and project support: Militieregisters.nl users who want to download scans may either pay for them in cash or in labor, by indexing N scanned pages. Christian van der Ven's blog post Crowdsourcen rond militieregisters and the associated comment thread discusses this intensely and is worth reading in full. Here's a loosly-translated excerpt:
The project assumes that it can not allow the volunteer to indicate whether he wants to index Zeeland or Groningen. It is--in the words of the project leader--the Orange feeling, to see if the rural people can volunteer and not just concentrate on their own location. Indexing people from their own village? Please, not that!

Well since the last World Cup I'm feeling Orange again, but overall experience and research in archives teaches that all country people are more interested in the history of themselves, their own ancestors, their homes and the surrounding area. The closer [the data], the more motivation to do something.

And if the purpose of this project is to build an indexing tool, to scan registers, and then to obtain indexes through crowdsourcing as quickly as possible, it seems to me that the public should be given what it wants: local resources if desired. What I suggest is a choice menu: do you want records from your source environment? Do you want them maybe only from a certain period? Or do you want them filtered by time and place? That kind of choice will trigger as many people as possible to participate, I think.
My Observations:
  • The pay-or-transcribe approach for acquiring scans is a really innovative approach. Offering people alternatives for supporting the project is a great way of serving the varied constituencies that compose genealogical researchers, allowing cash-poor, time-rich users (like retirees) an easy way to access the project.
  • Although I have no experience in the subject, I suspect that this federated approach to digitization--taking structurally-similar material from regional archives and scanning/hosting it centrally--has a lot of possibilities.
  • Christian's criticism is quite valid, and drives right to the paradox of motivation in crowdsourcing: do you strive for breadth using external incentives like scoreboards and free recognition, or do you strive for depth and cultivate passionate users through internal incentives like deep engagement with the source material? Volunteer motivation and the trade-offs involved is a fascinating topic, and I hope to do a whole post on it soon.
  • One potential flaw is that it will be very hard to charge to view the scans when transcribers must be able to see the scans to do their indexing. I gather that the randomization in VeleHanden will address this.
  • The budget described in the RFP is maximum 150000 Euros. As a real-life software developer, it's hard for me to see how this would pay for building a transcription tool, index database, scan import tool, scan CMS, search database and (since they expect to sell the searched scans) eCommerce. And that includes running servers too!
  • This is yet another project that's transcribing structured data from tabular sources, which would benefit from the FamilySearch Indexer, if only it were open-source (or even for sale).

Monday, April 14, 2008

Collaborative transcription, the hard way

Archivalia has kept us informed of lots of manuscript projects going online in Europe last week, offering commentary along the way. Perhaps my favorite exchange was about the Chronicle of Sweder von Schele on the Internet:
Mit dem Projekt wird zunächst bezweckt, die Transkription zu ergänzen und zu verbessern. Hierzu können neue Abschriften per Mail an die am Projekt beteiligten Institutionen geschickt werden. Nach redaktioneller Prüfung werden die Seiten ausgetauscht.

Meine Güte, wie vorsintflutlich. Hat man noch nie etwas von einem Wiki gehört?
That's right -- the online community may send in corrections and additions to the transcription by mail.

Thursday, February 7, 2008

Google Reads Fraktur

Yesterday, German blogger Archivalia reported that the quality of Fraktur OCR at Google Books has improved. There are still some problems, but they're on the same order of those found in books printed in Antiqua. Compare the text-only and page-image versions of Geschichte der teutschen Landwirthschaft (1800) with the text and image versions of Antigua Altnordisches Leben (1856).

This is a big deal, since previous OCR efforts produced results that were not only unreadable, but un-searchable as well. This example from the University of Michigan's MBooks website (digitized in partnership with Google) gives a flavor of the prior quality: "Ueber den Ursprung des Uebels." ("On the Origin of Evil") results in "Us-Wv ben Uvfprun@ - bed Its-beEd."

It's thrilling that these improvements are being made to the big digitization efforts — my guess is that they've added new blackletter typefaces to the OCR algorithm and reprocessed the previously-scanned images — but this highlights the dependency OCR technology has on well-known typefaces. Occasionally, when I tell friends about my software and the diaries I'm transcribing, I'm asked, "Why don't you just OCR the diaries?" Unfortunately, until someone comes with a OCR plugin for Julia Brumfield (age 72) and another for Julia Brumfield (age 88), we'll be stuck transcribing the diaries by hand.

Monday, June 25, 2007

Matt Unger's Papa's Diary Project

Of all the online transcriptions I've seen so far, Papa's Diary Project does the most with the least. Matt Unger is transcribing and annotating his grandfather's 1924 diary, then posting one entry per day. So far as I'm able to tell, he's not using any technology more complicated than a scanner, some basic image processing software, and Blogger.

Matt's annotations are really what make the site. His grandfather Harry Scheurman was writing from New York City, so information about the places and organizations mentioned in the diary is much more accessible than Julia Brumfield's corner of rural Virginia. Matt makes the most of this by fleshing out the spare diary with great detail. When Scheurman sees a film, we learn that the theater was a bit shabby via an anecdote about the Vanderbilts. This exposition puts the May 9th single-word entry "Home" into the context of the day's news.

More than providing a historical backdrop, Matt's commentary provides a reflective narrative on his grandfather's experience. This narration puts enigmatic interactions between Scheurmann and his sister Nettie into the context of a loving brother trying to help his sister recover from childbirth by keeping her in the dark about their father's death. Matt's skill as a writer and emotional connection to his grandfather really show here. I've found that this is what keeps me coming back.

This highlights a problem with collaborative annotation — no single editorial voice. The commenters at PepysDiary.com accomplish something similar, but their voices are disorganized: some pose queries about the text, others add links or historical commentary, while others speculate about the 'plot'. There's more than enough material there for an editor to pull together something akin to Papa's Diary, but it would take a great deal of work by an author of Matt Unger's considerable writing skill.

People with more literary gifts than I possess have reviewed Papa's Diary already: see Jewcy, Forward.com, and Booknik (in Russian). Turning to the technical aspects of the project, there are a number of interesting effects Matt's accomplished with Blogger.

Embedded Images
Papas diary uses images in three distinct ways.
1. Each entry includes a legible image of the scanned page in its upper right corner. (The transcription itself is in the upper left corner, while the commentary is below.)
2. The commentary uses higher-resolution cropped snippets of the diary whenever Scheurmann includes abbreviations or phrases in Hebrew (see May 4, May 14, and May 13). In the May 11 entry, a cropped version of an unclear English passage is offered for correction by readers.
3. Images of people, documents, and events mentioned in the diary provide more context for the reader and make the site more attractive.

Comments
Comments are enabled for most posts, but don't seem to get too much traffic.

Navigation
Navigation is fairly primitive. There are links from one entry to others that mention the same topic, but no way to show all entries with a particular location or organization. It would be nice to see how many times Scheurman attended a JNF meeting, for example. Maybe I've missed a category list, but it seems like the posts are categorized, but there's no way to browse those categories.

Lessons for FromThePage
1. Matt's use of cropped text images — especially when he's double-checking his transcription — is very similar to the illegible tag feature of FromThePage. It seems important to be able to specify a reading, however, akin to the TEI unclear element.
2. Images embedded into subject commentary really do make the site more engaging. I hadn't planned to allow images within subject articles, but perhaps that's short-sighted.

Friday, June 1, 2007

Conversations about Transcription

Gavin Robinson has been exchanging emails with the UK National Archives this week. He's trying to convince the archivists to revise their usage restrictions to allow quotation and reuse of user-contributed content.

Gavin recognizes that the NA is doing a difficult dance with their user community:
[S]ome people who have valuable knowledge would be put off from contributing if they had to give it away under GDL, and might prefer a non-exclusive licence which allows them to retain more rights. For example, the average Great War Forum member doesn’t tend to think in a Web 2.0 kind of way. But then they might be put off by the very idea of a wiki. Including as many people as possible has probably involved some difficult decisions for the NA.

This puts me in mind of a discussion over at Dan Lawyer's blog last year. Amateurs who are willing to collaborate on research feel a strong sense of ownership over the results of their labor. They don't want other people taking credit for it, they don't like other people making a profit from it, and they don't like seeing it misused*. Wikipedia proves that people will contribute to a public-domain project, but I suspect that family history, being so much more personal, is a bit different. Several of Dan's Mormon commenters feel uncomfortable entrusting anybody other than the LDS church with their genealogical conclusions. Of course, many non-Mormons feel uncomfortable providing genealogical data specifically to the LDS church. Getting these two sets of the public to collaborate must be challenging.

This same week, Rantings of a Civil War Historian published a fascinating article on "Copyright and Unpublished Manuscript Materials". The collectors, amateurs, and volunteers who buy letters and diaries on EBay feel a similar sense of ownership over their documents. The comments range over the legal issues involved in posting transcripts of old documents, as well as the problems that occur when people with no physical access to the sources propagate incorrect readings. Many of the commenters have done work with libraries that require them to agree to conditions before they use their materials, and others work in IP Law, so the discussion is very high quality.

* I think the most prominent fear of misuse is that shades of nuance will be lost, hard-fought conclusions will be over-simplified, and most especially that errors will propagate through the net. In my own dabbling with family history, I've seen a chunk of juvenile historical fiction widely quoted as fact. (No, James Brumfield did not eat the first raw oysters at Jamestown!) Less egregious but perhaps more relevant to manuscript transcription are the misspellings and misreadings committed by scribes without the context to determine whether "Edmunds" should be "Edmund". Dan discusses a technical solution to this in his fourth point: Allow the user to say “I think” or “Maybe” about their conclusions. That's something I should flag as a feature for FromThePage.

Thursday, April 5, 2007

Gavin Robinson's Project Wenham

On his indispensable Investigations of a Dog, Gavin Robinson describes digitizing his grandfather's letters from a German POW camp in WWI:
The text will be transcribed and marked up with TEI compliant XML, and published on the web, along with background information written by me. There will be an index of people, and possibly places. Another optional extra will be selections of relevant documents from other sources, such as battalion war diaries.
Robinson's approach to his project is almost identical to mine:
There is no possibility of using OCR for handwritten text, so the letters will have to be transcribed. Although we have the originals to work from, there might be some need to work from digital images. Apart from saving wear and tear on the documents, digital images are more flexible. Difficult text can be enlarged on screen, and contrast can be adjusted to bring out faded text. However, there is the added problem of viewing an image and typing at the same time, which might require specialised software. I’ve found that Zotero notes can be very useful for transcribing text from images or PDFs and might be adequate to start with. I could also use HTML/PHP/MySQL to cobble together something like the Distributed Proofreaders interface for my own use. The front end is a simple web based interface, and although I don’t know how their back end works, a local version just for my own use could be much simpler.
The differences between Robinson's needs and mine are small but not insignificant:
  1. I'm dealing with books whose many pages need consistent titles. In the course of my digitization, I've found the effort involved in reviewing and labelling images to be immense.

    Say I've got just under 200 JPG files that are supposed to represent every even-numbered page of my great grandmother's 1925 diary. These images were taken in bulk, and the haste involved means that some pages may be missing and need to be filled, some are duplicates, and a few may even be out of focus and require re-shoots. Even after a set of images is labelled and cleaned up, it needs to be interleaved with a similarly processed set of odd numbered pages.

    I had to halt development on the transcription feature several months ago in order to concentrate on automating this process. I doubt that a project in which each work could be represented by only a handful of images would need similar tools — in fact, automation might be slower than manually reviewing and ordering the images.
  2. The shortness of letters may make their images easy to title, but they're more likely to need sophisticated categorization. The initial versions of FromThePage listed works (then called "books") in a single page, ordered alphabetically by title. I'm pretty sure that this would be completely inadequate for Robinson's or Susan Kitchens' needs.
  3. Project Wenham includes Robinson's grandfather's photographs and (from what I gather) the front side of postcards. I've not given any thought to including images in the transcribed works.
  4. Robinson apparently does not have my requirement for offline access to the transcribed works.

Wednesday, April 4, 2007

Susan Kitchens' Letter Project

Susan Kitchens at Family Oral History Using Digital Tools [and I thought "Collaborative Manuscript Transcription" was a mouthful!] has a need that's very similar to my own. She's got a bunch of old letters and she wants to "scan them all and somehow make sense of them digitally". Her post on the subject outlines a plan to embed metadata into the scanned images themselves. This would allow her to use image viewing software -- she's looking at MemoryMiner -- to navigate the letter images.

Her project differs from mine in that
  1. She's not trying to distribute compact hardcopy versions, a core end-product of FromThePage.
  2. She needs more structured, analytical metadata than a freeform wiki-style "what links here" index can provide.
  3. She doesn't have the collaborative proofreading/correction/annotation needs I've seen in transcribing my great-great grandmother's diaries.
Kitchens' needs suggest enhancements to my design in structuring articles. I'd thought about differentiating general articles on subjects like "cutting match" or "tobacco" from those on people, but maybe further categorization would be worth investigating during early testing.