Hi. My name is Nicholas Woodward, and I am a Software Developer for the University of Texas Libraries. Ben Brumfield has been so kind as to offer me an opportunity to write a guest post on his blog about my approach for transcribing large scanned document collections that combines crowdsourcing and computer vision. I presented my application at the Texas Conference on Digital Libraries on May 7th, 2013, and the slides from the presentation are available on TCDL’s website. This purpose of this post is to introduce my approach along with a test collection and preliminary results. I’ll conclude with a discussion on potential avenues for future work.
Before we delve into algorithms for computer vision and what-not, I’d first like to say a word about the collection used in this project and why I think it’s important to look for new ways to complement crowdsourcing transcription. The Guatemalan National Police Historical Archive (or AHPN, in Spanish) contains the records of the Guatemalan National Police from 1882-2005. It is estimated that AHPN contains more than 80 million pages of documents (8,000 linear meters) such as handwritten journals and ledgers, birth certificate and marriage license forms, identification cards and typewritten letters. To date, the AHPN staff have processed and digitized approximately 14 million pages of the collection, and they are publicly available in a digital repository that was developed by UT Libraries.
While unique for its size, AHPN is representative of an increasingly common problem in the humanities and social sciences. The nature of the original documents precludes any economical OCR solution on the scanned images (See below), and the immense size of the collection makes page-by-page transcription highly impractical, even when using a crowdsourcing approach. Additionally, the collection does not contain sufficient metadata to support browsing via commonly used traits, such as titles or authors of documents.
These characteristics of AHPN informed my idea to develop a different method for transcribing a large scanned document collection that draws on the work of popular crowdsourcing transcription tools such as FromThePage, Scripto and Transcribe Bentham. These tools allow users to transcribe individual records of a collection, maintaining quality control largely through redundancy and error checking.
The only drawback from this approach is that the results of crowdsourcing are only applicable to the document being transcribed. In contrast, my approach looks to break up documents into individual words with the idea that though no two documents are exactly alike they are likely to contain similar words. And across an entire corpus, particularly very large ones such as AHPN, words are likely to appear many times. Consequently, if users transcribe the words of one document, then I can use image matching algorithms to find other images of the same words and apply the crowdsourced transcription to the new images. The process looks like this:
- Segment scanned documents into words
- Crowdsource the transcription of a fraction of those words
- Use image matching to pair images of transcribed images with unknown images
- In the case of a match, associate the crowdsourced text with the document containing the unknown image
The crowdsourcing output from the testing phase of the project looked like this.
- Compare the width and height of each image
- Compare the histograms of each image
- Extract ‘interest points’ of each image using SURF algorithm
- Match interest points between them
- Calculate average distance between interest points
- If the average is below a certain threshold, consider the images a match
The second step is also relatively straightforward. Image histograms are just the counts of pixels of different hues in an image. In a typical color image, a histogram could be the number of pixels of each color. In this case the scanned documents are grayscale TIFF images, and calculating histograms is simply a matter of adding up all the black and white pixels separately (See below). Two images of different words may contain the same number of letters but the letters are different shapes and so their ratio of black to white pixels will also differ. Note: this does not solve the problem of words with similar letters in a different order, e.g. “la” and “al”. This case is handled in the next step.
Steps 3 through 6 represent the most computationally intensive part of the entire matching process, and they are based on the Speeded Up Robust Features (SURF) implementation in OpenCV. The basic idea is to use the SURF algorithm to find the “interest points” of an image and then measure the distance between points in two separate images. Interest points are curves, blobs and points where two lines meet. Going this route is both a strength and weakness to my approach. I’ll explain. Even if we don’t speak Spanish, it’s relatively easy to infer that the images below contain the same word, just in a different format. Whether they’re in all caps or all lowercase or the first letter is capitalized, the word is the same to a human.
But not to a computer.
And we see this when we use SURF to calculate the interest points of each image.
The final step to matching images involves computing the distance between interest points and finding the matches below a certain distance. In this case, I used the ratio test to eliminate poor interest point matches and then considered to images the same if their average distance was below .20. The example below attempts to match the word “placas” (plates) with several other words, and finds a match with another image of the word placas, despite the poor quality of the second image.
But here is how it’s also a strength of the approach in at least two ways. The images above are Spanish words, but to a computer they’re just objects. They could really be in any language and the same basic workflow would apply. The only adjustments would be to crowdsourcers who understood the language (helpful for determining the word when it’s hard to make out and context helps) and tweaks to the algorithm parameters in steps 3-6. Second, while the format of the word is not necessarily important for keyword searches, it is relevant to many computation research methods such as document clustering, classification and finding named entities.
The output of this approach on the test collection from AHPN is as follows:
There are several implications from these results. First, shorter images, i.e. one or two letters, basically bring us back to the same issues of OCR. In the case of AHPN the quality of the scanned documents varies, and in many instances there are just not enough interest points to differentiate between ‘a’, ‘o’ and ‘e’. So my approach may only be suitable for longer words. This may not be too serious of an issue since in most cases short words are also stopwords, which are generally not needed for either keyword searches or computational research. Second, without a doubt, I will need to do the crowdsourcing on a much larger scale. 2.33 images per word is not enough training data to create an image classifier capable of finding other images of that word. And there is also a greater need for transcribing output because this approach requires images of each word in all of its forms, along with the transcribed text.
The future steps of the project, then, must include an avenue for acquiring more crowdsourced transcription. I think the best approach for this will be to refactor the CakePHP code into modules for Omeka, Drupal and other popular open source CMSs. Similar to the crowdsourcing tools mentioned above, Scripto, FromThePage, etc., I’ll need to enlist the users of particular collections who may have the most vested interest in seeing them become more accessible and functional. Another important component will be to continue development on the algorithms for image segmentation and matching. I am interested in looking at how to segment handwritten text, especially cursive text where it is particularly difficult to determine the spaces between words. Additionally, the image matching detailed above may be useful going forward because it is not particularly memory intensive and the work can be divided into separate tasks for a distributed computing environment, maybe something along the lines of SETI@Home. But with more output from crowdsourcing it would be good to incorporate either a Bag of Words or SVM classifier approach to process more words at a time. So there’s definitely plenty to do. Stay tuned!