Collaborative Manuscript Transcription: Progress Report: GitHub, Archive.org Integration, and General Availability

Tuesday, January 4, 2011

2010 saw big changes in FromThePage.

The Balboa Park Online Collaborative started using FromThePage to transcribe the field notes of herpetologist Laurence Klauber. Perian Sully, Rich Cherry, and all the other folks there have been fantastic to work with: full of enthusiasm and new ideas for the system while patient with the bugs that we've discovered. This is the first institution to install FromThePage, and their needs have driven a lot of development since October, including
Internet Archive integration: As you can see on the Klauber site, FromThePage now integrates directly with books hosted on the Internet Archive. This means that FromThePage gets to use the BookReader (in modified form) with its spiffy zoom and pan capabilities while delegating the expensive work of image hosting to Archive.org. It also reduces duplication of data and may enhance findability of the transcriptions. Best of all, the tedious process of uploading, assembling, and titling page images can be skipped, as FromThePage now imports the book structure and even the OCRed page titles from Archive.org derivative files.
As you can see from that last link, I've transferred FromThePage over to GitHub, released it under the Affero GPL, and created some extensive documentation on the wiki. So FromThePage is officially Free software, available for immediate use.

If you're interested in hosting a transcription project on FromThePage, drop me a line at benwbrum@gmail.com and I'll help you get started.

Jason said...: Greetings Ben, I am hoping that you can help me. I am sitting on pages upon page of scans of obituaries. The obits are very clean (neat, orderly and easy to read) and currently combined into a PDF file. Is there some sort of software that I can use that will allow be to transcribe these obits?; February 5, 2011 at 6:03 PM
Ben W. Brumfield said...: Jason,

If your obituaries are printed--and I assume they are--your best starting point is OCR software. This will convert the text in the images into plaintext, so you'll only have to proofread and correct the automatic transcriptions. Many scanning tools do OCR, including some commercially available Adobe products. Harder to use (but free!) is the Internet Archive community texts project, in which you'd upload your PDF and let their server software do the OCR for you.

I'm no expert on this--after all, you can't OCR handwritten material--but that's the advice I'd pass along.

Best of luck!; February 5, 2011 at 7:40 PM

Collaborative Manuscript Transcription