Friday, September 21, 2007

Progress Report: Printing

I just spent about two weeks doing what's known in software engineering as a "spike investigation." This is somewhat unusual, so it's worth explaining:

Say you have a feature you'd like to implement. It's not absolutely essential, but it would rank high in the priority queue if its cost were reasonably low. A spike investigation is the commitment of a little effort to clarify requirements, explore technology choices, create proof-of-concept implementations, and (most importantly) estimate costs for implementing that feature. From a developer's perspective you say "Pretend we're really doing X. Let's spend Y days to figure out how to do it, and how long that would take." Unlike other software projects, the goal is not a product, but 1) a plan and 2) a cost.

The Feature: PDF-download
According to the professionals, digitization projects are either oriented towards preservation (in which case the real-life copy may in theory be discarded after the project is done, but a website is merely a pleasant side-effect) or towards access (in which distribution takes primacy, and preservation concerns are an afterthought). FromThePage should enable digitization for access — after all, the point is to share all those primary sources locked away in somebody's file cabinet, like Julia Brumfield's diaries. Printed copies are a part of that access: when much of the audience is elderly, rural, or both, printouts really are the best vehicle for distribution.

The Plan: Generate a DocBook file, then convert it to PDF
While considering some PDF libraries for Ruby, I was fortunate enough to hear James Edward Gray II speak on "Ruby as a Glue Language". In one section called "shelling out", he talked about a requirement to produce a PDF when he was already rendering HTML. He investigated PDF libraries, but ended up piping his HTML through `html2ps | ps2pdf` and spending a day on the feature instead of several weeks. This got me looking outside the world of PDF-modifying Ruby gems and Rails plugins, at other documenting scripting languages. It makes a lot of sense — after all, I'm not hooking directly to the Graphviz engine for subject graphs, but generating .dot files and running neato on them.

I started by looking at typesetting languages like LaTeX, then stumbled upon DocBook. It's an SGML/XML-based formatting language which only specifies a logical structure. You divide your .docbook file into chapters, sections, paragraphs, and footnotes, then DocBook performs layout, applies typesetting styles, and generates a PDF file. Using the Rails templating system for this is a snap.

The Result:
See for yourself: This is a PDF generated from my development data. (Please ignore the scribbling.)

The Gaps:
  • Logical Gaps:
    • The author name is hard-wired into the template. DocBook expects names of authors and editors to be marked up with elements like surname, firstname, othername, and heritage. I assume that this is for bibliographic support, but it means I'll have to write some name-parsing routine that converts "Julia Ann Craddock Brumfield" into "<firstname> Julia </firstname> <othername type="middle"> Ann </othername> <othername type="maiden"> Craddock </othername> <surname> Brumfield </surname>".
    • There is a single chapter called "Entries" for an entire diary. It would be really nice to split those pages out into chapters based on the month name in the page title.
    • Page ranges in the index aren't marked appropriately. You see "6,7,8" instead of "6-9".
    • Names aren't subdivided (into surname, first name, suffix, etc.), and so are alphabetized incorrectly in the index. I suppose that I could apply the name-separating function created for the first gaps to all the subjects within a "Person" category to solve this.
  • Physical Layout: The footnote tags are rendering as end notes. Everyone hates end notes.
  • Typesetting: The font and typesetting betrays DocBook's origins in the world of technical writing. I'm not sure quite what's appropriate here, but "Section 1.3.4" looks more like a computer manual than a critical edition of someone's letters.
The Cost:
Fixing the problems with personal names requires a lot of ugly work with regular expressions to parse names, amounting to 20-40 hours to cover most cases for authors, editors, and indices. The work for chapter divisions is similar in size. I have little idea how easy it will be to fix the footnote problem, as it involves learning "a Scheme-like language" used for parsing .docbook files. Presumably I'm not the first person to want footnotes to render as footnotes, so perhaps I can find a .dsssl file that does this already. Finally, the typesetting should be a fairly straightforward task, but requires me to learn a lot more about CSS than the little I currently know. In all, producing truly readable copy is about a month's solid work, which works out to 4-6 months of calendar time at my current pace.

The Side-Effects:
Generating a .docbook file is very similar to generating any other XML file. Extending the printing code for exporting works in TEI or a FromThePage-specific format will only take 20-40 hours of work. Also, DocBook can be used to generate a set of paginated, static HTML files which users might want for some reason.

The Conclusions:
It's more important that I start transcribing in earnest to shake out problems with my core product, rather than delaying it to convert end notes to footnotes. As a result, printing is not slated for the alpha release.

2 comments:

Greg said...

Ben - Could your software be used for collaborative translation? I'm not sure if such a thing exists. Here's my use case:

My qigong teacher has written a bunch of stuff in Chinese - but doesn't know English very well. In the past he's had stuff translated by people who have good language abilities, but don't "get it" and the translations seem dry. He's also used people who have a better understanding of the topic, but not the languages.

Likewise, the translators are scattered around the globe.

I'm curious if your ideas and software could harness all this.


Greg

Ben W. Brumfield said...

Thanks for the comment, Greg!

I've thought a bit about this in the case of a coworker who has a Civil War diary written in Welsh, but your case may be more applicable.

The thing that my software is good at is distributed editing of text associated with an image, with wiki-style links and all the indexing that come along with that. How well does that match your use case?

Distributed? Check. You obviously need a way to make several passes at these translations, with at least one pass each by a language expert and by a subject matter expert.

Editing? You need that, but may not need the things my software is good at (indexing, image-based tagging). Quite possibly any other Wiki tool would provide a richer set of editing tools like spell-checkers or WYSIWYGs for formatting. In addition, while I capture all change information, displaying it in a readable format is much rougher than MediaWiki's history page diff tool.

The image association is the odd bit -- do the texts originate in handwritten Chinese which needs to be consulted throughout the editing process? If so, the software is a pretty good fit. On the other hand, if you're starting with the translation by a language expert and having subject-matter experts review it, you might be better off with a custom solution linking the original translations to the corrected ones.

I also wonder a bit about the links. My tool is great for internal cross-linking and doing analysis from that. It seems like you're not doing too much of that kind of analysis, but might be looking for external links to other qigong sites. I may have misread your use, though -- if you're after printed books with automatic indexing, my software can do that pretty well.