I just spent about two weeks doing what's known in software engineering as a "spike investigation." This is somewhat unusual, so it's worth explaining:
Say you have a feature you'd like to implement. It's not absolutely essential, but it would rank high in the priority queue if its cost were reasonably low. A spike investigation is the commitment of a little effort to clarify requirements, explore technology choices, create proof-of-concept implementations, and (most importantly) estimate costs for implementing that feature. From a developer's perspective you say "Pretend we're really doing
X. Let's spend
Y days to figure out how to do it, and how long
that would take." Unlike other software projects, the goal is not a product, but 1)
a plan and 2)
a cost.
The Feature: PDF-downloadAccording to the professionals, digitization projects are either oriented towards preservation (in which case the real-life copy may in theory be discarded after the project is done, but a website is merely a pleasant side-effect) or towards access (in which distribution takes primacy, and preservation concerns are an afterthought). FromThePage should enable digitization for access — after all, the point is to share all those primary sources locked away in somebody's file cabinet, like Julia Brumfield's diaries. Printed copies are a part of that access: when much of the audience is elderly, rural, or both, printouts really are the best vehicle for distribution.
The Plan: Generate a DocBook file, then convert it to PDFWhile considering some PDF libraries for Ruby, I was fortunate enough to hear
James Edward Gray II speak on "Ruby as a Glue Language". In one section called "shelling out", he talked about a requirement to produce a PDF when he was already rendering HTML. He investigated PDF libraries, but ended up piping his HTML through
`html2ps | ps2pdf` and spending a day on the feature instead of several weeks. This got me looking outside the world of PDF-modifying Ruby gems and Rails plugins, at other documenting scripting languages. It makes a lot of sense — after all, I'm not hooking directly to the Graphviz engine for
subject graphs, but generating
.dot files and running
neato on them.
I started by looking at typesetting languages like
LaTeX, then stumbled upon
DocBook. It's an SGML/XML-based formatting language which only specifies a logical structure. You divide your .docbook file into chapters, sections, paragraphs, and footnotes, then DocBook performs layout, applies typesetting styles, and generates a PDF file. Using the Rails templating system for this is a snap.
The Result:See for yourself:
This is a PDF generated from my development data. (Please ignore the scribbling.)
The Gaps:- Logical Gaps:
- The author name is hard-wired into the template. DocBook expects names of authors and editors to be marked up with elements like surname, firstname, othername, and heritage. I assume that this is for bibliographic support, but it means I'll have to write some name-parsing routine that converts "Julia Ann Craddock Brumfield" into "
<firstname> Julia </firstname> <othername type="middle"> Ann </othername> <othername type="maiden"> Craddock </othername> <surname> Brumfield </surname>".
- There is a single chapter called "Entries" for an entire diary. It would be really nice to split those pages out into chapters based on the month name in the page title.
- Page ranges in the index aren't marked appropriately. You see "6,7,8" instead of "6-9".
- Names aren't subdivided (into surname, first name, suffix, etc.), and so are alphabetized incorrectly in the index. I suppose that I could apply the name-separating function created for the first gaps to all the subjects within a "Person" category to solve this.
- Physical Layout: The footnote tags are rendering as end notes. Everyone hates end notes.
- Typesetting: The font and typesetting betrays DocBook's origins in the world of technical writing. I'm not sure quite what's appropriate here, but "Section 1.3.4" looks more like a computer manual than a critical edition of someone's letters.
The Cost:Fixing the problems with personal names requires a lot of ugly work with regular expressions to parse names, amounting to 20-40 hours to cover most cases for authors, editors, and indices. The work for chapter divisions is similar in size. I have little idea how easy it will be to fix the footnote problem, as it involves learning "a Scheme-like language" used for parsing .docbook files. Presumably I'm not the first person to want footnotes to render as footnotes, so perhaps I can find a
.dsssl file that does this already. Finally, the typesetting should be a fairly straightforward task, but requires me to learn a lot more about CSS than the little I currently know. In all, producing truly readable copy is about a month's solid work, which works out to 4-6 months of calendar time at
my current pace.
The Side-Effects:Generating a
.docbook file is very similar to generating any other XML file. Extending the printing code for exporting works in TEI or a FromThePage-specific format will only take 20-40 hours of work. Also, DocBook can be used to generate a set of paginated, static HTML files which users might want for some reason.
The Conclusions:It's more important that I start transcribing in earnest to shake out problems with my core product, rather than delaying it to convert end notes to footnotes. As a result, printing is not slated for the alpha release.