Friday, September 21, 2007

Progress Report: Printing

I just spent about two weeks doing what's known in software engineering as a "spike investigation." This is somewhat unusual, so it's worth explaining:

Say you have a feature you'd like to implement. It's not absolutely essential, but it would rank high in the priority queue if its cost were reasonably low. A spike investigation is the commitment of a little effort to clarify requirements, explore technology choices, create proof-of-concept implementations, and (most importantly) estimate costs for implementing that feature. From a developer's perspective you say "Pretend we're really doing X. Let's spend Y days to figure out how to do it, and how long that would take." Unlike other software projects, the goal is not a product, but 1) a plan and 2) a cost.

The Feature: PDF-download
According to the professionals, digitization projects are either oriented towards preservation (in which case the real-life copy may in theory be discarded after the project is done, but a website is merely a pleasant side-effect) or towards access (in which distribution takes primacy, and preservation concerns are an afterthought). FromThePage should enable digitization for access — after all, the point is to share all those primary sources locked away in somebody's file cabinet, like Julia Brumfield's diaries. Printed copies are a part of that access: when much of the audience is elderly, rural, or both, printouts really are the best vehicle for distribution.

The Plan: Generate a DocBook file, then convert it to PDF
While considering some PDF libraries for Ruby, I was fortunate enough to hear James Edward Gray II speak on "Ruby as a Glue Language". In one section called "shelling out", he talked about a requirement to produce a PDF when he was already rendering HTML. He investigated PDF libraries, but ended up piping his HTML through `html2ps | ps2pdf` and spending a day on the feature instead of several weeks. This got me looking outside the world of PDF-modifying Ruby gems and Rails plugins, at other documenting scripting languages. It makes a lot of sense — after all, I'm not hooking directly to the Graphviz engine for subject graphs, but generating .dot files and running neato on them.

I started by looking at typesetting languages like LaTeX, then stumbled upon DocBook. It's an SGML/XML-based formatting language which only specifies a logical structure. You divide your .docbook file into chapters, sections, paragraphs, and footnotes, then DocBook performs layout, applies typesetting styles, and generates a PDF file. Using the Rails templating system for this is a snap.

The Result:
See for yourself: This is a PDF generated from my development data. (Please ignore the scribbling.)

The Gaps:
  • Logical Gaps:
    • The author name is hard-wired into the template. DocBook expects names of authors and editors to be marked up with elements like surname, firstname, othername, and heritage. I assume that this is for bibliographic support, but it means I'll have to write some name-parsing routine that converts "Julia Ann Craddock Brumfield" into "<firstname> Julia </firstname> <othername type="middle"> Ann </othername> <othername type="maiden"> Craddock </othername> <surname> Brumfield </surname>".
    • There is a single chapter called "Entries" for an entire diary. It would be really nice to split those pages out into chapters based on the month name in the page title.
    • Page ranges in the index aren't marked appropriately. You see "6,7,8" instead of "6-9".
    • Names aren't subdivided (into surname, first name, suffix, etc.), and so are alphabetized incorrectly in the index. I suppose that I could apply the name-separating function created for the first gaps to all the subjects within a "Person" category to solve this.
  • Physical Layout: The footnote tags are rendering as end notes. Everyone hates end notes.
  • Typesetting: The font and typesetting betrays DocBook's origins in the world of technical writing. I'm not sure quite what's appropriate here, but "Section 1.3.4" looks more like a computer manual than a critical edition of someone's letters.
The Cost:
Fixing the problems with personal names requires a lot of ugly work with regular expressions to parse names, amounting to 20-40 hours to cover most cases for authors, editors, and indices. The work for chapter divisions is similar in size. I have little idea how easy it will be to fix the footnote problem, as it involves learning "a Scheme-like language" used for parsing .docbook files. Presumably I'm not the first person to want footnotes to render as footnotes, so perhaps I can find a .dsssl file that does this already. Finally, the typesetting should be a fairly straightforward task, but requires me to learn a lot more about CSS than the little I currently know. In all, producing truly readable copy is about a month's solid work, which works out to 4-6 months of calendar time at my current pace.

The Side-Effects:
Generating a .docbook file is very similar to generating any other XML file. Extending the printing code for exporting works in TEI or a FromThePage-specific format will only take 20-40 hours of work. Also, DocBook can be used to generate a set of paginated, static HTML files which users might want for some reason.

The Conclusions:
It's more important that I start transcribing in earnest to shake out problems with my core product, rather than delaying it to convert end notes to footnotes. As a result, printing is not slated for the alpha release.

Thursday, September 20, 2007

Money: Projected Costs

What are the costs involved making FromThePage a going concern? I see these five classes of costs:
  • DNS
  • Initial hosting bills
  • Marginal hosting fees associated with disk usage, cpu usage, or bandwidth served
  • Labor by people with skills neither Sara nor I possess
  • Labor by people with skills that Sara or I do possess, but do not have time or energy to spend
I can predict the first two with some degree of accuracy. I've already paid for a domain name, and the hosting provider I'm inclined towards costs around $20/month. When it comes to the cost of hosting other people's works for transcription, however, I have no idea at all what to expect.

I have started reading about start-up costs, and this week I listened to the SXSW panel "Barenaked App: The Figures Behind the Top Web Apps" (podcast, slides). What I find distressing about this panel is that the figures involved are so large: $20,000-$200,000 to build an application that costs $2000-$8000 per month for hardware and hosting! It's hard to figure out how comparable my own situation is to these companies, since I don't even have a paid host yet.

This big unknown is yet another argument for a slow rollout — not only will alpha testers supply feedback about bugs and usability, the usage patterns for their collections will give me data to figure out how much an n-page collection with m volunteers is likely to increase my costs. That should provide about half of the data I need to decide on a non-commercial open-source model versus a purely-hosted model.

Wednesday, September 19, 2007

Progress Report: Subject Categories

It's been several days since I updated this blog, but that doesn't mean I've been sitting idle.

I finished a basic implementation of subject categories a couple of weeks ago. I decided to go with hierarchical categories, as is pretty typical for web content. Furthermore, the N:N categorization scheme I described back in April turned out to be surprisingly simple to implement. There are currently three different ways to deal with categories:

  1. Owners may add, rename, and delete categories within a collection.
  2. Scribes associate or disassociate subjects with a category. The obvious place to put this was on the subject article edit screen, but a few minutes of scribal use demonstrated that this would lead to lots of uncategorized articles. Since transcription projects that don't care about subject indexing aren't likely to use the indexes anyway, I added a data-cleanup step to the transcription screen. Now, whenever a page contains a new, uncategorized subject reference, I display a separate screen when the transcription is saved. This screen shows all the uncategorized subjects for that page, allowing the scribe to categorize any subjects they've created.
  3. Viewers see a category treeview on the collection landing page as well as on the work reader. Clicking a category lists subjects for that category, and clicking the subject link lists links to navigate to the pages referring to that subject.
The viewer treeview presents the most opportunities, thus the most difficulties from a UI perspective. Should a subject link load the subject article instead of the page list? Should it refer to a reader view of pages including that subject? When viewing a screen with only a few pages from one work, should the category tree only display terms used on that screen, or on the work, or on all works from the collection the work is a part of? I'm really not sure what the answer is. For the moment, I'm trying to achieve consistency at the cost of flexibility: the viewer will always see the same treeview for all pages within a collection, regardless of context.

Future ideas include:
  • Category filtering for subject graphs -- this would really allow analysis of questions like "what was the weather when people held dances?" without the need to wade through a cluttered graph.
  • Viewing the text of all pages that contain a certain category on the same page, with highlighting of the term within that category.

Tuesday, September 11, 2007

Feature: Comments (Part 2: What is commentable?)

Now that I've settled on the types of comments to support, where do they appear? What is commentable?

I've given a lot of thought to this lately, and have had more than one really good conversation about it. In a comment to an earlier post, Kathleen Burns recommended I investigate CommentPress, which supports annotation on a paragraph-by-paragraph level with a spiffy UI. At the Lone Star Ruby Con this weekend, Gregory Foster pointed out the limitations of XML for delimiting commentable spans of text if those spans overlap. As one door opens, another one closes.

What kinds of things can comments (as broadly defined below) be applied to? Here's my list of possiblities, with the really exciting stuff at the end:
  1. Users: See comments on user profile pages at LibraryThing.
  2. Articles: See annotations to article pages at Pepys' Diary.
  3. Collections: Since these serve as the main landing page for sites on FromThePage, it makes sense to have a top-level discussion (albeit hidden on a separate tab).
  4. Works: For large works, such as the Julia Brumfield diaries, discussions of the work itself might be appropriate. For smaller works like letters, annotating the "work" might make more sense than annotating individual pages.
  5. Pages: This was the level I originally envisioned for comment support. I still think it's the highest priority, based on the value comments add to Pepys' Diary, but am no longer sure it's the smallest level of granularity worth supporting.
  6. Image "chunks": Flickr has some kind of spiffy JavaScript/DHTML goodness that allows users to select a rectangle within an image and post comments on that. I imagine that would be incredibly useful for my purposes, especially when it comes to arguing about the proper reading of a word.
  7. Paragraphs within a transcription: This is what CommentPress does, and it's a really neat idea. They've got an especially cool UI for it. But why stop at paragraphs?
  8. Lines within a transcription: If it works for paragraphs, why not lines? Well, honestly you get into problems with that. Perhaps the best way to handle this is to have comments reference a line element within the transcription XML. Unfortunately, this means that you have to use LINE tags to be able to locate the line. Once you've done that, other tags (like my subject links) can't span linebreaks.
  9. Spans of text within a transcription: Again, the nature of XML disallows overlapping elements. As a result, in the text "one two three", it would be impossible to add a comment to "one two" and a different comment to "two three".
  10. Points within a transcription: This turns out to be really easy, since you can use an empty XML tag to anchor the comment. This won't break the XML heirarchy, and you might be able to add an "endpoint" or "startpoint" attribute to the tag that (when parsed by the displaying JavaScript) would act like 8 or 9.

Feature: Comments (Part 1: What is a comment?)

Back in June, Gavin Robinson and I had a conversation about comments. The problem with comments is figuring out whom they're directed to. Annotations like those in Pepys' Diary Online are directed towards both the general public and the community of Pepys "regulars". Sites built on user-generated content (like Flickr, or Yahoo Groups) necessitate "flag as inappropriate" functionality, in which the general public alerts an administrator of a problem. And Wikipedia overloads both their categorization function and their talk pages to flag articles as "needing attention", "needing peer-review", or simply "candidates for deletion".

If you expand the definition of "comments" to encompass all of these — and that's an appropriate expansion in this case, since I expect to use the same code behind the scenes — I see the following general types of comments as applicable to FromThePage documents:
  1. Annotations: Pure annotations are comments left by users for the community of scribes and the general public. They don't have "state", and they don't disappear unless they're removed by their author or a work owner. Depending on configuration, they might appear in printed copies.
  2. Review Requests: These are requests from one scribe to another to proofread, double-check transcriptions, review illegible tags, or identify subjects. Unlike annotations, these have state, in that another scribe can mark a request as completed.
  3. Problem Reports: These range from scribes reporting out-of-focus images to readers reporting profane content and vandalism. These also have state, much like review requests. Unlike review requests, they have a more specific target — only the work owner can correct an image, and only an admin can ban a vandal.