Thursday, June 28, 2007

Progress Report: Article Links

Well, that was fast!

Four days after announcing that this blog would take a speculative turn (read "stall") while I spent months on article links, linking pages to articles works!

It only took me about ten hours of coding and research to learn to use the XML parser, write the article and link schema, process typed transcriptions into a canonical XML form with appropriate RDBMS updates, and transform that into HTML when displaying a page. In fact, the most difficult part was adding support for WikiMedia-style square-brace links.

After implementing the article links, it only took 15 minutes of transcription to discover that automated linking is a MUST-DO feature, rather than the NICE-TO-HAVE priority I'd assigned it. It's miserable to type [[Sally Joseph Carr Brumfield|Josie]] every time Julia mentions the daughter-in-law who lives with her.

Tuesday, June 26, 2007

Matt Unger on Article Links

Matt Unger has been kind enough to send his ideas on article links within transcription software. I've posted my plans here before, but Matt's ideas are borne of his experience after six months of work on Papa's Diary. Some of his ideas are already included by my link feature, others may inform FromThePage, but I think his specification is worth posting in its entirety. If I don't use his suggestions in my software, perhaps someone else will.
One of my frustrations with Blogger (and I'm sure this would be my frustration with any blog software since my coding skills start and end with HTML 2.0) is that I can't easily create/show the definitions of terms or explain the backgrounds of certain people or organizations without linking readers to other pages.

Here's what my dream software would do from the reader's point of view:

Individual words in each entry would appear in a different color if there was background material associated with it. On rollover, a popup window would give basic background information on the term: if it was a person, there'd be a thumbnail photo, a short bio, and a link to whatever posts involved that person if the reader wanted to see more; if it was the name of an organization, we might see a short background blurb and related links; if it was a slang term from the past, we might see its definition; if it was a reference to a movie, the popup window might play a clip, etc.

Here's how it would work from my perspective:

When I write a post, the software would automatically scan my text and create a link for each term it recognizes from a glossary that I would have defined in advance. This glossary would include all the text, links, and media assets associated with each term. The software would also scan my latest entry and compare it to the existing body of work and pick out any frequently-mentioned terms that don't have a glossary definition and ask me if I wanted to create a glossary entry for them. For example, if I frequently mention The New York Times, it would ask me if I wanted to create a definition for it. I could choose to say "Yes," "Not Now" or "Never." If I chose yes and created the definition, I'd have the option of applying the definition to all previous posts or only to future posts.

The application would also display my post in a preview mode pretty much as a regular reader would see it. If I were to roll over a term that had a glossary term associated with it, I'd see whatever the user would see but I would also have a few admin options in the pop-up window like: Deactivate (in case I didn't want it to be a rollover); Associate A Different Definition (in case I wanted to show another asset than usual for a term). If I didn't do anything to the links, they would simply default to the automatically-created rollovers when I confirm the post submission.

So, that's my dream (though I would settle for the ability to create a pop-up with some HTML in a pinch).
Matt adds:
One thing I forgot to mention, too, is that I would also want to be able to create a link to another page instead of a pop-up once in a while. I suppose this could just be another admin option, and maybe the user would see a different link color or some other visual signal if the text led to a pop-up rather than another page.

Feature: Illegible Tags

Illegible tags allow a scribe to mark a piece of text as illegible with a tag, linking that tag to a cropped image from the text. The application does the cropping itself, allowing the user to draw a rectangle around the illegible chunk on their web-browser and then presenting that as a link for insertion into their transcription as mark-up.

The open questions I have regard the markup itself and its display. The TEI manuscript transcription guidelines refer to an unclear tag that allows the transcriber to mark a passage as unclear, but attach variant readings, each with a confidence level. There's no provision for attaching an image of the unclear text. In a scenario where the reader is provided access to the unclear text, do you think there's anything to be added from displaying the probable reading? How about many probable readings?

I've been persuaded by Papa's Diary that probable readings are tremendously useful, so I should probably rephrase illegible as unclear. I want to encourage scribes to be a free as possible with the tags, increasing the transparency of their transcriptions. In fact, the Papa's Diary usage — in which Hebrew is displayed as an image, but transcribed into Latin characters — makes me think that unclear is not sufficiently generalized. I may need to either come up with other tags for image links, or generalize unclear into something with a different name.

Implementing the image manipulation code would not be difficult, except that a lot of work needs to be done in the UI, which is not my strength.

UI
  1. Scribe hits an 'unclear' icon on the page transcription screen.
  2. This changes their mouse pointer to something that looks appropriate for cropping.
  3. They click on the image and start dragging the mouse.
  4. This displays a dashed-rectangle on the image that disappears on mouseUp.
  5. The mouse pointer changes to a waiting state until a popup appears displaying the cropped image and requesting text to insert.
  6. A paragraph of explanatory text would explain that the text entry field should contain a possible reading for the unclear text.
  7. The text entry field defaults to "illegible", so users with low-legibility texts will be able to just draw rectangles and hit enter when the popup appears.
  8. Hitting OK inserts a tag into the transcription. Hitting cancel returns the user to their transcription.
Implementation
  1. Clicking the 'unclear' icon triggers an 'I am cropping now' flag on the browser and turns off zoom/unzoom.
  2. onMouseDown sets a begin coordinate variable
  3. onMouseUp sets an end coordinate variable, launches an AJAX request to the server, and sets the pointer state to waiting.
  4. The server receives a requst of the form 'create an illegible tag for page X with this start XY and this end XY.
  5. It loads up the largest resolution image of that manuscript page, transposes the coordinates if the display image was zoomed, then crops that section of the image.
  6. A new record is inserted into the cropped_image table for the page and the cropped image itself is saved as a file.
  7. RJS tells the browser to display the tag generation dialog.
  8. The tag generation dialog inserts the markup <illeg image="image_id.jpg">illegible</illeg> at the end of the transcription.
  9. When the transcription is saved, we parse the transcription looking for illegible tags, deleting any unused tags from the database and updating the db entries with the user-entered text.
Data Model
I'll need a new cropped_image table with a foreign key to page, a string field for the display text, a filename for the image, and perhaps a sort order. My existing models will change as follows to support the new relationship:
cropped_image belongs_to :page
page has_many :cropped_images


Viewer
At display time, the illegible tag is transformed into marked-up HTML. The contents of the HTML should be whatever the user entered, but a formatting difference should indicate that the transcription is uncertain -- probably italicizing the text would work. The cropped images need to be accessible to viewers -- editorial transparency is after all the point of image-based transcription. I'm tempted to just display any cropped images at the bottom of the transcription, above the user-annotations. I could then put in-page anchor tags around the unclear text elements, directing the user to the linked image.

Print
The idea here is to display the cropped images along with the transcription as footnotes. The print version of a transcription is unlikely to include entire page images, so these footnotes would expose the scribe's decision to the reader's evaluation. In this case I'd want to render the unclear text with actual footnote numbers corresponding to the cropped image. Perhaps the printer should also have the option of printing an appendix with all cropped images, their reading, and the page in which they appear.

Monday, June 25, 2007

Matt Unger's Papa's Diary Project

Of all the online transcriptions I've seen so far, Papa's Diary Project does the most with the least. Matt Unger is transcribing and annotating his grandfather's 1924 diary, then posting one entry per day. So far as I'm able to tell, he's not using any technology more complicated than a scanner, some basic image processing software, and Blogger.

Matt's annotations are really what make the site. His grandfather Harry Scheurman was writing from New York City, so information about the places and organizations mentioned in the diary is much more accessible than Julia Brumfield's corner of rural Virginia. Matt makes the most of this by fleshing out the spare diary with great detail. When Scheurman sees a film, we learn that the theater was a bit shabby via an anecdote about the Vanderbilts. This exposition puts the May 9th single-word entry "Home" into the context of the day's news.

More than providing a historical backdrop, Matt's commentary provides a reflective narrative on his grandfather's experience. This narration puts enigmatic interactions between Scheurmann and his sister Nettie into the context of a loving brother trying to help his sister recover from childbirth by keeping her in the dark about their father's death. Matt's skill as a writer and emotional connection to his grandfather really show here. I've found that this is what keeps me coming back.

This highlights a problem with collaborative annotation — no single editorial voice. The commenters at PepysDiary.com accomplish something similar, but their voices are disorganized: some pose queries about the text, others add links or historical commentary, while others speculate about the 'plot'. There's more than enough material there for an editor to pull together something akin to Papa's Diary, but it would take a great deal of work by an author of Matt Unger's considerable writing skill.

People with more literary gifts than I possess have reviewed Papa's Diary already: see Jewcy, Forward.com, and Booknik (in Russian). Turning to the technical aspects of the project, there are a number of interesting effects Matt's accomplished with Blogger.

Embedded Images
Papas diary uses images in three distinct ways.
1. Each entry includes a legible image of the scanned page in its upper right corner. (The transcription itself is in the upper left corner, while the commentary is below.)
2. The commentary uses higher-resolution cropped snippets of the diary whenever Scheurmann includes abbreviations or phrases in Hebrew (see May 4, May 14, and May 13). In the May 11 entry, a cropped version of an unclear English passage is offered for correction by readers.
3. Images of people, documents, and events mentioned in the diary provide more context for the reader and make the site more attractive.

Comments
Comments are enabled for most posts, but don't seem to get too much traffic.

Navigation
Navigation is fairly primitive. There are links from one entry to others that mention the same topic, but no way to show all entries with a particular location or organization. It would be nice to see how many times Scheurman attended a JNF meeting, for example. Maybe I've missed a category list, but it seems like the posts are categorized, but there's no way to browse those categories.

Lessons for FromThePage
1. Matt's use of cropped text images — especially when he's double-checking his transcription — is very similar to the illegible tag feature of FromThePage. It seems important to be able to specify a reading, however, akin to the TEI unclear element.
2. Images embedded into subject commentary really do make the site more engaging. I hadn't planned to allow images within subject articles, but perhaps that's short-sighted.

Saturday, June 23, 2007

Progress Report: Transcription Access Controls

I've just checked in transcription authorization, completing the security tasks a work owner may perform. I decided that owners should be able to mark work as unrestricted -- any logged-in user may transcribe that work. Otherwise, owners specify which users may transcribe or modify transcriptions.

These administrative features have been some of the quickest to implement, even if they're some of the least exciting. My next task will involve a lot of research into XML parsers for Ruby and how to use them. That — coupled with a restricted amount of time to devote to coding — means I probably won't have much progress to report for the next couple of months. I'll keep blogging, but my posts may take a more speculative turn.

Monday, June 18, 2007

Rails: A Short Introduction to before_filter

I just tried out some simple filters in Rails, and am blown away. Up until now, I'd only embraced those features of Rails (like ActiveRecord) that allowed me to omit steps in the application without actually adding any code on my part. Filters are a bit different — you have to identify patterns in your data flow and abstract them out into steps that will be called before (or after) each action. These calls are defined within class definitions, rather than called explicitly by the action methods themselves.

Using Filters to Authenticate
Filters are called filters because they return a Boolean, and if that return value is false, the action is never called. You can use the the logged_in? method of :acts_as_authenticated to prohibit access to non-users — just add before_filter :logged_in? to your controller class and you're set! (Updated 2010: see Errata!)

Using Filters to Load Data
Today I refactored all my controllers to depend on a before_filter to load their objects. Beforehand, each action followed a predictable pattern:
  1. Read an ID from the parameters
  2. Load the object for that ID from the database
  3. If that object was the child of an object I also needed, load the parent object as well
In some cases, loading the relevant objects was the only work a controller action did.

To eliminate this, I added a very simple method to my ApplicationController class:
def load_objects_from_params
if params[:work_id]
@work = Work.find(params[:work_id])
end
if params[:page_id]
@page = Page.find(params[:page_id])
@work = @page.work
end
end
I then added before_filter :load_objects_from_params to the class definition and removed all the places in my subclasses where I was calling find on either params[:work_id] or params[:page_id].

The result was an overall 7% reduction in lines of code -- redundant, error-prone lines of code, at that!
Line counts before the refactoring:
7 app/controllers/application.rb
19 app/controllers/dashboard_controller.rb
10 app/controllers/display_controller.rb
138 app/controllers/page_controller.rb
87 app/controllers/transcribe_controller.rb
47 app/controllers/work_controller.rb
[...]
1006 total
And after:
28 app/controllers/application.rb
19 app/controllers/dashboard_controller.rb
2 app/controllers/display_controller.rb
108 app/controllers/page_controller.rb
69 app/controllers/transcribe_controller.rb
34 app/controllers/work_controller.rb
[...]
937 total
In the case of my (rather bare-bones) DisplayController, the entire contents of the class has been eliminated!

Perhaps best of all is the effect this has on my authentication code. Since I track object-level permissions, I have to read the object the user is acting upon, check whether or not they're the owner, then decide whether to reject the attempt. When the objects may be of different types in the same controller, this can get a bit hairy:
if ['new', 'create'].include? action_name
# test if the work is owned by the current user
work = Work.find(params[:work_id])
return work.owner == current_user
else
# is the page owned by the current user
page = Page.find(params[:page_id])
return page.work.owner == current_user
After refactoring my ApplicationController to call my load_objects_from_params method, this becomes:
return @work.owner == current_user

Friday, June 8, 2007

Questions on Access to Sensitive Text

Alice Armintor Walker responded offline to my post on sensitive tags, asking:
Do you think you might want to allow sensitive text access to some viewers, but not unregistered users? An archivist might feel better about limiting access to a specific group of people.
That's an interesting point, and thanks for the question!

My thought on that had been something like this: There are people the owner trusts with sensitive info, and people they don't. That first group are the scribes, who by definition can see everything. The second group is everybody else.

But is that too simple? Are there people the owners trust to view sensitive info, but not to transcribe the text? The answer may be yes. I just don't know enough about archival practice. In the real-world model, FromThePage-style transcription doesn't exist. So it's quite reasonable that transcribing would be a separate task, requiring more trust than allowing access to view sensitive material.

It'd be easy enough to allow the owner to grant permission to some viewers — just an extra tab in the work config screen, with a UI almost identical to the scribe-access list.

Money: Principles

I haven't spent much time on this blog talking about vision or theory. Perhaps that's because blogger theorizing reminds me too much of corporate vision statements, or maybe because I've found it less than helpful in other people's writing. However, once you start talking about money and control, you need to figure out what you're not willing to do.

There a few principles which I am unlikely to compromise -- these comprise the constraints around any funding or pricing decisions.
  1. Free and open access to manuscript transcriptions.
    The entire point of the project is to make historical documents more accessible. Neither I nor anyone else running the software should charge people to view the transcriptions.
  2. Encourage altruistic uses.
    If a project like Soldier Studies wanted to host a copy of FromThePage, I can't imagine making them pay for a license. Charging for support, enhancements, or hosting might be a different matter, since those affect my own pocketbook.
    The same would apply to institutional users.
  3. No profit off my work without my consent.
    This may be an entirely self-serving principle, but it's better to go ahead and articulate it, since it will inform my decision-making process whether I like it or not. One of the things I worry about is that I'll release the FromThePage software as open-source, then find that someone — a big genealogy company or a clever 15-year-old — is selling it, competing with whatever hosting service I might run.

Thursday, June 7, 2007

Feature: Sensitive Tags

Sensitive tags allow passsages within a transcription to be removed from public view — visible to scribes and work owners, but suppressed in printouts or display to viewers. Why on earth is this desirable?

At some level, collaborative software is about persuasion. If the mid-term goal of this project is to get the people with old letters and diaries stashed in their filing cabinets to make those documents accessible, I have to overcome their objections.

Informal archivists have the same concerns institutional archivists do. In many cases their records are recent enough to have an impact on living people. Julia Brumfield may have died seventy years ago, but her diaries record the childhood and teen-aged years of people still living today. Would you want the comings and goings of your fifteen-year-old self published? I thought not.

The approach many family archivists take to this responsibility is to guard access to their data. My father, for example, is notably unenthusiastic about making Julia Brumfield's diaries visible to the public. If you force a family archivist to expose works they upload to everyone in their entirity, they simply won't share their works.

This is where sensitive tags come in. At any point, a scribe may surround a passage of transcription with <sensitive>. When the display code renders a page of transcription, it replaces the text within the sensitive tags with a symbol or note indicating that material has been elided. (This symbol should probably be set when the work is configured, and default to some editorial convention.)

condition
The sensitive tag has one plaintext attribute: condition. This represents a condition to be satisfied for the tag's contents to be made visible to the public. Thus
<sensitive condition="Uncle Jed has given permission for this to be printed"> I don't like that girl Jed's seeing.</sensitive>
would be rendered in display and print as
[elided]
and would add a new option to the owner's work configuration page:
Has 'Uncle Jed has given permission for this to be printed' occurred yet?
Checking this box would either remove the markup around the sensitive text or cause the text to be rendered normally when viewers see or print the transcription.

until
An alternative to the condition attribute is a date attribute named something like until. This wouldn't require additional intervention by the work owner to lift the suppression of sensitive text: upon rendering, compare the current date to the until date and decide whether to render the text.

It strikes me that archivists have probably developed guidlines for this problem, but I've had a lot of problems finding the kind of resources on archival practices that exist online for digitization and transcription. Any pointers would be welcome.

Wednesday, June 6, 2007

Money: The Current Situation

For now, FromThePage has followed the classic funding model of the basement inventor: The lone developer — me — holds a day job and works on the project in his spare time. This presents some challenges:
  1. At times my job occupies all my spare time and energy, so I accomplish nothing whatsoever on the project for months on end. This is no tragedy, as the job is really very rewarding. However, it certainly doesn't get FromThePage out the door.
  2. At other times, commitments to family and other projects occupy my spare time and energy, to the same effect.
  3. In the best of cases, development is throttled by my spare time. For a father working full-time, this means I can devote a sustainable maximum of 10 hours per week to my project. A spike in development to meet a deadline might raise that to two months' worth of 30 hours-a-week, which would exhaust the resources of my wife, daughter, in-laws, and myself. For developers not blessed with a spouse who is as capable and willing to develop web-apps as mine, this number would be lower. The recent, blazing rate of progress has been due largely to a configuration of family and work commitments optimal to project development — attending a couple of out-of-town weddings in a row would kill it.
  4. This limitation may not only slow the pace of development, it may prevent some necessary tasks. If I do a launch with a large number of active users, I'll probably need to take a week or two off work to deal with the demands that presents. Avoiding an abrupt vacation request will force me into a more gradual launch schedule.
All these constraints present some real opportunities as well. I've found myself quite a bit more effective per hour of coding time than I would be if this were a full time job. I suppose that's attributable to my awareness that each hour of FromThePage development is precious and shouldn't be wasted, combined with the ability to spend hours planning my work while I'm driving, cooking, or riding the elevator. The slow release schedule may force me to do effective usability studies, slowing down the cycle between user feedback and its implementation.

Next: The Alternatives

Feature: User Roles

I've moved into what is probably the least glamorous phase of development: security, permissions, and user management.

There are four (or five) different roles in FromThePage, with some areas of ambiguity regarding what those users are allowed to do.
  • Admins are the rulers of a software installation. There are only a few of them per site, and in a hosted environment, they hold the keys to the site. Admins may manage anything in the system, period.
  • Owners are the people who upload the manuscript images and pay the bills. They have entered into some sort of contractual relationship with the hosting provider, and have the power to create new works, modify manuscript page images, and authorize users to help transcribe works. In theory, they'll be responsible for supporting the scribes working on their works.
  • Scribes may modify transcriptions of works they're authorized to transcribe. They may create articles and any other content for those works. They are the core users of FromThePage, and will spend the most time using the software. If the scribes aren't happy, ain't nobody happy.
  • Viewers are registered users of the site. They can see any transcription, navigate any index, and print any work.
  • Non-users are people viewing the site who are not signed in to an account. They probably have the same permissions as viewers, but they will under no circumstances be allowed to create any content. I've had enough experience dealing with blog comments to know that the minute you allow non-CAPTCHA-authorized user-created content, you become prey to comment spammers who will festoon your website with ads for snake oil, pornography, and fraudulent mortgage refinance offers. [June 8 Update: Within thirty-six hours of publication, this very post was hit by a comment spammer peddling shady loans, who apparently was able to get through Blogger's CAPTCHA system.]
There are two open questions regarding the permissions granted to these different classes of user:
  1. Should viewers see manuscript images? Serving images will probably consume more bandwidth than all other uses combined. For manuscripts containing sensitive information, image service is an obvious security breach. The only people who really need images (aside from those who find unclear tags with links to cropped images insufficient) are scribes.
  2. Should viewers add comments? For the reasons outlined above, I think the answer is yes, at least until it's abused enough for me to turn off the capability.
For those who have never programmed enterprise software before, the reason security gets such short shrift is that fundamentally it's about turning off functionality. Before you get to the security phase of development, you have to have already developed the functionality you're disabling. By definition, it's an afterthought.

Sunday, June 3, 2007

Progress Report: Work and Page Administration

As I said earlier, I've been making really good progress writing the kinds of administrative features that Rails excels at. Despite my digression to implement zoom, in the last two weeks I've added code for:
  • Converting a set of titled images into a work composed of pages
  • Creating a new, blank work
  • Editing title and description information about a work
  • Deleting a work
  • Viewing the list of pages within a work and their status
  • Re-ording those pages within the list
  • Adding a new page
  • Editing or deleting a page
  • Rotating, and re-scaling the image on a page
  • Uploading a replacement image for a page
None of these are really very sexy, but they're all powerful. The fact that I was able to implement page re-ordering, new work creation, new page creation, and page deletion from start to finish during my daughter's nap today — in addition to refactoring my tab UI and completing image uploads — reminds me of why I love Ruby on Rails.

Saturday, June 2, 2007

Rails: acts_as_list Incantations

In my experience, really great software is so consistent that when you're trying to do something you've never done before, you can guess at how it's done and at least half of the time, it just works. For the most part, Ruby on Rails is like that. There are best practices baked into the framework for processes I never even realized you could have best practices for, like the unpleasant but necessary task of writing upgrade scripts. Through Rails, I've learned CS concepts like optimistic locking that were entirely new to me, and used them effectively after an hour's study and a dozen lines of code.

So it's disappointing and almost a little surprising when I encounter a feature of the Rails framework that doesn't just work automatically. acts_as_list is such a feature, requiring days of reading source files and experimenting with logfile to figure out the series of magical incantations needed to make the darn thing work.

The theory behind acts_as_list is pretty simple. You add an integer position column to a database table, and add acts_as_list to the model class it maps to. At that point, anytime you use the object in a collection, the position column gets set with the sort order of that object relative to everything else in the collection. List order is even scoped: a child table can be ordered within the context of its parent table by adding :scope => :parent_model_name to the model declaration.

In practice, however, there are some problems I ran into which I wasn't expecting. Some of them are well documented in Agile Web Development with Rails, but some of them required a great deal of research to solve.

List items appear in order of record creation, not in order of :position

If an ImageSet has_many TitledImages, and TitledImage acts_as_list over the scope of an ImageSet, you'd expect ImageSet.titled_images to return a collection in the order that's set in the position column, right? This won't happen unless you modify the parent class definition (ImageSet, in this case) to specify an order column on your has_many declaration:
has_many :titled_images, :order => :position

Pagination loses list order


Having fixed this problem, if you page through a list of items, you'll discover that the items once again appear in order of record creation, ignoring the value set in position. Fixing this requires you to manually specify the order for paged items using the :order option to paginate:
    @image_pages, @titled_images = paginate(:titled_image,
{:per_page => 10,
:conditions => conditions,
:order => 'position' })
Adjusting list order doesn't update objects in memory

Okay, this one took me the most time to figure out. acts_as_list has nothing to do with the order of your collection. Using array operators to move elements around in the collection returned by ImageSet.titled_image does absolutely nothing to the position column.

Worse yet, using the acts_as_list position modifiers like insert_at will not affect the objects in memory. So if you've been working with a collection, then call an acts_as_list method that affects its position, saving the elements that of collection will overwrite the position with old values. The position manipulation operators are designed to minimize SQL statements executed: among other side-effects, they circumvent optimistic locking. You must reload your collections after doing any list manipulation.

Moving list items from one parent object to another doesn't reorder their positions

Because acts_as_list pays little attention to order within a collection, removing an item from one parent and adding it to another requires explicit updates to the position attribute. You should probably use remove_from_list to zero out the position before you transfer items from one parent to another, but since this reshuffles position columns for all other items in the list, I'd be cautious about doing this within a loop. Since I was collating items from two different lists into a third, I just manually set the position:
    0.upto(max_size-1) do |i|
# append the left element here
if i < left_size
new_set.titled_images << left_set.titled_images[i]
end
# append the right element
if i < right_size
new_set.titled_images << right_set.titled_images[i]
end
end
# this has no effect on acts as list unless I do it manually
1.upto(new_set.titled_images.size) do |i|
new_set.titled_images[i-1].position=i
end
In my opinion, acts_as_list is still worth using — in fact, I'm about to work with its reordering functionality a lot. But I won't be surprised if I find myself experimenting with logfiles again.

Friday, June 1, 2007

Conversations about Transcription

Gavin Robinson has been exchanging emails with the UK National Archives this week. He's trying to convince the archivists to revise their usage restrictions to allow quotation and reuse of user-contributed content.

Gavin recognizes that the NA is doing a difficult dance with their user community:
[S]ome people who have valuable knowledge would be put off from contributing if they had to give it away under GDL, and might prefer a non-exclusive licence which allows them to retain more rights. For example, the average Great War Forum member doesn’t tend to think in a Web 2.0 kind of way. But then they might be put off by the very idea of a wiki. Including as many people as possible has probably involved some difficult decisions for the NA.

This puts me in mind of a discussion over at Dan Lawyer's blog last year. Amateurs who are willing to collaborate on research feel a strong sense of ownership over the results of their labor. They don't want other people taking credit for it, they don't like other people making a profit from it, and they don't like seeing it misused*. Wikipedia proves that people will contribute to a public-domain project, but I suspect that family history, being so much more personal, is a bit different. Several of Dan's Mormon commenters feel uncomfortable entrusting anybody other than the LDS church with their genealogical conclusions. Of course, many non-Mormons feel uncomfortable providing genealogical data specifically to the LDS church. Getting these two sets of the public to collaborate must be challenging.

This same week, Rantings of a Civil War Historian published a fascinating article on "Copyright and Unpublished Manuscript Materials". The collectors, amateurs, and volunteers who buy letters and diaries on EBay feel a similar sense of ownership over their documents. The comments range over the legal issues involved in posting transcripts of old documents, as well as the problems that occur when people with no physical access to the sources propagate incorrect readings. Many of the commenters have done work with libraries that require them to agree to conditions before they use their materials, and others work in IP Law, so the discussion is very high quality.

* I think the most prominent fear of misuse is that shades of nuance will be lost, hard-fought conclusions will be over-simplified, and most especially that errors will propagate through the net. In my own dabbling with family history, I've seen a chunk of juvenile historical fiction widely quoted as fact. (No, James Brumfield did not eat the first raw oysters at Jamestown!) Less egregious but perhaps more relevant to manuscript transcription are the misspellings and misreadings committed by scribes without the context to determine whether "Edmunds" should be "Edmund". Dan discusses a technical solution to this in his fourth point: Allow the user to say “I think” or “Maybe” about their conclusions. That's something I should flag as a feature for FromThePage.