Wednesday, April 25, 2007

Feature: Article Links

Both Wikipedia and Pepys Diary Online allow hyperlinks to be embedded into articles. These links are tracked bi-directionally, so that by viewing an article, you can see all the pages that link to it. This is an easy enough thing to do, once you allow embedded links:


The text markup to insert a link needs to be simple and quick, yet flexible if the scribe has the need and the skills to do complex operations. The wiki link syntax used in Wikipedia is becoming common, so seems like as good a starting place as any: double braces placed around text makes that text a link, e.g. [[tobacco]] links to an article Tobacco. In situations where the link target is different from the displayed text, and optional display text value is supplied after a pipe, so that [[Benjamin Franklin Brumfield|Ben]] links "Ben" to the article on Benjamin Franklin Brumfield. This user-friendly markup can be converted to my intermediate XML markup for processing on save and display.

Another mechanism for inserting links is a sort of right-click dropdown: select the linked text, right click it (or click some icon), and you see a pop-up single-select of the articles in the system, with additional actions of creating new articles with the selected text as a title, or creating a new article with a different title. Choosing/creating an article would swap out the selected text with the appropriate markup. It is debatable whether such a feature would be useful without significant automation (below).


When text is saved, we walk through the markup looking for link tags. When one is encountered, it's inserted into the RDBMS, recording the page the link is from, the article the link is to, and the actual display text. Any link target articles that do not already exist will be created with a title, but blank contents. Reading this RDBMS record makes indexing almost trivial — more on this later.

Automated linking

The reason for storing the display text in the link is to automate link creation. Diaries in particular include a frequently-repeated core of people or places that will be referred to with the same text. In my great-grandmother's diaries, "Ben" always refers to "Benjamin Frankly Brumfield Sr.", while "Franklin" refers to "Benjamin Franklin 'Pat' Brumfield Jr.". Offering to replace those words with the linked text could increase the ease of linking and possibly increase the quality of indexes.

I see three possible UIs for this. The first one is described above, in which the user selects the transcribed text "Ben", brings up a select list with options, and chooses one. The record of previous times "Ben" was linked to something can winnow down that list to something useful, a bit like auto-completion. Another option is to use a Javascript oberserver to suggest linking "Ben" to the appropriate article whenever the scribe types "Ben". This is an AJAX intensive implementation that would be ungainly under dial-up situations. Worse, it could be incredibly annoying, and since I'm hoping to minimize user enragement, I don't think it's a good idea unless handled with finesse. Maybe append suggested links to a sidebar while highlighting the marked text, rather than messing with popups? I'll have to mull that one over. The third scenario is to do nothing but server-side processing. Whenever a page is submitted, direct the user to a workflow in which they approve/deny suggested links. This could be made optional rather than integrated into the transcription flow, as Wikipedia's "wikify" feature does.


In my data model and UI, I'm differentiating between manuscript pages and article pages -- the former actually representing a page from the manuscript with its associated image, transcription, markup and annotation, while the latter represents a scribe-created note about a person, place, or event mentioned in one or more manuscript pages. The main use case of linking is to hotlink page text to articles. I can imagine linking articles to other articles as well, as when a person has their family listed, but it seems much less useful for manuscript pages to link to each other. Does the word "yesterday" really need to link March 14 to the page on March 13?


Anonymous said...

With the regimental history I'm working on I'm taking a slightly different approach (and I'll probably carry it on with the letters if it works out). I've been breaking the tagging down into stages - so far I've just got the structure marked up, and the next step is to start tagging names of people and places. On the first pass I'm just going to insert name tags to identify that these bits of text represent names, without making any claims about their identity. On the second pass I'll go through the names and insert attributes for database keys and/or standardised forms which can be used to link to article pages giving biographical details, or just to index pages which list all the pages where that person is mentioned (I don't have to decide this yet as I'm keeping it independent of the XML markup, and it should be easy to change it in future).

Once I've got the names marked as names in the first pass, I'm thinking I could do the second pass by extracting all the name elements into a database, which would allow me to sort them and apply attributes in batches then (I hope) write the data back into the XML file (I have to make sure each name element has a unique id before pulling them out).

The idea is to make it scaleable: the process can be broken down into small steps, which can be done by lots of different people with different skills. People transcribing manuscript or proofreading OCR text won't need to identify names and link them to other names or articles.

Ben W. Brumfield said...

It sounds like the link/annotation process we're both dealing with can be genericized into these steps:

Identification: Find parts of the text that are candidates for index/annotation markup. Thus Ben went to the store has indexable elements Ben and store, so the text markup incodes that as: Ben went to the store. (I'm using bold here as shorthand for "marked up in some way").

Normalization/De-duping: Identify candidates with the corresponding unique article/indexible thing. In my example, Ben->Benjamin Frankin Brumfield Sr., store->Renan General Store

Typing: Classifying markup or the markup's referent as places, names, events, etc. In my example, Ben refers to a person, store refers to a place.

Your approach decouples identification from normalization completely, because the extraction step is completely separate from the identification. In my case, I'm recording the identified candidates in the RDBMS whenever they're identified, which forces me to do some sort of normalization as I go, even if it's just creating stub articles with the transcribed display text as a title.

I think so long as I don't require normalization as users type, the net result is the same -- no matter what, I'll have to implement a deduping/cleanup step that does what you describe in your second paragraph.

One thing I'm interested is your approach to typing -- it seems that relying on TEI's <name> tags forces typing to occur at the candidate step, and forces a 1:N classification scheme. I'm not comfortable with this, and suspect that I'd rather use the link tags in TEI instead. I'll have another post up on classification soon, talking about this.

Anonymous said...

That's a useful way of classifying what we're doing. I think I'm effectively doing the identification and typing at the same time, then doing the de-duping afterwards. Part of my thinking is that there doesn't have to be a fixed outcome. De-duping is an optional extra which can be left out if time and budget make it unviable. Even if the de-duping is left out, the typing has some value in itself eg if users want to search for a string but only within name elements.

I can see that my approach would be inadequate for what you're doing because so far I'm only interested in identifying people, places, and dates. I haven't even thought about creating a subject index.

Ben W. Brumfield said...

It seems to me that you can still do subject indexing, even if your viewer doesn't run off an RDBMS. TEI has some link tags that would allow back-linking from an index entry in the back matter to the unique ID of the name/place tag. You could generate these from your RDBMS at the same time as you do your de-duping -- in both cases you're traversing the TEI XML and modifying the attributes of your name/place elements, so why not generate index elements at the same time?

This brings up the question of what you're planning to use as your viewer? Once text is transcribed, marked up, and annotated to your satisfaction, how is it accessible? I've got the clear requirement for a printed, book-style output, which I'm working to. I'm also working towards a less well-defined online viewer as a side-effect of building the transcriber.

In a sense, a TEI document is a reasonable output format -- enough so that I feel like "output as tei" is a valid third-stage feature. But I wouldn't want that to be the only way of accessing the transcriptions.

Anonymous said...

To be honest I haven't put much thought into presentation yet. I'm assuming (perhaps naively!) that once I have TEI XML it should be trivial to transform it into HTML or PDF or put an AJAX front end on it.

One thing I haven't thought about enough is page images. I think it's important to provide a link to images wherever possible as no transcription is going to be perfect. For letters and photos I might just use Flickr and put the URLs in an attribute somehwere, but that's going to be impractical for a 200 page book.