Sunday, April 5, 2009

Feature: Full Text Search/Article Link integration

In the last couple of weeks, I've implemented most of the features in the editorial toolkit. Scribes can identify unannotated pages from the table of contents, readers can peruse all pages in a collection linked to a subject, and users can perform a full text search.

I'd like to describe the full text search in some detail, since there are some really interesting things you can do with the interplay between searching and linking. I also have a few unresolved questions to explore.

Basic Search

There are a lot of technologies for searching, so my first task was research. I decided on the simple MySQL fulltext search over SOLR, Sphinx, and acts_as_ferret because all the data I wanted to search was located within the PAGES table. As a result, this only required a migration script, a text input, and a new controller action to implement. You can see the result on the right hand side of the collection homepage.

Article-based Search

Once basic search was working, I could start integrating the search capability with subject indexes. Since each subject link contains the wording in the original text that was used to link to a subject, that wording can be used to seed a text search. This allows an editor to double-check pages in a collection to see if any references to a subject have been missed.

For example, Evelyn Brumfield is a grandchild who is mentioned fairly often in Julia's diaries. Julia spells her name variously as "Evylin", "Evelyn", and "Evylin Brumfield". So a link from the article page performs a full text search for "Evylin Brumfield" OR Evelyn OR Evylin.

While this is interesting, it doesn't directly address the editors need to find references they might have missed. Since we're able to see all the fulltext matches for Evelyn Brumfield, since we can see all pages that link to the Evelyn Brumfield subject, why not subtract the second set from the first? An additional link on the subject page searches for precisely this set: all references to Evelyn Brumfield within the text that are not on pages linked to the Evelyn Brumfield subject.

At the writing of this blog post, the results of such a search are pretty interesting. The first two pages in the results matched the first name in "Evylin Edmons", in pages that are already linked to Evelyn Edmonds subject. Matched pages 4-7 appear to be references to Evelyn Brumfield in pages that have not been annotated at all. But we hit pay dirt with page number 3: it's a page that was transcribed and annotated very early during the transcription project, containing reference to Evelyn Brumfield that should be linked to that subject but is not.

Questions

I originally intended to add links to search for each individual phrase linked to a subject. However, I'm still not sure this would be useful -- what value would separate, pre-populated searches for "Evelyn", "Evylin", and "Evylin Brumfield" add?

A more serious question is what exactly I should be searching on. I adopted a simple approach of searching the annotated XML text for each page. However, this means that subject name expansions will match a search, even if the words don't appear in the text. A search for "Brumfield" will return pages in which Julia never wrote Brumfield, merely because they link to "John", which is expanded to "John Brumfield". This is not a literal text search, and might astonish users. On the other hand, would a user searching for "Evelyn" expect to see the Evelyns in the text, even though they had been spelled "Evylin"?