Wednesday, May 2, 2012

Bending Regular Expressions to Express Uncertainty

This post expands upon an email I sent to James Edward Gray II, my mentor in regular expressions, whom I thank for his generosity.

For the last few months--ever since I started conducting regular expression workshops at THATCamps--I've been thinking about using regular expressions to represent uncertainty. The great power of the regex is its ability to define context, boundaries, and precision over what is essentially unknown: /A[BC]/ means "I'm looking for an A, followed by either a B or a C -- I don't know which, but if you see either of those, that's the text."

I build tools for transcribing handwritten text -- some of it very old, some of it illegible, some of it damaged by fire or water. Representing uncertainty is a common problem within that domain, and it's important for a couple of reasons:
  1. Usually the text will be presented out of context -- the ASCII (or Unicode) characters representing the underlying text will be separated from the image containing the underlying text.
  2. Even when the images are available, the person doing the transcription is far more skilled at deciphering a particular cursive hand than their readers are likely to be. Take someone trained in 16th century paleography, who has spent weeks working on a particular author's handwriting -- their opinion on whether a scribble is an "f" or an "s" is going to be worth more than that of a casual researcher who encounters that handwriting in a single record of search results. We need to pass that opinion from the expert to the reader.

While otherwise rigorous, the methods professional documentary editors use for recording uncertain readings are pretty shabby -- suited to print editions, they're often concise but imprecise: the Emerson Journals display missing text via || . . . ||, with "three dots representing one to five words; four dots, six to ten words; and five dots, sixteen to thirty words". Another example uses [ . . . ], with each period representing an illegible letter. There is no convention for expressing "I'm sure this is either an 'a' or a 'u', but I can't be certain which one." (Kline and Perdue, A Guide to Documentary Editing, Third Edition)

This is where the notation regular expressions use for their search patterns comes in. It seems like a perfect fit to record Br[au]mfield when the user can't tell whether a name is "Brumfield" or "Bramfield" but is certain that it's not, say "Bremfield". And happily, FreeUKGen is doing exactly this -- they've created a regex-inspired Uncertain Character Format (UCF) for their volunteers to use when they're not quite able to make out text:
_ (Underscore) A single uncertain character. It could be anything but is definitely one character. It can be repeated for each uncertain character.
* (Asterisk) Several adjacent uncertain characters. A single * is used when there are 1 or more adjacent uncertain characters. It is not used immediately before or after a _ or another *.
Note: If it is clear there is a space, then * * is used to represent 2 words, neither of which can be read.
[abc] A single character that could be any one of the contained characters and only those characters. There must be at least two characters between the brackets.
For example, [79] would mean either a 7 or a 9, whereas [C_] would mean a C or some other character.
{min,max} Repeat count - the preceding character occurs somewhere between min and max times. max may be omitted, meaning there is no upper limit. So _{1,} would be equivalent to *, and _{0,1} means that it is unclear if there is any character.
? Sometimes you will have the situation where all of the characters have been read but you remain uncertain of the word. In this case append a ? at the end of the word e.g. RACHARD? The most frequent place where a ? is used is with transcription that have been donated from other systems and are being converted for entry into FreeREG.
And their volunteers are actually using UCF a lot -- here's a file with a minimal-but-effective example, and here's a list of files with a high incidence of records containing UCF.

One of the only big differences between UCF notation and regular expression notation is the use of underscore (_) for "I'm not sure what this character is". In a sense, this is equivalent to the regex . character, but in practice that's not how it's used. /[i.]/ makes no sense in regular expressions: "either 'i' or any character" is turned into the logical set of all characters plus the set of { 'i' }, the union of which is the same as the set of all characters.  As a result, in regular expressions the 'i' is redundant in [i.].  However, that's not how they use _ in UCF.  [i_] means "I think this character is an 'i', but I'm not really sure."  That statement is not the same thing as "I don't know what this character is" -- not at all! 

So, cool! We've got a notation for describing uncertainty inspired by regular expressions. Problem solved, right? Well, not quite. While FreeUKGen's UCF represents uncertain readings successfully, I think, there are still a couple of issues to iron out.

The first one of these is displaying the data -- I won't go too far into this, as UI is not really my forte, but it seems like we might be able to represent notations of the form "[a_]" by using a different font weight. I have no idea what we'll do about a "Br[au]mfield", though.

The second issue is in searching the data. No problem, right? Regular expressions are designed for searching! Well, sort of. In this case, we expect end users (primarily genealogy researchers) to be typing in precise search strings like "Brumfield" which they expect to match against the regular expression /Br[au]mfield/. This wouldn't be a problem if we only had a handful of records -- we'd convert the UCF in each transcription into its equivalent regular expression, then iterate through each record, matching it against the user-entered search string. Unfortunately this approach might take a while on a database containing hundreds of millions of records.

The problem with searching regular expressions is that you can't index them. So far as I'm aware, it's theoretically impossible to shove a working finite state machine into a B+-tree. What you can do, however, is index permutations of UCF -- if a first name is [_J]ane, you can at least index Jane so that a search for "Jane" will find that record. You can also permute /Br[au]mfield/ into both "Brumfield" and "Bramfield" and index them each, so that a search on either string will find the /Br[au]mfield/ record. This an incomplete solution, in that its results will differ from the aforementioned, logically correct approach of applying each regex against the search string. However, it might be just adequate for the most common cases.

After writing this and reading James's response, I started thinking more about my options.  One of these is a parallelized brute-force approach.  Why can't I match each regex in the database against a search string?  After all, we're talking about fewer than a billion records, and asking does X match Y is the sort of thing that is easily parallelized.  O brave new world, that has such infrastructure! I'm hesitant to go down this path, but I may be missing something -- perhaps some Hadoopy, Erlangy, Map-Reducey algorithm is cheap, easy, and presents the simplest solution to the problem?  Any other option is really an approximation to "correct", so it would be a shame to rule this out because of my own lack of experience.

Another approach might be to categorize each kind of UCF expression. Based on my limited research so far, it appears that the majority of the UCF in the existing transcripts falls into either the "completely unknown" category of "*" for an entire field, or the nuanced "I think this is a J but I'm not sure" category represented by [J_].  We will likely have to handle the former gingerly no matter what we do -- if a surname is totally illegible in the manuscript, the search engine will have to rely on other fields.  The former expression could be approximated by "J", which would match precise searches well provided the transcriber actually has the greatest possible expertise.
Moving into what I expect are rarer cases, expressions like /Br[au]mfield/ would work well with the permutation treatment I outlined above. If the system already supports begins-with and ends-with searches, we should be able to index "Brumf*" as well as "*field".  In fact, we might even be able to index a single, infixed wildcard like "Br*ld" by doing both begins-with and ends-with searches with a combination of cleverness and hackery.  This leaves some smaller number of true regular expression-equivalent UCF-encoded records like "Ca_{1,2}s*d" to deal with.  It's possible that this represents such a small sample that the system could actually apply each record containing an irreducible regex to every search, whether via big parallelization or a long loop.

Yet another approach is one that I understand to be deployed on the FreeUKGen databases now -- lossy comparison that reduces UCF to searchable data.  For example, the venerable Soundex algorithm begins by stripping non-alphabetic data from a record, converting /[J_]ane/ to "Jane".  The uncertainty recorded by the transcriber fades into the larger fog of the search algorithm.  I'm just as uncomfortable with this methodology as I am with the permutation of /[J_]ane/ to "Jane" I described above.  I suspect that my discomfort is due to simply not knowing what the correct behavior is when a user is searching on /[J_]ane/ -- I know that "Jane" should match, but am not entirely sure whether "Zane" should match the record.

Perhaps the right approach is a hybrid -- use tricks with database indexing for the majority of cases--which don't involve any UCF at all--provide Soundex and Metaphone in transparent ways, and shove the irreducable regular expressions into a spot where they can be processed cheaply.  But I really don't know, and I don't anticipate knowing for months yet.  Of course, if you happen to have done this before, I'd love to know how.  I'm heading to bed, expecting dreams which revolve around gem install index-fsm.


Anonymous said...

You comment...
Take someone trained in 16th century paleography, who has spent weeks working on a particular author's handwriting -- their opinion on whether a scribble is an "f" or an "s"

But this could also mean "ss" as and "f" was uused for that. Some transcribers have entered this as "f" and some as "ss".


Tony Proctor said...

Hi Ben. By coincidence, I've started a thread on rootsdev called 'Escape Sequences' where I was looking for a mechanism to clearly mark UCF sequences without reserving a whole block of special characters.

It has currently brought FHISO into the subject as a host org for maintaining a UCF standard. How does that sound to you?

Tony Proctor said...

How does FreeUKGEN represent text that was struck-out in its original form Ben?

This seems to be separate from UCF but is still relevant when transcribing original data.

Another case would be 'side notes', say in the margin or between the lines.