The obvious answer is that the subjects which page A would have most in common with page B are the same ones it would have in common with nearly every other page in the collection. In the corpus I'm working with, the diarist mentions her son and daughter-in-law in 95% of pages, for the simple reason that she lives with them. If I choose two pages at random, I find that March 12, 1921 and August 12, 1919 both contain Ben and Jim doing agricultural work, Josie doing domestic work, and Julia's near-daily visit to Marvin's. The two pages are connected through those four subjects (as well as this similarly-disappointing "dinner"), but not in a way that is at all meaningful. So I decided that a page-to-page relatedness tool couldn't be built from the page-to-subject link data.
All that changed two weeks ago, when I was editing the 1921 diary and came across the mention of a "musick box". In trying to figure out whether or not Julia was referring to a phonograph by the term, I discovered that the string "musick box" occurred only two times: when the phonograph was ordered and the first time Julia heard it played. Each one of these mentions shed so much light on the other that I was forced to re-evaluate how pages are connected through subjects. In particular, I was reminded of the "you and one other" recommendations that LibraryThing offers. This is a feature that find other users with whom you share an obscure book. In this case, obscurity is defined as the book occurring only twice in the system: once in your library, once in the other user's.
This would be a relatively easy feature to implement in FromThePage. When displaying a page, perform this algorithm:
- For each subject link in the page, calculate how many times it is referenced within the collection, then
- Sort those subjects by reference count, and
- Take the 3 or 4 subject links with the lowest reference count and,
- Display the pages which link to those subjects.
No comments:
Post a Comment