A brief foray into vectorial semantics

by James Somers, November 8, 2010

I designed my notes archive to be as easy to write to as possible, because I knew that if there were any friction in the process, I wouldn’t stick with it.

So I purposely decided not to tag, label, file, or otherwise classify each note. The only bits of metadata I do have are timestamps and permalinks, both of which are generated automatically by Tumblr.

The unfortunate side effect is that this “archive” of mine, more than two years in the making, is little more than a 280,000-word soup of fragments. And it’s become increasingly difficult to find things, either because I can’t recall the right keyword to use, or—and the chance of this keeps growing—the keyword I can recall appears in too many other notes.

Over the weekend, then, I decided to throw a little code at the problem—to see if I could perhaps automatically add structure to my archive. What follows is an account of that short (but fruitful) expedition.

Step 1: Database-ify

I should have fetched my notes and stored them in a local database a long time ago. Databases are great: they’re structured, queryable, portable, and they interface with just about every programming language.

So that’s what I did first, using the Tumblr API and this simple script. All it does is fetch my notes (really, Tumblr posts) and INSERTs them into a small SQLite3 database, notes.db.

Step 2: Introducing tf-idf weights

In college I took a course in computational linguistics, and at one point had the pleasure of working with these guys on a project to classify Supreme Court cases. So in the back of my mind were a few ideas for “semanticizing” documents, or using computers to extract some sort of meaning from globs of raw text.

I remembered that one of the best (and easiest) ways to start making sense of a document is to highlight its “important” words, or the words that appear within that document more often than chance would predict. That’s the idea behind Amazon.com’s “Statistically Improbable Phrases”:

Amazon.com’s Statistically Improbable Phrases, or “SIPs”, are the most distinctive phrases in the text of books in the Search Inside!™ program. To identify SIPs, our computers scan the text of all books in the Search Inside! program. If they find a phrase that occurs a large number of times in a particular book relative to all Search Inside! books, that phrase is a SIP in that book.

It seems to work pretty well. The first result for Catcher in the Rye, for instance, is “red hunting hat.”

How do they do it? Probably by calculating what’s called the “term frequency–inverse document frequency weight“—or tf-idf weight—for each word. It’s actually really simple.

Take my notes archive as an example. Each of the 3,641 notes is considered a “document,” and what we’re trying to do is to calculate the tf-idf weight for a single term in a single document. So, for instance, we might try to figure out how important the word “limitations” is in the following note, which came from a post on the excellent Letters of Note blog:

Limitations, honestly faced, are the greatest assets in producing a work of art. I am always impressed by ones ability to push his limitations to unknown, unexplored, realms rather than settling for the unexplicable endowment of talent. Anyone with their five senses operating normally is talented.

First we calculate the term frequency, which is just the number of times the term appears in that document, divided by the total number of words in the document. Here that’s 2 divided by 46, or 0.043478.

Next we calculate the inverse document frequency, which basically measures the proportion of documents containing a term. I say “basically” because it’s actually the logarithm of the inverse of that proportion, but that’s ultimately the information you’re extracting.

Calculating this is going to take a bit more work, because we need to go through all of our documents just to see whether they contain the word “limitations.” (By the way, we treat Limitations the same as limitations, and if we wanted, we could also count limitation without the “s.” We could even go as far as using a stemming algorithm to ignore all inflections, so that, e.g., buy and bought would be counted as the same word.)

Finally, to get the tf-idf weight, we just multiply the term frequency by the inverse document frequency. In this case, it’s 0.25641. Remember, that’s supposed to be a measure of how important the word “limitations” is in the single note above.

Step 3: Processing the whole archive

But really, 0.25641 doesn’t mean much on its own. What we really ought to do is find the tf-idf weights of all the words in a given note, or better yet, find the tf-idf weights for all the words in the entire notes archive.

That’s what this file does. Actually, it does more than that:

  1. First, we build an index mapping notes to the words they contain. The note above might appear in the index as 1321759949 => ["limitations", "honestly", "faced", "are", "the", ...], where the number 1321759949 is just the note’s ID from its Tumblr permalink.

  2. We build a reverse index, which maps words to the documents in which that word appears. So words like the are going to map to huge lists, whereas the word limitations will only map to nine documents, including the one we looked at above. This reverse index is really useful when we want to calculate the idf part of tf-idf weights, because it gives us a very fast way of counting the number of documents in which a given word appears: we just take the length of the list it’s mapped to.

  3. Once we have our indexes, we’re ready to calculate lots of tf-idf weights. To what end? To build an index I called semantics, which maps every word in the entire corpus of notes—there are 24,895 unique words total—to a list of documents in which it appears, weighted by the tf-idf value for that [word, doc] pair.

An example might clarify this third step, which is really the key to the whole operation.

The word “suspicion,” it turns out, appears in four of my notes. But where in one note it might be one of just a few words (and presumably very important), in another it might be one of thousands (and presumably not important at all). The tf-idf weight quantifies the difference.

So our semantics map contains an entry for the word “suspicion.” It looks like this:

"suspicion" => [
    [0.253483682143897, 1502308795],
    [0.0633709205359744, 777863615], 
    [0.00986613134093014, 230381070], 
    [0.0368188588588901, 228512335]
]

That list shows the four documents “suspicion” appears in, and for each, it gives a number (the tf-idf weight) that shows how important the word “suspicion” is to that specific document.

The semantics map contains 24,894 entries just like that one.

Step 3.5: Stashing and Un-stashing

It takes a few minutes to build the semantics index using my program. That’s really annoying, because it means that every time you want to change the code that uses semantics, you have to start the program from scratch to re-build it.

It would be great if we could just build the index—a Ruby hash—once, store it, and retrieve it quickly elsewhere. That way we could create another file to experiment with our results, without all this tedious rebuilding.

Thanks to Marcus Westin, we can, using a wonderful little class he calls ObjectStash. If you’ve used Python, it’s just like pickle. All you have to do is say ObjectStash.store semantics, './semantics.stash' to store it in a file, and later, ObjectStash.load './semantics.stash' to load it again. In this case the file it creates is 2.5MB, so it takes about a second to load, but that’s better than the minutes it takes to recompute the damned thing.

Step 4: Intelligent searches

So what can we do with our semantics map?

One application that immediately jumped to mind is search. Remember how I said that keyword searches were often ineffectual because I’d have to trawl through tons of irrelevant results to find the note I’m looking for? Well, with the information we now have, that problem can be mitigated significantly.

The notes-search.rb script is simple: you give it a search term, say, “female,” and it builds an HTML page for you that contains only notes including that term, ranked by their score in semantics.

The results so far have been very promising: when I search for, e.g., “cognition,” the first note I get is the cognition note—the one I’d definitely want to find. When I search for “female,” the first five or so notes are about female-ness, in one way or another. So too when I search for “math”—the notes it scores highest are (mostly) quite mathy, or about math, whereas the ones it scores poorly just mention it.

I also like that it typically gives higher scores to short notes, just because these are usually the most fun to read. That’s partly because it’s always more fun to read short stuff (sorry about that…) and partly because the short notes are often the ones I’ve written myself (instead of being excerpts of other people’s work).

Step 5: The good stuff, a.k.a., what the hell does this have to do with vectors, anyway?

Recall that semantics map for a second. Imagine if for each word-entry, instead of including just those documents in which the word appears, I included every document—with a score of 0 whenever the word doesn’t show up—and if for every word, I put the list of its [tf-idf, doc] pairs in the same order. Do you know what I’d have?

Vectors!

More concretely, each word-entry would be a vector in a 3,641-dimensional document-space. That is, each word would be a vector that is in some sense “tilted” or “pointed” toward the set of documents in which it appears, and “lengthened” (its magnitude increased) in the direction of wherever its tf-idf weights were higher.

The representation is quite powerful.

Example: I could take a note, calculate the document-vectors for each of its words (weighted by the tf-idf score of that word in the current note), and sum them, in essence creating a document-vector for that note. If I did it for another note, I could compare the two. I could say that if they’re close, say, cosine-similar, then the two notes are semantically related. Or I could calculate the vectors for all of my notes and use something like the k-means algorithm to find semantically-related clusters of notes. [1]

Or I could do something really cool. Say that I’ve written the first paragraph of a blog post, and I want to find some relevant notes from my archive. I could “vectorize” the paragraph that I’ve written—using the same process as above—and compare that vector to the vectors of each of my notes, spitting out the five closest. If this calculation could be done quickly enough, I could even redo it every time I typed a new word, dynamically suggesting notes—ideas, excerpts, phrases, fragments—as I write.

Conclusion

I think this basic approach is broadly applicable—really, it should work for any document set that’s not too big—so I hope to see others try it out. (And if you’ve done or seen something similar already, I’d love to hear about it.)

Incidentally, my code for this project is on Github.

Notes

[1] A lot of the ideas here were inspired by this excellent paper on computing the semantic relatedness of texts using Wikipedia. I recommend you read it. Actually, in this folder you’ll find a handful of cool papers related to the computational analysis of Wikipedia—I believe there have even been a few conferences on the subject.