the jsomers.net blog.

Six lines

When I was younger, I would occasionally sit down to my computer thinking I was about to compose a masterpiece, something that would all but earn my place in the canon. I would have a peculiar feeling brewing, that I was on the cusp of a radical insight, that I was about to see the soul of man. It was an epiphanic mood: everything I was seeing, I was seeing as crucial and human and connected to some secret the substance of which sat wonderfully unarticulated in the canny shadows of my mind. I had but to write it down and claim my prize.

I wouldn’t need much. Ezra Pound said, “If a man writes six good lines he is immortal.” A single stanza would do. Maybe a sonnet.

The mistake I was making was simple: looking at a blank page, thinking of myself as an above-average writer, a sensitive thinker, I felt like nothing material stood in the way of me and my six lines. It wasn’t like I was trying to run a four-minute mile or prove the Collatz conjecture, where it would take years just to get the tools, the mechanics, the muscles, to level with the problem. All I had to do was type the right words.

But the page isn’t blank, and my brain isn’t the Library of Babel. There are only so many things I can think.

Insights—the kind that Shakespeare had; the kind that Joyce had—aren’t just “in the air.” They’re the fruit of a lifelong process: seeds looked for and lucked upon, grown with steady care, seized when they ripen. I may never have one.

A brief foray into vectorial semantics

I designed my notes archive to be as easy to write to as possible, because I knew that if there were any friction in the process, I wouldn’t stick with it.

So I purposely decided not to tag, label, file, or otherwise classify each note. The only bits of metadata I do have are timestamps and permalinks, both of which are generated automatically by Tumblr.

The unfortunate side effect is that this “archive” of mine, more than two years in the making, is little more than a 280,000-word soup of fragments. And it’s become increasingly difficult to find things, either because I can’t recall the right keyword to use, or—and the chance of this keeps growing—the keyword I can recall appears in too many other notes.

Over the weekend, then, I decided to throw a little code at the problem—to see if I could perhaps automatically add structure to my archive. What follows is an account of that short (but fruitful) expedition.

Step 1: Database-ify

I should have fetched my notes and stored them in a local database a long time ago. Databases are great: they’re structured, queryable, portable, and they interface with just about every programming language.

So that’s what I did first, using the Tumblr API and this simple script. All it does is fetch my notes (really, Tumblr posts) and INSERTs them into a small SQLite3 database, notes.db.

Step 2: Introducing tf-idf weights

In college I took a course in computational linguistics, and at one point had the pleasure of working with these guys on a project to classify Supreme Court cases. So in the back of my mind were a few ideas for “semanticizing” documents, or using computers to extract some sort of meaning from globs of raw text.

I remembered that one of the best (and easiest) ways to start making sense of a document is to highlight its “important” words, or the words that appear within that document more often than chance would predict. That’s the idea behind Amazon.com’s “Statistically Improbable Phrases”:

Amazon.com’s Statistically Improbable Phrases, or “SIPs”, are the most distinctive phrases in the text of books in the Search Inside!™ program. To identify SIPs, our computers scan the text of all books in the Search Inside! program. If they find a phrase that occurs a large number of times in a particular book relative to all Search Inside! books, that phrase is a SIP in that book.

It seems to work pretty well. The first result for Catcher in the Rye, for instance, is “red hunting hat.”

How do they do it? Probably by calculating what’s called the “term frequency–inverse document frequency weight“—or tf-idf weight—for each word. It’s actually really simple.

Take my notes archive as an example. Each of the 3,641 notes is considered a “document,” and what we’re trying to do is to calculate the tf-idf weight for a single term in a single document. So, for instance, we might try to figure out how important the word “limitations” is in the following note, which came from a post on the excellent Letters of Note blog:

Limitations, honestly faced, are the greatest assets in producing a work of art. I am always impressed by ones ability to push his limitations to unknown, unexplored, realms rather than settling for the unexplicable endowment of talent. Anyone with their five senses operating normally is talented.

First we calculate the term frequency, which is just the number of times the term appears in that document, divided by the total number of words in the document. Here that’s 2 divided by 46, or 0.043478.

Next we calculate the inverse document frequency, which basically measures the proportion of documents containing a term. I say “basically” because it’s actually the logarithm of the inverse of that proportion, but that’s ultimately the information you’re extracting.

Calculating this is going to take a bit more work, because we need to go through all of our documents just to see whether they contain the word “limitations.” (By the way, we treat Limitations the same as limitations, and if we wanted, we could also count limitation without the “s.” We could even go as far as using a stemming algorithm to ignore all inflections, so that, e.g., buy and bought would be counted as the same word.)

Finally, to get the tf-idf weight, we just multiply the term frequency by the inverse document frequency. In this case, it’s 0.25641. Remember, that’s supposed to be a measure of how important the word “limitations” is in the single note above.

Step 3: Processing the whole archive

But really, 0.25641 doesn’t mean much on its own. What we really ought to do is find the tf-idf weights of all the words in a given note, or better yet, find the tf-idf weights for all the words in the entire notes archive.

That’s what this file does. Actually, it does more than that:

  1. First, we build an index mapping notes to the words they contain. The note above might appear in the index as 1321759949 => ["limitations", "honestly", "faced", "are", "the", ...], where the number 1321759949 is just the note’s ID from its Tumblr permalink.

  2. We build a reverse index, which maps words to the documents in which that word appears. So words like the are going to map to huge lists, whereas the word limitations will only map to nine documents, including the one we looked at above. This reverse index is really useful when we want to calculate the idf part of tf-idf weights, because it gives us a very fast way of counting the number of documents in which a given word appears: we just take the length of the list it’s mapped to.

  3. Once we have our indexes, we’re ready to calculate lots of tf-idf weights. To what end? To build an index I called semantics, which maps every word in the entire corpus of notes—there are 24,895 unique words total—to a list of documents in which it appears, weighted by the tf-idf value for that [word, doc] pair.

An example might clarify this third step, which is really the key to the whole operation.

The word “suspicion,” it turns out, appears in four of my notes. But where in one note it might be one of just a few words (and presumably very important), in another it might be one of thousands (and presumably not important at all). The tf-idf weight quantifies the difference.

So our semantics map contains an entry for the word “suspicion.” It looks like this:

"suspicion" => [
    [0.253483682143897, 1502308795],
    [0.0633709205359744, 777863615], 
    [0.00986613134093014, 230381070], 
    [0.0368188588588901, 228512335]
]

That list shows the four documents “suspicion” appears in, and for each, it gives a number (the tf-idf weight) that shows how important the word “suspicion” is to that specific document.

The semantics map contains 24,894 entries just like that one.

Step 3.5: Stashing and Un-stashing

It takes a few minutes to build the semantics index using my program. That’s really annoying, because it means that every time you want to change the code that uses semantics, you have to start the program from scratch to re-build it.

It would be great if we could just build the index—a Ruby hash—once, store it, and retrieve it quickly elsewhere. That way we could create another file to experiment with our results, without all this tedious rebuilding.

Thanks to Marcus Westin, we can, using a wonderful little class he calls ObjectStash. If you’ve used Python, it’s just like pickle. All you have to do is say ObjectStash.store semantics, './semantics.stash' to store it in a file, and later, ObjectStash.load './semantics.stash' to load it again. In this case the file it creates is 2.5MB, so it takes about a second to load, but that’s better than the minutes it takes to recompute the damned thing.

Step 4: Intelligent searches

So what can we do with our semantics map?

One application that immediately jumped to mind is search. Remember how I said that keyword searches were often ineffectual because I’d have to trawl through tons of irrelevant results to find the note I’m looking for? Well, with the information we now have, that problem can be mitigated significantly.

The notes-search.rb script is simple: you give it a search term, say, “female,” and it builds an HTML page for you that contains only notes including that term, ranked by their score in semantics.

The results so far have been very promising: when I search for, e.g., “cognition,” the first note I get is the cognition note—the one I’d definitely want to find. When I search for “female,” the first five or so notes are about female-ness, in one way or another. So too when I search for “math”—the notes it scores highest are (mostly) quite mathy, or about math, whereas the ones it scores poorly just mention it.

I also like that it typically gives higher scores to short notes, just because these are usually the most fun to read. That’s partly because it’s always more fun to read short stuff (sorry about that…) and partly because the short notes are often the ones I’ve written myself (instead of being excerpts of other people’s work).

Step 5: The good stuff, a.k.a., what the hell does this have to do with vectors, anyway?

Recall that semantics map for a second. Imagine if for each word-entry, instead of including just those documents in which the word appears, I included every document—with a score of 0 whenever the word doesn’t show up—and if for every word, I put the list of its [tf-idf, doc] pairs in the same order. Do you know what I’d have?

Vectors!

More concretely, each word-entry would be a vector in a 3,641-dimensional document-space. That is, each word would be a vector that is in some sense “tilted” or “pointed” toward the set of documents in which it appears, and “lengthened” (its magnitude increased) in the direction of wherever its tf-idf weights were higher.

The representation is quite powerful.

Example: I could take a note, calculate the document-vectors for each of its words (weighted by the tf-idf score of that word in the current note), and sum them, in essence creating a document-vector for that note. If I did it for another note, I could compare the two. I could say that if they’re close, say, cosine-similar, then the two notes are semantically related. Or I could calculate the vectors for all of my notes and use something like the k-means algorithm to find semantically-related clusters of notes. [1]

Or I could do something really cool. Say that I’ve written the first paragraph of a blog post, and I want to find some relevant notes from my archive. I could “vectorize” the paragraph that I’ve written—using the same process as above—and compare that vector to the vectors of each of my notes, spitting out the five closest. If this calculation could be done quickly enough, I could even redo it every time I typed a new word, dynamically suggesting notes—ideas, excerpts, phrases, fragments—as I write.

Conclusion

I think this basic approach is broadly applicable—really, it should work for any document set that’s not too big—so I hope to see others try it out. (And if you’ve done or seen something similar already, I’d love to hear about it.)

Incidentally, my code for this project is on Github.

Notes

[1] A lot of the ideas here were inspired by this excellent paper on computing the semantic relatedness of texts using Wikipedia. I recommend you read it. Actually, in this folder you’ll find a handful of cool papers related to the computational analysis of Wikipedia—I believe there have even been a few conferences on the subject.

deskotron

I could be a lot more productive. I could waste less time between tasks; focus more during tasks; keep better track of time; set more concrete goals and subgoals; have higher standards; give up less easily; and work even when I really don’t want to. It comes down to being more disciplined, basically, and more conscientious.

One way to get there is to “self-improve”: to attend to my flaws, the unhealthy habits of my work and mind, and try to change them.

Another way is to invent and build a magic desk robot. This might require more effort up front, but after I describe it to you, you’ll see why it’s probably worth it.

The device I have in mind would be an extremely sophisticated hybrid of a to-do list, filing cabinet, coach, and workspace. I’ll call him deskotron.

The first thing that deskotron would do is catalog everything I want or ought to do. By this I mean he would keep track of every book, blog post, paper, essay, and article I should read; every errand I have to run; every work to-do; every side project I’ve got going or have said I’d start; every puzzle or problem I would like to solve; every game I enjoy playing; all of my appointments; every essay I should write; all of my e-mails and calls; every movie, TED talk, and stupid video I plan to watch; etc. In short, deskotron would create a massive database of activity-units-I-could-fill-my-day-with, or tasklets.

Already I’m asking for something that even the world’s best secretary couldn’t provide, because I’m asking deskotron to find and suggest tasklets for me, to dredge up my forgotten project ideas, to record every time someone recommends a book, and so forth. I’m asking him to pay attention to, and write down, everything actionable in my life.

And that’s just the first step. Once deskotron has built this impressively complete tasklet database, I would want him to do whatever he could to put each item “at the ready,” that is, to lower its activation energy.

Example: if my next tasklet is to read twenty pages of Montaigne’s essay, “An apology for Raymond Sebond,” he would hand me the book opened to the right page. If it’s to answer an e-mail, he would fetch it for me and put it on my screen. And so on.

Next, deskotron would classify these tasklets. He’d rank them by cognitive exertion: reading some books is less taxing than reading others, and deskotron would know this. He’d know which taskslets will take longer (estimates based on prior experience, which he would track), which tasklets I’ll prefer, and which tasklets are related semantically, knowing, for instance, that a neurology paper and a TED talk on strokes are related.

He would know how each tasklet relates to larger goals of mine, and how much each contributes to those goals. For instance, he would know that I want to become a versatile writer, so he’d feed me a variety of writing tasklets, tracking my progress in different areas, say, by measuring the number of fiction-words I’ve written against the number of nonfiction-words. If I wanted to “gain a better understanding of recursion in Lisp,” he would know that reading The Little Schemer would help. Etc.

He would have something like Gmail’s “priority inbox” built in, knowing which e-mails warrant quick responses and which can wait.

More generally, he would be able to prioritize tasklets based on their importance. Tasklets related to “must-have” goals would have a higher priority than tasklets related to “nice-to-haves.”

He would understand strategic relationships between tasklets. By that I mean that he would understand which tasklets feed well into one another. For instance, I might write better after perusing my Google Reader queue, or I might write worse; deskotron would know which. He would know how many programming tasklets I can do before getting exhausted, and how to stagger hard and easy tasklets to squeeze the most effort out of me. He would know that I don’t like to read too many serious magazine articles in a row, and that I have to be primed in a certain way before I want to solve a Project Euler problem.

deskotron would be like a good personal trainer, demanding nearly too much of me, holding me to my commitments, pushing me when I falter, and knowing when to give me a break. He would monitor my mood and gauge my engagement. He would be like the logical extension of that Mercedes feature that wakes you up when you doze off.

With all this, deskotron would be able to dynamically pack my days. He would turn me into one of those high-powered guys who’s scheduled down to the minute, except that I wouldn’t feel constrained by him. Instead, I’d feel like he had the perfect answer every time I asked, “What’s next?”

Social Annealing

The first two weeks of college are exceptional for a lot of reasons. But there’s one phenomenon that stands out. It’s as peculiar and as powerful and as rare as magic. [1]

I’ll call it “social annealing” because it so resembles the process by which metals are heated—their particles freed into a loose homogenous jumble—and then recrystallized.

It’s the phenomenon whereby college freshmen, having just been ejected from the world they know and loosed upon a campus they don’t, mutually decide, in the heat of their eager anxiety, to willingly engage every stranger as a friend.

It happens in the dorms, the cafeteria, coffee shops, classrooms, libraries, and just around town: students with no prior connection approach one another and strike up conversations. Everyone tries everyone out.

Groups do form, but they’re not the kind of groups we’re used to—they congeal and dissolve with remarkable ease. So a lone student can approach five others without feeling like he’s intruding; two sets of roommates can combine; a pack can split with no friction.

Think of how bizarre that is. Think of what it would take, for instance, to introduce yourself to a group of four friends, none of whom you’d ever met. It’s practically preposterous.

That’s because in normal circumstances, confronting strangers without an overt excuse—the elevator breaks down, say, or a plane is canceled—is an act of aggression. If not outright threatening, it’s intrusive, or at the very least distracting.

Even in settings that seem to encourage mixing, like parties or bars, it’s not kosher for a person to engage just anybody—one must abide all kinds of cues and conventions and rules; contact must be made with a measure of finesse.

Which is to say that nothing you can find elsewhere in the workaday world even resembles the two-week college free-for-all, the strange fever in which everyone is basically pleased as hell to meet everyone else.

It almost sounds like a fantasy. But I assure you it happened. I’m not a spectacularly outgoing guy, but for the first two weeks of my freshman year at the University of Michigan, I introduced myself to just about everyone I saw. When I’d go down to the cafeteria, I could sit anywhere. At parties, on the way to class, in the dorms, etc., I—like everyone else—would flit from group to group in a crazy kind of convivial Brownian motion. Our social graph was effectively amorphous—fully connected. We were open to each other in a way that I imagine swingers must be open to sex, or hippies to psychedelics.

* * *

It’s probably worth asking how this happens, or why. I don’t think it’s all that complicated:

  1. Bizarre things are bound to happen when you throw a large number of eighteen year-olds into close quarters, especially if you don’t give them any work to do.

  2. For the most part, nobody knows anybody when they first arrive at college. And even if you did know some people, say, a few other kids from your high school, it’s good sense to avoid them for a little while, if only to participate in the social madness I’ve been describing.

    Which means everyone is effectively looking to make a fresh start, to find replacements for their now-disbanded troupe of close dependable friends.

    The trick is that everyone knows this. They know that everyone is in the same boat. And that pervasive common knowledge—where for any two people, A knows that B knows that A knows… that they’re both looking to make friends—is enough to fuel cold approaches.

  3. Freshmen are called “freshmen” for a reason: they’re fresh; they don’t yet have a reputation. Everyone understands that, and they understand that first impressions can stick. So it makes sense that they’re unusually warm and friendly. It’s the best way to keep their social options open.

  4. There is what I’ll call the “New Year’s Resolution Effect,” where kids just entering college decide, in light of their radical dislocation and the discontinuity from their life at home, to change themselves in fairly significant ways. In particular, they tend to commit to being more extroverted than they were in high school. It’s a common enough ambition to accelerate the pandemonium.

* * *

I mostly wanted to articulate this phenomenon because of something that happened last weekend. After seventeen months away I was back in Ann Arbor, that great college town and the site of my alma mater, ostensibly for a big football game, but really to reunite with lots of old friends, many of whom I hadn’t seen since the day we walked the Big House in our robes.

It was quite nostalgic. I really loved that place—still love it—and a lot of what I did that weekend was to reminisce, to reconnect with a mass of pleasant memories and in some cases to relive them.

But I also thought of all the things I didn’t do, of all the people I never met. I thought of how little I took advantage of Ann Arbor’s unbelievable density of young, curious kids with lots of free time and energy, all part of that same proud collective: students of the University of Michigan.

Walking around campus and the town, then, I had this remarkable urge, much like the one I had as an incoming freshman—but here I was older and more confident and more capable—to engage with everyone I saw.

But not much came of it. I wasn’t quite as bold as I could have been, for sure, but nor was the place as ripe as I’d imagined. I didn’t understand why until a friend explained it: the kids I’d seen that day had done the same thing we’d done, what we would later come to vaguely regret—they had annealed, and settled, and made themselves a home among a certain set of friendly faces. In the bargain they’d retired from the frenzy of their freshman year, the thrill of radical openness. And I had become a stranger once again.

Notes

[1] Why only two weeks? There are a few reasons: things in general have a way of lasting two weeks; school starts to get serious after about two weeks; and two weeks is just enough time for solid social bonds to form, for kids to get comfortable in their surroundings, and for everyone to pretty much sample everyone else in their little pocket of the campus. After two weeks, the magic’s over and the metal cools.

Perfunctory Offers

What do you do when you want the last piece of bread? Do you just take it? Or do you offer it to the table?

You offer it. Everybody knows the score: they know that you want the last piece, and that you’re only offering it to everyone else to be polite. So they’re going to decline, and you’re going to end up with the bread.

Why not be more direct? Because then you’d be skipping a ritual that gives everyone a chance to demonstrate how cooperative they are. Rituals like that are important.

Of course, making the offer opens you up to someone accepting it. That’s the price you pay for coming off as polite and cooperative; what you’ve done, effectively, is to wager the bread to earn a bit of social credit.

Nine times out of ten you’ll win—you’ll look good and keep your baguette—but occasionally you’ll lose the bread. You should be prepared for that.

The odds of pulling off a successful perfunctory offer are usually worse than in this bread situation. The reason is that the bread situation happens so often, and is so well-understood, that people rarely deviate from the script: offer, refuse, eat. Whereas in other cases—like when you offer a friend a ride but don’t want him to accept—the game isn’t so clear; your friend might suspect that your offer is just for show, but he can’t be sure. And free rides are attractive. So he’s more likely to accept.

The good news is that you get paid for taking on this extra risk. Offering your friend a ride is a bigger deal than offering your last piece of bread to the table, which means you stand to earn more social credit.

The stakes are higher for the other guy, too. He also earns points by turning down your empty offers; we saw that in the bread situation. It’s because turning down an empty offer (a) lets you off the hook, which is a nice and cooperative move for him to make, and (b) demonstrates his ability to detect empty offers in the first place, which feeds his reputation as a skilled coordinator.

That, then, is the arithmetic of perfunctory offers: you balance the cost of giving up some X against the points you’d earn for seeming generous, while he weighs the value of receiving X against the points he’d earn for skillfully detecting—and then abiding—your true intentions.