by James Somers
Abstract: A simple vector-space model shows a surprising talent for cluing in the Codenames board game.
Codenames is a Czech board game by Vlaada Chvátil where the goal is to say a one-word clue to your teammates in order to get them to choose correctly from the words laid out on the table. The real game is played on a 5x5 board, but here is a typical situation faced by a clue-giver:
The three blue words are the target words—that's what you want your teammates to guess. The black word is the bomb; if your teammates say that one, they instantly lose the game. The tan words are neutral or perhaps belong to your opponent.
Your task is to come up with a single word that connects HAM, BEIJING, and IRON, while avoiding the others. (There are rules about which kinds of clues are allowable: usually it has to be a single word; proper nouns are optionally allowed.)
The game is interesting because it requires you to connect far-flung concepts precisely enough that other people can re-create your associations. It can be delightful, and frustrating, to see your friends' minds leap from idea to idea—often going places you never intended.
Can you think of a clue for the board above?
"Word vectors" attempt to quantify meaning by plotting words in a high-dimensional space; words that are semantically related end up close to each other in the space.
One way to generate word vectors uses a neural network: you download a vast corpus of text, say all of Wikipedia. Then, you read the text into a small moving window, considering maybe ten words at a time—nine "context" words and one target word. Your goal is to predict the target from the context: you rejigger the weights of the network such that, based on the nine context words, it assigns a high probability to the tenth. At the heart of this neural network is a big matrix which has a column vector for each word; in the training process, you're esssentially nudging these vectors around. After training across the entire corpus, the vectors come to embody the semantics latent in the patterns of word usage.
It's a computationally intense procedure. Luckily, Stanford has published a data set of pre-trained vectors, the Global Vectors for Word Representation, or GloVe for short. (It uses a fancier method than the one described above.) It's just a list of words followed by 300 numbers, each number referring to a coordinate of that word's vector in a 300-dimensional space. The GloVe vectors we'll be using were trained on 42 billion words worth of text gotten from the Common Crawl.
Codenames seems like a good Turing test: to come up with a clue, you need to not only understand the many shades of meaning each word can take on—"PAN," for instance, can be a piece of kitchenware, a way of criticizing, or a prefix meaning "all"—you also seem to need a model of the world. You connect "NARWHAL" to "NET" because you know that narwhals might be caught in nets. You connect "GRENADE" to "PALM" because you know that grenades are held in your hand; when you think of the two words together, you might even mentally simulate a throw. All this seems difficult for a computer to do.
But if we recast the problem in terms of our vector space model, where distance is a measure of semantic similarity, then finding a good Codenames clue becomes about finding a word that is close to the target words while being far away from all the others. That sounds a little simpler.
Download the vectors:
$ wget http://nlp.stanford.edu/data/glove.42B.300d.zip
$ unzip glove.42B.300d.zip
The words are sorted by the number of times they appear in the original corpus. Each word has a list of 300 coordinates associated with it. Here are the word vectors for was
, or
, and your
:
with open("./glove.42B.300d.txt", 'r') as glove:
lines = [next(glove) for x in xrange(100)]
[(l.split(" ")[0], ", ".join(l.split(" ")[1:20]) + "...") for l in lines[30:33]]
There are more than a million words in this file, which makes processing slow. So we'll write the top 50,000 words to a separate file:
$ head -n 50000 glove.42B.300d.txt > top_50000.txt
Now we're ready to do some analysis.
We'll import some common libraries for numerical analysis:
import numpy as np
import pandas as pd
from scipy import spatial
Then, we'll create a map from words to their "embeddings", i.e., their 300-dimensional vector representations:
embeddings = {}
with open("./top_50000.txt", 'r') as f:
for line in f:
values = line.split()
word = values[0]
vector = np.asarray(values[1:], "float32")
embeddings[word] = vector
We can see which words are close to others by taking their cosine similarity—a measure of distance in high-dimensional space that computes the angle between two vectors:
$$distance(x, y) = 1 - cos(x, y) = 1 - \frac {\pmb x \cdot \pmb y}{||\pmb x|| \cdot ||\pmb y||}$$With a quick look at some neighboring words, we can see that the distance metric works pretty well:
def distance(word, reference):
return spatial.distance.cosine(embeddings[word], embeddings[reference])
def closest_words(reference):
return sorted(embeddings.keys(), key=lambda w: distance(w, reference))
[(w, ", ".join(closest_words(w)[1:10]) + "...") for w in ["magic", "sport", "scuba", "sock"]]
We can express the Codenames problem as taking a set of "target" words and a set of "bad" words, then trying to find candidate words that are close to the targets and far from the bad words.
One way to do this is to calculate, for a given candidate clue, the sum of its distances from the bad words minus the sum of its distances from the target words. (When the target distances are smaller, it means the candidate is better.) That is, for each word $w$ in our dictionary we want to compute:
$$goodness(w) = \sum_{b \in bad}{cos(w, v_b)} - c \cdot \sum_{t \in targets}{cos(w, v_t)}$$Then we pick the words with the highest values—say, the top 250 of them. (The constant $c>0$ expresses the fact that closeness to the target words is more important than farness from the bad words.)
The trouble is that a candidate that is close to one or two of the targets but far from the third can still score well—despite being a bad clue for that very reason. So, we sort our subset of 250 good candidates by the following:
$$minimax(w) = \underset{b \in bad}{\operatorname{arg min}}{cos(w, v_b)} - \underset{t \in targets}{\operatorname{arg max}}{cos(w, v_t)}$$That is, we're looking to minimize the maximum distance from the targets, and maximize the mininum distance from the bad words. This is all pretty easy to express in code:
def goodness(word, answers, bad):
if word in answers + bad: return -999
return sum([distance(word, b) for b in bad]) - 4.0 * sum([distance(word, a) for a in answers])
def minimax(word, answers, bad):
if word in answers + bad: return -999
return min([distance(word, b) for b in bad]) - max([distance(word, a) for a in answers])
def candidates(answers, bad, size=100):
best = sorted(embeddings.keys(), key=lambda w: -1 * goodness(w, answers, bad))
res = [(str(i + 1), "{0:.2f}".format(minimax(w, answers, bad)), w) for i, w in enumerate(sorted(best[:250], key=lambda w: -1 * minimax(w, answers, bad))[:size])]
return [(". ".join([c[0], c[2]]) + " (" + c[1] + ")") for c in res]
I've been playing lots of Codenames with my friends and have gathered some data along the way. In the "experiments," there are 16 players who participate. Four players are assigned randomly to the same 3x3 board, like the one above, and are asked to give a clue independently to three receivers apiece. (The receivers don't see the colors on the board, obviously.)
The scoring works as follows:
These experiments give a baseline of human performance, which can then be compared against the vector-space model.
For instance, with the board above, we had the following clues and results:
Receiver 4 | HAM | BEAR | CAT | 1 pts |
Receiver 5 | HAM | BEIJING | IRON | 6 pts |
Receiver 6 | HAM | IRON | BEIJING | 6 pts |
Receiver 7 | BEIJING | BEAR | CAT | 1 pts |
Receiver 8 | BEIJING | IRON | FALL | 0 pts |
Receiver 9 | WITCH | AMBULANCE | NOTE | 0 pts |
Receiver 10 | IRON | HAM | BEAR | 3 pts |
Receiver 11 | IRON | NOTE | HAM | 1 pts |
Receiver 12 | IRON | HAM | FALL | 0 pts |
Receiver 1 | BEIJING | IRON | HAM | 6 pts |
Receiver 2 | IRON | BEIJING | HAM | 6 pts |
Receiver 3 | BEIJING | HAM | IRON | 6 pts |
Clearly "WOK" was the best clue. "CAIDAO" might have been a good clue except that none of the receivers understood what it meant. "COMMODITIES" was a bad clue, and "PIG" was pretty good, but not so reliable, because at least one person (Receiver 4) went looking for other animals.
Let's see what the computer comes up with. We'll print the first 100 candidates using the function above. The number in parens is the minimax score that we're sorting by:
import pandas as pd
from itertools import izip_longest
def grouper(n, iterable, fillvalue=None):
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
from IPython.display import HTML
def tabulate(data):
data = list(grouper(10, data))
return HTML(pd.DataFrame(data).to_html(index=False, header=False))
answers = ["iron", "ham", "beijing"]
bad = ["fall", "witch", "note", "cat", "bear", "ambulance"]
tabulate(candidates(answers, bad))
I find these results pretty striking. The model here is simple geometry; it relies entirely on the meaning baked into the GloVe vectors. But wok
appears! wok
is basically a perfect clue—everyone was impressed with the friend who came up with it and upset they hadn't thought of it themselves—and here it is in the #2 spot, out of 50,000 candidates. tong
(#1) might work well, though I don't quite understand the connection to "Beijing," and jian
(#8), a word I hadn't heard before, fits decently well: it is a kind of Chinese sword. stir-fry
(#19) and sichuan
(#28) seem to evoke Chinese cooking. Notably, all of these clues are vastly better than "COMMODITIES," which is the one I came up with.
Of course, there's plenty of garbage (molybdenum
(#37) (??), qatar
(#87) (!?)), and many of the candidates are over-indexed to one or two of the targets at the expense of others. hock
(#49), for instance, doesn't have anything to do with "Iron" or "Beijing," and omelette
(#45), although connected to "Ham" and "Iron," is unrelated to "Beijing."
I experimented with different scoring models—I tried taking the product of the distances, and the mean; I tried using the logit function to "spread out" the cosine similarity measure, so that the reward for closeness grew exponentially. And I played with the constant $c$. But so far, the model above gives the best overall performance across the largest number of scenarios.
(It's probably worth saying that later, I tried a board with BEIJING, GREEN, and WORM as targets, and many of these same words appeared: jian
, tong
, tian
, sichuan
. Same if GREEN were changed to LAPTOP, but not when changed to DEER. So perhaps "Beijing" alone had conjured them up, and to some extent, the model got lucky.)
This was a doozy:
Receiver 1 | ATLANTIS | FAIR | HOSPITAL | 0 pts |
Receiver 2 | CHURCH | ATLANTIS | CAT | 6 pts |
Receiver 3 | ATLANTIS | CAT | BUCK | 3 pts |
Receiver 4 | ATLANTIS | AZTEC | CHURCH | 1 pts |
Receiver 5 | ATLANTIS | EYE | AZTEC | 1 pts |
Receiver 6 | AZTEC | EYE | CHURCH | 0 pts |
Receiver 7 | ATLANTIS | CHURCH | AZTEC | 3 pts |
Receiver 8 | CAT | CHURCH | FAIR | 0 pts |
Receiver 9 | CAT | CHURCH | FAIR | 0 pts |
Receiver 10 | ATLANTIS | AZTEC | EYE | 1 pts |
Receiver 11 | ATLANTIS | AZTEC | CHURCH | 1 pts |
Receiver 12 | CHURCH | AZTEC | ATLANTIS | 1 pts |
Only a single player managed to guess all three correctly, via the clue "MALTA." Much to my surprise, that clue appeared 12th on the model's list:
answers = ["church", "cat", "atlantis"]
bad = ["fair", "eye", "aztec", "buck", "pin", "hospital"]
tabulate(candidates(answers, bad))
Perhaps more surprising is the model's top pick, ark
. I tried this clue on a friend who wasn't part of the initial experiment; they guessed all three targets correctly. Indeed ark
might be a strictly better clue than "MALTA." (I like how it connects both to "Church" and to "Cat," and actually also to "Atlantis"—boat, island...—though it has a little interference with "Buck," which is also an animal that might end up on Noah's Ark.)
Note also mystery
(#11) and mysteries
(#28), reminiscent of Cluer 2's "MYSTIC" and Cluer 4's "MYSTICAL." aslan
didn't have a chance of appearing since it didn't make the original cutoff for inclusion in the dictionary (it's about the 57,000th word).
As before, much of the list seems kind of useless. Clearly the program is noisy. But it's capable of generating clues that are sometimes as good as, if not better than, what a person could come up with.
For instance, I remember that early on, someone came up with a brilliant clue for SOCK, LUCK, and ATLANTIS, a board which had stumped everyone else. The clue was "Lost." Sure enough, the model discovers that clue, at #24. Good program!
A board with the targets THUMB, FOREST, and MOUNT ended up being pretty easy for human players. The best clue—chosen independently by three people—was "GREEN," and six players got perfect scores from it. But the computer can't seem to see it:
answers = ["thumb", "mount", "forest"]
bad = ["pin", "tag", "hood", "princess", "police", "ball"]
tabulate(candidates(answers, bad))
ridges
, the top clue, might work (the connection to "THUMB" is via the ridges on your fingerprint, I think) but when I tested it on someone, they replied with "mount, hood, forest."
There was a similar misfire with a BOND, PIRATE, BUGLE board. The winning clue was "GOLD," but the computer didn't come up with it. Its clues seem pretty weak—over-indexed to one or two targets—with the exception maybe of "corps" (#41) and "cadets" (#75):
answers = ["bond", "pirate", "bugle"]
bad = ["alien", "poison", "seal", "dice", "link", "spell"]
tabulate(candidates(answers, bad))
It's hard to know what's happening here. Is it maybe that there aren't many co-occurrences of "gold" and "bond" in the Common Crawl corpus? Look at the distance of those two vectors:
distance("gold", "bond")
For reference, let's consider a word that's close to "gold":
distance("gold", "plated")
...and one that bears really no relation (that I can see):
distance("gold", "mouse")
So "bond" is almost as far away from "gold" as "mouse" is. More surprisingly, "bugle"—an instrument that is often gold-colored—is even farther away, suggesting that the two words don't appear around each other, or even in similar contexts:
distance("gold", "bugle")
We humans can use our imaginations to connect words—and in many cases this turns out to be far more powerful than a measure of conceptual distance based on co-occurence in a large corpus.
Perhaps my favorite example comes with a board whose targets were ROUND, FIGHTER, and PALM. The model's best effort is ufc
(#23); it seems preoccupied with MMA and boxing-related words:
answers = ["fighter", "round", "palm"]
bad = ["ivory", "knife", "point", "helicopter", "novel", "diamond"]
tabulate(candidates(answers, bad))
One of the human cluers, though, came up with "GRENADE." In vector terms, this word ends up being pretty far from all of the targets:
distance("grenade", "fighter")
distance("grenade", "round")
distance("grenade", "palm")
The last two of these are especially interesting. We humans know that a grenade is round (more or less)—but of course our computer model doesn't. It doesn't know anything. It only considers the raw token grenade
, and only "understands" it in relation to other tokens. And apparently that token doesn't appear in contexts involving words like round
all that often...
Same, too, with palm
. When we think of grenades, one of the things that immediately springs to mind is the fact that it's hand-held—particularly if that idea is primed by the presence of the word "PALM." This speaks to the richness of our mental models: it's not just words in there. By contrast, the only chance our dumb model has of seeing this association is if lots of texts happened to talk about palms, or hands, or fingers, in the same breath as grenades. Apparently that doesn't happen too often either.
It's worth showing an example where the computer falls flat on its face. Consider this board:
Here's what the humans came up with:
Receiver 1 | ROBIN | SERVER | PLATE | 3 pts |
Receiver 2 | SERVER | ROBIN | PLATE | 3 pts |
Receiver 3 | ROBIN | SERVER | PLATE | 3 pts |
Receiver 4 | SERVER | SCREEN | POLE | 3 pts |
Receiver 5 | SERVER | POLE | SCREEN | 1 pts |
Receiver 6 | SCREEN | SERVER | POLE | 3 pts |
Receiver 7 | SCREEN | ROBIN | SERVER | 6 pts |
Receiver 8 | SCREEN | SERVER | ROBIN | 6 pts |
Receiver 9 | ROBIN | SCREEN | SERVER | 6 pts |
Receiver 10 | SCREEN | ROBIN | SERVER | 6 pts |
Receiver 11 | SCREEN | ROBIN | SERVER | 6 pts |
Receiver 12 | ROBIN | SCREEN | SERVER | 6 pts |
There was much debate about whether "BATCOMPUTER" was even legitimate, but indeed we were allowing proper nouns and Wikipedia has Batcomputer spelled as one word. Clearly, though, "TWITTER" is the best clue, associating as it does to computer stuff ("screen," "server") and to birds ("robin").
How does the model compare?
answers = ["robin", "screen", "server"]
bad = ["pole", "plate", "ground", "pupil", "iron", "novel"]
tabulate(candidates(answers, bad))
It's terrible! The over-indexing problem has basically spoiled the results. It's as if "screen" and "server" combined have so much mass that we get trapped in a gravity well far away from "robin."
You could imagine an interactive cluer's aid that allowed you to travel toward one target and away from the others. Indeed, a version of the model that arbitrarily weights "robin" as two or three times more important than "screen" and "saver" ends up with slightly more interesting clues like "webmaster" (perhaps a person named Robin?), but still didn't deliver "twitter." (Changing the constant $c$ above from 4.0 to 3.5 brings "twitter" into the 7th position—perhaps by increasing the universe of possible clues?—though at the expense of worse overall performance with other boards.)
A simple vector space model using cosine similarities can dig up human-level clues at least some of the time.
There's an over-indexing problem: words that happen to be very close to one or two of the targets will rank highly even when they're far away from the third. Minimizing the maximum distance from any target helps mitigate but doesn't entirely solve this problem.
There are some triplets that humans can cleverly connect with words that are rarely used in similar contexts, but which make sense when you think about them. Since the computer doesn't think, it doesn't generate those clues.
In general, the model's rankings are a little noisy—the 11th result is often no better than its 91st—but at a coarser level, it sorts its candidates remarkably well. If you're willing to do a little sifting, the top 100 or so results can include surprisingly good clues.
I wasn't expecting that. I thought the vector space model was a neat way of describing the Codenames problem, but I had little faith that I'd be able to write an actually useful program with it. It's strange, almost magical, that so much meaning can be baked into a list of coordinates.
Github gist: https://gist.github.com/jsomers/1bb5e197dec221714df250e72265a301
Thanks to Todd, Rob, and Wilson for ideas that vastly improved the model, and for feedback on the post. Any remaining dumbness is mine.
A helpful post that got me started: https://medium.com/analytics-vidhya/basics-of-using-pre-trained-glove-vectors-in-python-d38905f356db
Mobile-friendly Jupyter CSS taken from nbviewer.