the jsomers.net blog.

Wandering the web stacks to improve your information diet

Alarm bells should go off any time you open your web browser, poke around for a few minutes, and close it feeling roughly like you had turned on the television and found that "there's nothing on." Because if that happens, it means either that the Internet has somehow become less dynamic and informationally rich than is commonly thought or, more likely, that you're now using it like the library patron who, having fallen in love with the periodicals section, has forgotten all about the books.

It's surprisingly easy to let this happen. All you have to do is curate a collection of channels -- blogs, magazines, newspapers, social news sites, twitter users, etc. -- sufficiently abundant and interesting that you can afford to stop looking. At that point, information-satisfaction is merely a matter of consuming your feed. The wider web might as well not exist.

Think again of the library. Imagine if all you did there was peruse the latest editions of your favorite periodicals. Certainly you could keep yourself busy this way, but there would be times, no doubt, when there's just nothing new. And if it doesn't occur to you to wander the stacks, you'll probably just leave.

So it goes with the web. Which leads us to the obvious question: what's the Internet equivalent of a book?

If I had my way the answer might well be "books," but until those become openly available we'll have to settle for something else. Here are a few candidates:

  • Professors' home pages. Find the X department at a good university, where X is some subject you're interested in, and odds are there will be at least one professor there with a gigantic, ugly web page full of links to top-notch papers, blogs, essays, books, and other, even bigger pages of links. Examples: Cosma Shalizi, John Lawler, John Baez, Timothy Gowers, Tyler Cowen, etc.
  • Along the same lines, I have recently discovered that grad students at many universities, especially if they're in small-ish departments, will collectively run a group blog. See for example Go Grue, run by philosophy grad students at the University of Michigan.
  • If you chase around the blogrolls of good blogs, you'll not only get a feel for the online territory staked out by some discipline, but you'll also start to notice that a few really strong blogs show up in nearly everyone's list.
  • Surf Wikipedia, of course, but follow the footnotes. Most Wikipedia articles draw heavily on one or two much more comprehensive sources. For example, many philosophy articles either point to or are in some sense abridged versions of pages at the Stanford Encyclopedia of Philosophy. I'm not sure but I would imagine other fields would have a few of their own go-to resources.
  • Classic short fiction and nonfiction is very easy to find on the web. Google "[famous author name] essays" and you'll likely find a page like this (for Montaigne) or this (for Orwell).
  • Pay attention to people who are themselves voracious readers, and ask them what they read.
  • Many law reviews post the full text of each issue online. See Harvard's, for instance. The "articles" and "essays" are mostly by professors, but the "notes" are written by students and are generally more accessible.
  • Find "classic" academic papers, especially in fields with which you're unfamiliar. Most classics are so called because they are especially readable, on top of being major contributions to the state of the art. One way to find them is through profs' pages, as above, or by googling "classic papers in X". In some fields, like cognitive science, people have made lists for just this purpose. Or you can look at the syllabi of introductory grad school courses, citation counts at Google Scholar, etc.

So if ever you catch yourself closing your browser after an unsatisfying short session -- and you still have time to consume, rather than create -- realize that you're probably turning your back on the stacks. The wonderful, wanderable stacks.

Rule 110, or, how a little orange dot becomes a universal computer

See jsomers.net/110. Won't take all of five minutes but I think it's neat.

By inspection

Mathematicians will sometimes say that a result follows "by inspection," meaning roughly that its proof is so straightforward that to work it out would almost be patronizing. [1]

It's a useful phrase to have in mind, and it's led me to a somewhat strange idea—that science, and I mean this quite generally, has developed as though one of its goals was in fact to make more and more things provable that way, "by inspection."

Consider, for example, how thousands of years ago one needed all kinds of clever arguments to convince someone that the Earth was round, whereas today you can just send them to space and have them look. Or how a student on his laptop can now brute-force a problem that once required tricky mathematics. Or how IBM can take 3D pictures of atomic bonds whose structure was always deduced, never seen.

New technologies, then, seem to serve twin purposes: one, to generate more inputs to an inferential model, like telescopes do for cosmologists and accelerators do for physicists, and two, to replace these models with simple pictures of the truth.

So we can imagine a future in which doctors see diseases, rather than teasing them out based on patients' symptoms or history; where chain-termination DNA sequencing is replaced by high-resolution radiography; where we explore the deep oceans instead of just guessing what's there.

Think of how J. Craig Venter's team discovered tens of thousands of new organisms in a barrel of seawater that marine microbiologists said would be empty. [2] The difference was that he had tools—for rapid DNA sequencing, mostly—that they didn't. He could see more. [3]

* * *

Footnotes

[1] Example: suppose we're playing a game of chess, and we inform our opponent—we claim—that he has been checkmated. We could be highly rigorous and say, "in particular, you cannot move here because of X, nor here because of Y, nor here...", but we would be insulting his intelligence. So instead we just leave it to him to inspect each of the possible moves until he sees, quite plainly, that there is no way out.

[2] See p. 40 of this awesome book of conversations / talks about life (think biology, not philosophy) [PDF, via edge.org].

[3] In the same way, individual minds have a way of moving from slow, heavily (and cleverly) deductive inferential procedures to a kind of rapid inspection. That is, when you learn a great deal about something, you are in some sense just building highly defined manipulable mental objects, ones that you can look at, bend, and combine at will. It's like how expert chess players, instead of meticulously playing out lines, "see the whole board" and fluidly explore it, or how expert coders feel like they can "hold the program in their head."

This ability is hard-earned. Which is why it's probably not a good idea to, say, forget about mathematics because we can now compute so quickly, or rely on imaging machines at the expense of clinical expertise. If we look for shortcuts we'll likely get lost.

But it's worth recognizing the power of richer mental pictures—that they are worth more than just the work put in to build them.

Offline Wikipedia

This very cool and prolific hacker, Thanassis Tsiodras, wrote an article a few years back explaining how to build a fast offline reader for Wikipedia. I have it running on my machine right now, and it's awesome.

The trouble is, his instructions don't quite get the job done; they require modification in some key places before everything will actually work, at least on my Macbook Pro (Intel) running OS X Leopard. So I thought I'd do a slightly friendlier step-by-step with all the missing bits filled in.

Requirements

  • About 15GB of free space. This is mostly for the Wikipedia articles themselves.
  • 5-6 hours. The part that takes longest is partially decompressing and then indexing the bzipped Wikipedia files.
  • Mac OS X Developer tools, with the optional Unix command line stuff. For example, you need to have gcc to get this running. These can be found on one of the Mac OS X installation DVDs.

Laying the groundwork

  1. Get the Mac OS X Developer tools, including the Unix command line tools. If you don't have these installed, nothing else will work. Luckily, these are easily installable off of the Mac OS X system DVDs that came with your computer. Head here if you've lost the DVDs to download the files; you'll need to set up an ADC (Apple Developer Connection) account (free) to actually start the download.
  2. Get the latest copy of Wikipedia! Look for the "enwiki-latest-pages-articles.xml.bz2" link on this page. It's a big file, so be prepared to wait. Do not expand this, since the whole point of Thanassis's tool is that you can use the compressed version.
  3. Get the actual "offline wikipedia" package (download it here). This has all of custom code that Thanassis wrote to glue the whole thing together.
  4. Get Xapian, which is used to do the indexing (download it here).
  5. Set up Xapian. After you've expanded the tar.gz file, cd into the newly created xapian-core-1.0.xx directory. Like every other *nixy package, type sudo ./configure, sudo make, and sudo make install to get it cooking.
  6. Get Django, which is used as the local web server. You can follow their quick & easy instructions for getting that set up here.
  7. Get the "mediawiki_sa_" parser/renderer here. To expand that particular file you'll need the 7z unzipper, which you can download here.
  8. Get LaTeX for Mac here.

Building it

Once you have that ridiculously large set of software tools all set up on your computer, you should be ready to configure and build the Wikipedia reader.

The first thing you'll need to do is to move the still-compressed enwiki-latest-pages-articles.xml.bz2 file into your offline.wikipedia/wiki-splits directory.

But you have to make sure to tell the offline.wikipedia Makefile what you've done, so open up offline.wikipedia/Makefile in your favorite text editor and change the XMLBZ2 = ... top line so that it reads "XMLBZ2 = enwiki-latest-pages-articles.xml.bz2".

Next, take that parser/renderer you downloaded and expanded in step 7 above, and move it into the offline.wikipedia directory.

Again, you have to tell the Makefile what you've done--so open it up again (offline.wikipedia/Makefile) and delete the line (#21) that starts

@svn co http://fslab.de/svn/wpofflineclient/trunk/mediawiki\_sa/ mediawiki_sa...

You don't need that anymore (and it wouldn't have worked anyway).

With that little bit of tweaking, you should be able to successfully build the reader. Type sudo make in the offline.wikipedia directory. You should see some output indicating that you're "Splitting" and "Indexing." The indexing will take a few hours, so at this point you ought to get a cup of coffee or some such.

Finishing up and fixing the math renderer

Even though the program will tell you that your offline version of Wikipedia is ready to run, it probably isn't. There are a couple of little settings you need to change before it will work. (Although I'd give it a try first!)

For one, you may need to change line 64 in offline.wikipedia/mywiki/show.pl: just replace php5 with php. Once you do that, you should be able to load Wikipedia pages without a hitch--which is to say, you're basically done.

(If it doesn't work at this point, first carefully read the error message that you're seeing in the server console, and failing that, add a comment below and I'll try to help you out).

Trouble is, the mathematics rendering will probably be broken. That might not matter for your particular purposes, but if you plan to read any math at all, it's definitely something you'll need.

What you have to do is recompile the texvc binary that's sitting in offline.wikipedia/mediawiki_sa/math. But first, you'll need a few more programs:

  1. Get an OCaml binary here.
  2. You need ImageMagick. This is an extremely large pain in the ass to install, unless you have MacPorts. So:

* Get MacPorts here. * Once that's installed, all you need to do is type sudo port install ImageMagick. Boom!

When that's all ready, head to the offline.wikipedia/mediawiki_sa/math directory. Then, delete all the files ending in .cmi or .cmx. Those are the by-products of the first compilation, and they can't be there when you run it again.

All you have to do now is type sudo make. If all goes well it should finish without error and you should have a working TeX renderer. Just to make sure, type ./texvc. If you don't see any errors or output, you're in good shape.

Finally, I've styled my version of the reader up a bit (out of the box it's a little ugly). If you'd like to do the same, open up offline.wikipedia/mywiki/show.pl and add the following lines underneath the </form> tag on line 98:

   <style type="text/css">
        body {
            font-family: Verdana;
            font-size: 12.23px;
            line-height: 1.5em;
        }
        a {
            color: #1166bb;
            text-decoration: none;
        }
        a:hover {
            border-bottom: 1px solid #1166bb;
        }
        .reference {
            font-size: 7.12px;
        }
        .references {
            font-size: 10.12px;
        }
    </style>

Nothing too fancy--just some prettier type.

You're done!

Now you should have a working version of the offline wikipedia. It's incredible: in a bus, plane, train, car, basement, Starbucks (where they make you pay for Internet), classroom, cabin, or mountaintop, you can still have access to the world's best encyclopedia.

Anatomy of a Project Euler problem, and, some remarks on the OEIS

[This post turned into a kind of narrative about solving a particular problem, which I'm happy with, but if you're interested in my general remarks about the online encyclopedia of integer sequences (the original intention of this post) jump here.]

Over the weekend I had a chance to work on a few Project Euler problems. I think of them as little "mental workouts," in that they provide a somewhat frictionless (controlled) environment in which to practice math and hacking. And like physical workouts, PE problems are satisfying in direct proportion to their difficulty.

Problem 106, reproduced here, has been one of my all-time favorites (of the 111 I've solved):

Let S(A) represent the sum of elements in set A of size n. We shall call it a special sum set if for any two non-empty disjoint subsets, B and C, the following properties are true:

  1. S(B) ≠ S(C); that is, sums of subsets cannot be equal.
  2. If B contains more elements than C then S(B) > S(C).

For this problem we shall assume that a given set contains n strictly increasing elements and it already satisfies the second rule.

Surprisingly, out of the 25 possible subset pairs that can be obtained from a set for which n = 4, only 1 of these pairs need to be tested for equality (first rule). Similarly, when n = 7, only 70 out of the 966 subset pairs need to be tested.

For n = 12, how many of the 261625 subset pairs that can be obtained need to be tested for equality?

(If you're interested in the rest of this post, I encourage you to read the description again, at least until you have some idea of what it's about; it's a bit of a mouthful the first time through. And if you're interested in solving the problem, by all means do it -- but be sure to come back when you've finished.)

The first thing that struck me about this question is that it doesn't feel well-defined. What does it mean for us to "need" to test a pair? Isn't that a judgment call?

My first step, then, was to print out the 25 subset pairs where n = 4; the idea would be to go through them and cross out the ones that "obviously" couldn't be equal, and see if there was a pattern to it.

I immediately discovered one rule: pairs of singletons can't possibly have equal "sums." Since the sets are disjoint by assumption, any member of one can't be part of the other -- which, when your sets only have one member each, implies a strict inequality. Simple.

The next rule, too, follows directly from a fact mentioned in the problem description, namely, that property #2 is satisfied. So we can throw out any pair containing sets of different sizes, because these can't be equal (by assumption).

Now it turns out that when you account for those two rules, you're only left with three pairs -- out of which only one "needs" to be tested, according to the problem description. They are:

[1, 2] [3, 4]
[2, 3] [1, 4]
[1, 3] [2, 4]

Can you see the rule?

Well, in the first and third cases, set A is strictly dominated by set B: if you move through both sets simultaneously, proceeding from left to right, the number you choose from set B is in every case greater than its partner from set A. So set B unequivocally has the larger sum.

With that in mind, it looks like we have all of the rules for telling us exactly how many pairs we need to check. But before we jump all the way up to n = 12, it makes sense to code up a little script on n = 7 to make sure our rules eliminate all but 70 of the 966 possible subset pairs (as we're told in the problem description). Here's a little code that does this (using a fast powerset() function pasted to the Python mailing list):

s = [1, 2, 3, 4, 5, 6, 7]
subsets = [list(A) for A in list(powerset(s))]

ct = 0
# Enumerate all possible pairs, then filter:
for i, s in enumerate(subsets[1:]):
    for j, t in enumerate(subsets[i:]):
        if len((s + t)) == len(set(s + t)): # Disjoint
            if len(s) == len(t) and len(s) > 1: # Rules 1 and 2
                diffs = [a[1] - a[0] for a in zip(s, t)]
                if len(filter(lambda x: x <= 0, diffs)) > 0: # Rule 3
                    ct += 1
print ct

It prints 70, which verifies that our rules are sound.

Unfortunately, though, we can't just add five elements to s and be done with it. When n = 12, the power set of s contains 212 elements, which means that if we "enumerated all possible pairs," as in the script above, we'd run through our loop (212)2 times, which would take something like 9 hours to complete on a computer like mine.

So the problem comes down to our being able to find (a) the number of pairs Px of disjoint, equally-sized subsets for sizes x = 2, ..., 6 (or 12/2), and (b) subtract from each of these Px the number of pairs that are "strictly dominated," in the sense above.

It took me a long time to figure out how to even approach (a). What I finally did was to make a table, for n = 4 and x = 2, of all the possible subset pairs; I'd then cross out the ones that weren't disjoint and try to find a pattern. So I started to draw something like this:

         (1, 2) (1, 3) ... (6, 7)
        +------+------+---+------
(1, 2)  |
(1, 3)  |
 ...    |
(6, 7)  |

which was going to be filled with X's in every spot where two sets contained the same element. But I stopped way early, because in the process of filling out my first row, I realized that the pair (1, 2) had essentially "blotted out" exactly two possibilities (1 and 2) from my list, leaving 3, 4, 5, 6, and 7. Which meant that if I was going to ensure "disjointness," all I had to do was figure out the number of ways to select 2 elements out of the 5 remaining -- which is just "5 choose 2", or, 5!/2!(5-2)!.

To figure out my total P2, then, I just had to multiply by the number of rows, which turns out to be "7 choose 2", or as a Python function, nCr(7, 2).

I checked my work using the lists I had printed out earlier, and it turned out that I was always off by a factor of two: I was double-counting. In terms of the table, I had forgotten that you only had to look above (or below) the diagonal.

With that done, all that was left was to subtract out the number of strictly dominated pairs.

But this was a lot more difficult than it sounds.

To restate the problem concretely, let's say N = 7 and x = 2. What we're after is the number of ways of selecting two pairs of elements from our 7-element list such that each of the guys from pair A has a partner in pair B that is strictly greater than it. Visually, where our pair A is represented by stars below the numbers and pair B by lines above, what we want is something like this:

      _   _
1 2 3 4 5 6 7
  *     *

as opposed to this:

    _   _
1 2 3 4 5 6 7
      *   *

because in this second case, 6 (from pair A) isn't dominated by anyone from pair B.

My first stab at a systematic rule here was to try fixing two stars, like so:

1 2 3 4 5 6 7
  *   *

and to see how many ways there were to arrange two lines so that they strictly dominated the stars. Working with a pencil, I started drawing, erasing, and tallying.

This was a brutal mess. There were simply too many possibilities, and I wasn't at all clear about what would happen if I increased n or x; I couldn't find a system, and was even having trouble moving the lines systematically without losing track. Arranging the stars differently just muddled things up even more.

I went to dinner and tried to think a bit while eating. One idea I had was to write, above each number, the different star-numbers that would be dominated if I chose that number (i.e., if I drew a line above it) -- sort of like how people solve Sudoku puzzles by listing all the possibilities and propagating constraints.

I don't think that was a good idea, but it may have got me thinking, because when I got back to work to try it out I came up with something much better. It was the clearest "insight" of the day.

What I realized was that n doesn't matter, at least in one important sense. Say x = 2 and n = 1,000,000. For any arrangement of stars and lines, there are all kinds of ways to shift things around; probably millions of them. But for the most part, none of these moves will change the essential dominance relationship -- the ordering of lines and stars relative to one another. There are really only a fixed number of arrangements that matter, and increasing n just adds empty space.

In fact, we can systematically investigate every arrangement-that-matters by letting n = 4, which is small enough to work out by hand. Then, and here's the trick, anytime we increase n, all we need to think about is the number of ways of choosing 4 numbers from that set.

Let's say, just to be explicit about it, that there are 2 ways of arranging two stars and two lines around the list 1 2 3 4 such that the lines dominate the stars. (There are.) Then, if we let n = 7, or 12, or 2,000, all we need to do is calculate the number of ways of embedding our 4-member list in there -- given by nCr(n, 4) -- and multiply that number by 2. Done.

We do have a bit more work to do, though, because all we know precisely at this point is what happens when x = 2, and we need to consider the cases where x = 3, 4, 5, and 6, because those are all the possibilities when n = 12. (Singletons aren't allowed, and anything bigger than 6 runs out of space.)

I was able to calculate these values explicitly for x = 2, 3, 4, and roughly for x = 4 (I wasn't sure about the last one, because it required lots of drawing, erasing, and tallying). So I had a series that looked like 2:2, 3:5, 4:14, 5:~43. I could have tried to figure it out for x = 6, but I saw that the series was growing quickly and I wasn't too keen on making hundreds of tallies. But I also wasn't sure that I could figure out a formula; I assumed it would be some tricky recursive combinatorial expression, and frankly, I had a much simpler idea:

* * *

I went to the online encyclopedia of integer sequences, otherwise called "OEIS" or "Sloane's."

What it is is a repository of integer sequences (obviously) to which anyone can contribute. For each entry, there is

  • A list of the first twenty or so terms, along with links to more. There's even a link to an audio file where you can hear the terms spoken out loud.
  • Comments -- usually one or many "interpretations" of the list, assuming it has some number-theoretic, algebraic, or geometric significance.
  • Heavy-duty (i.e. peer-reviewed) references of papers that use the sequence in some way; occasionally this includes the paper that "discovered" it.
  • Links to online explanations or discussions.
  • Formulas and computer code (usually Maple and Mathematica) for producing it.
  • Cross references to related sequences.

The way I think about it is less as an encyclopedia of sequences per se than it is a kind of projection of mathematics onto sequence-space. The point being that an integer sequence, because it often has so many mathematical interpretations, acts in some sense like the intersection of those interpretations. There are of course many ways in which two or more mathematical concepts are linked, but a shared integer sequence is among the most useful, precisely because it can be browsed and searched like an encyclopedia.

It also helps that this encyclopedia is online -- and therefore hyperlinked -- and that its contributors and editors are mostly professional mathematicians or computer scientists. It makes for an incredible resource, one more modest than Wikipedia or IMDb but in some ways more interesting.

The chief difference, I think, is that one can more readily imagine someone using the OEIS to do original research. It's as much a rich set of raw data as it is a compendium of what's already known.

Of course it's possible to be misled by a particular entry, which is why one still has to do the math before taking a connection too seriously -- for instance, I bet that nearly any random sequence of five or six strictly increasing numbers less than 100 will show up on Sloane's; so it might not be prudent to, say, walk into an elevator, see which numbers are highlighted, look them up online, and assume you've found some connection between your building and the Stolarsky array read by antidiagonals. But it's a rich resource for those who know what to do with it.

* * *

So that's where I went to find the next member of my series (x = 6). Since I wasn't sure of my value for 5 (43), I just omitted it and searched for "2, 5, 14", which brought up the "Catalan numbers" as the first entry.

As you may gather from the length of that page, the Catalan sequence is quite important and admits to a whole slew of interpretations. One of these in particular jumps out to me as straightforwardly relevant to the problem at hand:

Ways of joining 2n points on a circle to form n nonintersecting chords.

All you have to do is think of "chords crossing" as "a violation of strict dominance," and the correspondence should be clear. In my mind it's a beautiful visualization of the idea.

In any case, it turns out that the 5th member of our series is going to be 132. Which means we're ready to solve the problem, because we now have a formula for determining the number of strictly dominated pairs given an array length n and subset pairs each of size x. It's simply the number of ways of embedding 2x elements in n, i.e., nCr(n, 2x), multiplied by what turns out to be the xth Catalan number for that embedded array of 2x points.

Setting n = 12 and ranging over the possible values for x, 2 through 6, gives our answer. Here's the code:

def fact(n):
    "factorial"
    return reduce(lambda x, y: x * y, range(1, n + 1))

def nCr(n, r):
    "n choose r"
    if n == r:
        return 1
    else:
        return fact(n) / (fact(r) * fact(n - r))

def x_with_x(n, x):
    "number of *disjoint* x-subset pairs one can form with an array of length n"
    if n - x >= x:
        return (nCr(n, x) * nCr(n - x, x)) / 2
    else:
        return 0

catalans = {2:2, 3:5, 4:14, 5:42, 6:132}

def dominated(n, x):
    "number of 'strictly dominated' x_with_x guys -- these need not be checked"
    return nCr(n, 2 * x) * catalans[x]

def need_to_check(n):
    "number of disjoint subset pairs we need to check for array length n"
    tot = 0
    for x in range(2, n // 2 + 1):
        tot += x_with_x(n, x) - dominated(n, x)
    return tot

print need_to_check(12)

It takes .27 seconds to run on my machine.