Offline Wikipedia

by James Somers, August 23, 2009

This very cool and prolific hacker, Thanassis Tsiodras, wrote an article a few years back explaining how to build a fast offline reader for Wikipedia. I have it running on my machine right now, and it’s awesome.

The trouble is, his instructions don’t quite get the job done; they require modification in some key places before everything will actually work, at least on my Macbook Pro (Intel) running OS X Leopard. So I thought I’d do a slightly friendlier step-by-step with all the missing bits filled in.

Requirements

  • About 15GB of free space. This is mostly for the Wikipedia articles themselves.
  • 5-6 hours. The part that takes longest is partially decompressing and then indexing the bzipped Wikipedia files.
  • Mac OS X Developer tools, with the optional Unix command line stuff. For example, you need to have gcc to get this running. These can be found on one of the Mac OS X installation DVDs.

Laying the groundwork

  1. Get the Mac OS X Developer tools, including the Unix command line tools. If you don’t have these installed, nothing else will work. Luckily, these are easily installable off of the Mac OS X system DVDs that came with your computer. Head here if you’ve lost the DVDs to download the files; you’ll need to set up an ADC (Apple Developer Connection) account (free) to actually start the download.
  2. Get the latest copy of Wikipedia! Look for the “enwiki-latest-pages-articles.xml.bz2″ link on this page. It’s a big file, so be prepared to wait. Do not expand this, since the whole point of Thanassis’s tool is that you can use the compressed version.
  3. Get the actual “offline wikipedia” package (download it here). This has all of custom code that Thanassis wrote to glue the whole thing together.
  4. Get Xapian, which is used to do the indexing (download it here).
  5. Set up Xapian. After you’ve expanded the tar.gz file, cd into the newly created xapian-core-1.0.xx directory. Like every other *nixy package, type sudo ./configure, sudo make, and sudo make install to get it cooking.
  6. Get Django, which is used as the local web server. You can follow their quick & easy instructions for getting that set up here.
  7. Get the “mediawiki_sa_” parser/renderer here. To expand that particular file you’ll need the 7z unzipper, which you can download here.
  8. Get LaTeX for Mac here.

Building it

Once you have that ridiculously large set of software tools all set up on your computer, you should be ready to configure and build the Wikipedia reader.

The first thing you’ll need to do is to move the still-compressed enwiki-latest-pages-articles.xml.bz2 file into your offline.wikipedia/wiki-splits directory.

But you have to make sure to tell the offline.wikipedia Makefile what you’ve done, so open up offline.wikipedia/Makefile in your favorite text editor and change the XMLBZ2 = ... top line so that it reads “XMLBZ2 = enwiki-latest-pages-articles.xml.bz2“.

Next, take that parser/renderer you downloaded and expanded in step 7 above, and move it into the offline.wikipedia directory.

Again, you have to tell the Makefile what you’ve done–so open it up again (offline.wikipedia/Makefile) and delete the line (#21) that starts

@svn co http://fslab.de/svn/wpofflineclient/trunk/mediawiki\_sa/ mediawiki_sa...

You don’t need that anymore (and it wouldn’t have worked anyway).

With that little bit of tweaking, you should be able to successfully build the reader. Type sudo make in the offline.wikipedia directory. You should see some output indicating that you’re “Splitting” and “Indexing.” The indexing will take a few hours, so at this point you ought to get a cup of coffee or some such.

Finishing up and fixing the math renderer

Even though the program will tell you that your offline version of Wikipedia is ready to run, it probably isn’t. There are a couple of little settings you need to change before it will work. (Although I’d give it a try first!)

For one, you may need to change line 64 in offline.wikipedia/mywiki/show.pl: just replace php5 with php. Once you do that, you should be able to load Wikipedia pages without a hitch–which is to say, you’re basically done.

(If it doesn’t work at this point, first carefully read the error message that you’re seeing in the server console, and failing that, add a comment below and I’ll try to help you out).

Trouble is, the mathematics rendering will probably be broken. That might not matter for your particular purposes, but if you plan to read any math at all, it’s definitely something you’ll need.

What you have to do is recompile the texvc binary that’s sitting in offline.wikipedia/mediawiki_sa/math. But first, you’ll need a few more programs:

  1. Get an OCaml binary here.
  2. You need ImageMagick. This is an extremely large pain in the ass to install, unless you have MacPorts. So:
    • Get MacPorts here.
    • Once that’s installed, all you need to do is type sudo port install ImageMagick. Boom!

When that’s all ready, head to the offline.wikipedia/mediawiki_sa/math directory. Then, delete all the files ending in .cmi or .cmx. Those are the by-products of the first compilation, and they can’t be there when you run it again.

All you have to do now is type sudo make. If all goes well it should finish without error and you should have a working TeX renderer. Just to make sure, type ./texvc. If you don’t see any errors or output, you’re in good shape.

Finally, I’ve styled my version of the reader up a bit (out of the box it’s a little ugly). If you’d like to do the same, open up offline.wikipedia/mywiki/show.pl and add the following lines underneath the </form> tag on line 98:

   <style type="text/css">
        body {
            font-family: Verdana;
            font-size: 12.23px;
            line-height: 1.5em;
        }
        a {
            color: #1166bb;
            text-decoration: none;
        }
        a:hover {
            border-bottom: 1px solid #1166bb;
        }
        .reference {
            font-size: 7.12px;
        }
        .references {
            font-size: 10.12px;
        }
    </style>

Nothing too fancy–just some prettier type.

You’re done!

Now you should have a working version of the offline wikipedia. It’s incredible: in a bus, plane, train, car, basement, Starbucks (where they make you pay for Internet), classroom, cabin, or mountaintop, you can still have access to the world’s best encyclopedia.