{"id":109,"date":"2009-08-23T21:10:24","date_gmt":"2009-08-24T03:10:24","guid":{"rendered":"http:\/\/jsomers.net\/blog\/offline-wikipedia"},"modified":"2009-08-24T23:31:03","modified_gmt":"2009-08-25T05:31:03","slug":"offline-wikipedia","status":"publish","type":"post","link":"https:\/\/jsomers.net\/blog\/offline-wikipedia","title":{"rendered":"Offline Wikipedia"},"content":{"rendered":"<p>This very cool and prolific hacker, <a href=\"http:\/\/users.softlab.ntua.gr\/~ttsiod\/\">Thanassis Tsiodras<\/a>, wrote an article a few years back explaining <a href=\"http:\/\/users.softlab.ece.ntua.gr\/~ttsiod\/buildWikipediaOffline.html\">how to build a fast offline reader for Wikipedia<\/a>. I have it running on my machine right now, and it's awesome.<\/p>\n\n<p>The trouble is, his instructions don't <em>quite<\/em> get the job done; they require modification in some key places before everything will actually work, at least on my Macbook Pro (Intel) running OS X Leopard. So I thought I'd do a slightly friendlier step-by-step with all the missing bits filled in.<\/p>\n\n<h3>Requirements<\/h3>\n\n<ul>\n<li>About <strong>15GB<\/strong> of free space. This is mostly for the Wikipedia articles themselves.<\/li>\n<li><strong>5-6 hours<\/strong>. The part that takes longest is partially decompressing and then indexing the bzipped Wikipedia files.<\/li>\n<li>Mac OS X Developer tools, <em>with<\/em> the optional Unix command line stuff. For example, you need to have <code>gcc<\/code> to get this running. These can be found on one of the Mac OS X installation DVDs.<\/li>\n<\/ul>\n\n<h3>Laying the groundwork<\/h3>\n\n<ol>\n<li>Get the Mac OS X Developer tools, including the Unix command line tools. If you don't have these installed, nothing else will work. Luckily, these are easily installable off of the Mac OS X system DVDs that came with your computer. Head <a href=\"http:\/\/developer.apple.com\/technology\/xcode.html\">here<\/a> if you've lost the DVDs to download the files; you'll need to set up an ADC (Apple Developer Connection) account (free) to actually start the download.<\/li>\n<li>Get the latest copy of Wikipedia! Look for the \"enwiki-latest-pages-articles.xml.bz2\" link on <a href=\"http:\/\/download.wikimedia.org\/enwiki\/latest\/\">this page<\/a>. It's a big file, so be prepared to wait. <strong>Do not expand this<\/strong>, since the whole point of Thanassis's tool is that you can use the compressed version.<\/li>\n<li>Get the actual \"offline wikipedia\" package <a href=\"http:\/\/users.softlab.ece.ntua.gr\/~ttsiod\/offline.wikipedia.tar.bz2\">(download it here)<\/a>. This has all of custom code that Thanassis wrote to glue the whole thing together.<\/li>\n<li>Get Xapian, which is used to do the indexing <a href=\"http:\/\/oligarchy.co.uk\/xapian\/1.0.14\/xapian-core-1.0.14.tar.gz\">(download it here)<\/a>.<\/li>\n<li>Set up Xapian. After you've expanded the <code>tar.gz<\/code> file, <code>cd<\/code> into the newly created <code>xapian-core-1.0.xx<\/code> directory. Like every other *nixy package, type <code>sudo .\/configure<\/code>, <code>sudo make<\/code>, and <code>sudo make install<\/code> to get it cooking.<\/li>\n<li>Get Django, which is used as the local web server. You can follow their quick &amp; easy <a href=\"http:\/\/www.djangoproject.com\/download\/\">instructions for getting that set up here<\/a>.<\/li>\n<li>Get the \"mediawiki&#95;sa&#95;\" parser\/renderer <a href=\"http:\/\/users.softlab.ece.ntua.gr\/~ttsiod\/mediawiki_sa.tar.7z\">here<\/a>. To expand that particular file you'll need the 7z unzipper, which you can <a href=\"http:\/\/www.macupdate.com\/download.php\/19139\/eZ7zv0.845.zip\">download here<\/a>.<\/li>\n<li>Get LaTeX for Mac <a href=\"http:\/\/mirror.ctan.org\/systems\/mac\/mactex\/MacTeX.mpkg.zip\">here<\/a>.<\/li>\n<\/ol>\n\n<h3>Building it<\/h3>\n\n<p>Once you have that ridiculously large set of software tools all set up on your computer, you should be ready to configure and build the Wikipedia reader.<\/p>\n\n<p>The first thing you'll need to do is to move the still-compressed <code>enwiki-latest-pages-articles.xml.bz2<\/code> file into your <code>offline.wikipedia\/wiki-splits<\/code> directory.<\/p>\n\n<p>But you have to make sure to tell the offline.wikipedia Makefile what you've done, so open up <code>offline.wikipedia\/Makefile<\/code> in your favorite text editor and change the <code>XMLBZ2 = ...<\/code> top line so that it reads \"<code>XMLBZ2 = enwiki-latest-pages-articles.xml.bz2<\/code>\".<\/p>\n\n<p>Next, take that parser\/renderer you downloaded and expanded in step 7 above, and move it into the <code>offline.wikipedia<\/code> directory.<\/p>\n\n<p>Again, you have to tell the Makefile what you've done--so open it up again (<code>offline.wikipedia\/Makefile<\/code>) and <em>delete<\/em> the line (#21) that starts<\/p>\n\n<pre>@svn co http:\/\/fslab.de\/svn\/wpofflineclient\/trunk\/mediawiki\\_sa\/ mediawiki_sa...<\/pre>\n\n<p>You don't need that anymore (and it wouldn't have worked anyway).<\/p>\n\n<p>With that little bit of tweaking, you should be able to successfully build the reader. Type <code>sudo make<\/code> in the <code>offline.wikipedia<\/code> directory. You should see some output indicating that you're \"Splitting\" and \"Indexing.\" The indexing will take a few hours, so at this point you ought to get a cup of coffee or some such.<\/p>\n\n<h3>Finishing up and fixing the math renderer<\/h3>\n\n<p>Even though the program will tell you that your offline version of Wikipedia is ready to run, it probably isn't. There are a couple of little settings you need to change before it will work. (Although I'd give it a try first!)<\/p>\n\n<p>For one, you may need to change line 64 in <code>offline.wikipedia\/mywiki\/show.pl<\/code>: just replace <code>php5<\/code> with <code>php<\/code>. Once you do that, you should be able to load Wikipedia pages without a hitch--which is to say, you're basically done.<\/p>\n\n<p>(If it doesn't work at this point, first carefully read the error message that you're seeing in the server console, and failing that, add a comment below and I'll try to help you out).<\/p>\n\n<p>Trouble is, the mathematics rendering will probably be broken. That might not matter for your particular purposes, but if you plan to read <em>any<\/em> math at all, it's definitely something you'll need.<\/p>\n\n<p>What you have to do is <em>recompile<\/em> the <code>texvc<\/code> binary that's sitting in <code>offline.wikipedia\/mediawiki_sa\/math<\/code>. But first, you'll need a few more programs:<\/p>\n\n<ol>\n<li>Get an OCaml binary <a href=\"http:\/\/caml.inria.fr\/pub\/distrib\/ocaml-3.11\/ocaml-3.11.1-intel.dmg\">here<\/a>.<\/li>\n<li>You need ImageMagick. This is an extremely large pain in the ass to install, <em>unless<\/em> you have MacPorts. So:<\/li>\n<\/ol>\n\n<p>* Get MacPorts <a href=\"http:\/\/svn.macports.org\/repository\/macports\/downloads\/MacPorts-1.7.1\/MacPorts-1.7.1-10.5-Leopard.dmg\">here<\/a>.\n* Once that's installed, all you need to do is type <code>sudo port install ImageMagick<\/code>. Boom!<\/p>\n\n<p>When that's all ready, head to the <code>offline.wikipedia\/mediawiki_sa\/math<\/code> directory. Then, delete all the files ending in <code>.cmi<\/code> or <code>.cmx<\/code>. Those are the by-products of the first compilation, and they can't be there when you run it again.<\/p>\n\n<p>All you have to do now is type <code>sudo make<\/code>. If all goes well it should finish without error and you should have a working TeX renderer. Just to make sure, type <code>.\/texvc<\/code>. If you don't see any errors or output, you're in good shape.<\/p>\n\n<p>Finally, I've styled my version of the reader up a bit (out of the box it's a little ugly). If you'd like to do the same, open up <code>offline.wikipedia\/mywiki\/show.pl<\/code> and add the following lines underneath the <code>&lt;\/form&gt;<\/code> tag on line 98:<\/p>\n\n<pre>   &lt;style type=\"text\/css\"&gt;\n        body {\n            font-family: Verdana;\n            font-size: 12.23px;\n            line-height: 1.5em;\n        }\n        a {\n            color: #1166bb;\n            text-decoration: none;\n        }\n        a:hover {\n            border-bottom: 1px solid #1166bb;\n        }\n        .reference {\n            font-size: 7.12px;\n        }\n        .references {\n            font-size: 10.12px;\n        }\n    &lt;\/style&gt;<\/pre>\n\n<p>Nothing too fancy--just some prettier type.<\/p>\n\n<h3>You're done!<\/h3>\n\n<p><em>Now<\/em> you should have a working version of the offline wikipedia. It's incredible: in a bus, plane, train, car, basement, Starbucks (where they make you <em>pay<\/em> for Internet), classroom, cabin, or mountaintop, you can <em>still<\/em> have access to the world's best encyclopedia.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This very cool and prolific hacker, Thanassis Tsiodras, wrote an article a few years back explaining how to build a fast offline reader for Wikipedia. I have it running on my machine right now, and it's awesome. The trouble is, his instructions don't quite get the job done; they require modification in some key places [&hellip;]<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[],"tags":[],"class_list":["post-109","post","type-post","status-publish","format-standard","hentry"],"_links":{"self":[{"href":"https:\/\/jsomers.net\/blog\/wp-json\/wp\/v2\/posts\/109","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/jsomers.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/jsomers.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/jsomers.net\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/jsomers.net\/blog\/wp-json\/wp\/v2\/comments?post=109"}],"version-history":[{"count":7,"href":"https:\/\/jsomers.net\/blog\/wp-json\/wp\/v2\/posts\/109\/revisions"}],"predecessor-version":[{"id":115,"href":"https:\/\/jsomers.net\/blog\/wp-json\/wp\/v2\/posts\/109\/revisions\/115"}],"wp:attachment":[{"href":"https:\/\/jsomers.net\/blog\/wp-json\/wp\/v2\/media?parent=109"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/jsomers.net\/blog\/wp-json\/wp\/v2\/categories?post=109"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/jsomers.net\/blog\/wp-json\/wp\/v2\/tags?post=109"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}