Offline Wikipedia
by James Somers, August 23, 2009
This very cool and prolific hacker, Thanassis Tsiodras, wrote an article a few years back explaining how to build a fast offline reader for Wikipedia. I have it running on my machine right now, and it's awesome.
The trouble is, his instructions don't quite get the job done; they require modification in some key places before everything will actually work, at least on my Macbook Pro (Intel) running OS X Leopard. So I thought I'd do a slightly friendlier step-by-step with all the missing bits filled in.
Requirements
- About 15GB of free space. This is mostly for the Wikipedia articles themselves.
- 5-6 hours. The part that takes longest is partially decompressing and then indexing the bzipped Wikipedia files.
- Mac OS X Developer tools, with the optional Unix command line stuff. For example, you need to have
gcc
to get this running. These can be found on one of the Mac OS X installation DVDs.
Laying the groundwork
- Get the Mac OS X Developer tools, including the Unix command line tools. If you don't have these installed, nothing else will work. Luckily, these are easily installable off of the Mac OS X system DVDs that came with your computer. Head here if you've lost the DVDs to download the files; you'll need to set up an ADC (Apple Developer Connection) account (free) to actually start the download.
- Get the latest copy of Wikipedia! Look for the "enwiki-latest-pages-articles.xml.bz2" link on this page. It's a big file, so be prepared to wait. Do not expand this, since the whole point of Thanassis's tool is that you can use the compressed version.
- Get the actual "offline wikipedia" package (download it here). This has all of custom code that Thanassis wrote to glue the whole thing together.
- Get Xapian, which is used to do the indexing (download it here).
- Set up Xapian. After you've expanded the
tar.gz
file,cd
into the newly createdxapian-core-1.0.xx
directory. Like every other *nixy package, typesudo ./configure
,sudo make
, andsudo make install
to get it cooking. - Get Django, which is used as the local web server. You can follow their quick & easy instructions for getting that set up here.
- Get the "mediawiki_sa_" parser/renderer here. To expand that particular file you'll need the 7z unzipper, which you can download here.
- Get LaTeX for Mac here.
Building it
Once you have that ridiculously large set of software tools all set up on your computer, you should be ready to configure and build the Wikipedia reader.
The first thing you'll need to do is to move the still-compressed enwiki-latest-pages-articles.xml.bz2
file into your offline.wikipedia/wiki-splits
directory.
But you have to make sure to tell the offline.wikipedia Makefile what you've done, so open up offline.wikipedia/Makefile
in your favorite text editor and change the XMLBZ2 = ...
top line so that it reads "XMLBZ2 = enwiki-latest-pages-articles.xml.bz2
".
Next, take that parser/renderer you downloaded and expanded in step 7 above, and move it into the offline.wikipedia
directory.
Again, you have to tell the Makefile what you've done--so open it up again (offline.wikipedia/Makefile
) and delete the line (#21) that starts
@svn co http://fslab.de/svn/wpofflineclient/trunk/mediawiki\_sa/ mediawiki_sa...
You don't need that anymore (and it wouldn't have worked anyway).
With that little bit of tweaking, you should be able to successfully build the reader. Type sudo make
in the offline.wikipedia
directory. You should see some output indicating that you're "Splitting" and "Indexing." The indexing will take a few hours, so at this point you ought to get a cup of coffee or some such.
Finishing up and fixing the math renderer
Even though the program will tell you that your offline version of Wikipedia is ready to run, it probably isn't. There are a couple of little settings you need to change before it will work. (Although I'd give it a try first!)
For one, you may need to change line 64 in offline.wikipedia/mywiki/show.pl
: just replace php5
with php
. Once you do that, you should be able to load Wikipedia pages without a hitch--which is to say, you're basically done.
(If it doesn't work at this point, first carefully read the error message that you're seeing in the server console, and failing that, add a comment below and I'll try to help you out).
Trouble is, the mathematics rendering will probably be broken. That might not matter for your particular purposes, but if you plan to read any math at all, it's definitely something you'll need.
What you have to do is recompile the texvc
binary that's sitting in offline.wikipedia/mediawiki_sa/math
. But first, you'll need a few more programs:
- Get an OCaml binary here.
- You need ImageMagick. This is an extremely large pain in the ass to install, unless you have MacPorts. So:
* Get MacPorts here.
* Once that's installed, all you need to do is type sudo port install ImageMagick
. Boom!
When that's all ready, head to the offline.wikipedia/mediawiki_sa/math
directory. Then, delete all the files ending in .cmi
or .cmx
. Those are the by-products of the first compilation, and they can't be there when you run it again.
All you have to do now is type sudo make
. If all goes well it should finish without error and you should have a working TeX renderer. Just to make sure, type ./texvc
. If you don't see any errors or output, you're in good shape.
Finally, I've styled my version of the reader up a bit (out of the box it's a little ugly). If you'd like to do the same, open up offline.wikipedia/mywiki/show.pl
and add the following lines underneath the </form>
tag on line 98:
<style type="text/css"> body { font-family: Verdana; font-size: 12.23px; line-height: 1.5em; } a { color: #1166bb; text-decoration: none; } a:hover { border-bottom: 1px solid #1166bb; } .reference { font-size: 7.12px; } .references { font-size: 10.12px; } </style>
Nothing too fancy--just some prettier type.
You're done!
Now you should have a working version of the offline wikipedia. It's incredible: in a bus, plane, train, car, basement, Starbucks (where they make you pay for Internet), classroom, cabin, or mountaintop, you can still have access to the world's best encyclopedia.
Care to release a .app that makes this process less painstaking?
[…] Offline Wikipedia – J.Somers blog, August 23 2009 […]
[…] Offline Wikipedia « jsomers.netjsomers.net […]
What happens when you want to update… since wikipedia is constantly being added to one would assume you’d want to keep your offline version up to date. Do you have to re-do everything every time you want to update? Or has he worked out a hack that will just update the new files?
I’ve actually only ever done this twice, each time going through all the steps (it was on two different computers). It’s certainly easier to only update the content, rather than re-downloading all of the applications etc., but it’ll still take a while.
In general, though, I don’t think you need to update all that often. Wikipedia as it is now has most of the classic, big, important articles—so for reference when you’re in a jam it does just fine.
nice how-to one little problem however : when trying to “make wikipedia” :
“error: xapian.h: No such file or directory”
i have xapian ‘sudo maked installed’ in /usr/local/ (“xapian.h” in “/usr/local/include/xapian-1.1/xapian.h”)
how to make it visible/accessible to the installer ?
thanks
I have been trying to get this to work, but I seem to have run into some kind of problem. For starters, I have done everything up to editing the show.pl file where I replaced php5 by php. The make went just fine, the files are nicely split and I can access the server. Great. But each and every page I visit, apart from the search results, tells me that the woc.fslab.de parser failed.
I took a look at the site from Thanassis and i tried to render the bonsai.html page. Tough luck since this did not work. The error I get is “Parse error: syntax error, unexpected T_NAMESPACE, expecting T_STRING in /Users/kimbauters/Documents/offline.wikipedia/mediawiki_sa/includes/Namespace.php on line 46” which coincides with the beginning of the class in the Namespace.php file. This has left me dead in my tracks since I have no idea how to resolve this. Since I am using Snow Leopard, my PHP version is the latest and greatest PHP 5.3.0. Clearly, this should be able to cope just fine with classes in PHP.
Any ideas on how to resolve this issue or am I overlooking something?
This problem seems to have been fixed and was caused by “Namespace” being a reserved keyword in PHP 5.3.0. The solution is rather messy in that it involves checking all files in the include folder for references to Namespace:: and renaming that (as well as the class) to MyNamespace. I would advise using Smultron for this job as it can search through all the files that are open by using Advanced Find.
And then there were more problems. Splitting wnet fine, but indexing did not. The cause is a mysterious “Exception: Empty termnames aren’t allowed”. Ah, this tells me what I need to know. Not. Nevertheless, about half of the wikipedia dump has been indexed and I can get started. Except … I can get texvc to compile, which would be king of necessary, since most of my research on wikipedia involves math-related subjects. And this time the error is “32-bit absolute addressing is not supported for x86-64”. So basically, I cannot get this thing to compile no matter what I try since it does not support 64-bit, yet my operating system and processor are 64-bit. What a mess.
Finally a matter of debate. Why are there no images? Is this because the dump failed to parse completely or are these just omitted? In the latter case this would involve an additional download from wikimedia to make this entire idea of an offline version usable. To be honest I find the entire process to be far too hard for a encyclopedia that promotes openness.
Glad to hear you fixed the first problem. I must have been using a different version of PHP.
There may well be a 64-bit version of the software necessary to compile texvc. Do reply again if you’re able to fix that.
The dump itself does not include images, largely because they would increase the file size by orders of magnitude. If you can spare the space then you’ll have to (a) download all the images separately (which I think you can do) and (b) fix all the wiki’s image links so that they point to your local copies.
Note that you don’t need the images to get math to display properly, since those PNGs are rendered by texvc.
Personally I get a lot of mileage out of the offline wikipedia without images, but I suppose it could be helpful to have diagrams and the like. Let us know if you’re able to integrate images successfully.
The problem with “Exception: Empty termnames aren’t allowed” is that the code for the indexer doesn’t read all lines properly — it always receives an empty line at the end of the file. To fix it change if (cin.eof()) break; getline(cin, title); to if (!getline(cin, title)) break;
Hello, I wrote a little script to change this “Namespace” in many files automatically with vi. http://yaroslavn.tumblr.com/post/33975603446/wikipedia-or-wiktionary-offline [nikitenko@localhost owi_offline_wiki]$ more fix_namespace.sh PAT0=’Namespace::’ PAT1=’MyNamespace::’ DIR=’owi-transformer’ for fil in
grep -RI $PAT0 $DIR -l
; do vi -c “:%s/[^y]$PAT0/$PAT1/g” -c wq $fil; donevi -c “:%s/class Namespace/class MyNamespace/g” -c wq $DIR/includes/Namespace.php
Worked great for me with the german version of wikipedia. The only issue is with searching using umlauts – python doesnt seem to like unicode here. Quite a major one really for German. I intend having a look some time at how to get both de and en versions running at the same time.
I can’t seem to build the reader. I’ve downloaded everything, built Xapian and installed Django through Mac Ports (and I already had the dev tools). “sudo make” tells me to use “make wikipedia,” which gives me this error: usage: cp [-R [-H | -L | -P]] [-fi | -n] [-pvX] source_file target_file cp [-R [-H | -L | -P]] [-fi | -n] [-pvX] source_file … target_directory make: *** [mediawiki] Error 64
I’ve looked through everything again, but didn’t forget anything. I tried searching Google for “make Error 64” but didn’t find anything, and I’ve tried Django with Python 2.5 and 2.6. Does anyone have idea what my problem might be? Is there any other information that I should provide?
Thanks, Mason
I had the same problem. to fix I created a directory called “mediawiki_sa”, and inside that made a directory called “includes”.
Sorry. the above was a silly error…the actual problem is this:
./show.pl “../wiki-splits/rec09696enwiki-20090929-pages-articles.xml.bz2” “Wikipedia” sh: php: not found
mediawiki_sa parser failed! report to woc.fslab.de
i have installed Php5.3.0 from the net.
I had the same problem running fedora KDE 12. It turns out that my php engine did not like that the command php5, so I edited show.pl and changed this line:
system(“cd ../mediawiki_sa/ && php5 testparser.php /var/tmp/result > /var/tmp/result.tmp”);
to this line:
system(“cd ../mediawiki_sa/ && php testparser.php /var/tmp/result > /var/tmp/result.tmp”);
It seemed to parse just fine after that, albeit with a bunch of php notices in the shell. I hope that helps you though :D!
Glad to see an ongoing discussion on this very excellent offline reader. Perhaps a simple stream edit on the bz2 splits would allow, if online, the retrieval of images. The current enwiki-latest-pages-articles.xml.bz2 is 5.4G, and with the generated index (another 3G!), the damn thing just won’t fit on my current 8G uSD for use on a freerunner. I don’t how much larger the dump would be including images but I’d guess, maybe using this great method, it wouldn’t even fit on a 32G uSD card.
In any case, here’s a problem:
This reader has been working beautifully excepting one thing. Pages that have the country infobox do not render the data, for instance on the Afghanistan page the infobox information exists within the file but the rendered page gives a default template value, for example:
Capital city: {{{countrycapital}}}
I’m not really sure where to start looking as I’m not very familiar with the wikimedia parser.
Does anyone else have a clue?
— a simple stream edit will certainly not work, as it turns out. The image links are php-generated, I assume. Still, it shouldn’t be too difficult to adjust some code.
[…] posts discussing build options and compile errors and PATH variables. Examples fresh in my mind: offline Wikipedia, Metacat, and the Ruby wrapper for the Stanford Natural Language […]
Thanks for the instructions, I used a very similar method a few years back but never really got it to work.
For those wanting something simpler (maybe a slightly different use case) I strongly recommend aarddict (http://aarddict.org/).
It’s a free, open-source (GPL) and extremely fast program which also provides ready built databases for all wikipedia text content. You don’t get pictures (far too large) but you do get all the text and latex (math formulae) nicely rendered with working links, tables, formatting, etc. It is available for Ubuntu, Windows, OS-X and Nokia tablet (and source obviously)
Thanks for the instructions. I managed to build the whole thing, and it basically works. However, two observations: I tried the gamma function of the English wikipedia, and most of the math renders fine, but some formulas don’t. Any idea why and what I should do? 2nd observation: tex takes a lot of time rendering all the formulas one by one (I think?), and that each time I go back to the gamma function page. Any way to have it cache its results for later fast access?
Thanks anyway!
Frank
I have created a patch that caches the rendered TeX images. I sent you an e-mail that includes this patch.
Hi,
thanks a lot for these detailed instructions, after some tuning to get django and python interact correctly on my configuration (macosx 10.6.8), all works! Could you also send me the patch that caches the tex images ?
Thanks, G
I suppose you get a “could no parse” error. This is because of the texvc binaries supplied with the mediawiki_sa.tar.7z file you must have downloaded. Please recompile the texvc binaries, and it sould work fine. (It did for me).
I have one tip: If somebody (like me) has the texvc binaries installed on their computer, then there is no need to recompile the texvc source. One can go to the mediawiki_sa/math directory and soft link the texvc file in the /usr/bin directory instead of the binary supplied. It saves the trouble of having to install ocaml. I don’t know if this works in a mac though.