How to get your own Wikipedia DB copy: a brief tutorial
It is quite simple to get your own copy of Wikipedia Database. (Please note that this will get only the DB, not the frontend. To get a static version of Wikipedia you may use this link.)
- Select the project you want to dump from here. There are DB dumps for all the wikimedia project. I will choose itwiki. Probably you will want enwiki (wikipedia in english).
- Download the bz2/7z compressed dump file (XML) you need. Probably you’ll want just the pages-articles.xml (which contains just articles, templates and primary meta-pages). I will use this file.
- Extract the file. You will have a 2 GB XML file.
- Download the xml2sql tool and run xml2sql-fe.exe (MS Windows GUI)
- Convert the XML file into MySQL INSERT files using that tool. I will save the results in c:\tmp
- Install mySql server if you haven’t done it yet. You may want to use a very simple mySql installation like EasyPHP which boundles also Apache, PostgreSQL and PHP. I will use that distribution.
- Get into the mySql shell running
mysql.exe -u root
from the \mysql\bin directory. (Where root is the username. To specify a password use -p)
- Create a database in which we’ll import the XML:
CREATE DATABASE enwiki
and exit using quit.
- Download the latest table structure from this SVN (open and click on download). I will save it in c:\tmp
- Modify the max_allowed_packet variable in your my.ini under the \mysql folder.
Now it should be set to 1MB. I’ve set it to 500MB. This is needed because we are going to pass to mySql a huge amount of data (when adding text.sql). - Reconstruct the tables structure running this command from the \mysql\bin directory:
mysql -u root enwiki < c:\tmp\tables.sql
- Import tables’ data using these commands from the \mysql\bin directory:
mysql -u root enwiki < c:\tmp\page.sql mysql -u root enwiki < c:\tmp\text.sql
This should be more slow.
- Now you may use mySql Query Browser to explore the Wikipedia database. (If you installed EasyPHP you can use: Server Host: localhost; Username: root; Password: [empty]).
Note that only the page and the text tables are filled now.
Note also that usually the text table will contain every version of a page. However, the XML I downloaded contains just the last revision. The page table is linked to the text table through the revision table. You will need that table in order to navigate from text to page and viceversa. The db structure below will be useful if you need to access not just to page data.
As an alternative, you may import data into mySql using the LOAD DATA INFILE command exporting the XML to a mysqlimport format (using xml2sql).



