Jump to content

Data dumps/Tools for importing

From Meta, a Wikimedia project coordination wiki

This list of tools for importing the XML dumps is not comprehensive. If you spot one in the wild that's not described here, please add it!

Converting to SQL first

[edit]

If you are using MySQL and you are working with the dumps from a relatively large project, you will want to convert them to SQL first, and then import those into your database. This will require that you either load in the other SQL tables we provide, or that you rebuild that data using the rebuildall.php maintenance script provided with MediaWiki.

If you are working with only a small subset of pages, this solution is not ideal for you, as you'll have a bunch of extra page and revision related information (such as links) that won't be valid.

It's probably better to first read all rows into a table without any keys at all, and then create the needed keys afterwards. It is much faster to do the necessary sorting once instead of updating the keys for each inserted row. It is not enough to an "alter table page disable keys" because unique keys will still be updated and checked for uniqueness at each inserted row.

  • One way would be to edit the first part of the sql file which creates the table before using it.
  • Another way which gives better control would be to write your own program to parse the sql file and insert the data into the database.

Converting XML files to SQL

[edit]

These tools produce SQL files that can then be imported into your database by e.g. mysqlimport.

  • mwdumper - Java tool (outdated, might not be functional anymore)
  • mwxml2sql - C program for *nix platforms (last updated 2018)
  • mwimport - Perl script, needs editing by hand for non-english-language projects (last updated 2007)

Converting SQL files to tab-delimited files

[edit]

The output of these tools is intended for use with LOAD DATA INFILE for MySQL databases.

  • sql2txt - C program for *nix platforms, source here (last updated 2017; not functional as of March 2018).

Importing directly into your database

[edit]

If time is not an issue or you are dealing with a very small project or a subset of pages for import, you can try importing directly into the database. This method generally means that rows in related tables will be populated as information for each revision is imported, but it is much slower than using the SQL files we provide for download.

Tools for importing directly from the XML files to your database:

  • ImportDump.php -- maintenance script that comes with MediaWiki, always current. Also see the MediaWiki manual.

Importing Into Elasticsearch

[edit]

The Wikiparse tool (last update in 2015) can directly import the bz2 tarball into Elasticsearch with a number of convenient analyzers setup for text searching.

Making the imported wiki functional

[edit]

If you don't only want to import data, but also to use the resulting wiki (e.g. it's a backup restore or a wiki migration), you have to take several additional steps. See mw:Manual:Importing XML dumps.

Tools from the past

[edit]

People have been at this for years now. Here's some of the tools that folks have written, for the historical record:

  • xml2sql - cross-platform tool in C for converting XML files to sql, but now several years out of date
  • Perl importing script - Perl script for importing XML files directly into the database, years out of date
  • mwdum.py - Python tool with low memory-footprint and mediocre speed so far. Includes "parentid" (which mwdumper seems not to do) and has no unicode-problems so far.