Jump to content

Data dumps/xml2sql

From Meta, a Wikimedia project coordination wiki

NOTE: This is not the recommended method of importing XML dumps.
See mw:Manual:Importing XML dumps for an overview.


xml2sql is a tool to convert xml dumps which can be download at http://download.wikimedia.org/ to sqldump which can be imported with mysql, mysqlimport or psql.

This tool is written in ANSI C. To compile it, expat and zlib are required. This tool has been developed on Linux, it also works on FreeBSD, NetBSD, MacOS X, and Windows. Feel free to use it. :)

Download

[edit]
  • xml2sql-0.5.tar.gz (source code) 2006-02-08
    MD5SUM: 8a1d905636900e3ea07055dd645276f8
    SHA1SUM: ad4ccb37ccbef1a682a86e4b929b43ac0f578744
  • xml2sql-0.5-win32.zip (win32 executable) 2006-02-08
    MD5SUM: 9665424dc6d6f5abf6241298e727a5a3
    SHA1SUM: 403bc96a1f679259bcd904f7c9c9bae92252a266
  • GitHub: mediawiki-xml2sql

patch for recent versions of mw (>=1.10)

[edit]

because the revision table contains two new datasets since 1.10 (rev_len, rev_parent_id) the xml slightly changed. apply this patch to make it work again:

--- xml2sql-0.5/xml2sql.c	2008-01-16 15:32:28.000000000 +0100
+++ xml2sql-0.5 (2)/xml2sql.c	2008-02-17 15:06:34.000000000 +0100
@@ -741,6 +741,10 @@
	putcolumnf(&rev_tbl, "%d", revision.minor);
	/* rev_deleted */
	putcolumn(&rev_tbl, "0", 0);
+	
+	putcolumn(&rev_tbl, "NULL", 0);
+	putcolumn(&rev_tbl, "NULL", 0);
+
	finrecord(&rev_tbl);
	
	if(page.lastts == 0 || strcmp(page.lastts, revision.timestamp) < 0) {

Install

[edit]

*nix, MacOS

[edit]

The source package contains standard `configure' script. Just expand the package and make. (On *BSD, you may add --with-expat=/usr/local option to configure.)

(you need on debian/etch : gcc, libc6-dev, expat, libexpat1-dev)

$ ./configure
$ make
# make install

Windows

[edit]

Win32 executable is now available. Download it and unzip.

Easy to use

[edit]
$ wget http://download.wikimedia.org/enwiki/20080103/enwiki-20080103-pages-meta-current.xml.bz2
$ bunzip2 -c pages-meta-current.xml.bz2 | xml2sql
$ mysqlimport -u root -p --local dbname `pwd`/{page,revision,text}.txt

Note: This last line might not work. The database needs to be initialized with the correct tables. The way to do this is to install the Mediawiki software before doing the import.

Windows

[edit]

The GUI frontend can decompress gzip, bzip2 and 7-zip archive. Run xml2sql-fe.exe, choose XML file, choose option, optionally choose output directory, and then press "START!!" button.

Reference

[edit]
usage: xml2sql [options]... [XMLFILE]

Input MediaWiki XML dumpfile from XMLFILE (or standard input), output SQL dump for MediaWiki 1.5 or later.

Options

[edit]
-i, --import mysqlimport format. (default)
Output filenames are page.txt, revision.txt, and text.txt. You can use mysqlimport program to import this format.
-m, --mysql MySQL's INSERT format.
Output filenames are page.sql, revision.sql, and text.sql. You can use mysql program to import this format.
-p, --postgresql[=version] PostgreSQL's COPY format.
Output filenames are page.sql, revision.sql, and text.sql. If the version is omitted, 8.0 and earlier is assumed. You can use psql program to import this format.
-c, --compress[={old,full}] Compress text table with deflate. (default: old)
When output format is postgresql, this option is ignored because PostgreSQL will compress table data itself.
-r, --renumber Renumber page id and revision id.
-N, --namespace=ns,ns,... Output only specific namespaces. Namespaces can be specified by both namespace number and namespace name.
-t, --no-text Exclude text table
-o, --output-dir=OUTDIR Specifies output directory (default: current directory)
-t, --tmpdir=TMPDIR Specifies temporary directory (default: OUTDIR)
Temporary file is used only if --compress=old.
-v, --verbose Show progress
-h, --help Display help and exit
--version Display version information and exit
[edit]

xml2sql, MediaWiki XML to SQL converter.
Copyright © Tietew.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
  3. The name of the author may not be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

See also

[edit]