Jump to content

Data dumps/Import examples

From Meta, a Wikimedia project coordination wiki

Examples of the import process

[edit]

Import into an empty wiki of el wikivoyage on Linux with MySQL

[edit]

MediaWiki version: 1.29wmf19 Also, MariaDB version for local install: 10.1

This wiki was chosen because it's got a non-latin1 character set in a language I read, and it's small enough that I can import all revisions without it taking forever.

Before the import: MW downloads

[edit]
  1. I downloaded the dumps for a given day. I actually downloaded everything even though I knew some files would not be needed.
  2. I made sure I had all prerequisites for MediaWiki installed: mysql, php5, apache, php-mysql, php-intl, ImgeMagick.
  3. I got an up to date clone of MediaWiki from the git repo, and checked out the 1.29wmf19 branch.
  4. I also got clones of the following extensions and checked out the same branch:
    1. ActiveAbstract -- needed to run dumps, which was one of the reasons for the install
    2. Babel -- used on some user pages
    3. CategoryTree -- enabled on the wiki and used
    4. Cite -- enabled and used on the wiki
    5. CiteThisPage -- enabled on the wiki, probably could have been skipped, ah well
    6. Disambiguator -- enabled and used on the wiki
    7. Gadgets -- enabled and used on the wiki
    8. ImageMap -- enabled and used on the wiki
    9. Interwiki -- enabled but do we need it? Not sure
    10. Kartographer -- enabled and used on the wiki
    11. LabeledSectionTransclusion -- enabled and used on the wiki
    12. MwEmbedSupport -- enabled and used on the wiki
    13. PagedTiffHandler -- enabled but it's unclear to me if it's really needed
    14. ParserFunctions -- a must have for all wikis
    15. PdfHandler -- do we need it? I erred on the side of caution
    16. Scribunto -- must have for modern wikis
    17. SiteMatrix -- I wrote the dang dumps api code that gets site-related info and I still don't know if we needed this extension. Better safe than sorry.
    18. TemplateData -- enabled and used on the wiki
    19. TimedMediaHandler -- enabled and used on the wiki, though I did not test it
    20. TocTree -- enabled and used on the wiki
    21. Wikidata -- enabled and used on the wiki, BUT SEE NOTES
    22. WikiEditor -- enabled and used on the wiki
    23. WikimediaMaintenance -- I wanted to have these scripts available, just in case
  5. I grabbed the repo for the default skin (Vector) and checked out the right branch for that as well
  6. I copied the MediaWiki repo into directory elwv under my html docroot, copied the various extension repos into the elwv/extensions directory, and began the install process.
  7. I copied the skin to elwv/skins/

Before the import: MW install

[edit]
  1. I installed with settings:
    • el for my language and the wiki language
    • MySQL database type
    • localhost for hostname (hey, it's a local install on my laptop :-P)
    • elwikivoyage for database name
    • no database table prefix
    • root db username and password for the database username/password for install
    • InnoDB table format
    • Binary character set
    • Enable media uploads (no good reason, just forgot)
    • use InstantCommons
    • no email, no cache (can be fixed later), some cc license
    • I added a wgOverrideHostname just to be on the safe side (not a fqdn, since this is a laptop)
  2. I selected all the above extensions to be installed
  3. I added some settings to LocalSettings.php:
    • $wgContentHandlerUseDB = true;
    • $wgUseTidy = true;
    • $wgTidyBin = '/usr/bin/tidy';
    • $wgIncludeLegacyJavaScript = true; (needed for some stuff in MediaWiki:Common.js as imported)
    • $wgDefaultUserOptions['toc-floated'] = true; (needed for nicely styled toc on some pages to display correctly)
  4. For debugging purposes I also added the following settings:
    • $wgShowExceptionDetails = true;
    • $wgShowDebug = true;
    • $wgDebugLogFile = '/var/www/html/elwv/logs/debuglog.txt';

I will describe what had to be done with the WikiData extension below (BUT SEE BIG WARNINGS IN BOLDFACE).

The import

[edit]
  1. I created a nice working directory elsewhere
  2. I put all the downloaded files into a subdirectory "imported"
  3. I wrote some import scripts and put them in the working directory. See operations/dumps/import-tools, you want extract_tablecreate.py and import_tables.sh from the xmlfileutils directory, and you must edit the latter for the wiki, dump date, database username and password, etc.
  4. I got a git clone of the operations/dumps/import-tools master branch (see link above for info), entered the xmlfileutils directory, and did 'make'.. well, after updating the code for MW 1.29 :-)
  5. I copied mwxml2sql, sql2txt and sqlfilter from there to my working directory in the first step
  6. I made sure wget is installed, just in case.
  7. I made sure mariadb was running (required for the MediaWiki install, in any case).
  8. I ran the import_tables.sh script.

Some notes about what this script does:

  • downloads wiki dump files for specific date and wiki name, if they are not already in the directory the script looks for, for files to import.
  • generates pages, revision, text sql files, plus sql for the creation of those tables, if it does not find generated files in its output directory already
  • converts the above to tab-delimited files
  • converts the downloaded sql files to tab-delimited files
  • grabs the CREATE TABLE statements from the downloaded sql files and puts those into files to be used for the import (see NOTE below)
  • drops tables for import except page/text/revision
  • truncates tables page/text/revision
  • recreates dropped tables with the above saved create statements
  • imports page/text/revision data to local mariadb instance using LOAD DATA INFILE

NOTE THAT we cannot rely on the tables as created by the MediaWiki install, because the order of fields may be different from one wiki to the next!

Example: elwikivoyage as dumped from db has:
CREATE TABLE `page_restrictions` (
  `pr_page` int(11) NOT NULL,
  `pr_type` varbinary(60) NOT NULL,
  `pr_level` varbinary(60) NOT NULL,
  `pr_cascade` tinyint(4) NOT NULL,
  `pr_user` int(11) DEFAULT NULL,
  `pr_expiry` varbinary(14) DEFAULT NULL,
  `pr_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`pr_id`),
  UNIQUE KEY `pr_pagetype` (`pr_page`,`pr_type`),
  KEY `pr_typelevel` (`pr_type`,`pr_level`),
  KEY `pr_level` (`pr_level`),
  KEY `pr_cascade` (`pr_cascade`)
) ENGINE=InnoDB AUTO_INCREMENT=11 DEFAULT CHARSET=binary;
The same table create statement in MediaWiki has:
CREATE TABLE /*_*/page_restrictions (
  pr_id int unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
  pr_page int NOT NULL,
  pr_type varbinary(60) NOT NULL,
  pr_level varbinary(60) NOT NULL,
  pr_cascade tinyint NOT NULL,
  pr_user int NULL,
  pr_expiry varbinary(14) NULL
) /*$wgDBTableOptions*/;
CREATE UNIQUE INDEX /*i*/pr_pagetype ON /*_*/page_restrictions (pr_page,pr_type);
CREATE INDEX /*i*/pr_typelevel ON /*_*/page_restrictions (pr_type,pr_level);
CREATE INDEX /*i*/pr_level ON /*_*/page_restrictions (pr_level);
CREATE INDEX /*i*/pr_cascade ON /*_*/page_restrictions (pr_cascade);
Since we import things as tab-delimited for speed (using LOAD DATA INFILE), the order of the dumped fields must correspond to the order of the created table.

NOTE ALSO that we cannot rely on the create table statements in e.g. the page.sql file as downloaded, because those files may contain fields that the running version of MediaWiki does not have!

Example: https://phabricator.wikimedia.org/T86338 (page_counter still present)

After the import

[edit]
  1. I checked that there are no extra namespaces I needed to add.
  2. TBH a few of the extensions I claimed to add in advance, I added afterwards when seeing that some pages didn't render, but I did a re-import by running the script again and all was well.
  3. The gorilla in the room: WikiData. Some pages used an image banner that relied on a WikiData property lookup. So I sighed and tried to configure all that crap. Here's the settings, stolen straight out of the WMF configs:
$wgEnableWikibaseRepo = false;
$wmgUseWikibaseRepo = false;
$wgEnableWikibaseClient = true;
$wmgUseWikibaseClient = true;

require_once( "$IP/extensions/Wikidata/Wikidata.php" );

$wgWBClientSettings['repoSiteName'] = 'wikibase-repo-name';
$baseNs = 120;
// Define the namespace indexes for repo (and client wikis also need to be aware of these,
// thus entityNamespaces need to be a shared setting).
//
// NOTE: do *not* define WB_NS_ITEM and WB_NS_ITEM_TALK when using a core namespace for items!
define( 'WB_NS_PROPERTY', $baseNs );
define( 'WB_NS_PROPERTY_TALK', $baseNs + 1 );
define( 'WB_NS_QUERY', $baseNs + 2 );
define( 'WB_NS_QUERY_TALK', $baseNs + 3 );


$wgWBSharedSettings['entityNamespaces'] = [
	'item' => NS_MAIN,
	'property' => WB_NS_PROPERTY
];

$wgWBSharedSettings['specialSiteLinkGroups'][] = 'wikidata';

$wgWBClientSettings = $wgWBSharedSettings + $wgWBClientSettings;

// to be safe, keeping this here although $wgDBname is default setting
$wgWBClientSettings['siteGlobalID'] = $wgDBname;

$wgWBClientSettings['changesDatabase'] = 'wikidatawiki';
$wgWBClientSettings['repoDatabase'] = 'wikidatawiki';

$wgWBClientSettings['repoNamespaces'] = [
	'item' => '',
	'property' => 'Property'
];

$wgWBClientSettings['languageLinkSiteGroup'] = 'wikivoyage';

$wgWBClientSettings['siteGroup'] = 'wikivoyage';
$wgWBClientSettings['otherProjectsLinksByDefault'] = true;

$wgWBClientSettings['excludeNamespaces'] = function() {
	// @fixme 102 is LiquidThread comments on wikinews and elsewhere?
	// but is the Extension: namespace on mediawiki.org, so we need
	// to allow wiki-specific settings here.
	return array_merge(
		MWNamespace::getTalkNamespaces(),
		// 90 => LiquidThread threads
		// 92 => LiquidThread summary
		// 118 => Draft
		// 1198 => NS_TRANSLATE
		// 2600 => Flow topic
		[ NS_USER, NS_FILE, NS_MEDIAWIKI, 90, 92, 118, 1198, 2600 ]
	);
};

$wmgWikibaseClientSettings = [];

foreach( $wmgWikibaseClientSettings as $setting => $value ) {
	$wgWBClientSettings[$setting] = $value;
}

$wgWBClientSettings['allowDataTransclusion'] = true;
$wgWBClientSettings['allowDataAccessInUserLanguage'] = false;
$wgWBClientSettings['entityAccessLimit'] = 400;

// fixme find out what the hell this is
$wgWBClientSettings['sharedCacheKeyPrefix'] = 'wadafuq';
$wgWBClientSettings['sharedCacheDuration'] = 60 * 60 * 24;

$wgWBClientSettings['repoUrl'] = 'https://www.wikidata.org';
$wgWBClientSettings['repoConceptBaseUri'] = 'http://www.wikidata.org/entity/';
$wgArticlePlaceholderImageProperty = 'P18';

$wgWBClientSettings['badgeClassNames'] = [
	'Q17437796' => 'badge-featuredarticle',
	'Q17437798' => 'badge-goodarticle',
	'Q17559452' => 'badge-recommendedarticle', // T72268
	'Q17506997' => 'badge-featuredlist', // T72332
	'Q17580674' => 'badge-featuredportal', // T75193
	'Q20748091' => 'badge-notproofread', // T97014 - Wikisource badges
	'Q20748094' => 'badge-problematic',
	'Q20748092' => 'badge-proofread',
	'Q20748093' => 'badge-validated',
	'Q28064618' => 'badge-digitaldocument', // T153186
];

// Overwrite or add commons links in the "other projects sidebar" with the "commons category" (P373), per T126960
$wgWikimediaBadgesCommonsCategoryProperty = $wgDBname === 'commonswiki' ? null : 'P373';

$wgArticlePlaceholderSearchEngineIndexed = false;
$wgWBClientSettings['propertyOrderUrl'] = 'https://www.wikidata.org/w/index.php?title=MediaWiki:Wikibase-SortedProperties&action=raw&sp_ver=1';

This allowed the wiki to at least parse property tags and come back with nothing, in templates that use them, and those templates have a fallback. And that allowed pages to render ok.

HOWEVER. A big huge fat HOWEVER. Wikidata DOES NOT have an InstantCommons-like mode where you can use the Wikidata repo of the WMF from your local wiki. See here: [1] (remind me to replace this link with a better one later). The open ticket for adding such functionality is inactive [2]. THIS MEANS that in order to use properties and entities, we have to figure out a way to import them into a local Wikibase repo, not trivial. Currently Wikidata has over 25 million items (entities, properties etc). Clearly we don't want to import all those. Compare to the English language Wikipedia; it has less than 6 million articles. I need to investigate all of this or see if others have done some work in this area.

After all of this most things looked like they rendered ok, but there were a couple tags I saw and didn't fix up. I was able to run dumps successfully on this install and that's what I really wanted it for.

Import into an empty wiki of el wiktionary on Linux with MySQL

[edit]

MediaWiki version: 1.20

This wiki was chosen because it uses a non-latin1 character set, has a reasonable number of articles but isn't huge, and relies on only a small number of extensions.

I chose to import only the current pages, with User or Talk pages, because most folks who set up local mirrors want the article content and not the revision history or the discussion pages.

Before the import

[edit]
  1. I downloaded the dumps for a given day. I got all the sql.gz files, the stub-articles.xml.gz file, and the pages-articles.xml.bz2 file from http://download.wikimedia.org/elwiktionary/ even though I knew there would be a few of those sql files I wouldn't need.
  2. I installed the prerequisites for MediaWiki, including MySQL, PHP 5, Apache, php-mysql, php-intl, ImageMagick and rsvg (see the manual).
  3. I downloaded MediaWiki 1.20 and unpacked it into /var/www/html/elwikt (your location may vary).
  4. I installed MediaWiki 1.20 on my laptop, with the following settings:
    • el for my language and the wiki language
    • MySQL database type
    • localhost for hostname (hey, it's a local install on my laptop :-P)
    • elwikt for database name
    • no database table prefix
    • root db username and password for the database username and password for install
    • a different user name and password for the database account for web access, with 'create if it does not exist' checked
    • InnoDB table format
    • Binary character set
    • Disable media uploads
    • use InstantCommons
  5. I selected the extensions I wanted installed via the installer, some of them not being necessary but I thought they would be useful to have if I did decide to locally edit:
    • ConfirmEdit
    • Gadgets
    • Nuke
    • ParserFunctions
    • RenameUser
    • Vector
    • WikiEditor
  6. I generated page, revision and text sql files from the stub and page content XML files, using mwxml2sql via the command mwxml2sql -s elwiktionary-blahblah-stub-articles.xml.gz -t elwiktionary-blahblah-pages-articles.xml.bz2 -f elwikt-pages-current-sql.gz -m 1.20
  7. I converted all the sql files to tab delimited files using sql2txt (same repo as previous step) via the command zcat elwiktionary-blahdate-blahtable.sql.gz | sql2txt | gzip > elwiktionary-blahdate-blahtable.tabs.gz. Actually that's a lie, I wrote a tiny bash script to do them all for me. I skipped the following downloaded files:
    • site_stats - I didn't want or need these, the numbers would be wrong anyways
    • user_groups - Not needed for displaying page content
    • old_image and image - using InstantCommons
    • page - generated from XML files instead
  8. I converted the page, revision and text table files that were generated from the XML files, to tab delimited, using a command similar to the above step

The actual import

[edit]

Note: maybe using charset 'binary' here would be better!

  1. I imported all of the above files into MySQL, doing the following:
    • mysql -u root -p
    • mysql>use elwikt
    • mysql>SET autocommit=0;
    • mysql>SET foreign_key_checks=0;
    • mysql>SET unique_checks=0;
    • mysql>SET character_set_client = utf8;
    • unpacked the tab delimited file
    • mysql>TRUNCATE TABLE tablenamehere;
    • mysql>LOAD DATA INFILE path-to-tab-delim-file-for-table-here FIELDS OPTIONALLY ENCLOSED BY '\'';
    • repeated this for all tab delim files
    • mysql>exit;

After the import

[edit]
  1. Since this is a wiktionary, I updated the LocalSettings.php file so that page titles need not start with a capital letter, adding $wgCapitalLinks = false; to the file
  2. Since this wiki has extra namespaces beyond the standard ones defined by MediaWiki, I added those to LocalSettings.php. You can find such namespaces by looking at the first few lines of the stubs XML file. Lines added: $wgExtraNamespaces[100] = 'Παράρτημα'; and $wgExtraNamespaces[101] = 'Συζήτηση_παραρτήματος';.
  3. The namespace for the project and for project discussion are typically special localized names. I added those to LocalSettings.php, finding the names in the stub XML file at the beginning: $wgMetaNamespace = 'Βικιλεξικό'; and $wgMetaNamespaceTalk = 'Συζήτηση_βικιλεξικού';
  4. I installed tidy and added the following lines to LocalSettings.php to reflect that: $wgUseTidy = true; and $wgTidyBin = '/usr/bin/tidy';. No configuration file was necessary; one is provided as part of MediaWiki and used by default.
  5. I set up the interwiki cache cdb file, by using fixup-interwikis.py via the command python fixup-interwikis.py --localsettings /var/www/html/elwikt/LocalSettings.php --sitetype wiktionary and then added $wgInterwikiCache = "$IP/cache/interwiki.cdb" to the LocalSettings.php file. (See mw:Interwiki_cache/Setup_for_your_own_wiki for info.)

That was it. This was enough to let me view (most) pages without errors.

Caveats

[edit]

I didn't deal with the math settings, I forgot to do anything about svg images, I didn't set up memcache, and I didn't deal with lucene search, all things that would be useful for a real mirror. I also skipped over the Cite extension, which would be a must for Wikipedia articles. But this was still enough to let me view (most) pages without errors.

There might be more things one could turn off in MySQL before starting to shovel data in.

I didn't try to fix up Special:Statistics. I would never run the updater script for this, as it would be too slow. This table should be populated by hand using the count of the titles in the main namespace, the number of pages in the main namespace with links (I think this can be gotten by munging the pagelinks table), etc.

Because this is not a huge wiki, I didn't need to break up the tab delimited files into chunks. The recommended procedure is to have a script that writes pieces of them to a fifo from which MySQL reads.

My mysql configuration settings were laughably on the low side:

  • key_buffer_size=256MB
  • max_allowed_packet=20M
  • innodb_buffer_pool_size=512M
  • innodb_additional_mem_pool_size=40M
  • innodb_log_file_size=128M
  • innodb_log_buffer_size=8M

For a large wiki these would not be at all viable.

Wikidata is going to change this procedure. I don't know what's needed for wikis that alredy have this enabled.

Flagged Revisions enabled on your wiki? You'll want those tables and you'll want to convert and import them. I'm sure there are some extension-specific configuration settings you'll need, and I'm equally sure I have no idea what those are.

Liquid Threads? No frickin' clue. Good luck on that one.