Jump to content

Data dumps/Perl importing script

From Meta, a Wikimedia project coordination wiki

This is a script Tbsmith made to import only pages in certain categories. It works for Mediawiki 1.5.

It is not written up as a tutorial but it will give you a good idea of how to find the pages you want and import them. It does not bring images, although I'm sure it would be easy to do.

Warning

This will put pages on your site, but it doesn't worry about the links tables or anything else so it's not perfect. There are scripts in /maintenance/ to rebuild the links tables and search index.

/maintenance/tables.sql is where I got most of my information to build this.

#!/usr/bin/perl
use Parse::MediaWikiDump;
use DBI;
use DBD::mysql;

$server         = "localhost";
$name           = "dbname";
$user           = "admin";
$password       = "password";

$dsn = "DBI:mysql:database=$name;host=$server;";
$dbh = DBI->connect($dsn, $user, $password);
$pages = '/home/pages_articles.xml';
$pages = Parse::MediaWikiDump::Pages->new($pages);
print "Done parsing.\n";
#print qq{
#$pages->sitename
#$pages->base
#$pages->generator
#$pages->case
#$pages->namespaces
#information about a page record
#$page->redirect
#$page->categories
#$page->title
#$page->id
#$page->revision_id
#$page->timestamp
#$page->username
#$page->userid
#$page->minor
#$page->text 

#  good_categories is a file with a list of categories
#  from which you'd like to import pages, one per line

open I, "good_categories" or die $!;
while (<I>) {
 chomp;
 $cat{$_}= 1;
}
while(defined($page = $pages->page)) {
 next if $page->redirect;
 $title = $page->title;
 $timestamp = $page->timestamp;
 $text = $page->text;
 @cats = @{$page->categories};
 $cats = join ', ', @cats;
 # This part sees if it matches your category
 $matchcat = 0;
 for (@cats) {
  if (exists $cat{$_}) {
   $matchcat = 1;
   last;
  }
 }
 next unless $matchcat;
 $text = $$text;
 $text =~ s/\\/\\\\/g;
 $text =~ s/'/\\'/g;
 $size = length($text);
 $title =~ s/ /_/g;
 $title =~ s/\W//g; 

 # won't insert a page with the same title as an existing page

 $sth = $dbh->prepare("select count(*) from page where page_title='$title'");
 $sth->execute;
 $row = $sth->fetchrow_hashref;
 next if $row->{'count(*)'}>0;
 $dbh->do("insert into text (old_text) values ('$text')");
 $text_old_id = $dbh->{'mysql_insertid'};
 # 0 is the main namespace.
 $namespace = 0;
 $dbh->do("insert into page (page_namespace, page_title, page_is_new, page_random, page_latest,  page_len) values ($namespace, '$title', 1, rand(), 0, $size)"); 
 $page_id = $dbh->{'mysql_insertid'};
 # This will be the author or the pages
 $auth = 'Admin';
 $authid = 1;
 $dbh->do("insert into revision (rev_page, rev_text_id, rev_user, rev_user_text) values ($page_id,  $text_old_id, $authid, '$auth')");
 $rev_id = $dbh->{'mysql_insertid'};
 $dbh->do("update page set page_latest = $rev_id where page_id = $page_id");
 print <<STOP;
Page_id($page_id) -- Old_id($text_old_id) -- Rev_id($rev_id)
Importing Title: ($title)
-----------------------------------------
STOP
}