Rxr for WikiXRay
Appearance
Cut & paste the following code in a text file, and save it as rxr.py. Don't forget to give your file executable privileges.
This program simply reads (by default from the standard input) the standard xml file of a mediawiki dump looking for a certain page with id specified in the string variable firstPageId. When the page is found the software reproduces in the output (by default standard output) a valid xml which can be parsed from WikiXRay parser.
This programs helps to reprise the analysis of a wiki when it has been interrupted before its end.
Its usage is tipically:
7za e -so enwiki-20100130-pages-meta-history.xml.7z | python RikyXRay/rxr.py | python dump_sax.py
rxr inserts exactly transparently in the middle of the pipe between the decompression and WikiXRay parser parser.
#############################################
# rxr: a preprocessor for WikiXRay
#############################################
# This program is free software. You can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 or later of the GPL.
#############################################
# Author: Riccardo Tasso
import sys,codecs,re
def main(input):
firstPageId = '12345'
error = open('error.log', 'w')
pageP = re.compile(r'\s*<page>\s*')
lastPageTag = None
pageCount = 0
lastTitleTag = None
idP = re.compile(r'\s*<id>(.+)</id>\s*')
lastIdTag = None
firstPageFound = False
startWriting = False
line = input.readline()
while line != '':
if not startWriting and pageP.match(line):
firstPageFound = True
lastPageTag = line
pageCount += 1
if pageCount % 10000 == 0:
error.write(str(pageCount) + ' pages found\n')
error.flush()
lastTitleTag = input.readline()
lastIdTag = input.readline()
pageId = idP.match(lastIdTag).group(1)
if str(firstPageId) == str(pageId):
startWriting = True
error.write('work reprise for page id: ' + str(pageId) + '\n')
print lastPageTag.strip('\n')
print lastTitleTag.strip('\n')
print lastIdTag.strip('\n')
line = input.readline()
if line == '':
break
if not firstPageFound or startWriting:
print line.strip('\n')
line = input.readline()
error.close()
return 0
if __name__ == '__main__':
# Adapt stdout to Unicode UTF-8
sys.stdout=codecs.EncodedFile(sys.stdout,'utf-8')
input = sys.stdin
main(input)