Jump to content

User:Micke/WikiFind

From Meta, a Wikimedia project coordination wiki

WikiFind is a simple program written in C++, used for analysing database-dumps from a MediaWiki-site such as Wikipedia. The program looks for a user specified keyword and returns a text file with wiki-formatted links to each page containing the specified keyword (Regexes can be used). Thus the program can be used for looking for templates, bits of code, misspelled words and such, perhaps in order to get a list to use with a bot.

Source code is available below under the conditions of the GNU General Public License (GPL) version 3 or later http://www.gnu.org/licenses/gpl.html.

Notice

[edit]

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

If you decide to test the program, I would much appreciate feed-back and comments. Leave a message at my talk page or send me an e-mail (and if you know c++, don't be afraid to start fixing things on the todo-list :-) .

How to

[edit]

I will not supply executables at this time, which only means you'll have to compile the program your self.

Start by putting the source code bellow (copy/paste) in a ordinary text file and name it "wikifind.cpp", then:

  • For Unix/Linux users I recommend the GNU Compiler Collection (open source) in which case you can use these commands (more than likely g++ will be pre-installed on your system):

cd /to/directory/where/file/is
g++ wikifind.cpp -o wikifind -lboost_regex
./wikifind

Next time you want to run the program just:

cd /to/directory/where/file/is
./wikifind

  • For Windows users I recommend Dev C++ (open source):

Open Dev C++ and open the downloaded file, then press the button "Compile and run", this will compile and start the program. Next time you want to run the program simply double click the file "wikifind.exe" that you'll now find in the same folder where you kept "wikifind.cpp".

As you can notice below, there are two versions of the program, the second version is intended for use with a bzcat pipe like this: bzcat swwiki-20080607-pages-articles.xml.bz2| ./wikifind "output.txt" "[Kk]eyword"

This means that you don't have to uncompress the database dump prior to searching it.

Regex capability

[edit]

The regex function requires you to install Boost regexp (if you don't do this the program won't compile). If you don't want to (or can't) enable the Boost regex library you can use the old version found here [1].

Linux

First download the library from here (the file you want is called boost_1_34_1.tar.bz2):

tar --bzip2 -xf /path/to/boost_1_34_1.tar.bz2 And then follow instructions here:

Ubuntu (or other Debian?)

Install the boost library with sudo apt-get install libboost-regex-dev

Compile it with g++ wikifind.cpp -o wikifind -l boost_regex

Windows

First download the library from here (the file you want is called boost_1_33_1.exe):

  • Boost on sourceforge
  • Then download Bjam, the file is called: boost-jam-3.1.16-1-ntx86.zip)
  • Double click boost_1_33_1.exe and install to C:\Boost
  • Move everything from C:\Boost\boost_1_33_1 to C:\Boost and delete C:\Boost\boost_1_33_1
  • Extract bjam.exe from boost-jam-3.1.16-1-ntx86.zip to C:\Boost
  • Open a command line and execute bjam.exe "-sMINGW_ROOT_DIRECTORY=C:\Dev-Cpp" "-sTOOLS=mingw" install[1]. This is assuming that you have Dev-C++ installed to this location.
  • Open Dev-C++ and add C:\Boost\include\boost-1_33_1 to "Tools", "Compiler Options", "Directories", "C++ includes" ("Verktyg", Kompilatoralternativ", "Kataloger", C++inkluderingsfiler" if you have a swedish installation)
  • Set "Tools", "Compiler Options", "Directories", "Libraries" (bibliotek) to C:\Boost\lib and press ok.
  • Everything should work! Note: I din't get Boost-1.34.1 to work, so insted you should be carefull to download Boost-1.33.1.

See official guide for more info:

Reference

[edit]
  1. Thanx to Jozef Wagner

A few pointers

[edit]

Since the xml-dumps use < and > for tagging, all occurrences of < and > in the wiki-code are changed to:

  •  <  &lt;
    
  •  >  &gt;
    

respectivly, which means that you have to use this in order to search for wiki tags like e.g. <nowiki>:

  •  &lt;nowiki&gt;
    

TODO (and list of things that doesn't work yet):

[edit]
  • Translate in to English
  • Add regex capabilities
  1. Add capabilities to look for a string divided on two or more lines
  2. Add an option to search for more than one string at a time
    • Add a counter of articles found
  3. Automatically sort page hits alphabetically
  4. Add option to disregard redirects
  5. Make it so that the program only searches within <text> and </text>-tags
  6. Enable specific namespace searches

See also

[edit]

Source version 1

[edit]

Copy/paste below or download from here.

////////////////////////////////////////////////////////////
//	WikiFind is a program used for reading database dumps 
//	from MediaWiki, written by Mikael Nordin,  licensed under 
//	the GNU General Public License (GPL) version 3,  
//	or any later version.				                                                               
//	Copyright Mikael Nordin 2008.                                                                                   


#include <iostream> //for cin and cout
#include <string>	//for strings
#include <fstream>  // for ifstream
#include <boost/regex.hpp> //for regex

using namespace std;

string keyword, line, filenamein, filenameout, title, problem, found, looking; //Global variables
string title2 = "qqqqqqxxxxwpppppzzzzzwwwwqqqq"; //title not likely to exist
int nooftitles = 0;

void Lang(); //Sub-routines
void Search();

int main() //main function
{
	Lang(); //select language
	
	Search();	//searching file

    return 0;	
	
}

void Lang() //Localization and input/query/output function
{
	string lang, q1, q2, q3; //variables
	int lang2 = 1;
	
	while (lang2 != 0) //selecting language
	{
		cout << "Välj språk / Please choose language:\n";
		cout << "1. Svenska (sv)\n";
		cout << "2. English (en)\n";
		cin >> lang;
	
		if (lang == "sv") //Swedish localization
		{
			q1 = "Vilken fil vill du genomsöka: ",
			q2 = "Var vill du spara resultatet: ",
			q3 = "Vilket sökord vill du hitta: ",
			problem = "Filen kunde inte öppnas\n",
			found = " träffar gjordes\n",
			looking = "Letar efter: ",
			lang2 = 0;
		}
				
		else if (lang == "en")  //English localization
		{
		 	q1 = "Which file do you want to search: ",
			q2 = "Where do you want to store results: ",
			q3 = "Which string do you want to seach for: ",
			problem = "Could not open file\n",
			found = " titles found\n",
			looking = "Looking for: ",
			lang2 = 0;
		}
		
		else  //incorrect lang choice
		{
		 cout << "Fel val / Wrong choice\n"; 
		}
	}
	
	cin.get();
	cout << q1; 
	getline(cin, filenamein);
	
	cout << q2;
	getline(cin, filenameout);
	
	cout << q3;	
	getline(cin, keyword);
}

void Search()  //searching database dump
{
	ifstream FileIn(filenamein.c_str()); //Open dump

	if (!FileIn) //if something goes wrong with file opening
    {
       cout << problem;
    }
	
	ofstream FileOut(filenameout.c_str(), ios::app); //Open output file
	
	FileOut << "== " << keyword << " ==\n";  //headline to file
	cout << looking << keyword << endl; //what we are doing
	
	while (getline(FileIn, line)) //reading file  line by line
    { //checking to see if it's a pagename
		if (line[0] == ' ' && line[1] == ' ' && line[2] == ' ' && line[3] == ' ' 
    	&& line[4] == '<' && line[5] == 't' && line[6] == 'i' && line[7] == 't'
    	&& line[8] == 'l' && line[9] == 'e' && line[10] == '>')
    	{
    		title = line; //saving any pagenames
    		
    	}
    	
		boost::regex rexp(keyword);
		boost::smatch tokens;
   
		if (boost::regex_search(line, tokens, rexp))  //if keyword is found
		{ 
			while (title2 != title) //checking to see if pagename is allready stored
    		{
				
				int langd = title.length() - 19; //removingt xml- taggs
				int i = 11;
				
				FileOut << "* [[";  //wikiformating
					
				while (langd > 0) //printing pagename
				{
					FileOut << title[i]; 
					i = i + 1;
					langd = langd - 1;
					
				}						
						
				FileOut << "]]\n"; //wikiformating
				
				nooftitles = nooftitles +1; //counting articles
				
				title2 = title; //saving new title
				
				langd = title.length() - 19; //removingt xml- taggs again
				i = 11;
				
				cout << "* [[";  //wikiformating
					
				while (langd > 0) //printing pagename on screen
				{
					cout << title[i]; 
					i = i + 1;
					langd = langd - 1;
					
				}
				
				cout << "]]\n"; //wikiformating	 
			}                                 
    	} 
    }
    
    FileOut << endl << nooftitles << found << endl;  //printing number of articles to file
    
    cout << endl << nooftitles << found;  //printing number of articles to screen
}


Source version 2

[edit]

Copy/paste below or download from here.

////////////////////////////////////////////////////////////
//	WikiFind is a programme used for reading databse dumps 
//  from MediaWiki written by Mikael Nordin,  licensed under 
//	the GNU General Public License (GPL) version 3,  
//  or any later version.				                                                               
//	Copyright Mikael Nordin 2008.                                                                                   


#include <iostream> //for cin and cout
#include <string>	//for strings
#include <fstream>  // for ifstream
#include <boost/regex.hpp> //for regex

using namespace std;

string line, filenamein, title, problem, found, looking; //Global variables
string title2 = "qqqqqqxxxxwpppppzzzzzwwwwqqqq"; //title not likely to exist
int nooftitles = 0;


void Search(string filenameout, string keyword);

int main(int argc, char *argv[]) //main function
{

	string filenameout = argv[1];
	string keyword = argv[2];
	Search(filenameout, keyword);	//searching file

    return 0;	
	
}

void Search(string filenameout, string keyword)  //searching database dump
{
	ofstream FileOut(filenameout.c_str(), ios::app); //Open output file
	
	FileOut << "== " << keyword << " ==\n";  //headline to file
	cout << "Looking for: " << keyword << endl; //what we are doing
	while (getline(cin, line)) //reading file  line by line
    { //checking to see if it's a pagename
		if (line[0] == ' ' && line[1] == ' ' && line[2] == ' ' && line[3] == ' ' 
    	&& line[4] == '<' && line[5] == 't' && line[6] == 'i' && line[7] == 't'
    	&& line[8] == 'l' && line[9] == 'e' && line[10] == '>')
    	{
    		title = line; //saving any pagenames
    		
    	}
		boost::regex rexp(keyword);
		boost::smatch tokens;
   
		if (boost::regex_search(line, tokens, rexp))  //if keyword is found
		{ 
			while (title2 != title) //checking to see if pagename is allready stored
    		{
				
				int langd = title.length() - 19; //removingt xml- taggs
				int i = 11;
				
				FileOut << "* [[";  //wikiformating
					
				while (langd > 0) //printing pagename
				{
					FileOut << title[i]; 
					i = i + 1;
					langd = langd - 1;
					
				}						
						
				FileOut << "]]\n"; //wikiformating
				
				nooftitles = nooftitles +1; //counting articles
				
				title2 = title; //saving new title
				
				langd = title.length() - 19; //removingt xml- taggs again
				i = 11;
				
				cout << "* [[";  //wikiformating
					
				while (langd > 0) //printing pagename on screen
				{
					cout << title[i]; 
					i = i + 1;
					langd = langd - 1;
					
				}
				
				cout << "]]\n"; //wikiformating	 
			}                                 
    	} 
    }
    
    FileOut << endl << " pages found" << endl;  //printing number of articles to file
    
    cout << endl << nooftitles << " pages found" << endl;  //printing number of articles to screen
}