Research:MDM - The Magical Difference Machine
Topic
[edit]There are a number of research questions that in some form or other depend on what exactly has been changed, added, or removed during a revision. Unfortunately, this data is not easily accessible via the Wikidumps or databases which only contain the full text a revised article. The goal of this sprint is to create a system for quickly producing and querying a dataset containing the diffs of all revisions in the English wikipedia. The idea is to define a broad data structure that can then be used to answer the research questions and generate datasets based upon the diffs.
Below is a list of requirements/expectations of the capabilities of the revision diff (change enacted by an edit) database that the quants are building. You should think of this document as a wish list of things you would like to be able to search for that are related to the page text that is changed by edits.
Feature | Priority (Justification) |
---|---|
Find what revisions a template (or set of templates) was inserted by
|
High (We need to count the frequency and location of the application of templates such as welcomes, warnings, etc.) |
Count of total content added/removed by an editor (by namespace) | High |
Simple to associate with revert information e.g., was reverted, is reverting, for vandalism, etc. | High (Figuring out what is and isn’t a revert and how often it happens would be a huge boon. We also have to note when addition of content is actually a revert of a previous blanking of content.) |
References/citations i.e. ref tags and citation templates | Moderate (This might suggest how successful editors are with using our complicated syntax for sourcing, and more importantly, is a clear measure of the quality of edits. It will also allow us to figure out those editors who are great at adding references as a taxonomy activity. Knowing the sourcing gurus could be useful.) |
Structural changes, such as addition or removal of sections | Low (Use of proper sections or removal of them is one measure useful for determining quality. Vandals often blank sections, and use of proper section syntax is often a sign of a quality addition.) |
External and internal links added or removed | Low (Interesting but link use in general is not really an issue that is likely to have a causal relationship with new editor retention, though heavy external link use is usually a sign of low quality.) |
Were cleanup templates or citation needed templates added or removed | Moderate (This is generally interesting as a look at how often editors apply these tags to each other’s work, and it may be one of the factors that has lead to lower retention rates.) |
What is the ratio of markup to content in the diff (i.e. complexity) | Moderate (Growth in complexity over time is interesting, as well as how good newbies really are at using complex markup, but it’s not vital.) |
Links to policy and guidelines (WP:___) | Low (We already know people frequently cite these.) |
Images added or removed | Low (This is probably a separate question from links though.) |
Process
[edit]The system will use MapReduce to process the dumps via Hadoop and store the data in MongoDB (i.e. Wikilytics).
A rough description of the system and process is as follows:
- Parse dumps to send to the map function via the Hadoop streaming interface
- Send each page to the map function
- Map function will yield diffs to reduce function and store them in MongoDB
The sprint will consist of two major milestones:
- Generate the Diff Dataset
In the first week, we will generate a system to quickly produce and store the diff dataset.
- Searching Interface
In the second week, we will create the interface to perform searches and generate datasets from the diff dataset generated in the first week.