Research:Data introduction
This page introduces the basic concepts researchers and data scientists should know to start working with Wikimedia data. For an overview focused on data sources and ways to access them, see Research:Data.
The goals of this document are:
- Start building your understanding of the Wikimedia data landscape.
- Set your expectations about what data is available and the challenges involved in working with it.
- Alert you to unique qualities of Wikimedia data and infrastructure, and help you avoid common analysis pitfalls and misconceptions.
- Introduce you to major data domains, and direct you to more specific information about data sources, access methods, and analysis techniques within each domain.
Essential concepts
[edit]Wikimedia projects are some of the largest, most complex, multilingual knowledge bases ever created. That makes Wikimedia data interesting and useful, but it can also be confusing and complex. To avoid making false assumptions or analysis mistakes, and to save time identifying data sources for your work, you should understand these fundamental characteristics of Wikimedia projects.
Wiki projects are more than Wikipedia
[edit]Wikipedia may be the most well-known wiki project, but there are 18 official Wikimedia projects. The term "wiki projects" usually refers to all the open knowledge wikis supported by the Wikimedia Foundation, which includes Wikipedias, Commons, Wikidata, Wikisource, Wiktionary, and more.
In addition to those "core content projects", Wikimedia wikis include many other projects. Those projects may also include content, or they may focus on other types of activities and contributions that support the movement.
Why does this matter for your analysis?
Being aware of the full range of wiki projects can help you avoid omitting relevant data, or making incorrect assumptions. For example, if you're analyzing wiki content, you may want to consider how Wikipedia articles use images from Commons, and that both images and articles can have associated Wikidata statements. You may also want to investigate if communities are creating content related to your area of inquiry in sister wiki projects.
- For example, see the grouping of sister projects around biological classification: English Wikipedia's WikiProject Tree of Life, Wikispecies, WikiProject Taxonomy on Wikidata, and WikiProject Tree of Life on Commons.
Some wikis have language editions, others are multilingual
[edit]MediaWiki is the software that powers the Wikimedia projects. Some wiki projects, like Wikipedia, have a unique MediaWiki instance for each language. Those different MediaWiki instances are called "language editions". Other projects, like Wikidata, are multilingual: they have only one MediaWiki instance, and content is translated into multiple languages within that instance. Each MediaWiki instance has its own database.
Useful references:
The technologies and tools used by Wikimedians can vary substantially between wikis. Communities manage and customize many aspects of MediaWiki, which results in different user interfaces and functionality for different language editions. Different communities may have different governance practices, administrative rules, and collaboration models. Also, local and offline context can influence the makeup of the editor community, the availability of citation sources, and more.
Why does this matter for your analysis?
Always remember that user behavior and content patterns on large wikis, like English Wikipedia, are not universally applicable to all wiki projects. Making comparisons across language editions requires care and understanding of the many variables across language editions, and in some cases it may not be feasible.
Understanding cross-lingual differences on Wikimedia projects is essential in making accurate comparisons. Especially if you're working on NLP or building multilingual language models, the quality of your multilingual dataset depends on your understanding of how and why content varies between language editions. Learn more about how to approach research using multilingual Wikimedia content.
User interfaces and software functionality vary
[edit]The functionality and appearance of MediaWiki doesn't just vary by language edition, it can also vary along one or more of the following dimensions:
- User interface customization: Users and wiki admins can choose from multiple MediaWiki skins, and those can be further customized by user scripts and gadgets. For details, visit mw:Manual:Interface.
- Device type: Wikis can be viewed and edited through mobile apps, desktop or mobile web browsers (mobile or desktop), or specialized devices.
- Platforms: Wiki content can be consumed through search engines, voice assistants, specialized apps, and in many other contexts, without a user ever directly accessing a Wikimedia site.
- Editing interfaces: Wiki content can be edited using VisualEditor or source editing, or via tools or bots that use MediaWiki APIs. See mw:Editor for a full list of editing interfaces.
Why does this matter for your analysis?
In some cases, it might not be possible to compare user behavior or content interactions across different device types or platforms, because the user interface and functionality varies so significantly. For example, some types of Wikidata content reused on Wikipedia, like categories and geographic coordinates, only show up on desktop but not on mobile or apps[1]. If you're researching content consumption or contribution, you should consider and account for the many variables and environments in which those activities occur.
Wikitext doesn't fully represent page content: use HTML instead
[edit]Wikitext is the markup language the MediaWiki software uses to format pages. If your research involves extracting text from wiki pages, use data sources that provide the page HTML instead of the raw wikitext. Because of pervasive use of templates, modules, extensions, and other tools that alter page appearance and article content, extracting text from the fully-rendered HTML provides a more accurate and complete representation of the page content.[2][3]
Technical details about how the MediaWiki software renders HTML pages:
Core MediaWiki functionality handles most aspects of article parsing, but wiki-specific extensions can alter article content in ways that are more difficult to track. Also, research has shown that some types of Wikidata transcluded content may show up differently (or not at all) on different platforms.[4]
Not all wiki users and editors are humans
[edit]"Our communities are powered by generous humans and efficient robots" — Research:Wikistats_metrics
Wiki content is both consumed and edited by humans and by automated agents or scripts called 'bots'. Bots include major search engine crawlers that index wiki content, and simple scripts written by individual users to scrape a couple of wiki pages. Bots can influence traffic data, like pageview metrics, so vandalism can take the form of automated bots that seek to influence which pages appear on most-read lists.
Bots perform large numbers of wiki edits and maintenance tasks. They're an essential part of the wiki ecosystem, and they enable human volunteers to focus their time and energy on tasks that require human attention. The availability of bots or other automated filters to do simple tasks on a wiki varies greatly between language editions (see above).
Why does this matter for your analysis?
For most datasets, you must filter out (or consciously decide to include) bot traffic or automated contributions. Most Wikimedia data sources have separate files for automated traffic, or a field to enable filtering of individual users as automated or bots. For more details about how to identify and work with data related to bots, see the Traffic data and Contributing data overviews[f], or the documentation for specific datasets.
Wiki content is more than articles, and contributions are more than edits
[edit]Not all wiki projects focus on articles or written text, and many projects include multiple types of content within pages. For example, Wikipedia articles often contain page elements like templates, images, and structured data. All of these elements include meaningful information that enriches the article text, but the work of creating or maintaining this non-text content may happen in another wiki, or in a different namespace.
Building and maintaining wiki projects requires many types of contributions beyond just editing articles. Wiki editing is supported and enabled by a wide range of contributions from administrators, patrollers, translators, developers, and many other community members.
Why does this matter for your analysis?
Depending on the wiki project and community, you could miss many types of interactions and contributions if you ignore the full range of content types, namespaces, and ways of contibuting. When framing and scoping your research, consider all the different types of wiki content, community activities, and how they interact. For example: conversations on article Talk pages are an important part of the editing and maintenance process: should your analysis include text or data from those pages?
Namespaces and their content may vary
[edit]A content namespace is a namespace that contains the content of a Wikimedia project. But, what is content? In Wikipedia, "content" is traditionally limited to articles. In technical terms, that means pages in the namespace zero or the main namespace or ns0 (based on the numeric identifier that the software assigns to it).
However, wikis may designate different namespaces as content namespaces. Not all pages in the Main namespace contain content, and not all content that may be relevant for your analysis is in article pages. For example:
- In English Wikipedia's data model, the "Portal" or "Category" namespace are considered to be content namespaces, but the pages they contain are not "articles".
- Wikisource has an Author content namespace (example)
- Spanish Wikipedia has an Anexo (Appendix) content namespace (example)
For more details, and tips for identifying the content namespaces in a given wiki, see Research:Content_namespace.
Why does this matter for your analysis?
Understanding namespaces is important because different wikis use namespaces differently, so you should pay attention to, and explicitly decide, which namespaces in each wiki contain content relevant for your analysis.
- For example, if you're analyzing contributions in a given topic area, will you only include text contributions to article pages? What about media uploads, Talk page discussions, or edits to Wikidata items in that topic area?
Privacy is essential, even in public Wikimedia datasets
[edit]Protecting anonymity is not only an important value of the Wikimedia movement, it's essential to how the Wikimedia Foundation publishes and shares data. The following are examples of data that is private or unavailable:
- Data about some countries and territories may not be published. See the Country and Territory Protection List.
- IP addresses and User Agent strings
- Search or browsing history of a particular user
- The geographic location of readers on a per-page basis. You can see the total pageviews by country across all Wikimedia sites at stats.wikimedia.org.
Wikimedia public data dumps and datasets contain public information and aggregated or non-personal information from Wikimedia projects. The types of information that are considered "public" are more fully explained in the Wikimedia Foundation Privacy Policy. Since 2021, WMF has used differential privacy on some data releases; learn more about those datasets at Differential_privacy.
Why does this matter for your analysis?
You should understand which information isn't available. For example, there is no way to accurately identify individual readers, and that is on purpose. It's easy to identify individual editors, but not to know their real-life identity.
You should understand how Wikimedia privacy practices influence the assumptions you can make about the data. For example, you can't assume that one user account represents one person, or that an individual user always logs in when making contributions.
Researchers who have a formal collaboration agreement with the Wikimedia Foundation may gain access to additional data sources, and must follow Data Access Guidelines, Data Publication Guidelines, and the Open Access Policy.
Wiki projects have knowledge gaps
[edit]As community-curated knowledge bases, the content of Wikimedia projects reflects the knowledge, interests, and priorities of its contributors. There are many wiki communities and projects focused on understanding and reducing inequities in how wiki content is generated and maintained, how wiki communities function, and how larger issues like the digital divide and access to knowledge impact the Wikimedia movement.
Why does this matter for your analysis?
If you're researching wiki content, always keep in mind the possibility of knowledge gaps in areas like gender representation, geographical and linguistic coverage of topics, and more.
If you're analyzing traffic, contributions, or contributors, remember that complex and intersecting socio-technical variables impact individuals and wiki communities. Consider both the offline and online contexts in which people consume wiki content and contribute to the wiki projects. For example, Wikimedia Apps may provide offline access to content, or readers may access articles offline through Kiwix deployments.
Data domains
[edit]This section describes the major types of publicly-available, open-licensed data about Wikimedia projects. You should understand the different types of data so you can narrow down your analysis to specific datasets and relevant tools.
Traffic and readership data
[edit]"Traffic data" usually refers to data that is purely about content consumption. Other terms for this type of data are analytics, pageviews, readership, or site usage. This data represents actions or user behaviors that don't modify the status of a wiki. In contrast, actions that change the state of the wiki, and are recorded by the MediaWiki software in its various database tables, are better represented by datasets in other domains, like Contributing or Content.
Example: events that are traffic | Example: events that are not traffic |
---|---|
A user views a file on Commons | A user creates an account |
A web scraper accesses an article on Wikipedia | A bot adds a category to some pages |
Learn more about research and analysis concepts related to traffic:
Traffic and readership data sources
[edit]APIs |
|
Dumps | |
Specialized datasets | |
MediaWiki database tables |
|
Dashboards |
Content
[edit]In the context of Wikimedia data, "content" refers to the text, media, or data stored in the wiki project's MediaWiki database. Depending on the project and data source, "content" can mean raw Wikitext, parsed HTML of wiki pages, Wikibase JSON, RDF triples, images, or other types of content.
The MediaWiki database structure has separate tables for text, revisions, and pages. When a user edits a page, the act of editing creates a revision. A revision record contains metadata about the page change, but not the changed content itself.
In technical terms, an edit creates a row in the revision table of that wiki's MediaWiki database. The text of that revision is the "content" of the page, but the text of the revision itself is stored in the text table, a different table than the revision table, and a separate table from the page itself. This data structure means you can analyze contributions and user activities without dealing with large amounts of stored raw content, but if you do need the raw content, you have to combine data from multiple tables (or use a data source that has already combined them).
Examples of content | Not content |
---|---|
|
|
Learn more about research concepts related to content:
Content data sources
[edit]For large-scale content extraction, your primary options for obtaining wiki page HTML (the recommended method) include:
- Use the MediaWiki APIs or scrape wikis directly (computationally expensive and discouraged for large projects)
- Use Wikimedia Enterprise HTML dumps and mwparserfromhtml, a library that makes it easy to parse the HTML content of Wikipedia articles.
- Use a published, curated dataset, like WikiHist.html (limited to English Wikipedia history from 2001 to 2019)[5]
The table below provides an overview of content data sources beyond just those that provide HTML.
APIs |
|
Dumps |
|
Specialized datasets | |
MediaWiki database tables | |
Dashboards |
|
Structured data
[edit]For a full overview of available data sources and access methods for Wikidata items and lexemes, see Wikidata:Data_access.
APIs | |
Dumps | |
Specialized datasets |
Contributions and contributors
[edit]Contributions to the wiki projects take many forms, but editing the content of a wiki is one of the most frequently-analyzed type of contribution. Other types of contributor activities include patrolling, account management, organizing events, and technical work like coding gadgets or extensions.
Analyzing edits or editor activities usually involves working only with metadata, as opposed to the full content of wiki pages. Metadata includes information about pages, users, and revisions, but does not include the content of the revision itself.
Examples of contributions | Not contributions |
---|---|
|
|
Contributions and contributors data sources
[edit]Contributions / edits | Contributors / editors | |
APIs |
|
|
Dumps | ||
Specialized datasets | ||
MediaWiki database tables | ||
Dashboards |
Tools and data access methods
[edit]You should choose a data access method based on your analysis goals, your research domain, the datasets you need, and various other factors. Most research and analysis use cases require combining several data sources, parsing, formatting, and filtering the data before you can analyze it.
In general, Wikimedia datasets are very large, and different sources use different file and compression formats. Researchers and developers have created scripts, libraries, and tools to address many of these challenges.
Overview of data access methods
[edit]The table below provides an overview of the general benefits and constraints of available data access methods. For more information about the different access methods, see Research:Data.
Benefits | Constraints | More info | |
APIs | Relatively easy to use
Responses in standard formats that are easy to parse Direct access to the data contained in MediaWiki databases through HTTP requests |
Not optimized for bulk access (one page is a single response in some APIs)
Often limited to current or recent data restricted to small number of requests at a time (rate limited) |
APIs for analyzing: |
Dumps | Query for patterns across large corpus of data
Includes full content in addition metadata Historical data is available You can use as much computing power as you can provide |
Large file size makes parsing and extraction cumbersome
Different dumps use different types of aggregation, requires extra work to compare across datasets Requires writing your own code (although libraries are available) You must provide the necessary computing power, which may be a lot Most use less-common formats (XML or SQL statements) |
Dumps for analyzing: |
MediaWiki database replicas | Connect to shared server resources and query a copy of MediaWiki content databases
Use browser-based query tools like PAWS or Quarry, or command-line tools through Toolforge |
Data is raw, not aggregated
Querying multiple wiki projects requires multiple queries Queries may be too computationally expensive Limited to current or recent data May require a Wikimedia developer account |
MediaWiki tables for analyzing: |
Dashboards | Easy to use
Access via web browser No need to learn MediaWiki database structure |
Limited to predefined datasets and filters; may not cover all wiki projects or languages
Often limited to current or recent data Provide statistics but not raw data |
Dashboards for analyzing: |
Dataset discovery tools
[edit]Dataset search engines can help you find current and historical datasets related to wiki projects. Wikimedia data is often processed in various ways by researchers, and released as datasets that are hosted in various places online. Consider searching for Wikimedia-related datasets at one or more of the following:
- Zenodo
- Figshare
- Dimensions.ai
- Google Dataset Search
- Academic Torrents
- Hugging Face (see also a curated "Wikimedia Datasets" list on Huggingface).
Learn more
[edit]- Research:Data main page
- Learn more about the research process
- Connect with the Wikimedia Foundation's Research team
References
[edit]- ↑ Research:External_Reuse_of_Wikimedia_Content/Wikidata_Transclusion/Examples
- ↑ "From hell to HTML: releasing a Python package to easily work with Wikimedia HTML dumps – [[WM:TECHBLOG]]" (in en-US). 2023-02-24. Retrieved 2024-06-13.
- ↑ "wikimedia/wikipedia · Missing important parts of text (at least in 20231101.fr)". huggingface.co. 2023-12-04. Retrieved 2024-06-13.
- ↑ Research:External_Reuse_of_Wikimedia_Content/Wikidata_Transclusion
- ↑ Mitrevski, Blagoj; Piccardi, Tiziano; West, Robert (2020). "WikiHist.html: English Wikipedia's Full Revision History in HTML Format". doi:10.48550/ARXIV.2001.10256.
- ↑ Piccardi, Tiziano (2024-06-02). "Wikimedia Data Tutorial - Lecture 2: Content (ICWSM 2024)". doi:10.6084/m9.figshare.25954489.v1.
- ↑ Kaffee, Lucie-Aimée (2024-05-30). "Wikimedia Data TutorialLecture 3: Wikimedia Community". doi:10.6084/m9.figshare.25938262.v1.