Wikimedia Data Tutorial ICWSM 2024

Wikimedia Data Tutorial:
Using public data from Wikipedia and its sister projects
for academic research

held at ICWSM 2024

This is the website for the tutorial Wikimedia data How-to: Using public data from Wikipedia and its sister projects for academic research at the International AAAI Conference on Web and Social Media (ICWSM) 2024.

Motivation

Wikimedia's data is one of the largest available resources of multi-modal data, including articles across 326 language versions of Wikipedia, millions of images on Wikimedia Commons, and structured data in the Wikidata knowledge graph. Behind this well of information are large, active, and global communities working together. In their mission to disseminate open knowledge, the Wikimedia Foundation makes this data available under open licenses. In contrast, access to data about online communities on many other platforms is becoming more constrained. Given the extent of usage of Wikipedia and its sister projects on a global scale, Wikimedia data is one of the most useful tools for investigating our online habits and understanding the dynamics of community governance and collaboration in online spaces (see, e.g., Hill & Shaw: The most important laboratory for social scientific and computing research in history)

However, working with this information can be challenging in practice. These challenges include navigating the Wikimedia ecosystem, How to identify relevant data for specific research questions?), technical barriers (How to access available datasets via dumps?), or best practices (How to pre-process/filter the datasets?).

In this tutorial, we discuss the different data formats available through the Wikimedia projects, how to access them, and the best practices for working with Wikimedia information and the community creating this information. We will combine lecture-style moments with hands-on exercises to make everyone ready to start the next project using these resources.

Description

Format

The tutorial will be a mix of lecture-style presentations and hands-on exercises. In the first part, we plan to provide an overview of the Wikimedia ecosystem, its public data, and the opportunities it offers for research. In the second part, we will see how to access and use this data with step-by-step exercises.

We plan to reserve time for Q&A and free-form discussion between participants to address challenges they have faced or are currently facing accessing Wikimedia resources.

Lecture 1: Introduction. An overview of Wikipedia and its sister projects, the Wikimedia research community, and entry points for working with data.

Lecture 2: Modeling content. A deep dive into working with the content of the different projects

Lecture 3: Modeling behavior. A deep dive into understanding the behavior of users by analyzing editors, readers, and community processes.

Exercises: My first project. Breakout sessions with hands-on work running code to answer specific research questions.

Target audience

The tutorial is aimed at researchers who would like to learn about opportunities of how to use Wikimedia data for their research. While most of the data is public, researchers might not know about the existence and availability of many datasets or do not know where to get started in order to access and/or use it effectively. Given the richness of Wikimedia data, this tutorial is thus of interest to researchers across different disciplines in, e.g., Computational Social Science, analysis of Online Social Networks, Natural Language Processing, or Multimodal models; as well as social sciences without a dedicated CS background.

Outcomes

The tutorial aims to give the participants the tools to enrich their research agenda using data from Wikimedia projects. Specifically, participants will (1) get an overview of the richness of Wikimedia data (e.g., Wikipedia, Wikidata, Wikimedia Commons) and its value for research; (2) get familiar with the different ways of accessing the data on the Wikimedia projects (dumps, APIs) and readily available tools for processing the data; (3) explore which problems they currently have accessing these resources and how to overcome those barriers; (4) test some of the APIs and other access points to the data themselves.

Participants attending the tutorial will learn the different access points to Wikimedia resources, among others the Wikipedia API, data dumps, Wikidata SPARQL queries, Wikimedia Commons API, and similar data access points for other sister projects. Participants will have an awareness of which type of data they can access besides Wikipedia articles and Wikidata facts, such as community information in article discussions, edit history, and others. Challenges participants currently have, or have faced in the past, accessing Wikimedia data will have been discussed and ideally solved. Participants will learn where they can reach out for help from the Wikimedia research community for future questions and challenges they might face.

Prerequisites

Basic knowledge of Python is desirable to fully take advantage of the hands-on exercises using Jupyter Notebooks. Other useful but not mandatory skills are a basic understanding of data processing and APIs.

Participants in the exercise session only need a laptop with internet access and a browser. To simplify setup, we will use PAWS, which is an environment to run Jupyter Notebooks hosted by the Wikimedia Foundation. For preparation, setting up a free Wikimedia account is desirable (link to create an account).

Past precedent

A tutorial giving an overview on data from Wikimedia projects for research was presented at WWW’2011: Saez-Trumper, D., & Redi, M. (2020). Wikimedia Public (Research) Resources. Companion Proceedings of the Web Conference 2020, 311–312. This tutorial provided an introduction to the different publicly available resources. Our tutorial will be substantially different from previous tutorials in several ways.

First, several years have passed since the most recent tutorial. With some of the changes in the online ecosystem towards a more closed infrastructure (e.g. restrictions in access to Twitter API), Wikimedia’s open data resources gain new relevance for researchers. In addition, many new datasets have been made available over the past years that are likely very useful for researchers.

Second, our tutorial will focus on topics relevant specifically to the ICWSM community. Besides the access to data of different Wikimedia projects, we will discuss possibilities of research using data about the Wikimedia community, such as discussions of articles, or the edit history. We believe that focusing on the community of Wikipedia and how to access data specifically interesting in this context is helpful for both the ICWSM researchers as well as the Wikimedia community that can benefit from the ICWSM communities’ knowledge and the possibilities of facilitating community processes with automated tools and insights by the researchers.

Schedule

Logistics

When: June 3, 2-6pm (UTC-4/EDT) check the full conference program
Where: Room 1225B, Jacobs School of Medicine and Biomedical Sciences, 955 Main St, Buffalo, NY 14203

Materials

Lecture 1: Overview. Slidedeck on figshare
Lecture 2: Content. Slidedeck: figshare
Lecture 3: Wikimedia Community. Slidedeck on figshare
Hands-on work
- Getting started with PAWS: notebook
- Working with Wikimedia content: notebook
- Working with community data through the Mediawiki API: notebook
- Working with HTML dumps: notebook

Organizers

Martin Gerlach (Wikimedia Foundation)
Lucie-Aimée Kaffee (Hugging Face)
Tiziano Piccardi (Stanford University)

If you have questions about the workshop, start a thread on the talk-page (click the Discussion-button above) or send us an email.