Jump to content

Research:Wikipedia clickstream

From Meta, a Wikimedia project coordination wiki
This page documents a completed research project.


You can access the monthly public data releases that share how often two Wikipedia article pages are viewed consecutively at Analytics ClickStream Dataset. — Check it out!

About

[edit]

The Wikipedia Clickstream dataset contains counts of (referrer, resource) pairs extracted from the request logs of Wikipedia. A referrer is an HTTP header field that identifies the address of the webpage that linked to the resource being requested. The data shows how people get to a Wikipedia article and what links they click on. In other words, it gives a weighted network of articles, where each edge weight corresponds to how often people navigate from one page to another. To give an example, consider the figure below, which shows incoming and outgoing traffic to the "London" article on English Wikipedia during January 2015. We look at desktop, mobile web, and mobile app requests.

Where to get the Data

[edit]

The canonical citation and most up-to-date version of this dataset can be found at: https://dumps.wikimedia.org/other/clickstream/ (and the readme at https://dumps.wikimedia.org/other/clickstream/readme.html).

Ellery Wulczyn, Dario Taraborelli (2015). Wikipedia Clickstream. figshare. doi:10.6084/m9.figshare.1305770

Data Preparation

[edit]

For each release, and for several Wikipedia language versions, we take one month worth of requests for articles in the main namespace. Referrers are mapped to a fixed set of values, based on this scheme:

  • an article in the main namespace -> the article title
  • a page from any other Wikimedia project -> other-internal
  • an external search engine -> other-search
  • any other external site -> other-external
  • an empty referrer -> other-empty
  • anything else -> other-other

Requests for pages that get redirected were mapped to the page they redirect to. We attempt to exclude spider traffic by classifying user agents with the ua-parser library and a few additional Wikipedia specific filters. Finally, any `(referrer, resource)` pair with 10 or fewer observations was removed from the dataset. To give you a sense of the scale of the data, the March 2016 release for English Wikipedia contained 25 million `(referrer, resource)` pairs from a total of 6.8 billion requests

A note on empty referrers. There's a discussion on Phabricator that broadly suspects unidentified bots and browser bugs to be the main culprits, with fantastic deeper dives that look at VPNs, Wikipedia being set as the home page, and switching from mobile apps to mobile browsers when clicking on Wikipedia links. Definitely worth a read. And some further reading on Groupon's experiment, that finds a high percentage of Direct and Organic search traffic.

Format

[edit]

The current data includes the following 4 fields:

  • prev: the result of mapping the referrer URL to the fixed set of values described above
  • curr: the title of the article the client requested
  • type: describes (prev, curr)
    • link: if the referrer and request are both articles and the referrer links to the request
    • external: if the referrer host is not en(.m)?.wikipedia.org
    • other: if the referrer and request are both articles but the referrer does not link to the request. This can happen when clients search or spoof their refer.
  • n: the number of occurrences of the (referrer, resource) pair

Releases

[edit]

As the project has evolved, the exact details of how the data was generated has changed. Below, is a list of releases with notes if the data preparation and format is different from what is described above. Data is based on requests from desktop, mobile web, and mobile apps.

From June 2019

From November 2017

January 2017

  • released a dataset for English (2017_01_en)

September 2016

  • released a dataset for English (2016_09_en)

August 2016

  • released a dataset for English (2016_08_en)
  • released a dataset for English (2016_08_en_unresolved) where redirects were not resolved in the usual manner. Instead, the requested "current article" is captured in the curr_unresolved column. This means that page titles in this column can be redirects. In this case, the curr column captures what page the user was redirected to.

April 2016

  • released dataset for English, Arabic and Farsi Wikipedia.

March 2016

  • external referrers were mapped to a more granular set of fixed values

February 2016

  • external referrers were mapped to a more granular set of fixed values

February 2015

  • data also included page ids for prev and curr
  • only requests to the desktop version were used (after this, we look at mobile web and mobile app requests)
  • requests from clients who made too many requests were removed (for details, see here and here)
  • redlinks were included as a type
  • external referrers were mapped to a more granular set of fixed values

January 2015

  • data also included page ids for prev and curr
  • only requests to the desktop version were used
  • redirects were not resolved
  • external referrers were mapped to a more granular set of fixed values

Applications

[edit]

This data can be used for various purposes:

  • determining the most frequent links people click on for a given article
  • determining the most common links people followed to an article
  • determining how much of the total traffic to an article clicked on a link in that article
  • generating a Markov chain over Wikipedia

Some examples:

[edit]