Research:Language-Agnostic Topic Classification

Created

21:06, 14 May 2020 (UTC)

Contact

Isaac Johnson

Wikimedia Foundation

Collaborators

Martin Gerlach

Wikimedia Foundation

Diego Saez-Trumper

Wikimedia Foundation

Duration: 2019-September – 2023-May

Open access
via arXiv

Open source
via Github

Open data
via Figshare

Research:Projects

This page documents a completed research project.

This project comprised a number of complementary efforts to build language-agnostic topic classification models for Wikipedia articles -- i.e. label a Wikipedia article with one or more high-level topics using a single model that can provide predictions for any language edition of Wikipedia. This project built directly on the ORES' language-specific models that make topic predictions based on the text of the article. The work has resulted in the link-based topic classification model, which has been put into production and is used by various tools such as in guiding newcomers in finding relevant articles to edit. A series of proposed updates are in progress in 2025 with more details that can be found in this report on a series of focus groups.

Methods

The core approach used by these models is that they use language modeling techniques but rely on Wikidata as a feature space so that predictions are based on a shared, language-independent vocabulary that can be applied to any Wikipedia article. They all use the groundtruth set of topic labels for English Wikipedia that are based on the WikiProject directory. This page describes the method used by the production model but alternative approaches have been considered too.

Link-based classification

For this approach, we use the actual Wikipedia articles to make predictions, but represent them as a bag of outlinks to Wikidata items. The high-level view of this process is as follows:

For a given article, collect all of the Wikipedia articles that it links to (same wiki; namespace 0). I currently use the pagelinks table, but links could be extracted directly from the article's wikitext or parsed HTML for greater control.
Resolve any redirects.
Map these outlinks (in the form of page IDs) to their associated Wikidata items (QIDs).

Using en:Langdon House as an example, this article (as of May 2020) would be represented as (example API call):

Q12063015, Q76321820, Q6976908, Q1293931, Q3702059, Q6977993, Q1022954, Q12063106, Q11190, Q12063060, Q1148084, Q5870034, Q2131593, Q6975902, Q6976135, Q1850711, Q6977769, Q3349886, Q6976054, Q4618708, Q6975331, Q1850701, Q4643746, Q1967620, Q1657477, Q12063540, Q6977716, Q6977837, Q6977725, Q6978002, Q1582434, Q6975782, Q1967636, Q6975806, Q1679966, Q1397, Q6976367, Q576744, Q6976290, Q30, Q3719, Q6976722, Q6976479, Q1967640, Q382362, Q6977028, Q1839000, Q6977991, Q4365410, Q2888877, Q6977001, Q16147380, Q6977800, Q6976998, Q8676, Q12063156, Q1850680, Q6973378, Q6976779, Q6976989, Q6976981, Q6977699, Q22664, Q6977032, Q6977161, Q5773622, Q6977307, Q12063371, Q6977894, Q176290, Q1967649, Q6977264, Q6977097, Q6977075, Q6977067, Q43196, Q468574, Q6977490, Q1967645, Q6977990, Q1967630, Q1620797, Q308439

Note, most of these links actually come from this navigation template, which points to the impact of using the pagelinks table (all links retrieved) vs. wikitext (no transcluded-links retrieved) vs. HTML (choice can be made about standard vs. transcluded links).

A machine learning model can then be learned that takes as input a bag-of-words (where the words are QIDs) and outputs topic predictions. Because this model uses QIDs as its feature space, it can make predictions for any article from any Wikipedia language edition so long as that article's outlinks are first mapped to Wikidata items.

Wikidata properties and categories

While the original model purely was built on link-based predictions, we have slowly been incorporating in complementary approaches to expand coverage/performance and build on the work that Wikimedians do to annotate content beyond just the adding of links. The country model, which expands the original set of geographic regions that are predicted (e.g., Western Europe) to individual countries (e.g., France), is able to build on the fact that Wikimedians identify relevant countries quite frequently in annotations and so often they do not need to be inferred from context but can be directly extracted. For example, the model uses categories like Flora of France whose corresponding Wikidata item identifies France as a main topic. It can also use Wikidata properties like endemic to (P183) or country of citizenship (P27).

Subpages

Pages with the prefix 'Language-Agnostic Topic Classification' in the 'Research' and 'Research talk' namespaces:

Research:

Research talk:

Language-Agnostic Topic Classification