Jump to content

Research:Language-Agnostic Topic Classification

From Meta, a Wikimedia project coordination wiki
Created
21:06, 14 May 2020 (UTC)
Duration:  2019-September – 2023-May
This page documents a completed research project.


This project comprised a number of complementary efforts to build language-agnostic topic classification models for Wikipedia articles -- i.e. label a Wikipedia article with one or more high-level topics using a single model that can provide predictions for any language edition of Wikipedia. This project built directly on the ORES' language-specific models that make topic predictions based on the text of the article. The work has resulted in the link-based topic classification model, which has been put into production and is used by various tools such as in guiding newcomers in finding relevant articles to edit.

Methods

[edit]

The core approach used by these models is that they use language modeling techniques but rely on Wikidata as a feature space so that predictions are based on a shared, language-independent vocabulary that can be applied to any Wikipedia article. They all use the groundtruth set of topic labels for English Wikipedia that are based on the WikiProject directory. This page describes the method used by the production model but alternative approaches have been considered too.

[edit]

For this approach, we use the actual Wikipedia articles to make predictions, but represent them as a bag of outlinks to Wikidata items. The high-level view of this process is as follows:

  • For a given article, collect all of the Wikipedia articles that it links to (same wiki; namespace 0). I currently use the pagelinks table, but links could be extracted directly from the article's wikitext or parsed HTML for greater control.
  • Resolve any redirects.
  • Map these outlinks (in the form of page IDs) to their associated Wikidata items (QIDs).

Using en:Langdon House as an example, this article (as of May 2020) would be represented as (example API call):

Note, most of these links actually come from this navigation template, which points to the impact of using the pagelinks table (all links retrieved) vs. wikitext (no transcluded-links retrieved) vs. HTML (choice can be made about standard vs. transcluded links).

A machine learning model can then be learned that takes as input a bag-of-words (where the words are QIDs) and outputs topic predictions. Because this model uses QIDs as its feature space, it can make predictions for any article from any Wikipedia language edition so long as that article's outlinks are first mapped to Wikidata items.

Wikidata properties and categories

[edit]

While the original model purely was built on link-based predictions, we have slowly been incorporating in complementary approaches to expand coverage/performance and build on the work that Wikimedians do to annotate content beyond just the adding of links. The country model, which expands the original set of geographic regions that are predicted (e.g., Western Europe) to individual countries (e.g., France), is able to build on the fact that Wikimedians identify relevant countries quite frequently in annotations and so often they do not need to be inferred from context but can be directly extracted. For example, the model uses categories like Flora of France whose corresponding Wikidata item identifies France as a main topic. It can also use Wikidata properties like endemic to (P183) or country of citizenship (P27).

Subpages

[edit]