Jump to content

Research:Building a Wikidata Content Gap Index

From Meta, a Wikimedia project coordination wiki
Created
09:32, 29 July 2024 (UTC)
Contact
Collaborators
Albert Merono Penuela
Elena Simperl
Duration:  2024-February – 2025-July
Wikidata, Content gaps, Quality

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


In 2017, the Wikimedia foundation announced a strategic goal to support “knowledge and communities that have been left out by structures of power and privilege”. To date, much of the work in this area has focused on Wikipedia. In this project, we aim to develop a formal framework of content gaps in Wikidata. We will identify and evaluate existing metrics from current research while developing novel metrics and methods to further the state of the art. We will then evaluate these metrics to identify which are the most effective for accurately measuring gaps and which best support the development of a content gap index.

Context and Motivation

[edit]

Previous research has shown that Wikidata is generally of high -- and improving -- quality, but that its effectiveness is hampered by missing data among other factors. We note that such gaps have also been recognised by the Wikimedia foundation in the knowledge gap white paper[1], which outlines a taxonomy of knowledge gaps and calls for research which identifies and bridges such knowledge gaps.

To recognise and address these gaps, the Wikimedia Foundation has developed an existing index of content metrics on Wikipedia[2], mapping types of content gaps to quantitative measurements for formal assessment on Wikipedia. However, we note that for many content gap classes, the framework lacks metrics for formally measuring gaps. Additionally, the metrics defined are largely related to Wikipedia articles and do not necessarily generalise or apply to Wikidata. For example, when it comes to completeness of the graph, the complexity of the Wikidata ontology model makes it difficult to identify and measure the gaps. It is easier to look at specific case studies based on entity type, e.g., people/places. Similarly, when it comes to individual entities, measuring content as amount of characters or bytes does not work well with structured data. On the other hand, simply measuring the number of statements is too simplistic as some statements are more descriptive than others.

We thus want to develop a content gap index for Wikidata which aligns with the knowledge gap taxonomy and is similarly operationalised to the Wikipedia one, but takes into account the specificities of Wikidata as a structured knowledge base. We are conscious that prior research has already addressed some of these challenges. Relevant research on Wikidata has separately looked at its gender diversity representations[3] in terms of gender identity distributions; compared Wikidata to other knowledge graphs on a range of graph quality metrics including completeness and accuracy[4]; analysed race and citizenship bias in the knowledge base[5]; studied Wikidata content gaps as the imbalance between Wikidata contributions and Wikipedia pageviews[6], representing information needs. Our goal, then, is to identify the existing metrics that have been identified and developed for classifying Wikidata content gaps, but also to conceptualise and develop novel metrics and methods for identifying these content gaps. We will further compare the accuracy and effectiveness of these metrics to identify which best compliment the development of a content gap index. We aim to then further expand on this work by exploring how we might use existing metrics and work on content gaps within Wikidata (and Wikipedia) to explore whether such gaps also propagate to and/or are reflected within large language models which draw on Wikipedia, Wikidata and similar knowledge commons resources.

Building on Wikipedia Content Gaps

[edit]

There has been a wide body of work looking at content gaps within Wikipedia and a full review of all existing metrics would be outside of the scope of this research. However, we aim to identify some of the most common metrics and explore whether these would apply to Wikidata as an introductory point. We believe further analysis would be an important area for future work.

We recognise that there has been prior work within the Wikimedia Foundation (and beyond) to identify and classify types of content gap within Wikipedia -- for example Jonathan Morgan’s work[7], and the knowledge gap index[8]. We will use this existing framework and identified literature as the basis for our own analysis, which we will update to take into account subsequently published literature.

Methods

[edit]
  1. Analyse existing metrics used to identify content gaps in Wikipedia and adapt to Wikidata.
  2. Analyse metrics used to identify content gaps in Wikidata through a systematic literature review.
  3. Create a unified framework of content gaps

Research Questions

[edit]
  • What are the existing approaches to identify and measure content gaps in Wikipedia?
  • To what extent do these Wikipedia content gap metrics apply to Wikidata?
  • What are the existing metrics used to measure content gaps in Wikidata?
  • How do different metrics compare against each other and are certain metrics more reliable and suitable for developing an index than others?
  • How can we measure such gaps within LLMs?
  • Are the content gaps in Wikidata (and Wikipedia) reflected in LLMs?

Goals and Outputs

[edit]

Our ultimate goal is to develop a knowledge index of metrics for identifying and measuring content gaps within Wikidata. This knowledge index will be shared with the wider Wikidata and Wikimedia communities as well as in the form of a scientific paper which will collect, compare and evaluate different metrics and methods for quantifying content gaps. As well as this, we will produce a paper consolidating work conducted in this space to date by gathering and categorising metrics used to evaluate Wikipedia content gaps, as well as existing gaps identified within Wikidata. We aim to conduct surveys and interviews with the Wikidata community to aid in this evaluation process.

Ultimately, after identifying and measuring these content and knowledge gaps within Wikipedia, we hope to explore whether these gaps propagate to -- or otherwise occur within -- large language models. As discussed in the 2024 Bellagio Research Agenda, knowledge commons such as Wikipedia are crucial resources for the development and training of LLMs. While currently, knowledge commons communities may have limited direct influence on LLM development, we aim to explore whether the gaps and biases present in the content produced by such communities may have an influence beyond the context in which the communities intended that content.

Timeline

[edit]

Timeline to be confirmed.

Policy, Ethics and Human Subjects Research

[edit]

The initial phase of our work focuses on the literature review. Once this is complete, we plan to reach out to the Wikidata community to gather their views on the metrics. We will first seek institutional ethics approval prior and will update this page with further details of the process to be used and the approvals received when available.

Results

[edit]

To follow when available.

References

[edit]
  1. Leila Zia, Isaac Johnson, Bahodir Mansurov, Jonathan Morgan, Miriam Redi, Diego Saez-Trumper, and Dario Taraborelli. 2019. Knowledge Gaps – Wikimedia Research 2030. https://doi.org/10.6084/m9.figshare.7698245
  2. https://meta.wikimedia.org/wiki/Research:Knowledge_Gaps_Index/Measurement/Content
  3. https://wigedi.com/
  4. Färber, M., Bartscherer, F., Menne, C., & Rettinger, A. (2018). Linked data quality of dbpedia, freebase, opencyc, wikidata, and yago. Semantic Web, 9(1), 77-129.
  5. Shaik, Zaina, Filip Ilievski, and Fred Morstatter. "Analyzing race and citizenship bias in Wikidata." 2021 IEEE 18th international conference on mobile Ad Hoc and smart systems (MASS). IEEE, 2021.
  6. Abián, David, Albert Meroño-Peñuela, and Elena Simperl. "An analysis of content gaps versus user needs in the wikidata knowledge graph." International Semantic Web Conference. Cham: Springer International Publishing, 2022.
  7. https://meta.wikimedia.org/wiki/Research:Content_gaps_on_Wikipedia
  8. https://meta.wikimedia.org/wiki/Research:Knowledge_Gaps_Index/Measurement