Jump to content

Research:WQT (Wikidata Quality Toolkit): Assuring the world’s data commons

From Meta, a Wikimedia project coordination wiki
Created
16:51, 22 March 2024 (UTC)
Contact
Elena Simperl
Albert Meroño Peñuela
Odinaldo Rodrigues
Duration:  2024-January – 2024-December
Quality, Recommendations, References

Invalid status "Active" provided


Wikidata is one of the world’s most precious data assets: It was launched in 2012 by the Wikimedia Foundation (the NGO that runs Wikipedia) and contains machine-readable factual information about 100+ million topics. It is used extensively in anything from web search engines, virtual assistants (e.g. Siri, Alexa), and fact-checkers to 800+ projects in the Wikimedia ecosystem, including Wikipedias in multiple languages. As a curated source of structured, machine-readable information, it is a valuable training resource for numerous AI applications, including large language models (LLMs) and other foundational AI models.

Incomplete, erroneous, biased, or otherwise inappropriate data matters. Wikidata data is used in many Wikipedia articles, which are visited 24 billion times daily. Poor data is especially damaging when used to train AI systems, which tend to reinforce existing biases and stereotypes. This situation is not going away anytime soon: Wikidata grows faster than the size of its community; at the same time, LLMs like ChatGPT are expected to make things worse as they could unleash a huge tide of automatically generated content that requires additional human scrutiny. Furthermore, existing tools that help editors with these tasks are limited in scope or require specialist skills. The scale of the challenge is substantial: Wikidata has 21 million edits daily, made by ~24k active editors supported by ~330 bots.

We will build WQT (Wikidata Quality Toolkit), which will support a diverse set of editors in curating and validating Wikidata records at scale. This will leverage research findings and conceptual prototypes drawing from AI, data management, and social computing, which have been designed and evaluated by the Wikidata community, responding to their data assurance needs. The topic is timely, not the least, because of the risks of misinformation and disinformation posed by LLMs. The focus will be on:

  • revisiting existing assumptions and requirements for data assurance in the age of LLMs;
  • refactoring, improving and integrating existing code, which originates from a series of research grants and PhD projects;
  • evaluating the toolkit extensively with the Wikidata community and
  • developing a robust research software sustainability strategy.

The toolkit will be open-source, and all data, software, and guidance will be available to the community, as well as to researchers and AI developers. Besides the direct impact on a community of 24k editors, there are substantial economic and societal implications from the downstream AI applications using Wikidata (e.g. search engines) – according to Government sources, AI employs ~50k people in the UK and added £3.7 billion to the UK economy in 2022.

Methods[edit]

Timeline[edit]

  • In Phase 1, we will summarise the key outcomes of our research and revisit some of the basic assumptions and requirements in light of new developments, particularly LLMs. We will work with a designer to develop graphical representations (UI Gadget components, dashboard), and we will interview a diverse set of Wikidata editors to gather their requirements (at least 10 people, diversity of tenure in Wikidata, language and cultural background, frequency of contributions, level of technology and AI literacy). This will result in the first version of the toolkit in Wikidata’s hosted Toolforge (WF).
  • In Phase 2, we will evaluate the first version in its capacity to (a) replicate the results of the original research the toolkit is based on and (b) fulfil the Wikidata editor requirements collected in Phase 1. This will include technical evaluations using the same metrics as in the published papers, as well as user studies with at least 50 participants.
  • In Phase 3, we will present the toolkit at events linked to our impacted partners listed above. From there, we will go through all plan objectives, focusing on elements that can become part of Wikidata editor tools (i.e. Gadgets) and building a community of volunteer developers that can maintain peripheral libraries and components. This will be complemented in several meetups and events (see the Progress page on our website: https://wikiqt.github.io/progress/).
  • In Phase 4, the two newly hired staff will select the parts of the toolkit to be officially deployed as Gadgets and will assess compliance with RSE standards of both the toolkit and the community built around it (SRSE). Simultaneously, we will address long-term sustainability through 1-2 activities run by the Software Sustainability Institute. At the same time, we will seek new partnerships and further collaborations with WF, SRSE and beyond to use the toolkit as a resource to extend the WQT in Toolforge and the interfaces (e.g. Dashboard) with further assessment and repair tasks. We will also seek additional funding from the Wikimedia Research Funds and submit grants to public and philanthropic funding schemes, e.g., Allen Institute for AI.

Policy, Ethics and Human Subjects Research[edit]

We are committed to ensuring the safety, welfare, and dignity of all human participants in our research, treating them with equality and fairness. Additionally, we adhere to the policies and ethical guidelines of the Wikimedia Foundation.

Results[edit]

The study is currently in progress.

Resources[edit]

The study is in progress. Please check our website: https://wikiqt.github.io/