For constructing datasets related to policies on Wikipedia, there are a few existing resources that might be of help:

Identifying policy violations:
- Via content reliability templates: editors flag issues in articles via a range of templates (depending on the particular policy being violated). Wong et al.^[1] provide a standard approach to generating balanced datasets via these templates but there are likely other ways to enrich or expand.
- Via edit summaries: some editors link to policies in their edit summaries when reverting problematic edits. These edit summaries can be easily extracted from the Mediawiki history dumps (see this example code).
Working with Wikimedia data:
- mwedittypes is a Python library for generating structured diffs of what has changed between edits on Wikipedia. It is designed for wikitext, the mark-up language that Wikipedia articles are written in.
- mwparserfromhtml is a Python library for working with the rendered HTML of Wikipedia articles. This HTML can be a richer source of content than the wikitext (more details).
Examples of generally valuable contributions:
- Work that extends existing datasets into more languages.
- Work that explores how to build interpretable models that can go beyond detecting violations and help editors to understand and correct issues.
- Work that explores the nuances of these policies – e.g., does not just treat violation detection as a binary but explores the variety of ways in which a policy might be violated. The diversity of templates used by editors gives a good idea of possibilities here (examples).
- Work that explores nuances of evaluating LLMs on detecting and correcting policy violations – e.g., data contamination concerns^[2], curating high-quality datasets^[3].

If you have questions about datasets, tooling, or any other aspects, feel free to email the organizers (nlp4wikipedia[@]googlegroups.com) or book an office hour slot with co-organizer Isaac Johnson.

Core Content Policies

Below we provide a summary and resources related to Wikipedia's three core content policies. These are high-priority areas for developing datasets and benchmarks.^[4]

Neutral Point-of-View

Neutral Point-of-View (NPOV) can be viewed as two, closely-related aspects:

Biased language: this is a question of how a subject is presented – i.e. word choice. This aspect is more straightforward and can generally be assessed largely in isolation – i.e. looking at individual sentences from within an article. That said, "unbiased" language does not necessarily mean neutral sentiment but rather means that the sentiment is representative of reliable sources.
Biased coverage: this is a question of whether the appropriate balance is being struck between the different viewpoints present in reliable sources. This is a much trickier and less-explored space. Assessing this question of due weight likely also requires comparing the existing article to the content of the sources that it is based on.

NPOV Resources

Evaluation of LLMs on detecting and correcting NPOV issues: Ashkinaze, Joshua, et al. "Seeing like an AI: How LLMs apply (and misapply) Wikipedia neutrality norms."
Dataset of NPOV-related edits on English Wikipedia: Pryzant, Reid, et al. "Automatically neutralizing subjective bias in text." AAAI 2020.
Dataset that includes some POV-related templates on English Wikipedia: Wong, KayYen, et al. "Wiki-reliability: A large scale dataset for content reliability on Wikipedia." SIGIR 2021.
Neutrality-related templates on English Wikipedia

Verifiability

Verifiability means providing an appropriate citation for content on Wikipedia that is not common knowledge.

Verifiability Resources

Example task for generating an aligned cross-lingual dataset of references from Wikipedia: task T374554
Verifiability-related templates on English Wikipedia. See also inline templates category and maintenance templates category.
Taxonomy, dataset, and baselines for detecting what sentences need citations: Redi, Miriam, et al. "Citation Needed: A Taxonomy and Algorithmic Assessment of Wikipedia's Verifiability." WWW '19.
Metrics for evaluating the appropriateness of citations in generated text: Gao, Tianyu, et al. "Enabling Large Language Models to Generate Text with Citations." EMNLP '23.
- See Shao et al. for an application of this to Wikipedia.
Recommending citations for statements on Wikipedia: Petroni, Fabio, et al. "Improving wikipedia verifiability with AI." Nature Machine Intelligence 2023.

No Original Research

No Original Research (NOR) means that all content must be verifiable by a reliable source (even if no explicit citation is required per the Verifiability policy). Trivially this can be thought of as "no hallucinations".

NOR Resources

Zhao, Wenting, et al. "WildHallucinations: Evaluating long-form factuality in LLMs with real-world entity queries." 2024.
Fact verification based on Wikipedia data: Thorne, James, et al. "FEVER: a Large-scale Dataset for Fact Extraction and VERification." NAACL '18.
Original Research templates can also be found under the Verifiability Resources.

Community Wishlist

Wikimedians also have the opportunity to surface feature requests with the Wikimedia Foundation through what's known as the Community Wishlist. This can also be a good source of inspiration for datasets that could help support the needs of the Wikimedia community. The list is quite extensive so we have identified a few that might provide inspiration:

References

↑ Wong, KayYen; Redi, Miriam; Saez-Trumper, Diego (2021-07-11). "Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia". Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR '21 (New York, NY, USA: Association for Computing Machinery): 2437–2442. ISBN 978-1-4503-8037-9. doi:10.1145/3404835.3463253.
↑ Kaffee, Lucie-Aimée; Johnson, Isaac (2024-12-15). "Evaluations Using Wikipedia without Data Contamination: From Trusting Articles to Trusting Edit Processes" (PDF). EvalEval Workshop (NeurIPS '24).
↑ Kuo, Tzu-Sheng; Halfaker, Aaron Lee; Cheng, Zirui; Kim, Jiwoo; Wu, Meng-Hsin; Wu, Tongshuang; Holstein, Kenneth; Zhu, Haiyi (2024-05-11). "Wikibench: Community-Driven Data Curation for AI Evaluation on Wikipedia". Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. CHI '24 (New York, NY, USA: Association for Computing Machinery): 1–24. ISBN 979-8-4007-0330-0. doi:10.1145/3613904.3642278.
↑ Johnson, Isaac; Kaffee, Lucie-Aimée; Redi, Miriam (2024-11-16). Lucie-Aimée, Lucie; Fan, Angela; Gwadabe, Tajuddeen; Johnson, Isaac; Petroni, Fabio; van Strien, Daniel, eds. "Wikimedia data for AI: a review of Wikimedia datasets for NLP tasks and AI-assisted editing". Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia (Miami, Florida, USA: Association for Computational Linguistics): 91–101. doi:10.18653/v1/2024.wikinlp-1.14.

[1] Wong, KayYen; Redi, Miriam; Saez-Trumper, Diego (2021-07-11). "Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia". Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR '21 (New York, NY, USA: Association for Computing Machinery): 2437–2442. ISBN 978-1-4503-8037-9. doi:10.1145/3404835.3463253.

[2] Kaffee, Lucie-Aimée; Johnson, Isaac (2024-12-15). "Evaluations Using Wikipedia without Data Contamination: From Trusting Articles to Trusting Edit Processes" (PDF). EvalEval Workshop (NeurIPS '24).

[3] Kuo, Tzu-Sheng; Halfaker, Aaron Lee; Cheng, Zirui; Kim, Jiwoo; Wu, Meng-Hsin; Wu, Tongshuang; Holstein, Kenneth; Zhu, Haiyi (2024-05-11). "Wikibench: Community-Driven Data Curation for AI Evaluation on Wikipedia". Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. CHI '24 (New York, NY, USA: Association for Computing Machinery): 1–24. ISBN 979-8-4007-0330-0. doi:10.1145/3613904.3642278.

[4] Johnson, Isaac; Kaffee, Lucie-Aimée; Redi, Miriam (2024-11-16). Lucie-Aimée, Lucie; Fan, Angela; Gwadabe, Tajuddeen; Johnson, Isaac; Petroni, Fabio; van Strien, Daniel, eds. "Wikimedia data for AI: a review of Wikimedia datasets for NLP tasks and AI-assisted editing". Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia (Miami, Florida, USA: Association for Computational Linguistics): 91–101. doi:10.18653/v1/2024.wikinlp-1.14.

[1]

[2]

[3]

[4]