NLP for Wikipedia (ACL 2025)/Track 2 Guidance
For constructing datasets related to policies on Wikipedia, there are a few existing resources that might be of help:
- Identifying policy violations:
- Via content reliability templates: editors flag issues in articles via a range of templates (depending on the particular policy being violated). Wong et al.[1] provide a standard approach to generating balanced datasets via these templates but there are likely other ways to enrich or expand.
- Via edit summaries: some editors link to policies in their edit summaries when reverting problematic edits. These edit summaries can be easily extracted from the Mediawiki history dumps (see this example code).
- Working with Wikimedia data:
- mwedittypes is a Python library for generating structured diffs of what has changed between edits on Wikipedia. It is designed for wikitext, the mark-up language that Wikipedia articles are written in.
- mwparserfromhtml is a Python library for working with the rendered HTML of Wikipedia articles. This HTML can be a richer source of content than the wikitext (more details).
- Examples of generally valuable contributions:
- Work that extends existing datasets into more languages.
- Work that explores how to build interpretable models that can go beyond detecting violations and help editors to understand and correct issues.
- Work that explores the nuances of these policies – e.g., does not just treat violation detection as a binary but explores the variety of ways in which a policy might be violated. The diversity of templates used by editors gives a good idea of possibilities here (examples).
- Work that explores nuances of evaluating LLMs on detecting and correcting policy violations – e.g., data contamination concerns[2], curating high-quality datasets[3].
If you have questions about datasets, tooling, or any other aspects, feel free to email the organizers (nlp4wikipedia[@]googlegroups.com) or book an office hour slot with co-organizer Isaac Johnson.
Core Content Policies
[edit]Below we provide a summary and resources related to Wikipedia's three core content policies. These are high-priority areas for developing datasets and benchmarks.[4]
Neutral Point-of-View
[edit]Neutral Point-of-View (NPOV) can be viewed as two, closely-related aspects:
- Biased language: this is a question of how a subject is presented – i.e. word choice. This aspect is more straightforward and can generally be assessed largely in isolation – i.e. looking at individual sentences from within an article. That said, "unbiased" language does not necessarily mean neutral sentiment but rather means that the sentiment is representative of reliable sources.
- Biased coverage: this is a question of whether the appropriate balance is being struck between the different viewpoints present in reliable sources. This is a much trickier and less-explored space. Assessing this question of due weight likely also requires comparing the existing article to the content of the sources that it is based on.
NPOV Resources
[edit]- Evaluation of LLMs on detecting and correcting NPOV issues: Ashkinaze, Joshua, et al. "Seeing like an AI: How LLMs apply (and misapply) Wikipedia neutrality norms."
- Dataset of NPOV-related edits on English Wikipedia: Pryzant, Reid, et al. "Automatically neutralizing subjective bias in text." AAAI 2020.
- Dataset that includes some POV-related templates on English Wikipedia: Wong, KayYen, et al. "Wiki-reliability: A large scale dataset for content reliability on Wikipedia." SIGIR 2021.
- Neutrality-related templates on English Wikipedia
Verifiability
[edit]Verifiability means providing an appropriate citation for content on Wikipedia that is not common knowledge.
Verifiability Resources
[edit]- Example task for generating an aligned cross-lingual dataset of references from Wikipedia: task T374554
- Verifiability-related templates on English Wikipedia. See also inline templates category and maintenance templates category.
- Taxonomy, dataset, and baselines for detecting what sentences need citations: Redi, Miriam, et al. "Citation Needed: A Taxonomy and Algorithmic Assessment of Wikipedia's Verifiability." WWW '19.
- Metrics for evaluating the appropriateness of citations in generated text: Gao, Tianyu, et al. "Enabling Large Language Models to Generate Text with Citations." EMNLP '23.
- See Shao et al. for an application of this to Wikipedia.
- Recommending citations for statements on Wikipedia: Petroni, Fabio, et al. "Improving wikipedia verifiability with AI." Nature Machine Intelligence 2023.
No Original Research
[edit]No Original Research (NOR) means that all content must be verifiable by a reliable source (even if no explicit citation is required per the Verifiability policy). Trivially this can be thought of as "no hallucinations".
NOR Resources
[edit]- Zhao, Wenting, et al. "WildHallucinations: Evaluating long-form factuality in LLMs with real-world entity queries." 2024.
- Fact verification based on Wikipedia data: Thorne, James, et al. "FEVER: a Large-scale Dataset for Fact Extraction and VERification." NAACL '18.
- Original Research templates can also be found under the Verifiability Resources.
Community Wishlist
[edit]Wikimedians also have the opportunity to surface feature requests with the Wikimedia Foundation through what's known as the Community Wishlist. This can also be a good source of inspiration for datasets that could help support the needs of the Wikimedia community. The list is quite extensive so we have identified a few that might provide inspiration:
- WikiQuiz
- Quickly Add Infobox
- Search Wikipedia with image or sketch search
- Build a Wikipedia search that addresses common queries asked by new volunteers
- Translation quality
- Unwieldy discussions
- Make tiny edits (typos etc) easier
- Fixing the category graph
References
[edit]- ↑ Wong, KayYen; Redi, Miriam; Saez-Trumper, Diego (2021-07-11). "Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia". Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR '21 (New York, NY, USA: Association for Computing Machinery): 2437–2442. ISBN 978-1-4503-8037-9. doi:10.1145/3404835.3463253.
- ↑ Kaffee, Lucie-Aimée; Johnson, Isaac (2024-12-15). "Evaluations Using Wikipedia without Data Contamination: From Trusting Articles to Trusting Edit Processes" (PDF). EvalEval Workshop (NeurIPS '24).
- ↑ Kuo, Tzu-Sheng; Halfaker, Aaron Lee; Cheng, Zirui; Kim, Jiwoo; Wu, Meng-Hsin; Wu, Tongshuang; Holstein, Kenneth; Zhu, Haiyi (2024-05-11). "Wikibench: Community-Driven Data Curation for AI Evaluation on Wikipedia". Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. CHI '24 (New York, NY, USA: Association for Computing Machinery): 1–24. ISBN 979-8-4007-0330-0. doi:10.1145/3613904.3642278.
- ↑ Johnson, Isaac; Kaffee, Lucie-Aimée; Redi, Miriam (2024-11-16). Lucie-Aimée, Lucie; Fan, Angela; Gwadabe, Tajuddeen; Johnson, Isaac; Petroni, Fabio; van Strien, Daniel, eds. "Wikimedia data for AI: a review of Wikimedia datasets for NLP tasks and AI-assisted editing". Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia (Miami, Florida, USA: Association for Computational Linguistics): 91–101. doi:10.18653/v1/2024.wikinlp-1.14.