Talk:Wikimedia Enterprise
Add topic
Is there a discussion group for support?
[edit]Hi, I'm new to the enterprise APIs, and just did a project download for enwiki_namespace_0 but many of the articles contained are not the latest versions. One that you can see since it's near the top is for Athi,_Kenya. When I look at https://en.wikipedia.org/w/index.php?title=Athi,_Kenya&action=history I see there's a new version which adds the redirect where the project download has this version: "date_modified": "2023-01-30T04:42:08Z",
This is the project download API I'm calling:
NAMESPACE=enwiki_namespace_0 curl -L -H "Authorization: Bearer $WIKIPEDIA_ACCESS_TOKEN" \
https://api.enterprise.wikimedia.com/v2/snapshots/${NAMESPACE}/download \ --output ${NAMESPACE}
If there's a better place to ask, let me know. Thanks! Rcleveng (talk) 15:39, 7 February 2025 (UTC)
- Hello @Rcleveng - I hope you're finding the "snapshots" dataset useful for your needs.
- You can find the public helpcenter for technical enquiries about the Enterprise API on its dedicated website: helpcenter.enterprise.wikimedia.com. In the "What do you receive in the Snapshot API?" answer it specifies that:
- Snapshot API will return a tar.gz snapshot file of a project as it was at midnight UTC the day before the request and, for free accounts, refreshes twice-monthly on the 2nd and 21st of every month. It contains all of the current articles in each supported project at the time of file creation.
- The various formats and refresh rates of data that are available at no-cost are described on on our meta page under "Access".
- Finally, for direct technical support, you can login on https://dashboard.enterprise.wikimedia.com/dashboard and create a new support ticket.
- -- LWyatt (WMF) (talk) 16:01, 7 February 2025 (UTC)
- Thank you @[User:LWyatt (WMF)|LWyatt (WMF)]! I'll raise a ticket there.
Latest Release: Parsed Wikipedia References with Quality Scoring Models
[edit]The new Parsed References feature in Structured Contents provides parsed inline citations and references from Wikipedia articles in a consistent JSON format. The parsers output maintains a strong connection between the citation and the content it references by linking them at the paragraph level, ensuring context is preserved.
Additionally, references are structured where possible while preserving the text as it appears on the page, offering flexibility for reusers to adapt the data to their specific needs.
The new Reference Models feature delivers two Machine Learning scores for Wikipedia References: Reference Risk and Reference Need. When an article is updated, the ML models calculate a score to help editors and reusers understand more context about the article changes and how they affect the article’s overall verifiability and reliability.
Learn more about this release on our blog https://enterprise.wikimedia.com/blog/parsed-references-with-scoring-models/. Wikimedians can also access this beta release via their accounts on Wikimedia Cloud Services. SDelbecque-WMF (talk) 21:03, 27 March 2025 (UTC)
Quarterly product update
[edit]Hello everyone! If you're interested in exploring Enterprise's latest launches, I've just published the Quarterly update for Jan-March, 2025. We invite you all to check it out here --JArguello-WMF (talk) 14:28, 1 April 2025 (UTC)
Missing dumps for 2025-04-01
[edit]I noticed there are no recent enterprise dumps available: https://dumps.wikimedia.org/other/enterprise_html/runs/20250401/ (folder is empty).
Will they be available any time soon? (ping @LWyatt (WMF)) Prof.DataScience (talk) 11:09, 8 April 2025 (UTC)
- Hi @Prof.DataScience - there's an updated information text on that dump's information page which notes that:
- "...as of 03/24/2025, are no longer replicated here. If you are in need of recent runs, dumps of article change updates, or the ability to query individual articles from the dumps, visit Wikimedia Enterprise to sign up for a free account. Alternatively use your developer account to access APIs within Wikimedia Cloud Services."
- The folder "20250401" run shouldn't exist, I'll get that blank page removed so as not to cause future confusion. LWyatt (WMF) (talk) 11:40, 8 April 2025 (UTC)
Dataset published on Kaggle
[edit][TL;DR – the beta dataset is being shared in a new place but is neither new nor a reaction to AI scrapers. We still want people to give feedback on it though!]
Last week, we released our “Structured Contents” dataset on Kaggle (our blog post announcement; Kaggle’s announcement). This is part of an early beta release that we’re proactively sharing with test partners and across open platforms to engage a broad range of commercial, academic, and volunteer users. Our goal is to gather feedback while refining the dataset for future production release.
This dataset was first openly published in September 2024 on Hugging Face (blog post announcement; talkpage notice), alongside the announcement of expanded free accounts. That update increased access to include 5,000 monthly On-demand API requests, replacing the old trial version that offered only a limited number of free requests. The Structured Contents articles endpoint is included as part of this free access.
Since last week, the Kaggle data release has garnered some media attention (including 1234 etc.). The media stories led to additional awareness of the Enterprise API services] – we saw our biggest ever traffic day as a result! However, many of the media articles wrongly conflated this release with a different blog post from two weeks ago – which discussed the heavy toll on WMF infrastructure by bots scraping data to train LLMs. Unfortunately, no journalist actually confirmed this connection with us before publishing their story. As we continue to submit correction requests (with varying degrees of success), we have had to clarify several misconceptions that arose in the media narrative:
- Although the dataset is indeed useful for training AI models, re-publishing this beta dataset on Kaggle (which was already available on HuggingFace for 6 months) is not a reaction to the impact of scraping on our infrastructure, nor is it an attempt to “fend off” AI scrapers or “get [them] off our back”. Equally, this re-publication is neither “because” of that scraping activity nor “a solution” to it. It is trying to help developer communities through cleaner and efficient data procurement of an early beta format.
- Kaggle is not “paying for the data” as one article previously stated. The dataset allows developers to access Wikimedia data in a new machine-readable format. However, the content continues to be both freely-licensed and freely-accessible (gratis and libre). Since last year, the Wikimedia Enterprise team has been testing this product via external releases, to help us get wider feedback and identify use cases to aid in development decisions to improve the product.
We hope to use this increased interest in the beta release to gather more useful feedback on the data structure itself as we turn the beta into a production release, increase the number of Wikipedia language editions it covers, and ensure its utility for more of our service’s users (at the free and paid tier alike). LWyatt (WMF) (talk) 09:08, 23 April 2025 (UTC)