Jump to content

CERSEI

From Meta, a Wikimedia project coordination wiki
CERSEI screenshot

CERSEI is a tool that can import or scrape third-party data sources. It uses source-specific Python code for each source, and can even use a "headless browser" to scrape complicated websites that rely on eg JavaScript to navigate. It can therefore access data sources that can not be accessed via eg Mix'n'match. The data from sources can be updated regularly, either for everything, or just changed entries (if the source has a "recent changes" equivalent).

CERSEI stores the scraped results in an "extended" WikiBase-compatible JSON format, that can be filtered into Wikidata-compatible items, for easier comparison and import. There is an API endpopint with MediaWiki-compatible path and output format, to allow processing by existing MediaWiki clients.

Properties can be queried and filtered via a simple syntax to retrieve entries with specific values.

If either the data source or the import/scraper code are updated and generate more details for an entry, the old revision is stored, allowing for reconciliation of just the new entries with Wikidata, and analysis of the changes between revisions.

At the moment, CERSEI is not intended to allow matching of entries to Wikidata items from within the tool; it is rather a repository of automatically curated data to be used by other tools. Wikidata matches are either imported from Wikidata, or from the respective data source. A Mix'n'match "bridge" is in place.

Please feel free to suggest more sources to import, or even write a new scraper (example) yourself.