Grants:IEG/StrepHit: Wikidata Statements Validation via References/Renewal/Timeline
This project is funded by an Individual Engagement Grant
This Individual Engagement Grant is renewed
renewal scope | timeline & progress | finances | midpoint report | final report |
Timeline for StrepHit: Wikidata Statements Validation via References
[edit]Timeline | Date |
Back-end redesign: phab:T166497 | August 2017 |
Back-end version 2 | February 2018 |
Front-end redesign: phab:T166495 | February 2018 |
Tool documentation | May 2018 |
Data release tutorial | April 2018 |
StrepHit lexical database version 2 | May 2018 |
StrepHit standard datasets version 2 | May 2018 |
StrepHit direct inclusion dataset | May 2018 |
StrepHit unresolved entities dataset | May 2018 |
Overview
[edit]- Project start date: May 22, 2017
- Codebase:
https://github.com/Wikidata/primarysources
Monthly updates
[edit]Each update will cover a 1-month time span, starting from the 22nd day of the previous month. For instance, June 2017 means May 22nd to June 22nd 2017.
June 2017
[edit]Community outreach: WikiCite 2017
[edit]- The project leader was accepted and attended WikiCite_2017 (see WikiCite_2017#Attendees);
- kick-off talk given at the main conference track: WikiCite_2017/Program#May_23.2C_2017:_Conference;
- conference third (hack) day: face-to-face meeting with the Wikidata development team to understand the next implementation steps for the tool uplift;
- connected in person with Tpt, core developer of the primary sources tool version 1;
- synchronized with T_Arrow, member of the WikiFactMine team, a strategic partner. When the WikiFactMine project ends, we expect our dataset upload service to be ready to accept the WikiFactMine dataset.
Mock-ups
[edit]The team has focused on task U1 of the project work package and has published a set of mock-ups that integrate known requirements:
The set is visible to anyone in Phabricator and a high-priority subset was shown to the WikiCite audience as part of the given presentation.
Tool uplift proposal
[edit]The team has come up with an official uplift proposal, which replaces the old tool page: d:Wikidata:Primary_sources_tool. It is based on:
- feedback collected by the WikiCite audience;
- outcomes of the meeting with the Wikidata development team;
- investigation of technical solutions for both the back end and the front end. @Afnecors and Kiailandi: great work so far in diving into the MediaWiki world!
Technical
[edit]- First implementation steps towards the solution proposed for the back end: phab:T167810;
- submitted the first patch to an active Wikimedia project! Currently under review: gerrit:360376;
- started the front-end refactoring: phab:T168243, phab:T168239.
July 2017
[edit]Community outreach
[edit]- Notification of the uplift proposal to the Wikidata community: https://lists.wikimedia.org/pipermail/wikidata/2017-July/010902.html;
- the uplift proposal is mentioned in the Wikidata weekly summary #268: https://lists.wikimedia.org/pipermail/wikidata/2017-July/010940.html;
- solicited multiple times the Wikidata development team for direct feedback, which would be especially helpful for the front end: unfortunately, we did not get any answer;
- regular fruitful interactions on the back-end implementation with Smalyshev_(WMF), who is member of the WMF Discovery team (a strategic partner), as well as product owner of the Wikidata Query Service (WDQS) back end. At the time of writing this report, we are discussing which would be the best place to put the primary sources tool code in WDQS, i.e., as a separate module or in an existing one;
- ongoing collaboration with the reviewers of the submitted patches. @Smalyshev (WMF) and GLederrey (WMF): thanks a lot for your precious contributions.
Front end
[edit]Focused on a major refactoring of the existing code to fit the new architecture, i.e., a MediaWiki extension. More specifically, we worked on:
- refactoring HTML templates, see phab:T168247;
- writing unit tests for the HTML templates, see phab:T168248;
- refactoring functions that interact with the user, see phab:T168251;
- writing unit tests for the functions that interact with the user, see phab:T168254.
Back end
[edit]The first development iteration of the Ingestion API (responsible for the interaction with the data providers) is over. More specifically, we worked on:
- Wikidata RDF data model validation, see phab:T167810. Part of a patch under review at gerrit:365253;
- upload service, see phab:T169045. Part of a patch under review at gerrit:365253;
- update service, see phab:T170682. Pending patch submission;
- the drop service is already available, see phab:T170684;
- enabling the use of named graphs to ensure provenance of the datasets, see phab:T170685. Patch under review at gerrit:366227;
- storing data provider metadata in a specific named graph, see phab:T170819. Patch under review at gerrit:366231.
August 2017
[edit]Community outreach: Wikimania 2017
[edit]- The project leader attended Wikimania 2017;
- connected in person with User:Smalyshev_(WMF), core developer of the mw:Wikidata_query_service, which will serve as the back end for the primary sources tool;
- meeting with User:Smalyshev_(WMF) to discuss technical implementation details, with a focus on the back end;
- update given to User:Tpt on the tool uplift proposal. Positive feedback received;
- update given to User:Lydia_Pintscher_(WMDE) on the tool uplift proposal. The development roadmap is agreed;
- synchronized with User:T_Arrow and User:Charles_Matthews on the WikiFactMine dataset release;
- follow up of the discussion with User:Smalyshev_(WMF) sent to him and to User:GLederrey_(WMF).
Front end
[edit]Efforts still focused on the major refactoring needed for the new architecture. The development of a MediaWiki extension built upon Wikibase is non-trivial and particularly time-consuming, due to:
- a lack of exhaustive documentation, which entails the following workflow.
- to dive into the (vast) codebase;
- to identify relevant pieces of code;
- to understand its behavior;
- to interact with the authors and integrate their eventual feedback.
- the absence of a Wikidata user interface toolkit, preventing direct access to relevant objects;
- the profusion of non-standard development practices.
Back end discussion outcomes
[edit]We digested the discussion with User:Smalyshev_(WMF) at Wikimania 2017. The main action points have become Phabricator tickets:
- phab:T173747;
- phab:T173749;
- phab:T173750;
- phab:T173753;
- phab:T173755;
- phab:T173759;
- phab:T173760;
- phab:T173761;
- phab:T173762.
Other technical details:
- having the primary sources tool code in a separate repo implies less overhead in terms of Gerrit patches, e.g., web.xml for new Web services;
- the actual code review is independent from patches;
- the interaction between the new Web services and Blazegraph is now implemented via HTTP, but can be probably done through Servlet filters;
- data normalization is probably not needed;
- ranking can be safely ignored, i.e., always set to best.
Things external to the code:
- RDF is probably a complex format for data providers;
- we need a tool to generate Wikidata-compliant RDF out of tabular data;
- the most challenging step to generate a Wikidata-compliant dataset is probably entity reconciliation, i.e., mapping between internal IDs and Wikidata QIDs;
- we may propose to use Open Refine, where a plug-in for Wikidata reconciliation is being developed by d:User:Pintoch.
September 2017
[edit]Front end
[edit]- The new architecture is now ready: the time to develop known features (phab:M218) has come;
- worked with Thiemo Kreuz (WMDE) to finalize the discussion on the HTML templates;
- the migration/refactoring of HTML templates is complete: phab:T168239;
- started the implementation of the reference preview: phab:M218/693, phab:T160332;
- started the implementation of the new filter: phab:M218/691, phab:T176638
Back end
[edit]- The ingestion API is complete: phab:T169044;
- worked on the curation API: phab:T169985;
- the suggest service is complete: phab:T172502;
- came up with a way to mark curated data: phab:T169986;
- worked on the curate service: phab:T172505;
- started handling complex data types: phab:T173750
October 2017
[edit]- Discussion on the QuickStatments to RDF converter: phab:T173749;
- started the implementation of the converter: it is a standalone tool written in Python: https://github.com/marfox/qs2rdf
Front end
[edit]- The following known features are deployed to the gadget (version 1), for users to test:
- de-duplication;
- suggested properties browser menu;
- started migrating the browser menu to the new architecture; phab:T175164;
- the development environment for the filter is ready: phab:T176641;
- started the implementation of the preview facility: phab:T160332;
- implemented a best-effort renderer for generic Web sources (current target: Freebase datasets);
- generic rendering is too error-prone: implemented direct usage of source corpora when datasets make them available (current target: StrepHit);
- requested a ToolForge account to deploy the rendering services;
- implemented highlighting of relevant source content.
Back end
[edit]- Implemented the curate service: phab:T172505;
- the curation API is complete: phab:T169985;
- made the back end a separate module of Wikidata Query Service: phab:T173747;
- integrated latest code review: phab:T178476.
November 2017
[edit]Community outreach: itWikiCon 2017
[edit]- Afnecors co-organized itWikiCon, the first Italian local chapter Wiki conference;
- Hjfocs attended and disseminated StrepHit.
Side project: qs2rdf
[edit]The first version of the QuickStatements-to-RDF converter is complete: phab:T173749.
Front end
[edit]- adapted the front end to the back end version 2: phab:T168244, phab:T168255;
- migrated the filter to the new architecture: phab:T176642;
- added reference column to the filter: phab:T178299, phab:T148165;
- implemented the Web service for Web sources corpora;
- indexed the StrepHit corpus with ElasticSearch for effective retrieval performance;
- integrated approve/reject buttons into the preview facility.
Back end
[edit]- First test deployment of the full back end, version 2: phab:T178585;
- translated current StrepHit datasets into RDF: phab:T178795;
- the front end heavily relies on the QuickStatments format:
- added support for QuickStatments output to the curation API;
- adapted integration tests;
- started working on the search service: phab:T180486.
VPS machine
[edit]- Requested a VPS machine for production deployment: phab:T180347. @AndrewBogott (WMF), BDavis (WMF), and CPettet (WMF): thanks for bringing it up!
December 2017
[edit]Front end
[edit]- The reference preview feature is ready.
- Tackles this known task: phab:T160332;
- pull request for the gadget version (v1): https://github.com/Wikidata/primarysources/pull/140;
- worked with Tpt on the pull request;
- module added to the MediaWiki extension version (v2): phab:T181481;
- started integration with the filter: phab:T181480;
- created query template for the domain of interest feature in the filter: phab:T178304;
- started work on the interface for data providers. The dataset upload form is ready: phab:T170822.
Back end
[edit]- Requested support for specific time zones in WDQS: phab:T179068. Addressed by Smalyshev_(WMF);
- the search service is ready: phab:T180486;
- the random service is ready: phab:T180483.
- implemented a caching mechanism with separate threads for efficient response;
- dataset-specific cache gets updated when a new dataset is uploaded;
- global cache gets updated through a fixed schedule;
- resolved the non-randomness issue of the random button: phab:T148180;
- the datasets service is ready: phab:T182192.
VPS machine
[edit]- worked on the setup: phab:T182789;
- the back-end production deployment is ready;
- requested and obtained access to shared storage (Wikimedia data dumps): phab:T183229;
- the front-end v2 testing deployment is currently blocked by phab:T183274.
January 2018
[edit]Happy new year! Back to work after the Christmas break.
Front end
[edit]- The software developers were not available due to university exam sessions;
- the project leader started the integration of the back-end version 2 production services (i.e., those deployed in the VPS machine) into the front-end gadget code.
Back end
[edit]- Implemented the statistics service: phab:T183364;
- dataset statistics are ready, use a scheduled caching mechanism with a separate thread for efficient response phab:T183367;
- user statistics are ready: phab:T183370;
- they are computed on the fly through queries to a specific named graph in the back-end storage engine: phab:T170820;
- user name sanity check: phab:T182185;
- the curate service updates user activities count in a live fashion, i.e., whenever a user curates a Wikidata statement.
VPS machine
[edit]- Worked with BDavis_(WMF) to resolve phab:T183274, which was blocking the import of Wikidata XML dumps into a MediaWiki Vagrant instance;
- the import process has started: phab:T182989.
February 2018
[edit]The main outcome of this month is the alpha release of version 2: phab:T185571. The back end services are now switched to version 2, while the front end is the gadget version with major features implemented.
The team has also intensively worked on the beta release of version 2: phab:T185572.
Front end
[edit]- Handled duplicate suggested statements with date values: phab:T177226;
- implemented version 2 of the configuration dialog, i.e., the dataset selection window, as per phab:M218: phab:T187043;
- fixed an issue in the reference preview affecting the genealogics.org source: phab:T186698;
- added the text box to input generic SPARQL queries: phab:T178306;
- started working on the entity of interest input box: phab:T178303.
Back end
[edit]- improved the search service performance: phab:T185576;
- added an optional dataset description parameter in the upload service: phab:T187221;
- output the dataset description in the statistics service: phab:T187220.
VPS machine
[edit]- Finished loading a full copy of Wikidata XML dumps into a MediaWiki vagrant instance: phab:T182989;
- worked with BDavis_(WMF) to resolve phab:T185637: instead of bypassing a bug in Apache HTTP client through a specific NGINX configuration, upgrade the dependency in WDQS.
StrepHit
[edit]- Confident dataset version 2:
- structured references with reference URL (P854), stated in (P248), retrieved (P813) and source-specific properties for IDs when available, e.g., Union List of Artist Names ID (P245);
- improved the mapping between extracted statements and Wikidata properties:
qs2rdf
[edit]- handle multiple reference-value pairs in QuickStatements as a single RDF reference node.
March 2018
[edit]The team focused its development efforts on the beta release of version 2. The highest priority was given to the improvement of the filter component.
Front end
[edit]- Filter module:
- added property autocompletion: phab:T178305;
- added entity value autocompletion: phab:T178307, phab:T178301;
- started work on the baked filters (i.e., all properties, all item values, frequent StrepHit properties and item values): phab:T189123;
- user workflow: disable mutually exclusive filters depending on user input;
- first working version of statements search with SPARQL;
- worked on properly displaying search results in a table;
- worked on showing action buttons (i.e., preview, approve, reject) with search SPARQL query results;
- connected the reference preview facility to search results;
- finalized the Web interface for data providers, in the form of a special page: phab:T170821;
- finalized the HTML templates module;
- improved the statement suggestions browser:
- ported the code to generate it into the MediaWiki extension: phab:T175164;
- now it also shows properties when only references are new (not full claims): phab:T177231;
- fixed a bug that prevented it from reappearing after a page reload on Firefox and Safari: phab:T186604;
Back end
[edit]- Implemented the service for property autocompletion: phab:T188002;
- implemented the service for entity value autocompletion: phab:T188004;
VPS machine
[edit]- Extended the maximum size of a dataset to be uploaded to 256 MB: phab:T186731
April 2018
[edit]Front end
[edit]- Fixed curation button labels in the reference preview: phab:T186611;
- resolved a conflict with the ESC key that was closing both the filter and the reference preview windows: phab:T189579;
- pick different methods to retrieve labels via the Wikidata API depending on the number of results;
- optimized the labels cache;
- finalized the dataset selection dialog;
- implemented a tool-specific portlet that stands out on the left sidebar;
- optimized the automatic blacklist of reference URLs;
- filter module:
- completed the implementation of the baked filters dropdown menu: phab:T189123;
- enabled curation actions on baked filters statements;
- optimized the autocompletion filters;
- forced a limit of 100 results in arbitrary SPARQL to avoid heavy queries;
- rewrited log handling with fine-grained message levels;
- completed work on the upload/update special page to comply with the MediaWiki style.
Back end
[edit]- Major refactoring of the whole codebase:
- moved shared pre/post-processing logic to static methods in a utility class;
- moved SPARQL queries, API parameters, settings read from environment variables into standalone classes;
- improved logging with fine-grained messages;
- started work on documentation;
- started work on compliance with MediaWiki coding conventions;
- Technical details:
- ensured that parameters are not shared among requests;
- upload service: respond with a bad request if the dataset files are not RDF;
- ingestion API: full check of required request parameters;
- validate user name wherever it is passed as a request parameter;
- implemented a filter to add CORS headers in each service;
VPS machine
[edit]- Loaded the URL blacklist and whitelist for more comprehensive testing.
May 2018
[edit]This is the final month of the project. Closed as many issues as possible in Phabricator: phab:project/board/2788/. Some are still open due to third-party assignees, some are out of scope. See the final report for more details: Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Renewal/Final.
Front end
[edit]Complete refactoring and clean-up of the codebase.
- The JavaScript and CSS parts comply with MediaWiki conventions (mw:Manual:Coding_conventions/JavaScript and mw:Manual:Coding_conventions/CSS):
- ESLint config as per https://github.com/wikimedia/eslint-config-wikimedia;
- Stylelint config as per https://github.com/wikimedia/stylelint-config-wikimedia;
- the PHP part complies with MediaWiki conventions (mw:Manual:Coding_conventions/PHP):
- code sniffer config as per https://github.com/wikimedia/mediawiki-tools-codesniffer;
- linter from https://github.com/JakubOnderka/PHP-Parallel-Lint;
- minux x checker from https://github.com/wikimedia/mediawiki-tools-minus-x;
- more styling à la MediaWiki: buttons and icons in dialog windows and table result rows;
- fixed filter results table grid rendering in Firefox;
- finalized the special page for datasets upload and update;
- added widget with instructions on how to use the filter;
- fixed empty result table handling;
- fixed broken headers in service-specific result table with multiple datasets;
- handled the dataset parameter in properties and values back-end service calls;
- packaged the repository as per the guidelines: mw:Manual:Developing_extensions;
- requested and obtained a Gerrit repository: mw:Gerrit/New_repositories/Requests. Thanks MarcoAurelio for your work;
- created the page in the Extension namespace on mediawiki.org: mw:Extension:PrimarySources.
Back end
[edit]- Completed Web API documentation;
- completed Java API documentation;
- fixed all code compilation warnings (except a redundant one);
- the codebase complies with MediaWiki conventions (mw:Coding_conventions/Java/checkstyle.xml);
- the codebase complies with forbidden APIs checks (https://github.com/policeman-tools/forbidden-apis);
- handled empty dataset description case.
VPS machine
[edit]- Deployed the latest version of the back end;
- installed the latest version of the MediaWiki extension front end.
StrepHit
[edit]- Finalized the confident and supervised datasets version 2.