Grants:Project/CS&S/Structured Data on Wikimedia Commons functionalities in OpenRefine/Midpoint

This project is funded by a Project Grant

Report under review

This Project Grant midpoint report has been submitted by the grantee, and is currently being reviewed by WMF staff. You may add comments, responses, or questions to this report's discussion page.

To read the approved grant submission for this project, please visit Grants:Project/CS&S/Structured Data on Wikimedia Commons functionalities in OpenRefine.

Review the reporting requirements to better understand the reporting process.
Review all Project Grant reports under review.
Please Email projectgrantswikimedia.org if you have additional questions.

Welcome to this project's midpoint report! This report shares progress and learning from the first half of the grant period.

Summary

In a few short sentences or bullet points, give the main highlights of what happened with your project so far.

We set up this project:
- We hired and onboarded a fresh team;
- We established lines of communication with the user community of OpenRefine’s future SDC features.
We deployed a basic Wikimedia Commons reconciliation service, which makes it possible to load lists of Commons file names in OpenRefine and retrieve their Wikitext and structured data for further processing. This activity is nearly done: we are still ironing out some bugs and adding some features to make the service more flexible and user-friendly.
We built the technical foundations for batch SDC editing via OpenRefine by extending its Wikibase extension. This work is now almost halfway done.
We commissioned and started doing design research: after a set of interviews with typical end users, we are outlining how the SDC functionalities in OpenRefine should work in practice, what the user workflow will look like, and what features will make the workflow more user friendly. This work is now halfway done.
Based on emerging insights from this user research, we made several key decisions for further software development and prioritization.

Methods and activities

How have you setup your project, and what work has been completed so far?

Describe how you've setup your experiment or pilot, sharing your key focuses so far and including links to any background research or past learning that has guided your decisions. List and describe the activities you've undertaken as part of your project to this point.

Project organisation and setup:

We hired and onboarded two junior developers (Eugene Egbe for Wikimedia-specific code and Joey Salazar for OpenRefine-specific code) and engaged a design researcher (Lozana Rossenova).
We set up basic project infrastructure: workboards and tasks lists on Phabricator (for Wikimedia code: reconciliation / OpenRefine in general) and GitHub (for OpenRefine code); and we created an info page on Wikimedia Commons for OpenRefine in general

Technical development:

See completed tasks on Phabricator and on GitHub; see our monthly updates for more detail.

Set up a Gerrit repository for the Wikimedia Commons reconciliation service, and created a minimal operational version for it, which has been improved and extended upon, including functionalities for data extension.
Many updates to OpenRefine's Wikibase extension, through which it is now possible to edit other entities than just items; including Wikimedia Commons' MediaInfo entities.

Design research and wireframes:

After a series of stakeholder interviews, Lozana Rossenova created workflow diagrams and high-level wireframes for file upload through OpenRefine.
This design research, and the resulting workflow sketches and lo-fi wireframes helped us to make some major architectural decisions, as described under 'Learning' in this report. The features table included in this grant's Timeline page was also created after processing Lozana's design research.

Community and stakeholder outreach:

Presentation at WikidataCon 2021 (October 30) about our ongoing work on SDC support in OpenRefine.
We set up a contact list through which we can send MassMessages about our project to interested Wikimedia community members.
We have a monthly recurring meeting with advisors in the Wikimedia Foundation's GLAM team and at Wikimedia Sverige.
As part of the design research by Lozana Rossenova, we conducted in-depth interviews with diverse potential end users of the SDC functionalities in OpenRefine.
We presented our ongoing work to the Wikidata community at WikidataCon 2021, where we also gave a general OpenRefine tutorial.

Midpoint outcomes

What are the results of your project or any experiments you’ve worked on so far?

Please discuss anything you have created or changed (organized, built, grown, etc) as a result of your project to date.

We have deployed the following code and software (quite a bit of this is still being 'refined' further):

The Wikimedia Commons Reconciliation Service, which makes it possible to retrieve the identifiers (M-ids) from lists of Wikimedia Commons files in OpenRefine and to retrieve their Wikitext and structured data, so that the user can then perform further operations on this data inside OpenRefine. It is also available for testing at the Reconciliation service test bench.
OpenRefine with a list of reconciled file names, their corresponding M-ids, extracted Wikitext and various structured data statements, retrieved with the Wikimedia Commons Reconciliation Service.
Wikimedia Commons manifest for OpenRefine (can be considered a type of 'settings' file to help OpenRefine work with a specific Wikibase - in this case Wikimedia Commons).
The OpenRefine team is delighted at seeing the Wikimedia Commons logo appear for the very first time inside OpenRefine on November 24, 2021 (as a result of succesful integration of the Wikimedia Commons manifest)
The EditGroups tool (which is heavily used on Wikidata) has been ported to Wikimedia Commons as well: https://editgroups-commons.toolforge.org/https://editgroups-commons.toolforge.org/. This makes it possible for Wikimedia Commons contributors to undo certain batch edits on Wikimedia Commons, including future 'faulty' edits by OpenRefine.
There is now a dedicated tag on Wikimedia Commons which will be used to track and count Wikimedia Commons edits done with OpenRefine.

Joey did the very first OpenRefine-powered structured data edit on Wikimedia Commons on December 20, 2021.

Finances

Please take some time to update the table in your project finances page. Check that you’ve listed all approved and actual expenditures as instructed. If there are differences between the planned and actual use of funds, please use the column provided there to explain them.

Then, answer the following question here: Have you spent your funds according to plan so far? Please briefly describe any major changes to budget or expenditures that you anticipate for the second half of your project.

Our spending up till this point has been added as an additional (last) column in our grant application's budget table.

We have spent our funds as planned so far, with a few small deviations:

We are dedicating part of the 'Additional developer honoraria' budget line towards $3,000.00 design research work by Lozana Rossenova (this work has not been invoiced yet).
We are also dedicating a small part of the 'Additional developer honoraria' towards better internet connectivity for Eugene Egbe (based in Cameroon) and Joey Salazar (based in the Basque Country).

We don't anticipate any major changes to expenditures for our current grant in the second half of this funded project, but as mentioned below under 'Learning', we are applying for a project extension and supplement in order to be able to (1) perform continued design research and (2) develop an additional feature that will improve the experience of Wikimedia Commons file uploads in OpenRefine's Wikimedia Commons extension.

Learning

The best thing about trying something new is that you learn from it. We want to follow in your footsteps and learn along with you, and we want to know that you are taking enough risks to learn something really interesting! Please use the below sections to describe what is working and what you plan to change for the second half of your project.

What are the challenges

What challenges or obstacles have you encountered? What will you do differently going forward? Please list these as short bullet points.

The goal of our project is simple: we will make it possible for Wikimedia Commons users to batch edit and upload files with structured data with the help of OpenRefine.

However, as we started working on (mainly back-end) software enhancements and as we had many conversations with stakeholders and potential end users, we became a bit overwhelmed with the complexity of the workflow that we want to accommodate. Also, we were confronted with quite a few questions that we hadn’t yet deeply thought through while drafting our project plan. Examples include:

How to deal with Wikitext in OpenRefine? Wikitext descriptions (infoboxes, license templates…) in Wikimedia Commons are notoriously complex. Do we want to help end users to process (split/parse) this Wikitext in manageable chunks, and if so, how will we accommodate this? (Phabricator ticket)
How can we allow Wikimedia Commons users to start their batch editing process from Wikimedia Commons categories (rather than a list of files generated by an external tool)? Such a feature is not natively supported by OpenRefine but will significantly improve the end users’ workflows; without this feature inside OpenRefine, they would need to turn to other, rather complex external tools (like PetScan or the Wikidata or Commons SPARQL endpoints) to retrieve a list of file names for batch editing.
The Wikimedia Commons batch upload tool Pattypan provides the end user with strong guidance: when starting an upload process, the user can choose a Wikimedia Commons template (for instance Artwork, Photograph, Book) that informs them about the metadata that is needed for a Commons upload. Do we want to enable a similar functionality in OpenRefine, and if so, how can we integrate this in the software (which is actually designed for more freeform, less guided data manipulation)?

Design research has informed several key decisions

In order to help us with well-informed decision making around these new challenges, we have engaged a design researcher starting from November 2021, who helps us to understand user workflows and helps us decide how we can accommodate these in OpenRefine. These insights have supported us in several architectural decisions, and they help us to prioritize our work in a much more informed way. As a result, we have so far made the following major architectural decisions:

We will deploy various Wikimedia Commons-specific features for OpenRefine (such as support for loading a dataset from Wikimedia Commons categories; parse infobox Wikitext; see Commons-specific previews of files before edit or upload) in a separate new piece of software: a Wikimedia Commons / media file extension for OpenRefine. This decision will make it possible for us to create dedicated user interfaces for such workflow steps. We want to build this extension in such a way that, in a future iteration, it will also be able to support batch file editing and batch file upload for arbitrary Wikibases. In the table below you can see an overview of features which will be enabled by OpenRefine itself and/or by the Wikimedia Commons extension (click 'Expand' to see the full table):

Workflow step	If the user has ... installed, they are able to ...	OpenRefine without Commons extension (via this grant)	OpenRefine with Commons extension - beta version (via this grant; mid 2022)	OpenRefine with Commons extension - v1.0 release (2023-ish?)	Remarks
01 🌟 Project creation	Let OpenRefine import a data sheet (any OR-supported format) with file paths from Wikimedia Commons	Yes	Yes	Yes
02 🚿 Data cleaning	Clean and improve a spreadsheet or dataset with file metadata	Yes	Yes	Yes
03 🔃 Reconciliation / recon	Reconcile data columns with Wikidata	Yes	Yes	Yes	Using the Wikimedia Commons Reconciliation Service.
03 🔃 Reconciliation / recon	Reconcile Commons file names with M-ids	Yes	Yes	Yes	Using the Wikimedia Commons Reconciliation Service.
03 🔃 Reconciliation / recon	Reconcile M-ids with Commons file names	Yes	Yes	Yes	Using the Wikimedia Commons Reconciliation Service.
04 ➡️ Reconciliation / data extension	Retrieve Wikitext (as one big blob) from existing Commons files which are reconciled in OpenRefine	Yes	Yes	Yes
04 ➡️ Reconciliation / data extension	Retrieve structured data (including captions) from existing Commons files which are reconciled in OpenRefine	Yes	Yes	Yes
06 ✍️ Editing schema preparation	Create an editing schema in OpenRefine that structures edits to files on Wikimedia Commons	Yes	Yes	Yes
06 ✍️ Editing schema preparation	Create an editing schema in OpenRefine that structures edits to Wikidata items	Yes	Yes	Yes	Is currently already supported in OpenRefine; user will be able to switch between (batch) editing Wikidata and Commons.
07 👀 Check/preview/test upload	See an example preview of the structured data that will be (batch) edited to existing Commons files	Yes	Yes	Yes
07 👀 Check/preview/test upload	See an example preview of the Wikitext (generated) infobox of files edited in batch	Yes	Yes	Yes
08 💿 Batch SDC edit	Add structured data to existing files on Wikimedia Commons	Yes	Yes	Yes
11 🤦‍♂️ Fix errors	Undo batch SDC edits to existing files	Yes	Yes	Yes	Using the EditGroups tool.
11 🤦‍♂️ Fix errors	Delete batch uploads	Yes	Yes	Yes	Using the EditGroups tool.
12 ✅ Report after upload	Download dataset with modified or uploaded file links from Commons, with their file names, M-ids, and metadata	Yes	Yes	Yes
01 🌟 Project creation	Provide OpenRefine with one or multiple Wikimedia Commons categories; OpenRefine then loads the (reconciled) file paths of all the files in these categories	No	Yes	Yes
10 🔺 Upload files	Upload new files to Wikimedia Commons	No	Yes	Yes	We may also decide to implement file upload in OpenRefine itself. It will certainly be possible as part of the Commons extension.
04 ➡️ Reconciliation / data extension	Retrieve Wikitext, split and cleaned per parameter and template, from existing Commons files which are reconciled in OpenRefine	No	Possibly?	Yes
07 👀 Check/preview/test upload	See an example preview of the structured data that will be (batch) added to newly uploaded Commons files	No	Possibly?	Yes
07 👀 Check/preview/test upload	See an example preview of the Wikitext (generated) infobox of files uploaded in batch	No	Possibly?	Yes
02 🚿 Data cleaning	Directly see thumbnails of media files while cleaning and editing their metadata	No	No	Yes	OpenRefine is very data-centric and does not natively support (direct) preview of thumbnails of files in its data operation screen. In a next version of the Commons extension we can, for instance, introduce a media-centric editing and viewing screen that allows end users to batch edit metadata of media files in a more visually oriented way.
05 📌 Metadata mapping	Map the user's dataset with a preset template / checklist of fields from Wikimedia Commons (e.g. Artwork, Book, Map...)	No	No	Yes
01 🌟 Project creation	IIIF support (retrieve and process images/files hosted on a IIIF service)	No	No	Possibly?	We're aware of this potential use case; it can be developed after deployment of core functionalities and after more research.
09 📚 Batch Wikitext edit	Edit Wikitext of existing Commons files	No	No	Possibly?	Wikimedia Commons editing and upload functionalities in OpenRefine focus on Structured Data first and foremost.
10 🔺 Upload files	Upload new files to an arbitrary Wikibase	No	No	Possibly?	We're aware of this potential use case; it can be developed after deployment of core functionalities and after more research. Members of the Wikibase stakeholder group have already indicated interest.

In our grant application, we (optionally) mentioned the possible development of an external, generic batch upload tool for Wikimedia Commons files, which would provide an alternative to QuickStatements. As a major benefit, such a tool will make it possible for end users to perform very large batch edits and uploads ‘in the background / in the cloud’ rather than via OpenRefine directly (for which they would need to keep their computer and OpenRefine running for very long periods of time). However, inclusion of very large batches of to be uploaded files (potentially several gigabytes at once, with varying file sizes and formats) offers a new challenge for such a tool. Therefore, we have decided to currently dedicate all our available development resources to deploy batch upload properly in OpenRefine itself. As we have learned from this process, we will decide upon the creation of a generic, separate batch upload tool after OpenRefine’s own file upload functionalities have been properly tested.

As we go forward, we would like to continue engaging a designer throughout the entire process. As mentioned above, insights from user interviews and design research have been invaluable in helping us to make informed decisions about ongoing development; we want to continue engaging design research to support our team for a few hours per week. In addition, we have noticed that a designer’s input will be very welcome to help us create updated or new user-facing interface elements (dialog windows/screens).

Planning adjustments

We have noticed that technical development is taking a bit more time than we originally anticipated. This is partly due to the complexity of the code that our developers need to get onboarded / acquainted with.

Basic features of the Wikimedia Commons Reconciliation Service are ready, but more finetuning is needed to make the service truly user-friendly. We have decided to dedicate Wikimedia-specific developer time in December 2021 - February 2022 fully towards:
- Fine-tuning, bug fixes, and adding more flexibility to the Wikimedia Commons reconciliation service
- Design enhancements and Wikimedia Commons-specific UX improvements to various dialog windows in the core OpenRefine workflow and the Wikimedia Commons extension.

General OpenRefine development is slightly behind schedule. A first experimental edit of structured data on Wikimedia Commons has been done on December 20, 2021; we will probably deploy a new OpenRefine version with full SDC editing support a bit later than planned (originally end January 2022, will probably be a bit later). As we are also planning to deploy a new OpenRefine extension, it will be helpful if we can extend OpenRefine development work with several months.

Sustainability conversations

On a more meta level, we are also thinking about the longer-term sustainability of the tools we are developing through this grant. Recently, the GLAM-Wiki community has been strongly affected by unreliability and failures of two tools which have become very important for content partnerships: Pattypan and the ISA Tool. We expect that OpenRefine’s batch editing and upload functionalities will become a key technical environment upon which many Wikimedia volunteers and affiliates will rely for content partnerships. How can we make sure that the software we’re developing now will stay maintained for a longer period of time, and that we have the resources we need to fix bugs and (yet unpredictable) incompatibilities?

As an example, the ISA Tool has recently been ‘adopted’ by Wikimedia Sverige, which is now dedicating developer time to make the tool more secure and to add widely requested features (such as internationalization). Similarly, we have started conversations with (mainly Wikimedia) organizations, exploring if we can build a partnership with an organization that will ‘adopt’ further management of OpenRefine’s Wikimedia Commons extension and - in the longer term - the external batch upload tool, if we decide to develop it.

In addition, we are also exploring future funding opportunities, both inside and outside the core Wikimedia ecosystem. Our work on media file support also receives a lot of interest from the Wikibase community at large: quite a few managers of external Wikibases are interested in uploading files to their own repositories and using OpenRefine for this purpose. As project member of the Wikibase Stakeholder Group, OpenRefine was recommended to the NFDI4Culture consortium in Germany for additional funding to help with Wikibase-specific development. Results of the funding round will be announced later in February 2022.

What is working well

What have you found works best so far? To help spread successful strategies so that they can be of use to others in the movement, rather than writing lots of text here, we'd like you to share your finding in the form of a link to a learning pattern.

We received a lot of very diverse, talented and skilled candidacies for our job openings for both our junior developer positions! We have added some of our tips to the Finding programmers for a WMF grant learning pattern and have updated it with various channels we have used.
We have onboarded our junior developers in a short amount of time and - mainly due to their own deep interest and enthusiasm! - this has gone quite well and unexpectedly quickly. It is challenging to help a developer who is not a Wikimedian get familiar with the Wikimedia ecosystem, and we have contributed some tips to the learning pattern Working with developers who are not Wikimedians.

Next steps and opportunities

What are the next steps and opportunities you’ll be focusing on for the second half of your project? Please list these as short bullet points.

We will continue deploying all necessary back-end functionalities in OpenRefine’s core software and its Wikibase extension, in order to enable SDC editing and uploading. A new release of OpenRefine (probably version 3.6.0) will include these features.
As mentioned above, we will start the development of a dedicated Wikimedia Commons / media upload extension for OpenRefine. During the current grant period and with the current budget, we are able to develop the basic technical architecture for this extension. By the end of this grant period, we will release a beta version of this extension, which will include basic functionalities for one important and often-requested feature: allowing Wikimedia Commons users to load a dataset of Commons files into OpenRefine by entering one or several Wikimedia Commons categories.

Grantee reflection

We’d love to hear any thoughts you have on how the experience of being an grantee has been so far. What is one thing that surprised you, or that you particularly enjoyed from the past 3 months?

Active development on this project has been running for a bit more than four months now. In general, working on this project is extremely enjoyable, so it's not easy to pinpoint a single highlight. So here's a short list:

We started this project with just two team members (Sandra and Antonin) and our team has quickly grown. We have been lucky to find and recruit three excellent, smart and very nice colleagues (Eugene, Joey and Lozana). We really enjoy working together and learn a lot from each other!
Our project proposal has received many enthusiastic endorsements and we continue to have positive and inspiring interactions with the Wikimedia community. We do notice people are really eager to see our software 'live'. It is very useful for us to have occasional check-ins with people who act like external 'mentors', especially staff from the Wikimedia Foundation's GLAM team and Wikimedia Sverige, and quite a few individual Wikimedians and external supporters who are always happy to lend us some advice. Thank you!