This Project Grant midpoint report has been submitted by the grantee, and is currently being reviewed by WMF staff. You may add comments, responses, or questions to this report's discussion page.
We established lines of communication with the user community of OpenRefine’s future SDC features.
We deployed a basic Wikimedia Commons reconciliation service, which makes it possible to load lists of Commons file names in OpenRefine and retrieve their Wikitext and structured data for further processing. This activity is nearly done: we are still ironing out some bugs and adding some features to make the service more flexible and user-friendly.
We built the technical foundations for batch SDC editing via OpenRefine by extending its Wikibase extension. This work is now almost halfway done.
We commissioned and started doing design research: after a set of interviews with typical end users, we are outlining how the SDC functionalities in OpenRefine should work in practice, what the user workflow will look like, and what features will make the workflow more user friendly. This work is now halfway done.
Based on emerging insights from this user research, we made several key decisions for further software development and prioritization.
How have you setup your project, and what work has been completed so far?
Describe how you've setup your experiment or pilot, sharing your key focuses so far and including links to any background research or past learning that has guided your decisions. List and describe the activities you've undertaken as part of your project to this point.
Project organisation and setup:
We hired and onboarded two junior developers (Eugene Egbe for Wikimedia-specific code and Joey Salazar for OpenRefine-specific code) and engaged a design researcher (Lozana Rossenova).
Many updates to OpenRefine's Wikibase extension, through which it is now possible to edit other entities than just items; including Wikimedia Commons' MediaInfo entities.
Design research and wireframes:
After a series of stakeholder interviews, Lozana Rossenova created workflow diagrams and high-level wireframes for file upload through OpenRefine.
Presentation at WikidataCon 2021 (October 30) about our ongoing work on SDC support in OpenRefine.We set up a contact list through which we can send MassMessages about our project to interested Wikimedia community members.
We have a monthly recurring meeting with advisors in the Wikimedia Foundation's GLAM team and at Wikimedia Sverige.
As part of the design research by Lozana Rossenova, we conducted in-depth interviews with diverse potential end users of the SDC functionalities in OpenRefine.
What are the results of your project or any experiments you’ve worked on so far?
Please discuss anything you have created or changed (organized, built, grown, etc) as a result of your project to date.
We have deployed the following code and software (quite a bit of this is still being 'refined' further):
The Wikimedia Commons Reconciliation Service, which makes it possible to retrieve the identifiers (M-ids) from lists of Wikimedia Commons files in OpenRefine and to retrieve their Wikitext and structured data, so that the user can then perform further operations on this data inside OpenRefine. It is also available for testing at the Reconciliation service test bench.OpenRefine with a list of reconciled file names, their corresponding M-ids, extracted Wikitext and various structured data statements, retrieved with the Wikimedia Commons Reconciliation Service.
Wikimedia Commons manifest for OpenRefine (can be considered a type of 'settings' file to help OpenRefine work with a specific Wikibase - in this case Wikimedia Commons).The OpenRefine team is delighted at seeing the Wikimedia Commons logo appear for the very first time inside OpenRefine on November 24, 2021 (as a result of succesful integration of the Wikimedia Commons manifest)
Please take some time to update the table in your project finances page. Check that you’ve listed all approved and actual expenditures as instructed. If there are differences between the planned and actual use of funds, please use the column provided there to explain them.
Then, answer the following question here: Have you spent your funds according to plan so far? Please briefly describe any major changes to budget or expenditures that you anticipate for the second half of your project.
We have spent our funds as planned so far, with a few small deviations:
We are dedicating part of the 'Additional developer honoraria' budget line towards $3,000.00 design research work by Lozana Rossenova (this work has not been invoiced yet).
We are also dedicating a small part of the 'Additional developer honoraria' towards better internet connectivity for Eugene Egbe (based in Cameroon) and Joey Salazar (based in the Basque Country).
We don't anticipate any major changes to expenditures for our current grant in the second half of this funded project, but as mentioned below under 'Learning', we are applying for a project extension and supplement in order to be able to (1) perform continued design research and (2) develop an additional feature that will improve the experience of Wikimedia Commons file uploads in OpenRefine's Wikimedia Commons extension.
The best thing about trying something new is that you learn from it. We want to follow in your footsteps and learn along with you, and we want to know that you are taking enough risks to learn something really interesting! Please use the below sections to describe what is working and what you plan to change for the second half of your project.
What challenges or obstacles have you encountered? What will you do differently going forward? Please list these as short bullet points.
The goal of our project is simple: we will make it possible for Wikimedia Commons users to batch edit and upload files with structured data with the help of OpenRefine.
However, as we started working on (mainly back-end) software enhancements and as we had many conversations with stakeholders and potential end users, we became a bit overwhelmed with the complexity of the workflow that we want to accommodate. Also, we were confronted with quite a few questions that we hadn’t yet deeply thought through while drafting our project plan. Examples include:
How to deal with Wikitext in OpenRefine? Wikitext descriptions (infoboxes, license templates…) in Wikimedia Commons are notoriously complex. Do we want to help end users to process (split/parse) this Wikitext in manageable chunks, and if so, how will we accommodate this? (Phabricator ticket)
How can we allow Wikimedia Commons users to start their batch editing process from Wikimedia Commons categories (rather than a list of files generated by an external tool)? Such a feature is not natively supported by OpenRefine but will significantly improve the end users’ workflows; without this feature inside OpenRefine, they would need to turn to other, rather complex external tools (like PetScan or the Wikidata or Commons SPARQL endpoints) to retrieve a list of file names for batch editing.
The Wikimedia Commons batch upload tool Pattypan provides the end user with strong guidance: when starting an upload process, the user can choose a Wikimedia Commons template (for instance Artwork, Photograph, Book) that informs them about the metadata that is needed for a Commons upload. Do we want to enable a similar functionality in OpenRefine, and if so, how can we integrate this in the software (which is actually designed for more freeform, less guided data manipulation)?
Design research has informed several key decisions
In order to help us with well-informed decision making around these new challenges, we have engaged a design researcher starting from November 2021, who helps us to understand user workflows and helps us decide how we can accommodate these in OpenRefine. These insights have supported us in several architectural decisions, and they help us to prioritize our work in a much more informed way. As a result, we have so far made the following major architectural decisions:
We will deploy various Wikimedia Commons-specific features for OpenRefine (such as support for loading a dataset from Wikimedia Commons categories; parse infobox Wikitext; see Commons-specific previews of files before edit or upload) in a separate new piece of software: a Wikimedia Commons / media file extension for OpenRefine. This decision will make it possible for us to create dedicated user interfaces for such workflow steps. We want to build this extension in such a way that, in a future iteration, it will also be able to support batch file editing and batch file upload for arbitrary Wikibases. In the table below you can see an overview of features which will be enabled by OpenRefine itself and/or by the Wikimedia Commons extension (click 'Expand' to see the full table):
Workflow step
If the user has ... installed, they are able to ...
OpenRefine without Commons extension (via this grant)
OpenRefine with Commons extension - beta version (via this grant; mid 2022)
OpenRefine with Commons extension - v1.0 release (2023-ish?)
Remarks
01 🌟 Project creation
Let OpenRefine import a data sheet (any OR-supported format) with file paths from Wikimedia Commons
Yes
Yes
Yes
02 🚿 Data cleaning
Clean and improve a spreadsheet or dataset with file metadata
Download dataset with modified or uploaded file links from Commons, with their file names, M-ids, and metadata
Yes
Yes
Yes
01 🌟 Project creation
Provide OpenRefine with one or multiple Wikimedia Commons categories; OpenRefine then loads the (reconciled) file paths of all the files in these categories
No
Yes
Yes
10 🔺 Upload files
Upload new files to Wikimedia Commons
No
Yes
Yes
We may also decide to implement file upload in OpenRefine itself. It will certainly be possible as part of the Commons extension.
04 ➡️ Reconciliation / data extension
Retrieve Wikitext, split and cleaned per parameter and template, from existing Commons files which are reconciled in OpenRefine
No
Possibly?
Yes
07 👀 Check/preview/test upload
See an example preview of the structured data that will be (batch) added to newly uploaded Commons files
No
Possibly?
Yes
07 👀 Check/preview/test upload
See an example preview of the Wikitext (generated) infobox of files uploaded in batch
No
Possibly?
Yes
02 🚿 Data cleaning
Directly see thumbnails of media files while cleaning and editing their metadata
No
No
Yes
OpenRefine is very data-centric and does not natively support (direct) preview of thumbnails of files in its data operation screen. In a next version of the Commons extension we can, for instance, introduce a media-centric editing and viewing screen that allows end users to batch edit metadata of media files in a more visually oriented way.
05 📌 Metadata mapping
Map the user's dataset with a preset template / checklist of fields from Wikimedia Commons (e.g. Artwork, Book, Map...)
No
No
Yes
01 🌟 Project creation
IIIF support (retrieve and process images/files hosted on a IIIF service)
No
No
Possibly?
We're aware of this potential use case; it can be developed after deployment of core functionalities and after more research.
09 📚 Batch Wikitext edit
Edit Wikitext of existing Commons files
No
No
Possibly?
Wikimedia Commons editing and upload functionalities in OpenRefine focus on Structured Data first and foremost.
10 🔺 Upload files
Upload new files to an arbitrary Wikibase
No
No
Possibly?
We're aware of this potential use case; it can be developed after deployment of core functionalities and after more research. Members of the Wikibase stakeholder group have already indicated interest.
In our grant application, we (optionally) mentioned the possible development of an external, generic batch upload tool for Wikimedia Commons files, which would provide an alternative to QuickStatements. As a major benefit, such a tool will make it possible for end users to perform very large batch edits and uploads ‘in the background / in the cloud’ rather than via OpenRefine directly (for which they would need to keep their computer and OpenRefine running for very long periods of time). However, inclusion of very large batches of to be uploaded files (potentially several gigabytes at once, with varying file sizes and formats) offers a new challenge for such a tool. Therefore, we have decided to currently dedicate all our available development resources to deploy batch upload properly in OpenRefine itself. As we have learned from this process, we will decide upon the creation of a generic, separate batch upload tool after OpenRefine’s own file upload functionalities have been properly tested.
As we go forward, we would like to continue engaging a designer throughout the entire process. As mentioned above, insights from user interviews and design research have been invaluable in helping us to make informed decisions about ongoing development; we want to continue engaging design research to support our team for a few hours per week. In addition, we have noticed that a designer’s input will be very welcome to help us create updated or new user-facing interface elements (dialog windows/screens).
We have noticed that technical development is taking a bit more time than we originally anticipated. This is partly due to the complexity of the code that our developers need to get onboarded / acquainted with.
Basic features of the Wikimedia Commons Reconciliation Service are ready, but more finetuning is needed to make the service truly user-friendly. We have decided to dedicate Wikimedia-specific developer time in December 2021 - February 2022 fully towards:
Fine-tuning, bug fixes, and adding more flexibility to the Wikimedia Commons reconciliation service
Design enhancements and Wikimedia Commons-specific UX improvements to various dialog windows in the core OpenRefine workflow and the Wikimedia Commons extension.
General OpenRefine development is slightly behind schedule. A first experimental edit of structured data on Wikimedia Commons has been done on December 20, 2021; we will probably deploy a new OpenRefine version with full SDC editing support a bit later than planned (originally end January 2022, will probably be a bit later). As we are also planning to deploy a new OpenRefine extension, it will be helpful if we can extend OpenRefine development work with several months.
On a more meta level, we are also thinking about the longer-term sustainability of the tools we are developing through this grant. Recently, the GLAM-Wiki community has been strongly affected by unreliability and failures of two tools which have become very important for content partnerships: Pattypan and the ISA Tool. We expect that OpenRefine’s batch editing and upload functionalities will become a key technical environment upon which many Wikimedia volunteers and affiliates will rely for content partnerships. How can we make sure that the software we’re developing now will stay maintained for a longer period of time, and that we have the resources we need to fix bugs and (yet unpredictable) incompatibilities?
As an example, the ISA Tool has recently been ‘adopted’ by Wikimedia Sverige, which is now dedicating developer time to make the tool more secure and to add widely requested features (such as internationalization). Similarly, we have started conversations with (mainly Wikimedia) organizations, exploring if we can build a partnership with an organization that will ‘adopt’ further management of OpenRefine’s Wikimedia Commons extension and - in the longer term - the external batch upload tool, if we decide to develop it.
In addition, we are also exploring future funding opportunities, both inside and outside the core Wikimedia ecosystem. Our work on media file support also receives a lot of interest from the Wikibase community at large: quite a few managers of external Wikibases are interested in uploading files to their own repositories and using OpenRefine for this purpose. As project member of the Wikibase Stakeholder Group, OpenRefine was recommended to the NFDI4Culture consortium in Germany for additional funding to help with Wikibase-specific development. Results of the funding round will be announced later in February 2022.
What have you found works best so far? To help spread successful strategies so that they can be of use to others in the movement, rather than writing lots of text here, we'd like you to share your finding in the form of a link to a learning pattern.
We received a lot of very diverse, talented and skilled candidacies for our job openings for both our junior developer positions! We have added some of our tips to the Finding programmers for a WMF grant learning pattern and have updated it with various channels we have used.
We have onboarded our junior developers in a short amount of time and - mainly due to their own deep interest and enthusiasm! - this has gone quite well and unexpectedly quickly. It is challenging to help a developer who is not a Wikimedian get familiar with the Wikimedia ecosystem, and we have contributed some tips to the learning pattern Working with developers who are not Wikimedians.
What are the next steps and opportunities you’ll be focusing on for the second half of your project? Please list these as short bullet points.
We will continue deploying all necessary back-end functionalities in OpenRefine’s core software and its Wikibase extension, in order to enable SDC editing and uploading. A new release of OpenRefine (probably version 3.6.0) will include these features.
As mentioned above, we will start the development of a dedicated Wikimedia Commons / media upload extension for OpenRefine. During the current grant period and with the current budget, we are able to develop the basic technical architecture for this extension. By the end of this grant period, we will release a beta version of this extension, which will include basic functionalities for one important and often-requested feature: allowing Wikimedia Commons users to load a dataset of Commons files into OpenRefine by entering one or several Wikimedia Commons categories.
We’d love to hear any thoughts you have on how the experience of being an grantee has been so far. What is one thing that surprised you, or that you particularly enjoyed from the past 3 months?
Active development on this project has been running for a bit more than four months now. In general, working on this project is extremely enjoyable, so it's not easy to pinpoint a single highlight. So here's a short list:
We started this project with just two team members (Sandra and Antonin) and our team has quickly grown. We have been lucky to find and recruit three excellent, smart and very nice colleagues (Eugene, Joey and Lozana). We really enjoy working together and learn a lot from each other!
Our project proposal has received many enthusiastic endorsements and we continue to have positive and inspiring interactions with the Wikimedia community. We do notice people are really eager to see our software 'live'. It is very useful for us to have occasional check-ins with people who act like external 'mentors', especially staff from the Wikimedia Foundation's GLAM team and Wikimedia Sverige, and quite a few individual Wikimedians and external supporters who are always happy to lend us some advice. Thank you!