Content Partnerships Hub/Metabase/Metabase for chapter data

Content Partnerships Hub

Improving the Wikimedia movement’s work with content partners

Metabase for chapter data

White paper on Metabase
Introduction Setting up a Wikibase from scratch Metabase for chapter data Metabase for the global movement

After the initial setup was done, with Metabase up and running, we were ready to start filling it with content. We elected to start with something we are familiar with – the internal knowledge base of Wikimedia Sverige. In this study, we will describe how we went about it, with a focus on the process, so that other affiliates considering using Metabase for this purpose know what to expect.

Background

Starting with our own data was a given, as we had a good understanding of its structure. Our chapter wiki, where it all lives, is our central information repository. We use it to document everything related to our operations – our projects and their constituent parts, like batch uploads and activities, together with the metrics that we report to WMF and partners; grant applications and project reports; annual plans and reports; board meeting minutes; policies and employee guidelines. The information collected over the years makes it possible for others to learn about our work (we are all about transparency in the Wikimedia world, after all, especially considering that we are a non-profit association and rely heavily on the support from our members and the general public), but at the same time it's also a reference for ourselves. We use categories and templates to structure the information, but it's not exactly a secret that it's not always easy, even for us, to quickly find what we're looking for, especially if we want to aggregate data or search across different projects. That's why turning some of the information into linked data was an appealing experiment.

Our staff members contribute to and read the wiki every day, using it in different ways depending on their roles. With that in mind, we started out by compiling a set of personas, based on real staff members, to guide us in our work developing the knowledge base. We quickly realized that we – the few individuals working on the Metabase project – do not have the same skills and needs as our coworkers, and we should make it as easy as possible for them to understand what Metabase is and isn't, and how they could use it in their work. Throughout the project, we have kept them updated on our work, its purpose and major milestones. We also had a couple sessions where everyone could contribute with data relevant to their expertise – create an item for two for a report they authored or an edit-a-thon they organized. In this way, we ensured there's a good understanding of why we, as an organization, are investing resources in building Metabase; at the same, by watching our colleagues and listening to their feedback, we got a (slightly!) better idea of the needs of different sorts of users.

Method

As stated in Setting up a Wikibase from scratch, the initial work of setting up Metabase concluded with building a basic ontology. A number of general-use properties were created, such as sameAs on Wikidata. We also discussed, and settled on, the basic guidelines for how issues such as data duplication and compatibility with Wikidata and personal information should be tackled. These apply to all the data in Metabase, regardless of its source or subject matter. As we entered the next stage and started actually thinking about the content, we decided that the best way of dealing with the terror of an empty database was to let the modeling grow together with the content. That is, we started thinking about what we wanted to input and then created the properties and modeling guidelines that made the most sense.

In practical terms, we had one staff member dedicated to data input, and two others – who had set up the basic structure of Metabase – who provided support and served as a sounding board. They all met at regular intervals to go through the recent developments and discuss any challenges. In particular, in the early stages of our work, we decided that any new properties should not be created until the three of us had reached consensus that they were needed and how they should be used. We wanted to keep the modeling as simple as possible, and avoid creating possibly unnecessary properties with too specific purposes. At the same time as the content grew, we documented our decisions in a shared internal document, after which they were cleaned up and created the foundation for the on-wiki modeling guidelines in Metabase.

The iterative process showed there's a lot of difference between planning and doing the actual work. We did not know in advance – indeed we could hardly imagine – how much time it would take and what problems we would encounter. Lots of questions appeared during the work that had to be discussed, especially in areas where there were no Wikidata examples of structures to copy.

The WMSE data

Every affiliate has their own way of managing and administering their work. In Wikimedia Sverige, we use the following structure:

Thematic areas. For the last few years they have been the following four: Enabling, Access, Community, Use.
Projects. Every project belongs to one of the four thematic areas. Every project has a time frame, a manager and a financing source.
- Projects funded by an operational grant (WCF) from the Wikimedia Foundation. The entire funding for the project is received at the beginning of the year. The projects run on a calendar year basis, and therefore the year is included in the project name.
- Externally funded projects. A broad category of projects, tied to a specific application submitted to a funder in terms of form, budget, and time. The projects often span calendar years and therefore the year is not included in the project name.
- Time-based projects. These are commissioned by another party, usually by committing to deliver a certain number of work hours. The assignment can either be directly for the commissioning party or as a subcontractor in a project application.
Activities. Every activity, such a batch upload, an edit-a-thon or a presentation given at a conference belongs to a specific project that finances it, and the activity contributes to fulfilling the project goals.

Results

Projects

Projects are in the center of our work – everything we do is part of a project. That's why it made sense to start with this concept; our Q1 was the Content partnerships support project. After creating it, we started thinking about what we want to be able to say about it, and created the necessary properties.

Some of them were obvious, and had equivalents in Wikidata: instance of, start time, end time etc. But we also wanted to reflect how our work is actually structured. The following properties and structures were needed to achieve this:

WMSE properties

sorted under programmatic area – to show which of the four main themes of our work the project belongs to.
WMSE project ID – a unique administrative ID that we assign to every project.

WMSE work structure

Some of our projects are aimed at editing or adding content to the Wikimedia platforms, such as by organizing edit-a-thons or uploading files to Wikimedia Commons. This can be indicated by using the property Wikimedia platform(s) affected on the project item.

Every project has a dedicated page in the Projekt namespace in the WMSE wiki, e.g. Projekt:Stöd för innehållspartnerskap. We use the property described at Wikimedia-page to link to it.

Since every activity that we do is part of a project, we use the properties has part(s) / part of to link them together.

Financing – an open problem

One aspect of our chapter's work that we would like to be able to model is our budget and how our work is financed. This includes information such as:

Grant applications – funder, grant, requested amounts, co-financing, applicants; including both successful and unsuccessful applications.
Funded projects and events – the financial aspects of this including: final costs and sources of funding.
Our overall yearly finances – budget size, final costs/income/result, donations.

We have had a number of internal discussions about how to best approach this topic. Ideally, we would like to re-use any existing Wikidata structures and best practices. Wikidata has some modeling around research grants, but the focus is primarily on the grants and the funded projects rather than the applications themselves (which are central to WMSE's way of working). We have identified a couple of Wikidata properties that could be of interest:

After doing all this research, we ended up not modeling anything related to economy or financing in Metabase. The reason for this is that we do not want to accidentally build structures that are perfect for our own needs but not useful for other affiliates. As with everything else, we want the ontology to be flexible and easy to use for other affiliates. In order to achieve this, it is crucial that we get input and feedback from them – how they work today and what their needs are. We brought up this topic at the Wikidata Modeling Days in the fall of 2023 (Youtube recording, slides) and we are going to do it again during Wikimania in 2024, providing an opportunity for other affiliates to chip in and discuss the necessary structures together.

Content uploads

Content uploads – of media files to Wikimedia Commons or data to Wikidata – are among the most important parts of our, and many other affiliates', content partnership work. At the same, there has been no need to model them in Wikidata, and thus no best practices to learn from; we had to model them from scratch. Because of this, they deserve some extra attention here. We will use the item 2021 images in 2021 ‒ upload from the Röhsska Museum as an example.

We decided we wanted to express the following things about a content upload:

When it was done.
Which of WMSE's projects it belonged to – to make it possible to search for project-specific information, or compile metrics for each of our projects.
Which Wikimedia platform it was done on.
What types of resources were uploaded. This is especially pertinent to media uploads, as Wikimedia Commons can house images, audio files, videos, pdf files etc. In Wikidata, we can work with items or lexemes, so we should also be able to keep those apart.
How many entities were uploaded. That's items or lexemes in Wikidata, or files in Wikimedia Commons.
Where the uploaded resources came from. We want to be able to say how many files a particular museum contributed with.
Commons category. So that we can easily locate the uploaded files!
Link to relevant page in WMSE's wiki. Our wiki is where we report metrics for all of our uploads.

After some discussions and trial and error we settled on the following properties and structures:

number of affected content pages – how many content pages were created or improved. The details are described with qualifiers:
- namespace – which namespace was affected.
- Wikimedia platform(s) affected – in which Wikimedia project.
- resource type – with what type of resources. Resource types are, in the case of Wikimedia Commons, media types like audio, video or image file.

The reason why we use both namespace and resource type is that the File namespace in Wikimedia Commons can host different types of files: audio recordings, photos, videos, etc. In Wikidata, items and lexemes belong to different namespaces, even though they are both linked data. Sometimes it’s useful to know the breakdown – not only how many entities were uploaded, but also what kinds.

Note that this modeling can accommodate uploads that affect multiple Wikimedia platforms. For example, we have done projects where we upload a number of digitized paintings to Wikimedia Commons and at the same time create Wikidata items for them. Since we do it simultaneously, or at least within the same conceptual space (we don't think of them as separate projects – it's obvious that paintings should have Wikidata items), it doesn't make sense to separate them. See WiR upload, Swedish Performing Arts Agency as an example of a content upload which encompassed several Wikimedia projects and resource types.

Documents

Wikimedia Sverige, like any organization, produces a lot of documents, for both internal and external use. As we mentioned previously, we want to keep Metabase as much in sync with Wikidata as possible, and that includes the document types. We thought it would be simple, but it turned out, it was not easy to find Wikidata equivalents for all of them. For example, we often write or contribute to remissvar – a document that we send to the parliament where we give feedback on some new law they want to make. Wikidata has an item for consultation, which is the process of the parliament asking other actors for feedback on proposed laws, but not for the document – a consultation reply? – itself.

We will not be surprised if other affiliates will encounter similar challenges when mapping the document types they work with to Wikidata. There will be local differences in how document types are defined. An obvious solution is correcting or developing the data in Wikidata, which would benefit everyone, not only Metabase users. We decided not to do that as part of our experimental work, as it would require both time and domain knowledge (and as all Wikidata editors know, it's easy to fall into a rabbit hole of editing just one more thing, until you lose track of what you were doing in the first place); besides, by not editing data in Wikidata to better serve Metabase, we make it clear that such problems can arise. We imagine that Metabase editors who are not constrained by time might feel motivated to improve the data in Wikidata while they're at it, depending on their expertise, or at least bring up the problems in relevant editor communities.

Conclusions

Metabase – a supplement or replacement of the current structures?

One big question that remains unanswered is to what extent can Metabase replace or complement our current structures? While we can only answer it for ourselves, other affiliates would be making similar considerations. Every affiliate's needs are different.

In the case of WMSE, our wiki is a central part of our workflow. We use it to store not only information that could easily become linked open data, like the metrics about our activities that we report to our funders, but also a large number of documents, like policies and guidelines, that need to be easy to access and browse. This information goes back many years, with the wiki serving the function of an archive as employees come and go. Moving it to a different platform, like the non-data namespace in Metabase, would require a lot of work, so it's not something we want to do for the sake of it – the benefits need to outweigh the trouble. With such a long history behind it, getting rid of our wiki and replacing it with a different platform is not realistic; Sticking to the workflow we have been successfully using for years has its benefits from a historical perspective, making it easier for future users to orient themselves in our knowledge repository.

We imagine that some sort of split would be possible, where the data that could be stored as linked data lives in Metabase and is displayed in the WMSE wiki (displayed is doing a lot of heavy lifting here, we'll get to it in a moment). Right now, as we are only experimenting, there is no obligation for the staff to report their activities in Metabase – the WMSE wiki is still where it belongs. What should we do in the future?

It's impossible to think about the future applications of Metabase for our internal data without looking into the question of whether Metabase should become the only place for the data. Should it be still reported in the WMSE wiki as well, doubling the workload and increasing the risk for mismatches? Taking into account the fact that many pages in our wiki are highly structured – we report the metrics from our work using templates and predefined page layouts, making sure they always look the same regardless of who inputs them – it makes sense to think about possible ways of automating the flow of data from the wiki to Metabase. A bot could transfer the data, keep it updated when corrections are made in the wiki, and notify us of any mismatches caused by manual edits in Metabase. Of course, this is only possible thanks to the rigid structure of our metric reporting infrastructure – every affiliate will have their own challenges and considerations.

Right now, however, this is merely a thought experiment. There is no simple way of displaying the content of a Wikibase Cloud wiki in a Wikimedia wiki. There are suitable solutions for Wikidata, where data can be transcluded in other Wikimedia wikis using either templates, like Template:Infobox person/Wikidata and its localized equivalents, or Listeria. Listeria in particular is a very interesting tool from the point of view of Wikibase Cloud users and administrators. It's used widely across the Wikimedia platforms, so if it were adapted to ingest data from Wikibase Cloud, the threshold for users to start using it would be low. It could become a key to facilitating the flow of data from Wikibase Cloud instances to Wikimedia wikis, making the platform more useful and relevant to new users.

We are aware that we cannot ask the staff to use Metabase if they haven't been properly trained – and while most have at least basic Wikidata skills, they are far from experienced Wikidatans, something that significantly lowers the threshold for understanding and using Metabase. While we did have a basic introduction to the project for the entire staff – they are aware why the platform is being developed and have been encouraged to try editing it – we know that it would require time and effort for them to feel comfortable with using it as part of their daily routine, as is the case with any new digital tool.

Getting started as an organization

Based on our work with Wikimedia Sverige's data, we can make some generalizations about the challenges an affiliate might face when embarking on a similar journey.

First of all, one should start with examining the data available. What sort of information has been written down in the first place, and how far does it go? This will determine what is possible to achieve in Metabase. It will be helpful to look at how similar items, like presentations or events, that are already in Metabase, are modeled, and reflect on how much of these structures can be reused. It is possible that there will be a need for affiliate-specific properties and structures, just like WMSE needed the WMSE project ID and the thematic areas. The strength of Metabase, as compared to Wikidata, is that it enables us to experiment and easily create the properties we need even if they're not of interest to outside users.

Then, a workflow for the actual data input stage will need to be developed. Who should do it, and what resources do they have at their disposal? If they are already familiar with Wikidata, it will greatly lower the difficulty of their work. Doubly so if they are able to use QuickStatements – it works in Metabase! – and know how to parse the source data (e.g. for extracting information from the wikitext in the affiliate's wiki). Keep in mind that even if there's a dedicated data input person, they might need support from other staff members in order to tap into the organizational knowledge that's not explicitly stated in the source material.

In fact, one challenge we encountered when copying data from the chapter's wiki was that the information was not always clear. Notes about an event or an activity could be chaotic and hard to understand for people who were not involved in the event themselves, and crucial pieces of information – for example, the topic of a presentation – could be difficult to locate. As we all know, when you document an event you organized yourself, it can happen that you omit things that are obvious to you – but will not be obvious to your colleagues (or you…) going through your notes five years later! In some cases, we were able to ask the staff member who had organized the event for clarification, but we cannot assume this will always be the case. In this way, the Metabase experiment provided an extra lesson: we should be more mindful when writing things down in our wiki, so that they can be understood later.

Documentation is crucial. In our case, documenting our modeling choices had two purposes. First of all, of course, we wanted to make it possible for future Metabase contributors to understand the modeling and use (or discuss and change!) it. But we quickly became aware that the documentation was just as important for ourselves. Human memory is unreliable, and it doesn't take much time to accidentally detract from the modeling guidelines we devised ourselves.

The place of Metabase in an organization's workflow

With that in mind, an affiliate interested in including Metabase in their daily work, should carefully consider the prerequisites and consequences. Who should contribute data to the platform, how often, and what role should Metabase have on an organizational level?

While it would be beneficial for everyone in the organization to understand what Metabase is and how it can be used, it might not be necessary for everyone to be required to contribute to it. After all, everyone in an organization has their role and responsibility – like running edit-a-thons or developing financial plans – and they should be able to focus on their core tasks with maximum productivity without having to learn yet another reporting tool. People are different; some feel most comfortable working with linked data, others find contributing to a wiki or writing text documents easier. Furthermore, when people who are not comfortable with linked data contribute to the data collection, the quality of the data can vary significantly. This means more time and resources must be devoted to cleaning and verifying the data, which is inefficient and costly.

Having a staff member dedicated to data input could provide a solution to this problem. A large chapter that generates a lot of data and knowledge material which needs to be cataloged in Metabase might be able to hire someone for this task, but it might not be financially realistic in a high cost of living area. Alternatively, a person – a staff member or a volunteer – could provide their Metabase skills as a service to several affiliates, on behalf of a thematic or regional hub. Thanks to working in Metabase on a regular basis, they would have a good overview of the structures and modeling, making sure that the data quality is consistent between the sources – something that is difficult for someone who contributes every now and then, and only with data relevant to their particular expertise. With such a centralized structure and accountability, the risk that valuable data (such as information about events run by volunteers or capacity building material not produced by affiliate staff) is overlooked is smaller. Furthermore, this person could provide guidance and support to Metabase users, such as developing SPARQL queries to extract and analyze specific data, as well as develop instructional materials and spread the knowledge about the platform at conferences and other events.

There's also the question of sustainability. Inputting historical data can be a one-off project, but once that's done, one has to decide how often it should be updated. At WMSE, we aim at adding metrics to the wiki as soon as possible, both to keep the members of the association updated and to make sure nothing is forgotten when it's time to report our results to WMF, our partners and other sponsors. But what benefits would updating Metabase at the same time have? Of course it depends on the role we want the platform to have in our work. If it's just a searchable knowledge repository, rather than a crucial part of our daily operations, it could make sense to only update it periodically. That data input could be done by a dedicated staff member, familiar with the platform, rather than by everyone – which would reduce the mental workload of their colleagues, not forced to learn yet another reporting platform.

In conclusion, while an affiliate who wants to start actively using and contributing to Metabase might be apprehensive about the necessary investment of time and resources, there are ways to reduce the requirements on a single organization. While all staff members can be trained on what Metabase is, and encouraged to contribute to it if they develop an interest, there are alternative solutions to data input that do not require large investments. If we want Metabase to become a hub for the Movement's knowledge, it would make sense to establish a centralized service for data input and maintenance that affiliates and volunteers can rely on, without requiring investments from affiliates in under-represented communities who have less resources.