Content Partnerships Hub/Metabase/Setting up a Wikibase from scratch

Content Partnerships Hub

Improving the Wikimedia movement’s work with content partners

Setting up a Wikibase from scratch

White paper on Metabase
Introduction Setting up a Wikibase from scratch Metabase for chapter data Metabase for the global movement

In this case study, we describe the first steps we did with Metabase, after agreeing on its vision and goals – from launching the Wikibase Cloud instance to establishing a basic ontology and the challenges that came with it.

Initial setup

At the time we started our initiative (October 2022), Wikibase Cloud was in closed beta, requiring us to apply for early access. This step is no longer necessary, as the platform has opened up for everyone.

The initial set-up was straightforward. We made some observations that might be valuable for others embarking on a similar project:

The name of the wiki and the username of the first user have to be given in the first step, and cannot be easily changed later.
In the first step, we can choose whether the wiki should be open for everyone to create a user account in. We decided to keep it open, to make it possible for interested community members to join us early on.
Support for Lexemes can be enabled.
We can choose whether we want to map any properties and items to Wikidata. We chose not to do that, since at that stage we were still unsure as to to which extent we would need that. This step can also be done later, which we ended up testing with a couple properties and items, but it's unclear what sort of benefits it provides.

Legal considerations

There are some legal considerations stemming from the fact that Wikibase Cloud is hosted on a server in Germany. Like all websites hosted in Germany, a Wikibase Cloud instance is legally required to have an Imprint, stating who owns and manages it, so we created one in Metabase.

What and how

After the Wikibase was up and running, it was time to start filling it with data. The knowledgebase being completely empty gave us full flexibility in how we wanted to start. On the other hand, it also meant we had to devise the entire ontology from scratch – we had to decide what the root items and the first properties would be, something that we don't normally think about when using Wikidata.

Our goal was to create a data model that is flexible and easy to understand by everyone, but also as compatible with Wikidata as possible. In the background section, we elaborate upon the relationship between Metabase and Wikidata – while, obviously, our platform has some specific needs, we do not want to make it completely different from Wikidata, keeping in mind both user-friendliness data compatibility (federated SPARQL queries, possible export/import of data).

Since the goal of Metabase is to contain only a tiny fraction of the world's knowledge, our ontology did not have to be as nuanced as Wikidata's. Thanks to the magic of federation, when writing SPARQL, we can always fall back onto Wikidata to use all the knowledge collected there. Our guiding principles were as follows:

We differentiate between content items and support items.
- Content items are the actual entities Metabase focuses on, like documents, slide decks, meet-ups etc.
- Support items are everything else that's needed to model them and that exists in Wikidata, like geographical locations, topics, types of events and documents etc.
Content items, whether or not they exist in Wikidata, are first-class citizens of Metabase.
Support items should include as little information as possible, ideally only a link to Wikidata. We neither need, nor want, the geographical coordinates of Stockholm or links to all the encyclopedias that describe the concept of conference.

Properties

The first properties created were:

These basic properties fulfill a structural need and create a basic framework for content creation.

A property that has a Wikidata equivalent has a sameAs on Wikidata statement.

A property without a Wikidata equivalent is instance of Metabase internal property.

Metabase internal properties

The following properties (Metabase internal properties) are specific to Metabase in that they do not have (one to one) Wikidata equivalents. They were created after careful consideration, as we assessed that they were necessary to fulfill our needs:

Described at Wikimedia-page – provides a way to link to pages on the Wikimedia platforms, as Wikibase Cloud does not support sitelinks. For example, :es:Wikipedia:Encuentros/IV Jornadas de Wikimedia España will link to a page in Wikipedia in Spanish. Notably this allows us to link to pages on Wikimedia wikis beyond what is supported through sitelinks on Wikidata, such as affiliate wikis – :wmse:Projekt:Wikipedia_för_hela_Sverige goes to the WMSE wiki.

Wikimedia platform(s) affected – makes it possible to say that e.g. an edit-a-thon focused on editing the Spanish Wikipedia.
WMSE project ID – unique internal identifier of projects done by Wikimedia Sverige.
Sorted under programmatic area – used internally by Wikimedia Sverige; every project we do belongs to one of four programmatic/thematic areas. See also Metabase for chapter data.

Items

At the top of the Metabase ontology is the MetaBase root item, which has the immediate sub-item: MetaBase root item requiring Wikidata equivalent. The following SPARQL query shows the root items we created. We decided that the most important item types were crucial to enable us to model the data we envisioned:

Everything that the Movement does and that we want to include in Metabase is either an activity of some sort (edit-a-thon, conference, presentation, project etc.) or a published document (tutorial, report, blog post, video, slide deck etc.). Everything else is there to enable us to describe those entities – who organized or authored them, where they took place, or what they were about.

Some specific modeling considerations

Bottom-down, iterative development

Before we started working on the ontology, we had to decide on a suitable workflow. Should we:

Create all the properties we thought were needed, and then start creating the content, or
Start creating the content and at the same time create the properties and data model?

We settled for the second alternative: a content-driven modeling workflow. We did have some theoretical discussions in the beginning, about very simple cases: we will probably need a model for people, organizations, topics etc. But we realized very quickly that such discussions, while pleasing to our brains, were purely academic. We would not know exactly what we needed until we actually needed it. That's why we decided to focus on the questions we wanted Metabase to help us answer and work our way back from there to figure out how the data should be structured. Filling Metabase with content provoked discussions about what data structures were needed, not the other way around. In the end, Metabase will not fulfill its purpose until it's actually being used by other members of the Movement – and the best way for them to learn is to look at the examples we have created.

Humans

The focus on Metabase is not on individuals, but rather on what they produce. For this reason, we decided to limit the information about people to what's needed to link them to their work and its products. In order to do that, we do not need to know their gender, date of birth or educational background. Data privacy has also been a consideration; far from all active members of the Wikimedia movement have Wikidata items, or other public information about their life and work. Being respectful of their personal information is a matter of not only human decency, but also the General Data Protection Regulation, which Wikimedia Sverige, as an organization based in a EU country, is legally obliged to follow.

We only create items about humans when these are needed to describe authorship/speaker/event or project responsibility. And we only do that when this information is already publicly available; any affiliation would also need to have been publicly communicated. Since the main affiliation might not always apply, it can be superseded by a qualifier on an item-by-item basis.

Some Wikimedians participate in the Movement's activities and are widely known under their usernames only. We have made it clear in the documentation that this should be respected and that contributors should not publish their real name if they happen to know it. We have also noted that some Wikimedians maintain separate user accounts for their participation as affiliate staff and volunteers. It can for example happen that one person has two presentations at a conference, one as affiliate staff and another in their volunteer capacity. If the source material makes the distinction clear – e.g. the conference program says one presentation is given by Sven Svensson (WMSE) and the other by Sven123456, we do argue that two person items should be created even if some people know whom the volunteer account belongs to.

We hope that by limiting the information about humans in this way, we can respect the subjects' privacy while still being able to store the data about authorship or event leadership that we want.

Index terms

We want the users of Metabase to be able to find things that are about other things, e.g. presentations about gender gap or working with art museums. In order to do that, Metabase needs to contain the items gender gap and art museum, which obviously exist on Wikidata. Duplicating all the content in art museum, however, is totally unnecessary for Metabase.

That's why we came up with the simplified concept of the index term.

An index term is a keyword that we use to describe e.g. the topic of a document or the focus area of a project (using the property main subject).

Within the scope of Metabase, the label is enough; we do not concern ourselves with e.g. expressing the relationships between index terms. Those live in Wikidata, and if necessary, can be queried using federated queries.

That's why the only the only data we provide for index terms is:

instance of → index term
sameAs on Wikidata → Wikidata Q ID

Sometimes there's a need to use an item that also is something else as an index term. In this case, the item can have multiple instance of statements. For example:

The city of Gothenburg is both a location (where some conference took place) and an index term (because it was the topic of an edit-a-thon).
Swedish Wikipedia is both a Wikimedia content project (so that we can describe an edit-a-thon where it was edited, using Wikimedia platform(s) affected) and an index term (because it was the topic of a presentation or tutorial).

The Wikimedia platforms

One of our goals is to use Metabase to store information about activities on the Wikimedia platforms, such as edit-a-thons and uploads. The platforms have a hierarchical structure: the Wikipedias in different languages are sibling projects, and offspring of a conceptual "Wikipedia" (even though it does not exist). From the point of view of users, we want to replicate this hierarchical relationship; sometimes you might be interested in finding all edit-a-thons with Wikipedia focus, sometimes only limited to a particular language version. Apart from Wikipedia, several other projects are relevant to our data, such as Wikidata and Wikimedia Commons.

The structure of the Wikimedia platforms in Metabase is as follows:

family of Wikimedia projects is a root item.
Wikimedia content project is a root item.
Wikipedia is a family of Wikimedia projects, as it encompasses many language editions.
Swedish, English, German etc. Wikipedias are Wikimedia content projects as well as parts of Wikipedia.
Wikimedia Commons and other sibling projects of Wikipedia are Wikimedia content projects.

Experiences coming from Wikidata

As experienced Wikidata editors, contributing to a Wikibase Cloud instance should be extremely easy. Or so we thought. Along the way, we discovered several technical differences between a Wikibase Cloud instance and Wikidata, ranging from major to mildly annoying. While they did not stop us from progressing, they did slow us down in the beginning.

Lack of support for gadgets and user scripts. Many experienced Wikidata editors take great pride in their customized environment, possibly not even aware of what a nude Wikibase interface looks like. The writers were particularly longing for Merge, LabelLister, DuplicateReferences and MoveClaim. It is not possible to deploy your own JavaScript or CSS files to customize the interface either.
Limited number of extensions. As compared to Wikidata, Wikibase Cloud does not come with several of the extensions familiar to Wikimedians. In the Item namespace, the lack of Property Suggester is most noticeable – we hadn't even realized how much time and mental effort it saves in Wikidata! Outside of the Item namespace, Translate and VisualEditor are particularly needed. Not being able to use them affected our efficiency in editing the documentation pages, such as the modeling guidelines. We're all thoroughly spoiled when it comes to creating tables in VisualEditor, after all! And without the Translate extension, our future users will not be able to easily create versions in their preferred languages, which will have a huge negative impact on our ability to introduce and advocate for Metabase in the global community.
Images from Wikimedia Commons are not displayed. This provided a minor but painful hurdle when describing our modeling – sometimes a picture does say more than a thousand words (and it's the actual reason why you're not reading this paper in Metabase).

The Cradle tool does not work correctly, searching for items in Wikidata instead. The issue has been noted on Phabricator, as there are other Wikibase Cloud users who would like to be able to use Cradle.
Lack of property constraints, i.e. rules on properties that specify how properties should be used. As Wikidata editors, it is very convenient to be immediately notified when setting a possibly erroneous value on a property, either because one's not familiar with the conventions or simply by accident. In Metabase, we have attempted to circumvent this by writing a set of SPARQL queries to detect common errors and mismatches, but it's not as effective or user-friendly as automatic constraint warnings.
Lack of interwiki support for the Wikimedia projects. The creation of the property described at Wikimedia-page was motivated by that.
When writing SPARQL queries, prefix declarations are required, e.g.:

PREFIX wb: <https://metabase.wikibase.cloud/entity/>
PREFIX wbt: <https://metabase.wikibase.cloud/prop/direct/>

Conclusions

In summary, setting up a Wikibase Cloud instance is technically simple. It should be noted that we had not attempted working with self-hosted Wikibase instances, so we cannot make a reasonable comparison there. The real difficulties start on the conceptual level, when devising and implementing our basic ontology. Of course, these difficulties will be the same regardless of whether one's working in the cloud or on an own server – anyone can configure software, but designing an ontology requires a high level of abstract thinking, which we are not experts in.

Are you considering creating your own Wikibase Cloud instance? Here are some things we learned when setting up Metabase and preparing it for data input:

Decide what you want to achieve, rather than how. What queries do you want to be able to run? Then work backwards from there to design the ontology.
Prepare for the fact that designing an ontology from scratch is harder than it seems. Being used to Wikidata can help, but it can also be a hindrance. The human brain likes what it's familiar with, but we explicitly did not want or need to be as detailed as Wikidata.
Decide early what level of compatibility with Wikidata will be desirable.
Think about your target group – data users and contributors alike. Will they all be intimately familiar with the subject matter, or are you planning to actively engage external users? This will influence how you think about the ontology and its complexity.