WikiFAIR
This page is under construction. Please help review and edit this page. |
WikiFAIR is a set of ideas, instructions and helpful examples to archive the FAIR-Prinicples in research projects using Wikimedia systems and technologies. More specifically it's about integrating Wikimedia platforms in research data management to promote free knowledge, while reducing hurdles in building publishing infrastructure.
FAIR principles
[edit]The FAIR data principles are guidelines designed to improve the Findability, Accessibility, Interoperability, and Reusability of digital assets. These principles emphasize the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals.
Full implementation of every aspect of the FAIR principles is difficult for most research projects, especially smaller ones. Depending on whether a new digital presentations platform needs to be build, or research data needs to be published the scale of possible hurdles and potential costs is significant. But most researchers want to, and often need to follow these guidelines to archive the best scientific outcomes.
Integrating research projects with Wikimedia's ecosystem allows research teams to benefit from adhering to FAIR data standards and making the research data consumable in many different formats. The specific results depend on the chosen integration model, which can range from simply linking the data sets, to fully using multiple Wikimedia Projects to host structured data, media files and other formats about the research.
Wikimedia and FAIR Principles
[edit]The Wikimedia Movements goal is to become the essential infrastructure of the ecosystem of free knowledge, and accordingly implementing FAIR principles across all projects is a core goal. As WikiFAIR mostly focuses on the semantic knowledge base Wikidata and the free media repository (Wikimedia) Commons, we will explore them in particular. Most of the details about Wikidata also apply to Wikibase, the open source software powering it, which can also be run independently in a federated system.
Putting these projects in the FAIR definitions of GO FAIR:
Findable
[edit]All Wikimedia projects use unique and persistent identifiers (QID for Wikidata, Filenames for Commons) as required in F1, with descriptive metadata for F2. Every entry includes the identifier (F3) and is index by multiple search engines (F4).
Accessible
[edit]Wikimedia sites are offering free content ranging from CC BY-SA (often found on Commons) to Public Domain (all of Wikidata) and are retrievable by their identifier with standardized, open protocols (A1, A1.1). There is also multi-project spanning Single User Login system (A1.2).
Metadata about deleted/removed objects can be retrieved via deletion logs as demanded by A2.
Interoperable
[edit]Structured Data on Wikidata can be represented a completely multilingual user interface and is available in the most common and FAIR (I2) export formats like RDF, JSON, TTL and more (I1). The dataset is interlinked with many other authority databases (I3).
Reusable
[edit]The complete Wikidata space features well over 100 Million items, while containing 1.5 Billion Triples, created from a pool of over 10,000 properties (R1). All data is released under clear and open licensing (see #Accessible R1.1) and a referencing system in applied across all Wikimedia projects, requiring citations and provenance information in most circumstances (R1.2).
The data model implements or approximates different community standards, by linking properties to their relevant equivalent in other standards. (R1.3)
Additional Features of Integration
[edit]Aside of helping with FAIR-requirements, integrating data in with the broader Wikimedia community can help with different aspects of maintaining, visualizing or just hosting your data set.
Community
[edit]The Community can help with long term quality management, through crowdsourcing improvements to the data. There are thousands of volunteers and bots scraping through the data each day, patrolling changes, adding or removing statements or other helpful tasks. This can mean that a newly created Wikidata item from your project receives a lifetime of maintenance as part of the wider dataset.
The community can also help with questions of data modeling, writing SPARQL Queries or improving related properties. In Commons they can categorize uploaded media and add new depicts statements, also enriching the data.
Allowing the community access can also help with fulfilling the expanded set of criteria CARE Principles for Indigenous Data Governance, in which it is required to allow access to a divers set of related groups, partly already represented in the Wikimedia Movement.
5-Star Open Data
[edit]A different concept to FAIR, with similar intentions, is the 5-Star Open Data system. It categorizes data by qualities of accessibility and machine readability and was proposed by Tim-Berners Lee. With the addition of Structured data on Commons both it and Wikidata offer the highest level in this scheme, and by linking/integrating data with them research projects can do the same with little effort.
Software
[edit]From a software point of view, both MediaWiki and Wikibase are well tested, open source tools that offer full versioning, multi-user editing and caching for scalability. They are capable of accommodating teams of editors in most sizes, and bring a well developed extension system to customize them even further.
External Tools
[edit]Data on Wikimedia projects can be processed with a plethora of diverse external tools, allowing for more possibilities than many other platforms. For an overview see Wikidata:Tools and Commons:Tools or visit Toolhub.
Possible Caveats
[edit]Before continuing further, please be advised that there could be downsides to some parts of this process.
(Partial) loss of control
[edit]Wikimedia projects are generally open to everyone to edit, which can be very appealing. But it's also possible to get your edits reverted, or to have your data vandalized by malicious editors.
If something like this happens, adhere to community guidelines or ask for help if anything remains unclear. Depending on your research goals and methods, you may want to monitor your data after upload. See WikiFAIR#Monitoring Uploaded Data for more information.
Alternatively it could be helpful to run your own Wikibase instance to better control editing access to the data set, see #Running your own Knowledge Graph with Wikibase.
It is of utmost importance to keep a second copy of the research data, preferably published too independent of any Wikimedia project. There are no guarantees, that e.g the community doesn't request deletion of change the contents, which always can be appealed, but could lead to disruptions. For publishing bulk research data, there are multiple popular platforms, e.g Zenodo. (A publication on such a bulk hosting platform is a great reference, see the next paragraph)
Sourced statements requirement
[edit]It is required to cite sources to allow your data into any Wikimedia project, more information on that here. This can be a problem for original research, or hard to source facts. If the source is a paper your team is going to write, it can also be cited, but the problem still remains for some special cases. Asking in the Project Chat, to help with questions regarding sources can be helpful.
Integrating Structured Data in Wikidata
[edit]As stated on the Main Page of Wikidata:
Wikidata is a free and open knowledge base that can be read and edited by both humans and machines.
This already suggests some ways to proceed, either by adding the data by hand or by using tools to modify Wikidata on a bigger scale.
Inclusion Criteria
[edit]Most kinds of structured data fit in Wikidata, which allows for great freedom in how you want to model your contribution. There are also (deliberately broad) rules about inclusion in the scope of the project, but for researchers the second is the most important:
[The item] refers to an instance of a clearly identifiable conceptual or material entity that can be described using serious and publicly available references.
Having publicly available sources for the added data is key to the future of Wikidata as a quality source, and quite easy to implement, see #Technical Process.
Also please apply a measure of diligence to how you contribute to the dataset. If your data only adds some statements to various items, there shouldn't be any problem, but be careful when creating new items if you are not 100% sure if they actually contribute relevant knowledge.
Technical Process
[edit]To learn about the process of adding and editing existing items follow a tour on Wikidata:Tours. There are many different routes to upload data with tools outside, some popular options are Openrefine, Bots or Quickstatments.
Attribution
[edit]To collect and show the contributions made by a project, it can be useful to create a project user account. Editing with the project account lets you identify contributions as belonging to the project. Linking and referencing Wikidata statements back to the project database or publications can also be helpful, especially with the properties
- described by source (P1343)
- stated in (P248) {project item}
- reference url (P854)
Monitoring Uploaded Data
[edit]You can monitor uploaded data via Watchlists or Maintenance Queries. See for example the Wikiproject Chess Maintenace Queries.
Creating a Knowledge Graph with Wikibase
[edit]If you prefer to host the data partially or completely outside of Wikidata, but still want a similar feature set, you might want to use Wikibase. It consists of the same, open source, software Wikidata uses, but completely under your control. On your own Wikibase instance, it is possible to model the data how you want, and to have your own community editing it (for example, in a research project). Wikibase instances can be federated with each other, that is, you can get SPARQL query results from queries that access several instances at the same time (for example, your own Wikibase and Wikidata), in line with the Wikibase Ecosystem Joint Vision.
Technical Process
[edit]Wikibase can be set up in two major ways: As a hosted service on Wikibase.cloud or as a on-premises install using Docker with Wikibase suite. Here is a guide to help you choose.
Free Media on Wikimedia Commons
[edit]Wikimedia Commons, or sometimes referred to as "Commons" is a media repository of free-to-use images, sounds, videos and other media.
Inclusion Criteria
[edit]Many types of files can be hosted on Wikimedia Commons if they have the applicable licensing, CC-BY-SA or better, see Commons:Licensing.
Not every kind of data is allowed in Commons, even if the licensing is correct. The uploaded data should be "educational", normally not a problem for research data and fit into the project scope of Wikimedia Commons.
If you are unsure if your media is compatible for upload in Commons, you can ask here. There are also other Projects that can host freely licensed media, e.g Archive.org.
Technical Process
[edit]For uploading images to Wikimedia Commons follow this tutorial. There is also this list, that shows the supported file types, nearly any relevant open format is supported
Attribution
[edit]There are multiple ways to attribute the research project or organization, depending on the platform. Creating a User Account can help with attribution too.
Media files on Commons can be labeled (via Structured Commons and traditionally) with a creator property. Alternatively, if the files are not a creation of the researchers, it would be possible to add templates that indicate to origin and context of the upload to the file page, and of course the User property in the file history or the source statement is always a way to reference to the project account.
Monitoring Uploaded Data
[edit]You can monitor uploaded data via Watchlists or Maintenance Queries. See for example the Wikiproject Chess Maintenace Queries.
Other Wikimedia Projects
[edit]In some cases even an integration with other Wikimedia Projects could be beneficial. Please note that all of those projects are generally not multilingual like Commons or Wikidata, so using the appropriate language version would be the correct procedure.
Citing your own research in Wikipedia can be allowed in some cases, but overdoing it should be avoided. Following this guide for academics and researchers is generally a good start.
Adding new source texts or improving translation of an existing one would be possible ways to use this Wikimedia project. Using Wikisource as a platform to publish historical or non-copyrighted texts allows access to features like:
- Different export formats (EPUB and MOBI for eReaders, PDF, RTF and more)
- Configurable reader
- The Wikisource Community for proofreading of translations and transliterations
- A correlated Wikidata item to the text
A Wikimedia project dedicated to tracking and collecting species, a sensible place to put data about them.
For lexicographers, adding new dictionaries as sources, or new senses could be a worthwhile endeavor. Additionally Wikidata supports Lexemes since 2018 and will be an easy to (re)use platform, since it's licensing is public domain.
Integration Models
[edit]Depending on your requirements the integration in the WM ecosystem can vary drastically from simple links or statements, to most of the research data being hosted there. Also more effort is required to archive deeper levels of integration.
Examples
[edit]Here is a list of projects, that used some parts of the listed approaches to integrate Wikimedia into their projects. The documentation is in German at the moment but will be translated in the future.
The documentation should contain information about the setup, different integration steps taken and some results, as well as future plans.