Notes and links

Goal

WikiData will serve as a centralized, highly structured, repository capable of representing the highly networked nature of the scholarly sources that support the Knowledge archived across all Wikimedia projects. This signals an unprecedented opportunity for not only scientists and scholars but also society at large to explore the complex landscape of human knowledge. Yet, it is not clear what such an exploration would look like. What kinds of questions can be asked of such a system? With the generous support of CrossRef and the Sloan and Moore Foundations, the WikiCite 2016 Workshop established a working group to not only envision concrete use cases for scholarly source-related question in WikiData but also to determine whether the technical foundations required to effectively express those questions as intelligent, efficient, and systematic queries are in place. Where these technical foundations are lacking but needed, the working group tasked itself with developing proposals for overcoming such limitations.

This group focused on discussing and prioritizing use cases for wikidata queries involving source metadata. The assumption is that we already have all the required data. We also worked on obtaining a small open licensed bibliographic and citation graph dataset to build a proof of concept of the querying and visualization potential of having this data stored in Wikidata and exposed via SPARQL.

Notes

See Proposal: Retrieving Wikidata statements by source
Aim
- discuss and prioritize the most important types of source-related queries that WDQS should support
- determine if these queries can be effectively expressed in SPARQL and executed via WDQS or if they require a different indexing / data modeling strategy

Key properties

Properties expressing a citation relation

Stated in (d:property:P248)
Main subject (d:property:P921)
Published in (d:property:P1433)
Imported from (d:property:P143)
Cites (d:property:P2860).

The 'Cites' property was suggested, supported and created during the WikiCite 2016 meeting. The quick and bold creation promptly became a topic of discussion on Wikidata and a Wikidata user even suggested it for deletion. Nevertheless, meeting participants rapidly utilized the property to mark up a few scientific papers, so small citation networks could be visualized. This was particularly the case for scientific papers about the Zika virus and fever.

Other relevant properties

PubMed ID (d:property:P698)
subclass of (d:property:P279)
author (d:property:P50)
short author name (d:property:P2093)

Examples

list all Wikidata statements citing a New York Times article
- e.g. d:Q191020
list the most popular scholarly journals used as citations of statements for any item that is a subclass of economics
retrieve all statements citing the works of Joseph Stiglitz (d:18430)
retrieve all statements citing journal articles
- by physicists from Oxford University
- that have a PubMed Central ID
list all statements citing a specific journal article that was retracted
list all statements citing(WD) a source that cites(non-WD) a specific journal article ( or one that was retracted).
- this is outside the current scope of any Wikidata-related project, it requires storing scholarly citations between papers

all Zika-related journal articles(WD) that were published in the last n weeks Wikidata WikiProject Source Metadata: Items about Zika virus or fever

coauthors of X
- requires storing bibliographic metadata for all publications by X
coauthors of X in Wikipedia
- is there an interest for coauthors limited to sources cited in Wikipedia?
*other examples of queries on citations restricted to Wikipedia would be more useful
X's H-Index
requires storing bibliographic metadata for all publications by X and all their citations

Can citation links by typed?
Citations restricted by their target
Note: you cannot add qualifiers to sourcing statements, e.g. stated in (with a specific citation intention)
How do we think about veracity on WD?

Use cases

Reuse source MD

- Example: look up a particular publication via a combination of free-form keywords, e.g. author, journal name, words in the title ('choosing experiements evans sociology')
- Is this something WDQS would be able to return? would a vanilla search API be more appropriate?
Publication lists
- Example: all publications by Finn Arup Nielsen sorted by publication date
- requires storing biblio metadata for the entire publication record of a given author
- could potentially be implemented via a script periodically syncing up an author entry on Wikidata with the corresponding ORCID record
- could extend to bibliographies/ reading lists of all types
- knowledge wells (return the developed scholarship from an arbitrary 'community' [e.g. individual, lab, department, division, university, company]).
custom curriculums
- Example: all publications by members of a given lab
- ORCID supports affiliations as free-form text, Wikidata has the benefit of supporting affiliations via linked data
- Example: all publications supported by grants from a specific funder
- Overlaps potentially with Crossref data (funderID)

Sanity Checks

This is mostly targeted at data producers / source owners

Example 1:
- bot scraping data about proteins and storing sources on Wikidata
- used to reference text, but created errors referencing synonyms, e.g. Ebola River (Q934455) instead of Ebolavirus (Q5331908)
Example 2:
- graph representations surfacing type/class errors, e.g. US states sharing borders used to return items that are not an instance of a state link
Example 3:
- https://bitbucket.org/sulab/wikidatasparqlexamples#markdown-header-uniprot

Federated Wikibase Queries

run queries across multiple data providers
analyze data quality by comparing results from separate providers

Generating a test case

We decided to identify a corpus of references to explore the feasibility of importing them and using them as sources for existing Wikidata items. Requirements for this dataset are the following:

size: the corpus should have a fairly small number of nodes (articles)
relevance: the corpus should fill some obvious gaps, such as serving to directly source statements in Wikidata
PID-ready: the corpus should have clean metadata derivable from persistent identifiers (DOIs or PMIDs)

Obtaining a dataset

Bibliographic records
- Zika dataset
  - Number of nodes:
  - Pubmed search with "Zika" returns 883 articles https://www.ncbi.nlm.nih.gov/pubmed?term=zika
  - List of Pubmed ID resulting from https://www.ncbi.nlm.nih.gov/pubmed/?term=zika+virus[Mesh+terms]+OR+zika+fever[Mesh+terms]
  - https://gist.github.com/konrad/341d1b8af1fd602f0f881bcc53c540ab
- Ebola Dataset
  - (all Pubmed records with "Ebola" in the title or abstract) -- Available Now here: https://www.dropbox.com/sh/gh5ckftwhvt7pao/AAD6-tXO_Kz-QbphFhUmUCG0a?dl=0
  - Number of nodes: 1,600 records.
  - ebolaDocs.csv is a list of the records with the above this delimited schema: id pmid issn year vol issue journal journalAbbrev journalCountry journalNlmID articleTitle (just from looking through, this document seems to have some noisy stuff in it. Shouldn't matter because we're only after like 50 records).
  - ebolaAuthors is a list of the authors with this delimited schema: id pmid rank LastName FirstName initials
  - ebolaWD is a list of the entities currently in WikiData (some of which are missing pmids) generated with this query SPARQL
  - Wikidata scientific articles that contain ebola in title
- APS dataset
Citation graph
- APS (DOI prefix: 10.1103, i.e. papers like http://dx.doi.org/10.1103/PhysRevFluids.1.013903 )
- PubMed: full pubmed - Waiting
- PubMed: Ebola Dataset - Waiting

Mapping records to existing WD items

Zika dataset
- Wikidata WikiProject Source Metadata: Items about Zika virus or fever
- identify items that
  - are an instance of (d:property:P31) scientific article (d:Q13442814)
  - have Zika virus (d:Q202864) or Zika fever (d:Q8071861) as Main subject (P921)
    - PubMed search: https://www.ncbi.nlm.nih.gov/pubmed?term=zika%20virus[Mesh%20terms]%20OR%20zika%20fever[Mesh%20terms]%20&sourceid=mozilla-search
  - are orphaned, i.e. are not currently used as a source in any Wikidata statement Query: SPARQL
  - alternatively, identify all items that are orphaned Query: SPARQL

Tools for importing data and curating it

source MD: https://tools.wmflabs.org/sourcemd/?
- takes a DOI or PMID or PMCID as input and generates an item, using the data model specified by source MD
- documentation: https://www.wikidata.org/wiki/Wikidata:WikiProject_Source_MetaData/Source,_M.D./Tests#Methodology
quick statements tools: https://tools.wmflabs.org/wikidata-todo/quick_statements.php

Curation

Ask the GeneWiki community to help crossreference this corpus with existing Wikidata items

Running samples queries and visualizations

Zika dataset
existing visualizations
timeline
listeria-generated list of references
graph visualizations
queries

Reference material

Example queries
- Examples
- Cats
What is a "statement"?

Proposal

Import an entire corpus of bibliographic metadata and citation graph for a given field
- Show all kinds of queries / visualizations that can be obtained via WDQS
- Source: Pubmed? Mendeley? American Physical Review?
Mendeley contacted 25 May
- Twitter
American Physical Review contacted 25 May

Example queries

See also

Wikidata items that are instances of (P31) scientific article (Q13442814) and have a PMID (P698) or PMCID (P932): SPARQL
Wikidata items that are instances of scientific article (Q13442814) but do not have a PMID (P698) or PMCID (P932): SPARQL
Wikidata statements that have scientific papers as references, specifically Wikidata items with statements involving a PMID (P698) or PMCID (P932): SPARQL
Wikidata statements that have scientific papers as references, specifically Wikidata items that are instances of scientific article (Q13442814) but do not have a PMID (P698) or PMCID (P932): SPARQL
Most common Zika author strings: SPARQL
- Didier Musso is the most frequent amongst them https://www.wikidata.org/wiki/Q24244119
Wikidata scientific articles that contain "zika" in the title SPARQL
Example citation network for Zika research papers: https://angryloki.github.io/wikidata-graph-builder/?property=P2860&item=Q23906890&iterations=5&mode=undirected
Another example citation network for Zika research papers https://angryloki.github.io/wikidata-graph-builder/?property=P2860&item=Q23308149&mode=both

Results

An example of how the set of articles can be used in Wikidata d:Q202864 This is the entity for Zika virus, we added sources for several of the statements that had been empty.
A property 'cites' (d:property:P2860) was created to model citation events between documents. It
Up to and after the meeting Finn Årup Nielsen created Wikidata item for all papers associated with data in the OpenfMRI neuroimaging database (d:Q23891141).

For Andra

Zika virus @ BioProject from the National Center for Biotechnology Information (NCBI)