Research:Knowledge Gaps Index/Measurement/Content Mapping

This page contains the list of information extraction methods to map Wikipedia articles to categories relevant to knowledge gaps.

Gender Gap

For the gender gap, we need to map articles of people (biographies) to their corresponding gender. To do this, we take available values from Wikidata.

We map each article to the corresponding Wikidata item
We filter out items that are not about humans, by checking that the P31 (instance of) property takes value Q5 (human)
We record the value for the property P21 (sex or gender) as the gender class. The class can take 40 different values, as listed in this file.

Code for this mapping is in here.

Sexual Orientation Gap

For the gender gap, we need to map articles of people (biographies) to their corresponding sexual orientation. To do this, rather than resorting to detection frameworks or external sources that could introduce bias, we take available values from Wikidata.

We map each article to the corresponding Wikidata item
We filter out items that are not about humans, by checking that the P31 (instance of) property takes value Q5 (human)
We record the value for the property P91 (sexual orientation) as the sexual orientation class. The class can take 40 different values, as listed in this file.

Code for this mapping is in here.

Geographic Gap

For the geographic gap, we need to map articles to the specific geographies they refer to. Two different approaches have been implemented:

Geospatial model

The geospatial model uses lat/lon coordinates defined by the property P625 (coordinate location) to reverse geocode to a geographical entity. The content gap metrics are published on the country and the wmf_region level. The code for this mapping is in here

Cultural model

The cultural model is based on a set of geo-related properties from their corresponding Wikidata items.

We map each article to the corresponding Wikidata item
We record the value for the following properties (properties were selected as part of previous research):

P19	place of birth
P17	country
P27	country of citizenship
P495	country of origin
P131	located in the administrative territorial entity
P1532	country for sport
P3842	located in present-day administrative territorial entity
P361	part of
P1269	facet of
P183	endemic to

The mapped geographic entities are defined here, the cultural geography gap is currently published for the wikimedia region level.

The features for the cultural geography model are useful to compute content gap metrics for intersections between e.g. gender and geography, as there are few articles about people that are associated with lat/lon coordinates. Code for this mapping is in here.

Time Gap

For the time gap, we need to map articles to the specific points in time they refer to. We do this by collecting a set of time-related properties from their corresponding Wikidata items.

We map each article to the corresponding Wikidata item
We record the value for the following properties (properties were selected as part of previous research):

P569	date of birth
P570	date of death
P571	inception
P575	time of discovery or invention
P576	dissolved, abolished or demolished
P577	publication date
P580	start time
P582	end time
P813	retrieved
P1191	date of first performance
P1249	time of earliest written record
P1619	date of official opening

We convert the results into numerical values corresponding to the years the Wikidata item is covering. Values can take any point in time.

Code for this mapping is in here.

Multimedia Gap

We map here each article to a binary category [illustrated | unillustrated] based on whether the article has images or not. We do this based on the current snapshot of the imagelinks table as stored in the wmf_raw database in our Data Lake.

For each article, we check wether it has one or more images associated with it or not using the imagelinks table
We check that the article images are not icons, using a heuristic based on the number of times the image appears in the corresponding wiki (more in this task)
If there is at least one image that is not an icon, then the article is marked as "illustrated", or "unillustrated" otherwise.

The code for computing this mapping can be found here.