Research:Knowledge Gaps Index/Measurement/Content Mapping
This page contains the list of information extraction methods to map Wikipedia articles to categories relevant to knowledge gaps.
Gender Gap
[edit]For the gender gap, we need to map articles of people (biographies) to their corresponding gender. To do this, we take available values from Wikidata.
- We map each article to the corresponding Wikidata item
- We filter out items that are not about humans, by checking that the
P31 (instance of)
property takes valueQ5 (human)
- We record the value for the property
P21 (sex or gender)
as the gender class. The class can take 40 different values, as listed in this file.
Code for this mapping is in here.
Sexual Orientation Gap
[edit]For the gender gap, we need to map articles of people (biographies) to their corresponding sexual orientation. To do this, rather than resorting to detection frameworks or external sources that could introduce bias, we take available values from Wikidata.
- We map each article to the corresponding Wikidata item
- We filter out items that are not about humans, by checking that the
P31 (instance of)
property takes valueQ5 (human)
- We record the value for the property
P91 (sexual orientation)
as the sexual orientation class. The class can take 40 different values, as listed in this file.
Code for this mapping is in here.
Geographic Gap
[edit]For the geographic gap, we need to map articles to the specific geographies they refer to. Two different approaches have been implemented:
Geospatial model
[edit]The geospatial model uses lat/lon coordinates defined by the property P625 (coordinate location)
to reverse geocode to a geographical entity. The content gap metrics are published on the country and the wmf_region level. The code for this mapping is in here
Cultural model
[edit]The cultural model is based on a set of geo-related properties from their corresponding Wikidata items.
- We map each article to the corresponding Wikidata item
- We record the value for the following properties (properties were selected as part of previous research):
P19 place of birth P17 country P27 country of citizenship P495 country of origin P131 located in the administrative territorial entity P1532 country for sport P3842 located in present-day administrative territorial entity P361 part of P1269 facet of P183 endemic to
- The mapped geographic entities are defined here, the cultural geography gap is currently published for the wikimedia region level.
The features for the cultural geography model are useful to compute content gap metrics for intersections between e.g. gender and geography, as there are few articles about people that are associated with lat/lon coordinates. Code for this mapping is in here.
Time Gap
[edit]For the time gap, we need to map articles to the specific points in time they refer to. We do this by collecting a set of time-related properties from their corresponding Wikidata items.
- We map each article to the corresponding Wikidata item
- We record the value for the following properties (properties were selected as part of previous research):
P569 date of birth P570 date of death P571 inception P575 time of discovery or invention P576 dissolved, abolished or demolished P577 publication date P580 start time P582 end time P813 retrieved P1191 date of first performance P1249 time of earliest written record P1619 date of official opening
- We convert the results into numerical values corresponding to the years the Wikidata item is covering. Values can take any point in time.
Code for this mapping is in here.
Multimedia Gap
[edit]We map here each article to a binary category [illustrated | unillustrated] based on whether the article has images or not. We do this based on the current snapshot of the imagelinks
table as stored in the wmf_raw
database in our Data Lake.
- For each article, we check wether it has one or more images associated with it or not using the
imagelinks
table - We check that the article images are not icons, using a heuristic based on the number of times the image appears in the corresponding wiki (more in this task)
- If there is at least one image that is not an icon, then the article is marked as "illustrated", or "unillustrated" otherwise.
The code for computing this mapping can be found here.