Research:Knowledge Gaps Index/Visualization/Use Cases

To be able to understand how this tool would satisfy the user needs, we tried to enumerate a set of different use cases. We counted five basic, which we structured according to the general and specific questions the user is trying to answer, the general data characteristics, and what they might want to do next.

The use cases are “the current situation” and four comparisons: compare-past, compare-languages, compare-past-languages, and compare-subgroups. We rapidly review each of the knowledge gaps index use cases for the gender gap.

who is using the tool: Thematic User Group organizers (e.g., Gender or LGBT+). For example, Camelia Boban (Italian Language Edition, Wikidonne), Ester Bonet (Viquidones), Kira Wisniewski (Art+Feminism), etc. Other users: Wikimedia Affiliates, WMF communication team, education team, researchers, etc.
general context and goal: interested in designing activities and reaching specific milestones in bridging the gender gap in one specific language edition or language-based wikimedia projects.

use case 1: current

general question: what is the current situation?

specific questions by priority:

what is the share of each gender category?
what is the most represented category?
what is the least represented category?
how dominant is the most represented category with respect to the others?

steps to answer them: as simple as accessing the knowledge gaps index “gender gap” and “content” page.

gender-gap data characteristics:

* number of categories: 5-7 (mainly 2)

* color: gender categories

* type of graph: stacked-bars or treemap. for gender and looking at one language, stacked bars works fine. for geography (countries), treemap is a better option. → I would give the choice to the user here with stacked bar as default.

real data example: the data relative to:

* the “share of women in biographies in Italian Wikipedia” in the current time: a) number of articles, b) percentage.

e.g., https://humaniki.wmcloud.org/gender-by-language

e.g., https://wdo.wmcloud.org/gender_gap/

for example (Arabic):

main graph: stacked bars, treemap, barchart.

Gender Gap Stacked bar

table:

Gender Gap Colored Tables (Humaniki)

integrated: bar + table

Gender Gap (Delenezh)

We may want to allow the user to select stacked bars (ideal for gender), barchart (ideal for a distribution), or a treemap (easy to understand and visually attractive).

insights and interpretation

the interpretation is direct. you read the number of articles and 16.149% of women biographies - you think it is far from parity.

answering the rest of questions is straightforward given the limited number of categories.

additional paths / questions / use cases (by order of priority) → this is what you could think of at this point.

a) get a better understanding by comparing your X language edition to the past.

b) get a better understanding by comparing your X language edition to others in terms of “share of women” (or any other gender).

c) get a better understanding by comparing your X language edition to the past and to other language editions / or by comparing the X language edition to X sister projects.

d) get a better understanding by zooming into subgroups of biographies based on other gaps (e.g., geography and time).

e) get a better understanding by comparing it to a previously set “target” in number of articles / share.

For each of these additional paths/question, we create a different use case.

use case 1a: compare-past

general question: are we making any progress?

specific questions by priority:

what is the “share” now compared to time_period (week, month, quarter, half an year, year, two years, five years,…)?
how many new articles have been created in X time_period?
what is the percentage increase/decrease in the share/in the number of new articles in each time_period?

steps to answer them: if it is not visible in the same page, you should be able to access a page at one click.

you should be able to select time_period and the gender categories you want to see/hide (through a menu or in the same graph legend).

since x-axis is time, y-axis can be the number of articles (each gender as a color) or the share.

gender-gap data characteristics:

* number of categories: 5-7 (mainly 2)

* color: gender categories

* type of graph: stacked-bars -> choosing the graph type is not possible. (or may be only possible when comparing two period, and then that would be the treemap)

input controls (input interface): (the choices and data the user introduces to visualize the comparison)

language & wikimedia project
gap (gender, etc.)
share / number of articles (to display in y-axis)
1. default: share, as displayed in the graph.
time (x-axis).
1. period of time aggregation (month, quarter, year):
  1. default: month
2. frame: time-controls (you could limit to 6M, 1Y, 10Y, ALL). possibly in the graph, like in the next example.
  1. default: 5 years?
accumulated / created articles. usually, we get the accumulated at a specific point in time (use case 1 = present time). however, it is also valuable to see the number of created articles / share in the created articles for each category of the gap.
1. default: accumulated.
select a specific gap-category in the barchart legend and exclude the rest.

real data example: the data relative to:

https://wdo.wmcloud.org/diversity_over_time/

* the “share of women in biographies” in each time_period.

if selected “share” (control 3) and “accumulated” (control 5), you get this:

* the number of women biographies in each time_period (for a time_window): possibly in the same graph or hover.

if selected number of articles, you get this:

This graph is not as good as the other one, because you can’t compare well the share of each category while you see growth.

these two graphs show the values “accumulated” at each point in time, but not the number of content “created” in each period of time.

one option could be to add another control for “accumulated” or “created”. another option could be to simply show the graph below this one.

You could see this in this last “Monthly Created Articles On Gender”:

* the increment in number of new articles in each time_period (for a time_window) for each category.

insights and interpretation:

Seeing the improvement in a time_period should be possible with the graph and a summary with values. when clicking a time_window of 5y, it should provide the variation (share difference) as text in a “summary” of points. You could also infer it from the graph, but it is much harder.

The summary would provide bulleted “headlines” ready to take away.

The summary answers the specific questions proposed before (what is the “share” of a specific category now compared to time_period (week, month, quarter, half an year, year, two years, five years,…)? how many new articles for this speific category compared to time_period? what is the percentage increase/decrease in the share/in the number of new articles in each time_period?).

when clicking or hovering on a bullet, it would highlight that particular element in the graph.

potential other users: WMF communication team, Affiliate ED, etc.

KEY: these other users might not play with the visualize but just read the bullets in the summary.

use case 1b: compare-languages

specific context:

some editors in the Wikiindaba conference want to compare how the African languages are doing in terms of gender gap. (other usual groups: CEE, Nordic Languages, Celtic Knot languages, etc.).

some gender gap affiliates like Wiki Loves Women have campaigns aimed at working on the gender gap in wikiquotes. (SheSaid) “The Wiki Loves Women initiative was celebrating women leaders throughout late 2020 with the SheSaid drive. The drive was aimed at improving the visibility of women in creating new or improving already existing Wikiquote entries related to them.”

general question: are we as X Wikipedia doing better than Swahili Wikipedia in terms of covering women (i.e., share and number of articles)?

specific questions by priority:

what is the “share” of X Wikipedia compared to Spanish Wikipedia?
how many articles (women biographies) compared to Spanish Wikipedia?
which language covers more articles? which language covers less?
which language has the largest share of women among biographies?
which language has the smallest share?
how do X Wikipedia results compare to the X Sister projects?*

* Gender gap makes sense in Wikiquotes, Wikisource, Wikibooks, Wikinews.

steps to answer them: from the main page, you could be able to select one or more languages. having predefined groups could make it easier. languages usually compare with those from the same geography or with a similar number of articles.

gender-gap data characteristics:

* number of categories: 5-7 (mainly 2)

* color: gender categories

* type of graph: -> stacked bars. however, showing a max. number of (10? categories). input controls (input interface): (the choices and data the user introduces to visualize the comparison)

input controls (input interface): (the choices and data the user introduces to visualize the comparison)

language/project-based choice:
1. choose a language, and then a group of languages, and finally, a project (e.g., Wikipedia).
2. choose a language, and then the projects (e.g., Wikipedia, Wikisource, Wikiquotes, Wikibooks…).

when choosing a “group of languages”, we could give predefined groups (e.g., CEE, African, top 10 in number of articles, closer to the first choice, etc.)

gap (gender, etc.). at this point we may have already choosen it, but we may want to change it.
share / number of articles (to display in y or x-axis, depending on whether they are vertical or horizontal)
1. default: share, as displayed in the graph.
sort by any of the value (lang. alphabetical, number of articles, % of...etc.)
graph type:
1. when comparing two languages and the same project or two wikimedia projects of the same language, we could be able to select the tree-map, as it may be a good option to see the categories. by default: stacked bars.

real data example:

* the “share of women in biographies” in each language edition (table or stacked bars with all languages):

https://denelezh.wmcloud.org/gender-gap/?sort=label#project

insights and interpretation: seeing the stacked-bars with the different gender categories (not proportional to the number of articles but occupying the entire bar) or the numbers in the data columns should be easy.

functions like sorting languages should be helpful to see the first or the last.

the summary would provide the “headlines” that anyone might want to take away. e.g.,

* Spanish Wikipedia (the selected lang.) has X art. Y% more than the following one.

* Among the Top 10 largest languages (art.) Swedish has the largest share of gender biographies.

* …

other: we discarded the use case “coverage”, which wanted to anser questions such as:

how well does X Wikipedia cover all the women in Wikidata?

how well does X Wikipedia cover all the women in Wikidata with at least one article in one language edition?

how well does X Wikipedia cover all the women that exist in English Wikipedia?

...

the user might enter the gender content gap and see the percentage of share and number of women in X Wikipedia. to understand what it means, she might compare with the past/other languages, but also with the existing women biographies in total.

solution: in this use case 1b compare-languages, we may want to facilitate adding these two entities “wikidata” and “all wikidata items with a wikipedia article”. hence, it would be easy to compare the number of women in (for example) Spanish Wikipedia to the total number of women biographies in Wikidata. we forget about a dedicated use-case, but we can have this baseline at reach.

use case 1c: compare-past-languages

general question: how has the coverage of this gap-category (e.g., women) evolved over time in these X,Y,Z languages?

specific questions by priority:

which language has decreased the gender gap more effectively?

which language has created more women biographies during this period of time?

which language has been more effective in rising the share of women in biographies this year?

how many women biographies have been created in the past time_period (week, month, etc.).

steps to answer them: the user might need to select the languages and the specific category within gender (possibly women). then select whether she or he wants to see the entire history or the latest time_period.

gender-gap data characteristics:

number of categories: 1 gender category, multiple languages or 1 language, multiple gender categories.

type of graph: line-chart

x-axis: percentage of women in all articles

y-axis: monthly

color: languages or gap-categories

input controls (input interface): (the choices and data the user introduces to visualize the comparison)

choosing 1 gender-category and multiple languages or 1 language and multiple gender categories.
1. choice 1: select a group of languages or select each language manually.
2. choice 2: select all gender-categories or select each gender category manually.
gap (gender, etc.)
share / number of articles (to display in y-axis)
1. default: share, as displayed in the graph.
time (x-axis).
1. period of time aggregation (month, quarter, year):
  1. default:
2. frame: time-controls (you could limit to 6M, 1Y, 10Y, ALL). possibly in the graph, like in the next example.
  1. default: 5 years?
accumulated / created articles. usually, we get the accumulated at a specific point in time (use case 1 = present time). however, it is also valuable to see number of created articles / share in the created articles for each category of the gap.
1. default: accumulated.
select a specific gap-category in the barchart legend and exclude the rest.

real data example: the data relative to:

* the “share of women in all Wikipedia articles” (it should be in bios, but I couldn’t find it) in three language editions over time

https://wdo.wmcloud.org/diversity_over_time/

Here we can the “accumulated” and below the “created”, but for the same reason, the second one does not show up as there might be a problem with the data.

insights and interpretation:

comparing languages and a single or multiple points in time requires selecting one single specific category in the gender gap (normally women).

using the line-chart you can compare up to 10 languages and see the evolution over time, with the scatterplot you can compare more languages but you need to assign color to each language and choose one specific point in time.

likewise, the summary would provide the “headlines” that anyone might want to take away. e.g.,

this month, Spanish Wikipedia is the first among the selected languages in terms of women biographies (30%).
in the past five years, Catalan Wikipedia has bridged 5% of the gender gap while Italian has only improved a 2%.
...

use case 1d: compare-subgroups-gaps

context and goal:

Wiki Loves Women works on the gender inequality in Africa. They use Wikidata to trace the number of biographies in each country (project Mind the Gap).

Asian Month is a contest organized to create articles about Asia. They invite many language editions communities to participate. There is a subcontest organized by Wikidonne to create Asian women.

These two initiatives are aimed at reducing the gender gap within a specific geography.

Other Gender-gap oriented affiliates and Wikiprojects have projects aimed at reducing the gender gap in specific topics or periods of time (centuries).

general question: what is the gender gap (i.e. women coverage) in articles related to this other gap?

specific questions by priority:

what is the share of the gender gap in this country in X Wikipedia?
what is the share of the gender gap in this century in X Wikipedia?
what is the share of the gender gap in this language’s local content in X Wikipedia?
what is the share of the gender gap in X gap category compared to the general gender gap?

steps to answer them:

the user should easily find a button to access an interface where she can select the secondary gap. the secondary gap that is “compatible” with gender (geography, time or local content) has different categories.

at this point, the user must select one single Wikimedia project (X Wikipedia language edition) or one single secondary gap category (e.g., geography-“cameroon”, time-”1910-1920’s”, localcontent-”Catalan”).

In the first case, she will see the gender gap in X Wikipedia language for all the secondary gap categories, while in the second case, she will see the gender gap for all Wikipedia language editions and for a single secondary gap category.

gender-gap data characteristics:

* number of categories: gender gap has 5-7 (mainly 2)

* type of graph: horizontal stacked-bars

* x-axis: percentage of articles

* y-axis: project snapshot

* color: gender categories

here you have to decide:

option 1: one language, gap1 (geography, all categories as rows), gap2 (gender, all categories in the graph).
option 2: multiple languages (as rows), gap1 (one category: geography: nigeria), gap2 (gender, all categories in the graph).

input controls (input interface): (the choices and data the user introduces to visualize the comparison)

language or languages
gap 1 (all categories or one single category, in case you chose multiple languages)
gap 2
share / number of articles (to display in y-axis)

default: share, as displayed in the graph.

real data example: the data relative to:

one single project (Wikidata) and all the categories within the secondary gap. these are two examples (geography and time):

1 project (wikidata), gender (all categories), geography (one category: country):

1 project (wikidata), gender (all categories), time (one category: decade of birth):

insights and interpretation:

this comparison is quite straightforward. the summary can highlight and bullet different aspects:

the category of the secondary gap in which the category of the first gap (e.g. women) has a highest share. e.g., “The gender gap for Japanese biographies is 28.184%”. In case the user choosen one single category of the secondary gap, then it would be something like “Portuguese Wikipedia is the language edition which shows a greatest share of women biographies in the 1990s-2000 (30%)”.
...

Extra functionality: distance-target

context and goal:

Catalan Wikipedia editors have created the Wikiproject 30,000 women. They expected to reach this goal by October 15th 2020, and they did. In this Wikiproject page you can see that right now there are 33,564 women (19,29% of biographies). This is done with the help of a service querying Wikidata and a bot updating this Catalan Wikipedia page.

Catalan editors would like to set new targets and check them in the website.

specific questions by priority:

How many articles (women biographies) do we need until we get Y in X Wikipedia?
When will we get to that number of articles / share at the current pace?

steps to answer them:

Users should be able to set targets and compare the different language editions to them. They could be represented visually within the same graphs as horizontal lines. They could be signalled in the URL as a parameter. They could be set as an mailing-alarm to receive an e-mail when reaching it or close to reach it.

real data example:

There is not.

Gender User Groups celebrate increasing the % of women biographies in social media (let’s make easy share).