Research:Language-Agnostic Topic Classification/V2 Focus Groups
The language agnostic Liftwing Topic Models create a foundation for many different interfaces and evaluations. They allow us to evaluate what topics are covered across all the language Wikipedias, and has both direct and indirect effects on how the Wikimedia movement covers different topics -- especially those relevant to Wikimedia communities work on knowledge gaps..
Wikimedia organizers have consistently noticed gaps in the way in which the models cover topics. As part of a process to improve the language agnostic topical model we completed a series of four conversations about how to improve the models. We included community experts in subject matter domains that correspond heavily with topics that we know are important to organizing and Topics for Impact in the Wikimedia community: Sustainability + Biodiversity, Art + Culture, Gender, and Marginalized knowledge.
The consultation process was designed to bring subject matter expertise to the changes that Isaac Johnson is implementing to the models. Proposed changes ranged from incremental changes in how we grouped WikiProjects for our training data to larger changes in the set of labels we used and longer-term feedback about what topic-related supports are valuable in these spaces.
Methodology
[edit]We completed 4 workshops, led by the User:Astinson (WMF) and User:Isaac (WMF) supported by colleagues User:GFontenelle (WMF), User:MMulaudzi-WMF and User:BKurgat-WMF.
Each workshop included ~2 hours of conversation, focused on a diverge and converge methodology -- allowing folks to highlight challenges and opportunities based on a comparison of the current Liftwing model and a version 1 prototype of an update that moved some WikiProjects and added some non-WikiProject data to the predictions.
After the consultation, Isaac is working on a V2 which will be documented below, and the updated prototype web app now reflects some of the changes.
We think that this process was very constructive for building greater understanding of the strategic choices we should be making with the machine learning model. We recommend future teams considering revisions to machine learning tools to consider a consultation process that factors in affected communities.
Common and actionable findings across workshops
[edit]These are observations from across all of the conversations:
- The changes to the model that better reflected occupation, biography, country and gender and these were universally appreciated.
- Chronology/Time is needed for multiple domains for filtering of content, and measurement of representation of different kinds of cultural content (see below discussions and solutions for more detail).
- For multiple of the domains, though country filtering is useful, we still need the ability to see “regional” labels at some level -- we may or may not be able to implement this in all interfaces.
- A non-binary approach to Gender Studies content needs to be reflected in the model.
- There was widespread appreciation for the model being more accurate, but an acknowledgement for needing to know how to fix bad predictions by improving content categories, Wikidata or links.
- Multiple domains suggested a need to do intersections of topics with AND statements to get the most from the model -- especially specific areas like Medicinal plants or food from a specific country. Existing interfaces only have the OR function.
Top level research questions
[edit]Following the conversations User:Isaac (WMF) and User:Astinson (WMF) identified the following further research questions:
- For multiple domains, there is a concern that the cultural context of content is missing (especially when that cultural context does not reflect national identity), but the best data for this often lives in categories and we don’t have a clear modeling of categories that would make this accessible. We think there might be some further research to be done to extend previous work on cultural context. We also are optimistic that some of this work can be done with Community Defined Topic lists.
- How do we refine the time units so that it's useful across different domains of knowledge? There is often different complexity represented depending on the kind of article (i.e. history vs. modern biographies vs. art vs. pre-history & anthropology)
- We think the link, category and Wikidata information about Gender Studies is not robust enough to be of great use in the model. It is likely worth organizers in the Gender space experimenting with tactics to enrich this data.
Model changes based on the Workshops
[edit]Updated March 21, 2025
[edit]- Incremental:
- WikiProject Comedy from Performing Arts -> Film+Television per recommendation from Visual Arts group.
- WikiProject Craft moved from Entertainment -> Visual Arts per recommendation from Visual Arts group.
- WikiProject Comedy moved from Performing Arts -> Film as feedback is that it's overwhelmingly about TV/Films.
- Removed WikiProject Skepticism. A participant pointed out that maybe not Society and instead more related to Medicine but a deeper look showed it spans too many topics to be a useful source of training data for the model.
- We removed some of the country-specific WikiProjects from larger topics (e.g., Persian History from the generic History label). In practice, we've found that using these region- or identity-specific WikiProjects within a more generic category can lead to the model overfitting to that region or identity as opposed to the broader topic.
- We use various occupation values on Wikidata as further signals of topics for people. For instance, a biography of a person who is a Journalist (or any subclasses like Photojournalist) on Wikidata would now trigger the Journalism topic.
- Taxonomy changes:
- We are testing out a new Gender Studies topic -- in the current prototype we tried combining Feminism, Women’s Studies and the LGBTQ studies WikiProjects as training data for the model. Performance is currently lower than other topics but we are digging deeper to understand why and whether this is acceptable for now.
- Biographies – we now explicitly use Wikidata as the source for the Biography topic and Gender breakdown (man, woman, non-binary). The prototype now also allows for multiple values for gender. For example, transgender individuals can be found in their appropriate gender identity AND non-binary category. We also skip deprecated claims from Wikidata on gender to make sure that we are not misidentifying individuals who have transitioned to another identity.
- We began experimenting with labeling the time with a starting date and an end date for a topic using several Wikidata properties (explicit Wikidata properties which have time concepts). The idea would be to make a User Experience recommendation to design either a time slider or other time grouping (i.e. century or decade) that could be intersected with other properties. We are currently exploring the feasibility of this approach as it's different from our other, more-discrete topic labels.
Key takeaways by workshop
[edit]Sustainability and Biodiversity
[edit]The first workshop focused on Sustainability and Biodiversity -- we invited approximately ~30 community members, and ~12 participated in part or whole of the workshop. The choice of sustainability and biodiversity was because of the emerging community of practice forming around topics like the Biodiversity Heritage Library and communities of practice emerging because of WikiForHumanRights.
Conclusions
- There was general positive feedback on the presence of country and taxon in the model.
- There was a repeat acknowledgment of the importance of AND and OR type intersections of labels.
- There was some general feedback that it would still be useful to give people using tools the ability to group countries by region at some level.
- Most of the feedback focused on the Species taxonomies --
- We are missing microbiology and microscopic species. There was not a strong need to have this available in the model.
- There were strong suggestions that we should use taxons to provide deeper exploration of the taxon hierarchy
- There was a strong desire to capture something related to “extinct, invasive and historical” -- concepts related to time came up in several parts of the conversation
- Some of these needs would be better served by a taxon-specific browsing tool.
- There were a few other specific properties recommended for taxon range: endemic to, invasive to, and country specific identifies
- Generally, the need to cover something like “Cultural context” of species and nature (i.e. foods) -- specific examples were the cultural uses of the species/traditional environmental knowledge (TEK)-- but also of the connections between species and human led problems in their environment (i.e. invasive)
- We didn’t get as much specific feedback on the Sustainability labels at the depth that we hoped.
- The overall format of the workshop was welcomed -- folks seem to have been able to engage in a productive way, while also learning something from others.
Things we would like to explore:
- Explore if we can layer down the taxonomy for Animal and Plant kingdoms (relevant unit was mammal, insect, bird and reptile) -- we are not sure how good the data is on Wikidata for plants, and we have questions about how granular to implement such a change.
- Investigate if the microbiology/microscopic species information should be its own label?
- Can we create time units relevant to prehistoric species ?
Further research questions
- How can we work on improving the connection of cultural context and traditional environmental knowledge within the model from Wikidata and categories? (this is similar to the finding from the other
Ethnic Identity and Indigenous Knowledge
[edit]Broadly, one of the most widely understood parts of working on Knowledge Equity is the representation of indigenous and specific knowledge related to ethnic communities. Also, when thinking about the impacts of machine learning models, there tends to be a negative impact on groups whose histories and cultures have been marginalized. We wanted to make sure that we understood the potential impacts of our decisions, and mitigate potential biases. We invited ~ 15 participates with 8 attending.
Main findings
- The national borders label may be erasing some of the diversity that is inherent in specific topics where something like “North America” may be appropriate, especially for indigenous peoples or pre-colonial structures.
- There is some need to learn (in the interface for the use of the model) which sources of data were driving the labeling - especially if we know that parts of the taxonomy are colonial, or inappropriate data modeling that could be corrected with the community.
- There was a concern that marginalized cultures and identities could be missed or not included in the labels that get surfaced in front of readers -- there was a hope that we would rely more on category data which provides more robust taxonomies on marginalized topics.
- There were consistent concerns about how some articles contain multiple topics or domains and sometimes the Wikidata data overreaches in defining the scope of concepts that it covers.
- Concerns that the interface of Wikipedia continues to deprioritize categories and other metadata that is important for community curation of marginalized knowledge.
- Important cultural topics such as food, arts and medicine which are often intersectional and more local than our current labeling.
Things we would like to explore:
- There is consensus not to implement an "indigenous knowledge or marginalized groups" topic.
- Can we improve the interface guidance where the model is used to give attribution to the guiding piece of data for that prediction, to provide attribution and improve ability to fix the mislabeling?
- Some value in region labeling (i.e. North America for indigenous communities that are beyond).
Research Questions
- Going forward, we suggest that researchers looking at our data spend more time on evaluating what we could learn about cultural knowledge that is not neatly “labeled” Wikidata, but rather other sources of data like categories?
- For the community, we could use more documented best practices for data modeling improvements.
Gender
[edit]Since the Wikimedia Foundation has implemented the ORES model one of the major critiques is the over-reliance of the “Gender” topic on Women’s biographies. This has a number of unintended consequences, when measuring topics like the gender gap. Creating a clear direction for the Gender topic area was a priority in the consultation. We invited 20 participants with 9 attending.
Main findings
- Folks did not express any concern over the use of Wikidata as a label for gender of biographies -- and are glad that we are no longer guessing based on inference.
- There was general positivity about how professions were being more equitably represented in the model -- ensuring that the work of women is attached to the larger domains of knowledge that they were connected to.
- Art+Feminism and others expressed interest in continuing the conversations especially on labeling a broader Gender studies label, that includes more perspectives on gender.
- Being able to implement intersectional “AND” filters were important for several of the topics that folks brought to the area, such as health/medicine and Gender Studies or Education and Gender studies.
- Categories contain a lot of information about Gender, which could be used to signal a more clear relationship to the gender studies topic -- however, unlike the country labels the categories are not as well described on Wikidata.
Things we would like to explore:
- Gender modeling of biographies seems reasonable as long as each of the labels is gender inclusive of trans and other similar labeling, Men ,Women and Non-Binary.
- There is a preference for there to be both a Gender Studies and Human Rights labels -- we are going to test on how to best model this with WikiProjects and Categories -- possibly filter out biographies from the Gender Studies bucket to improve data.
Further research questions
- There is an opportunity for community members to label better categories on Wikidata related to gender and gender adjacent topics (i.e. feminism, women’s schools, etc) to improve the quality of coverage of the topics.
- In general, the coverage of Gender-adjacent WikiProjects beyond the biography on English Wikipedia is poor -- further research would help to understand the overall gaps in coverage.
Arts and culture
[edit]GLAM, Arts and culture topics are some of the most widely organized and deeply data connected parts of the organizing community. We wanted to make sure that the topic model connects with the data needs of the GLAM-Wiki community. 20 people were invite with ~12 participants participated in the conversation
Main findings
- Representations of Gender and most of the arts topics in the new model were generally appreciated.
- Cultural representation within arts and culture material is important -- sometimes a country is too restrictive or doesn’t cover the overall diversity that is important (i.e. ethnicity) or its origin (i.e. topics from a colonial history from one geography).
- The majority of conversation within the workshop focused on representation of eras or time periods in intersection with other topics, because arts and culture are focused on periods of time usually (i.e. Art from the mid-20th century)
- Craft WikiProject should be moved to Visual arts instead of Entertainment.
- WikiProject Comedy is overwhelmingly about Film so should be grouped there.
Things we would like to explore:
- Can we do the right size time unit for arts and culture, and history that works for biology as well?
- Intersections of identity, art topic or cultural topic with time is important -- most of the content has distinct historical eras reflected in the content, and the areas of focus.
Further research questions
- There is an opportunity to explore the use of various identifiers from external authorities as part of the topic model. We don’t have the capacity to do it at this time, but future research could look into this.