Jump to content

Wikimedia CH/Grant apply/Extending Cite Unseen

From Meta, a Wikimedia project coordination wiki

Infodata

[edit]
  • Name of the project: Extending Cite Unseen
  • Amount requested: 20,000 CHF
  • Type of grantee: Individual
  • Name of the contact: Kevin Payravi (User:SuperHamster)
  • Contact: kevinpayravi(_AT_)gmail.com

Biography

[edit]

Kevin Payravi (User:SuperHamster) is a software engineer from Dallas, Texas, USA, and has been editing Wikimedia projects since 2007. In addition to day-to-day editing, Kevin has developed a number of tools to help wiki readers and editors, including Cite Unseen, View it! (2024 Coolest Tool Award winner!), and Indie Wiki Buddy. He also serves on the Board of Wikimedia DC and as an organizer for the Ohio Wikimedians User Group. Each year, he helps organize WikiConference North America and Wiki Loves Monuments in the United States, and photographs a number of events for the WikiPortraits initiative.

Payravi has worked as a professional software engineer since 2017. He recently left his full-time corporate job to focus on high-impact projects to help communities, including Writing Atlas (a catalog of short stories) and wiki-related tooling.

More information and a list of previous grants is available on Payravi's CV.

The problem and the context

[edit]

What is the problem you're trying to solve?

[edit]
Cite Unseen in action on en:Citizens United v. FEC sources

Wikipedia is built on citations, and a lengthy article can have hundreds of them. A standard Wikipedia citation will include basic metadata such as author, title, publication, and date. However, there is often little else to indicate the nature or reliability of the source. The English Wikipedia, for example, maintains an extensive list of perennially discussed sources and their reliability, but this data is not natively surfaced in the UI when viewing a list of citations. Similar lists exists on over a dozen other language versions of Wikipedia, including Farsi, French, and Swedish. There are other interesting and important indicators to consider for each source, such as whether a cited New York Times is a news article or opinion piece, or whether the cited article is sponsored by an external party. These sort of classifiers provide important context and can even help indicate the reliability of a source, but these details are not readily surfaced in our existing citation systems.

What is your solution to this problem (please explain the context and the solution)?

[edit]

A solution to the above problem is Cite Unseen, an existing open source user script on the English Wikipedia that adds categorical icons to Wikipedia citations to provide readers and editors a quick initial evaluation of citations at a glance. This helps guide users on the nature and reliability of sources, and to help identify sources that may potentially be problematic or should be used with caution.

Cite Unseen was initially developed at the CredCon hackathon in November 2018, jointly created by Kevin Payravi (SuperHamster) and Josh Lim (Sky Harbor), with support from the Credibility Coalition and the Knowledge Graph Working Group. The project won the CredCon hackathon and received funding to continue development at the Wikimedia Hackathon 2019, and is supported by WikiCred. At the 2023 Media Party Buenos Aires hackathon, Cite Unseen was extended to integrate Arabic-language sources and some potential Spanish language ones.

Cite Unseen's categorization dataset currently holds over 3,400 domains in 20 categories. These categories include:

  • Perennial sources list statuses from the English Wikipedia ( generally reliable; marginally reliable; generally unreliable; deprecated; blacklisted; and varied consensus)
  • Advocacy groups; books; blogs; user-generated news; editable sites; state media; news; opinion pieces; press releases; satire; social media sites; sponsored articles; tabloids; and TV and radio programs
  • Predatory journals listed on the English Wikipedia's predatory source list

Out of the 1000+ tracked user scripts on the English Wikipedia, Cite Unseen is a top-60 most-installed user script with 391 users (156 who are currently active editors). Cite Unseen's domain categorizations are also being utilized by other tools, such as Citation Watchlist.

Cite Unseen has been recognized and added to the Wikimedia Foundation's Anti-Disinformation Repository. There are additional user scripts available that annotate sources, such as Headbomb's unreliable and Novem Linguae's CiteHighlighter that highlight citations based on their reliability. Cite Unseen differs in that, while it does mark when sources are generally considered reliable or unreliable, it focuses more on the nature of sources (news, state media, advocacy groups, etc.). These annotations provide additional context on sources that can help evaluate sources beyond what has been identified by Wikipedia's perennial sources.

As a volunteer-driven project, Cite Unseen has mostly seen incremental development and feature development, such as adding user-submitted categorizations. But there has been persistent inquiries about this tool and a need to expand with new features. Through talking to users throughout the users online and at Wikimedia events, I've realized the following needs:

  • A need to port the script to other language versions of Wikipedia.
  • A need to expand the sources list, particularly to cover non-English sources.
  • A need to make it easier for users to submit categories for domains.
  • A need to keep semi-automated lists of domains up to date.
  • A need to make it easier for users to configure their settings for the script.
  • A need to make it easier for other tools to utilize Cite Unseen's domain categorizations.
  • A need to let users explore Cite Unseen's categorizations and the reasoning behind them.
  • A need to display article-level statistics about an article's citations.
  • A need to promote the script and do outreach to gather quality feedback.

This grant request is to fund software development efforts to address these needs.

Project goals

[edit]

To address the needs above, these are the planned new features as part of this grant:

  • Bring Cite Unseen to more language versions of Wikipedia.
    • Translate Cite Unseen's text into various languages.
      • An open-source tool such as TranslateWiki or Weblate will be used to facilitate translating into many languages.
    • Collaborate editors to expand sources lists into other languages.
      • Cite Unseen's sources lists is maintained primarily by English speakers and includes the English Wikipedia's perennial sources list. When we port Cite Unseen to other language versions of Wikipedia, there will be a large gap in the covered sources. As part of this grant, we will collaborate with speakers and contributors of non-English languages to expand Cite Unseen's sources lists.
    • For the sake of this grant, the first priority would be official languages of Switzerland (German, French, and Italian). We will also expand to other languages, focusing on the largest and most active Wikipedias, and the Wikipedias with local editors and affiliates who can help with translations and expanding our sources lists.
  • Semi-automated flows to keep source lists up-to-date.
    • Some of the existing categories of tracked sources, such as sources from advocacy organizations, were originally created by crawling through Wikipedia's categories for advocacy groups and extracting official links. Since these initial scrapes, these lists have largely been static other than for a few one-off additions. The goal is to create an automated flow to periodically re-generate these lists from various language Wikipedias and merge them in after manual review, to ensure that these lists are kept up-to-date.
  • Explore potential new categorizations for domains.
    • As we expand and develop our sources lists, we may find opportunities for additional categorizations to indicate the nature of sources. We can also explore additional meta categories, such as adding icons for open-access sources.
  • Allow annotations on categorized domains.
    • Currently, when a domain is added to a category, there is no additional metadata attached. The goal is to enhance our database to hold notes for each entry. As an example, it would be useful for each domain to have an explanation as to why the domain was added to a category, or whether the categorization is coming from another source.
  • Allow users to easily recommend adding a domain to a category right from an article's references list.
    • Currently, if a user wants to recommend that a domain be added to a category, they either leave a message on the user script talk page, or submit a code change through the script's GitHub repository. This is a relatively slow and tedious process that most users don't want to go through. The goal is to update the script to add an intuitive way for users to submit domain categorization recommendations right from an article's references list. The design has to be finalized, but it would look something like a "+" button next to each citation; once clicked, a modal appears allowing the user to select what categories the citation should be a part of, and submitting that for review without ever having to leave the article.
  • Add a UI for changing a user's custom settings.
    • Cite Unseen is quite configurable, allowing users to add or ignore domains per category, and to hide or display certain categorical icons. This configuration, however, requires maintaining a custom JSON file, which is not an ideal experience. The goal will be an intuitive UI that allows the user to manage and change their configurations, instead of having to manually adjust a JSON file.
  • Add a statistics summary at the top of references sections.
    Proof-of-concept of reference statistics
    • To give users an overview of an article's references, we can provide a count of each type of reference at the top of the references section. We can also add a filtering function, to display only references of a certain type. A proof-of-concept has been made (see thumbnail), but it needs to be refined before releasing to production.
  • Create an API to enable other tools to easily export and utilize Cite Unseen's data.
    • An endpoint that, given one or more categories, returns the domains that fall under those categories.
    • An endpoint that, given a domain, returns the categories that domain falls under.
  • Create a portal to explore domain categorizations.
    • Once the API is available, we can build a web UI to allow users to explore domain categories.
    • Once we add support for annotations on domain categorizations, the UI can also be used by users to easily explore and see why a domain was added to a category.

In addition to the above tasks, this grant project will also include outreach invite more users to try Cite Unseen. This can include outreach on various WikiProjects and social groups (including Wikimedia groups on Discord and Telegram), as well as at conferences and other Wikimedia events.

Project impact

[edit]

How will you know if you have met your goals?

[edit]
  • The number of editors on each language version of Wikipedia who install (and keep installed) Cite Unseen is a good indicator of the demand for Cite Unseen.
    • On the English Wikipedia, Cite Unseen has 391 users, a top-60 most-installed script. Increasing this to ~600 installs would be a good and achievable goal.
    • On other language versions of Wikipedia, becoming one the of the most-installed user scripts would be a good goal.
  • As we bring Cite Unseen to new communities, we will solicit feedback from users.
  • Once we introduce the feature to allow users to easily submit new domain categorizations, we'll know success and demand for the tool by the number of users regularly submitting new domains for categorization.

Do you have any goals or metrics around participation or content?

[edit]
  • For the language versions of Wikipedias we port Cite Unseen to, we would aim to become a top-10% most-used user script.
  • For each newly supported language (German, French, and Italian, and other large language Wikipedias), we hope to add 500+ new domain categorizations.

Project plan

[edit]

The project is split into several components: software development, list development, and community outreach.

These activities would take place over the next 8 months. Most software development will take place in the first half, with the second half focusing on expanding domain categorization, user adaption, integrating feedback.

Activities

[edit]

Software development

[edit]

The software development component of this grant funds a software developer to implement the following features described above:

  • Semi-automated flows to keep source lists up-to-date.
  • Allowing annotations on categorized domains.
  • Allow users to easily recommend adding a domain to a category right from an article's references list.
  • Add a UI for changing a user's custom settings.
  • Add a statistics summary at the top of references sections.
  • Supporting localization of Cite Unseen's text.
  • Port the script to more language versions of Wikipedia.
  • Create the API as a service on Toolforge.
  • Create a simple web portal to explore domain categorizations.

List development

[edit]

As mentioned above, Cite Unseen's existing lists of sources needs to be updated. In addition, we would like to expand the list to cover non-English sources. The priority will be on sources from Switzerland, as well as sources in Italian, French, and German. We also plan cover some of the other most active Wikipedias, and the Wikipedias with local editors and affiliates who can help with translations and expanding our sources lists.

In addition to the semi-automated flows to keep source lists up-to-date (as covered by software development above), we will be incorporating more perennial sources lists from non-English Wikipedias, and adding sources from other countries and languages across all our categories (news, advocacy, state media, etc.). We will be doing outreach to editors and affiliates on other language wikis to help populate these lists. Once we add the feature that allows users to easily recommend adding a domain to a category from an article's reference list, we should be able to build these lists over time.

Community outreach

[edit]

Throughout the software and list development, we will be doing outreach to communities to share Cite Unseen with new editors and solicit feedback.

  • As Cite Unseen is brought to new language Wikipedias, we will invite editors to try it out via relevant forums (noticeboards, WikiProjects, mailing lists, affiliates, etc.).
  • Users will be invited to provide feedback on new features as they come out. Using the Global Search tool, we can gather a list of users and leave updates on user talk pages.
  • As we've done in the past, we will present on Cite Unseen at relevant wiki conferences (likely WikiConference North America and Wikimania in 2025).

Budget

[edit]

This budget can be scaled as needed. Most flexibility is in project management and list development.

  • Software development: 10,000 CHF
    • For software development of features described above
    • Primarily to Payravi
  • Project management / community outreach: 8,000 CHF
    • Role may be performed by other Wikimedians.
    • Outreach includes sharing updates, planning demos and conference presentations, gathering and organizing feedback from users, and connecting with editors+communities to help build source lists.
    • Includes stipends for Wikipedians and affiliates who dedicate time to research and develop source lists in multiple languages. More funding here can result in more comprehensive lists across multiple languages, but can be scaled down (or up!) as needed.
  • Fiscal sponsorship fee and administration: 2,000 CHF (Hacks/Hackers has supported Cite Unseen and can be the fiscal sponsor for this grant)

Community engagement

[edit]

Most engagement and feedback occurs on the script's talk page.

Other notes

[edit]

I had previously submitted a rapid grant request to the Wikimedia Foundation for Cite Unseen, but it was not funded as the Wikimedia Foundation has stated a blanket policy that it is currently not funding tech projects and the Technology Fund has been shut down.