Toolhub/Data model/Research and design

Design principles

Minimum viable set of attributes:

The taxonomy should be "just enough" to cover the tools that exist. It should not cover concepts in great detail if the corpus of tools doesn't merit it.
The taxonomy should not cover attributes that don't require human curation of the attribute values. Information that can be automatically extracted, or which cannot be controlled/curated, is not appropriate for inclusion in the taxonomy, though in many cases it should still be available for cataloging tools or filtering them in the UI.
The taxonomy should not cover concepts already included in the toolinfo.json schema, and for which data has already been populated. "Tool type" is the prime example here.

Useful attributes only: The taxonomy should not include attributes simply because we *could* model them. The attributes should align with the personas and goals documented in Toolhub use cases. For attributes that we intend to include in the Toolhub browsing UI, there's a limited amount of screen real estate to use. Too many attributes reduces the utility of the browsing UI, and may not display well on small screens. Similarly, attribute values should not exist for concepts that only apply to a small set of tools. For example, attribute values like "photo contests" or "new page patrolling" are too specific to be useful – there are not more than 5 tools that would fit each of those classifications. (Those examples would be valuable as free-text annotations or tags, not as controlled terms).

Modular, atomic concepts: Attributes and their values must not combine multiple facets of a tool into one concept. This is the main area in which v2 of the taxonomy differs from v1. The initially-proposed "use case" attribute combines concepts like "audience" and "task" into one attribute. The principle of "one attribute per concept" ensures that attributes are mutually exclusive, and that attribute values are of the same type (this is sometimes called "level homogeneity"). Meanwhile, this approach still supports complex concepts through combination of multiple attributes (compositionality).

No

"Media upload" combines the content type ("media") and the action a tool might apply to that object ("upload").

Yes

"Content Type": values like "Media", "Data", "Text", "Links", "Diffs"

"Task": values like "Upload", "Patrol", "Delete", "Rank", "Analyze"

No

Developers

APIs
Coding environments
Data services
...

Consumers

Reading
Data and metrics
Visualization and remixing
Large-scale content analysis
...

Yes

"Audience": values like "Developers", "Consumers"

"Tool type": values like "APIs", "Coding environments"

"Task": values like "Reading", "Visualization", "Content analysis"

Concrete concepts and definitions: The taxonomy should not cover concepts that will or may evolve over time, like the stability of a tool or its number of active maintainers. While these design principles attempt to avoid ambiguous concepts, there is always some ambiguity in any such system. Consequently, attributes should be documented with clear definitions to prevent semantic drift over time. For example, concepts like "use case" or "tool purpose" can be interpreted many ways. If included, these attributes must be defined with clear semantics that constrain their potential attribute values. This also enables both future data model maintainers and data contributors to easily identify values that don't belong.

Research process

This and the following sections describe how User:TBurmeister_(WMF) drafted the taxonomy and controlled vocabulary.

Revisited the categories proposed from earlier design phases.
Reviewed Research Phase 1 resources, including multiple pre-Toolhub solutions that had any sort of audience/category/topic segmentation or controlled vocabulary:
Reviewed Toolhub use cases.
Reviewed Toolhub feature requests / Phabricator tasks around discoverability and metadata.
Analyzed and added vocab control to unstructured keywords that existed in Toolhub as of 2022 February, many (most) of which came from tool data in Hay's Directory. I limited this analysis to keywords with 3+ occurrences b/c the tail was long.

Concept extraction process

From all of the above sources, I extracted:

Concepts people use to talk about tool discovery or navigating a catalog of tools
Concepts people use to talk about where, when, why, and how they use tools
Categories and attributes people used when they create lists or catalogs of tools

The result of this concept extraction is an uncontrolled list of concepts (attributes) and attribute values.

Analysis, concept mapping, and terminology standardization

Using the "megalist" of uncontrolled concepts, I then analyzed the semantics of the various attributes and their values. As part of this process, I:

Mapped most low-level concepts to a small set of high-level conceptual themes.
Reviewed the uncontrolled values and attribute types that these themes clustered together.
Analyzed the semantics of the various attributes and their values, looking specifically for recurring concepts, overlapping meanings, differing levels of granularity, and different tactics for approaching the same underlying concept.
Documented common areas of ambiguity that the final taxonomy must address.
Started building the controlled vocabulary as I encountered clear cases of synonymy.
Developed a standardized set of attributes. The "Task" attribute required the most standardization, because it combined many values that appeared as different attributes in the various legacy categorization schemes. To make this attribute useful and manageable, I mapped all the uncontrolled task terms to higher-level categories that seek to capture the most common and important themes in the data.

Task attribute term mapping

For a shorter list of just the proposed values for the "Task" attribute, or to provide feedback on this item, see the main Data Model page.

Task attribute values for controlled vocab
Uncontrolled value	Controlled value
administrative work	Project management and reporting
analysis	Analysis
Annotating	Annotating and linking
Anti-vandalism and user warning	Identifying vandalism; Warning users
Archive content	Archiving and cleanup
Assessment	Analysis
attribution	Citing and referencing
Automated editing	Editing or updating
Blocking	User management
bulk / quick editing	Editing or updating
Categorizing	Categorizing and tagging
Change	exclude due to vagueness
citation	Citing and referencing
Clean up sandbox	Archiving and cleanup
Collection curation (curating datasets, curating image sets)	Categorizing and tagging
Conduct (interacting with user task)	Communication and supporting users
Connect Wikipedia with other sites	Annotating and linking
Connect Wikipedia with other wikis	Annotating and linking
consuming content	Downloading or reusing content
content migration	Migrating content
Contest organizing	Event and contest planning
contributing content	Creating new content; Uploading or importing
conversion	Converting and formatting content
convert	Converting and formatting content
copy	Editing or updating
Copy editing	Editing or updating
Copyediting	Editing or updating
Copyright management	Identifying policy violations
Counseling and social support	Communication and supporting users
Create	Creating new content
curation / organization	Categorizing and tagging
data curation	Categorizing and tagging
data upload	Uploading or importing content
Deleting	Deleting and reverting
Deliver article alert	Warning users
deployment	Hosting and maintaining tools
Destroy	Deleting and reverting
developing content	Creating new content
disambiguation	Disambiguation
dispute resolution	Communication and supporting users
Document user data	User management
Drafting	Editing or updating
edit	Editing or updating
Editing	Editing or updating
enhance - categorization	Categorizing and tagging
Event planning	Event and contest planning
Expanding	Editing or updating
Fix content	Editing or updating
Fix files	Archiving and cleanup
Fix links	Annotating and linking
Fix parameters in template/category/infobox	Annotating and linking
Format conversion (e.g. OCR, video conversion)	Converting and formatting content
Formatting	Converting and formatting content
gamification	exclude due to over-specificity
generate attribution	Citing and referencing
Generate pages based on other sources	Creating new content
Generate redirect pages	Annotating and linking
get source media / metadata	Annotating and linking
Greeting the newcomers	Communication and supporting users
Identify policy violations	Identifying policy violations
Identify spam	Identifying spam
Identify vandals	Identifying vandalism
Illustrating	Creating new content
importing	Uploading or importing content
In-place editing	Editing or updating
Large-scale content analysis	Analysis
Maintenance tagging	Categorizing and tagging
matching with Wikidata	Annotating and linking
measure	Analysis
media upload	Uploading or importing content
Merging	Merging content
Moving and merging	Merging content
New page patrolling	Patrolling recent changes
Online project planning (WikiProjects, etc.)	Event and contest planning
organizing projects	Project management and reporting
Page creation	Creating new content
patrol	Patrolling recent changes
Prepare	exclude - too vague
Previewing	exclude- too vague
Project communication	Project management and reporting
Provide suggestions for users	Recommending content
Provide suggestions for Wikiprojects	Recommending content
Purging	Archiving and cleanup
ranking	Listing and ranking
Reading	Reading
Recent changes patrolling	Patrolling recent changes
reconciliation	Disambiguation
Renaming	Editing or updating
reporting	Project management and reporting
reuse	Downloading or reusing content
reuse / visualization	Downloading or reusing content
Reverting	Deleting and reverting
Rollback/reverting	Deleting and reverting
search	too broad
Send user notifications	Communication and supporting users
Socializing users	Communication and supporting users
source data cleaning	Converting and formatting content
source text transcription	Converting and formatting content
Splitting	Converting and formatting content
Suppressing	exclude - too vague
Tag article assessment	Categorizing and tagging
Tag article status	Categorizing and tagging
Tag multimedia status	Categorizing and tagging
Tag Wikiprojects	Categorizing and tagging
Tagging and flagging	Categorizing and tagging
Talk page discussion	Communication and supporting users
Template editing	Editing or updating
Template insertion	Editing or updating
thanks	Communication and supporting users
track	Project management and reporting
tracking	Project management and reporting
transfer	Migrating content
translation	Translating and localizing
Update maintenance pages	Project management and reporting
Update statistics	Project management and reporting
upload	Uploading or importing content
Uploading	Uploading or importing content
User activity analysis	Analysis
user analysis	Analysis
User rights (admin, rollback, etc.)	User management
vandalism patrol	Identifying vandalism
Warning	User management
Welcoming	Communication and supporting users
Worklist development	Listing and ranking

Content type attribute term mapping

Content type attribute values for controlled vocab
Uncontrolled value	Controlled attribute value
label	Categories and labels
articles	Articles
articles for creation	Articles
audio	Audio
automated contributions	exclude
batch	exclude
Body	exclude
books	Books
Categories	Categories and labels
category	Categories and labels
Code	Software or code
Commons and files	too vague; can't map
Content pages (encyclopedia articles, original texts)	Articles
Contributions	Diffs and revision data
coordinates	Geographic data
csv	file format, not a content type
Data (Wikidata items, structured file data)	Structured data
Diffs	Diffs and revision data
Discussions	Discussions
Documentation	too broad
Drafts	Drafts
edit count	Diffs and revision data
Edit filters	Diffs and revision data
Edit form	Diffs and revision data
Edit summary	Diffs and revision data
edits	Diffs and revision data
email	Email
Feeds	too specific
file	too broad; indicate more specific content type
Files	too broad; indicate more specific content type
Flagged revisions	Diffs and revision data
image	Images
images	Images
infobox	too specific
isbn	Bibliographic data
lexeme	Linguistic data
links	Links
list	Lists
Listings	Lists
lists	Lists
Logs	Logs
map	Maps
maps	Maps
media	too broad
Media (images, videos, sound recordings)	too broad
missing	exclude
Modules, scripts and stylesheets	Software or code
ogg	Audio
open license text	exclude
Page information	Page metadata
page views	Page metadata
pages	Page metadata
pageviews	Event data
pdf	file format, not a content type
photos	Images
projects	exclude
properties	Structured data
Queries	Event data
query	Event data
random	exclude
Recent changes	Diffs and revision data
redirect	Links
Redirects	Links
redlinks	Links
reference	References
References	References
report	too broad
reports	too broad
rss	exclude
Search form	too broad
Shortcuts	exclude
statistics	too broad
statistics (Commons)	too broad
statistics (Wikidata)	too broad
Stats	too broad
svg	Images
table	Structured data
template	Templates
Templates	Templates
timeline	Event data
user edits	Diffs and revision data
User information	User data
vandalism	Diffs and revision data
video	Videos
videos	Videos
views	Event data
Watchlist	Watchlist
web resource	Webpages
What links here	Links
wikitext	Wikitext
written content	Articles
youtube	Videos

Platform attribute term mapping

Content type attribute values for controlled vocab
Uncontrolled value	Controlled attribute value
command line tool	Command-line
Command-line tools	Command-line
desktop app	Desktop
Image software extensions	Extension for existing software (non-MediaWiki)
Integrated tools	MediaWiki
mediawiki	MediaWiki
mobile	Mobile / smartphone
On external website	Web app
on mobile devices	Mobile / smartphone
Smartphone apps	Mobile / smartphone
standalone desktop applications	Desktop
Standalone software	exclude
web app	Web app
web tools	Web app

Common areas of semantic ambiguity

To improve the existing data model, it's important to understand areas where previous attempts to model this space have generated ambiguity or shown inconsistency in how they handled similar concepts. These areas of ambiguity are the most important areas for the new taxonomy to standardize and clarify.

Audience vs. target area / domain vs. wiki project: Previous categorization schemes and data models have often mixed together attribute values related to the people, fields, wikis, wiki projects, and locales for which (or for whom) a tool may be especially useful. The final taxonomy must have a clear definition of what any "Audience", "Domain", or "Project" attribute captures, and those attributes should be clearly mutually exclusive. Examples:

GLAM, Education → can be modeled as an audience (the humans who are in the GLAM field), but also as a use case or application domain.
Chinese Wikipedia, English Wikipedia → these examples follow the anti-pattern of combining two concepts, language/locale and specific wiki project (Wikipedia) – into one attribute value.

Subjects, verbs, objects: Previous data models frequently differ in how they model the type of task the tool helps with vs. who it helps vs. with what it helps them do stuff. Intersections of audience, content format, and task were sometimes grouped into a "use cases" attribute to avoid this complexity. Examples:

"categories" as a content format that a tool acts upon, vs. "Categorizing" as a contributor task. In the data I reviewed, there were 3 instances of noun usage and only one of gerund, so the final taxonomy prefers modeling "categories" as a noun object in an attribute like "Content type".
"contributing content" vs. "Contributions" vs. "Contributors"(verb vs. noun object vs. noun subject). This set of ambiguous concepts probably suffers from the overall concept being too vague. Better to more specifically model the type of content being contributed, and a more specific set of contributor types or audiences, like "editors", "developers".
"vandalism" as a content format vs "Patrolling" as a task. In most existing data models, patrolling has high prominence as a task or activity, and "vandalism" can come in many different content formats, so this ambiguity is best resolved by keeping "Patrolling"-related activities as verbs in an attribute focused on user tasks. However, we should also ensure that all tools that involve patrolling have vandalism as a keyword in their description or tags.
- A similar, but less clear example: is "statistics/metrics" a content format or a task (i.e. "analyzing")? For cases like this, we must be guided by user research, feedback, and data like search queries to determine the most likely way that users would expect to navigate to the relevant tools in this topic area. The current taxonomy uses the verb form and puts "Analysis" under the Task attribute, but we should test and iterate on this.

Audience vs. characteristics of audience vs. characteristics of the UI: Some previous data models highlighted "non-English speakers" as an audience, and many recommended a "language" attribute. The current Toolhub data model includes "available_ui_languages". The taxonomy must clearly differentiate and define the scope of any language- or locale-related attributes, so that it's clear whether they're describing attributes of the tool itself (UI language) or attributes of the audience it's meant to help.

Attributes of the tool itself vs. attributes of its use or context: In previous data models, "tool type" attributes have often conflated specific tool attributes with more generic attributes describing the type of general usage or application the tool has. These should be separate attributes. Examples:

APIs, bots, coding environments, displays
Productivity tools, editing tools

Levels of granularity: Past data models and existing, project-specific schemes often vary in how specific they are in capturing the following types of concepts:

Projects and sub-projects: Wikidata, Structured Data, Commons, Structured Commons → There's limited utility in subdividing projects into sub-projects. Tools that serve these sub-domains of work on structured data can be easily found by combining a Project attribute like "Wikidata'' with a Content Type attribute like "images"' or "media".
How specific to get in the wiki page structure: because there are scripts that operate on and help with specific subsections or pages, some data models have specified these elements. Is the granularity of "page" vs. "discussion" too much or too little to provide effective access to these tools? The taxonomy must definitely subdivide the tool space by the content types of "articles" vs. "media" vs. "data", so we need to be careful about whether we add too much granularity by modeling the subsections of pages as a content type.
How specific to get in modeling file types and file formats? Data models or categorization schemes in specific domains have historically been more specific about the content formats most relevant to them. For example, Commons-focused tool descriptions and data models are likely to not just specify image vs. video, they care about whether the file type is svg, jpg, png, tiff, ogg, midi. The question for the Toolhub taxonomy is whether it can support discovery in those specific domains without modeling that level of granularity, and whether there are enough tools in the catalog to make detailed file formats a useful level of granularity.

From a UX perspective, the more shallow the taxonomy is, the better it is likely to work in the current Toolhub UI – there's not a ton of horizontal space for expanding subclasses and showing long lists of possible attributes for filtering.

Nominalizations vs. verb forms: A common inconsistency in past data models appears in how they formalize concepts that can be described with either a noun or a verb form. For example: "content recommenders" is a nominalization of the action "recommending content". The nominalization makes it easy to think of that attribute as a "tool type", and such nominalizations are common when describing bots (like in this article). Maybe we do this because we like to anthropomorphize bots, but fundamentally these types of attributes are describing actions that a user is trying to do with the help of a tool. Our taxonomy should choose one formulation and use it consistently. The proposed taxonomy uses verb forms to model the tasks that a tool helps its users do, rather than using ambiguous nominalizations of such actions.