Toolhub/Data model/Research and design
Design principles
[edit]Minimum viable set of attributes:
- The taxonomy should be "just enough" to cover the tools that exist. It should not cover concepts in great detail if the corpus of tools doesn't merit it.
- The taxonomy should not cover attributes that don't require human curation of the attribute values. Information that can be automatically extracted, or which cannot be controlled/curated, is not appropriate for inclusion in the taxonomy, though in many cases it should still be available for cataloging tools or filtering them in the UI.
- The taxonomy should not cover concepts already included in the toolinfo.json schema, and for which data has already been populated. "Tool type" is the prime example here.
Useful attributes only: The taxonomy should not include attributes simply because we *could* model them. The attributes should align with the personas and goals documented in Toolhub use cases. For attributes that we intend to include in the Toolhub browsing UI, there's a limited amount of screen real estate to use. Too many attributes reduces the utility of the browsing UI, and may not display well on small screens. Similarly, attribute values should not exist for concepts that only apply to a small set of tools. For example, attribute values like "photo contests" or "new page patrolling" are too specific to be useful – there are not more than 5 tools that would fit each of those classifications. (Those examples would be valuable as free-text annotations or tags, not as controlled terms).
Modular, atomic concepts: Attributes and their values must not combine multiple facets of a tool into one concept. This is the main area in which v2 of the taxonomy differs from v1. The initially-proposed "use case" attribute combines concepts like "audience" and "task" into one attribute. The principle of "one attribute per concept" ensures that attributes are mutually exclusive, and that attribute values are of the same type (this is sometimes called "level homogeneity"). Meanwhile, this approach still supports complex concepts through combination of multiple attributes (compositionality).
"Content Type": values like "Media", "Data", "Text", "Links", "Diffs"
"Task": values like "Upload", "Patrol", "Delete", "Rank", "Analyze"Developers
- APIs
- Coding environments
- Data services
- ...
Consumers
- Reading
- Data and metrics
- Visualization and remixing
- Large-scale content analysis
- ...
"Audience": values like "Developers", "Consumers"
"Tool type": values like "APIs", "Coding environments"
"Task": values like "Reading", "Visualization", "Content analysis"Concrete concepts and definitions: The taxonomy should not cover concepts that will or may evolve over time, like the stability of a tool or its number of active maintainers. While these design principles attempt to avoid ambiguous concepts, there is always some ambiguity in any such system. Consequently, attributes should be documented with clear definitions to prevent semantic drift over time. For example, concepts like "use case" or "tool purpose" can be interpreted many ways. If included, these attributes must be defined with clear semantics that constrain their potential attribute values. This also enables both future data model maintainers and data contributors to easily identify values that don't belong.
Research process
[edit]This and the following sections describe how User:TBurmeister_(WMF) drafted the taxonomy and controlled vocabulary.
- Revisited the categories proposed from earlier design phases.
- Reviewed Research Phase 1 resources, including multiple pre-Toolhub solutions that had any sort of audience/category/topic segmentation or controlled vocabulary:
- Reviewed Toolhub use cases.
- Reviewed Toolhub feature requests / Phabricator tasks around discoverability and metadata.
- Analyzed and added vocab control to unstructured keywords that existed in Toolhub as of 2022 February, many (most) of which came from tool data in Hay's Directory. I limited this analysis to keywords with 3+ occurrences b/c the tail was long.
Concept extraction process
[edit]From all of the above sources, I extracted:
- Concepts people use to talk about tool discovery or navigating a catalog of tools
- Concepts people use to talk about where, when, why, and how they use tools
- Categories and attributes people used when they create lists or catalogs of tools
The result of this concept extraction is an uncontrolled list of concepts (attributes) and attribute values.
Analysis, concept mapping, and terminology standardization
[edit]Using the "megalist" of uncontrolled concepts, I then analyzed the semantics of the various attributes and their values. As part of this process, I:
- Mapped most low-level concepts to a small set of high-level conceptual themes.
- Reviewed the uncontrolled values and attribute types that these themes clustered together.
- Analyzed the semantics of the various attributes and their values, looking specifically for recurring concepts, overlapping meanings, differing levels of granularity, and different tactics for approaching the same underlying concept.
- Documented common areas of ambiguity that the final taxonomy must address.
- Started building the controlled vocabulary as I encountered clear cases of synonymy.
- Developed a standardized set of attributes. The "Task" attribute required the most standardization, because it combined many values that appeared as different attributes in the various legacy categorization schemes. To make this attribute useful and manageable, I mapped all the uncontrolled task terms to higher-level categories that seek to capture the most common and important themes in the data.
Task attribute term mapping
[edit]For a shorter list of just the proposed values for the "Task" attribute, or to provide feedback on this item, see the main Data Model page.
Uncontrolled value | Controlled value |
---|---|
administrative work | Project management and reporting |
analysis | Analysis |
Annotating | Annotating and linking |
Anti-vandalism and user warning | Identifying vandalism; Warning users |
Archive content | Archiving and cleanup |
Assessment | Analysis |
attribution | Citing and referencing |
Automated editing | Editing or updating |
Blocking | User management |
bulk / quick editing | Editing or updating |
Categorizing | Categorizing and tagging |
Change | exclude due to vagueness |
citation | Citing and referencing |
Clean up sandbox | Archiving and cleanup |
Collection curation (curating datasets, curating image sets) | Categorizing and tagging |
Conduct (interacting with user task) | Communication and supporting users |
Connect Wikipedia with other sites | Annotating and linking |
Connect Wikipedia with other wikis | Annotating and linking |
consuming content | Downloading or reusing content |
content migration | Migrating content |
Contest organizing | Event and contest planning |
contributing content | Creating new content; Uploading or importing |
conversion | Converting and formatting content |
convert | Converting and formatting content |
copy | Editing or updating |
Copy editing | Editing or updating |
Copyediting | Editing or updating |
Copyright management | Identifying policy violations |
Counseling and social support | Communication and supporting users |
Create | Creating new content |
curation / organization | Categorizing and tagging |
data curation | Categorizing and tagging |
data upload | Uploading or importing content |
Deleting | Deleting and reverting |
Deliver article alert | Warning users |
deployment | Hosting and maintaining tools |
Destroy | Deleting and reverting |
developing content | Creating new content |
disambiguation | Disambiguation |
dispute resolution | Communication and supporting users |
Document user data | User management |
Drafting | Editing or updating |
edit | Editing or updating |
Editing | Editing or updating |
enhance - categorization | Categorizing and tagging |
Event planning | Event and contest planning |
Expanding | Editing or updating |
Fix content | Editing or updating |
Fix files | Archiving and cleanup |
Fix links | Annotating and linking |
Fix parameters in template/category/infobox | Annotating and linking |
Format conversion (e.g. OCR, video conversion) | Converting and formatting content |
Formatting | Converting and formatting content |
gamification | exclude due to over-specificity |
generate attribution | Citing and referencing |
Generate pages based on other sources | Creating new content |
Generate redirect pages | Annotating and linking |
get source media / metadata | Annotating and linking |
Greeting the newcomers | Communication and supporting users |
Identify policy violations | Identifying policy violations |
Identify spam | Identifying spam |
Identify vandals | Identifying vandalism |
Illustrating | Creating new content |
importing | Uploading or importing content |
In-place editing | Editing or updating |
Large-scale content analysis | Analysis |
Maintenance tagging | Categorizing and tagging |
matching with Wikidata | Annotating and linking |
measure | Analysis |
media upload | Uploading or importing content |
Merging | Merging content |
Moving and merging | Merging content |
New page patrolling | Patrolling recent changes |
Online project planning (WikiProjects, etc.) | Event and contest planning |
organizing projects | Project management and reporting |
Page creation | Creating new content |
patrol | Patrolling recent changes |
Prepare | exclude - too vague |
Previewing | exclude- too vague |
Project communication | Project management and reporting |
Provide suggestions for users | Recommending content |
Provide suggestions for Wikiprojects | Recommending content |
Purging | Archiving and cleanup |
ranking | Listing and ranking |
Reading | Reading |
Recent changes patrolling | Patrolling recent changes |
reconciliation | Disambiguation |
Renaming | Editing or updating |
reporting | Project management and reporting |
reuse | Downloading or reusing content |
reuse / visualization | Downloading or reusing content |
Reverting | Deleting and reverting |
Rollback/reverting | Deleting and reverting |
search | too broad |
Send user notifications | Communication and supporting users |
Socializing users | Communication and supporting users |
source data cleaning | Converting and formatting content |
source text transcription | Converting and formatting content |
Splitting | Converting and formatting content |
Suppressing | exclude - too vague |
Tag article assessment | Categorizing and tagging |
Tag article status | Categorizing and tagging |
Tag multimedia status | Categorizing and tagging |
Tag Wikiprojects | Categorizing and tagging |
Tagging and flagging | Categorizing and tagging |
Talk page discussion | Communication and supporting users |
Template editing | Editing or updating |
Template insertion | Editing or updating |
thanks | Communication and supporting users |
track | Project management and reporting |
tracking | Project management and reporting |
transfer | Migrating content |
translation | Translating and localizing |
Update maintenance pages | Project management and reporting |
Update statistics | Project management and reporting |
upload | Uploading or importing content |
Uploading | Uploading or importing content |
User activity analysis | Analysis |
user analysis | Analysis |
User rights (admin, rollback, etc.) | User management |
vandalism patrol | Identifying vandalism |
Warning | User management |
Welcoming | Communication and supporting users |
Worklist development | Listing and ranking |
Content type attribute term mapping
[edit]Uncontrolled value | Controlled attribute value |
---|---|
label | Categories and labels |
articles | Articles |
articles for creation | Articles |
audio | Audio |
automated contributions | exclude |
batch | exclude |
Body | exclude |
books | Books |
Categories | Categories and labels |
category | Categories and labels |
Code | Software or code |
Commons and files | too vague; can't map |
Content pages (encyclopedia articles, original texts) | Articles |
Contributions | Diffs and revision data |
coordinates | Geographic data |
csv | file format, not a content type |
Data (Wikidata items, structured file data) | Structured data |
Diffs | Diffs and revision data |
Discussions | Discussions |
Documentation | too broad |
Drafts | Drafts |
edit count | Diffs and revision data |
Edit filters | Diffs and revision data |
Edit form | Diffs and revision data |
Edit summary | Diffs and revision data |
edits | Diffs and revision data |
Feeds | too specific |
file | too broad; indicate more specific content type |
Files | too broad; indicate more specific content type |
Flagged revisions | Diffs and revision data |
image | Images |
images | Images |
infobox | too specific |
isbn | Bibliographic data |
lexeme | Linguistic data |
links | Links |
list | Lists |
Listings | Lists |
lists | Lists |
Logs | Logs |
map | Maps |
maps | Maps |
media | too broad |
Media (images, videos, sound recordings) | too broad |
missing | exclude |
Modules, scripts and stylesheets | Software or code |
ogg | Audio |
open license text | exclude |
Page information | Page metadata |
page views | Page metadata |
pages | Page metadata |
pageviews | Event data |
file format, not a content type | |
photos | Images |
projects | exclude |
properties | Structured data |
Queries | Event data |
query | Event data |
random | exclude |
Recent changes | Diffs and revision data |
redirect | Links |
Redirects | Links |
redlinks | Links |
reference | References |
References | References |
report | too broad |
reports | too broad |
rss | exclude |
Search form | too broad |
Shortcuts | exclude |
statistics | too broad |
statistics (Commons) | too broad |
statistics (Wikidata) | too broad |
Stats | too broad |
svg | Images |
table | Structured data |
template | Templates |
Templates | Templates |
timeline | Event data |
user edits | Diffs and revision data |
User information | User data |
vandalism | Diffs and revision data |
video | Videos |
videos | Videos |
views | Event data |
Watchlist | Watchlist |
web resource | Webpages |
What links here | Links |
wikitext | Wikitext |
written content | Articles |
youtube | Videos |
Platform attribute term mapping
[edit]Uncontrolled value | Controlled attribute value |
---|---|
command line tool | Command-line |
Command-line tools | Command-line |
desktop app | Desktop |
Image software extensions | Extension for existing software (non-MediaWiki) |
Integrated tools | MediaWiki |
mediawiki | MediaWiki |
mobile | Mobile / smartphone |
On external website | Web app |
on mobile devices | Mobile / smartphone |
Smartphone apps | Mobile / smartphone |
standalone desktop applications | Desktop |
Standalone software | exclude |
web app | Web app |
web tools | Web app |
Common areas of semantic ambiguity
[edit]To improve the existing data model, it's important to understand areas where previous attempts to model this space have generated ambiguity or shown inconsistency in how they handled similar concepts. These areas of ambiguity are the most important areas for the new taxonomy to standardize and clarify.
Audience vs. target area / domain vs. wiki project: Previous categorization schemes and data models have often mixed together attribute values related to the people, fields, wikis, wiki projects, and locales for which (or for whom) a tool may be especially useful. The final taxonomy must have a clear definition of what any "Audience", "Domain", or "Project" attribute captures, and those attributes should be clearly mutually exclusive. Examples:
- GLAM, Education → can be modeled as an audience (the humans who are in the GLAM field), but also as a use case or application domain.
- Chinese Wikipedia, English Wikipedia → these examples follow the anti-pattern of combining two concepts, language/locale and specific wiki project (Wikipedia) – into one attribute value.
Subjects, verbs, objects: Previous data models frequently differ in how they model the type of task the tool helps with vs. who it helps vs. with what it helps them do stuff. Intersections of audience, content format, and task were sometimes grouped into a "use cases" attribute to avoid this complexity. Examples:
- "categories" as a content format that a tool acts upon, vs. "Categorizing" as a contributor task. In the data I reviewed, there were 3 instances of noun usage and only one of gerund, so the final taxonomy prefers modeling "categories" as a noun object in an attribute like "Content type".
- "contributing content" vs. "Contributions" vs. "Contributors"(verb vs. noun object vs. noun subject). This set of ambiguous concepts probably suffers from the overall concept being too vague. Better to more specifically model the type of content being contributed, and a more specific set of contributor types or audiences, like "editors", "developers".
- "vandalism" as a content format vs "Patrolling" as a task. In most existing data models, patrolling has high prominence as a task or activity, and "vandalism" can come in many different content formats, so this ambiguity is best resolved by keeping "Patrolling"-related activities as verbs in an attribute focused on user tasks. However, we should also ensure that all tools that involve patrolling have vandalism as a keyword in their description or tags.
- A similar, but less clear example: is "statistics/metrics" a content format or a task (i.e. "analyzing")? For cases like this, we must be guided by user research, feedback, and data like search queries to determine the most likely way that users would expect to navigate to the relevant tools in this topic area. The current taxonomy uses the verb form and puts "Analysis" under the Task attribute, but we should test and iterate on this.
Audience vs. characteristics of audience vs. characteristics of the UI: Some previous data models highlighted "non-English speakers" as an audience, and many recommended a "language" attribute. The current Toolhub data model includes "available_ui_languages". The taxonomy must clearly differentiate and define the scope of any language- or locale-related attributes, so that it's clear whether they're describing attributes of the tool itself (UI language) or attributes of the audience it's meant to help.
Attributes of the tool itself vs. attributes of its use or context: In previous data models, "tool type" attributes have often conflated specific tool attributes with more generic attributes describing the type of general usage or application the tool has. These should be separate attributes. Examples:
- APIs, bots, coding environments, displays
- Productivity tools, editing tools
Levels of granularity: Past data models and existing, project-specific schemes often vary in how specific they are in capturing the following types of concepts:
- Projects and sub-projects: Wikidata, Structured Data, Commons, Structured Commons → There's limited utility in subdividing projects into sub-projects. Tools that serve these sub-domains of work on structured data can be easily found by combining a Project attribute like "Wikidata'' with a Content Type attribute like "images"' or "media".
- How specific to get in the wiki page structure: because there are scripts that operate on and help with specific subsections or pages, some data models have specified these elements. Is the granularity of "page" vs. "discussion" too much or too little to provide effective access to these tools? The taxonomy must definitely subdivide the tool space by the content types of "articles" vs. "media" vs. "data", so we need to be careful about whether we add too much granularity by modeling the subsections of pages as a content type.
- How specific to get in modeling file types and file formats? Data models or categorization schemes in specific domains have historically been more specific about the content formats most relevant to them. For example, Commons-focused tool descriptions and data models are likely to not just specify image vs. video, they care about whether the file type is svg, jpg, png, tiff, ogg, midi. The question for the Toolhub taxonomy is whether it can support discovery in those specific domains without modeling that level of granularity, and whether there are enough tools in the catalog to make detailed file formats a useful level of granularity.
From a UX perspective, the more shallow the taxonomy is, the better it is likely to work in the current Toolhub UI – there's not a ton of horizontal space for expanding subclasses and showing long lists of possible attributes for filtering.
Nominalizations vs. verb forms: A common inconsistency in past data models appears in how they formalize concepts that can be described with either a noun or a verb form. For example: "content recommenders" is a nominalization of the action "recommending content". The nominalization makes it easy to think of that attribute as a "tool type", and such nominalizations are common when describing bots (like in this article). Maybe we do this because we like to anthropomorphize bots, but fundamentally these types of attributes are describing actions that a user is trying to do with the help of a tool. Our taxonomy should choose one formulation and use it consistently. The proposed taxonomy uses verb forms to model the tasks that a tool helps its users do, rather than using ambiguous nominalizations of such actions.