Jump to content

DataNamespace

From Meta, a Wikimedia project coordination wiki

The problem

[edit]
2010 Populations with multiracial identifiers
Group 2010 Population Percentage of Total Population
White 22,953,374 61.6%
White, not Hispanic or Latino 15,763,625 42.3%
Hispanic or Latino (of any race) 14,013,719 37.6%
Mexican 11,423,146 30.6%
Salvadoran 573,956 1.5%
Guatemalan 332,737 0.8%
Puerto Rican 189,945 0.5%
Spaniard 142,194 0.3%
Nicaraguan 100,790 0.2%

Source: Demographics of California

Traffic by calendar year
Year Passengers Aircraft Movements Freight (tons) Mail (tons)
1994 51,050,275 689,888 1,516,567 186,878
1995 53,909,223 732,639 1,567,248 193,747
1996 57,974,559 763,866 1,696,663 194,091
1997 60,142,588 781,492 1,852,487 212,410
1998 61,215,712 773,569 1,787,400 264,473
1999 64,279,571 779,150 1,884,526 253,695
2000 67,303,182 783,433 2,002,614 246,538
2001 61,606,204 738,433 1,779,065 162,629
2002 56,223,843 645,424 1,869,932 92,422
2003 54,982,838 622,378 1,924,883 97,193
2004 60,704,568 655,097 2,022,911 92,402
2005 61,489,398 650,629 2,048,817 88,371
2006 61,041,066 656,842 2,022,687 80,395
2007 62,438,583 680,954 2,010,820 66,707
2008 59,815,646 622,506 1,723,038 73,505
2009 56,520,843 544,833 1,599,782 64,073
2010 59,069,409 575,835 1,852,791 74,034
2011 61,862,052 603,912 1,789,204 80,442
2012 63,688,121 605,480 1,866,432 96,779

Source: Los_Angeles_International_Airport#Traffic_and_statistics

There are thousands of data tables buried inside the body of Wikipedia articles. These tables are generally:

  • hard to reference: how do I cite or refer to a table? If I am lucky there's a fragment/id I can link to in an article, but in general data tables in Wikipedia are not objects that can be referenced in the same way as an image is.
  • hard to discover: for the same reason, it's impossible to obtain a human-readable list of tabular datasets that are included in Wikipedia articles.
  • hard to maintain: we do not provide table-specific versioning, meaning that a change to a dataset (a new row, an existing value modified) is just a regular article revision.
  • impossible to reuse across articles or projects: the same dataset in two articles of the same project or in the same article in two different projects would need to be copied and maintained twice.
  • visualization-unfriendly: we use static images or SVGs for timelines and plots that could be easily generated from a tabular data source.
  • hard to style consistently: templates and various hacks are used for tables to behave in the context of an article.
  • a huge source of pain for VE/parsoid: parsing HTML tables in general, not just data tables, and the templates that are used to render them, is one of the biggest challenges for VisualEditor.

Conversely, we have tons of simple charts (such as those available on the Wikimedia reportcard) that cannot be easily reused or embedded in Wikipedia articles.

A proposal

[edit]
One of many static barcharts used across Wikipedia

A dedicated namespace for tabular data (represented as delimiter-separated values or JSON) will offer several benefits:

  • revision control: individual datasets will become fully revision controlled and much easier to maintain.
  • citability: each dataset will have a canonical URI (project_id:namespace:page_id) that would make it uniquely identifiable internally (in Wikimedia projects) and externally.
  • reusable: data tables, instead of living inside the body of an article, will be transcluded/embedded via LUA and become reusable across all Wikimedia projects.
  • visualization-ready: tabular data that can be easily embedded into an article will allow us to develop extensions or gadgets in MediaWiki to easily toggle between a tabular view and a chart view, replacing the need of static images or vector graphs.
  • consistently styled: editors can focus on curating the data and selecting a subset of meaningful options for rendering it as a table, instead of bothering with presentation issues. VisualEditor will have one less problem to worry about.
  • metadata: any page associated with a dataset can be used to store metadata, or (even better) the metadata can be stored on Wikidata if the data table exists as an entity in Wikidata.
  • machine readable: a uniquely identifiable object in a dedicated namespace can be exposed and accessed programmatically via the MediaWiki API.

Scope

[edit]

The (initial) scope of this proposal is limited to:

  • tabular data already existing in Wikipedia articles, not original datasets imported from external sources
  • datasets of a sufficiently small size to be editable and rendered on-wiki (see discussion 1,2)

What about Wikidata?

[edit]

Most of these motivations are the same used in the rationale for Wikidata, but Wikidata is focused on structured/semantic data, i.e. data that's typically used to express statements like: "entity Q has property P with qualifier R according to source S". With the exception of tables that can be generated as queries against structured data, support for tabular data (i.e. data that can be represented as a barchart or a timeseries) is not within the scope of Wikidata. (discussion)

State of the art

[edit]
  • We already have JSON namespaces on Meta, with dedicated ContentHandler settings, that are serving various purposes, from hosting data models (e.g. Schema:Edit) to Wikipedia Zero settings (e.g. Zero:250-99)
  • The WMF Multimedia team and Commons community are advocating the use of Wikidata to store media metadata. The same approach could be used to store metadata of tabular datasets.
  • In the Brede Wiki, Finn Årup Nielsen is using ordinary namespace pages to store comma-separated values including one-row header for scientific data, see, e.g., Example on CSV file. This data can then be transcluded on other pages on the wiki, see, e.g., example. The transclusion uses the 'tab' tag from the 'SimpleTable' extension of Johan the Ghost defined in a template, making a static table rendering (except for the standard sortable style). The data from the CSV pages is read by an external script that performs meta-analysis on the data, see, e.g., meta-analysis example. This script also allows for export of the CSV data in JSON format. The 'semantic' annotation of the column header takes place in standard MediaWiki templates, that are aware of the format of the external script API, see, e.g., metaanalysis csv template referenced from BiND metaanalysis section. This simple approach, which requires no modification of a standard installation of MediaWiki beyond the 'SimpleTable' extension enabling, has been described in more detail in a few articles: