DataNamespace

This page in a nutshell: This draft is a proposal to create a dedicated namespace to host open (tabular) data and make these datasets persistently identifiable, version controlled and easily embeddable into other wikis. (discuss)

The problem

2010 Populations with multiracial identifiers
Group	2010 Population	Percentage of Total Population
White	22,953,374	61.6%
White, not Hispanic or Latino	15,763,625	42.3%
Hispanic or Latino (of any race)	14,013,719	37.6%
Mexican	11,423,146	30.6%
Salvadoran	573,956	1.5%
Guatemalan	332,737	0.8%
Puerto Rican	189,945	0.5%
Spaniard	142,194	0.3%
Nicaraguan	100,790	0.2%

Source: Demographics of California

Traffic by calendar year
Year	Passengers	Aircraft Movements	Freight (tons)	Mail (tons)
1994	51,050,275	689,888	1,516,567	186,878
1995	53,909,223	732,639	1,567,248	193,747
1996	57,974,559	763,866	1,696,663	194,091
1997	60,142,588	781,492	1,852,487	212,410
1998	61,215,712	773,569	1,787,400	264,473
1999	64,279,571	779,150	1,884,526	253,695
2000	67,303,182	783,433	2,002,614	246,538
2001	61,606,204	738,433	1,779,065	162,629
2002	56,223,843	645,424	1,869,932	92,422
2003	54,982,838	622,378	1,924,883	97,193
2004	60,704,568	655,097	2,022,911	92,402
2005	61,489,398	650,629	2,048,817	88,371
2006	61,041,066	656,842	2,022,687	80,395
2007	62,438,583	680,954	2,010,820	66,707
2008	59,815,646	622,506	1,723,038	73,505
2009	56,520,843	544,833	1,599,782	64,073
2010	59,069,409	575,835	1,852,791	74,034
2011	61,862,052	603,912	1,789,204	80,442
2012	63,688,121	605,480	1,866,432	96,779

Source: Los_Angeles_International_Airport#Traffic_and_statistics

There are thousands of data tables buried inside the body of Wikipedia articles. These tables are generally:

hard to reference: how do I cite or refer to a table? If I am lucky there's a fragment/id I can link to in an article, but in general data tables in Wikipedia are not objects that can be referenced in the same way as an image is.
hard to discover: for the same reason, it's impossible to obtain a human-readable list of tabular datasets that are included in Wikipedia articles.
hard to maintain: we do not provide table-specific versioning, meaning that a change to a dataset (a new row, an existing value modified) is just a regular article revision.
impossible to reuse across articles or projects: the same dataset in two articles of the same project or in the same article in two different projects would need to be copied and maintained twice.
visualization-unfriendly: we use static images or SVGs for timelines and plots that could be easily generated from a tabular data source.
hard to style consistently: templates and various hacks are used for tables to behave in the context of an article.
a huge source of pain for VE/parsoid: parsing HTML tables in general, not just data tables, and the templates that are used to render them, is one of the biggest challenges for VisualEditor.

Conversely, we have tons of simple charts (such as those available on the Wikimedia reportcard) that cannot be easily reused or embedded in Wikipedia articles.

A proposal

One of many static barcharts used across Wikipedia

A dedicated namespace for tabular data (represented as delimiter-separated values or JSON) will offer several benefits:

revision control: individual datasets will become fully revision controlled and much easier to maintain.
citability: each dataset will have a canonical URI (project_id:namespace:page_id) that would make it uniquely identifiable internally (in Wikimedia projects) and externally.
reusable: data tables, instead of living inside the body of an article, will be transcluded/embedded via LUA and become reusable across all Wikimedia projects.
visualization-ready: tabular data that can be easily embedded into an article will allow us to develop extensions or gadgets in MediaWiki to easily toggle between a tabular view and a chart view, replacing the need of static images or vector graphs.
consistently styled: editors can focus on curating the data and selecting a subset of meaningful options for rendering it as a table, instead of bothering with presentation issues. VisualEditor will have one less problem to worry about.
metadata: any page associated with a dataset can be used to store metadata, or (even better) the metadata can be stored on Wikidata if the data table exists as an entity in Wikidata.
machine readable: a uniquely identifiable object in a dedicated namespace can be exposed and accessed programmatically via the MediaWiki API.

Scope

The (initial) scope of this proposal is limited to:

tabular data already existing in Wikipedia articles, not original datasets imported from external sources
datasets of a sufficiently small size to be editable and rendered on-wiki (see discussion 1,2)

What about Wikidata?

Most of these motivations are the same used in the rationale for Wikidata, but Wikidata is focused on structured/semantic data, i.e. data that's typically used to express statements like: "entity Q has property P with qualifier R according to source S". With the exception of tables that can be generated as queries against structured data, support for tabular data (i.e. data that can be represented as a barchart or a timeseries) is not within the scope of Wikidata. (discussion)

State of the art

We already have JSON namespaces on Meta, with dedicated ContentHandler settings, that are serving various purposes, from hosting data models (e.g. Schema:Edit) to Wikipedia Zero settings (e.g. Zero:250-99)
The WMF Multimedia team and Commons community are advocating the use of Wikidata to store media metadata. The same approach could be used to store metadata of tabular datasets.
In the Brede Wiki, Finn Årup Nielsen is using ordinary namespace pages to store comma-separated values including one-row header for scientific data, see, e.g., Example on CSV file. This data can then be transcluded on other pages on the wiki, see, e.g., example. The transclusion uses the 'tab' tag from the 'SimpleTable' extension of Johan the Ghost defined in a template, making a static table rendering (except for the standard sortable style). The data from the CSV pages is read by an external script that performs meta-analysis on the data, see, e.g., meta-analysis example. This script also allows for export of the CSV data in JSON format. The 'semantic' annotation of the column header takes place in standard MediaWiki templates, that are aware of the format of the external script API, see, e.g., metaanalysis csv template referenced from BiND metaanalysis section. This simple approach, which requires no modification of a standard installation of MediaWiki beyond the 'SimpleTable' extension enabling, has been described in more detail in a few articles:
- Online open neuroimaging mass meta-analysis with a wiki
- Online open neuroimaging mass meta-analysis (shorter paper)
- Brede tools and federating online neuroinformatics databases (some mentioning of the system)