DataNamespace
This page is kept for historical interest. Any policies mentioned may be obsolete. 'The Data: namespace has been implemented: Tabular data.' |
This page in a nutshell: This draft is a proposal to create a dedicated namespace to host open (tabular) data and make these datasets persistently identifiable, version controlled and easily embeddable into other wikis. (discuss) |
The problem
[edit]Group | 2010 Population | Percentage of Total Population |
---|---|---|
White | 22,953,374 | 61.6% |
White, not Hispanic or Latino | 15,763,625 | 42.3% |
Hispanic or Latino (of any race) | 14,013,719 | 37.6% |
Mexican | 11,423,146 | 30.6% |
Salvadoran | 573,956 | 1.5% |
Guatemalan | 332,737 | 0.8% |
Puerto Rican | 189,945 | 0.5% |
Spaniard | 142,194 | 0.3% |
Nicaraguan | 100,790 | 0.2% |
Source: Demographics of California
Year | Passengers | Aircraft Movements | Freight (tons) | Mail (tons) |
---|---|---|---|---|
1994 | 51,050,275 | 689,888 | 1,516,567 | 186,878 |
1995 | 53,909,223 | 732,639 | 1,567,248 | 193,747 |
1996 | 57,974,559 | 763,866 | 1,696,663 | 194,091 |
1997 | 60,142,588 | 781,492 | 1,852,487 | 212,410 |
1998 | 61,215,712 | 773,569 | 1,787,400 | 264,473 |
1999 | 64,279,571 | 779,150 | 1,884,526 | 253,695 |
2000 | 67,303,182 | 783,433 | 2,002,614 | 246,538 |
2001 | 61,606,204 | 738,433 | 1,779,065 | 162,629 |
2002 | 56,223,843 | 645,424 | 1,869,932 | 92,422 |
2003 | 54,982,838 | 622,378 | 1,924,883 | 97,193 |
2004 | 60,704,568 | 655,097 | 2,022,911 | 92,402 |
2005 | 61,489,398 | 650,629 | 2,048,817 | 88,371 |
2006 | 61,041,066 | 656,842 | 2,022,687 | 80,395 |
2007 | 62,438,583 | 680,954 | 2,010,820 | 66,707 |
2008 | 59,815,646 | 622,506 | 1,723,038 | 73,505 |
2009 | 56,520,843 | 544,833 | 1,599,782 | 64,073 |
2010 | 59,069,409 | 575,835 | 1,852,791 | 74,034 |
2011 | 61,862,052 | 603,912 | 1,789,204 | 80,442 |
2012 | 63,688,121 | 605,480 | 1,866,432 | 96,779 |
Source: Los_Angeles_International_Airport#Traffic_and_statistics
There are thousands of data tables buried inside the body of Wikipedia articles. These tables are generally:
- hard to reference: how do I cite or refer to a table? If I am lucky there's a fragment/id I can link to in an article, but in general data tables in Wikipedia are not objects that can be referenced in the same way as an image is.
- hard to discover: for the same reason, it's impossible to obtain a human-readable list of tabular datasets that are included in Wikipedia articles.
- hard to maintain: we do not provide table-specific versioning, meaning that a change to a dataset (a new row, an existing value modified) is just a regular article revision.
- impossible to reuse across articles or projects: the same dataset in two articles of the same project or in the same article in two different projects would need to be copied and maintained twice.
- visualization-unfriendly: we use static images or SVGs for timelines and plots that could be easily generated from a tabular data source.
- hard to style consistently: templates and various hacks are used for tables to behave in the context of an article.
- a huge source of pain for VE/parsoid: parsing HTML tables in general, not just data tables, and the templates that are used to render them, is one of the biggest challenges for VisualEditor.
Conversely, we have tons of simple charts (such as those available on the Wikimedia reportcard) that cannot be easily reused or embedded in Wikipedia articles.
A proposal
[edit]A dedicated namespace for tabular data (represented as delimiter-separated values or JSON) will offer several benefits:
- revision control: individual datasets will become fully revision controlled and much easier to maintain.
- citability: each dataset will have a canonical URI (project_id:namespace:page_id) that would make it uniquely identifiable internally (in Wikimedia projects) and externally.
- reusable: data tables, instead of living inside the body of an article, will be transcluded/embedded via LUA and become reusable across all Wikimedia projects.
- visualization-ready: tabular data that can be easily embedded into an article will allow us to develop extensions or gadgets in MediaWiki to easily toggle between a tabular view and a chart view, replacing the need of static images or vector graphs.
- consistently styled: editors can focus on curating the data and selecting a subset of meaningful options for rendering it as a table, instead of bothering with presentation issues. VisualEditor will have one less problem to worry about.
- metadata: any page associated with a dataset can be used to store metadata, or (even better) the metadata can be stored on Wikidata if the data table exists as an entity in Wikidata.
- machine readable: a uniquely identifiable object in a dedicated namespace can be exposed and accessed programmatically via the MediaWiki API.
Scope
[edit]The (initial) scope of this proposal is limited to:
- tabular data already existing in Wikipedia articles, not original datasets imported from external sources
- datasets of a sufficiently small size to be editable and rendered on-wiki (see discussion 1,2)
What about Wikidata?
[edit]Most of these motivations are the same used in the rationale for Wikidata, but Wikidata is focused on structured/semantic data, i.e. data that's typically used to express statements like: "entity Q has property P with qualifier R according to source S". With the exception of tables that can be generated as queries against structured data, support for tabular data (i.e. data that can be represented as a barchart or a timeseries) is not within the scope of Wikidata. (discussion)
State of the art
[edit]- We already have JSON namespaces on Meta, with dedicated ContentHandler settings, that are serving various purposes, from hosting data models (e.g. Schema:Edit) to Wikipedia Zero settings (e.g. Zero:250-99)
- The WMF Multimedia team and Commons community are advocating the use of Wikidata to store media metadata. The same approach could be used to store metadata of tabular datasets.
- In the Brede Wiki, Finn Årup Nielsen is using ordinary namespace pages to store comma-separated values including one-row header for scientific data, see, e.g., Example on CSV file. This data can then be transcluded on other pages on the wiki, see, e.g., example. The transclusion uses the 'tab' tag from the 'SimpleTable' extension of Johan the Ghost defined in a template, making a static table rendering (except for the standard sortable style). The data from the CSV pages is read by an external script that performs meta-analysis on the data, see, e.g., meta-analysis example. This script also allows for export of the CSV data in JSON format. The 'semantic' annotation of the column header takes place in standard MediaWiki templates, that are aware of the format of the external script API, see, e.g., metaanalysis csv template referenced from BiND metaanalysis section. This simple approach, which requires no modification of a standard installation of MediaWiki beyond the 'SimpleTable' extension enabling, has been described in more detail in a few articles:
- Online open neuroimaging mass meta-analysis with a wiki
- Online open neuroimaging mass meta-analysis (shorter paper)
- Brede tools and federating online neuroinformatics databases (some mentioning of the system)