Jump to content

Connected Open Heritage/Wikidata migration/Documentation

From Meta, a Wikimedia project coordination wiki

DATA EXPLORATION

[edit]
  • Set up milestone on Phabricator under Connected-Open-Heritage-Wikidata-migration, using the name of the database table, such as se-arbetsl.
  • Set up page under d:Wikidata:WikiProject_WLM/Mapping_tables
    • Fill it out with sample data.
    • Note: As of now, these are all created and filled out thanks to this script. It only needs to be rerun if a new table is added to the WLM db.
  • Look at the unique identifier of each item. Does it correspond to an identifier in an external source?
    • If yes, find or request an appropriate property.
    • If no (i.e. the ID is just for internal WLM use), this might mean the dataset is not suitable for import. Without a real-world reference, we can't tell much about the completeness or selection criteria of the data.
  • Identify heritage status. Do all the items represent the same type of heritage protection (eg. national monument in <country>)?
    • If not, how can the heritage status of each item be inferred?
    • Create or edit any necessary items, for example cultural monument of the Czech Republic (Q385405). It should at least have assigned country and subclass of cultural property / national heritage site.
  • Identify P31
    • A default P31 for all the items -- something basic like building or ancient monument.
    • Sometimes there's a separate column for this, like type, that can be used to substitute the default one if possible.
  • Create necessary lookup tables.
    • Some fields have a limited range of distinct values, for example se-fornmin_(sv)/types.
    • In SQL, you can check it using select distinct(columnname) from tablename;
    • The script for this is here.
    • Focus on mapping the most common ones first
  • Identify and download any necessary offline data.
    • This is to avoid doing live queries while running the program, which takes a lot of time.
    • Usually stuff like placenames, administrative units.
    • Data that does not change often.
  • Identify areas that can benefit from community input.
    • Problematic due to language.
    • Problematic due to lack of factual knowledge.
  • Labels and descriptions
    • Can the name column be used as-is for label?
    • Descriptions can be made using the default P31/heritage and country/administrative location
    • Descriptions in extra languages, apart from the language of the dataset?

CODING

[edit]
  • Create a basic mapping file like this one.
    • Contains data that apply to all the items.
    • If possible, use a unique property (for ID number) that will be used in addition to monument_article to see whether an item might already exist.
  • Create statements for all relevant columns.
  • All statements have a source -- see phab:T155241.

UPLOADING

[edit]
  • Create page with preview of processed data.
  • Request for permission
    • Link to preview
    • Describe how data is processed.
    • Describe how already existing items are detected.
  • Test upload of ~10 items.
  • Upload of dataset.