User:A ka es/OpenRefine/wikimania2019 lightningtalk quality
Appearance
Lightning Talk at #wikimania2019
[edit]Description | Slides |
---|---|
This are the slides as .pdf-file; screencasts can not be played in this format. | ![]() |
explore
[edit]Description | Screencast |
---|---|
* 65 files, 1 .html, 1 .txt, 63 .csv * .csv file names 1 single (EP info), 3 groups (committees, groups, countries) |
|
* reading the .txt and .html to get the meta information
* seems like the data in the files are congruent; the .html-file contains links to other sources * I will create 4 projects to compare |
compare and analyze
[edit]match with wikidata
[edit]recheck with wikidata
[edit]Description | Screencast |
---|---|
* via MEP ID * via Twitter username (both are wikidata identifier) * 357 items are true - I am sure the matches are okay => mark them as “okay" * 390 items are false: next check: all of them are empty, no false entries => more checks are needed |
|
* 357 items are true - I am sure the matches are okay => mark them as “okay" * 390 items are false: next check: all of them are empty, no false entries => more checks are needed |
find the gaps via SPARQL and OpenRefine
[edit]connect the projects and enrich data
[edit]Description | Screencast |
---|---|
combining the data from the eu.zip with the wikidata values creates a very powerful worksheet: * you can find the pattern * you can clean the data and upload the changes in different steps * you can recheck the results for every step * you can combine new data * you can interrupt and finish later |
data processing is a circle
[edit]Description | Screenshot |
---|---|
* finding pattern and failure, reconciling the next columns, preparing the wikidata schema, upload the data, recheck the uploaded data - many steps for a workshop * data processing is a circle: with every repaired data cluster you will find new failures and have to find the new pattern * here: the start date /o\ |