User:A ka es/OpenRefine/wikimania2019 lightningtalk quality
Appearance
Lightning Talk at #wikimania2019
[edit]Description | Slides |
---|---|
This are the slides as .pdf-file; screencasts can not be played in this format. |
explore
[edit]Description | Screencast |
---|---|
* 65 files, 1 .html, 1 .txt, 63 .csv * .csv file names 1 single (EP info), 3 groups (committees, groups, countries) |
|
* reading the .txt and .html to get the meta information
* seems like the data in the files are congruent; the .html-file contains links to other sources * I will create 4 projects to compare |
compare and analyze
[edit]Description | Screencast |
---|---|
* compare the data * analyze the data * why are there so much more data rows in the file "committees"? *- has one file more data for an mp as the others? * the committees-file contains 745 mep - some of them are in more than one committee |
|
* the columns are equal => I will work with the EP-info-file and will store the others to fill gaps if needed |
match with wikidata
[edit]Description | Screencast |
---|---|
* try the options: mark 3 datarows and test * reconciling against Q5 works; I split the data rows in groups to have control about the reconcilation process => 561 items are matched, 186 not * looking for the not matched names in different wikipedia language versions and finding out the wikidata item; matching manually |
|
* 743 matched * 4 without wikipedia site and without wikidata item |
recheck with wikidata
[edit]Description | Screencast |
---|---|
* via MEP ID * via Twitter username (both are wikidata identifier) * 357 items are true - I am sure the matches are okay => mark them as “okay" * 390 items are false: next check: all of them are empty, no false entries => more checks are needed |
|
* 357 items are true - I am sure the matches are okay => mark them as “okay" * 390 items are false: next check: all of them are empty, no false entries => more checks are needed |
find the gaps via SPARQL and OpenRefine
[edit]Description | Screencast |
---|---|
* thanks to Lucas Werkmeister for his uncomplicated support * result: all items with the value “Member of the European Parliament” * 7121 items * downloaded the data and made a new OpenRefine project |
|
* 7121 Members of the European Parliament * 744 should have the “parliamentary term” “Ninth European Parliament” * real: 110 |
connect the projects and enrich data
[edit]Description | Screencast |
---|---|
combining the data from the eu.zip with the wikidata values creates a very powerful worksheet: * you can find the pattern * you can clean the data and upload the changes in different steps * you can recheck the results for every step * you can combine new data * you can interrupt and finish later |
data processing is a circle
[edit]Description | Screenshot |
---|---|
* finding pattern and failure, reconciling the next columns, preparing the wikidata schema, upload the data, recheck the uploaded data - many steps for a workshop * data processing is a circle: with every repaired data cluster you will find new failures and have to find the new pattern * here: the start date /o\ |
result
[edit]Description | Screenshot |
---|---|
* 751 member of the Ninth European Parliament: * 749 have a start date 2. July 2019 * 4 have an end date (are retired) * 2 have an start date later than the 2. July 2019 * result: 747 active member of the Ninth European Parliament |