Jump to content

User:A ka es/OpenRefine/wikimania2019 lightningtalk quality

From Meta, a Wikimedia project coordination wiki

Lightning Talk at #wikimania2019

[edit]
Description Slides
This are the slides as .pdf-file; screencasts can not be played in this format.
Slides

explore

[edit]
Description Screencast
* 65 files, 1 .html, 1 .txt, 63 .csv
* .csv file names 1 single (EP info), 3 groups (committees, groups, countries)
explore
* reading the .txt and .html to get the meta information

* seems like the data in the files are congruent; the .html-file contains links to other sources
* I will create 4 projects to compare
explore

compare and analyze

[edit]
Description Screencast
* compare the data
* analyze the data
* why are there so much more data rows in the file "committees"?
*- has one file more data for an mp as the others?
* the committees-file contains 745 mep - some of them are in more than one committee
compare and analyze
* the columns are equal => I will work with the EP-info-file and will store the others to fill gaps if needed
compare and analyze

match with wikidata

[edit]
Description Screencast
* try the options: mark 3 datarows and test
* reconciling against Q5 works; I split the data rows in groups to have control about the reconcilation process => 561 items are matched, 186 not
* looking for the not matched names in different wikipedia language versions and finding out the wikidata item; matching manually
match with wikidata
* 743 matched
* 4 without wikipedia site and without wikidata item
match with wikidata

recheck with wikidata

[edit]
Description Screencast
* via MEP ID
* via Twitter username (both are wikidata identifier)
* 357 items are true - I am sure the matches are okay => mark them as “okay"
* 390 items are false: next check: all of them are empty, no false entries => more checks are needed
recheck
* 357 items are true - I am sure the matches are okay => mark them as “okay"
* 390 items are false: next check: all of them are empty, no false entries => more checks are needed
recheck

find the gaps via SPARQL and OpenRefine

[edit]
Description Screencast
* thanks to Lucas Werkmeister for his uncomplicated support
* result: all items with the value “Member of the European Parliament”
* 7121 items
* downloaded the data and made a new OpenRefine project
find the gaps
* 7121 Members of the European Parliament
* 744 should have the “parliamentary term” “Ninth European Parliament”
* real: 110

connect the projects and enrich data

[edit]
Description Screencast
combining the data from the eu.zip with the wikidata values creates a very powerful worksheet:
* you can find the pattern
* you can clean the data and upload the changes in different steps
* you can recheck the results for every step
* you can combine new data
* you can interrupt and finish later
connect

data processing is a circle

[edit]
Description Screenshot
* finding pattern and failure, reconciling the next columns, preparing the wikidata schema, upload the data, recheck the uploaded data - many steps for a workshop
* 

data processing is a circle: with every repaired data cluster you will find new failures and have to find the new pattern
* here: the start date /o\
File:09 circle screenshot slides.png
data cleaning

result

[edit]
Description Screenshot
* 751 member of the Ninth European Parliament:
* 749 have a start date 2. July 2019
* 4 have an end date (are retired)
* 2 have an start date later than the 2. July 2019
* result: 747 active member of the Ninth European Parliament
result