User:A ka es/OpenRefine/wikimania2019 lightningtalk quality

Lightning Talk at #wikimania2019

Description	Slides
This are the slides as .pdf-file; screencasts can not be played in this format.	Slides

Description	Screencast
* 65 files, 1 .html, 1 .txt, 63 .csv * .csv file names 1 single (EP info), 3 groups (committees, groups, countries)	explore
* reading the .txt and .html to get the meta information  * seems like the data in the files are congruent; the .html-file contains links to other sources * I will create 4 projects to compare	explore

Description	Screencast
* compare the data * analyze the data * why are there so much more data rows in the file "committees"? - has one file more data for an mp as the others? the committees-file contains 745 mep - some of them are in more than one committee	compare and analyze
* the columns are equal => I will work with the EP-info-file and will store the others to fill gaps if needed	compare and analyze

Description	Screencast
* try the options: mark 3 datarows and test * reconciling against Q5 works; I split the data rows in groups to have control about the reconcilation process => 561 items are matched, 186 not * looking for the not matched names in different wikipedia language versions and finding out the wikidata item; matching manually	match with wikidata
* 743 matched * 4 without wikipedia site and without wikidata item	match with wikidata

Description	Screencast
* via MEP ID * via Twitter username (both are wikidata identifier) * 357 items are true - I am sure the matches are okay => mark them as “okay" * 390 items are false: next check: all of them are empty, no false entries => more checks are needed	recheck
* 357 items are true - I am sure the matches are okay => mark them as “okay" * 390 items are false: next check: all of them are empty, no false entries => more checks are needed	recheck

Description	Screencast
* thanks to Lucas Werkmeister for his uncomplicated support * result: all items with the value “Member of the European Parliament” * 7121 items * downloaded the data and made a new OpenRefine project	find the gaps
* 7121 Members of the European Parliament * 744 should have the “parliamentary term” “Ninth European Parliament” * real: 110

Description	Screencast
combining the data from the eu.zip with the wikidata values creates a very powerful worksheet: * you can find the pattern * you can clean the data and upload the changes in different steps * you can recheck the results for every step * you can combine new data * you can interrupt and finish later	connect

Description	Screenshot
* finding pattern and failure, reconciling the next columns, preparing the wikidata schema, upload the data, recheck the uploaded data - many steps for a workshop *   data processing is a circle: with every repaired data cluster you will find new failures and have to find the new pattern * here: the start date /o\	File:09 circle screenshot slides.png data cleaning

Description	Screenshot
* 751 member of the Ninth European Parliament: * 749 have a start date 2. July 2019 * 4 have an end date (are retired) * 2 have an start date later than the 2. July 2019 * result: 747 active member of the Ninth European Parliament	result