Grants:Project/Wikipedia Cultural Diversity Observatory (WCDO)/Timeline

This project is funded by a Project Grant

Timeline for Wikipedia Cultural Diversity Observatory (WCDO)

Timeline	Date
Selection of Cultural Context Content (CCC)	31 February 2018
Publish the Midpoint Report	24 May 2018
Create the site for the "Wikipedia Cultural Diversity Observatory"	15 June 2018
Disseminate the observatory in academia and community engagement	31 June 2018
Publish the Final Report	30 September 2018

Monthly updates

January 2018

We created a Wikipedia_language_territories_mapping_quality.csv with the language - territories mapping (territories where the language is spoken today because it is official or indigenous) with ISO 3166 and 3166-2 codes, territory names in their corresponding languages and in English, demonyms, among others. Task finished.
We presented the project at the pre-hackathon and received valuable advice by some community members and data experts. Task finished.
We revised code used to obtain geolocated articles and those with specific keywords on title. Task finished.
We studied different new ideas in order to complement the strategies to create CCC datasets in a way that articles are related to territories and its strategy. Task in progress.
We are working on using Wikidata as a source of article gathering. Currently, we are debating whether it is better using the sparqle language or processing the dumps (this is taking longer than expected). Task in progress.
We created a github account (wcdo), a toolaccount in toolforge (wcdo) and the project complies with the requirements for 'right to fork policy'. Task finished.

February 2018

We evaluated different web frameworks in order to create the website and decided for 'nikola'.
We evaluate different visualization frameworks in order to visualize some parts of the CCC datasets and decided for 'bokeh’.
We gave a dataset of articles geolocated in the Catalan speaking regions to the Catalan Wikipedia editors to help them finding articles for the new edition of the Catalan Challenge in which they translate articles into different languages.
We revised the language-territories equivalences file generated in the previous month and fixed some errors.
We effectively coded the first three strategies to obtain CCC articles (category crawling, keywords on title, geolocated) and continued evaluating the possibilities of using Wikidata properties.
We started writing the CCC datasets methodology for a future academic paper.

March 2018

We obtained all the necessary data and developed the filtering algorithms to create the CCC datasets. The results were not satisfactory as some interference appears (articles which should not be selected are in the dataset). 
We implemented the first visualizations using bokeh. 
We presented at Wikiindaba. The talk is (pdf) about African languages (their statistics and current situation) and the potential of WCDO to help them spread their content across languages.

We gathered feedback at Wikiindaba from community leaders and WMF staff members about communities functioning in order to know how to disseminate better.
We had some problems with the technology and used and moved to a VPS after exhausting all the different technological alternatives.

April 2018

We optimized the code considering the task needs to be automatized for all Wikipedias (e.g. solved some key and blocking issues derived from the worst-case scenario, the English Wikipedia).
We run series of manual assessment of the CCC (Cultural Context Content) Selection of articles with two people. Results are not as satisfying as expected (false positives).
We implemented a machine learning classifier (random forest) in order to select the final CCC datasets, discarding the previous idea of using a threshold based on inlinks (the approach I used for my PhD thesis).
We coded the algorithm in order to select lists of top priority articles to be translated across languages.
We revised the language-territories equivalences file generated in the previous month and fixed some errors.
We continued writing the CCC datasets methodology, state of art and first results for an academic paper that will be sent to a a journal on Digital Humanities.

May 2018

We adjusted the machine learning classifier 'Random Forest' from scikit using a negative sampling in order to train it.
We discarded creating an external website and implemented pywikibot in the main script in order to publish the results in meta.
We attended the 2018 Wikimedia Hackathon to present the project to wikimedians and assess the quality of the content selection in several languages.
We created some recommendation lists based on different rellevance features (number of editors, number of pageviews, etcetera.).
We provided some results for the ESEAP Conference (2018) so Liang Shangkuan could disseminate the project.

June 2018

We generated a whole set of lists of CCC Vital Articles based on different features and content characteristics (number of editors, number of edits, women, etc.), their corresponding availability in other language editions, and put them available in static html files (e.g. Romanian CCC lists).
We generated the main tables (CCC Extent, Culture Gap (Coverage and Spread), among others) both in html files and in

.

We verified the automatation and timings of the CCC dataset generation process triggered by a cron.

July 2018

We presented a lightning talk and a poster at Wikimania in order to explain how the WCDO is structured and what kind of data editors can find, and at the same time, to future users (e.g. WMCEE Spring).
We collected editor feedback on the CCC Vital articles lists, which mainly included demands for generating lists based on countries too, among other ideas for new lists.
We corrected the language-territories database with editor input from communities (now uploaded in meta in this gigantic table).

August 2018

We adjusted the data set generation machine learning with new features in order to tackle some interferent articles in CCC. These negative features relate to CCC from other language editions.
We re-design the code architecture for the statistics oriented towards a better abstraction based on "intersections" between groups of articles in order to make it more robust and allow more and simpler calculations.
We created most of the pages in meta for the WCDO project with the main statistics and datasets and automated the bot in order to update tables from meta.
We considered creating a simple website using Flask in order to include after discarding the previous solution which implied uploading all the articles lists in meta and static HTML files in the server for scalability reasons.

Is your final report due but you need more time?

Extension request

New end date

31.09.2018

Rationale

As I have pointed out in the midreport, I have dedicated many efforts in order to perfect the method to select the Cultural Context Content (CCC) in every language edition. Some tasks were not foreseen in the beginning: a) the use of machine learning, b) the preparation of a language - territories mapping database or c) the need for managing certain resource bottlenecks. Other tasks such as creating lists of women/men based on the cultural context content were not planned, but I saw it could be useful for gender gap projects. Likewise, I had to correct my calendar to include new events and actions: I had the chance to disseminate in both Wikimedia communities events and in Academia (I published a paper in an indexed Open Access Journal).

However, I consider that in order to automatize the data visualizations in meta, adjust them with user feedback and do the extensive dissemination I wanted to do with the website running (I am thinking about Wikimedia CEE Spring, Wikiarabia, etc.), I would need two extra months. At the same time, these months would be useful for documentation. I am very committed to explain the project methodology in an easy way in powerpoint slides, code documentation and the same project pages in meta so other researchers and editors can contribute to it. Please, do not hesitate to ask me for further detail on how I would spend these two extra months or any aspect related to the project. I assume that, in case of this extension being accepted, the final report would be required by the end of October.