Research talk:Expanding Wikipedia articles across languages/Work log/2018-01-03
Add topicWednesday, January 3, 2018
[edit]We are interested in applying what we have learned so far to languages other than English. We aim to track this work here.
Cleaning up the category network
[edit]Cleaning the category network using the methodology we developed across languages is possible, however, it is costly. Recently, Amit et al. has developed a new way for cleaning up Wikipedia's category network in 280 languages. Instead of spending times to clean up the category network in a new language, we decided to use their cleaned category network (almost out of the box) and assess the quality of the recommendations. We decided to focus on French Wikipedia for this phase given the ongoing conversations with Ma Commune folks and that we would like to help them expand the article types for which their tool can recommend sections.
Results
[edit]Preliminary Dataset
https://drive.google.com/file/d/1dYKNBXk-l_FfVdca9Kk7uzNJNrdQVVH6/view?usp=sharing
Json format:
One category per line: {"category": <category_name>, "recs": [{"title": <section_title>, "relevance": <relevance_score>},...]}
Example:
{"category":"Catégorie:Ville_de_Souss-Massa-Drâa","recs":[{"relevance":0.3333333333333333,"title":"Notes et références"},{"relevance":0.3333333333333333,"title":"Voir aussi"},{"relevance":0.2222222222222222,"title":"Démographie"},{"relevance":0.2222222222222222,"title":"Économie"},{"relevance":0.1111111111111111,"title":"Infrastructures"},{"relevance":0.1111111111111111,"title":"Culture"},{"relevance":0.1111111111111111,"title":"Population"},{"relevance":0.1111111111111111,"title":"Manifestations"},{"relevance":0.1111111111111111,"title":"Vue d'ensemble"},{"relevance":0.1111111111111111,"title":"Climat"}]}
Basic Usage
The simplest approach is:
- take all the categories of the target article
- merge the recommendations by summing the relevance scores of the shared titles
- sort by score (desc) and show the top K
- optional: filter the very common sections (Notes et références, Voir aussi, ...).
The method works better when the article has more than one category because the sum of the scores promotes the relevant sections. In the future, we will apply Learning2Rank techniques to give weights to different categories.