Jump to content

Research talk:Automated classification of article importance/Work log/2017-05-07

Add topic
From Meta, a Wikimedia project coordination wiki

Sunday, May 7, 2017

[edit]

Today I aim to complete training and evaluation of models for the four other candidate WikiProjects: Judaism, Politics, NFL, and Africa.

WikiProject Judaism

[edit]

The distribution of articles per importance rating for WikiProject Judaism is as follows:

Rating N articles
Top 234
High 498
Mid 1,604
Low 4,010

The limit of our training set size is dependent on their number of High-importance articles, unless we decide to also generate synthetic samples for those articles. So far I have chosen not to, as I see that potentially introducing more noise in an already noisy dataset. In the initial stage I withheld 50 articles (sampled randomly) from each class for a test set, and then used a combination of 180 Top-importance articles with another 180 synthetic samples to get a training set with 1,440 articles.

Performance on this initial training/test setup was on par with the other projects, so I switched to a larger training set with 230 of the 234 Top-importance articles and another 230 synthetic samples. This gives us a new training set with 1,840 articles, 460 from each class. Using 10-fold cross-validation we found a minimum node size of 8 to have the lowest error, and we used 1,007 trees for predictions. On the complete dataset, prediction performance went as follows:

Top High Mid Low Accuracy
Top 149 53 27 5 63.68%
High 142 213 94 49 42.77%
Mid 150 373 658 423 41.02%
Low 108 389 797 2,716 67.73%
Average 58.87%

Performance across WikiProject Judaism is slightly higher than what we saw for WikiProject China. As we've seen before, there's quite some confusion between Top- and High-importance articles, and Mid- and Low-importance as well. However, predictability of Low-importance articles is higher than WikiProject China, in WikiProject Judaism we get more than two out of three articles correct. At the same time, our ability to predict Top-importance articles is not as good, partly due to the view rate and inlink counts of these articles being more varied than what we have seen previously.

WikiProject National Football League

[edit]

WikiProject NFL has the following distribution of articles per importance rating:

Rating N articles
Top 366
High 516
Mid 3,004
Low 4,426

Similarly as for WikiProject Judaism, the number of High-importance articles will limit the size of our training sets. After having built a training and test dataset and found the performance of the classifier to be promising, I decided to use 250 Top-importance articles for training, with 250 additional synthetic samples, for a total of 2,000 articles across all four classes. Using cross-validation as before I found the minimum node size of 8 to perform best, and used 4,203 trees for the predictions. With this setup, we get the following confusion matrix:

Top High Mid Low Accuracy
Top 335 25 2 4 91.53%
High 51 372 64 29 72.09%
Mid 89 301 2,262 352 75.30%
Low 37 244 536 3,609 81.54%
Average 79.14%

The performance on this dataset is much higher than we have seen previously. We are never below 70% accuracy for any of the classes. There also appears to be less confusion between pairs of classes, instead the errors for High- and Mid-importance balance out fairly between the its two neighboring classes. I am curious whether that will affect reception of the model results in any way.