Research talk:Automated classification of article importance/Work log/2017-05-07
Add topicSunday, May 7, 2017
[edit]Today I aim to complete training and evaluation of models for the four other candidate WikiProjects: Judaism, Politics, NFL, and Africa.
WikiProject Judaism
[edit]The distribution of articles per importance rating for WikiProject Judaism is as follows:
Rating | N articles |
---|---|
Top | 234 |
High | 498 |
Mid | 1,604 |
Low | 4,010 |
The limit of our training set size is dependent on their number of High-importance articles, unless we decide to also generate synthetic samples for those articles. So far I have chosen not to, as I see that potentially introducing more noise in an already noisy dataset. In the initial stage I withheld 50 articles (sampled randomly) from each class for a test set, and then used a combination of 180 Top-importance articles with another 180 synthetic samples to get a training set with 1,440 articles.
Performance on this initial training/test setup was on par with the other projects, so I switched to a larger training set with 230 of the 234 Top-importance articles and another 230 synthetic samples. This gives us a new training set with 1,840 articles, 460 from each class. Using 10-fold cross-validation we found a minimum node size of 8 to have the lowest error, and we used 1,007 trees for predictions. On the complete dataset, prediction performance went as follows:
Top | High | Mid | Low | Accuracy | |
---|---|---|---|---|---|
Top | 149 | 53 | 27 | 5 | 63.68% |
High | 142 | 213 | 94 | 49 | 42.77% |
Mid | 150 | 373 | 658 | 423 | 41.02% |
Low | 108 | 389 | 797 | 2,716 | 67.73% |
Average | 58.87% |
Performance across WikiProject Judaism is slightly higher than what we saw for WikiProject China. As we've seen before, there's quite some confusion between Top- and High-importance articles, and Mid- and Low-importance as well. However, predictability of Low-importance articles is higher than WikiProject China, in WikiProject Judaism we get more than two out of three articles correct. At the same time, our ability to predict Top-importance articles is not as good, partly due to the view rate and inlink counts of these articles being more varied than what we have seen previously.
WikiProject National Football League
[edit]WikiProject NFL has the following distribution of articles per importance rating:
Rating | N articles |
---|---|
Top | 366 |
High | 516 |
Mid | 3,004 |
Low | 4,426 |
Similarly as for WikiProject Judaism, the number of High-importance articles will limit the size of our training sets. After having built a training and test dataset and found the performance of the classifier to be promising, I decided to use 250 Top-importance articles for training, with 250 additional synthetic samples, for a total of 2,000 articles across all four classes. Using cross-validation as before I found the minimum node size of 8 to perform best, and used 4,203 trees for the predictions. With this setup, we get the following confusion matrix:
Top | High | Mid | Low | Accuracy | |
---|---|---|---|---|---|
Top | 335 | 25 | 2 | 4 | 91.53% |
High | 51 | 372 | 64 | 29 | 72.09% |
Mid | 89 | 301 | 2,262 | 352 | 75.30% |
Low | 37 | 244 | 536 | 3,609 | 81.54% |
Average | 79.14% |
The performance on this dataset is much higher than we have seen previously. We are never below 70% accuracy for any of the classes. There also appears to be less confusion between pairs of classes, instead the errors for High- and Mid-importance balance out fairly between the its two neighboring classes. I am curious whether that will affect reception of the model results in any way.