Research talk:Automated classification of article importance/Work log/2017-05-07

Sunday, May 7, 2017

Today I aim to complete training and evaluation of models for the four other candidate WikiProjects: Judaism, Politics, NFL, and Africa.

WikiProject Judaism

The distribution of articles per importance rating for WikiProject Judaism is as follows:

Rating	N articles
Top	234
High	498
Mid	1,604
Low	4,010

The limit of our training set size is dependent on their number of High-importance articles, unless we decide to also generate synthetic samples for those articles. So far I have chosen not to, as I see that potentially introducing more noise in an already noisy dataset. In the initial stage I withheld 50 articles (sampled randomly) from each class for a test set, and then used a combination of 180 Top-importance articles with another 180 synthetic samples to get a training set with 1,440 articles.

Performance on this initial training/test setup was on par with the other projects, so I switched to a larger training set with 230 of the 234 Top-importance articles and another 230 synthetic samples. This gives us a new training set with 1,840 articles, 460 from each class. Using 10-fold cross-validation we found a minimum node size of 8 to have the lowest error, and we used 1,007 trees for predictions. On the complete dataset, prediction performance went as follows:

	Top	High	Mid	Low	Accuracy
Top	149	53	27	5	63.68%
High	142	213	94	49	42.77%
Mid	150	373	658	423	41.02%
Low	108	389	797	2,716	67.73%
Average					58.87%

Performance across WikiProject Judaism is slightly higher than what we saw for WikiProject China. As we've seen before, there's quite some confusion between Top- and High-importance articles, and Mid- and Low-importance as well. However, predictability of Low-importance articles is higher than WikiProject China, in WikiProject Judaism we get more than two out of three articles correct. At the same time, our ability to predict Top-importance articles is not as good, partly due to the view rate and inlink counts of these articles being more varied than what we have seen previously.

WikiProject National Football League

WikiProject NFL has the following distribution of articles per importance rating:

Rating	N articles
Top	366
High	516
Mid	3,004
Low	4,426

Similarly as for WikiProject Judaism, the number of High-importance articles will limit the size of our training sets. After having built a training and test dataset and found the performance of the classifier to be promising, I decided to use 250 Top-importance articles for training, with 250 additional synthetic samples, for a total of 2,000 articles across all four classes. Using cross-validation as before I found the minimum node size of 8 to perform best, and used 4,203 trees for the predictions. With this setup, we get the following confusion matrix:

	Top	High	Mid	Low	Accuracy
Top	335	25	2	4	91.53%
High	51	372	64	29	72.09%
Mid	89	301	2,262	352	75.30%
Low	37	244	536	3,609	81.54%
Average					79.14%

The performance on this dataset is much higher than we have seen previously. We are never below 70% accuracy for any of the classes. There also appears to be less confusion between pairs of classes, instead the errors for High- and Mid-importance balance out fairly between the its two neighboring classes. I am curious whether that will affect reception of the model results in any way.