Research talk:Automated classification of article importance/Work log/2017-05-05

Friday, May 5, 2017

My goal for today is to finish writing code for the WikiProject models, and generate work lists for all projects. Once the lists are ready, I'll write some introductions to them as well.

WikiProject China model

I've streamlined the process of combining our various datasets and training a GBM model. For WikiProject China, this required fewer synthetic samples than WikiProject Medicine, since the former has a larger number of Top-importance articles. WikiProject China's article distribution amongst the four importance categories is as follows:

Rating	N articles
Top	413
High	1,684
Mid	9,632
Low	13,557

I decided to do model building in two stages. First I do a traditional training & test phase where I seek to understand the general performance of the model. In this phase, the datasets are disjoint, and the classes are all balanced in both the training and test sets. If the first stage is reasonably successful, I then resample a second, larger dataset (with balanced classes) and train a model on that. This is then used to predict the importance rating of the entire WikiProject. In both cases I use 10-fold cross-validation to determine training model accuracy and decide on appropriate minimum node size and forest size.

When it comes to WikiProject China, the performance in the first stage showed promise, with an overall accuracy of 54.25% on a test set of 400 articles (100 per class). Performance on Top-importance articles is reasonably high at 75% accuracy. There is some confusion between Top- and High-importance as we saw previously in WikiProject Medicine. Accuracy for the other classes goes from 43% to 56%, and similarly as we saw for WPMED there is quite some confusion between Mid- and Low-importance.

Given that WikiProject China has slightly more than 1,600 High-importance articles, I decided to make the second training set contain 6,400 articles. This meant generating 1,200 synthetic samples for Top-importance articles. Once that was done, the cross-validation suggested a minimum node size of 8, and a forest size of 2,213. Using those parameters, the model predicts the articles in WikiProject China as follows:

	Top	High	Mid	Low	Accuracy
Top	329	71	9	4	79.66%
High	387	763	278	256	45.31%
Mid	470	2,387	3,690	3,085	38.31%
Low	305	2,255	3,241	7,756	57.21%
Average					49.58%

Overall performance is somewhat lower than what we have seen previously in WikiProject Medicine. This is partly due the Low-importance class not being as easy to predict as it was in WPMED. We can also see that the other classes are to some degree spanning the whole spectrum, notice how there's more than two thousand Low-importance articles that are predicted to be High-importance. Just like we discussed when it comes to WikiProject Medicine, performance is limited by the underlying data, but we might still have a useful solution.