Jump to content

Research talk:Automated classification of article importance/Work log/2017-05-08

Add topic
From Meta, a Wikimedia project coordination wiki

Monday, May 8, 2017

[edit]

Today my goal is to finish up building models for the candidate WikiProjects by building models for WikiProject Africa and WikiProject Politics. Once that is done, I'll write introductions to the various work lists we've generated, and adapt the introduction to the rerating candidates from WikiProject Medicine. Then I'll put those somewhere on enwiki and compose messages to the various projects' assessment pages.

WikiProject Politics

[edit]

The distribution of articles per importance rating for WikiProject Politics is as follows:

Rating N articles
Top 111
High 1,107
Mid 4,068
Low 18,469

The number of Top-importance articles will limit the size of our test set, so I decided to set away 30 articles to the test set, and use 80 for the training set. We have had reasonable success with creating fairly large proportion of synthetic samples, so having an additional 800 synthetic samples should be okay, giving us 880 articles per class.

Performance on the 120 article test set was about the same as for other projects, so I created a larger training set with 100 Top-importance articles, 1,000 synthetic samples, and 1,100 articles from the three other classes for a total of 4,400 articles. Using 10-fold cross-validation I found a minimum node size of 64 to have the lowest error, and that I should use 5,088 trees for predictions. With this setup, we get the following performance across the complete dataset:

Top High Mid Low Accuracy
Top 83 8 16 3 75.45%
High 221 439 271 175 39.69%
Mid 586 936 1,733 812 42.61%
Low 476 1,250 2,880 13,859 75.06%
Average 67.85%

Overall performance is on par with what we have seen in other projects. We also see that Top- and Low-importance are the classes that appear easier to predict, while performance on the other two classes is quite a lot lower. About 20% of the High-importance articles are predicted to be Low-importance, and about 10% of Mid-importance is predicted to be High-importance, something I wonder if will be picked up later during discussions. Inspecting some of the predicted reratings also indicates that they are not as clearly related to number of views and inlinks as we have seen in other projects, curious to see how that affects things too.

WikiProject Africa

[edit]

The distribution of articles per importance rating for WikiProject Africa is as follows:

Rating N articles
Top 2,264
High 1,249
Mid 4,054
Low 25,835

Whereas we before had a low number of Top-importance articles, we now have a large project with less limitation on dataset sizes. It's the number of High-importance articles that limits dataset size, but with almost 1,250 articles, it's not really a limitation compared to what we have seen previously. We first split the dataset up in separate training and test sets, and find classifier performance to be on par with what we have seen previously. While doing this, we did find several articles that were tagged multiple times by the project and given two different importance ratings, and I created a work list table for those as well.

I chose to sample 1,240 articles from each category for the final training set. Using 10-fold cross-validation as before, I found that a minimum node size of 4 had the best performance, using 4,543 trees for the predictions. This resulted in the following confusion matrix for the full dataset:

Top High Mid Low Accuracy
Top 1,672 360 160 72 73.9%
High 440 422 155 232 33.8%
Mid 921 882 958 1,293 23.6%
Low 1,807 3,104 3,802 17,122 66.3%
Average 60.4%

Overall performance is comparable to other projects we have modelled. We see fairly strong performance for Top-importance articles, and good performance on Low-importance articles as well. High- and Mid-importance articles are not predicted as well. There is some indication that the High-importance articles look like Top-importance, which we've encountered in other projects too. It is somewhat worrying that predictions of Mid-importance articles are spread out across the board, but this is also a result of the input data as the project's importance ratings do not seem to map closely to neither number of views nor inlinks.