Research talk:Automated classification of article importance/Work log/2017-04-07
Add topicFriday, April 7, 2017
[edit]Today I plan to wrap up the WPMED classifier training using the data gathered on Low-importance articles, get in touch with WPMED, and start moving on to looking at other WikiProjects.
WPMED classifier performance
[edit]We start out by adding a binary variable, labelling whether an article is an instance of a Low-importance item. Then we randomly choose 160 articles, 40 from each importance class, to be our test dataset as before. Lastly, we generate 200 synthetic samples of Top-importance articles, and combine that with the remaining 50 Top-importance articles and 250 articles from the other three classes to create a 1,000 article training dataset.
Benchmark SVM classifier
[edit]We use the SVM classifier from March 17 as our benchmark. It uses three predictors: (log) number of views per day over the past four weeks, (log) number of incoming Wikilinks from all articles in enwiki, and the proportion of incoming links that are from articles in WPMED. When we run it on our current test dataset we find that it has an overall accuracy of 60.62%, with the following confusion matrix (rows are true rating, columns are predicted rating):
Top | High | Mid | Low | Accuracy | |
---|---|---|---|---|---|
Top | 34 | 6 | 0 | 0 | 85.00% |
High | 12 | 23 | 5 | 0 | 57.50% |
Mid | 2 | 9 | 19 | 10 | 47.50% |
Low | 2 | 9 | 8 | 21 | 52.50% |
Average | 60.62% |
Adding the binary variable
[edit]We add the binary variable that labels articles that should be Low-importance, tune the SVM classifier and run it. It results in an overall accuracy of 58.12%, with the following confusion matrix:
Top | High | Mid | Low | Accuracy | |
---|---|---|---|---|---|
Top | 31 | 8 | 1 | 0 | 77.50% |
High | 14 | 18 | 8 | 0 | 45.00 |
Mid | 2 | 8 | 22 | 8 | 55.00% |
Low | 0 | 2 | 16 | 22 | 55.00% |
Average | 58.12% |
We are particularly interested in whether the classifier predicts articles that should be Low-importance as something else. It turns out that none of those articles are misclassified. This is a strong improvement over the benchmark, which incorrectly labels five articles: one Top-importance (Sanofi), three High-importance (JAMA_(journal), Lucy_Finch, and Marshall_M._Parks), and one Mid-importance (Royal_College_of_Physicians_and_Surgeons_of_Canada). This seems to suggest that the classifier is handling the categorical variable correctly. We can also see that the variable creates a somewhat stronger distinction between Top/High-importance and Mid/Low-importance. Articles that are Mid- or Low-importance are much less likely to be labelled High- or Top-importance.
Inspecting some of the misclassified articles between Top- and High-importance suggests that those labels might be somewhat arbitrary. Do we really need both of those classes, particularly if one is only allowed to contain about a hundred articles? Why are Alternative_medicine, Autism, Bulimia_nervosa, and Chlamydia_infection all High-importance instead of Top-importance? Would it be meaningful to demote Allergy and Surgery to High-importance?
We also tested the Random Forest classifier to see how it performs compared to the SVM on this type of dataset. First we tune the forest size and minimum node size using 10-fold cross-validation, finding that 801 trees and node size 2 has the best performance. Training a classifier on the full training set and evaluating it on the test set results in the following confusion matrix:
Top | High | Mid | Low | Accuracy | |
---|---|---|---|---|---|
Top | 26 | 12 | 2 | 0 | 65.00% |
High | 10 | 19 | 9 | 2 | 47.50% |
Mid | 1 | 7 | 23 | 9 | 57.50% |
Low | 0 | 2 | 12 | 26 | 65.00% |
Average | 58.75% |
Inspecting the classifications, we again find that it correctly predicts all Low-importance articles that should be Low-importance. When it comes to Top-importance articles that it predicts are Mid-importance, it adds Stomach cancer, whereas the SVM only predicts Major trauma as Mid-importance. However, it only predicts a single Mid-importance article as Top-importance (Ketamine). The two Low-importance articles it predicts are High-importance are the same as for the SVM.
Rerating Low-importance articles
[edit]One issue with our dataset is that we have a number of articles that we strongly believe should be rated Low-importance, but do not have that. What happens to the model if we assert that they should all be Low-importance, in other word rerate them in our dataset? We create a new variable for our dataset and adjust it accordingly, then retrain our models. We first run the SVM trained on the old dataset on the new test set, to get an idea of its overall performance. It results in the following confusion matrix:
Top | High | Mid | Low | Accuracy | |
---|---|---|---|---|---|
Top | 33 | 7 | 9 | 0 | 82.50% |
High | 12 | 18 | 7 | 3 | 45.00% |
Mid | 1 | 12 | 19 | 8 | 47.50% |
Low | 0 | 3 | 13 | 24 | 60.00% |
Average | 58.75% |
Overall accuracy is slightly lower, it is the accuracy on High-importance articles that suffers, while performance on Low-importance articles is slightly better. However, can we do better if we train an SVM classifier on this new dataset? Arguably we should, since the classes have changed and thereby their definitions. That is also what happens:
Top | High | Mid | Low | Accuracy | |
---|---|---|---|---|---|
Top | 29 | 10 | 1 | 0 | 72.50% |
High | 11 | 16 | 10 | 3 | 40.00% |
Mid | 0 | 11 | 24 | 5 | 60.00% |
Low | 0 | 2 | 5 | 33 | 82.50% |
Average | 63.75% |
We see here that accuracy for Top- and High-importance articles changes for the worse. That might be because we now have labelled some articles as Low-importance that by our predictors look like they should be higher importance. However, the model does not predict any of those articles as nothing but Low-importance. In other words, even though we do not use the binary variable in our model, those articles are now predicted to be Low-importance.
The model's incorrect predictions are now perhaps somewhat more interesting. There's one Top-importance article predicted to be Mid-importance, Long-term effects of alcohol consumption. That article does not have particularly many inlinks nor views compared to its rating. No Mid-importance articles are predicted to be Top-importance, whereas the two Low-importance articles now predicted as Top-High-importance are List of common misconceptions and Nabilone. The first has lots of views, the second has lots of inlinks. Nabilone actually has more inlinks from WPMED than the previously mentioned alcohol consumption article has in total.
The question is, does it now benefit us to add the binary variable defining if an article should be Low-importance? Will that split up the dataset in a meaningful way for our classifier? We test it again and get a confusion matrix as follows:
Top | High | Mid | Low | Accuracy | |
---|---|---|---|---|---|
Top | 30 | 9 | 1 | 0 | 75.00% |
High | 10 | 19 | 11 | 0 | 47.50% |
Mid | 0 | 11 | 26 | 3 | 65.00% |
Low | 0 | 2 | 9 | 29 | 72.50% |
Average | 65.00% |
Overall performance is slightly up, driven by improved accuracy for High- and Mid-importance articles. The ability to correctly predict Low-importance articles is somewhat lower, although it does not incorrectly label any article that has the "should be Low-importance" flag set. When it comes to misclassifications, the ones that are rather far away are the exact same as for the other benchmark. The confusion between Top- and High-importance shows many of the same characteristics as before, it is not obvious why these should be different classes. We also looked at Low-importance articles predicted to be Mid-importance, and there it is again not obvious whether they should be or not. Some of the articles have quite a number of views and inlinks (e.g. Apex beat, DNA vaccination, and Mayo scissors), but the others are low on views and/or inlinks, in some cases very much so.
Doubling training set size
[edit]While we earlier did not see improved performance with a larger training set, it might be that we can benefit from it in this case since we have the added binary variable. The underlying assumption is that the larger training set might enable more accurate predictions since we will have more examples of articles with this binary variable set. We will test two approaches, a larger training set with an unstratified sample of Low-importance articles, and one with a stratified sample of Low-importance articles such that half of them have the binary variable set while the other does not.
Training a classifier on the unstratified larger training set and using the original article ratings gives us the following confusion matrix:
Top | High | Mid | Low | Accuracy | |
---|---|---|---|---|---|
Top | 30 | 9 | 1 | 0 | 75.00% |
High | 10 | 20 | 8 | 2 | 50.00% |
Mid | 1 | 12 | 22 | 5 | 55.00% |
Low | 0 | 3 | 4 | 33 | 82.50% |
Average | 65.62% |
The performance of this classifier is very close to that we saw previously after the ratings were corrected, but overall it is slightly stronger. Accuracy for Mid-importance articles is slightly lower (55% vs 60%), performance on Top-importance article is slightly higher (75% vs 72.5%) performance on Low-importance articles is unchanged. However, performance on High-importance is much better, with an increase to 50% accuracy. This suggests that we can gain performance by using a larger sample.
If we now add back the binary variable for Low-importance, thus again hopefully slicing the dataset into two parts to enable us to predict importance for the core medicine articles, we get the following confusion matrix:
Top | High | Mid | Low | Accuracy | |
---|---|---|---|---|---|
Top | 32 | 8 | 0 | 0 | 80.00% |
High | 10 | 21 | 8 | 1 | 52.50% |
Mid | 1 | 13 | 21 | 5 | 52.50% |
Low | 0 | 3 | 8 | 29 | 72.50% |
Average | 64.38% |
We gain accuracy for the higher-importance articles, and lose accuracy for the lower-importance ones. Overall there is little change. The question is whether this is a useful tradeoff. However, we again want to experiment with rerating articles, and run the same approach as before, seeing what would happen if everything that is supposed to be Low-importance is actually rated Low-importance.
Top | High | Mid | Low | Accuracy | |
---|---|---|---|---|---|
Top | 28 | 9 | 2 | 1 | 70.00% |
High | 10 | 17 | 11 | 2 | 42.50% |
Mid | 1 | 12 | 22 | 5 | 55.00% |
Low | 1 | 2 | 5 | 32 | 80.00% |
Average | 61.88% |
Overall, performance is down by quite a bit, and that is driven by articles that are not Low-importance. Where we before had no articles in the extremes (e.g. Low-importance predicted to be Top-importance), we now have two examples of these articles. What other interesting misclassification examples are there? First of all, one of the articles that should be Low-importance, Adrian Kantrowitz, is now predicted to be High-importance. The article has a fair number of inlinks, but not that many views. One Top-importance article is predicted to be Low-importance, and that is Suicide which has lots of views and inlinks. Its misclassification makes very little sense. Similarly we see that a Low-importance article is predicted to be Top-importance, that is DNA vaccination, and its views and inlinks does not support that predicted importance rating. It seems clear that the classifier is now not functioning in a way we wanted it to. We add in the binary variable and see if that helps, resulting in the following confusion matrix:
Top | High | Mid | Low | Accuracy | |
---|---|---|---|---|---|
Top | 27 | 10 | 2 | 1 | 67.50% |
High | 9 | 18 | 12 | 1 | 45.00% |
Mid | 1 | 12 | 23 | 4 | 57.50% |
Low | 0 | 1 | 7 | 32 | 80.00% |
Average | 62.50% |
The classifier now correctly predicts all the articles that "should be" Low-importance, and we also see that Low-importance articles are not predicted to be higher importance in the same way as before. However, we still see some challenges, and inspect the misclassification as we've done before. The article about suicide is once again predicted to be Low-importance, for some odd reason. When it comes to the confusion between Top- and High-importance, that appears to behave in much the same way as before.
Lastly, we test a stratified sample, where we sample 250 Low-importance articles with the binary variable set, and 250 where it is not set. We then again test both a model without the binary variable, and one with it, to see if there is some improvement. First without, where we get the following confusion matrix:
Top | High | Mid | Low | Accuracy | |
---|---|---|---|---|---|
Top | 30 | 10 | 0 | 0 | 75.00% |
High | 9 | 20 | 10 | 1 | 50.00% |
Mid | 1 | 12 | 24 | 3 | 60.00% |
Low | 0 | 0 | 7 | 33 | 82.50% |
Average | 66.88% |
Overall performance on this test set is comparable to what we saw earlier. We can see that the classes appear more distinct, particularly towards the edges. There are no Low-importance articles predicted to be High- or Top-importance, and no Top-importance articles are predicted to be Mid- or Low-importance. A single Mid-importance article, Hydrochlorothiazide is predicted to be Top-importance, probably due to its reasonably large number of inlinks (many of which are not from WPMED), and high number of views (1,123 per day on average). The confusion between Top- and High-importance appears to behave in much the same way as before. The Mid-importance articles that are predicted to be Low-importance seem reasonable, and the same can be said for the other way around.
Top | High | Mid | Low | Accuracy | |
---|---|---|---|---|---|
Top | 28 | 12 | 0 | 0 | 70.00% |
High | 9 | 21 | 10 | 0 | 52.50% |
Mid | 1 | 12 | 25 | 2 | 62.50% |
Low | 0 | 0 | 15 | 25 | 62.50% |
Average | 61.88% |
What happened here? Well, there are no major differences when it comes to the extreme ends of the confusion matrix. The confusion between Top- and High-importance (and vice versa) seems to behave just as before. However, we see quite a lot more confusion between Mid- and Low-importance, although none of the "should be Low-importance articles" are misclassified. Instead, what happens is that lots more Low-importance articles are predicted to be Mid-importance. Some of these might actually be Mid-importance. In other words, it seems like we have finally reached a point where we are pushing against the accuracy of WPMED's ratings. Next step should therefore be to start discussing these with them, it might be that we do an actual experiment.