Jump to content

Research talk:Automated classification of article importance/Work log/2017-04-10

Add topic
From Meta, a Wikimedia project coordination wiki

Monday, April 10, 2017

[edit]

Today I plan on wrapping up the WPMED classifier training by testing those incorporating the clickstream data, figure out a plan for evaluation with Aaron, and lastly work on the issue of WikiProject/category overlap.

Clickstream classifier

[edit]

We reuse the clickstream dataset from March 28 and first tune a benchmark SVM classifier using the "local inlink proportion" from Kamps and Koolen as before. On the test set it performs as follows:

Top High Mid Low Accuracy
Top 32 8 0 0 80.00%
High 9 19 8 4 47.50%
Mid 2 9 22 7 55.00%
Low 1 4 8 27 67.50%
Average 62.50%

These results are somewhat better than what I reported on March 28, perhaps driven by more iterations tuning the gamma and cost parameters, the ones I found were quite different from what I used then. Compared to the March 28 results, this classifier is worse on Top-importance articles (80% vs 85%) and High-importance articles (47.5% vs 57.5%), but scores higher on the other two classes (Mid-importance: 55% vs 47.5%; Low-importance 67.5% vs 52.5%).

Our previous investigations into clickstream performance suggested that there is no reason to not add the pair of clickstream variables together, as they individually might degrade performance but together enhance it. We therefore add the two global proportion variables first, one for the proportion of views coming from other articles, and the proportion of (global) inlinks that were used. It results in the following confusion matrix:

Top High Mid Low Accuracy
Top 23 15 2 0 57.50%
High 6 17 14 3 42.50%
Mid 1 9 18 12 45.00%
Low 0 6 8 26 65.00%
Average 52.50%

Overall accuracy has dropped a lot, mainly due to the huge drop in accuracy for Top-importane articles. This is very different from what we saw when previously tested the clickstream results. This is something to keep in mind as we move along, we might simply be struggling with large swings in performance simply because our test set is so small.

In previous testing we saw no improvement in performance when adding the project-specific clickstream variables, and we therefore do not add them before we add the binary variable for Low-importance articles. We will instead add them afterwards to see whether performance is affected in that case. Adding the binary variable for articles that ought to be Low-importance gives us the following confusion matrix:

Top High Mid Low Accuracy
Top 29 10 1 0 72.50%
High 9 20 8 3 50.00%
Mid 2 9 22 7 55.00%
Low 1 5 6 28 70.00%
Average 61.88%

Overall performance is comparable to the benchmark. The classifier is slightly better on High- and Low-importance articles, and worse on Top-importance articles.

Lastly, we add the two project-specific clickstream variables for reference, resulting in the following confusion matrix:

Top High Mid Low Accuracy
Top 16 21 2 1 40.00%
High 6 21 8 5 52.50%
Mid 2 10 14 14 35.00%
Low 0 6 8 26 65.00%
Average 48.12%

Well, that's disappointing, but not unexpected. Adding the two variables did not increase performance earlier either, so it's not terribly surprising that it didn't work this time around.

Doubling training set size

[edit]

One thing we have experimented with is generating a larger number of synthetic samples, which allows us to sample a larger number of articles from the actual categories. We also found that rerating the articles that ought to be Low-importance seemed to be beneficial, although we need confirmation from WPMED on our selection of categories for doing that relabelling. Based on those results, we'll generate a larger training data set using the relabelled ratings, a stratified sample of Low-importance articles, and then introduce the clickstream and binary variables as before.

Note that because we use the relabelled ratings, we have to resample our test set, and therefore cannot compare performance directly with the previous results. That is also why we train a new benchmark classifier, which performs as follows:

Top High Mid Low Accuracy
Top 24 13 3 0 60.00%
High 8 15 15 2 37.50%
Mid 2 13 17 8 42.50%
Low 0 6 10 24 60.00%
Average 50.00%

This benchmark has lower accuracy across the board compared to what we saw previously. There are several reasons why that might happen, and the size of the training and test datasets is again a likely cause. However, we have previously seen large differences in performance, and while we might not reach the same accuracy as before, it is the relative performance that is more important. We saw from our earlier test results that performance in this setting is most likely good, the question now is whether there is room for improvement through the clickstream data.

We first add the global clickstream variables as before, giving us the following confusion matrix:

Top High Mid Low Accuracy
Top 22 18 0 0 55.00%
High 1 22 16 1 55.00%
Mid 1 10 21 8 52.50%
Low 0 2 12 26 65.00%
Average 56.88%

The classifier is still struggling with the Top-importance articles, but is doing slightly better on the other classes, leading to a solid boost in overall performance. We next add the binary variable for Low-importance articles and get the following confusion matrix:

Top High Mid Low Accuracy
Top 16 23 1 0 40.00%
High 4 18 16 2 45.00%
Mid 1 11 24 4 60.00%
Low 0 2 12 26 65.00%
Average 52.50%

This drop in overall performance is similar to what we saw when testing without the clickstream data. It is surprising that the performance on Top-importance articles is so low. Inspecting the Top-importance rated articles, we see that they all have a large number of views and inlinks. It might be that, as we discussed previously, we're pushing against the boundaries of the WPMED dataset, where Top- and High-importance might not be a distinct as they can be.

Lastly, we test with adding the last pair of project-specific clickstream variables:

Top High Mid Low Accuracy
Top 19 20 1 0 47.50%
High 1 25 13 1 62.50%
Mid 1 11 24 4 60.00%
Low 0 2 11 27 67.50%
Average 59.38%

Now the results of adding the two variables are reversed from before. This might be because of the increased size of the training dataset. Although, it might also largely be a story about our test set. I have been inspecting the cross-validation results from training the classifier, and the performance gain there is not as significant. Both with and without the variables the cross-validation performance is above 70%, with a gain of 0.3% when adding the last two variables. While in this case we have a classifier that performs well on our test set, we might also want to inspect the cross-validation results for all our classifier in order to get some more signal, partly because those datasets are much larger (although they do include synthetic samples for Top-importance articles).

Cross-validation results

[edit]

Checking the accuracy using 10-fold cross-validation on the training dataset gives us the following table:

Dataset Benchmark Global props Low-importance Project props
1,000 articles 64.7 63.5 63.5 62.5
2,000 articles 67.4 68.4 70.7 70.7

Here we see indications that the larger dataset provides a better fit, and that adding the project-specific clickstream variables does not provide any additional useful information. We are, in other words, at the point where feedback from WPMED is helpful.