Jump to content

Research talk:Automated classification of article importance/Work log/2017-03-10

Add topic
From Meta, a Wikimedia project coordination wiki

Friday, March 10, 2017

[edit]

Today I will investigate how different approaches to measuring article views affects classifier performance. The current approach was to use four weeks of data and average across it, I will study whether averaging over a longer timespan changes performance, but also how chopping up the data can affect performance.

I wrote a Python script that grabs up to 84 days of view data from the page view API, and then wrote a second script to process these views for each article in the dataset. The latter then writes out a myriad of summary statistics (e.g. various means and standard deviations) that we can test with our classifiers.

Our benchmark are the SVM results from earlier in the week, with a total accuracy of 50.56%. I chose to use the Random Forest classifier for testing because it doesn't require any additional tuning, meaning it's fast and easy to evaluate. Once I have candidate models I run those models through the full SVM tuning process to evaluate them against the benchmark.

Averages using various timespans

[edit]

For averages we test out several different combinations of time spans: the full 84 days; three separate 28-day periods; four separate one-week periods; one 28-day period and four separate one-week periods; both 28-day periods and four separate one-week periods; two one-week periods (the most recent ones). Neither of these provide us with higher performance than what we saw previously using just a 28-day average. What we instead see is that a class may or may not be slightly improving, here's the confusion matrix for a tuned SVM classifier using all 84 days, which is a good example of this type of result:

Top High Mid Low
Top 198 118 63 21
High 101 187 77 25
Mid 22 109 130 139
Low 3 25 81 291

If we compare this to the benchmark, we see that the classifier is slightly better at predicting High- and Low-importance articles (187 vs 177; 291 vs 286 correct predictions) but that it is slightly less accurate on Top- and Mid-importance articles (198 vs 206; 140 vs 130 correct predictions).

Standard deviations

[edit]

Next we explore using standard deviations in addition to, or as substitutions for, the averages. The idea behind this is that we want to capture variation in popularity rather than the popularity itself in order to control for articles in a specific class that has large variations in popularity (e.g. because of a sudden change in interest). However, it turns out that standard deviations are strongly correlated with mean popularity (on the order of 0.94–0.95). The result is that we gain very little information by using these and classifier performance is similar to that shown previously.

[edit]

Lastly we experiment with labelling large shifts in popularity, using a categorical variable instead of standard deviations. Since we found that standard deviations, which arguably capture variation, was largely correlated with overall popularity, we instead want to capture the notion of whether an article's popularity seems somewhat out of the ordinary and give that a label. This approach is similar to the one used to identify trending articles in our ICWSM paper[1], although we here seek to use simpler approaches than the rather complex ARIMA models used in the paper. Another concern with the ARIMA models is that we trained them using 8+ weeks of view data for many very popular articles, while in the current dataset we have many articles with very low popularity, meaning these models might not behave the way we expect.

In this case we calculate the mean popularity of each article for the most recent week, and then calculate a 99% confidence interval for the popularity across the four or eight weeks preceding the most recent one. If an article's popularity is above the upper limit of the 99% CI during the last week, we label it "+", if it's below the lower limit we label it "-", otherwise we label it "0". We then experiment with whether adding this label and as well as combinations of popularity averages improves classifier performance.

The results of this approach are similar to those seen previously. We might improve performance for one or two classes, but decrease performance on other classes. Overall accuracy does not change. In conclusion, we find that we might as well use the average popularity over four weeks as a signal of popularity.

References

[edit]
  1. Morten Warncke-Wang, Vivek Ranjan, Loren Terveen, and Brent Hecht (2015). "Misalignment Between Supply and Demand of Quality Content in Peer Production Communities" (PDF). Proceedings of the The 9th International AAAI Conference on Web and Social Media (ICWSM).