Research talk:Autoconfirmed article creation trial/Work log/2018-02-22

Thursday, February 22, 2018

Today I'll revisit H18 and H19 based on our improved data gathering, then move on to H20, H21, and H22.

H18 and H19, revisited

We've further improved the data gathering of deletions and therefore revisit H18 and H19 to see if it affects our results. The improvements have come in two main areas: identifying deletion of redirects, and usage of Special:Nuke to mass delete pages. We use a regex match on "redirect" to filter out all redirects from non-CSD deletions, and note that we also identify usage of R2, R3, and X1 in CSDs so we can ignore those as well. When it comes to mass deletions, we look for those ahead of labelling a deletion as "other", and if found add it to G5. Using this new data, we get the results described below.

H18: The reasons for deleting articles will remain stable.

Our previous analysis for the Main namespace was done on January 19 and January 22. We'll quickly repeat that analysis, first updating our graphs of number of deletions in the Main namespace, both since 2009 and since 2017.

There does not appear to be substantial differences between the updated versions and what we saw previously. Next, we investigate changes in overall deletion rate, comparing the first two months of ACTRIAL with similar time periods of the years 2012–2016. Examining the distributions, we find that they are right-skewed but that a log-transformation resolves the issue. This is also seen in the summary statistics for daily number of deletions. The pre-ACTRIAL mean is 390 articles/day, while the median is 367; during ACTRIAL the mean is 155.6 and the median 140. This large drop in deletions should be statistically significant, and that is confirmed (t-test: t=22.49, df=76.65, p << 0.0001; Mann-Whitney U test: W=18256, p << 0.001).

Next we update our table of changes in reasons for deletions:

Reason	Deletions/day pre-ACTRIAL	Deletions/day ACTRIAL	Delta deletions/day	Delta deletions/day (%)	Proportion pre-ACTRIAL (%)	Proportion ACTRIAL (%)	Delta proportion (%)
G1	3.10	0.38	−2.72	−87.83	0.79	0.24	−0.55
G2	6.87	1.03	−5.84	−85.0	1.76	0.66	−1.10
G3	18.61	2.03	−16.58	−89.1	4.77	1.31	−3.46
G4	4.06	3.51	−0.55	−13.5	1.04	2.25	1.21
G5	21.41	19.61	−1.80	−8.4	5.49	12.60	7.11
G7	17.70	7.82	−9.88	−55.8	4.54	5.03	0.49
G9	0.00	0.00	0.00	0.0	0.00	0.00	−0.00
G10	6.24	0.51	−5.73	−91.9	1.60	0.33	−1.27
G11	47.99	12.77	−35.22	−73.4	12.31	8.21	−4.10
G12	16.76	4.59	−12.17	−72.6	4.30	2.95	−1.35
G13	0.01	0.03	0.02	200	0.00	0.02	0.02
A1	11.03	1.54	−9.49	−86.0	2.83	0.99	−1.84
A2	0.68	0.18	−0.50	−73.6	0.17	0.12	−0.05
A3	11.98	2.20	−9.78	−81.7	3.07	1.41	−1.66
A5	0.06	0.00	−0.06	−100.0	0.01	0.00	−0.01
A7	113.44	18.00	−95.44	−84.1	29.09	11.57	−17.52
A9	1.52	0.34	−1.18	−77.4	0.39	0.22	−0.17
A10	8.25	2.21	−6.04	−73.2	2.12	1.42	−0.70
A11	3.88	0.44	−3.44	−88.6	1.00	0.28	−0.72
PROD	32.23	19.80	−12.43	−38.6	8.26	12.73	4.47
AfD	44.80	41.75	−3.05	−6.8	11.49	26.83	15.36
Other	19.37	16.85	−2.52	−13.0	4.97	10.83	5.86

Given how we changed our data gathering, the changes in deletion rates show up for G5, PROD, AfD, and "other", all other rows are not altered (apart from changes in rounding). Because we filter redirects differently, the proportions have seen changes as well, mostly minor. When it comes to changes during ACTRIAL, sorting the list on "delta deletions/day" results in a similar list, except "other" is no longer seeing a large increase. In total, this means that our overall conclusions are unchanged: there's a significant reduction in number of deletions in the Main namespace, which is mainly driven by a reduction in CSDs, and H18 is not supported.

H19: The reasons for deleting non-article pages will change towards those previously used for deletion of articles created by non-autoconfirmed users.

We analyzed the User and Draft namespaces on January 19 and January 23. These analyses are also updated using the new data.

User namespace deletions

We update our graphs showing the number of daily deletions in the User namespace:

While we did not see any major changes in the Main namespace graph, the historic User deletion graph does see some changes in that the updated version does not contain several peaks found previously. This should be due to how we have added detection of redirects and removed them from the analysis.

We investigate whether the rate of deletions in the User namespace has changed during ACTRIAL. As before, we compare the first two months of ACTRIAL with similar time periods in 2012–2016. Similarly as for the Main namespace, the deletion data is right-skewed and we perform a log-transform, after which the data is more similar to a Normal distribution. The skewness is also seen in the summary statistics: pre-ACTRIAL mean is 147.8 deletions/day, median is 136; ACTRIAL mean is 194.1 deletions/day and median is 150. While there appears to be a difference, we find that it is not statistically significant (t-test on log-transformed data: t=-2.05, df=72.61, p=0.044; Mann-Whitney U test: W=7864.5, p=0.057).

Lastly, we update our table with data on how usage of deletion reasons has changed, again comparing the first two months of ACTRIAL with similar time periods of 2012–2016:

Reason	Deletions/day pre-ACTRIAL	Deletions/day ACTRIAL	Delta deletions/day	Delta deletions/day (%)	Proportion pre-ACTRIAL (%)	Proportion ACTRIAL (%)	Delta proportion (%)
Other	20.86	51.59	30.73	147.3	13.14	26.58	13.44
G2	1.24	1.43	0.19	15.0	0.78	0.73	−0.05
G3	4.17	4.52	0.35	8.4	2.63	2.33	−0.30
G5	5.31	4.41	−0.90	−17.0	3.35	2.27	−1.08
G7	4.34	3.02	−1.32	−30.5	2.73	1.55	−1.18
G10	2.14	2.69	0.55	25.5	1.35	1.39	0.04
G11	50.31	53.08	2.78	5.5	31.69	27.35	−4.34
G12	3.25	2.92	−0.33	−10.1	2.05	1.50	−0.55
G13	20.37	13.41	−6.96	−34.2	12.83	6.91	−5.92
U1	24.77	16.59	−8.18	−33.0	15.61	8.55	−7.06
U2	1.48	1.39	−0.09	−5.5	0.93	0.72	−0.21
U5	20.48	39.05	18.57	90.7	12.90	20.12	7.22

Given the change to our data gathering method, we find a reduction in the rate of "other" deletions, as well as an increase in G5. Apart from those, there are few changes to proportions.

Draft namespace deletions

Lastly, we update our analysis of changes in the Draft namespace. First an update to our graphs showing daily number of deletions, both historically and more recent:

There are not a lot of changes in the deletion graphs, there's a peak in January 2018 that disappears, which again shows how our redirect detection improves the dataset. Next, we look at the overall change in rate of deletions per day. Because the Draft namespace got created in late 2013, we use data for similar time periods in 2015 and 2016 to compare against ACTRIAL. During those two years, the deletion rates in the Draft namespace were relatively stable.

We find the data is skewed similar to the other two namespaces we have analyzed, and log-transform it to correct this. The summary statistics also reflect this skewness: pre-ACTRIAL mean 100.4 deletions/day, median 86.5; ACTRIAL mean 124.2 deletions/day, median 111.0. The difference is statistically significant (t-test: t=-3.72, df=193.48, p < 0.001; Mann-Whitney U test: W=4521, p=0.003). Again we note that the increase in rate of deletion is much less than the increase in number of pages created in the Draft namespace per day during ACTRIAL.

Lastly, we calculate changes in rate of deletion per reason and create our overview table:

Reason	Deletions/day pre-ACTRIAL	Deletions/day ACTRIAL	Delta deletions/day	Delta deletions/day (%)	Proportion pre-ACTRIAL (%)	Proportion ACTRIAL (%)	Delta proportion (%)
G2	2.32	3.05	0.73	31.4	2.54	2.42	−0.12
G3	2.48	4.07	1.59	63.7	2.72	3.23	0.51
G5	1.01	3.82	2.81	278.9	1.10	3.03	1.93
G7	2.84	4.18	1.34	47.0	3.11	3.32	0.21
G10	1.05	1.25	0.20	18.8	1.15	0.99	−0.16
G11	11.29	19.59	8.30	73.6	12.34	15.55	3.21
G12	3.81	6.80	2.99	78.5	4.17	5.40	1.23
G13	54.14	62.79	8.65	16.0	59.18	49.84	−9.34
AfD	0.88	0.13	−0.75	−85.2	0.97	0.10	−0.87
Other	11.65	20.30	8.65	74.2	12.73	16.11	3.38

Again the differences are found in G5, AfD, and "other", which is to be expected since that's where we have changes in our data gathering process. While the numbers and proportions have changed slightly, there are not very large changes in the table. Perhaps the most significant one is that AfD usage is now down, whereas it was previously shown as increasing.

The new data gathering has improved the quality of our data, but we do not find any changes in the general trends and subsequent conclusions based on them.

H20: The quality of articles entering the NPP queue will increase.

Our initial analysis of data related to H20 was done on January 8 and January 15, where we looked at creations by accounts that were up to 30 days old. H20 does not concern itself with account age, and we will therefore not filter based on account age. The hypothesis concerns the New Page Patrol queue, and we will therefore utilize our dataset of non-autopatrolled article creations in the Main namespace.

In this analysis, we choose to not consider pages moved into the Main namespace. One reason for that is that we expect several of them to have gone through AfC, thereby meeting some threshold of quality. Secondly, if we were to count moves we would also need to look at the user rights of the person moving the page, and perhaps also whether they patrolled the page after moving it. This process would be further complicated by the introduction of the patroller right in late 2016, in the sense that we might have to consider that a significant change. Prior to the introduction, an AfC reviewer could move a page and mark it as reviewed themselves, while afterwards that might not be possible. Lastly, as we found in our analysis of H9's related measure, most articles come to be through being created directly. We see an analysis of pages moved into the Main namespace as an interesting opportunity for future research, while as mentioned leaving them out of this analysis.

The first thing to note about articles is that quite a lot of them get deleted, and some of them get deleted in such a way that we cannot retrieve them. This can be seen as an indicator of quality issues, e.g. that the content of the article was copied from somewhere else. Our first order of business is therefore to calculate the proportion of all pages in our dataset that cannot be retrieved. We have 3,053,757 non-autopatrolled article creations in our dataset spanning January 1, 2009 to November 30, 2017. Of those, 1,911,434 (62.6%) were retrievable. This indicates that about 38% of new article creations have severe quality issues, leading to them being permanently deleted.

How has this proportion developed over time, and has it changed during ACTRIAL? We plot the proportion per day, both starting from January 2009 to see the longer term trends, and from 2016 onwards to get a better look at the more recent years and the effect of ACTRIAL.

In the first graph, we can see that the proportion of deleted article creations had a couple of peaks in 2010 and 2011. During those times, the proportion typically moves between 40% and 50%. There is then a gradual reduction until mid-2012, after which the proportion holds steady with an average of about 35%. We can also see the clear reduction in the proportion in late 2017, which is more clearly understood in the 2016–2017 plot where the dotted line shows the start of ACTRIAL. The proportion is generally stable with a mean slightly above 35% prior to the introduction of ACTRIAL, where it makes a distinct drop to an average of about 15%.

We want to measure this drop and therefore calculate the mean daily proportion of permanently deleted article creations during the first two months of ACTRIAL, and compare it with the same measurement for the years 2012–2016. Based on the historical graph, the proportion was relatively stable during all of those five years. The distribution of this measurement appears to be roughly Normal, particularly the 2012–2016 values as that is a larger dataset, and the data appears to have little skewness. This is also reflected in how the mean and median are close to each other: pre-ACTRIAL mean 35.08%, median 34.88%; ACTRIAL mean 15.49%, median 15.21%. We therefore use a t-test to establish the statistical significance of this drop, finding that is statistically significant (t=49.8, df=125, p << 0.001).

Given the stability of the proportion across time and the severity of the reduction during ACTRIAL, we see little reason to also verify this using a forecasting model. Because we see permanent deletion as an indicator of severe quality issues with the articles, this large reduction during ACTRIAL suggests an increase in quality of the created articles, thus supporting H20.

The next quality indicator we will look at is using ORES' draft quality model to determine whether an article has issues with how it is written. ORES' draft quality model can flag articles that are written as spam, attack pages, or appear to be vandalism. In this case, we do not look at specific flags for issues, but instead measure whether the model thinks the article is free of flaws (an "OK" label). Based on our earlier analysis of precision and recall in the draft quality model (found in our January 11 work log), we require the draft quality model to be at least 66.4% confident. We then calculate the daily proportion of articles that were retrievable that also were predicted to be "OK".

In order to make the variation in the data easier to see, we decided to not plot the proportion of "OK" articles but instead the inverse proportion. Similarly as we did for proportion of permanently deleted articles, we make one plot that starts in January 2009, and one that starts in January 2016:

Comparing the 2009–2017 graph with the previous one showing the proportion of permanently deleted articles, we can see similar trends in the earlier years with peaks in 2010 and 2011, followed by a reduction and stabilization in mid-2012. From then on, we see the trend stay in the 15–20% range. There is an interesting change in July 2017 with a sharp increase, followed by a sharp decrease once ACTRIAL starts. The increase in July is not seen in previous years, and further study might allow us to understand more about why this occurs. For now, we will not look into it and instead focus on examining if the higher proportion of articles labelled "OK" during ACTRIAL is unexpected.

We first compare the average proportion of articles predicted to be "OK" during the first two months of ACTRIAL with similar time periods of the five preceding years. The distribution of this proportion is found to be roughly Normal, meaning that we do not need to transform the data. This is also reflected in the summary statistics, where the mean and median are close to each other (pre-ACTRIAL mean 82.39%, median 82.37%; ACTRIAL mean 89.05%, median 89.06%). This difference in means is found to be statistically significant (t-test: t=-20.17, df=105.26, p << 0.001).

Due to the variation in the data and the odd change in proportion prior to ACTRIAL, we are also interested in studying this from a time series perspective. Given the increase in July through when ACTRIAL starts, we adopt the approach used in several other hypotheses and calculate the proportion on a bimonthly basis. This provides us with more stability in the measurement and removes challenges with making forecasts on a daily basis (e.g. holidays, or breaking news events).

Calculating the proportion on a bimonthly basis results in the following graph:

The graph is generally similar to the previous one, but perhaps makes the increase and subsequent decrease in 2017 easier to see. We next investigate stationarity and seasonality in this time series. Given that there are shifts in the mean over time, we expect it to be non-stationary and find that to be the case. Secondly, we find strong seasonal regularity, indicating that our forecasting model should take the yearly cycle into consideration.

Using R's auto.arima, we find a candidate ARIMA(0,1,1)(1,0,0)[24] model. Checking the ACF and PACF of this model does raise some concerns as it has spikes at certain lags. We find that adding I(1) and switching from AR(1) to MA(1) in the seasonal component greatly improves model fit. Using an iterative approach, we find that adding complexity to the model does not provide a substantial benefit. We therefore choose an ARIMA(1,1,0)(0,1,1)[24] model and use it to forecast the first two months of ACTRIAL, resulting in the following graph:

We can see that the forecasting model takes the decrease from July 2017 onwards into account and forecast a continuation from that. The true values (shown in red) shows the strong increase in the proportion once ACTRIAL starts. We can see that the actual values are far from the forecast. This further establishes the increase during ACTRIAL as significant. Because we see the proportion of articles labelled as "OK" as an indicator of article quality, finding that the proportion increases also support H20.

So far we've looked at to what extent created articles appear to be permanently deleted, and whether those that did not were labelled "OK" by the ORES draft quality model. What remains is to examine the quality of the articles that are "OK". To do that, we use ORES' WP 1.0 article quality model. The model predicts which of the English Wikipedia's quality assessment classes an article would belong to. In order to get a more fine-grained assessment, we calculate an average weighed sum across all classes for a prediction, the approach used by Halfaker in his 2017 research paper "Interpolating Quality Dynamics in Wikipedia and Demonstrating the Keilana Effect". Using that approach, a weighed sum of 1.0 means the article is Start-class quality, 2.0 is C-class, and so on.

We calculate the average weighed sum of articles labelled as "OK" by the draft quality model on a daily basis and plot it across time. The graphs looks as follows:

Looking at the average score across time, the most striking feature is that the trend is generally increasing. There are some changes to this trend in 2016 and 2017, indicating that it is not continuing. The graph showing 2016 and 2017 indicates that the trend could be flat, with a dip in September and October. Lastly, we see a decrease in July of 2017, followed by another slowly increasing trend. Note that there does not appear to be any particular change around the introduction of ACTRIAL.

Because of the slowly increasing trend and changes in 2017, we do not want to compare the average daily average during ACTRIAL against similar periods in previous years. Instead, we again apply our time series analysis approach and calculate a bimonthly average. Plotted across time, this average looks as follows:

We can again see the increasing trend across time, and how it changes in 2016. Secondly, we appear to have a bit of a decrease occurring in the early fall of recent years. Lastly, there is a distinct drop in July 2017, followed again by a slow increase.

Examining the stationarity and seasonality of the time series, we find that it is non-stationary, but that it is unclear whether it has a seasonal component. We will therefore examine various models with and without seasonal components in order to determine the best fit. First, we apply R's auto.arima function and its best fit is an ARIMA(1,1,2)(1,0,0)[24] model. Iterating through other candidate models, we find that none perform as well and therefore chose to go with the auto-selected one. Using it to forecast the first two months of ACTRIAL results in the following graph:

We can see in the graph above that the forecast for the first two months of ACTRIAL is very closely to the true average. This indicates that ACTRIAL has not had an effect on the quality of articles that are flagged as "OK" by the draft quality model. In other words, it suggests that ACTRIAL's effect on quality is related to blatantly non-encyclopedic articles (spam, vandalism, etc), because we do not see the trial have an effect on general article quality.

In summary, we find that H20 is partially supported because there has been a decrease in our two indicators suggesting quality issues: proportion of permanently deleted creations, and proportion of creations flagged as "OK" by the draft quality model. At the same time, we do note that there is no effect of ACTRIAL on quality of article creations flagged as "OK" by the draft quality model.