Jump to content

Research:New page reviewer impact analysis/Number of re-reviews

From Meta, a Wikimedia project coordination wiki

How has the implementation of the New Page Reviewer Right impacted re-reviews that pages get monthly?

Getting a sense of the number of pages that need to be reviewed again is an important indicator of the quality of reviews the pages get each month. Changes in this metric could be informative regarding the New Page Reviewer Rights implementation.

In general, any new page on Wikipedia follows the review model shown:

General new page review for Wikipedia

Any new page might get a number of reviews or a deletion tag depending on the content. However, it might also happen that a new page might get two or more reviews quickly or a deletion review might be reverted depending upon the quality of review done before. To get the true number of re-reviews, we first extract the data as follows:

Getting data

[edit]
                                                                                                                                                                                                                                                                  
use enwiki_p;                                                                                                                                                            
SELECT EXTRACT(YEAR FROM DATE_FORMAT(log_timestamp,'%Y%m%d%H%i%s')) AS `year`,                                                                                           
       EXTRACT(MONTH FROM DATE_FORMAT(log_timestamp,'%Y%m%d%H%i%s')) AS `month`,                                                                                         
                log_title, log_page, log_timestamp FROM logging_logindex WHERE log_type='pagetriage-curation'                                                            
        AND log_timestamp between 20151001000000 and 20170731000000                                                                                                      
                ORDER BY `year` ASC, `month` ASC;

Typical entries look like:

year Month log_title log_page log_action log_timestamp
2015 10 Alexey_Severtsev 47981347 reviewed 20151001000237
2015 10 Pratap_pur_chhataura 47981016 reviewed 20151001000341
2015 10 Sabina_Ddumba 47980851 reviewed 20151001000514
2015 10 2015_Finlandia_Trophy 47987451 delete 20151001005242

Working on the data

[edit]

This extracts review logs between the given timestamps where log_action might be one of review, delete or tag. After looking through the data, based on the observations, some assumptions were made for identifying potential re-reviews:

  • Tag is immediately preceded by a review entry, thereby indicating that the tag entry cannot occur without being preceded by a review entry, as tagging is essentially a review. Therefore, for analytics purposes, we can ignore the tag entries.
  • There may be entries which are only of review type, without involving tagging, and we take these entries as valid reviews.
  • Any consecutive entries with the same page_id but different log_action are likely to belong to the same session and hence ignored.

The above observations are incorporated in the code below which parses the dataset and generates observations:

dataset parsing
import matplotlib.pyplot as plt                                                                                                                                          
import numpy as np                                                                                                                                                       
import pandas as pd                                                                                                                                                      
import matplotlib.dates as mdates                                                                                                                                        

page_rereviewsset = 'quarry-20777-re-reviews-of-new-pages-run196734.tsv'                                                                                                 
df = pd.read_csv(page_rereviewsset, delimiter='\t')                                                                                                                      
# get total years to iterate on                                                                                                                                          
years = df['year'].unique()                                                                                                                                              
page_rereviews = np.array([])                                                                                                                                            
avg_reviews = np.array([])                                                                                                                                               
# aggregate the data for each month                                                                                                                                      
for y in years:                                                                                                                                                          
    df_tmp = df[df['year'] == y]                                                                                                                                         
    # Get unique months in the year                                                                                                                                      
    months = df_tmp['month'].unique()                                                                                                                                    
    for m in months:                                                                                                                                                     
        page_rereviews = np.append(page_rereviews, 0)                                                                                                                    
        reviews_per_month = df_tmp[df['month'] == m]                                                                                                                     
        prev_id = 0                                                                                                                                                      
        for index, row in reviews_per_month.iterrows():                                                                                                                  
            page_id = row['log_page']                                                                                                                                    
            # If continuous review entries of a page exist, likely from the same                                                                                         
            # session, here we're looking at a new page id so add it                                                                                                     
            if prev_id != page_id:                                                                                                                                       
                page_rereviews[-1] = page_rereviews[-1] + 1                                                                                                              
            prev_id = page_id                                                                                                                                            
                                                                                                                                                                         
# Generate year-months for x-axis                                                                                                                                        
months = pd.date_range('2015-11', periods=page_rereviews.shape[0], freq='1m')                                                                                            
# For storing the aggregate data in wikitable format                                                                                                                     
f = open('page_rereviews.wiki','w')                                                                                                                                      
for i, m in enumerate(months):                                                                                                                                           
    f.write('|-\n|{:%Y-%m}\n|{}\n'.format(m, page_rereviews[i]))                                                                                                         
f.close()                                                                                                                                                                
plt.figure()                                                                                                                                                             
                                                                                                                                                                         
plt.plot(months, page_rereviews, label="users doing review", c='orange')                                                                                                 
plt.ylabel('Re-reviews per month')                                                                                                                                       
plt.xlabel('Months')                                                                                                                                                     
plt.legend()                                                                                                                                                             
xfmt = mdates.DateFormatter('%d-%m-%y')                                                                                                                                  
plt.axvline('2016-11', color='b', linestyle='dashed', linewidth=2, label="NPP right implementation")                                                                     
plt.text('2016-11', plt.gca().get_ylim()[1]+10,'NPP user right implementation', ha='center', va='center')                                                                
plt.show()

Results

[edit]

The plot for the average monthly re-reviews of new pages is shown:

Average monthly new page re-reviews

Some conclusions that could be drawn are:

  • The number of re-reviews is highly erratic.
  • The number of re-reviews showed a huge spike just after the New Page Review rights implementation, and even after that remained somewhat higher than before.


Dataset

[edit]
Year-Month Average page re-reviews
2015-11 15400.0
2015-12 15406.0
2016-01 14319.0
2016-02 15908.0
2016-03 14637.0
2016-04 16236.0
2016-05 17611.0
2016-06 14948.0
2016-07 7559.0
2016-08 8957.0
2016-09 8200.0
2016-10 9827.0
2016-11 16933.0
2016-12 10938.0
2017-01 11512.0
2017-02 12234.0
2017-03 16883.0
2017-04 11951.0
2017-05 14200.0
2017-06 15714.0
2017-07 13851.0