Research:New page reviewer impact analysis/Number of re-reviews
How has the implementation of the New Page Reviewer Right impacted re-reviews that pages get monthly?
Getting a sense of the number of pages that need to be reviewed again is an important indicator of the quality of reviews the pages get each month. Changes in this metric could be informative regarding the New Page Reviewer Rights implementation.
In general, any new page on Wikipedia follows the review model shown:
Any new page might get a number of reviews or a deletion tag depending on the content. However, it might also happen that a new page might get two or more reviews quickly or a deletion review might be reverted depending upon the quality of review done before. To get the true number of re-reviews, we first extract the data as follows:
Getting data
edit
use enwiki_p;
SELECT EXTRACT(YEAR FROM DATE_FORMAT(log_timestamp,'%Y%m%d%H%i%s')) AS `year`,
EXTRACT(MONTH FROM DATE_FORMAT(log_timestamp,'%Y%m%d%H%i%s')) AS `month`,
log_title, log_page, log_timestamp FROM logging_logindex WHERE log_type='pagetriage-curation'
AND log_timestamp between 20151001000000 and 20170731000000
ORDER BY `year` ASC, `month` ASC;
Typical entries look like:
year | Month | log_title | log_page | log_action | log_timestamp |
---|---|---|---|---|---|
2015 | 10 | Alexey_Severtsev | 47981347 | reviewed | 20151001000237 |
2015 | 10 | Pratap_pur_chhataura | 47981016 | reviewed | 20151001000341 |
2015 | 10 | Sabina_Ddumba | 47980851 | reviewed | 20151001000514 |
2015 | 10 | 2015_Finlandia_Trophy | 47987451 | delete | 20151001005242 |
Working on the data
editThis extracts review logs between the given timestamps where log_action might be one of review, delete or tag. After looking through the data, based on the observations, some assumptions were made for identifying potential re-reviews:
- Tag is immediately preceded by a review entry, thereby indicating that the tag entry cannot occur without being preceded by a review entry, as tagging is essentially a review. Therefore, for analytics purposes, we can ignore the tag entries.
- There may be entries which are only of review type, without involving tagging, and we take these entries as valid reviews.
- Any consecutive entries with the same page_id but different log_action are likely to belong to the same session and hence ignored.
The above observations are incorporated in the code below which parses the dataset and generates observations:
dataset parsing
|
---|
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import matplotlib.dates as mdates
page_rereviewsset = 'quarry-20777-re-reviews-of-new-pages-run196734.tsv'
df = pd.read_csv(page_rereviewsset, delimiter='\t')
# get total years to iterate on
years = df['year'].unique()
page_rereviews = np.array([])
avg_reviews = np.array([])
# aggregate the data for each month
for y in years:
df_tmp = df[df['year'] == y]
# Get unique months in the year
months = df_tmp['month'].unique()
for m in months:
page_rereviews = np.append(page_rereviews, 0)
reviews_per_month = df_tmp[df['month'] == m]
prev_id = 0
for index, row in reviews_per_month.iterrows():
page_id = row['log_page']
# If continuous review entries of a page exist, likely from the same
# session, here we're looking at a new page id so add it
if prev_id != page_id:
page_rereviews[-1] = page_rereviews[-1] + 1
prev_id = page_id
# Generate year-months for x-axis
months = pd.date_range('2015-11', periods=page_rereviews.shape[0], freq='1m')
# For storing the aggregate data in wikitable format
f = open('page_rereviews.wiki','w')
for i, m in enumerate(months):
f.write('|-\n|{:%Y-%m}\n|{}\n'.format(m, page_rereviews[i]))
f.close()
plt.figure()
plt.plot(months, page_rereviews, label="users doing review", c='orange')
plt.ylabel('Re-reviews per month')
plt.xlabel('Months')
plt.legend()
xfmt = mdates.DateFormatter('%d-%m-%y')
plt.axvline('2016-11', color='b', linestyle='dashed', linewidth=2, label="NPP right implementation")
plt.text('2016-11', plt.gca().get_ylim()[1]+10,'NPP user right implementation', ha='center', va='center')
plt.show()
|
Results
editThe plot for the average monthly re-reviews of new pages is shown:
Some conclusions that could be drawn are:
- The number of re-reviews is highly erratic.
- The number of re-reviews showed a huge spike just after the New Page Review rights implementation, and even after that remained somewhat higher than before.
Dataset
editYear-Month | Average page re-reviews |
---|---|
2015-11 | 15400.0 |
2015-12 | 15406.0 |
2016-01 | 14319.0 |
2016-02 | 15908.0 |
2016-03 | 14637.0 |
2016-04 | 16236.0 |
2016-05 | 17611.0 |
2016-06 | 14948.0 |
2016-07 | 7559.0 |
2016-08 | 8957.0 |
2016-09 | 8200.0 |
2016-10 | 9827.0 |
2016-11 | 16933.0 |
2016-12 | 10938.0 |
2017-01 | 11512.0 |
2017-02 | 12234.0 |
2017-03 | 16883.0 |
2017-04 | 11951.0 |
2017-05 | 14200.0 |
2017-06 | 15714.0 |
2017-07 | 13851.0 |