Research talk:Surviving new editor/Work log/2014-02-05
Wednesday, Feb. 5th
editSo the first thing I want to do is to get a sense for how a few different survival periods (the time between and </math>t_3</math>) affect the measurement of "surviving" new editors.
- 2 to 6 months - Used in R:First edit session and R:The Rise and Decline
- 1 year + 1 month - Used in [The Editor Trends Study]
I'd also like to throw a few simple measures out there:
- 3 to 4 weeks - Measures short term (month) retention assuming that editors will tend to edit in the last week of the month
- 1 to 2 months - Measures medium term retention
- 1 to 2 years - Measures long term retention
One thing you'll notice right away is that there are more editors who appear in the 1 to 2 year survival period than survive in the 3 to 4 weeks period. At first this seems counter-intuitive. However, the 3 to 4 weeks survival period only considers edits that occur within a one week timespan while 1 to 2 years considers any edit that takes place within a whole year. However the large difference here points toward an important problem. If we chose 3 to 4 weeks as a survival period, we'd be missing many editors who will not edit during that time, but would edit again within a year.
Really, the key story here though is related to the similar trends seen in all metrics. The figure below plots factor difference between the surviving proportion for each "survival period" as compared to "2 to 6 months". I chose "2 to 6 months" since it lands between other potential metrics in proportion. A horizontal line at 1 would represent a perfect correlation.
There are a few things to pick out here:
- "1 to 2 years" and "1 year + 1 month" seem to be measuring something different than "2 to 6 months" before 2007.
- Other than this deviation, the proportional relationship between "2 to 6 months" and other survival periods holds relatively constant.
- About half the number of editors that appear in the "3 to 4 weeks" survival period will appear in "2 to 6 months"
I'm really not sure that this analysis gives us a clear winner or allows us to pick an arbitrary survival period and be done with it. Next, I'll extract some edit counts for all months for a users first year so that I can do some more nuanced comparisons.
SQL for monthly activity
|
---|
SELECT
DATABASE() AS wiki_db,
user_id,
SUM(revisions_month_1) AS revisions_month_1,
SUM(revisions_month_2) AS revisions_month_2,
SUM(revisions_month_3) AS revisions_month_3,
SUM(revisions_month_4) AS revisions_month_4,
SUM(revisions_month_5) AS revisions_month_5,
SUM(revisions_month_6) AS revisions_month_6,
SUM(revisions_month_7) AS revisions_month_7,
SUM(revisions_month_8) AS revisions_month_8,
SUM(revisions_month_9) AS revisions_month_9,
SUM(revisions_month_10) AS revisions_month_10,
SUM(revisions_month_11) AS revisions_month_11,
SUM(revisions_year_1) AS revisions_year_1,
SUM(revisions_year_2) AS revisions_year_2
FROM (
(SELECT
user_id,
SUM(
rev_timestamp IS NOT NULL AND
DATEDIFF(rev_timestamp, user_registration) BETWEEN 30*1+1 AND 30*2
) AS revisions_month_1,
SUM(
rev_timestamp IS NOT NULL AND
DATEDIFF(rev_timestamp, user_registration) BETWEEN 30*2+1 AND 30*3
) AS revisions_month_2,
SUM(
rev_timestamp IS NOT NULL AND
DATEDIFF(rev_timestamp, user_registration) BETWEEN 30*3+1 AND 30*4
) AS revisions_month_3,
SUM(
rev_timestamp IS NOT NULL AND
DATEDIFF(rev_timestamp, user_registration) BETWEEN 30*4+1 AND 30*5
) AS revisions_month_4,
SUM(
rev_timestamp IS NOT NULL AND
DATEDIFF(rev_timestamp, user_registration) BETWEEN 30*5+1 AND 30*6
) AS revisions_month_5,
SUM(
rev_timestamp IS NOT NULL AND
DATEDIFF(rev_timestamp, user_registration) BETWEEN 30*6+1 AND 30*7
) AS revisions_month_6,
SUM(
rev_timestamp IS NOT NULL AND
DATEDIFF(rev_timestamp, user_registration) BETWEEN 30*7+1 AND 30*8
) AS revisions_month_7,
SUM(
rev_timestamp IS NOT NULL AND
DATEDIFF(rev_timestamp, user_registration) BETWEEN 30*8+1 AND 30*9
) AS revisions_month_8,
SUM(
rev_timestamp IS NOT NULL AND
DATEDIFF(rev_timestamp, user_registration) BETWEEN 30*9+1 AND 30*10
) AS revisions_month_9,
SUM(
rev_timestamp IS NOT NULL AND
DATEDIFF(rev_timestamp, user_registration) BETWEEN 30*10+1 AND 30*11
) AS revisions_month_10,
SUM(
rev_timestamp IS NOT NULL AND
DATEDIFF(rev_timestamp, user_registration) BETWEEN 30*11+1 AND 365
) AS revisions_month_11,
SUM(
rev_timestamp IS NOT NULL AND
DATEDIFF(rev_timestamp, user_registration) BETWEEN 365*1+1 AND 365*2
) AS revisions_year_1,
SUM(
rev_timestamp IS NOT NULL AND
DATEDIFF(rev_timestamp, user_registration) BETWEEN 365*2+1 AND 365*3
) AS revisions_year_2
FROM user
LEFT JOIN revision ON
rev_user = user_id AND
rev_timestamp >= DATE_FORMAT(
DATE_ADD(user_registration, INTERVAL 31 DAY),
"%Y%m%d%H%i%S"
)
GROUP BY 1)
UNION
(SELECT
user_id,
SUM(
ar_timestamp IS NOT NULL AND
DATEDIFF(ar_timestamp, user_registration) BETWEEN 30*1+1 AND 30*2
) AS revisions_month_1,
SUM(
ar_timestamp IS NOT NULL AND
DATEDIFF(ar_timestamp, user_registration) BETWEEN 30*2+1 AND 30*3
) AS revisions_month_2,
SUM(
ar_timestamp IS NOT NULL AND
DATEDIFF(ar_timestamp, user_registration) BETWEEN 30*3+1 AND 30*4
) AS revisions_month_3,
SUM(
ar_timestamp IS NOT NULL AND
DATEDIFF(ar_timestamp, user_registration) BETWEEN 30*4+1 AND 30*5
) AS revisions_month_4,
SUM(
ar_timestamp IS NOT NULL AND
DATEDIFF(ar_timestamp, user_registration) BETWEEN 30*5+1 AND 30*6
) AS revisions_month_5,
SUM(
ar_timestamp IS NOT NULL AND
DATEDIFF(ar_timestamp, user_registration) BETWEEN 30*6+1 AND 30*7
) AS revisions_month_6,
SUM(
ar_timestamp IS NOT NULL AND
DATEDIFF(ar_timestamp, user_registration) BETWEEN 30*7+1 AND 30*8
) AS revisions_month_7,
SUM(
ar_timestamp IS NOT NULL AND
DATEDIFF(ar_timestamp, user_registration) BETWEEN 30*8+1 AND 30*9
) AS revisions_month_8,
SUM(
ar_timestamp IS NOT NULL AND
DATEDIFF(ar_timestamp, user_registration) BETWEEN 30*9+1 AND 30*10
) AS revisions_month_9,
SUM(
ar_timestamp IS NOT NULL AND
DATEDIFF(ar_timestamp, user_registration) BETWEEN 30*10+1 AND 30*11
) AS revisions_month_10,
SUM(
ar_timestamp IS NOT NULL AND
DATEDIFF(ar_timestamp, user_registration) BETWEEN 30*11+1 AND 365
) AS revisions_month_11,
SUM(
ar_timestamp IS NOT NULL AND
DATEDIFF(ar_timestamp, user_registration) BETWEEN 365*1+1 AND 365*2
) AS revisions_year_1,
SUM(
ar_timestamp IS NOT NULL AND
DATEDIFF(ar_timestamp, user_registration) BETWEEN 365*2+1 AND 365*3
) AS revisions_year_2
FROM user
LEFT JOIN archive ON
ar_user = user_id AND
ar_timestamp >= DATE_FORMAT(
DATE_ADD(user_registration, INTERVAL 31 DAY),
"%Y%m%d%H%i%S"
)
GROUP BY 1)
) user_span_revisions
GROUP BY 1,2;
|