Research talk:Wikipedia article creation/Work log/Friday, January 17th
Friday, January 17th
editToday I'm working on the last piece in the puzzle of newcomer page creations -- figuring out the scale at which all newcomers enter the page creation process.
So first things first, I need to generate a funnel for observing how many registered editors --> R:New editors --> New page creators --> New article creators --> New draft article creators.
SELECT
DATE(CONCAT(LEFT(user_registration, 6), "01")) AS registration_month,
COUNT(*) AS registered_users, /* registered users */
SUM(revision_stats.day_revisions > 0) AS new_editors,
SUM(page_stats.pages > 0) AS new_page_creators, /* users who started a any page in first 30 days */
SUM(page_stats.article_pages > 0) AS new_article_creators, /* users who started an article page in first 30 days */
SUM(page_stats.draft_pages > 0) AS new_draft_creators /* users who started a draft article page in first 30 days */
FROM (
SELECT
user_id,
user_registration,
SUM(day_revisions) AS day_revisions
FROM (
(
SELECT
user_id,
user_registration,
SUM(rev_id IS NOT NULL) AS day_revisions
FROM user
LEFT JOIN revision ON
rev_user = user_id AND
rev_timestamp < DATE_FORMAT(DATE_ADD(user_registration, INTERVAL 1 DAY), "%Y%m%d%H%i%S")
WHERE
user_registration > "2008"
GROUP BY 1
)
UNION
(
SELECT
user_id,
user_registration,
SUM(ar_id IS NOT NULL) AS day_revisions
FROM user
LEFT JOIN archive ON
ar_user_text = user_name AND
ar_timestamp < DATE_FORMAT(DATE_ADD(user_registration, INTERVAL 1 DAY), "%Y%m%d%H%i%S")
WHERE
user_registration > "2008"
GROUP BY 1,2
)
) AS revision_archive_stats
GROUP BY 1,2
) AS revision_stats
LEFT JOIN (
SELECT
user_id,
SUM(c.user_id IS NOT NULL) AS pages,
SUM(ap.page_title IS NOT NULL) AS article_pages,
SUM(ap.original_namespace != 0) AS draft_pages
FROM user
LEFT JOIN staging.nov13_dewiki_creation c USING (user_id)
LEFT JOIN staging.nov13_dewiki_article_page ap USING (page_id)
WHERE
user_registration > "2008" AND
c.rev_timestamp < DATE_FORMAT(DATE_ADD(user_registration, INTERVAL 30 DAY), "%Y%m%d%H%i%S")
GROUP BY 1
) AS page_stats USING (user_id)
GROUP BY 1
I cheated a little bit. I got the above query wrong several times before I got it right.
Here's the results. First, the monthly proportion of users that make it to each part of the funnel. Since Enwiki is super big and takes forever to run, I'll start off with German.
~ 30% of newly registered users will make an edit within 24h. ~40% of these new editors will create some page. ~60% of them will have created an article (or a draft article). And depending on where you look, about %3 of those users will have created a draft (non ns=0 page that became an article). There's a concerning dip in the number of draft creators that has me thinking that there might be some anomaly in my data.
Check out the raw numbers:
The number of editors who create drafts each months drops from 30 to 0-2 in the middle of 2011. My best guess is that there was a substantial change in the way that page moves were recorded at that point. I'll have to pick through the data to see if I can figure out what's going on. --Halfak (WMF) (talk) 23:56, 17 January 2014 (UTC)