Research talk:Wikipedia article creation/Work log/Friday, December 6th
Friday, December 6th
editI'm working on incorporating AFC drafts into my analysis today. I have this table that I generated from "move revisions" that occurred in the English Wikipedia. This should help me figure out which pages were originally AFC drafts and which drafts were intended to be articles.
> select rev_comment from nov13_move_revision where rev_comment LIKE "%Wikipedia_talk:%" limit 3; +--------------------------------------------------------------------------------------------------------------------------------------------------------------+ | rev_comment | +--------------------------------------------------------------------------------------------------------------------------------------------------------------+ | moved [[Wikipedia talk:Articles for creation/!PAUS3]] to [[!PAUS3]]: Created via [[WP:AFC|Articles for Creation]] ([[WP:WPAFC|you can help!]]) | | moved [[User:Natgenex/$1,000 genome]] to [[Wikipedia talk:Articles for creation/$1,000 genome]]: Moving to [[Wikipedia:AFC|AFC]] space (bot) | | moved [[Wikipedia talk:Articles for creation/$1,000 genome]] to [[$1,000 genome]]: Created via [[WP:AFC|Articles for Creation]] ([[WP:WPAFC|you can help!]]) | +--------------------------------------------------------------------------------------------------------------------------------------------------------------+ 3 rows in set (0.03 sec)
It looks like the first thing that I should do is process this dataset to extract the page move titles and namespaces. Then I can use the resulting table to answer the questions:
- Which pages were ever in the main namespace?
- Which pages are/were an AFC draft?
One of the things I'm worried about is determining which drafts are rejected, but have yet to be deleted. I should be able to use categorylinks
to find that. I'll leave that problem for later. --Halfak (WMF) (talk) 16:29, 6 December 2013 (UTC)
> select count(*) from nov13_page where page_namespace = 5 AND page_title LIKE "Articles_for_creation/%"; +----------+ | count(*) | +----------+ | 184645 | +----------+ 1 row in set (0.27 sec)
Almost 200,000 AFC articles. I wonder how many of them have been deleted?
> select archived, count(*) from nov13_page where page_namespace = 5 AND page_title LIKE "Articles_for_creation/%" group by 1; +----------+----------+ | archived | count(*) | +----------+----------+ | 0 | 100195 | | 1 | 84450 | +----------+----------+ 2 rows in set (5 min 20.23 sec)
A little less than half. OK. The CSD criteria for G13 states that drafts that haven't been editing in two months should be deleted.
> select archived OR DATEDIFF("20131113010101", last_revision) >= 60, count(*) from nov13_page where page_namespace = 5 AND page_title LIKE "Articles_for_creation/%" group by 1; +-------------------------------------------------------------+----------+ | archived OR DATEDIFF("20131113010101", last_revision) >= 60 | count(*) | +-------------------------------------------------------------+----------+ | 0 | 13803 | | 1 | 170842 | +-------------------------------------------------------------+----------+ 2 rows in set (1.61 sec)
OK. 13.8k articles still alive that might still be good.
> select LEFT(first_revision, 4) AS year_created, count(*) from nov13_page where (NOT archived AND DATEDIFF("20131113010101", last_revision) <= 60) AND page_namespace = 5 AND page_title LIKE "Articles_for_creation/%" group by 1; +--------------+----------+ | year_created | count(*) | +--------------+----------+ | 2006 | 2 | | 2007 | 2 | | 2008 | 3 | | 2009 | 24 | | 2010 | 36 | | 2011 | 557 | | 2012 | 2140 | | 2013 | 11178 | +--------------+----------+ 8 rows in set (0.92 sec)
There are some old drafts in there! At least most of the drafts were started in 2013. It looks like I'll need to gather categories for this set of pages. I'm a little off temporally for finding drafts that have been rejected, but that shouldn't affect my analysis much.
OK. I think that I've got it. Here's the query to produce the table I need:
SELECT
page_id,
page_namespace,
page_title,
archived,
DATEDIFF("20131113010101", last_revision) >= 6*30 AS should_be_archived,
declined.cl_from IS NOT NULL AS declined_submission
FROM (
SELECT * FROM halfak.nov13_page
WHERE page_namespace = 5 AND
page_title LIKE "Articles_for_creation/%"
) AS afc_page
LEFT JOIN categorylinks declined ON
cl_from = page_id AND
cl_to LIKE "AfC_submissions_declined%";
I've got it running now. --Halfak (WMF) (talk) 22:47, 6 December 2013 (UTC)
I made a mistake in my notes above. The G13 cutoff is 6 months, not 60 days. I've updated my query above to account for the difference. --Halfak (WMF) (talk) 23:17, 6 December 2013 (UTC)