Research talk:Wikipedia article creation/Work log/Friday, December 27th

Friday, December 27th

edit

Today, I'm babysitting my analysis of dewiki since part of my pipeline broke down over the holiday. It was due to a set of duplicate key values that I had to clean up. In the meantime, I'm working on building a dashboard for drafts.

I want a dataset that provides daily observations of the number of drafts, created, published (moved to main) and unpublished (deleted or moved out of main). This is hard to do since, as I've covered previously, moves and deletions are not covered well in the logging table.

Given these issues, I might be simply creating a dashboard of the number of pages currently in the Draft namespace. First, let me find out if I can get draft creations from the recentchanges table. I'll need to know if the rc_namespace changes when a page gets moved.

> select rc_title, rc_namespace, rc_params from recentchanges where rc_log_type = "move" and rc_log_action = "move" order by rc_id desc limit 10;
+--------------------------------------------------+--------------+----------------------------------------------------------------------------------------------------------+
| rc_title                                         | rc_namespace | rc_params                                                                                                |
+--------------------------------------------------+--------------+----------------------------------------------------------------------------------------------------------+
| Asuka_120%_Burning_Fest                          |            1 | a:2:{s:9:"4::target";s:29:"Talk:Asuka 120% BURNING Fest.";s:10:"5::noredir";s:1:"0";}                    |
| Asuka_120%_Burning_Fest                          |            0 | a:2:{s:9:"4::target";s:24:"Asuka 120% BURNING Fest.";s:10:"5::noredir";s:1:"0";}                         |
| AEK_Athens_B.C._season_2013-2014                 |            1 | a:2:{s:9:"4::target";s:37:"Talk:2013–14 AEK Athens B.C. season";s:10:"5::noredir";s:1:"0";}              |
| AEK_Athens_B.C._season_2013-2014                 |            0 | a:2:{s:9:"4::target";s:32:"2013–14 AEK Athens B.C. season";s:10:"5::noredir";s:1:"0";}                   |
| PrimalBlueWolf/Command:_Modern_Air_Naval_Warfare |            2 | a:2:{s:9:"4::target";s:48:"PrimalBlueWolf/Command: Modern Air Naval Warfare";s:10:"5::noredir";s:1:"0";} |
| Home_Free_Vocal_Band                             |            0 | a:2:{s:9:"4::target";s:16:"Home Free (band)";s:10:"5::noredir";s:1:"0";}                                 |
| Articles_for_creation/Arthur_Miles_Moss          |            5 | a:2:{s:9:"4::target";s:17:"Arthur Miles-Moss";s:10:"5::noredir";s:1:"0";}                                |
| Havelländisches_Luch                             |            0 | a:2:{s:9:"4::target";s:14:"Havelland Luch";s:10:"5::noredir";s:1:"0";}                                   |
| KNRQ-FM                                          |            1 | a:2:{s:9:"4::target";s:9:"Talk:KLVP";s:10:"5::noredir";s:1:"0";}                                         |
| Judicial_Branch_of_Colombia                      |            1 | a:2:{s:9:"4::target";s:26:"Talk:Judiciary of Colombia";s:10:"5::noredir";s:1:"0";}                       |
+--------------------------------------------------+--------------+----------------------------------------------------------------------------------------------------------+
10 rows in set (0.05 sec)

OK. Now I pick one and look for the creating revision. I'd prefer a page that crossed namespace.

> select * from page where page_namespace = 0 and page_title = "Arthur_Miles-Moss";
+----------+----------------+-------------------+-------------------+--------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+
| page_id  | page_namespace | page_title        | page_restrictions | page_counter | page_is_redirect | page_is_new | page_random    | page_touched   | page_links_updated | page_latest | page_len |
+----------+----------------+-------------------+-------------------+--------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+
| 41172977 |              0 | Arthur_Miles-Moss |                   |            0 |                0 |           0 | 0.526640257948 | 20131227174052 | NULL               |   587942780 |     3961 |
+----------+----------------+-------------------+-------------------+--------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+
1 row in set (0.03 sec)
> select min(rev_timestamp) from revision where rev_page = 41172977;
+--------------------+
| min(rev_timestamp) |
+--------------------+
| 20131124012813     |
+--------------------+
1 row in set (0.03 sec)

Darn. I just missed the recentchanges cutoff.

> select min(rc_timestamp) from recentchanges;
+-------------------+
| min(rc_timestamp) |
+-------------------+
| 20131127175925    |
+-------------------+
1 row in set (0.04 sec)

I ran through a few before I found one.

> select min(rev_timestamp) from revision where rev_page = 41460310;
+--------------------+
| min(rev_timestamp) |
+--------------------+
| 20131225225510     |
+--------------------+
1 row in set (0.03 sec)

Now to look for the record of the page creation in the recentchanges table.

> select min(rev_id) from revision where rev_page = 41460310;
+-------------+
| min(rev_id) |
+-------------+
|   587675815 |
+-------------+
1 row in set (0.04 sec)

> select * from recentchanges where rc_this_oldid = 587675815;
+-----------+----------------+-------------+----------+--------------+--------------+-----------------+------------+----------+--------+--------+-----------+---------------+---------------+---------+-----------+----------------+-------------------+--------------+----------------+------------+------------+------------+----------+-------------+---------------+-----------+
| rc_id     | rc_timestamp   | rc_cur_time | rc_user  | rc_user_text | rc_namespace | rc_title        | rc_comment | rc_minor | rc_bot | rc_new | rc_cur_id | rc_this_oldid | rc_last_oldid | rc_type | rc_source | rc_moved_to_ns | rc_moved_to_title | rc_patrolled | rc_ip          | rc_old_len | rc_new_len | rc_deleted | rc_logid | rc_log_type | rc_log_action | rc_params |
+-----------+----------------+-------------+----------+--------------+--------------+-----------------+------------+----------+--------+--------+-----------+---------------+---------------+---------+-----------+----------------+-------------------+--------------+----------------+------------+------------+------------+----------+-------------+---------------+-----------+
| 625836506 | 20131225225510 |             | 16302367 | MarkTraceur  |          118 | Big_Data_(band) | First text |        0 |      0 |      1 |  41460310 |     587675815 |             0 |       1 | mw.new    |              0 |                   |            0 | <snip> |          0 |        957 |          0 |        0 | NULL        |               |           |
+-----------+----------------+-------------+----------+--------------+--------------+-----------------+------------+----------+--------+--------+-----------+---------------+---------------+---------+-----------+----------------+-------------------+--------------+----------------+------------+------------+------------+----------+-------------+---------------+-----------+
1 row in set (0.04 sec)

Cool! So the recentchanges table does not get it's rc_namespace updated on page move, so I can use it to look for draft creations.  :) --Halfak (WMF) (talk) 18:24, 27 December 2013 (UTC)Reply


Something is wrong :/

> SELECT
    ->     DATE(rc_timestamp) AS day_created,
    ->     COUNT(*) AS created
    -> FROM recentchanges
    -> WHERE rc_type = 1
    -> AND rc_namespace = 118 /* Drafts */
    -> GROUP BY 1;
+-------------+---------+
| day_created | created |
+-------------+---------+
| 2013-12-18  |       9 |
| 2013-12-20  |      29 |
| 2013-12-22  |       1 |
| 2013-12-23  |       7 |
| 2013-12-25  |       2 |
| 2013-12-26  |       1 |
+-------------+---------+
6 rows in set (0.04 sec)

I should see a lot more page creations than that.

select count(*) from page where page_namespace = 118;
+----------+
| count(*) |
+----------+
|      115 |
+----------+
1 row in set (0.04 sec)

Oh! Of course. Those are moves. Let's find an example to be sure.

> select page_id, page_title, page_namespace from page left join recentchanges ON page_title = rc_title and page_namespace = rc_namespace AND rc_type = 1 WHERE page_namespace = 118 AND rc_id IS NULL LIMIT 5;
+----------+--------------------------------------------------+----------------+
| page_id  | page_title                                       | page_namespace |
+----------+--------------------------------------------------+----------------+
| 41369619 | Adam_Jury                                        |            118 |
| 41416174 | Ahead_Learning_Systems                           |            118 |
| 41435807 | Beth_Sotelo                                      |            118 |
| 31482973 | Bramalea_Limited                                 |            118 |
| 31623099 | Brampton_Outstanding_Business_Achievement_Awards |            118 |
+----------+--------------------------------------------------+----------------+
5 rows in set (0.04 sec)

... so I did a bunch of digging to get to this point, but here is the revision that shows "Adam_Jury" was moved to drafts from elsewhere:

> select rev_comment from revision where rev_id = 587441533;
+---------------------------------------------------------------------------------------------------------------------------------+
| rev_comment                                                                                                                     |
+---------------------------------------------------------------------------------------------------------------------------------+
| BOZ moved page [[User:BOZ/Adam Jury]] to [[Draft:Adam Jury]] without leaving a redirect: will release this one to [[WP:DRAFTS]] |
+---------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.04 sec)

It seems that most "drafts" get there by being moved in. I'm going to want to handle this too. *sigh* Fun story is that I can't do it. The log table does not have a query-able field that will tell me where a page was move to. I guess that I can guess a bit though. If log_params contains "%Draft:%", it probably means a page was moved into the draft namespace.

> SELECT
    ->     DATE(log_timestamp) AS day_moved,
    ->     COUNT(DISTINCT log_title) AS moved
    -> FROM logging
    -> WHERE log_type = "move"
    -> AND log_action IN ("move", "move_redir")
    -> AND log_namespace != 118
    -> AND log_params LIKE "%Draft:%"
    -> GROUP BY 1;
+------------+-------+
| day_moved  | moved |
+------------+-------+
| 2008-04-11 |     9 |
| 2010-01-10 |     1 |
| 2011-04-10 |     1 |
| 2013-12-18 |    10 |
| 2013-12-19 |     2 |
| 2013-12-20 |     3 |
| 2013-12-21 |     4 |
| 2013-12-22 |    17 |
| 2013-12-23 |     8 |
| 2013-12-24 |    20 |
| 2013-12-26 |     2 |
| 2013-12-27 |     2 |
+------------+-------+
12 rows in set (18.18 sec)

Well... that's messy. Note that there were some "draft moves" that occurred before the draft namespace existed in 2008, 2010 and 2011. I'll just ignore those I think and put a timestamp bound. --Halfak (WMF) (talk) 18:58, 27 December 2013 (UTC)Reply


OK. Now to bring it all together. Here's what I've got.

CREATE TEMPORARY TABLE halfak.daily_drafts_created
SELECT
    DATE(rc_timestamp) AS day_date,
    COUNT(*) AS created
FROM recentchanges
WHERE rc_type = 1
AND rc_namespace = 118 /* Drafts */
GROUP BY 1;

CREATE TEMPORARY TABLE halfak.daily_drafts_moved
SELECT
    DATE(log_timestamp) AS day_date,
    COUNT(DISTINCT log_title) AS moved
FROM logging
WHERE log_type = "move"
AND log_action IN ("move", "move_redir")
AND log_namespace != 118
AND log_params LIKE '%Draft:%'
AND log_timestamp > '20131211'
GROUP BY 1;

CREATE TEMPORARY TABLE halfak.daily_drafts_deleted
SELECT
    DATE(log_timestamp) AS day_date,
    COUNT(*) AS deleted
FROM logging 
WHERE log_type = 'delete'
AND log_action = 'delete'
AND log_namespace = 118
GROUP BY 1;

CREATE TEMPORARY TABLE halfak.daily_drafts_published
SELECT
    DATE(log_timestamp) AS day_date,
    COUNT(*) AS published
FROM logging 
WHERE log_type = 'move'
AND log_action = 'move'
AND log_namespace = 118
AND log_params NOT LIKE '%Draft:%'
GROUP BY 1;

SELECT
    day_date,
    IFNULL(created, 0) AS created,
    IFNULL(moved, 0) AS move,
    IFNULL(published, 0) AS published,
    IFNULL(deleted, 0) AS deleted
FROM halfak.daily_drafts_created
LEFT JOIN halfak.daily_drafts_moved USING (day_date)
LEFT JOIN halfak.daily_drafts_deleted USING (day_date)
LEFT JOIN halfak.daily_drafts_published USING (day_date);

Now to set up a script to cron this out as a public dataset. --Halfak (WMF) (talk) 18:59, 27 December 2013 (UTC)Reply


Should be available here: http://stat1001.wikimedia.org/public-datasets/enwiki/drafts/daily_drafts.tsv. Enjoy. :) --Halfak (WMF) (talk) 19:31, 27 December 2013 (UTC)Reply

Return to "Wikipedia article creation/Work log/Friday, December 27th" page.