Research talk:Wikipedia article creation/Work log/Wednesday, November 13th

Wednesday, November 13th

edit

I'm sitting down to work on this late today due to meetings and email. Here's my plan

  1. Clean up my understand of page curation action
  2. Plot a timeseries of page creations, deletions and curation actions
  3. Study the AfC workflow and devise a data extraction plan

Page curation actions

edit
> select distinct log_action from logging where log_type = "pagetriage-curation" and log_timestamp >= "201311";
+------------+
| log_action |
+------------+
| reviewed   |
| tag        |
| delete     |
| unreviewed |
+------------+
4 rows in set (0.06 sec)

"reviewed" and "unreviewed" are straightfoward. It appears that we get a "tag" action when a user adds a set of cleanup templates to the page. But does "delete" mean delete?

Let's get a page curation "delete" action

> select log_page, log_namespace, log_title from logging where log_type="pagetriage-curation" and log_action = "delete" limit 1;
+----------+---------------+------------+
| log_page | log_namespace | log_title  |
+----------+---------------+------------+
| 36943599 |             0 | Ryan_Water |
+----------+---------------+------------+
1 row in set (0.12 sec)


OK. Now let's see what type of delete actions have affected this page.

> select log_timestamp, log_action, log_type, log_page, log_namespace, log_title from logging where log_action = "delete" and log_namespace = 0 and log_title = "Ryan_Water";
+----------------+------------+---------------------+----------+---------------+------------+
| log_timestamp  | log_action | log_type            | log_page | log_namespace | log_title  |
+----------------+------------+---------------------+----------+---------------+------------+
| 20120907071509 | delete     | pagetriage-curation | 36943599 |             0 | Ryan_Water |
| 20120907071509 | delete     | pagetriage-deletion | 36943599 |             0 | Ryan_Water |
| 20120907071645 | delete     | delete              |        0 |             0 | Ryan_Water |
+----------------+------------+---------------------+----------+---------------+------------+
3 rows in set (0.04 sec)

Ok... So it looks like "pagetriage-curation" with "delete" isn't really a delete. I bet it adds a Prod or CSD tag.

> select ar_rev_id, ar_timestamp, ar_comment from archive where ar_title = "Ryan_Water" and ar_namespace = 0;
+-----------+----------------+-----------------------------------------------------------------------------+
| ar_rev_id | ar_timestamp   | ar_comment                                                                  |
+-----------+----------------+-----------------------------------------------------------------------------+
| 511186784 | 20120907071335 | [[WP:AES|←]]Created page with 'Ryan water is your GOD'                      |
| 511186883 | 20120907071508 | Nominated page for deletion using [[Wikipedia:Page Curation|Page Curation]] |
+-----------+----------------+-----------------------------------------------------------------------------+
2 rows in set (0.24 sec)

Deletion event lacks page_id

edit

OK. So upon review, it looks like the actual deletion event does not store the page_id. This breaks down a *huge* set of assumptions. I'm going to have go back and revisit my main queries.

This leaves me with a couple of options.

  1. generate the deletion timeseries based on which pages are currently in the archive table
    • This sucks because you can't tell when the pages were deleted
  2. generate raw counts of creations, deletions and restorations
    • This sucks because pages can be deleted and restored several times

I think I'll do both.

The following should get me the number of pages created per day with a count of the number of pages that are currently archived.

SELECT
    LEFT(creation.rev_timestamp, 8) AS date,
    page_namespace,
    COUNT(*) AS pages,
    SUM(archived) AS archived
FROM halfak.nov13_page AS page
INNER JOIN halfak.nov13_creation AS creation USING (page_id)
GROUP BY 1,2;

Yikes. I just went to run this and saw that there are zero non-deleted pages in my table. It turns out that the second part of my page gathering query -- the part that gets pages that haven't been deleted -- failed to run. I just kicked that off again and sequenced it with a query to regenerate the creations table too.


Articles for Creation workflow

edit

While I'm waiting for those queries to finish, I pinged TheOriginalSoni to come chat.

Draft submissions
Currently pending submissions
Accepted submissions
  • All ns=0 pages corresponding to ns=1 pages in en:Category:Accepted_AfC_submissions
    • Time accepted == Timestamp of first log_type="move" and log_action="move" log event where page was placed in ns=0
Decline submissions
  • All pages in subcategories of en:Category:Declined_AfC_submissions (note, will have to enumerate)
    • Time declined == ???
  • All deleted pages with ns=5 (Wikipedia_talk) and page_title prefixed by "Articles_for_Creation/"
    • Time declined == Timestamp of first log_type="delete" and log_action="delete" log event where page with exact title was deleted.

This approach misses:

Accepted submissions
  • All pages where the Template containing en:Category:Accepted_AfC_submissions was removed.
    • Not sure how often this might happen, but there's no good way to look for the category historically.
Declined submissions
  • User sandbox pages that have been deleted won't have "Articles_for_Creation".
    • These types of pages should be an ignorable minority.
Return to "Wikipedia article creation/Work log/Wednesday, November 13th" page.