Research:Automated classification of draft quality
This page is an incomplete draft of a research project.
Information is incomplete and is likely to change substantially before the project starts.
The average time between edits when an editor is "in-session" is 7 minutes. But the median time to deletion tagging a new article is 2 minutes. The most common reason for deletion tagging (in English Wikipedia) is A7: "No indication of importance". It seems likely, that newcomer article creators are *adding* a credible assertion of importance in a second edit that is blocked by a deletion tagging edit conflict. Research suggests that this early, negative feedback is one of the leading predictors that a newcomer will stop editing Wikipedia entirely.
Methods
editThe reason we need to review new page creation so quickly is to get rid of spam and egregious vandalism. Most other types of potentially undesirable new articles would not cause damage were they to be left alone for a little while -- enough time to allow the creator to finish their initial sequence of edits. We can split the feed of newly created pages using a machine learning classifier so that we can have two review backlogs: one for fast review of spam and egregious vandalism and another for slower review of all other new articles.
The ORES service would be a great place to build and host such a model and the Research:Revision scoring as a service project team would be interested in providing support & advisement.
Labeled data
editThe labeling query (see below) was run for each month between Aug. 2015 and Aug. 2016 to acquire a dataset with 907,415 observations:
- 881,159 "OK"
- 26,256 otherwise
- 6506 "vandalism"
- 2451 "attack"
- 17,704 "spam"
The full dataset can be downloaded from the github repository: https://github.com/wiki-ai/draftquality/tree/master/datasets
labeling query
|
---|
SELECT
page_title,
rev_id,
rev_timestamp AS creation_timestamp,
FALSE AS archived,
"OK" AS draft_quality
FROM revision
INNER JOIN page ON
rev_page = page_id WHERE
rev_timestamp BETWEEN @start AND @end AND
rev_parent_id = 0 AND
page_namespace = 0
UNION ALL
SELECT
ar_title AS page_title,
ar_rev_id AS rev_id,
ar_timestamp AS creation_timestamp,
True AS archived,
IF(log_comment REGEXP "WP:CSD#G3\\|", "vandalism",
IF(log_comment REGEXP "WP:CSD#G10\\|", "attack",
IF(log_comment REGEXP "WP:CSD#G11\\|", "spam", "OK")))) AS draft_quality
FROM archive
LEFT JOIN logging speedy_delete ON
log_namespace = ar_namespace AND
log_title = ar_title AND
log_type = "delete" AND
log_action = "delete" AND
log_comment LIKE "[[WP:CSD#%" AND
log_comment REGEXP "WP:CSD#(G3|G10|G11)\\|" AND
log_timestamp > ar_timestamp
WHERE
ar_timestamp BETWEEN @start AND @end AND
log_timestamp BETWEEN @start AND @end AND
ar_parent_id = 0 AND
ar_namespace = 0
|
Results
editTBA