Research:Communication to New Editors 2004-2011
These datasets and analyses are mostly test cases for the rest of the qualitative work for the 2011 Summer of Research, but do suggest some interesting trends nonetheless. Our basic methodologies are described below.
Assessing quality of the first edits made by new editors, 2004 and 2011
editHow many contributions by new editors are made in good faith and are worth retaining or improving? Are most edits by newbies vandalism or spam, or are they made primarily in good faith?
We selected a randomized sample of first edits by contributors who joined in April 2004 and in April 2011, derived via simple SQL query run against the toolserver. We then analyzed these edits by hand, ranking the first edit on a 1-5 scale, with one being pure vandalism and five being a well-referenced content addition indistinguishable from the edit of an experienced contributor. We also noted when the first edit was not a mainspace contribution, and whether that was vandalism or not.
Results
editResults are described at: "How much do new editors actually improve Wikipedia?" We'll publish the totals data soon, but the actual samples will not be distributed to avoid calling out individual editors by name.
-
April 2004 sample
-
April 2011 sample
The type and tone of user talk page edits directed at new editors within their first 30 days
editAs a follow up experiment to the previous one, which gave us an idea of how many new editors made valuable contributions according to Wikipedia standards, we wanted to look at how these good faith contributors were being communicated with on their user talk pages early on.
We prepared another random sample of several hundred edits made to user talk pages of new registered users on English Wikipedia from 2004 through 2011. These edits were made by other contributors within 30 days of a new person’s first edit.
The sample was gathered using the Toolserver, and the following query is an example of how the 2008 set was gathered. (If you want to run it on different years, simply change the timestamps.)
SQL query to get the sample |
---|
use enwiki_p;
select su.user_name,r.rev_id
from (SELECT u.user_id,u.user_name,u.user_registration,min(r.rev_timestamp) t
FROM user u
INNER JOIN revision r
ON u.user_id = r.rev_user
JOIN page p
ON r.rev_page = p.page_id
WHERE u.user_registration BETWEEN '20080201000000' AND '20080301000000' and u.user_id between 6335000 and 6565000 AND UNIX_TIMESTAMP(r.rev_timestamp) - UNIX_TIMESTAMP(u.user_registration) < (60*60*24*7)
LIMIT 500) su
INNER JOIN page p
ON su.user_name = p.page_title
INNER JOIN revision r
ON r.rev_page=p.page_id and r.rev_user != su.user_id
where p.page_namespace = 3
AND UNIX_TIMESTAMP(r.rev_timestamp) - UNIX_TIMESTAMP(su.t) < (60*60*24*30);
|
The complete list of classification possibilities is below. If it was applicable, we noted multiple items per edit. For example: if the edit was the addition of a warning template, we marked "Template", "Tip, correction, or warning" and then assigned a tone depending on the contents of the template used.
- Content discussion and/or debate: any edit whose purpose was primarily to discuss or debate the content of encyclopedia articles.
- Template: any edit that was a template.
- Welcome: any edit that was obviously intended to welcome a new editor, either in template form or personalized.
- Tip, correction, warning: Any tip about future editing, correction about errors in past editing procedure or technique, or any warning to cease editing in violation of policy/guideline.
- Invitation: any invitation to edit a specific page or subject, such as a WikiProject invitation or a suggestion about an interesting topic.
- Praise: any form of praise, from personalized text to barnstars.
- Vandalism: any edit to the user talk page that was purely vandalism.
- Socializing: any edit that did not discuss the project directly, but instead consisted of socializing.
- Minor: any minor change in formatting, grammar, spelling, etc.
Results
editResults are described at: "The Rise of Warnings to New Editors on English Wikipedia". The totals data for the two items compared is below, but the actual samples will not be distributed to avoid calling out individual editors by name.
Year | Edits that included praise | Edits that added a template with a negative tone | Total number of edits analyzed |
---|---|---|---|
2004 | 36 | 0 | 251 |
2005 | 23 | 0 | 223 |
2006 | 26 | 11 | 243 |
2007 | 5 | 24 | 347 |
2008 | 7 | 33 | 235 |
2009 | 13 | 36 | 176 |
2010 | 3 | 50 | 209 |
2011 | 6 | 84 | 244 |
Year | Edits that included praise | Edits adding a template with a negative tone |
---|---|---|
2004 | 14.34% | 0 |
2005 | 10.31% | 0 |
2006 | 10.70% | 4.53% |
2007 | 1.44% | 6.92% |
2008 | 2.98% | 14.04% |
2009 | 7.39% | 20.45% |
2010 | 1.44% | 23.92% |
2011 | 2.46% | 34.4% |
The totals calculated as a percent of the whole (in the sample) resulted in the following chart:
Analysis of the amount of new editors participating in good faith, 2004-2011
editAs a follow up activity to our sampling of the quality of first edits by newbies in 2004 and 2011, we made a similar qualitative assessment of whether new editors were participating in good faith or not overall. This is another attempt to understand how positively the influx of new contributors effects Wikipedia.
Partially working from the dataset of our previous analysis of user talk edits, we classified a random sample of new editors who arrived in February 2004-2011. We then simply noted whether that editor had an overall pattern of good faith participation through their edit history, or if they were clearly a vandal or spammer. We did so through correllating warnings and blocks from the community with a manual look at the composition of their edits. Practically speaking, duplicate accounts (i.e. sockpuppets) are quite difficult to identify, so only those that were blocked for the abuse of multiple accounts were counted as sockpuppets for our analysis.
Results
editPercent of good faith, spammer, vandal, and sockpuppet editors 2004-2011, based on the sample. (Actual samples with analysis will not be distributed to avoid calling out individual editors by name.) See blog post.
Years | Good faith editors | Vandals | Spammers | Sockpuppets | Total sampled |
2004 | 189 | 0 | 1 | 0 | 190 |
2005 | 135 | 2 | 2 | 3 | 142 |
2006 | 193 | 11 | 2 | 2 | 208 |
2007 | 118 | 19 | 2 | 8 | 147 |
2008 | 143 | 25 | 4 | 6 | 178 |
2009 | 107 | 40 | 5 | 10 | 162 |
2010 | 118 | 34 | 7 | 14 | 173 |
2011 | 121 | 26 | 8 | 8 | 163 |
1,363 |
Years | Good faith | Vandalism | Spam | Sockpuppets |
2004 | 99.47% | 0.00% | 0.53% | 0.00% |
2005 | 95.07% | 1.41% | 1.41% | 2.11% |
2006 | 92.79% | 5.29% | 0.96% | 0.96% |
2007 | 80.27% | 12.93% | 1.36% | 5.44% |
2008 | 80.34% | 14.04% | 2.25% | 3.37% |
2009 | 66.05% | 24.69% | 3.09% | 6.17% |
2010 | 68.21% | 19.65% | 4.05% | 8.09% |
2011 | 74.23% | 15.95% | 4.91% | 4.91% |
After gathering the number and percent of the types of contributors within our sample, we next used statistics (from stats.wikimedia.org) to extrapolate the trends through the actual number of new editors each February. For example: if about 74% of new editors in our 2011 sample were acting in good faith, and there were 7,820 new English Wikipedians in 2011, then 5,805 of those editors are likely to be participating in good faith (if the trends from the sample are correct).
Year | February new editors |
2004 | 683 |
2005 | 1,417 |
2006 | 7,659 |
2007 | 13,828 |
2008 | 10,812 |
2009 | 9,551 |
2010 | 8,492 |
2011 | 7,820 |
Year | Good faith | Vandalism | Spam | Sockpuppets |
2004 | 679 | 0 | 4 | 0 |
2005 | 1,347 | 19 | 20 | 30 |
2006 | 7,107 | 405 | 74 | 74 |
2007 | 11,100 | 1,787 | 188 | 753 |
2008 | 8,686 | 1,519 | 243 | 364 |
2009 | 6,308 | 2,358 | 295 | 590 |
2010 | 5,792 | 1,669 | 344 | 687 |
2011 | 5,805 | 1,247 | 384 | 384 |