Research:Wikipedia article creation
Code repository: https://github.com/halfak/Wikipedia-article-creation-research
The process of creating articles is becoming increasingly difficult for new users due to increasingly restrictive criteria[1] and the speed at which their articles are tagged and deleted[2]. This trend is concerning because new users tend to leave the wiki when their work is deleted.
The English Wikipedia Articles for Creation WikiProject has recently adjusted in order to encourage new editors to create draft articles outside of the usual article space. However, it's unclear whether such initiatives are successful in improving the success rate of articles created by new editors or improving their retention. In this study, we'll discuss our analysis of newcomer created articles in the most active Wikipedia projects and answer questions about how different workflows affect the success rate of articles.
Related work
editResearch has established that the number of active editors in the English Wikipedia has entered a decline and that this decline is the result of decreased retention of new users[3]. Subsequent research by Halfaker et al. has shown evidence that this decline is not due to the quality of newcomers, but rather the increasing complexity newcomers must manage in order to successfully contribute and the negative reactions they receive[4]. One of the key factors in Halfaker et al.'s model predicting the retention of new editors was whether they created articles that were quickly deleted. Related work by User:Mr.Z-man confirmed that new editors who created articles that were deleted are less likely to continue to contribute[5]. Research performed in parallel found that the rate at which newly created articles are deleted has risen sharply in recent years[1] and the speed at which new articles are tagged and deleted has increased dramatically[2].
Given that several other large Wikipedias exhibit trends similar to the English Wikipedia (e.g. German[6]) and the recent interest in creating native draft functionality in the English Wikipedia, we have set out to better understand the nature of newcomer article creation. Specifically, we sought to understand how drafts have affected the success rate of newcomers' articles.
Research questions
editFor this analysis, we focused on the top 10 Wikipedias by daily number of articles created[7]: English, Spanish, Russian, German, French, Italian, Polish, Chinese, Japanese and Portuguese. For the sake of simplicity, we'll focus our in-depth analyses on the English and German Wikipedias, but we will compare general statistics across all 10 wikis listed previously.
RQ 1: At what scale do new editors create articles?
- How many newcomers create articles?
- How many articles are created by newcomers?
RQ 2: How successful are new editors in creating articles in Wikipedia?
- What is the success rate of articles created by new editors? ...of more experienced editors? ...of IP editors?
- How many articles are created indirectly, as drafts? How does the success rate differ for these draft articles?
- How has AfC affected the success rate of new editors in English Wikipedia?
Methods
editData was gathered using the page, revision, archive and logging tables. Data was originally extracted on Nov 5th, 2013, so all subsequent queries were bound by this date for fairness across wikis.
General terminology
editOn Wikipedia, there is a great deal of complexity and nuance to the jargon for various kinds of wiki pages and page creation processes. The following definitions are terms we will use in our research analyses:
- Page
- a page is any wiki page, i.e. in all namespaces. If we are referring to pages within a particular namespace only, we will either say article (see below) or add the namespace in specific, e.g. "userspace page" or "user talk page"
- Article
- an article is a page in the main namespace, aka namespace zero
- Draft
- a draft is any page originally created outside of the main namespace that is intended to be an article.
- Articles for Creation, or AFC
- Articles for Creation is a project on English Wikipedia originally intended to allow IP editors to request the creation of articles by registered users. However, recently the project has become a sort of pre-review and mentoring space for both IP editors and new editors.
Assumptions and notes
edit- Page creation
-
- We assumed that the timestamp of the first revision to a page is the time of creation
- Page creator
-
- We assumed that the user who saved the first revision to a page is the page's creator
- Newcomer tenure was determined by comparing the timestamp of page creation with the
user_registration
timestamp for all users.- Note that
user_registration
represents something different for Research:Attached users, so they are held aside in newcomer analyses.
- Note that
- Page deletion
-
- We assume that the timestamp of the last revision to a page is the time of deletion
- We chose this approximation due to bug #26122 which does not allow for easily associating deletion events with the page that was deleted).
Datasets
editTo perform this analysis, we generated two datasets. For most of our analysis, we focused on the top 2 Wikipedias by article count: English & German.
English & German Wikipedia
editIn this dataset, we sought to build as complete a picture of article creation as possible. In order to do this, we needed to be able to distinguish articles that started as drafts from articles that were created directly in the main namespace. The problem is that it is difficult to programmatically tell the difference between an unpublished draft and other kinds of non-namespace zero pages. For example, many users start drafts in their "user sandbox" and later move those pages to main (e.g. en:User:'DesoHaa/sandbox was moved to en:Fredrick_Kúmókụn_Adédeji_Haastrup on May 27th, 2013), but much "user sandbox" usage is unrelated to article drafting.
One exception is the drafts created via Articles for Creation (AFC) in the English Wikipedia. Pages originally created in the Wikipedia Talk namespace as a sub-page of en:Wikipedia_talk:Article_for_creation. We assume that all such sub-pages represent a draft that is intended to eventually become an article. This allows us the ability to make a clear distinction between AFC pages that have been published and those that have not.
However useful this is for our analysis of AfC specifically, we sought to be able to compare English & German Wikipedia fairly. In order to do this, we invented the concept of an "article page" -- a page that was, at some point, visible in the main namespace -- under the assumption that any page that is moved to the main namespace was probably intended to be an "article" of some sort.
- Page move
- A page is moved when its namespace or title are changed. Draft articles tend to be "published" by being moved into the main namespace from other namespace
- Note that we extract page move information from structured comments that appear in the revision history of pages due to a bug (#57084) which does not allow for easily associating move events with the page that was moved.
- English --
rev_comment RLIKE '.*moved .*\\[\\[([^\]]+)\\]\\] to \\[\\[([^\]]+)\\]\\].*:.*'
- German --
rev_comment RLIKE ".*(hat „|verschob „|verschob Seite |verschob die Seite )\\[\\[([^\]]+)\\]\\](“)? nach („)?\\[\\[([^\]]+)\\]\\](“)?(.*)"
- English --
- Note that we extract page move information from structured comments that appear in the revision history of pages due to a bug (#57084) which does not allow for easily associating move events with the page that was moved.
- Article page
- A page that appears in the main namespace at some point. This includes pages that were originally created as articles and those that were moved to namespace zero.
- Original namespace
- Many article pages were created in directly in the main namespace. However, others were created in "userspace" (namespace 2) and Articles for Creation (namespace 5). By capturing the originating namespace, we hope to get a sense for how different workflows affect the survival of articles once they are moved ("published") to the main namespace.
- Publication date
- An article is "published" when it first appears in the main namespace. This matches the creation date for pages that were initially created as articles, but instead matches the date a page was moved into namespace zero for pages that were created in other namespaces.
- Unpublication date
- An article is "unpublished" when it is first deleted or moved out of namespace zero. For simplicity, we did not account for undeletions and moves back to the main namespace. Once a page is unpublished, it's gone. Regretfully, the logging table does not track an appropriate identifier for deleted pages (see bug #26122), so we considered the last revision to an archived page to represent the approximate time-of-deletion.
The other top 8 Wikipedias
editIn order to be able to generalize our conclusions we sought to get a representative sample from our larger projects. For this dataset, we extracted general page creation & deletion data from the top 10 wikis by size (English, German, Italian, Spanish, Russian, Portuguese, Polish, Japanese, Chinese and French). Due to the way that the analytics slaves which host non-English/German wikis operate, we were unable to efficiently extract move information. So, rather than examining article pages, we instead used all pages that appeared in the main namespace as of Nov. 13th, 2013 as our set of "articles". While this means that we are not able to observe the effects of different workflows on article survival, we are still able to reflect on how creation and deletion rates change over time for those articles that appear in the main namespace.
- Article
- A page that appeared in the main namespace as of Nov. 13th, 2013. (This includes deleted pages.)
- Creation date
- The timestamp of page creation
- Deletion date
- The timestamp of page deletion
Note that, when we compare German and English Wikipedias to the other 8, we revert to this simple definition of an "article", creation and deletion.
Article creator classes
editIn order to observe newcomer article creation, we split newly registered users into three groups based on their tenure at the time of article creation:
- -day
- Newcomers who registered less than a day before saving the first revision of a page.
- day-week
- Newcomers who registered between 24 hours and 7 days before saving the first revision of a page
- week-month
- Newcomers who registered between 7 and 30 days before saving the first revision of a page
Note that, when we refer to "newcomers" without specifying which class, we mean the aggregate of the above newcomer classes (i.e. all newcomers with less than 30 days of tenure).
We also examined other groups for comparison
- month-
- Wikipedians who registered more than a month before saving the first revision of a page
- anons (IP editors)
- Editor who created pages "anonymously" -- not through a registered account. Note that the English Wikipedia and a handful of other wikis disallowed direct article creation from IP addresses.
- autocreated
- Attached editors are users with global accounts who came from another wiki (e.g. meta). Since they are automatically registered (hence "autocreated") when they visit the wiki of interest, registration dates don't generally reflect their actual tenure across Wikipedia. For this reason, these users were analyzed separately.
Successful articles
editIn order to examine RQ 2, we sought to formalize "successful" article creation. Preferably, we'd like newly all created articles to meet minimum guidelines for inclusion in an encyclopedia. If an effective review process is in place, new articles below such a threshold are deleted in a reasonable amount of time and articles above the threshold should remain. In such a system, any article that survives a certain amount of time can be assumed to be a successful article creation. In order to identify how long a reasonable amount of time might be, we performed an analysis of the time between creation and deletion (last revision timestamp). Figure #Article lifetime shows a strong cluster between one minute and one hour. While some deletions take more than one year, 87.3% of deletions occurred within one month. Based on these observations, we operationalize a reasonable amount of time as 30 days.
Results
editSince the archive table only contains page_id
values for revisions of pages deleted after 2007, we'll focus our attention on 2008 through 2013.
RQ 1: At what scale do new editors create articles?
editHow many newcomers create articles?
editTo get a sense for how many newcomers create articles and to identify changes over time, we built a monthly timeseries of the number of newly_registered_users, new editors, new page creators, new article page publishers and new draft article publishers. These editor classes represent a "funnel" as users approach creating articles and eventually draft articles.
- new page creators = new editors who create a page within 1 month of registration
- new article publishers = new editors who create an article page within 1 month of registration
- new draft article publishers = new editors who create a draft (original namespace != 0) article page within 1 month of registration.
Figures #Relative funnel proportions (enwiki) and #Relative funnel proportions (dewiki) show how the relative proportion of newcomers reaching each step in the funnel has changes over time. Both English and German Wikipedia have experienced a slow decline in the proportion of newly_registered_users who make at least one edit (new editors). However, while the proportion of new editors who create pages (page creators) has been declining for German Wikipedia (55% in Jan. 2008 to less than 40% in Oct. 2013), the proportion has been holding steady for English Wikipedia at about 40-45%.
Another difference can be observed in the proportion of new page creators who publish an article page. Both the English and German Wikipedia fluctuate around 60-65% between 2008 and mid 2011, but the percentage of such editors drops to about 32% thereafter. Note that the timing of this switch corresponds to the time that newcomers began to be directed to en:Wikipedia:Articles for creation. This transition is explored more closely in #How has AfC affected the success rate of new article creators in English Wikipedia?
The table below summarizes statistics for English and German Wikipedia's for the most recent full month in the dataset: October, 2013.
group | English (Oct. 2013) | German (Oct. 2013) | ||||
---|---|---|---|---|---|---|
Users | Relative % | Absolute % | Users | Relative % | Absolute % | |
Newly registered users | 157008 | 100 | 100 | 9633 | 100 | 100 |
New editors | 48163 | 30.6 | 30.6 | 4019 | 41.7 | 41.7 |
New page creators | 18808 | 39.1 | 12.0 | 1575 | 39.2 | 16.4 |
New article publishers | 6045 | 32.1 | 3.9 | 952 | 60.4 | 9.9 |
New draft publishers | 118 | 2.0 | 0.0 | 0 | 0 | 0.0 |
How many articles are created by newcomers?
editNext we sought to explore how many articles newcomers were responsible for. Figure #Article created by experience plots the proportion of articles created in October, 2013 for each of the observed wikis.
In all wikis, Wikipedians with more than a month of tenure create the vast majority of articles. On the low end is Italian with "month-" Wikipedians creating 54% of new articles. On the high end is English and German with "month-" Wikipedians creating 83.6 and 79% of articles respectively.
In all non-English wikis, anonymous editors (aka IP editors) represent the next largest class of article creators. Their article creation proportion ranges from 14.5% in German Wikipedia to 36.2% in Italian Wikipedia.
The next largest group of article creators are newcomers with less than 1 day of tenure on Wikipedia. In all the observed wikis, a higher proportion of articles are created by newcomers in their first day than those created by newcomers in the rest of their first week/month combined.
RQ 2: How successful are new editors in creating articles in Wikipedia?
editWhat is the success rate of articles created by new editors?
editIn order to address this research question, we first sought to get a general sense of the success rate of newcomers in creating articles across wikis and how that success rate has changed over time. Figure #Article survival by experience summarizes the success rate of editors from each of the 6 article creator classes from October, 2013 across the set of languages.
We were surprised to find that, in wikis that allow article creation by anonymous editors, their survival rate was/is substantially higher than that of recently registered new editors. In most cases, anons were twice as likely to create an article that would stick than newcomers with less than a day of tenure. One notable exception is Polish Wikipedia which has the lowest observed survival rate of anon created articles (22.2%) and the the highest observed survival rate of articles created by "-day" newcomers (58.6%).
In general, the rate of article survival is much higher for all editors in Japanese and Polish Wikipedias. It's unclear to us why this is the case. All other wikis seem to follow a similar pattern whereby newcomers with less than a day of tenure have the lowest article survival rate.
Next, we looked at how article survival rate has been changing over time for different article creator classes. The figures below plot the monthly survival rate for each class of editor with linear models overlaid to aid with visualizing trends.
In general, the survival rate of newcomers articles has been decreasing -- even in Polish and Japanese, where the survival rate of newcomer articles is very high. English and German represent exceptional cases where it appears that the success rate of newcomer created articles is rising for all three newcomer classes. For English, this could be explained by the introduction of Articles for creation. We'll discuss this possibility more in #How has AfC affected the success rate of new article creators in English Wikipedia? For German Wikipedia, we don't have any likely explanations to put forward.
How does the success rate differ for draft articles?
editNext we looked to our dataset of article pages for English and German Wikipedia to examine the survival rate of articles that were created in another namespace (drafts) and moved to the main namespace later. The Article page survival figures (enwiki & dewiki) below plot the survival rate of articles over time by their original namespaces, for the three classes of newcomers and experienced Wikipedians.
- 0 = Main (article) namespace
- 2 = User namespace
- 5 = Wikipedia_talk (Project_talk) namespace (used by Articles for creation on English Wikipedia)
As expected, newcomers with the least tenure ("-day") have the most divergent survival rates for direct to main (origin = 0) article creations and drafts. In the English Wikipedia, direct article creations survive about 25% of the time while articles that start in userspace (origin = 2) and Articles for creation (origin = 5) survive about 96% of the time once published. Similar, but less substantial differences in the survival rate exist for these newest of newcomers in the German Wikipedia. There, direct article creations by "-day" newcomers survive about 20% of the time while articles that start in userspace survive about 80% of the time.
This is where the similarities between English and German Wikipedia with regards to draft survival seem to disappear. While the survival rate of both draft types remains high for English Wikipedia through all editor classes, in German, the survival rate of userspace drafts created by slightly more experienced newcomers in ("day-week" & "week-month") and experienced editors have a surprising low survival rate -- lower than even direct article creations. This differing trend could represent the different cases in which userspace drafts are used in German vs. English Wikipedia. However one thing is clear: more examination is necessary to explain this difference.
The tables below present summary statistics for October, 2013, the most recent complete month in the dataset.
English Wikipedia (Oct. 2013) | |||||
---|---|---|---|---|---|
origin | creator_tenure | authors | articles | surviving | survival % |
0 | -day | 5142 | 6221 | 1511 | 24.3 |
0 | day-week | 759 | 1415 | 788 | 55.7 |
0 | week-month | 750 | 1320 | 805 | 61.0 |
0 | month- | 5775 | 37033 | 35051 | 94.6 |
2 | -day | 69 | 70 | 67 | 95.7 |
2 | day-week | 22 | 23 | 22 | 95.7 |
2 | week-month | 40 | 53 | 45 | 84.9 |
2 | month- | 219 | 457 | 387 | 84.7 |
5 | -day | 96 | 96 | 92 | 95.8 |
5 | day-week | 29 | 32 | 32 | 100.0 |
5 | week-month | 29 | 31 | 31 | 100.0 |
5 | month- | 176 | 529 | 512 | 96.8 |
German Wikipedia (Oct. 2013) | |||||
---|---|---|---|---|---|
origin | creator_tenure | authors | articles | surviving | survival % |
0 | -day | 802 | 967 | 190 | 19.6 |
0 | day-week | 111 | 163 | 83 | 50.9 |
0 | week-month | 101 | 171 | 113 | 66.1 |
0 | month- | 1615 | 15183 | 14216 | 93.6 |
2 | -day | 35 | 39 | 31 | 79.5 |
2 | day-week | 23 | 26 | 9 | 34.6 |
2 | week-month | 21 | 30 | 11 | 36.7 |
2 | month- | 129 | 352 | 154 | 43.8 |
How has AfC affected the success rate of new article creators in English Wikipedia?
editFinally, we sought to get a sense for how Articles for creation (AfC) affects the work of new editors. We showed in #How does the success rate differ for draft articles? that articles created through AfC are most likely to survive for all three newcomer classes as well as experienced Wikipedians. However, we also observed in #How many newcomers create articles? that the proportion of new page creators whose pages get published in the main namespace declined sharply around the time that new article creators began being directed to AfC.
Given the high survival rate of AfC published articles (~96%), it could be AfC is merely acting as an effective filter where articles that won't survive are just not published in the first place. In order to test this hypothesis, we filtered all article pages that did not survive at least 30 days from dataset and draw a similar proportion. Figure #Surviving article per new page creator plots this proportion with loess fits for before and after newcomers were directed toward AfC. Despite filtering for surviving articles, it looks like the trend remains. Figure #AfC drafts created per month shows the corresponding rise in the number of AfC drafts created by newcomers during this time period.
This analysis makes it clear that about half as many good articles are published by newcomers since newcomers started being directed to AfC when creating articles. However convincing, this analysis merely demonstrates a temporal correlation. More analysis will be necessary to assert with confidence that AfC is causing the decline in successful newcomer created articles.
Summary
editIn this study, we examined article creation across the top 10 language Wikipedia's by number of articles and performed a focused analysis of draft article creation in English and German Wikipedias.
We found some strong regularities in success rate of articles depending on the experience level of the article's creator; namely, that the more experience an editor has, the more likely their articles are to survive. We were surprised to find that articles created by anonymous editors (where such creations are possible) are more likely to survive than articles created by newcomers who recently registered an account. This result suggests that it's time to review English Wikipedia's policy against anonymous article creation.
We also found that, in general, the rate of survival for newcomer articles has been decreasing over time with two notable exceptions. In the English Wikipedia, the survival of newcomer articles decreased steadily until the introduction of Articles for creation, a space for creating article drafts and receiving review before publishing. In German Wikipedia, the survival of newcomer articles has been rising steadily since 2008, but we see no evidence of a comparable switch toward a draft & review process in that wiki.
Finally, we found correlation based evidence that directing new article creators to AfC has resulted in a dramatic decline in the creation of good new articles by newcomers. We also showed that the drafts published via AfC are extremely likely to survive. More work is necessary to identify what factor may be limiting AfCs success to this smaller proportion of articles.
See also
editNotes
edit- ↑ a b Lam, S. T. K., & Riedl, J. (2009, May). Is Wikipedia growing a longer tail?. In Proceedings of the ACM 2009 international conference on Supporting group work (pp. 105-114). ACM. pdf
- ↑ a b Research:The Speed of Speedy Deletions
- ↑ Suh, B., Convertino, G., Chi, E. H., & Pirolli, P. (2009, October). The singularity is not near: slowing growth of Wikipedia. In Proceedings of the 5th International Symposium on Wikis and Open Collaboration (p. 8). ACM.
- ↑ Halfaker, A., Geiger, R. S., Morgan, J. T., & Riedl, J. (2013). The Rise and Decline of an Open Collaboration System How Wikipedia’s Reaction to Popularity Is Causing Its Decline. American Behavioral Scientist, 57(5), 664-688. pdf
- ↑ https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2011-04-04/Editor_retention
- ↑ http://stats.wikimedia.org/EN/ChartsWikipediaDE.htm
- ↑ New articles per day from stats.wikimedia.org