The main goal of this project is to develop a robust and extensible software tool for an in-depth quantitative analysis of the whole Wikipedia project. This project is currently developed by José Felipe Ortega (GlimmerPhoenix) at the Libresoft Group, at Universidad Rey Juan Carlos.

Currently, this tool includes a set of Python and GNU R scripts to obtain statistics, graphics and quantitative results for any Wikipedia language version. Current functionality includes:

  • Downloading the 7zip database dump of the target language version.
  • Construction and decompression of the database dump in a local storage media.
  • Creating additional database tables with useful statistics and quantitative information.
  • Generating graphics and data files with quantitative results, adequately organized in a per-language directory substructure.

Some of these capabilities still require manual insertion of parameters in a common configuration file, though also some of them work automatically.

The source code is publicly available under the GNU GPL license, and could be found in LibreSoft tools git repository, or alternatively on the WikiXRay project at Gitorious.

Please note that this software is still in a very early development stage (pre-alpha level). Any useful contributions will be of course welcomed (first contact with the project administrators).

Following, we summarize some of the most relevant results we have obtained so far. These results comes from a quantitative analysis focused on the top 10 Wikipedias (attending to their total number of articles).

Python Parser

edit

WikiXRay comes with a Python parser to process compressed XML dumps of Wikipedia. More information in WikiXRay/Python parser.

Article size distribution (HISTOGRAMS)

edit

Regarding articles size, one of the most interesting aspects to study is their distribution for each language. We can plot the histogram for the article size distribution in each language to evaluate the most frequent size intervals.

As we can see, one common feature is that in each histogram we can identify two different populations. The lower half represents the tiny articles redirects population, most probably composed by stubs. The higher half represents the standard articles population.

We can also explore the statistical distribution exhibited in each half (stubs redirects population and standard population). This can be achieved working out the empirical cumulative distribution function (ecdf) for each half in every histogram. Results show normal distributions.

We can appreciate the normal distribution comparing the ECDF of each population for each language with the ECDF of a perfectly normal distribution (red line).

   

Share of bots edits

edit

Bots have a significant effect in the number of edits received in Wikipedia. The following figures present the percentage of total number of edits from logged users and the percentage of total number of edits (including annonymous users) generated by bots in the top-ten Wikipedias.

Correlation between relevant parameters

edit

The following graphics show the relationship between number of different authors who has contributed to a certain article, and the article size (in bytes).

edit
  • Public repositories for research dumps: Rediris mirror, ftp (announce)
  • PhD. thesis "Wikipedia: A quantitative analysis", PDF (CC-BY-SA, 3.0 license), 2009.