Learning patterns/Tips for reading project codes from pageviews data files

A learning pattern forwiki design
Tips for reading project code from pageviews data files
problemThe pageviews data files use their own project codes with several formats.
solutionUse the following tips.
creatorAkeron
endorse
created on15:50, 18 January 2017 (UTC)
status:DRAFT

What problem does this solve?

edit

The pageviews data files use their own Wikimedia project codes with several formats. If you need to read those files for hundreds of projects, you will need to handle most cases and exceptions in order to find the actual project url or name of the database replica.

What is the solution?

edit

Project codes list

edit

The documentation (Wikistats pageview files or inside data files) is not complete and sometime inaccurate:

  • Wiktionary is d (not k)
  • Wikivoyage is voy (not o or wo)
  • Wikidata is wd
  • Mediawiki is w
  • Foundationwiki is f

Later, I have found a more up-to-date documentation here https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageviews.

I use the following PHP array to map pageview code to project group :

array(
    'b'=>'wikibooks',
    'd'=>'wiktionary',
    'n'=>'wikinews',
    'q'=>'wikiquote',
    's'=>'wikisource',
    'v'=>'wikiversity',
    'voy'=>'wikivoyage',
    'z'=>'wikipedia',
    'wd'=>'wikidata',
    'w'=>'mediawiki',
    'f'=>'foundation',
    );

Special projects

edit

I also use a special projects array, those projects don't have language code. Inside the Labs sites table the site_language is always en. A non empty value mean that the site_group field in the site table is different than the key in pageviews data.

array(
    'commons'=>'',
    'meta'=>'',
    'incubator'=>'',
    'species'=>'',
    'zero'=>'',
    'outreach'=>'',
    'nostalgia'=>'',
    'ten'=>'',
    'wg-en'=>'',
    'beta'=>'betawikiversity',
    'quality'=>'',
    'strategy'=>'',
    'usability'=>'',
    'test'=>'',
    'test2'=>'',
    );

Project codes formats

edit

The data also use a traffic type code : m for mobile, zero for Wikipedia Zero.

The data files use several formats, each key is separated by a dot :

  • [language] (in older files) es = eswiki (desktop)
  • [language].[group/project] de.s = dewikisource (desktop)
  • [language].[traffic type] de.m = dewiki (mobile)
  • [language].[traffic type].[group/project] fr.zero.v = frwikiversity (zero)

For *.wikimedia.org special projects

  • [project].m meta.m = metawiki (desktop)
  • [project].[traffic type].m meta.m.m = metawiki (mobile)

For other special projects

  • ["www" or traffic type].[project] www.wd = wikidata (desktop), zero.s = sourceswiki (zero)

Tips :

  • No value for traffic type mean desktop.
  • If there is no [group/project], it is Wikipedia (z).
  • If the second key is m or zero, it is the traffic type, the project is the third key or Wikipedia if there is none. The m can also be a special *.wikimedia site, you can use the special projects list to detect it.
  • If the first key is m, zero or www, it is traffic type for some special projects, www is probably for desktop only.
  • In older files there is a mw code (e.g. es.mw, fr.mw) with very few hits, you can ignore them.

Sample project codes inside pageviews data file pagecounts-2017-01-01.bz2, with total number of row for each code:
French projects :

fr.b 3127
fr.d 69553
fr.m 788866
fr.m.b 2306
fr.m.d 73131
fr.m.n 520
fr.m.q 1128
fr.m.s 7476
fr.m.v 2232
fr.m.voy 507
fr.n 976
fr.q 2280
fr.s 30344
fr.v 3769
fr.voy 1151
fr.z 1130938
fr.zero 75286
fr.zero.b 360
fr.zero.d 3728
fr.zero.n 66
fr.zero.q 96
fr.zero.s 441
fr.zero.v 399
fr.zero.voy 49

Classic special projects :

commons.m 1350704
commons.m.m 352237
commons.zero.m 35679
meta.m 12029
meta.m.m 5701
meta.zero.m 1117
outreach.m 848
outreach.m.m 163
outreach.zero.m 22
species.m 11516
species.m.m 6520
species.zero.m 149

Other special projects :

www.f 1818
www.s 2067
www.w 13826
www.wd 97219
m.f 857
m.s 1055
m.w 4337
m.wd 44163
zero.f 477
zero.s 141
zero.w 1164
zero.wd 1126
edit

It can be useful to link the pageview code to the corresponding site_global_key from the Labs table (which contains url to link articles). The global_key value is also used for database names.

Sample columns from site table for French sites:

+-----------------+-----------+-------------+-------------+---------------+
| site_global_key | site_type | site_group  | site_source | site_language |
+-----------------+-----------+-------------+-------------+---------------+
| frwiki          | mediawiki | wikipedia   | local       | fr            |
| frwiktionary    | mediawiki | wiktionary  | local       | fr            |
| frwikibooks     | mediawiki | wikibooks   | local       | fr            |
| frwikinews      | mediawiki | wikinews    | local       | fr            |
| frwikiquote     | mediawiki | wikiquote   | local       | fr            |
| frwikisource    | mediawiki | wikisource  | local       | fr            |
| frwikiversity   | mediawiki | wikiversity | local       | fr            |
| frwikivoyage    | mediawiki | wikivoyage  | local       | fr            |
+-----------------+-----------+-------------+-------------+---------------+

You should be able to find the global key with the group array and the language, special projects are always en. There is some exceptions :

  • International Wikisource use sources and not wikisource for site_group.
  • beta become betawikiversity in site_group.
  • Language be-tarask use be-x-old for site_language.

Things to consider

edit
  • Some older pageviews files can contain uppercase in project codes, you should always transform to lowercase when reading.
  • In older files there is sometimes private projects like arbcom-en.z
  • Some values are not filtered, for example the file for 2017-01-01 contain a line starting with which_of_the_following_nicknames.z. Finding or not the project inside the Labs table is a good way to filter them out.

When to use

edit

Endorsements

edit

See also

edit
edit
edit

References

edit