Learning patterns/Tips for reading project codes from pageviews data files
What problem does this solve?
editThe pageviews data files use their own Wikimedia project codes with several formats. If you need to read those files for hundreds of projects, you will need to handle most cases and exceptions in order to find the actual project url or name of the database replica.
What is the solution?
editProject codes list
editThe documentation (Wikistats pageview files or inside data files) is not complete and sometime inaccurate:
- Wiktionary is
d
(notk
) - Wikivoyage is
voy
(noto
orwo
) - Wikidata is
wd
- Mediawiki is
w
- Foundationwiki is
f
Later, I have found a more up-to-date documentation here https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageviews.
I use the following PHP array to map pageview code to project group :
array(
'b'=>'wikibooks',
'd'=>'wiktionary',
'n'=>'wikinews',
'q'=>'wikiquote',
's'=>'wikisource',
'v'=>'wikiversity',
'voy'=>'wikivoyage',
'z'=>'wikipedia',
'wd'=>'wikidata',
'w'=>'mediawiki',
'f'=>'foundation',
);
Special projects
editI also use a special projects array, those projects don't have language code. Inside the Labs sites table the site_language is always en
. A non empty value mean that the site_group field in the site table is different than the key in pageviews data.
array(
'commons'=>'',
'meta'=>'',
'incubator'=>'',
'species'=>'',
'zero'=>'',
'outreach'=>'',
'nostalgia'=>'',
'ten'=>'',
'wg-en'=>'',
'beta'=>'betawikiversity',
'quality'=>'',
'strategy'=>'',
'usability'=>'',
'test'=>'',
'test2'=>'',
);
Project codes formats
editThe data also use a traffic type code : m
for mobile, zero
for Wikipedia Zero.
The data files use several formats, each key is separated by a dot :
- [language] (in older files)
es = eswiki (desktop)
- [language].[group/project]
de.s = dewikisource (desktop)
- [language].[traffic type]
de.m = dewiki (mobile)
- [language].[traffic type].[group/project]
fr.zero.v = frwikiversity (zero)
For *.wikimedia.org special projects
- [project].m
meta.m = metawiki (desktop)
- [project].[traffic type].m
meta.m.m = metawiki (mobile)
For other special projects
- ["www" or traffic type].[project]
www.wd = wikidata (desktop)
,zero.s = sourceswiki (zero)
Tips :
- No value for traffic type mean desktop.
- If there is no [group/project], it is Wikipedia (z).
- If the second key is
m
orzero
, it is the traffic type, the project is the third key or Wikipedia if there is none. Them
can also be a special *.wikimedia site, you can use the special projects list to detect it. - If the first key is
m
,zero
orwww
, it is traffic type for some special projects,www
is probably for desktop only. - In older files there is a
mw
code (e.g.es.mw
,fr.mw
) with very few hits, you can ignore them.
Sample project codes inside pageviews data file pagecounts-2017-01-01.bz2, with total number of row for each code:
French projects :
fr.b 3127 fr.d 69553 fr.m 788866 fr.m.b 2306 fr.m.d 73131 fr.m.n 520 fr.m.q 1128 fr.m.s 7476 fr.m.v 2232 fr.m.voy 507 fr.n 976 fr.q 2280 fr.s 30344 fr.v 3769 fr.voy 1151 fr.z 1130938 fr.zero 75286 fr.zero.b 360 fr.zero.d 3728 fr.zero.n 66 fr.zero.q 96 fr.zero.s 441 fr.zero.v 399 fr.zero.voy 49
Classic special projects :
commons.m 1350704 commons.m.m 352237 commons.zero.m 35679 meta.m 12029 meta.m.m 5701 meta.zero.m 1117 outreach.m 848 outreach.m.m 163 outreach.zero.m 22 species.m 11516 species.m.m 6520 species.zero.m 149
Other special projects :
www.f 1818 www.s 2067 www.w 13826 www.wd 97219 m.f 857 m.s 1055 m.w 4337 m.wd 44163 zero.f 477 zero.s 141 zero.w 1164 zero.wd 1126
Link to Labs sites table
editIt can be useful to link the pageview code to the corresponding site_global_key from the Labs table (which contains url to link articles). The global_key value is also used for database names.
Sample columns from site table for French sites:
+-----------------+-----------+-------------+-------------+---------------+ | site_global_key | site_type | site_group | site_source | site_language | +-----------------+-----------+-------------+-------------+---------------+ | frwiki | mediawiki | wikipedia | local | fr | | frwiktionary | mediawiki | wiktionary | local | fr | | frwikibooks | mediawiki | wikibooks | local | fr | | frwikinews | mediawiki | wikinews | local | fr | | frwikiquote | mediawiki | wikiquote | local | fr | | frwikisource | mediawiki | wikisource | local | fr | | frwikiversity | mediawiki | wikiversity | local | fr | | frwikivoyage | mediawiki | wikivoyage | local | fr | +-----------------+-----------+-------------+-------------+---------------+
You should be able to find the global key with the group array and the language, special projects are always en
. There is some exceptions :
- International Wikisource use
sources
and notwikisource
for site_group. beta
becomebetawikiversity
in site_group.- Language
be-tarask
usebe-x-old
for site_language.
Things to consider
edit- Some older pageviews files can contain uppercase in project codes, you should always transform to lowercase when reading.
- In older files there is sometimes private projects like
arbcom-en.z
- Some values are not filtered, for example the file for 2017-01-01 contain a line starting with
which_of_the_following_nicknames.z
. Finding or not the project inside the Labs table is a good way to filter them out.
When to use
edit- To make pageview statistics for a lot of projects with all the data. If you need only some pages or top viewed pages, use the Analytics/PageviewAPI instead.
- Used by http://wikiscan.org.
Endorsements
editSee also
edit- https://dumps.wikimedia.org/other/analytics/ (Analytics Datasets)
- https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageviews (the more up-to-date documentation)
- https://meta.wikimedia.org/wiki/Research:Page_view