Research:Prioritization of Wikipedia Articles/Importance/WikiProjects

There has been a fair bit of research in the past into how WikiProjects on English Wikipedia assess article importance and how "predictable" these assessments are from basic details about the article such as how many other articles link to it or the pageviews it receives. This page has a few goals:

  • Present a simple approach for data collection around importance assessments
  • Provide some basic insights into how "contextual" an importance assessment is -- i.e. how much does it depend on the particular WikiProject

Code: https://github.com/geohci/wiki-prioritization/tree/master/wikiproject_importance

Gathering data

edit

Past research gives an excellent overview of the particulars of the article importance scale and notes some of the challenges of gathering this data. Luckily, this importance data is now collated in a much more structured format in the page_assessments table for English Wikipedia (and a few other languages). As such, it is a much more straightforward query to gather importance assessments for all articles in a given wiki:

Get Wikipedia articles tagged with WikiProject templates
  SELECT
    pa.pa_page_id AS article_pid,
    pap.pap_project_title AS wp_template,
    p.page_latest AS article_revid,
    p.page_title AS title,
    ptalk.page_id AS talk_pid,
    ptalk.page_latest AS talk_revid,
    pa.pa_importance AS importance
  FROM page_assessments pa
  INNER JOIN page_assessments_projects pap
    ON (pa.pa_project_id = pap.pap_project_id)
  INNER JOIN page p
    ON (pa.pa_page_id = p.page_id
        AND p.page_namespace = 0
        AND p.page_is_redirect = 0)
  INNER JOIN page ptalk
    ON (p.page_title = ptalk.page_title
        AND ptalk.page_namespace = 1)

The raw assessments have to be standardized then as they are not strict in what values are accepted. Specifically, I do the following with the data:

  • I drop the following values:
    • Unknown, NA, na
  • And map the rest to four categories:
    • Low: Low, low, Bottom, Related
    • Mid: Mid, mid
    • High: High, high
    • Top: Top, top

Results

edit

For English Wikipedia, we get the following counts of how many times each assessment level appeared (articles can have multiple assessments):

  • Low: 3,687,536 assessments (79%)
  • Mid: 751,592 assessments (16%)
  • High: 174,025 assessments (4%)
  • Top: 40,411 assessments (1%)

As one can see, most articles are most often assessed as low importance and quite rarely as top importance.

Importance by Topic

edit

Below is data on the distribution of importance templates by article for articles on English Wikipedia, split into the different ORES taxonomy topics. Note, the topic assessments are based on what WikiProjects have tagged an article, not topics as predicted by a classification model. Many articles fall under multiple topics and are counted for each. A single WikiProject can only contribute a single assessment per article, but multiple WikiProjects might tag an article and provide assessments (example). The columns are:

  • # articles: number of articles that are part of the topic on English Wikipedia (the rest of the values are proportions of this number)
  • no assess.: proportion of articles that don't have a single importance assessment
  • single assess.: proportion of articles with a single importance assessment
  • mult assess.: proportion of articles with more than one importance assessment
  • agreed: multiple assessments but they all were the same level -- e.g., Low and Low
  • adjacent: multiple assessments but they were adjacent levels -- e.g., Low and Mid
  • two steps: multiple assessments that were two steps away -- e.g., Low and High
  • full: multiple assessments that were the full range -- i.e. Low and Top

From this table, we can see that topics like Society, Philosophy/Religion, and History are most likely to show a wider range of importance assessments, but this seems to be arise from articles in those topics being more likely to be tagged by multiple WikiProjects. It is also very rare that articles tagged with multiple WikiProjects have consistent assessments of importance -- they are most likely to be adjacent assessments such as Mid and High. Finally, it's clear that articles tagged as Low importance (79% of assessments) are very unlikely to be tagged by multiple WikiProjects -- i.e. they tend to have a narrow scope -- because otherwise we would expect much greater agreement between WikiProjects tagging the same article.

Topic # articles no assess. single assess. mult assess. agreed adjacent two steps full
History and Society.Society 64148 0.185758 0.565302 0.24894 0.000452 0.170434 0.058724 0.01933
Culture.Philosophy and religion 158108 0.09546 0.696429 0.208111 0.000829 0.150935 0.042578 0.013769
Geography.Regions.Africa.Central Africa 11458 0.266539 0.533165 0.200297 0.000175 0.155263 0.034561 0.010298
History and Society.History 173010 0.170921 0.629796 0.199283 0.000329 0.148043 0.040298 0.010612
STEM.Medicine & Health 76955 0.107738 0.69846 0.193802 0.000195 0.148554 0.038022 0.00703
Culture.Media.Software 23391 0.278098 0.530033 0.191869 0.001069 0.141422 0.043265 0.006113
Geography.Regions.Africa.Eastern Africa 35776 0.266547 0.545114 0.188339 -- 0.136432 0.041564 0.010342
Culture.Visual arts.Architecture 167657 0.076853 0.735901 0.187245 0.004754 0.148446 0.030556 0.003489
STEM.Physics 19164 0.013671 0.802233 0.184095 0.015028 0.126905 0.035796 0.006366
STEM.Mathematics 23688 0.067671 0.74886 0.183468 0.078141 0.07717 0.023092 0.005066
STEM.Libraries & Information 9965 0.080983 0.738083 0.180933 0.000803 0.138083 0.033718 0.008329
Culture.Visual arts.Visual arts* 280611 0.158746 0.694171 0.147083 0.007387 0.112248 0.02409 0.003357
History and Society.Education 98776 0.151282 0.701982 0.146736 0.001002 0.107395 0.032467 0.005872
Geography.Regions.Africa.Africa* 147027 0.278507 0.575037 0.146456 0.000109 0.111102 0.027886 0.007359
Geography.Regions.Africa.Southern Africa 25755 0.226286 0.627412 0.146302 0.000078 0.123199 0.01821 0.004815
STEM.Earth and environment 80403 0.077049 0.78063 0.142321 0.000211 0.107222 0.029912 0.004975
Geography.Geographical 313358 0.073446 0.785226 0.141327 0.000109 0.116123 0.021123 0.003973
Geography.Regions.Africa.Northern Africa 30895 0.210714 0.650688 0.138598 0.000356 0.103447 0.0268 0.007995
Culture.Literature 203361 0.183934 0.678193 0.137873 0.006402 0.107449 0.019812 0.004209
Geography.Regions.Asia.Central Asia 11776 0.115489 0.746858 0.137653 -- 0.096637 0.033033 0.007982
Geography.Regions.Asia.South Asia 241528 0.087133 0.77827 0.134597 0.000248 0.106712 0.024001 0.003635
Culture.Visual arts.Comics and Anime 37093 0.018818 0.847222 0.133961 0.03405 0.081093 0.015879 0.002939
Culture.Biography.Women 263244 0.211568 0.660893 0.127539 0.000874 0.107334 0.016578 0.002754
Culture.Visual arts.Fashion 12653 0.257251 0.61685 0.125899 0.000079 0.096183 0.025844 0.003794
Geography.Regions.Asia.North Asia 85780 0.059827 0.818163 0.12201 0.000455 0.096258 0.021042 0.004255
Culture.Media.Books 60291 0.226137 0.656682 0.117182 0.001028 0.097295 0.016553 0.002305
Geography.Regions.Oceania 241103 0.040215 0.84276 0.117025 0.000174 0.099119 0.015782 0.001949
Geography.Regions.Europe.Northern Europe 429847 0.136248 0.748145 0.115606 0.000384 0.095529 0.016859 0.002834
Geography.Regions.Africa.Western Africa 36403 0.358899 0.526495 0.114606 0.000027 0.082933 0.024861 0.006785
STEM.Space 33298 0.043997 0.843474 0.112529 0.003814 0.083458 0.021293 0.003964
History and Society.Military and warfare 213470 0.365293 0.525994 0.108713 0.000309 0.082316 0.021802 0.004286
Culture.Media.Television 116103 0.2425 0.649303 0.108197 0.003988 0.078473 0.022015 0.003721
Geography.Regions.Asia.Asia* 785879 0.13687 0.761822 0.101308 0.000328 0.081384 0.01684 0.002756
Culture.Internet culture 48879 0.055648 0.843859 0.100493 0.007181 0.069969 0.020029 0.003314
STEM.Chemistry 29487 0.131109 0.769661 0.09923 0.000441 0.074677 0.020958 0.003154
Geography.Regions.Asia.Southeast Asia 92044 0.140009 0.762385 0.097605 0.000261 0.080364 0.014776 0.002205
Geography.Regions.Americas.South America 102947 0.142996 0.759478 0.097526 0.000204 0.079439 0.01526 0.002623
Geography.Regions.Asia.East Asia 180427 0.146369 0.757276 0.096355 0.000571 0.078985 0.014787 0.002012
Culture.Performing arts 41368 0.294068 0.610689 0.095243 0.000266 0.073414 0.017356 0.004206
Geography.Regions.Europe.Europe* 1197367 0.174059 0.731506 0.094435 0.000408 0.078784 0.012933 0.00231
STEM.Engineering 84612 0.410261 0.496797 0.092942 0.000461 0.073595 0.016109 0.002777
Geography.Regions.Europe.Eastern Europe 267187 0.134206 0.775464 0.09033 0.000389 0.077137 0.010816 0.001987
Culture.Sports 933564 0.18111 0.729999 0.088891 0.001055 0.074605 0.011495 0.001735
History and Society.Transportation 223236 0.315558 0.595921 0.088521 0.000287 0.074316 0.012435 0.001483
Geography.Regions.Europe.Western Europe 304552 0.209015 0.706014 0.084971 0.000581 0.0712 0.011253 0.001937
Geography.Regions.Europe.Southern Europe 209630 0.24216 0.673978 0.083862 0.000234 0.066641 0.013748 0.003239
Culture.Biography.Biography* 1838864 0.339359 0.579076 0.081565 0.001282 0.065869 0.012143 0.00227
STEM.STEM* 880243 0.107787 0.811999 0.080214 0.002273 0.061126 0.014377 0.002438
All Articles 5890819 0.292323 0.631763 0.075915 0.000895 0.060812 0.012164 0.002044
Culture.Media.Video games 37133 0.004632 0.927288 0.06808 0.009183 0.045539 0.011607 0.00175
Culture.Media.Media* 924129 0.366187 0.567583 0.06623 0.001308 0.05088 0.011902 0.002139
Geography.Regions.Asia.West Asia 182848 0.22406 0.710175 0.065765 0.000175 0.051261 0.011583 0.002745
Culture.Media.Radio 29169 0.219685 0.726696 0.053619 0.000137 0.041962 0.009634 0.001886
Culture.Media.Music 389662 0.431751 0.518726 0.049522 0.000252 0.038626 0.00879 0.001855
Culture.Media.Films 242294 0.429751 0.523731 0.046518 0.001284 0.035527 0.008052 0.001655
Culture.Linguistics 98242 0.745934 0.214043 0.040024 0.000234 0.028002 0.010047 0.001741
STEM.Biology 462863 0.013555 0.94666 0.039785 0.000065 0.032489 0.00619 0.001041