Research:Knowledge Gaps Index/Measurement/Content

Content Gap Metrics

edit
FACET GAP Metric CURRENTLY MEASURED? FREQUENCY
The Dimensions' Facet. The Knowledge gap. e.g. Gender How do we measure the gap? Actual implementation of the measurements How often do we make the measurements?
Representation Gender Time series of content gap metrics over gender mappings Yes, datasets / notebook available Monthly
Age Time series of content gap metrics over time mappings Yes, datasets available Monthly
Geography Time series of content gap metrics over geographic mappings Yes, datasets / notebook available Monthly
Language We will be planning more research to measure this gap. In progress, repo Monthly (ideal)
Socio-economic Status We will be planning more research to measure this gap. No Monthly (ideal)
Cultural Background We will be planning more research to measure this gap. No Monthly (ideal)
Topics for Impact We will be planning more research to measure this gap. No Monthly (ideal)
Sexual Orientation Time series of content gap metrics over sexual orientation mappings Yes, datasets available Monthly
Interaction Readability Currently working on this: follow along our research on multilingual readability No Monthly (ideal)
Structured Data Currently working on this: follow along our research on Wikidata item quality No Monthly (ideal)
Multimedia Time series of content gap metrics over multimedia mappings Yes, datasets available Monthly

Metrics for Aggregation

edit

As shown by previous Research, the content coverage, namely how well Wikimedia project content addresses a particular topic, can be described in different ways. In the content gap metrics, we operationalize two dimensions of content coverage.

  • Selection: whether the content is present or not
  • Extent: how much content the topic has overall, i.e. its quality

The gap in each wiki are described according to five different metrics:

  • article_created: number of articles created for each category, which reflects the selection of the content gap
  • pageviews_sum: total number of pageviews for each category, which reflects the selection of the content gap from the readers perspective
  • pageviews_mean: mean number of pageviews for each category, see above
  • revision_count: total number of edits for each category, which reflects the selection of the content gap from the editors perspective
  • quality_score: average article quality score for each category, which reflects the extent of the content gap.
  • standard_quality_count: number of articles in the category that satisfy the Standard Quality Criteria
  • standard_quality: the average of the standard quality score. As the standard quality is binary, this is the ratio articles that satisfy the standard quality criteriaFor the article quality metrics, which are content based, the last revision to an article in a given month is used to calculate the quality. If an article was not edited in a given month, the score from the previous month is used for aggregation.

Standard Quality Criteria

edit

An article is of standard+ quality if it meets at least 5 of the 6 following criteria:

  • It is at least 8kB long in size
  • It has at least 1 category
  • It has at least 7 sections
  • It is illustrated with 1 or more images
  • Its references are at least 4
  • It has 2 or more intra wiki links.

Aggregation levels

edit

The metrics are computed at multiple aggregation levels.

Category level

edit

The most granular dataset is at the level of the content gap category and per wiki. The all_wikis version is aggregated across all wikis.

  • by category: [ wiki_db, content_gap, category, time_bucket ], datasets see the individual content gap datasets table above.
  • by category across all wikis: [ content_gap, category, time_bucket ], dataset

Content gap level

edit

The metrics are aggregated across all categories of the content gaps and per wiki. The all_wikis version is aggregated across all wikis.

  • by content gap: [ wiki_db, content_gap, time_bucket ], dataset
  • by content gap across all wikis: [ content_gap, time_bucket ], dataset

Note that for e.g. the pageviews_mean, quality_score, standard_quality metrics, which are mean values, 're-aggregating' a dataset (e.g. from [ content_gap, category, time_bucket ] to [ content_gap, time_bucket ]) will not yield the same results, while for the count/sum based metrics the numbers are identical.