Research:Knowledge Gaps Index/Measurement/Content

Content Gap Metrics

FACET	GAP	Metric	CURRENTLY MEASURED?	FREQUENCY
The Dimensions' Facet.	The Knowledge gap. e.g. Gender	How do we measure the gap?	Actual implementation of the measurements	How often do we make the measurements?
Representation	Gender	Time series of content gap metrics over gender mappings	Yes, datasets / notebook available	Monthly
	Age	Time series of content gap metrics over time mappings	Yes, datasets available	Monthly
	Geography	Time series of content gap metrics over geographic mappings	Yes, datasets / notebook available	Monthly
	Language	We will be planning more research to measure this gap.	In progress, repo	Monthly (ideal)
	Socio-economic Status	We will be planning more research to measure this gap.	No	Monthly (ideal)
	Cultural Background	We will be planning more research to measure this gap.	No	Monthly (ideal)
	Topics for Impact	We will be planning more research to measure this gap.	No	Monthly (ideal)
	Sexual Orientation	Time series of content gap metrics over sexual orientation mappings	Yes, datasets available	Monthly
Interaction	Readability	Currently working on this: follow along our research on multilingual readability	No	Monthly (ideal)
	Structured Data	Currently working on this: follow along our research on Wikidata item quality	No	Monthly (ideal)
	Multimedia	Time series of content gap metrics over multimedia mappings	Yes, datasets available	Monthly

Metrics for Aggregation

As shown by previous Research, the content coverage, namely how well Wikimedia project content addresses a particular topic, can be described in different ways. In the content gap metrics, we operationalize two dimensions of content coverage.

Selection: whether the content is present or not
Extent: how much content the topic has overall, i.e. its quality

The gap in each wiki are described according to five different metrics:

article_created: number of articles created for each category, which reflects the selection of the content gap
pageviews_sum: total number of pageviews for each category, which reflects the selection of the content gap from the readers perspective
pageviews_mean: mean number of pageviews for each category, see above
revision_count: total number of edits for each category, which reflects the selection of the content gap from the editors perspective
quality_score: average article quality score for each category, which reflects the extent of the content gap.
standard_quality_count: number of articles in the category that satisfy the Standard Quality Criteria
standard_quality: the average of the standard quality score. As the standard quality is binary, this is the ratio articles that satisfy the standard quality criteriaFor the article quality metrics, which are content based, the last revision to an article in a given month is used to calculate the quality. If an article was not edited in a given month, the score from the previous month is used for aggregation.

Standard Quality Criteria

An article is of standard+ quality if it meets at least 5 of the 6 following criteria:

It is at least 8kB long in size
It has at least 1 category
It has at least 7 sections
It is illustrated with 1 or more images
Its references are at least 4
It has 2 or more intra wiki links.

Aggregation levels

The metrics are computed at multiple aggregation levels.

Category level

The most granular dataset is at the level of the content gap category and per wiki. The all_wikis version is aggregated across all wikis.

by category: [ wiki_db, content_gap, category, time_bucket ], datasets see the individual content gap datasets table above.
by category across all wikis: [ content_gap, category, time_bucket ], dataset

Content gap level

The metrics are aggregated across all categories of the content gaps and per wiki. The all_wikis version is aggregated across all wikis.

by content gap: [ wiki_db, content_gap, time_bucket ], dataset
by content gap across all wikis: [ content_gap, time_bucket ], dataset

Note that for e.g. the pageviews_mean, quality_score, standard_quality metrics, which are mean values, 're-aggregating' a dataset (e.g. from [ content_gap, category, time_bucket ] to [ content_gap, time_bucket ]) will not yield the same results, while for the count/sum based metrics the numbers are identical.