Research:Knowledge Gaps Index/Datasets
The knowledge gaps project defines readership metrics, contributorship metrics and content gap metrics. This page aims to describe the datasets generated for the various categories, including providing links to download the data and descriptions of the data format.
Readership
editReader Survey
editThe schema for the gender gap survey:
category
: "Man", "Woman", "Genderdiverse", "I prefer not to say", "Skipped"percent
: percentage of responses in that categoryMOE
: margin of errorwiki_db
: enwiki, itwiki, etc
Contributorship
editContent
editThe pipeline architecture is described here. The datasets are documented for both the publically datasets available for download here, as well as for internal usage on the data engineering infrastructure. Where applicable, a link to a notebook with example usage of the data is included.
Content Gap Metrics
editThe schema of the content gap metric datasets is:
wiki_db
: enwiki, itwiki, etccategory
: the underlying categories for each gap, for example "men", "women", "Europe", etc.time_bucket
: the time bucket, with monthly granularity (e.g. 2020-02)metrics
contains the values for the aggregated metricsarticle_created
: number of articles created (in the time bucket)pageviews_sum
: number of pageviewspageviews_mean
: average number of pageviewsrevision_count
: number of editsquality_score
: average article quality scorestandard_quality
: percentage of articles that are above a standard quality thresholdstandard_quality_count
: number of articles that are above a standard quality threshold
quantiles
contains the 5th, 25th, 50th (median), 75th, and 95th percentilesarticle_created
: quantiles for the number of articles created (in the time bucket)pageviews
: quantiles for the number of pageviewsrevision_count
: quantiles for the number of editsquality_score
: quantiles for the average article quality score
Access the data
editThe content gap metrics datasets are published for each new mediawiki snapshot that becomes available. Note that each snapshot contains the full history, so likely you will be interested in using the most recent snapshot. The default file format is apache parquet.
The content gap metrics are available both publicly and within the WMF data infrastructure
- The datasets are available for download here. Refer to this notebook for examples on how to load the data.
- Within the WMF data infrastructure, the content gap metrics are available on hive in the
content_gap_metrics
database, as documented on datahub (Wikimedia SSO required). Refer to this notebook for examples on how to load the data.
Aggregation levels
editThe content gap metrics are available at different aggregation levels. More background on aggregation levels here
- by category: metrics for each category of each gap for each wiki (highest granularity)
- by content gap: metrics for each content gap for each wiki (aggregated across all categories of a gap)
- by category for all wikis: metrics for each category of each gap (aggregated across all wikis)
- by content gap for all wikis: metrics for each content gap (aggregated across all categories of a gap, and across all wikis)
CSV format
editA simplified version of the metrics are also stored in the csv folder. The csv files are only available for the most recent snapshot. The columns are
wiki_db
: enwiki, itwiki, etccategory
: the underlying categories for each gender gap, for example "men", "women", "Europe", etc.time_bucket
: the time_bucket at which the metric is recorded, with monthly granularity (e.g. 2020-02)- [metric value columns] which contain the measurements for the following: article_count_value; article_created_value; pageviews_sum_value; pageviews_mean_value; standard_quality_value; standard_quality_count_value; quality_score_value; revision_count_value. An explanation of relevant metrics:
article_created_value
: number of articles created for each category in the time bucketpageviews_sum_value
: total number of pageviews for each category in the time bucketpageviews_mean_value
: mean number of pageviews for each category in the time bucketrevision_count_value
: total number of edits for each category in the time bucketquality_score_value
: average article quality score for each category in the time bucketstandard_quality_value
: percentage of articles in the category that are above a standard quality threshold for each category in the time bucket
- [total columns] which contain the totals across all categories for the following: article_count_total; article_created_total; pageviews_sum_total; pageviews_mean_total; standard_quality_total; standard_quality_count_total; quality_score_total; revision_count_total. An explanation of relevant metrics:
article_created_total
: number of articles created across all categories in the time bucketpageviews_sum_total
: total number of pageviews across all categories in the time bucketpageviews_mean_total
: mean number of pageviews across all categories in the time bucketrevision_count_total
: total number of edits across all categories in the time bucketquality_score_total
: average article quality score across all categories in the time bucketstandard_quality_total
: percentage of articles in the category that are above a standard quality threshold across all categories in the time bucket
The _value
columns are equivalent to the metrics in "by category" dataset, while the _total
columns are equivalent to the "by content gap" dataset (i.e. the total refers to "across all categories", not "across all wikis").
Content Gap features
editThe content gap features dataset connects articles with information (aka features) about the various content gaps. For example, the article about Angkor Wat is labelled as being about Cambodia for the geography gap, associated with 12th century for the time gap, and marked as illustrated with multimedia.
The schema of the dataset is documented on datahub, the features for the various content gaps are described here.
The content gap features dataset is used as input for the aggregation of the content gap metrics themselves. It is useful for
- doing custom metrics aggregations. For an example, see this notebook computing metrics for the intersection between the gender and the geography gap
- creating lists of articles filtered by criteria based on content gaps. See this notebook for an example constructing a list of articles about women associated with France and the 1970s.
Metric features
editIn order to provide insights into content gaps over time, the knowledge gap pipeline aggregates commonly used metrics for analyzing Wikipedia content and editor activity into a metric features dataset.
wiki_db
: enwiki, itwiki, etcpage_id
: the page id of the articletime_bucket
: the time bucket, with monthly granularity (e.g. 2020-02)article_created
: boolean of whether the article was created in the time_bucketpageviews
: number of pageviews in the time_bucketquality_score
: article quality score of the last revision to that article in the time_bucketpage_revision_count
: number of edits in the time_bucket
The metric features are associated with a particular Wikipedia article, i.e. the Frida Kahlo articles exists in 152 projects and the above metrics are calculated for each of them individually. The metrics are calculated with a monthly granularity, which in turn enables the analysis of trends over time.
See this notebook for more details on the schema and example usage.