Research:Language-Agnostic Topic Classification/Countries

Tracked in Phabricator:
Task T366273

This page documents a planned research project.
Information may be incomplete and change before the project starts.


This project is an effort to build a model for taking any given Wikipedia article and inferring which (0 to many) countries are relevant to that article. For example, taking the article for en:Akwaeke Emezi and inferring Nigeria and the United States of America. The intent is for the model's inference to be a two-stage process: 1) extraction of high-confidence countries based on a curated set of Wikidata properties -- e.g., country of citizenship; 2) inference of additional, lower-confidence countries based on what articles (and associated countries) are linked to by the article itself. This second stage is designed to cover cases where the Wikidata item is under-developed or the relationship between the article's topic and country is not one that is currently modeled well on Wikidata -- e.g., where a given species is endemic to.

Current status

edit
  • 2024-05-30 updated page to reflect thinking for FY24-25 hypothesis (WE 2.1.1).
  • Up next: determine the approach for evaluating the model.

Test it out

edit

Background

edit

The ORES topic taxonomy was an initiative to establish a set of high-level topics (~60) that captured many of the ways that the Wikipedia community delineated content and could be used to train a machine-learning model that could infer these topics for any given article. It forms the backbone of the topic infrastructure that has been built to understand content dynamics and enable filters that editors can use to more quickly identify content that they might want to edit via recommender systems.

One facet of the taxonomy that was too coarse to be useful for most editors was the regional topics -- e.g., Western Europe, Central Asia, etc. While these topics are useful for high-level statistics about article or pageview distributions, they do not support other use-cases such as helping editors finding content relevant to their region (which generally is a country or even smaller subdivisions) or more fine-grained analyses of content gaps. The goal of this project is to develop a new model that can replace this aspect of the original taxonomy with country-level predictions (and any aggregation to larger regions can still be easily applied).

What is a country?

edit

While many regions are clearly countries and have widely-recognized borders and sovereignty, other regions that we might think of as countries are disputed or actually officially part of a larger region. Choosing a set of "countries" is an inherently political act. The point of this classifier is to support editors who wish to find and edit content relevant to their region as well as analyses of geographic trends while still mapping well to other geographic data such as pageviews or grant regions. The model currently uses this list.

edit

Content Gap Metrics

edit

The Knowledge Gaps Index is an attempt to quantify certain important gaps on the Wikimedia projects so as to provide insight into movement trends and ways in which the diversity of the projects could be strengthened. It considers countries an important aspect of content gaps and splits this between a "geographic" gap that covers content with a set of latitude-longitude coordinates on Wikidata (primarily places and events) and a "cultural" gap that covers other content with a less physical relationship to a country (people, political movements, sports teams, etc.) via a separate set of Wikidata properties. This model combines these two approaches (coordinate-based geolocation and cultural Wikidata properties) to form a first-pass, high-confidence prediction of countries for a given Wikipedia article.

Earlier attempts

edit

Given the success of the language-agnostic topic classification model based on an article's outlinks, that same approach was initially tried but replacing the 64 topics with one or more of 193 countries based on entities identified as sovereign states in Wikidata with a small amount of manual cleaning. Groundtruth data was based purely on an article's associated Wikidata items and was a union of coordinate location (geolocated to the same set of 193 countries by checking simple containment in each country's borders), place of birth, country, country of citizenship, and country of origin. Only direct matches were used so e.g., a place of birth property that references a city such as Cambridge for Douglas Adams) would not be mapped to the United Kingdom (though in Douglas Adams' case, the country of citizenship property served to still include the United Kingdom).

This model performed well statistically -- i.e. relatively high precision and recall for most countries -- but had a number of drawbacks that suggested it could still be improved substantially. Most notably, the model struggled to handle articles that relate to many geographies -- e.g., World War II -- and would predict many countries with a very low confidence that generally means no confident predictions. This could potentially be handled by lowering the prediction threshold but, in practice, I think this would result in other issues related to false positives.

This led me to think that graph-based approaches might show more promise here than I would have expected for the broader topic taxonomy. For example, many topics do not show simple homophily-type relationships -- e.g., an article that links to many articles about people is not itself clearly an article about a person. I would expect clearer relationships with geography though -- i.e. an article that links repeatedly to content about a particular region is almost certainly relevant to that region. Thus, a careful aggregation all of the countries (by way of articles) that a given article links to should provide a high-confidence classifier for identifying additional relevant countries for an article.

After further experimentation, I am less certain that this will actually solve the many-geography problem (like with World War II) but it still is a far simpler and easier-to-maintain approach than a trained model. Namely, a learned model would have a limited, static vocabulary that would need re-trained regularly whereas an approach that just relies on knowing what countries are associated with each linked article can be easily updated as new articles are created and content is edited. Further, the gaps in the Wikidata-based groundtruth are highly topically-biased -- i.e. the countries are not missing at random but highly concentrated in certain topical areas like flora/fauna. This would make it difficult for a model to learn to fill in these gaps because the right patterns would not necessarily be present in the training data.