Machine learning models/Production/add-a-link model
This model card page currently has a draft status. It is a piece of model documentation that is in the process of being written. Once the model card is completed, this template should be removed. |
This model generates suggestions for new links to be added to articles. Specifically, each suggestion for a new link contain the anchor-text and the page-title of the target article. The intended use-case of the model is to generate link recommendations at scale for the add-a-link structured task.
Model card | |
---|---|
This page is an on-wiki machine learning model card. | |
Model Information Hub | |
Model creator(s) | Martin Gerlach, Kevin Bazira, Djellel Difallah, Tisza Gergő, Kosta Harlan, Rita Ho, and Marshall Miller |
Model owner(s) | MGerlach (WMF) |
Model interface | https://api.wikimedia.org/wiki/Link_Recommendation_API |
Publications | paper and arXiv |
Code | mwaddlink |
Uses PII | No |
In production? | Yes |
Which projects? | add-a-link structured tasks (most Wikipedias) |
This model uses wikitext and existing links to recommend potential new links to add to an article. | |
Motivation
editContributing to Wikipedia requires acquaintance with the MediaWiki platform not only on a technical level (e.g. editing tools) but also with the intricate system of policies and guidelines. These issues pose significant barriers to retention of new editors (so-called newcomers), a key mechanism to maintain or increase the number of active contributors in order to ensure the functioning of an open collaboration system such as Wikipedia[1]. Different interactive tools have been introduced, such as the visual editor, which aims to lower technical barriers to editing by providing a “what you see is what you get” interface to editing.
Another promising approach to retain newcomers is the Structured Tasks framework developed by Wikimedia’s Growth Team. This approach builds on earlier successes in suggesting edits that are easy (such as adding an image description) which are believed to lead to positive editing experiences and, in turn, a higher probability of editors to continue participating. Structured tasks aim to generalize this workflow by breaking down an editing process into steps that are easily understood by newcomers, easy to use on mobile devices, and guided by algorithms.
The task of adding links has been identified as an ideal candidate for this process: i) adding links is a frequently used work type and considered an attractive task by users [2], ii) it is well-defined, and iii) can be considered low-risk for leading to vandalism or other negative outcomes.
After completion of "Add a link" Experiment Analysis, we can conclude that the add-a-link structured task improves outcomes for newcomers over both a control group that did not have access to the Growth features as well as the group that had the unstructured "add links" tasks, particularly when it comes to constructive (non-reverted) edits. The most important points are:
- Newcomers who get the Add a Link structured task are more likely to be activated (i.e. make a constructive first article edit).
- They are also more likely to be retained (i.e. come back and make another constructive article edit on a different day).
- The feature also increases edit volume (i.e. the number of constructive edits made across the first couple of weeks), while at the same time improving edit quality (i.e. the likelihood that the newcomer's edits are reverted).
Users and uses
editEthical considerations, caveats, and recommendations
edit- The performance varies across languages. This has at least two main reasons. First, parsing some of the languages is challenging. For example, standard word tokenization does not work for Japanese because it relies on whitespaces to separate tokens. Second, for some languages we have very little training data because the corresponding Wikipedias are small in terms of the number of articles. We implemented a backtesting evaluation to make sure that each deployed model passes a minimum level of quality.
Model
editPerformance
editThe model was evaluated offline (test data, see below) and manually by editors for an initial set of 6 languages.
Offline (test data) | Manual (editors) | ||
---|---|---|---|
Project | precision | recall | precision |
arwiki | 0.754 | 0.349 | 0.92 |
bnwiki | 0.743 | 0.297 | 0.75 |
cswiki | 0.778 | 0.43 | 0.7 |
enwiki | 0.832 | 0.457 | 0.78 |
frwiki | 0.823 | 0.464 | 0.82 |
viwki | 0.903 | 0.656 | 0.73 |
For all other languages (in total 301 languages), the model was evaluated only offline with the test data. In practice, we required a precision of 0.7-0.75 or higher such that the majority of suggestions would be true positives. As a result, we discarded models for 23 languages. For details, see the report.
Implementation
editThe model works in three stages to identify links in an article:
- Mention detection: we parse the text of the article and identify n-grams of tokens that are not yet linked as potential anchor texts.
- Link generation: for a given anchor-text, we generate potential link candidates for the target page-title from an anchor dictionary. The anchor dictionary stores the anchor-text and the target page-title of all already existing links in the corresponding Wikipedia. The same anchor-text can yield more than one link candidate. We only generate link candidates for an anchor-text if that link has been used at least once before.
- Link disambiguation: for each candidate link consisting of the triplet (source page-title, target page-title, anchor text) we predict a probability via a binary classification task. In practice, we use XGBoost’s gradient boosting trees[3]. As model input we use the following features:
- ngram: the number of words in the anchor (based on simple tokenization)
- frequency: count of the anchor-link pair in the anchor-dictionary
- ambiguity: how many different candidate links exist for an anchor in the anchor-dictionary
- kurtosis: the kurtosis of the shape of the distribution of candidate-links for a given anchor in the anchor-dictionary
- Levenshtein-distance: the Levensthein-distance between the anchor and the link. This measures how similar the two strings are. Roughly speaking, it corresponds to the number of single-character edits one has to make to transform one string into another, e.g. the Levensthein-distance between “kitten” and “sitting” is 3.
- w2v-distance: similarity between the article (source-page) and the link (target-page) based on the content of the pages. This is obtained from wikipedia2vec[4]. Similar to the concept of word-embeddings, we map each article as a vector in an (abstract) 50-dimensional space in which articles with similar content will be located close to each other. Thus given two articles (say the source article and a possible link) we can look up their vectors and get an estimate for their similarity by calculating the distance between them (more specifically, the cosine-similarity). The rationale is that a link might be more likely if the corresponding article is more similar to the source article.
More details can be found here.
{
links: <Array of link objects>
links_count:
<No. recommendations>,
page_title: <Article title>,
pageid: <Article identifier>,
revid: <Revision identifer>
}
The link object looks like:
{
context_after:
<Characters immediately succeeding the link text, may include space, punctuation, and partial words>,
context_before:
<Characters immediately preceding the link text, may include space, punctuation, and partial words>,
link_index:
<0-based index of the link recommendation within all link recommendations, sorted by wikitext offset>,
link_target:
<Article title that should be linked to>,
link_text:
<Phrase to link in the article text >,
match_index:
<0-based index of the link anchor within the list of matches when searching for the phrase to link within simple wikitext (top-level wikitext that's not part of any kind of wikitext construct)>,
score:
<Probability score that the link should be added >,
wikitext_offset:
<Character offset describing where the anchor begins >
}
GET /service/linkrecommendation/v1/linkrecommendations/wikipedia/en/Earth
Output
{
"links": [
{
"context_after": " to distin",
"context_before": "cially in ",
"link_index": 0,
"link_target": "Science fiction",
"link_text": "science fiction",
"match_index": 0,
"score": 0.5104129910469055,
"wikitext_offset": 13852
}
],
"links_count": 1,
"page_title": "Earth",
"pageid": 9228,
"revid": 1013965654
}
Data
editLicenses
edit- Code: GNU General Public License v3.0
- Model: MIT License
Citation
editCite this model as[5]:
@INPROCEEDINGS{gerlach2021multilingual,
title = "Multilingual Entity Linking System for Wikipedia with a {Machine-in-the-Loop} Approach",
booktitle = "Proceedings of the 30th {ACM} International Conference on Information \& Knowledge Management",
author = "Gerlach, Martin and Miller, Marshall and Ho, Rita and Harlan, Kosta and Difallah, Djellel",
publisher = "Association for Computing Machinery",
pages = "3818--3827",
series = "CIKM '21",
year = 2021,
address = "New York, NY, USA",
location = "Virtual Event, Queensland, Australia",
doi = "10.1145/3459637.3481939"
}
- ↑ Halfaker, A., Geiger, R. S., Morgan, J. T., & Riedl, J. (2013). The rise and decline of an open collaboration system. The American Behavioral Scientist, 57(5), 664–688. https://doi.org/10.1177/0002764212469365
- ↑ Cosley, D., Frankowski, D., Terveen, L., & Riedl, J. (2007). SuggestBot: using intelligent task routing to help people find work in wikipedia. Proceedings of the 12th International Conference on Intelligent User Interfaces, 32–41. https://doi.org/10.1145/1216295.1216309
- ↑ Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. https://doi.org/10.1145/2939672.2939785
- ↑ Yamada, I., Asai, A., Sakuma, J., Shindo, H., Takeda, H., Takefuji, Y., & Matsumoto, Y. (2020). Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 23–30. https://doi.org/10.18653/v1/2020.emnlp-demos.4
- ↑ Gerlach, M., Miller, M., Ho, R., Harlan, K., & Difallah, D. (2021). Multilingual Entity Linking System for Wikipedia with a Machine-in-the-Loop Approach. Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 3818–3827. https://doi.org/10.1145/3459637.3481939