Research:Automated classification of article importance/Potential sources of signal
What are sources that can help determine the importance of an article? Here we document the ones we use and consider to use. We discuss why some of these should or should not be included in our models.
Indegree
editIn Wikipedia, the indegree of an article is the number of links pointing to that article. These links might be filtered, e.g. we might only count links from other Wikipedia articles. The purpose of linking from one article to another might vary. It could be that an article explains a concept (e.g. philosophy links to the Greek language article because the word "philosophy" is of Greek origin), or it can contain more information about a specific subject (e.g. the article about philosophy briefly describes Western philosophy and links to Western philosophy for those who wish to learn more). Taken together, we could say that the links signify "topical relevance" as argued by Kamps and Koolen.[1]
Indegree should provide us with some signal of importance because articles that are linked to from many other articles cover a topic that is relevant to a larger proportion of the encyclopedia. Say that two articles X and Y both explain their respective topics. If X is linked to from a handful of articles while Y is linked to by thousands of articles, then arguably Y is explaining a topic that is more important to cover well.
Refinements to indegree
editIn their 2008 paper,[2], Kamps and Koolen use a modified measure of "local indegree". Because they study query performance, "local indegree" refers to links between articles that are relevant to a given query. This type of measure might be useful to us in a local context (e.g. within WikiProjects[supp 1]
We might want to modify indegree because not all links behave the same way in an article. A recent paper by Dimitrov et al[3] suggests that links towards the top of an article are much more frequently used. One way to identify position in an article is to parse all pages and note where the links are. We could also use the clickstream dataset to identify actual page transitions, thereby discarding links that are not actually used. This should provide us with a more accurate measure of indegree, but require careful consideration up against measures of popularity.
Popularity
editIn Wikipedia, the popularity of an article can be defined as the number of views of an article based on the WMF's view statistics.
Popularity should provide us with some signal of importance because generally articles that have more readers should be more important to the encyclopedia. A topic that regularly sees thousands of visitors should be more complete (have an article of higher quality) than a topic with a handful of visitors. In the latter case a shorter introduction to the topic might be sufficient.
Refinements to popularity
editWe pointed earlier to Kamps and Koolen's measure of local indegree. By using the clickstream dataset, we could modify popularity to give us an understanding of whether readers find an article through searches or by following links. This could provide us with information about whether a given page is externally popular, or whether it is providing important information related to other articles.
Wider importance
editArticle importance can also be judged using sources external to a given Wikipedia edition. This might be in the form of external lists of important topics (e.g. a list of biographies). It might also be in the form of how many Wikipedia languages have covered a given topic (measuring it's universality). Some of these sources might be limited to a certain domain (e.g. biographies) or they may not (as in the Wikipedia example).
A measure of importance on a wider scale can provide signal in the form of providing an external source of input. Wikipedia's labels of importance arguably judge encyclopedic importance, which one might argue is independent of measures like popularity.
As mentioned, one way of measuring importance could be to use the number of Wikipedia language editions having an article about a certain topic. There are two research projects in WMF that are using translation to increase content, thereby changing the number of languages articles are in. This can be seen as introducing a bias in the data, and potentially establishes a feedback loop between these projects. Because of these issues, we do not see this measure as currently providing useful signal for input into a model of article importance.
Network centrality
editArticles in Wikipedia create a network through the wikilinks that connect them. Network analysis is a field that concerns itself with understanding these networks. Identifying central nodes in the network is a key part of this analysis because these nodes are regarded as the most important (e.g. in a network of people, the central nodes can have a larger influence on the rest of the network).
Network analysis theory provides us with several measures of network centrality, see Boldi and Vigna[4] for an overview. Of these, PageRank is perhaps the most commonly used approach, particularly for measuring importance in Wikipedia (see Research:Studies of Importance for more coverage).
While PageRank should provide us with some signal of importance, there are several reasons for why one might not want to use it as part of a prediction model. First of all, it's costly to calculate and maintain. Secondly, most applications of PageRank to Wikipedia uses default parameters in order to identify central nodes in the network using few iterations. These parameters assume a certain user behaviour (PageRank models the "random surfer") and we have reasons to believe these assumptions don't hold in Wikipedia.[5] If Gleich et al's results hold today then pageviews do not propagate, meaning importance is mainly dependent on an article's indegree, which we already proposed to use.[supp 2]
Article characteristics
editThere are several characteristics of an article that might provide signal, for instance:
- Article age
- Article quality
- Number of contributors
Many of the characteristics about articles, for example all the non-quality examples above, tend to correlate with article quality. The articles that were written early on tend to have higher quality, have amassed larger number of contributors and edits (partly because of their age), and so on. As a result, this is mainly about whether we think that article quality is a signal for article importance. One could argue that this is the case, higher quality articles are more important to the encyclopedia. At the same time, if that is the case, then importance is no longer a characteristic of the topic (for which there is an article), but a characteristic of the article itself. The causality of the relationship between importance and quality changes, an article becomes important (partly) because it is high quality, not because the topic itself is important.
The various other sources of signal that we have listed all aim to discover characteristics that are external to the article. We are looking for sources of signal about the topic's importance, not the article itself. Those characteristics are harder to change from within Wikipedia, and as we have discussed above, we are less interested in using them if they are.
A whole, or sum of its parts?
editThe approaches we have discussed so far are all about the topic or the article, but there might also be a possibility of gaining signal by altering the approach we use for building our classifier that predicts importance across an entire edition. Our initial approach was to train a classifier on unanimous ratings by multiple WikiProjects, and our initial results show that it predicts correctly about half the time. We have also built a classifier for WikiProject Medicine that fares much better on articles within that project. So, instead of building a single classifier (a "whole") for the entire English edition, we might get better results if we build classifiers for individual WikiProjects and combine them, thus creating a hybrid classifier ("summing its parts"). One reason for why this might work better is that we can in those cases use subject-specific sources of signal, for example by utilizing knowledge from the information retrieval literature.
Notes
edit- ↑ We have tested this in the context of WikiProject Medicine, where it significantly improved performance of a model with global and local indegree as well as popularity.
- ↑ PageRank uses a decay parameter, which in a Wikipedia dataset would largely determine how much an article's PageRank is affected by the PageRank of distant relatives (articles linking to it through a chain). If the decay parameter is set low, as Gleich et al argues, then an article's Pagerank is largely determined by the PageRank of articles that link directly to it.
References
edit- ↑ Kamps, Jaap; Koolen, Marijn (February 2009). "Is Wikipedia link structure different?". Proceedings of the Second ACM International Conference on Web Search and Data Mining. pp. 232–241. ISBN 9781605583907. doi:10.1145/1498759.1498831.
- ↑ Kamps, Jaap; Koolen, Marijn (2008). "The Importance of Link Evidence in Wikipedia". Advances in Information Retrieval. Springer Berlin Heidelberg. pp. 270–282. ISBN 9783540786450. doi:10.1007/978-3-540-78646-7_26.
- ↑ D. Dimitrov, P. Singer, F. Lemmerich, and M. Strohmaier. Visual Positions of Links and Clicks on Wikipedia. In Proceedings of the 25th International Conference Companion on World Wide Web, WWW ’16 Companion, pages 27–28. International World Wide Web Conferences Steering Committee, 2016.
- ↑ Boldi, Paolo; Vigna, Sebastiano (2014). "Axioms for centrality" (PDF). Internet Mathematics 10 (3–4): 222–262.
- ↑ Gleich, David F.; Constantine, Paul G.; Flaxman, Abraham D.; Gunawardana, Asela (April 2010). "Tracking the Random Surfer: Empirically Measured Teleportation Parameters in PageRank". Proceedings of the 19th international conference on World wide web. ISBN 9781605587998. doi:10.1145/1772690.1772730.