Research:Studies of Importance
There have been many large encyclopedias during human history. The fourth edition of the Encyclopaedia Britannica, for instance, contained some 32 million words across its 32,000 pages. It was, however, small compared to the French Encyclopédie méthodique and the German Grosses vollständiges Universal-Lexicon aller Wissenschafften und Künste, both of whom contained about 120 million words across 120,000 pages.[1]
All of these dwarf in comparison to Wikipedia. In March 2014, the German edition alone contained some 979 million words,[2] with the French edition right behind with 950 million words. The English edition was on its way towards two billion words in 2010,[2] and a print edition art project in 2015 covered 7,473 volumes.[3] It is therefore reasonable to conclude that Wikipedia is, by far, the largest encyclopedia in history.
Given the enormous size of Wikipedia, it is worth asking whether it should simply contain everything you can imagine, or whether there are some things that are not important enough to warrant an encyclopedic article. In the English Wikipedia, the question of whether something is worthy of inclusion is described in the notability guideline,[4] which decribes some of the processes Wikipedia contributors use to determine if something is worthy of its own article or not. Many other languages have similar guidelines as well. The issue of whether the encyclopedia contains or covers a specific topic is in the literature often referred to as completeness or comprehensiveness.
This literature review will investigate a related topic: importance. Say you have two Wikipedia articles, should one be completed before the other? If so, how do we go about determining which of the two articles has priority? We will look at how researchers have covered this topic across several different fields, from computer and information sciences to history and philosophy.
There are topics related to importance which this literature review will not cover. We have already mentioned completeness and its related topics of comprehensiveness and notability. While some of the cited literature might concern itself with these topics, this review will instead focus on how they determine importance. Article quality will also not be extensively covered in this review. While it would be reasonable to expect articles that are more important to have higher quality, we see quality as orthogonal to importance. This will be discussed in more detail in the terminology section where we define the terms used in this review, as one of our findings is that terminology usage in the literature is not necessarily consistent.
The main finding of this literature review is that the dominant approach to ranking article importance is based on the link structure between articles; specifically the PageRank algorithm that catapulted Google towards dominance in web search. Other approaches tend to be used on a one-off basis. Reviewing the literature also reveals opportunities for future work. For instance, we know little about how readers and contributors to Wikipedia understand and interpret the notion of article importance, whether contributors use measures of importance in their work, and how different ranking strategies compare.
Methodology
editSeveral sources of research papers were used to gather an overview of the literature:
- The Wikimedia Research Newsletter[5]
- WikiPapers[6]
- Keyword searches on Google Scholar[7] (e.g. “wikipedia article importance”)
The references used in the papers found through the listed methods were also examined in order to discover research not easily found through other means. Similarly, Google Scholar's function to show papers that cite a given paper was used to find more recent work in the field as well as highly-cited work that could be influential.
Terminology
editWe would like to propose a set of definitions for relevant terminology before we dive into the literature. The usage of terms might differ slightly between research areas, and it might also differ within a single paper. We therefore propose these definitions in an aim to aid researchers moving forward so that usage is more consistent. These terms will typically be defined in the context of Wikipedia, but should generalise to other areas.
Importance
editWe define importance as the relative urgency to which an article needs to be completed, with more important articles needing to be completed sooner. In other words, given two articles X and Y, X has higher importance if it should be completed sooner than Y.
This definition disregards the question of whether a topic (in the form of an article) is worthy of being included in the encyclopedia at all, which as discussed in the introduction is covered by studies of notability, coverage, completeness, and comprehensiveness.[8][9][10][11][12]
Reputation and Authoritativeness
editReputation and authoritativeness play a part in studies of importance through the analysis of the structure of the link network between web pages and sites. One of the assumptions that is often used is that a link from one page (A) to another (B) means that A has a positive opinion about B. With regards to reputation, this means that A gets a slightly higher reputation because of said link. Or similarly for authoritativeness, A is regarded have increased authority on a specific subject because B links to it. It is then possible to identify sites with high reputation/authority, for instance by looking at which ones have a high number of sites/pages pointing to them (this is called indegree and will be discussed in more detail later, Google's PageRank is a more sophisticated variation of this approach).
In Wikipedia, reputation/authoritativeness works a bit differently, and we will therefore not define them in the context of articles. Kamps and Koolen argue that "the authoritativeness of individual pages is essentially the same, and the value of link evidence is primarily to signal topical relevance" due to Wikipedia being a single domain.[13]
Relevance
editWe take a system-oriented perspective on relevance and define it as the strength of the relationship between an article and a given query.
The reason behind taking this perspective is that most of the literature described later in this review also takes the same perspective. It comes from the information retrieval field, where one of the standard methodologies is to compare algorithms with regards to them being capable of retrieving "relevant" documents given a specific query.
We acknowledge that there are other perspectives on relevance and how it functions. For example the discussion in the Journal of the American Society for Information Science and Technology where relevance and theories of information science have been thoroughly discussed (e.g. Hjørland, 2010[14] and Cole, 2011[15]). We see that discussion as important both in general and in relation to Wikipedia. As we have not found much work that looks for a deeper understanding of relevance when people visit Wikipedia, further studies in this area are welcome.
Quality
editWe define article quality as the content quality of a given article.
Article quality in Wikipedia is a thoroughly studied subject, and Wikipedia's own quality assessment scale[16] has been found to correspond to traditional notions of encyclopedic quality.[17] For an overview of much of the research on article quality in Wikipedia, see Mesgari et al, 2015.[18]
Popularity
editWe define popularity as the reader interest for a topic reflected by the number of requests to its corresponding article logged by the Wikimedia Foundation.
The Wikimedia Foundation has made datasets of Wikipedia article views available for many years[19] and more recently also through an API.[20] Some of the research on importance in Wikipedia equates popularity with importance and uses these datasets to measure said popularity.
Popularity can also be defined through interest in domains outside of Wikipedia. For example, a topic might be very popular in the media, or there can be a list of topics on a book or website which many consider authoritative. Both of these are relevant to importance on Wikipedia, but the former likely manifests itself through article views, and the second defines importance through a claim of authority, which we covered previously.
Overview
editOver 50 papers were reviewed in order to create this literature review. Of those papers, not all turned out to be concerned with article importance in a sense that is relevant to us, nor did they provide relevant background information. The table below provides an overview of the approaches we identified to determining article importance and how many articles are related to each approach. Some articles cover multiple approaches and will therefore be found in several of the categories.
Overarching Approach | Algorithm/Approach | No. of papers |
---|---|---|
Link network | PageRank | 14 |
CheiRank and 2DRank | 6 | |
HITS | 2 | |
Hybrid | 2 | |
Other | 3 | |
Other networks | Miscellaneous | 4 |
Non network-based | TF/IDF | 3 |
Readership/Popularity | 3 | |
Miscellaneous | 7 |
Link Network-based Approaches
editA large portion of the research literature studies article importance based on wikilinks, the links that go between Wikipedia articles. These links can for example point to subjects that explain something mentioned in an article (e.g. the article on philosophy describes how the word originates from Greek and subsequently links to the article on the Greek language), or to related subject (e.g. the article on philosophy links to an article on Western philosophy for a more thorough description of that topic).
Importance can then be defined through these relationships. If an article is linked to by many other articles (known as a "high indegree") it might mean it describes a topic that is needed in order to understand the other articles, thereby suggesting it is an important topic. We will see research that uses this particular method to define importance. Most of the literature uses a method based on recursion,[21] where an article is important if it is linked to by other important articles, and PageRank is the most commonly used algorithm.
PageRank
editPageRank is the ranking algorithm behind Google's search engine. It aims to model the "random surfer", someone who will visit a website and with some probability follow either of the links found on that website, or with some probability not follow any of them and instead use a bookmark to go to a completely different website. The algorithm is described in more detail in (Page, Brin, Motwani, and Winograd, 1998)[22] and (Brin and Page, 2012),[23] but also commonly explained by any paper that uses it.
Two research papers studies PageRank in the context of Wikipedia in order to learn more about Wikipedia and the algorithm. Gleich et al (2010)[24] investigates whether people who visit Wikipedia tend to click on links in articles in similar proportion to those who visit other websites. They find that on Wikipedia, visitors follows links to other articles much less often than elsewhere on the web, and propose ways to alter the PageRank algorithm in order to account for this. Ermann et al (2013)[25] study the properties of Wikipedia's Google Matrix. Said matrix is a standardized form of the matrix that encodes the link relationships between all articles in Wikipedia.
Bellomi and Bonato (2005)[26] apply PageRank to the English Wikipedia edition in order to understanding more about the high level structure of Wikipedia, and get insights into its content and potentially hidden cultural biases. Much of their results are comparisons between PageRank and HITS (for more information on HITS, see its own section), but they also study the 300 articles with the highest PageRank and find a bias towards religious topics (e.g. Christianity, Roman Catholic Church).
Ganjisaffar, Javanmardi, and Lopes (2009)[27] uses PageRank to measure article relevance, and compare its performance against using a text-similarity approach (through Apache Lucene), or by counting the number of contributors to an article. They note that links on Wikipedia do not function as "votes for authoritativeness", unlike what is generally assumed for links on webpages. A combination of the three mentioned approaches is found to have the highest performance. They also find that number of contributors to an article, which is intended as a proxy for article quality, has slightly higher performance than PageRank.
Zhirov, Zhirov, and Shepelyansky (2010)[28] studies two-dimensional ranking of Wikipedia articles, with PageRank as one of the dimensions (the other dimension is CheiRank, and they combine to form 2DRank; both of these will be covered in a separate section below). They study how articles in the English edition are ranked by PageRank both overall and in several categories: countries, universities, biographies, physicists, Nobel laureate physicists, chess players, and companies listed in the Dow Jones index. For many of these, the rankings from PageRank and its variants closely relate to the various reference rankings.
Eom et al (2013)[29] studies how article rankings differ across time. They gather snapshots of the English editions every two years from 2003 to 2011 and apply PageRank (as well as CheiRank and 2DRank) individually to each dataset individually. For each year, they study how the ranking of different categories of articles changes.
Shuai et al (2013)[30] compared academic and Wikipedia rankings of selected research papers, authors and topics. The ACM digital library is the dataset used to measure academic rankings, where rankings are based on frequency of occurrence of a term, frequency of citation, and PageRank for a selected set of articles. On the Wikipedia side, a dataset of the English edition is used, and the rankings are based on editing frequency, PageRank, and article weight (with the latter always set to 1). They find that Wikipedia favours influential work and focuses on trending topics rather than more classical or traditional topics.
Eom and Shepelyansky (2013)[31] studies the cultural differences of nine Wikipedia editions. The nine editions they study are English, French, German, Italian, Spanish, Dutch, Russian, Hungarian, and Korean. For each of them they apply PageRank, CheiRank, and 2DRank, and examine the 30 most highly ranked individuals. They find that each language edition favours local people, e.g. the top ranked people in the Korean edition are either Korean, Japanese, or Chinese. Secondly, they find that PageRank favours politicians.
Hanada, Cristo, and Pimentel (2013)[32] studies the relationship between link analysis metrics (indegree, outdegree, and PageRank) and Wikipedia article quality and importance. Their Wikipedia dataset is from the Brazilian edition and contains articles which have both a quality and importance rating. They also utilise a Brazilian-based dataset of websites that they compare Wikipedia results with. Correlational analysis is used to measure the relationship. When it comes to article quality and popularity, they find only a moderate correlation between these and the link analysis metrics. The correlation with importance is categorised as "weak".
Eom et al (2015)[33] expands on previous papers by Arágon et al (2012, described in further detail below) and Eom, Dina, and Shepelyansky (2013, described above). The analysis of cultural differences is extended to 24 different Wikipedia editions. Similarly as in the 2013 paper, three algorithms are used to rank individuals: PageRank, CheiRank, and 2DRank. Rankings for each edition as well as an overall global ranking is produced. Overall, they find a western, modern, and male skewness of important historical figures. There is also a tendency towards local preference in that each Wikipedia edition favours historical figures born in countries speaking that edition's language. Lastly, they find a set of global historical figures, those whom are highly ranked in most Wikipedia editions.
Gloor et al (2015)[34] studies four Wikipedia editions (English, Chinese, Japanese, and German) in order to learn differences in their world view across history. From each of the four editions they gather a dataset of biographies. They then build a link network between people based on lifespan, meaning that person X links to person Y only if both X and Y lived at the same time. This approach leads to the creation of 4,900 networks across the four editions. PageRank is then applied to these networks in order to identify the most important people in each network, and the paper describes these networks in more detail.
Siddiqui (2015)[35] uses Wikipedia data to rank rock guitarists. The English edition is used to create a dataset of rock guitarists based on their Wikipedia pages. Information about who influenced a specific guitarist is extracted and used to build a network of these influences. PageRank is then applied to this network in order to identify the most important guitarists. The ranking is then compared to several other rankings made for instance by magazines like Rolling Stone and Guitar World. They find that using PageRank on the influence network reveals a few highly influential guitarists that appear to be overlooked by other lists.
Lages et al (2016)[36] applies PageRank (as well as CheiRank and 2DRank) to the link networks of 24 Wikipedia editions in order to create the "Wikipedia Ranking of World Universities". This ranking is then compared to the Academic Ranking of World Universities, a ranking started by the Shanghai Jiao Tong University in 2003 (leading the ranking to also be known as the "Shanghai Ranking"). They find large degrees of overlap between the top 100 universities in the rankings, both in terms of which ones are listed as well as their individual rankings.
Thalhammer and Rettinger (2016)[37] compare several different PageRank-based analyses for Wikipedia. They are particularly interested in how different types of links, for example those found in templates or just those found in an article's lede, affect the rankings. In addition to comparing these different strategies for extracting links, they compare the rankings from several different datasets (e.g. DBpedia[38] and the Open Wikipedia Ranking Project[39]) and ranking methods (e.g. PageRank, page views). Their main findings are that links in templates negatively affect PageRank scores and that links found early in an article are more likely to be clicked.
CheiRank and 2DRank
editFive of the reviewed papers utilise CheiRank and 2DRank: Zhirov, Zhirov, and Shepelyansky, 2010[28]; Eom et al, 2013[29]; Eom and Shepelyansky, 2013[31]; Eom et al, 2015[33]; and Lages et al, 2016.[36] Ermann et al, 2015[40] provides an overview of the research using CheiRank and 2DRank, both in the context of Wikipedia and elsewhere.
CheiRank functions similarly as PageRank, except that the flow of ranking scores moves in the opposite direction. Whereas in PageRank they follow the outgoing links from a page, in CheiRank they follow the incoming links. The result is that the rankings are based on a page's "communicative power", rather than a notion of authoritativeness or reputation as is the case with PageRank. Subsequently the rankings are very different, CheiRank will for example rank lists highly. The two approaches can be compared to the hubs and authorities in HITS, an algorithm which will be covered in more detail in the next section.
2DRank is a two-dimensional rank-based combination of PageRank and CheiRank. Zhirov, Zhirov, and Shepelyansky (2010) perform the first investigation into this type of ranking, and note that for example for countries it appears to prioritize countries with higher amount of historical connectivity (e.g. Egypt) or tourism (e.g. Thailand and Malaysia), compared to PageRank which is interpreted as prioritizing more authoritative countries (e.g. United States and the United Kingdom). This ranking is then utilized in other papers to investigate cultural differences between Wikipedia language editions (e.g. Eom et al, 2015).
HITS
editHITS is a commonly used algorithm for identifying particular nodes in a directed graph, or in the case of Wikipedia, particular pages based on the links that go between them. HITS was first described by Kleinberg (1999).[41] Whereas PageRank calculates a single score for each page, HITS calculates two: a hub score, and an authority score. A hub is a page that links to many pages that have a high authority score, and an authority is a page that is linked to by many hubs. Similarly as for PageRank, these definitions are recursive and the calculations similar.
HITS was first explored in the context of Wikipedia by Bellomi and Bonato (2005)[26] in a presentation at the 2005 Wikimania conference. As previously mentioned, they also applied PageRank to a dataset of English Wikipedia articles. They focus only on the authority side of HITS, as that is comparable to PageRank. By ranking the 300 articles with the highest authority score, they find that it focuses on "space" and "time". The former in the form of "political geographical denominations", and the latter in the form of "both time spans and landmark historical events".
Kamps and Koolen (2009)[13] uses HITS in their study of query performance on the Wikipedia XML Corpus used at INEX.[42] They incorporate the hub and authority scores in order to understand whether link information improves performance. The results show that the approach fails, and they attribute this to there being very little difference between hubs and authorities in Wikipedia's link structure.
Hybrid Approaches
editKamps and Koolen apply two hybrid approaches in the 2008 and 2009 papers. In their 2008 paper[43] they find that a combination of global and local indegree improves performance compared to their baseline. They use the global indegree to normalise the local indegree in a way similar to TF/IDF,[44] meaning that articles with a high local indegree will be less influential if they also have a high global indegree.
In their 2009 work,[13] their hybrid approach combines three methods: a mixture language model, local and global indegree, and article length. The mixture language model is built from the article's text, link anchor text (the text used to link to the given article), and the article's title. Local indegree is used in its raw form as well as its logarithm, and global indegree is used to dampen the effect of high local indegree similarly as in their 2008 paper.
Other Approaches
editHwang et al (2010)[45] introduces BinRank, an approach for quickly computing a PageRank-like score for articles given a specific query. This allows for improved performance since the link-based ranking score is localised to the given query. BinRank is based on ObjectRank,[46] which again is similar to Personalized PageRank.[47] All of these are akin in that they aim to improve on PageRank by moving from a global score to one that is localised to a given query (e.g. articles that are relevant to someone searching for "Barack Obama"), thus providing better search results.
Aragon et al (2012)[48] studies the differences between networks of biographies in 15 Wikipedia editions. They use a starting set of biographies from the English Wikipedia, and identify the most central people using betweenness centrality.[49] They find that these tend to be political leaders, revolutionaries, famous musicians, writers, and actors. In other words, what we regard as highly influential individuals.
As mentioned previously, Hanada, Cristo, and Pimentel (2013)[32] studies the relationship between link analysis metrics and Wikipedia article quality and importance. In addition to applying PageRank, they also tested two straightforward measures of connectedness: indegree and outdegree. In the context of Wikipedia, indegree is the number of links pointing from other articles to a specific article, and outdegree is the number of links pointing from the specific article to other articles. For both of these they only found weak or moderate correlations, similar as to what they found for PageRank.
Other Network-based Approaches
editSeveral papers utilize network-based approaches on datasets that are not the network of links between articles. Korfiatis, Poulos, and Bokos (2006)[50] build a "social network graph" between articles and contributors (i.e. there is an edge between contributor X and article Y if X edited Y), but also between contributors (there is a directional edge from contributor X to contributor Z if X edited article Y after Z). Edges between contributors are weighted by how much of the content was retained after the edit. They then apply the "centrality index" measure (Freeman 1978[51]; Sabidussi 1966[52]) to measure authoritativeness of a contributor in this social network graph, and use this to identify the most central contributors various philosophy-related articles.
Athenikos and Lin (2009)[53] studies English Wikipedia articles about philosophers. They build a network consisting of both simple links between them as well as connections based on who influenced whom (e.g. Immanuel Kant was influenced by for example Plato, Aristotle, Hume, and Smith, meaning there are "influenced by" links from Kant to all four of those). They then analyze who are the most central philosophers in this network through network measurements like indegree, but also a link reduction algorithm they call "strongest link path".
Maniu et al (2011)[54] build a network of trust between Wikipedia contributors based on various actions they take (e.g. deleting a contributor's contribution is a sign of lack of trust in that contributor). They investigate whether these links can be predicted using a logistic regression and achieve good performance. Lastly they use summary measurements of an article's trust network to predict article quality and importance of a limited set of English Wikipedia articles about politics, and again achieve good performance.
Warncke-Wang et al (2012)[55] use the "inter-language links" as a way to identify the most important articles across all Wikipedia editions. Inter-language links connect articles about the same topic in different languages. They use a simple measure of the number of languages an article is found in and first categorise the 823 articles that are found in over half of all languages. Secondly, they list the 20 articles in the most languages and discuss how some of these are unexpected (e.g. the article about the German town of Uetersen).
Other Approaches
editTF/IDF
editTerm frequency/inverse document frequency[44] (abbreviated TF/IDF) is a commonly used approach to identify key terms (e.g. words) in documents. Term frequency refers to how often a term occurs in a document, and inverse document frequency refers to the process of dividing the term frequency by how often the term occurs across all documents. The latter balances out the scores of common terms, i.e. the most used words in a given language.
Zaragoza et al (2007)[56] apply a variant of TF/IDF to entities in Wikipedia in order to understand how it affects performance when ranking entity queries. They find that this approach provides strong performance compared to several other approaches (one of which is based on an "Entity Containment Graph", which will be covered in more detail below).
The previously reviewed paper by Ganjisaffar, Javanmardi, and Lopes (2009)[27] compared the performance of text similarity, PageRank, and number of contributors when ranking Wikipedia articles. They used the Apache Lucene[57] search engine to calculate text similarity. By default, Lucene is strongly reliant on TF/IDF for queries, and their paper did not suggest they deviated from this. They found that a combination of all three approaches has the best performance.
Schwarzer et al (2016)[58] studied recommendations of Wikipedia articles, using three competing approaches to finding articles to recommend: co-citation, co-citation proximity analysis (CPA),[59] and "MoreLikeThis".[60] The latter is Apache Lucene's built-in method for finding similar documents and relies heavily on TF/IDF. They found that CPA and MoreLikeThis has different strengths, and suggested that a hybrid approach might combine these strengths for further improvements in performance.
Readership/Popularity
editHow popular a topic is can of course also be used as a way to determine popularity. Rosenzweig (2006)[9] discussed how Wikipedia's articles generally reflected a popular, rather than an academic, interest in history. In other words, suggesting that the extent to which an article exists, or is of high quality, is more mainly determined by the number of people interested in that topic.
Warncke-Wang et al (2015)[61] ranks articles by popularity as measured by article views. Their datasets use the publicly available view dumps from the Wikimedia Foundation. While their paper focuses on whether articles that are more popular are also of high quality, the underlying assumption is that importance follows in lock step with popularity.
Wulczyn et al (2016)[62] studies how to successfully recommend articles for translation. Articles are recommended based on ranking them by predicted popularity and filtering them based on a contributor's interest. Their approach for predicting popularity does not seek to predict an exact number of article views, instead they find that predicting the popularity rank of an article has higher performance. Again the underlying assumption is that more popular articles are more important.
Miscellaneous
editSimilarly as we saw for the network-based approaches, there are a myriad of ways of measuring importance and ranking articles that are only applied by a single paper.
Zaragoza et al (2007)[56] used TF/IDF as one of their approaches when ranking entities, but they also use a graph-based approach called "Entity Containment Graph". This graph has two types of nodes: entities, and the passage in which an entity occurs. They then apply various network measurements for the entities (e.g. indegree) and compare their performance for ranking entities. As described above in the TF/IDF section, the performance of their TF/IDF-variant is slightly better than these network-based approaches.
In addition to PageRank and TF/IDF, which we have already covered, Ganjisaffar, Javanmardi, and Lopes (2009)[27] used number of contributors to an article as a method. As number of contributors has been found to correlate with article quality, it can be regarded as a proxy for quality. They found that this measurement had slightly higher performance than PageRank, but that a combination of all three approaches has the highest performance.
Royal and Kapila (2008)[63] use several external datasets in their study of what Wikipedia covers. They use number of words as a proxy for quality/coverage and measure the correlation between this and their external datasets. This is similar to Rosenzweig (2006),[9] who compared Wikipedia to the American National Biography Online, although Royal and Kapila's approach uses several datasets and a clearer methodology. They find that Wikipedia has a recency effect, it covers more recent events in greater detail, and that more common/popular terms have greater coverage.
Warncke-Wang et al (2012)[55] ranked articles by how many Wikipedia language editions they were featured in, but also proposed a simple measure of the amount of content across all languages as a measure of importance. While counting number of languages resulted in some oddities in the list of top 20 articles, measuring the amount of content showed no sign of them.
In their comparison of academic citations in Wikipedia and the ACM Digital Library, Shuai et al (2013)[30] used editing frequency (number of edits) as one of their approaches. They found that edit frequency and Pagerank both performed lower than simply giving all articles equal weight. As mentioned in the PageRank section, their main findings were that Wikipedia favours influential work and focuses on trending topics rather than more classical or traditional topics.
As we covered previously, Schwarzer et al (2016)[58] compared a TF/IDF approach with Co-citation Proximity Analysis[59] (CPA) when it comes to recommending Wikipedia articles. When using co-citation, two articles are similar if they both are linked from the same article. Proximity analysis adds to this in that the articles will be more similar if these links are close to each other in the linking article, and less similar if they are further apart. As mentioned previously they find that CPA has different strengths than the TF/IDF approach, and suggest that combining the two can potentially further improve performance.
Lewoniewski, Węcel, and Abramowicz (2016)[64] studies whether it is possible to apply machine learning to predict article importance. They use the article importance ratings commonly found on article talk pages as their dataset, and gather these from four language editions: English, French, Polish, and Russian. 85 different parameters (e.g. article length, number of links, popularity) are inputs to their model. Top-importance (the highest rating) and low-importance (the lowest rating) are the two importance ratings that have the highest performance, both are predicted correctly about two thirds of the time.
Conclusion
editThis literature review has covered much of the research literature on Wikipedia article importance. While the focus has been on the definitions of article importance that have been used and the methods applied to measure it and rank articles, we have also situated this research in a broader context, for instance how Wikipedia compares in size and coverage to other encyclopedias.
By reviewing the literature we uncovered that key terms are often defined differently or used interchangeably. Distinguishing between importance, relevance, and authoritativeness is challenging, and further studies into how these terms are best defined in the context of Wikipedia is welcome. In this review we defined these terms explicitly, particularly article importance, in order to make the meaning clear to the reader.
We have found that the main approach found in the literature is based on the interconnecting links between articles, as this allows for applying established network analysis methods from computer science. PageRank, the recursive algorithm for identifying authoritative pages on the web is clearly the most frequently applied method. However, studies of Wikipedia's link structure suggests that it should be used with caution, and further research into how PageRank behaves in Wikipedia will be useful. We have also seen several other approaches being used, mainly from information retrieval science.
References
edit- ↑ Jeff Loveland (2012). "Why Encyclopedias Got Bigger . . . and Smaller.". Information & Culture: A Journal of History 42 (2): 233–254. doi:10.1353/lac.2012.0012.
- ↑ a b Zachte, Erik. "Wikipedia Statistics – Words". Retrieved Feb 17, 2017.
- ↑ Wikipedia contributors. "Print Wikipedia". Retrieved Feb 17, 2017.
- ↑ Wikipedia contributors. "Wikipedia:Notability". Retrieved Feb 17, 2017.
- ↑ WikiPapers. "WikiPapers". Retrieved Feb 17, 2017.
- ↑ Google Inc. "Google Scholar homepage". Retrieved Feb 17, 2017.
- ↑ Wallace, D.; Van Fleet, C. (2005). "From the Editors: The Democratization of Information? Wikipedia as a Reference Resource". Reference & User Services Quarterly 45 (2): 100–103.
- ↑ a b c Rosenzweig, Roy (June 2006). "Can History be Open Source? Wikipedia and the Future of the Past". Journal of American History 93 (1): 117–146.
- ↑ Lam, Shyong (Tony) K.; Riedl, John (2009). "Is Wikipedia growing a longer tail?" (PDF). Proceedings of the international conference on Supporting group work.
- ↑ Taraborelli, Dario; Ciampaglia, Giovanni Luca (2010). Beyond notability. Collective deliberation on content inclusion in Wikipedia. Self-Adaptive and Self-Organizing Systems Workshop.
- ↑ Reagle, Josep; Rhue, Lauren. "Gender Bias in Wikipedia and Britannica". International Journal of Communication 5: 1138–1158.
- ↑ a b c Kamps, Jaap; Koolen, Marijn (February 2009). "Is Wikipedia link structure different?". Proceedings of the Second ACM International Conference on Web Search and Data Mining. pp. 232–241. ISBN 9781605583907. doi:10.1145/1498759.1498831.
- ↑ Hjørland, Birger (2010). "The foundation of the concept of relevance". Journal of the American Society for Information Science and Technology 61 (2): 217–237.
- ↑ Cole, Charles (2011). "A theory of information need for information retrieval that connects information to knowledge". Journal of the American Society for Information Science and Technology 62 (7): 1216–1231.
- ↑ Wikipedia contributors. "Wikipedia:WikiProject assessment". Retrieved Feb 17, 2017.
- ↑ Stvilia, Besiki; Twidale, Michael B.; Smith, Linda C.; Gasser, Les (2008). "Information quality work organization in Wikipedia". Journal of the American Society for Information Science and Technology 59 (6): 983–1001.
- ↑ Mesgari, Mostafa; Okoli, Chitu; Mehdi, Mohamad; Nielsen, Finn Årup; Lanamäki, Arto (2015). ""The sum of all human knowledge": A systematic review of scholarly research on the content of Wikipedia". Journal of the Association for Information Science and Technology 66 (2): 219–245.
- ↑ Wikimedia Foundation. "Wikimedia Downloads: Analytics Datasets". Retrieved Feb 20, 2017.
- ↑ Dan Andreescu, Wikimedia Foundation (Dec 14, 2015). "Making our page view data easily accessible". Retrieved Feb 20, 2017.
- ↑ Wikipedia contributors. "Recursion - Wikipedia". Retrieved Feb 20, 2017.
- ↑ Page, Lawrence; Brin, Sergey; Motwani, Rajeev; Winograd, Terry (January 1998). "The PageRank citation ranking: bringing order to the web" (PDF).
- ↑ Brin, Sergey; Page, Lawrence (December 2012). "Reprint of: The anatomy of a large-scale hypertextual web search engine". Computer Networks 56 (18).
- ↑ Gleich, David F.; Constantine, Paul G.; Flaxman, Abraham D.; Gunawardana, Asela (April 2010). "Tracking the Random Surfer: Empirically Measured Teleportation Parameters in PageRank". Proceedings of the 19th international conference on World wide web. ISBN 9781605587998. doi:10.1145/1772690.1772730.
- ↑ Ermann, Leonardo; Frahm, Klaus M.; Shepelyansky, Dima L. (May 2013). "Spectral properties of Google matrix of Wikipedia and other networks". The European Physical Journal B 86 (193). doi:10.1140/epjb/e2013-31090-8.
- ↑ a b Bellomi, F.; Bonato, R. (2005). "Network analysis for Wikipedia". Proceedings of Wikimania 2005.
- ↑ a b c Ganjisaffar, Yasser; Javanmardi, Sara; Lopes, Cristina (June 2009). "Review-Based Ranking of Wikipedia Articles". International Conference on Computational Aspects of Social Networks. doi:10.1109/CASoN.2009.14.
- ↑ a b A. O. Zhirov, O. V. Zhirov, D. L. Shepelyansky (October 2010). "Two-dimensional ranking of Wikipedia articles". The European Physical Journal B 77 (4). doi:10.1140/epjb/e2010-10500-7.
- ↑ a b Eom, Young-Ho; Frahm, Klaus M.; Benczúr, András; Shepelyansky, Dima L. (December 2013). "Time evolution of Wikipedia network ranking". The European Physical Journal B 84 (492). doi:10.1140/epjb/e2013-40432-5.
- ↑ a b Shuai, Xin; Jiang, Zhuoren; Liu, Xiaozhong; Bollen, Johan (July 2013). "A comparative study of academic and Wikipedia ranking". Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries. pp. 25–28. ISBN 9781450320771. doi:10.1145/2467696.2467746.
- ↑ a b Eom, Young-Ho; Shepelyansky, Dima L. (October 2013). "Highlighting Entanglement of Cultures via Ranking of Multilingual Wikipedia Articles". PLOS ONE 8 (10). doi:10.1371/journal.pone.0074554.
- ↑ a b Hanada, Raíza; Cristo, Marco; Pimentel, Maria da Graça Campos (November 2013). "How do metrics of link analysis correlate to quality, relevance and popularity in wikipedia?". Proceedings of the 19th Brazilian symposium on Multimedia and the web. pp. 105–112. ISBN 9781450325592. doi:10.1145/2526188.2526198.
- ↑ a b Young-Ho Eom, Pablo Aragón, David Laniado, Andreas Kaltenbrunner, Sebastiano Vigna, Dima L. Shepelyansky (March 2015). "Interactions of Cultures and Top People of Wikipedia from Ranking of 24 Language Editions". PLOS ONE 10 (3). doi:10.1371/journal.pone.0114825.
- ↑ Peter A. Gloor, Joao Marcos, Patrick M. de Boer, Hauke Fuehres, Wei Lo, Keiichi Nemoto (July 2015). "Cultural Anthropology through the Lens of Wikipedia: Historical Leader Networks, Gender Bias, and News-based Sentiment". arXiv Preprint.
- ↑ Siddiqui, Muazzam A. (November 2015). "Mining Wikipedia to Rank Rock Guitarists" (PDF). International Journal of Intelligent Systems and Applications 12: 50–56.
- ↑ a b Lages, José; Patt, Antoine; Shepelyansky, Dima L. (2016). "Wikipedia Ranking of World Universities". The European Physical Journal B 89 (3): 1–12.
- ↑ Thalhammer, Andreas; Rettinger, Achim (2016). PageRank on Wikipedia: Towards General Importance Scores for Entities. Semantic Web Conference. pp. 227–240.
- ↑ Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., & Ives, Z (2007). "DBpedia: A Nucleus for a Web of Open Data". The Semantic Web. pp. 722–735.
- ↑ Laboratory for Web Algorithmics of the Università degli Studi di Milano. "The Open Wikipedia Ranking 2016". Retrieved Feb 20, 2017.
- ↑ Ermann, Leonardo; Frahm, Klaus M.; Shepelyansky, Dima L. (2015). "Google matrix analysis of directed networks". Reviews of Modern Physics 87 (4).
- ↑ Kleinberg, Jon M. (1999). "Authoritative sources in a hyperlinked environment". Journal of the ACM 46 (5): 604–632.
- ↑ Denoyer, Ludovic; Gallinari, Patrick (June 2006). "The Wikipedia XML Corpus". SIGIR Forum 40 (1): 64–69.
- ↑ Kamps, Jaap; Koolen, Marijn (2008). "The Importance of Link Evidence in Wikipedia". Advances in Information Retrieval. Springer Berlin Heidelberg. pp. 270–282. ISBN 9783540786450. doi:10.1007/978-3-540-78646-7_26.
- ↑ a b Wikipedia contributors. "tf–idf - Wikipedia". Retrieved Feb 20, 2017.
- ↑ Hwang, Heasoo; Balmin, Andrey; Reinwald, Berthold; Nijkamp, Erik (Aug 2010). "BinRank: Scaling Dynamic Authority-Based Search Using Materialized Subgraphs". IEEE Transactions on Knowledge and Data Engineering 22 (8): 1176–1190. doi:10.1109/TKDE.2010.85.
- ↑ Balmin, Andrey; Hristidis, Vagelis; Papakonstantinou, Yannis (2004). "ObjectRank: Authority-Based Keyword Search in Databases". Proceedings of the Thirtieth international conference on Very large data bases. pp. 564–575.
- ↑ Jeh, Glen; Widom, Jennifer (May 2003). "Scaling Personalized Web Search". Proceedings of the 12th international conference on World Wide Web. pp. 271–279. doi:10.1145/775152.775191.
- ↑ Aragon, Pablo; Laniado, David; Kaltenbrunner, Andreas; Volkovich, Yana (August 2012). "Biographical social networks on Wikipedia: a cross-cultural study of links that made history" (PDF). Proceedings of the Eighth Annual International Symposium on Wikis and Open Collaboration. ISBN 9781450316057. doi:10.1145/2462932.2462958.
- ↑ Freeman, Linton C (1977). "A set of measures of centrality based on betweenness". Sociometry: 35–41.
- ↑ Korfiatis, Nikolaos Th.; Poulos, Marios; Bokos, George (2006). "Evaluating authoritative sources using social networks: an insight from Wikipedia". Online Information Review 30 (3): 252–262. doi:10.1108/14684520610675780.
- ↑ Freeman, Linton C. (1978). "Centrality in social networks: Conceptual clarification". Social networks 1 (3): 215–239.
- ↑ Sabidussi, Gert (1966). "The centrality index of a graph". Psychometrika 31 (4): 581–603.
- ↑ Athenikos, Sofia J.; Lin, Xia (July 2009). "The WikiPhil Portal: Visualizing Meaningful Philosophical Connections". Journal of the Chicago Colloquium on Digital Humanities and Computer Science 1 (1).
- ↑ Maniu, Silviu; Abdessalem, Talel; Cautis, Bogdan (April 2011). "Casting a web of trust over Wikipedia: an interaction-based approach". Proceedings of the 20th international conference companion on World Wide Web. ISBN 9781450306379. doi:10.1145/1963192.1963237.
- ↑ a b Warncke-Wang, Morten; Uduwage, Anuradha; Dong, Zhenhua; Riedl, John (August 2012). "In Search of the Ur-Wikipedia: Universality, Similarity, and Translation in the Wikipedia Inter-language Link Network" (PDF). Proceedings of the Eighth Annual International Symposium on Wikis and Open Collaboration. ISBN 9781450316057. doi:10.1145/2462932.2462959.
- ↑ a b Hugo Zaragoza, Henning Rode, Peter Mika, Jordi Atserias, Massimiliano Ciaramita, and Giuseppe Attardi (November 2007). "Ranking very many typed entities on Wikipedia". Proceedings of the sixteenth ACM conference on Conference on information and knowledge management. pp. 1015–1018. ISBN 9781595938039. doi:10.1145/1321440.1321599.
- ↑ The Apache Software Foundation. "Apache Lucene". Retrieved Feb 20, 2017.
- ↑ a b Malte Schwarzer, Moritz Schubotz, Norman Meuschke, Corinna Breitinger, Volker Markl, Bela Gipp (June 2016). Evaluating Link-based Recommendations for Wikipedia (PDF). Joint Conference on Digital Libraries. ISBN 9781450342292. doi:10.1145/2910896.2910908.
- ↑ a b Gipp, Bela; Beel, Joeran (2009). "Citation Proximity Analysis (CPA) – A new approach for identifying related work based on Co-Citation Analysis". In Birger Larsen and Jacqueline Leta, eds. Proceedings of the 12th International Conference on Scientometrics and Informetrics 2. pp. 571–575.
- ↑ Apache Software Foundation. "MoreLikeThis (Lucene 3.0.3 API)". Retrieved Feb 20, 2017.
- ↑ Warncke-Wang, Morten; Ranjan, Vivek; Terveen, Loren; Hecht, Brent (May 2015). "Misalignment Between Supply and Demand of Quality Content in Peer Production Communities". Proceedings of the Ninth International AAAI Conference on Web and Social Media.
- ↑ Wulczyn, Ellery; West, Robert; Zia, Leila; Leskovec, Jure (April 2016). "Growing Wikipedia Across Languages via Recommendation" (PDF). Proceedings of the 25th International Conference on World Wide Web. pp. 975–985. ISBN 9781450341431. doi:10.1145/2872427.2883077.
- ↑ Royal, Cindy; Kapila, Deepina (April 2008). "What's on Wikipedia, and What's Not . . . ?". Social Science Computer Review 27 (1). doi:10.1177/0894439308321890.
- ↑ Lewoniewski, Włodzimierz; Węcel, Krzysztof; Abramowicz, Witold (October 2016). "Quality and Importance of Wikipedia Articles in Different Languages". In Giedre Dregvaite, Robertas Damasevicius (eds.). Information and Software Technologies. Communications in Computer and Information Science. Springer International Publishing. pp. 613–624. ISBN 9783319462547. doi:10.1007/978-3-319-46254-7_50.