Research:Analyzing sources on Wikipedia
This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.
Sources, particularly reliable sources, are key to Wikipedia. They are the primary mechanism for ensuring verifiability and therefore maintaining knowledge integrity and removing misinformation.[1][2] They also present a major barrier to expanding coverage of marginalized communities[3] and many knowledge gaps on Wikipedia arise in part due to a lack of reliable sources.[4] The specific sources that underpin an article can also determine whose point of view is represented[5][6] -- a particularly important question when considering the role that Wikipedia plays with respect to digital colonialism.[7][8]
Despite the important role of sources on Wikipedia and many community discussions / concerns, they are generally understudied. A large factor in this is likely their inaccessibility for large-scale data analyses. This project will work to devise methods to overcome some of these challenges:
- There is no standard format for references: common approaches to adding references include bare ref tags, citation templates, and shortened footnote templates. While extracting ref tags from Wikipedia is relatively straightforward, the content within them can still be unstructured and difficult to parse. The latter templates bring useful structure but there are many types and template names / parameter names will vary across languages. Any quantitative research on references on Wikipedia then generally comes with high start-up costs to extract the desired information. The mwrefs Python library helps with some of this -- it extracts ref tags, any associated citation template, external links, and identifiers (DOI, ISBN, pubmed, arxiv). See task T374554 for some potential steps to better align data from citation templates across languages.
- The citation only tells you so much: while you can extract certain information (does this reference have a DOI? is there a URL suggesting that the source is digitized? etc.), many interesting aspects of a reference (source country, language, open-access or paywall, etc.) can only be determined from consulting external databases or scraping content from external websites.
- They are not tracked by any external tables: analyzing references requires going directly to the wikitext (or parsed HTML) of an article and doing the above extraction. There are no logging tables or links tables (maybe externallinks though external links can appear outside of references) that provide an easy entrypoint to analysis as is the case for studying images, links, categories, etc. on Wikipedia.
Characterizing sources
editThis is a non-exhaustive way to approach the characterization of sources in ways that are useful to patrollers or readers looking to assess verifiability/reliability of content. Editors have already devised ways to assess and label many of these characteristics via templates, user-scripts, and tools, but the lack of structure around citations makes it difficult to keep these assessments standardized and up-to-date. I have identified at least three main aspects to this, as described below.
Source-level metadata
editThere are many features specific to a source that are useful when evaluating it even ignoring the context in which it appears:
- Medium: is the source a newspaper or book or website etc.?
- Availability/accessibility: is the source available online or only as a physical artifact? If it is available online, is the URL still live, archived, and/or no longer available? What is the depth of the URL -- e.g., a link to a specific article or a general domain that is likely to change? Is the source behind a paywall? Is it transcribed into text and therefore easily searchable or just a scan of content? More accessible sources are not necessarily better sources, but they often make verifiability easier.
- Recency: when was the source created? Newer sources are not necessarily better but older sources for content areas that are fast evolving can miss important details.
- Level of source: is the source primary, secondary, or tertiary? None are explicitly disallowed but secondary are preferred.
- Reliability: is the source considered reliable on that wiki? Examples of some of these discussions: en:Perennial sources.
- Geoprovenance: what region or culture is associated with a particular source? This is generally operationalized at the country level and while it alone cannot tell you much about an individual source, it's much more useful for understanding the broader set of sources being considered (as noted below). A prototype API for English Wikipedia that is largely a replication and extension of Sen et al.[9] that analyzes an article's sources and their geographic distribution can be found here: https://geo-provenance.wmcloud.org/api/v1/geo-provenance?lang=en&title=Climate_change
Relationship between source and article content
editAdditional features about the appropriateness of a source only become clear when viewing it in context of the content that it is supposed to support:
- Correctness: does the source in fact support the statement that it is supposed to verify? Tools such as verify or Side[10] can help with this.
- Language: is the source in the local language or would require translation to be accessible to the reader?
Relationship between source and article history
editUnderstanding how a source came to be present in the current state of an article can help in assessing whether it warrants further evaluation:
- Source provenance: who added the source -- e.g., akin to Who Wrote That? for article content. Understanding who added a given source and when can help moderators understand its context.
- Stability/acceptance: how controversial is the source? Highly controversial sources might also be associated with previous discussions on talk pages or elsewhere about the appropriateness of the source.
The Reference Risk model is a good example of how these types of signals can be used.
Relationship between source and other sources
editViewing the source in relation to all other sources in an article or language edition or project can reveal even more information about the overall state of Wikipedia and gaps in what knowledge is represented:
- Diversity: how similar is the source to others being used to support content? This can have strong implications for presenting a neutral point of view. Diversity has many aspects including the geoprovenance, medium, recency, level, publisher, and aspects related to the authors.
- Frequency: how often and where else does this source appear on the Wikimedia projects? A single instance of a source does not make it bad -- many are hyperspecific to a topic -- but understanding usage both can help in assessing the reliability of a source and, conversely, reassessing usage if the reliability is thrown into question. Tools like citation finder or Global Search can help with this.
Existing resources for assessing reliability
editThe reliability of a source is tracked in various ways around the Wikimedia projects. From a technical standpoint, sources on Wikipedia are essentially presumed allowable unless challenged as not meeting reliable source guidelines. In other words, there is no extensive set of sources that are deemed reliable that you must (or can) pull from. There are a few lists of sources deemed generally unreliable or spam that you can't use but many sources fall into a grey area of appropriate for certain information but not others. For example, even though there are approximately 45,000 links on English Wikipedia to tweets as of January 2025 and even a template for citing tweets, this does not make X/Twitter a reliable source for most things. Here is a list of some of the resources that are available for making judgments about whether a source is likely to be accepted as reliable on Wikipedia (see task T276857 for even more examples and context):
- Sources that match URLs found on the Spam blacklist (spam isn't the same as an unreliable source but obviously there are overlaps).
- Sources that have been deemed generally unreliable by editors:
- The best known example is the Perennial Sources List. The sources and judgments are stored in a semi-structured table but there are parsers out there to extract the judgments such as wiki_references.[11]
- But there are other lists out there such as the Unreliable/Predatory Source Detector user script.
- And finally there are a few lists of sources that are considered reliable in specific domains to help guide editors -- see the CiteHighlighter user script for some examples.
The stability of reliability
editThere are a few caveats to be aware of when making judgments related to source reliability:
- Source reliability is often topic-specific – e.g., editors have raised concerns about Fox News for politics/science topics but generally trust it for other topics.
- Source reliability is often not a binary label for a given URL domain – e.g., many newspapers host their opinion/blog section under the same URL domain as their investigatory/explanatory journalism despite applying much greater editorial oversight to the latter – e.g., The Guardian is an example of this.
- Source reliability can be language-specific – e.g., sources that are deemed unreliable by one language edition are not necessarily viewed the same way by other language editions.
See Also
edit- Shared Citations proposal
- Models for detecting sentences needing citations: Identification of Unsourced Statements
- Analysis of scholarly sources on Wikipedia: Scholarly article citations in Wikipedia
- Analysis of paywall issues with scholarly sources: Towards Modeling Citation Quality
References
edit- ↑ Cohen, Noam (2021-09-07). "One Woman’s Mission to Rewrite Nazi History on Wikipedia". Wired. Wired. Retrieved 2022-03-30.
- ↑ Grabowski, Jan; Klein, Shira (2023-02-09). "Wikipedia’s Intentional Distortion of the History of the Holocaust". The Journal of Holocaust Research 0 (0): 1–58. ISSN 2578-5648. doi:10.1080/25785648.2023.2168939.
- ↑ Berson, Amber; Monika, Sengul-Jones; Tamani, Melissa (June 2021). "Unreliable Guidelines: Reliable Sources and Marginalized Communities in French, English and Spanish Wikipedias" (PDF). Art + Feminism. Retrieved 2022-03-30.
- ↑ "Wikipedia is a mirror of the world’s gender biases". Wikimedia Foundation (in en-US). 2018-10-18. Retrieved 2022-03-30.
- ↑ Luyt, Brendan; Tan, Daniel (2010). "Improving Wikipedia's credibility: References and citations in a sample of history articles". Journal of the American Society for Information Science and Technology: n/a–n/a. ISSN 1532-2882. doi:10.1002/asi.21304.
- ↑ Ford, Heather; Sen, Shilad; Musicant, David R.; Miller, Nathaniel (2013-08-05). "Getting to the source: where does Wikipedia get its information from?". Proceedings of the 9th International Symposium on Open Collaboration. WikiSym '13 (New York, NY, USA: Association for Computing Machinery): 1–10. ISBN 978-1-4503-1852-5. doi:10.1145/2491055.2491064.
- ↑ Duncan, Alexandra (2020). "Towards an activist research: is Wikipedia the problem or the solution?" (PDF). Art Libraries Journal. ISSN 0307-4722. Retrieved 2022-03-30.
- ↑ "Decolonizing the Internet". Whose Knowledge (in en-US). Retrieved 2022-03-30.
- ↑ Sen, Shilad W.; Ford, Heather; Musicant, David R.; Graham, Mark; Keyes, Os; Hecht, Brent (2015-04-18). "Barriers to the Localness of Volunteered Geographic Information". Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. CHI '15 (New York, NY, USA: Association for Computing Machinery): 197–206. ISBN 978-1-4503-3145-6. doi:10.1145/2702123.2702170.
- ↑ Petroni, Fabio; Broscheit, Samuel; Piktus, Aleksandra; Lewis, Patrick; Izacard, Gautier; Hosseini, Lucas; Dwivedi-Yu, Jane; Lomeli, Maria; Schick, Timo (2023-10-19). "Improving Wikipedia verifiability with AI". Nature Machine Intelligence 5 (10): 1142–1148. ISSN 2522-5839. doi:10.1038/s42256-023-00726-1.
- ↑ Baigutanova, Aitolkyn; Saez-Trumper, Diego; Redi, Miriam; Cha, Meeyoung; Aragón, Pablo (2023-10-21). "A Comparative Study of Reference Reliability in Multiple Language Editions of Wikipedia". Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. CIKM '23 (New York, NY, USA: Association for Computing Machinery): 3743–3747. ISBN 979-8-4007-0124-5. doi:10.1145/3583780.3615254.