WikiCredCon 2025/Schedule/Sawood Alam and Michael Nelson
Session Title:
editRobust Links and Link Rot (Session)
Presenter(s):
editName & Pronouns | Username | Affiliation | |
---|---|---|---|
Sawood Alam (he/him) | User:Ibnesayeed | sawood@archive.org | Internet Archive |
Michael Nelson (he/him) | User:Phonedudemln | mln@cs.odu.edu | Old Dominion University |
Harsha V. Madhyastha (he/him) | User:HarshaMadhyastha | madhyast usc edu | University of Southern California |
Abstract:
editUnlike traditional scholarly publications, web pages and other online resources often suffer from content drift and link rot over time. Consequently, any references citing such resources lose credibility when resolving such references lead to error pages or content that have become unrelated to the context. The inherent problem here is the lack of expression of the temporal dimension when citing a web resource. While some references do expresses the intended date in the text, HTML anchor element does not have a standard way to encode this information in a machine-readable manner.
RobustLinks is a proposed standard to bring this capability to anchor elements in the form of HTML5 data-* attributes. We introduce "data-originalurl", "data-versiondate", and "data-versionurl" attributes to express the original URL of the referenced resource, the date (or datetime) of the intended state or version of the resource, and optionally one or more known good archived version URLs at which the resource is preserved in the intended state. Currently, these attributes are not interpreted by user-agents in any special way, but JavaScript can be used to leverage them in the interim.
For many broken web links, no archived copies exist. Even if a copy exists, it often poorly approximates the original page, e.g., any functionality on the page which requires the client browser to communicate with the page's backend servers will not work, and even the latest copy will be missing updates made to the page's content after that copy was captured.
We observe that broken links are often merely a result of website reorganizations; the linked page still exists on the same site, albeit at a different URL. Therefore, given a broken link, our system FABLE attempts to find the linked page's new URL by learning and exploiting the pattern in how the old URLs for other pages on the same site have transformed to their new URLs. We have been working with the Internet Archive and the Wikipedia user community to use FABLE to patch broken external links on Wikipedia which have been marked as "permanently dead."