Community Tech/Migrate dead external links to archives

Tracked in Phabricator:
Task T120433 resolved

The migrate dead links to the Wayback Machine project aims to redirect dead links to the Internet Archive's Wayback Machine. This project page gives information on the process and a place to discuss the tool.

For the Community Tech team, this project is supporting the work of User:Cyberpower678, the creator of Cyberbot II. Cyberbot is currently running on article pages on English Wikipedia, looking at links that have been tagged as dead links. Cyberbot queries the Internet Archive, replaces the dead external link with a link to the archived version, and then posts a message on the article talk page asking for contributor review.

Community Tech's support for the project includes: creating a centralized logging interface and building a module advanced dead link detection, so that Cyberbot can go through the whole wiki and identify which links need to be fixed.

Rationale

We cite a lot of sources available on the web. With time many of those disappear from the webpages where they've been hosted – moved or removed – resulting in a dead link. This is a big problem and very difficult to keep track of manually. To make sure we don't lose our sources, we need an automated process to move dead links to the Internet Archive's Wayback Machine, or flag them for contributor review. This was the most popular suggestion in the 2015 Community Wishlist Survey, with 111 supporting votes.

Technical discussion and background

Status

This is a slightly less technical overview. For more details, please see our meeting notes and the Phabricator task.

October 3, 2016

Work is ongoing to deploy IABot to Swedish Wikipedia (T136142). The code is ready for testing, but it still needs local configuration and localization.

July 20, 2016

The dead links detection has been approved for normal use and IABot's switches have been flipped. However some operational bugs cropped up that were masked when the switches weren't flipped. Cyberpower is working to fix them quickly. Upgrades to the bot's intelligence have also been made. As Ryan mentioned, it's almost sentient now.

June 21, 2016

Cyberpower says in T136728 that the false positive rate is down to .2%, way better than the goal. This should be enough to get approval for the bot...

June 8, 2016

Cyberbot is working through the trials required to get approval for phase 2 of the project -- running through all the external links on English Wikipedia, not just the ones marked with the dead link template. Community Tech added the code to skip paywalled sites, and we're helping to track down some of the false positives -- live links that Cyberbot is marking as dead. Right now, the false positive rate is around 8%. The approval process hasn't established given a set limit on false positives, but we're hoping to get it down to 1%, if possible.

May 9, 2016

We're still testing and tweaking Cyberbot's dead link detection; there have been some false positives that we're helping to track down and fix. We will also work on having the bot skip paywalled sites, not counting them as dead.

April 24, 2016

We have been working with Cyberpower678 on dead link detection -- testing links in articles to make sure that they're still alive. The bot now checks for HTTP 404 error messages, and if it finds that the link is dead, then it checks again in a few days. When a link returns 404s after three checks, Cyberbot marks the link as dead and replaces it with an archive link.

April 6, 2016

Cyberbot II has now worked through all of the pages marked with the Dead link template. There are still approximately 60,000+ pages in the category of articles with dead links; these are links for which the Internet Archive doesn't have a suitable archive. We're starting a discussion on Template talk:Dead link, suggesting that we add an extra parameter to the template that can mark these as unfixable, so that we don't have to keep rechecking the same unsalvageable links.

March 28, 2016

The centralized logging interface is finished, and Cyberbot now uses the logging API. Documentation is on Fixing dead links/Deadlink logging app.

March 27, 2016

Cyberbot II is currently being discussed on Wikipedia:Bots/Requests for approval/Cyberbot II 5a. We're working with Cyberpower678 on detecting dead links that aren't marked with the dead link template.

March 14, 2016

User:Green Cardamom has created a new bot called WaybackMedic. It fixes a bug in the Internet Archive API that returned false positives for Cyberbot to use as replacements for dead links. According to Green Cardamom, there are tens of thousands of archive links that are pointing to the wrong archive. WaybackMedic is cleaning up those links (example) -- and also cleaning up some formatting bugs from early in Cyberbot's career (example).

WaybackMedic is currently in bot trials.

March 3, 2016

 
Dead links logging interface

Work on the Centralized logging interface on Tool Labs for all dead links bots is going well. There's a first draft version up on Tool Labs: Deadlinks, and documentation here: Fixing dead links/Deadlink logging app. We'll be asking for stakeholder input next week.

Coming up soon: we'll work on an output API from the logging interface, showing the last time that a particular page was checked, and the last article that a particular bot processed. This will help a bot that crashed or paused to pick up where it left off, and it'll help multiple bots running on the same wiki to avoid checking the same pages that another bot recently processed. (T128685)

February 7, 2016

We've defined goals for what we're expecting to do this quarter on this project, and a goal for later on this year.

Goals for this quarter (until end of March):

  • Centralized logging interface on Tool Labs for all dead links bots -- This can be used to track what pages have been checked. It'll be useful for individual bots so they don't go over the same page multiple times, and especially useful if there are multiple bots running on the same wiki. The log will include name of the wiki, the archive source, # of links fixed, or notifications posted. (This should accommmodate bots from DE, FR and others.) Several tickets: Create a centralized logging API (T126363), Create a web interface (T126364), Documentation (T126365).
  • Investigation of advanced dead link detection -- Investigate and plan for adding advanced dead link detection as a module for Cyberbot, detecting other kinds of dead links besides 4XX and 5XX error codes. This may involve adapting Internet Archive's code. See T125181 and T127749.
  • Documentation and code review for Cyberbot -- Documenting Cyberbot's code, in preparation for helping other developers create bots on other wikis. Documentation has started at InternetArchiveBot, code review in T122227.

For later on this year:

  • Our big goal for this project is to help bot-writers on many different language Wikipedias to create their own dead link archive bots. Each community has its own templates, policies, approach and preferred archive service, and it's not scaleable for our team / Cyberpower / Internet Archive to create bots for every language WP. We want to provide the tools that bot writers can use -- modular code that includes APIs and advanced dead link detection, documentation of the existing code, and a centralized logging interface that all bots can use.

February 2, 2016

One of the tasks we'll be working on with Cyberpower678 is a centralized logging interface for tracking and reporting dead link fixes. The log will be kept on Tool Labs. This will help keep track of which pages have had their dead links fixed, when, and by what agent/bot. This will facilitate 3 things:

  • If a bot dies, it can pick up where it left off
  • It will help prevent bots from doing redundant work
  • It will provide a centralized (and hopefully comprehensive) reporting interface for the Internet Archive and other archive providers

The tracking ticket for the logging interface is T125610.

Jan 20, 2016

  • We're working with Cyberpower678 and the Internet Archive to define more exactly what the goals for the project are.
  • We're looking at different ways of approaching the problem, because the best solution may involve a few different tools or processes working together. There are three bots (English, Spanish, French) we need to compare, and we need the solution to work for different languages and different projects (not just Wikipedia).
  • We're investigating automated ways to test links to see if they're dead or not. Legoktm has a proposal and Niharika is working on an algorithm.
  • We need to limit the number of requests we send to the Internet Archive's API.

Timeline

Too early to say anything yet, but when we have a good estimation, we'll put it here.

Initial Community Tech team assessment

Support: High. Dead reference links hurt our projects' reliability and verifiability, and connecting deadlinks with an archive supports the usefulness of our content. There were some dissents in the voting phase, pointing out that it's better when humans find the appropriate alternative links, rather than a bot that might not choose the right one.
Impact: High. Improving the quality of citations helps readers as well as contributors. There are some bots currently running on English, French and Spanish Wikipedias. We want to help build solutions that can be adapted to every language.
Feasibility: High. Cyberbot II is currently active on English Wikipedia, and Elvisor on Spanish Wikipedia. Cyberpower678's work on Cyberbot is being supported by The Wikipedia Library and the Internet Archive. There is obviously good work being done here, and we can figure out how to best support it, and help it to scale globally.
Risk: Low. Cyberbot II is running on English Wikipedia, with no major issues encountered. It may be challenging to integrate with other wikis’ citation templates.