Toolserver/Reports
This page is for discussion and development of a collection of scripts for creating reports from database dumps. The collection is distributed from http://tools.wikimedia.de/~beland/
Infrastructure status
edit- We would like to get a CVS or Subversion repository set up on the toolserver, to facilitate contributions from a number of people. The toolserver admins would like to synchronize ssh and Subversion passwords, but ssh logins currently don't use passwords, only ssh keys. There have recently been problems with the toolserver admins being non-responsive to requests, so I am giving up on waiting for this to happen. -- Beland 17:35, 2 April 2006 (UTC)
- I would like there to be a version of the scripts deployed on the Toolserver so that they can be run as a cron job, and Toolserver users can expand and improve them as desired. -- Beland 17:46, 2 April 2006 (UTC)
- If you would like to make a contribution, you can post an update to the web, or e-mail me using the "e-mail this user" link from one of my user pages. I will try to post updates to the toolserver page in a timely fashion. -- Beland 17:46, 2 April 2006 (UTC)
Script status
edit- I no longer have enough hard drive space on my laptop to run these scripts, and my desktop machine does not have enough RAM. (I would recommend 10GB+ for storage, 700MB+ for RAM.) Reducing both of those requirements might be useful in general, as would faster execution. -- Beland 23:00, 2 April 2006 (UTC)
- Some scripts are currently broken. There are some notes in auto-run.sh, but what I've been doing is running the scripts one at a time, in the sequence they appear in that file, and checking the output to make sure that it is valid and non-empty. -- Beland 23:00, 2 April 2006 (UTC)
- It would be nice if the scripts had a better dependency mechanism. Right now, there is a single central script, auto-run.sh, which runs the scripts in one order which builds dependencies before they are needed. It would be nice if one could simply run the script that output the report one desires, and the scripts that built the input files it needed would be automatically run. On the other hand, if the toolserver has enough RAM and hard drive space, and the scripts are tidied up, they could be run as a cron job, and there would be less need to worry about dependencies. (Unless contributors want to generate certain reports for their own purposes.)
Volunteers
edit- en:User:Beland is the originator of the project
- en:User:A beautiful mind recently expressed interest in helping out
Update requests
edit- I have gotten several requests from Wikipedia:Most wanted articles to re-generate that page, so that should probably be given priority. -- Beland 23:00, 2 April 2006 (UTC)
- en:Wikipedia:Most-referenced articles has had several requests for an update, and a request that distinct sublists be created for particular topics: United States, Countries of the world, Cities, Political parties, Decades, Dates of the calendar year, etc. This is possible by looking at the category membership of articles, and perculating those memberships up the category tree. -- Beland 00:03, 16 August 2006 (UTC)
- Last update was in 2011, we could use a new report. --Melody Lavender 08:18, 10 November 2014 (UTC)
- en:Wikipedia:Maintenance#Reports has the refresh date for all important reports, and needs to be updated on a rotating basis.
Known bugs
edit- auto-categorize4.pl: A link from Columbus, Michigan to "Columbus_Township, St._Clair_County, Michigan" was interpreted as being a link to a county with the name "Columbus Township, St. Clair", so Columbus, Michigan was added to "Category:Columbus Township, St. Clair County". We don't create categories for townships, only counties. -- Beland
Replacement tools
edit- en:User:Bluemoose/DataBaseSearchTool can find regexps in a database dump. This is a possible replacement for the "bad links" report.
See also
edit- en:User:Pearle is an open-source bot which these scripts occasionally make use of.
- en:Wikipedia:Bot requests, if you are interested in Perl programming
- Summer of Code 2006 - More project ideas