Data dumps/2006 notes


Clusters

edit

The wikis hosted in our Korean cluster will have a separate host, at http://download-yaseo.wikimedia.org/

Reporting

edit

The backup runner script will generate some pretty HTML pages showing status as each file completes, so it should be easier to see what's done, what's in progress, and what failed.

I'm about to code up this part, shouldn't be too hard I hope. :)

File layout

edit

This basic layout of file generation is complete in the script:

  • public/
    • dbname/
      • YYYYMMDD/
        • dbname-YYYYMMDD-all-titles-in-ns0.gz
          list of page names for BBC
        • dbname-YYYYMMDD-table.gz
          SQL table dumps
        • dbname-YYYYMMDD-pages-type.xml.bz2
        • dbname-YYYYMMDD-pages-type.xml.7z
          XML page text dumps
        • dbname-YYYYMMDD-abstract.xml.gz
          page extracts for Yahoo

Static URLs

edit

There will probably also be a directory with symbolic links for a static URL to whatever the latest version is of each file. Will likely look like this:

  • public/
    • dbname/
      • latest/
        • dbname-all-titles-in-ns0.gz
          list of page names for BBC
        • dbname-table.gz
          SQL table dumps
        • dbname-pages-type.xml.bz2
        • dbname-pages-type.xml.7z
          XML page text dumps
        • dbname-abstract.xml.gz
          page extracts for Yahoo

Images/uploads

edit

Not yet included, this may change in near future.