Data dumps/Misc dumps format

The format of XML/Sql dumps is documented here. The wikidata entity dump formats are documented here for JSON and here for RDF.

The format of the other dumps produced by Wikimedia is described below.

Category dumps

edit

These are produced in RDF format. For each category, the following information is provided:

  • Category title
  • If the category is hidden
  • Number of pages in the category, excluding subcategories and files
  • Number of subcategories
  • Urls for each category to which this category belongs
Sample excerpt
<https://en.wikipedia.org/wiki/Category:0-6-0PT_locomotives> a mediawiki:Category ;
	rdfs:label "0-6-0PT locomotives" ;
	mediawiki:pages "13"^^xsd:integer ;
	mediawiki:subcategories "0"^^xsd:integer .
...
<https://en.wikipedia.org/wiki/Category:0-6-0PT_locomotives> mediawiki:isInCategory <https://en.wikipedia.org/wiki/Category:0-6-0T_locomotives>,
		<https://en.wikipedia.org/wiki/Category:0-6-0_locomotives>,
		<https://en.wikipedia.org/wiki/Category:Commons_category_link_is_on_Wikidata>,
		<https://en.wikipedia.org/wiki/Category:Tank_locomotives>,
		<https://en.wikipedia.org/wiki/Category:Whyte_notation> 

Cirrus search index dumps

edit

For more information, see the extension documentation on the search schema.


These files contain importable Cirrus search indexes in json format. For each entry, the following is provided:

type one of 'page' or 'namespace'

For namespaces:

id namespace number
<wikiname> name of specific wiki
<wikitype> name of wiki type (wikipedia, wikivoyage, and so on)

For pages:

auxiliary text thumbnail captions, tables and a few other things that are searchable but not part of the primary page content
category list of categories to which this page belongs
content_model whether the page content is wikitext, json and so on
coordinates geographical coordinates provided via the parser function '#coordinates', if present
create_timestamp date and time page was first created
defaultsort the sort key for sorting the page in categories which contain it, if set
display_title the value for the DISPLAYTITLE magic word, if set
external_link list of links outside off the wiki projects, made in this page
heading list of entries in the page content surrounded by == (so html h2 headers)
incoming_links number of pages that link to this page
language content language of the page
namespace number of the namespace of this page
namespace_text name of the namespace of this page
opening_text text before the first heading (h2 through h6, i.e. == through ======)
outgoing_link list of links made in this page that lead to other pages on wiki projects
redirect namespace and title of pages which redirect to this page, if any
source text raw text of the revision
text everything but the opening text and auxiliary text (after wikitext expansion)
text_bytes length of revision content, in bytes
template list of templates included by this page
timestamp timestamp of current revision
title title of page
version current revision id
version_type always 'external' when set
wiki name of specific wiki
wikibase_item the Q number of the page on wikidata (how is this even obtained??)


Sample excerpts (simplewikibooks, enwiki)
{"index":{"_type":"namespace","_id":"4"}}
{"name":["wikibooks"],"wiki":"simplewikibooks"}
...
{"index":{"_type":"page","_id":"29012876"}}
{"template":[],"redirect":[],"wikibase_item":"Q8801599","heading":[],"source_text":"[[Category:Protected areas of Maine by county|Penobscot]]\n[[Category:Geography of Penobscot County, Maine]]\n[[Category:Tourist attractions in Penobscot County, Maine]]","version_type":"external","opening_text":null,"wiki":"enwiki","coordinates":[],"auxiliary_text":[],"language":"en","title":"Protected areas of Penobscot County, Maine","version":755291624,"external_link":[],"namespace_text":"Category","namespace":14,"text_bytes":167,"incoming_links":1,"text":"","category":["Protected areas of Maine by county","Geography of Penobscot County, Maine","Tourist attractions in Penobscot County, Maine"],"defaultsort":false,"outgoing_link":[],"timestamp":"2016-12-17T05:52:40Z","content_model":"wikitext","create_timestamp":"2010-10-01T04:20:42Z"}
{"index":{"_type":"page","_id":"57554132"}}
{"version":843749848,"wiki":"enwiki","namespace":14,"namespace_text":"Category","title":"Turkish aerobic gymnasts","timestamp":"2018-05-31T06:08:09Z","category":["Turkish gymnasts","Aerobic gymnasts"],"external_link":[],"outgoing_link":["Portal:Gymnastics"],"template":["Template:Portal","Module:Portal","Module:Portal\/images\/g"],"text":"Gymnastics portal","source_text":"{{Portal|Gymnastics}}\n\n[[Category:Turkish gymnasts|Aerobic]]\n[[Category:Aerobic gymnasts]]","text_bytes":90,"content_model":"wikitext","coordinates":[],"language":"en","heading":[],"opening_text":null,"auxiliary_text":[],"defaultsort":false,"redirect":[],"incoming_links":1,"create_timestamp":"2018-05-31T06:08:09Z","wikibase_item":"Q55963883"}
{"index":{"_type":"page","_id":"9772184"}}
{"template":[],"redirect":[],"heading":[],"source_text":"hello","version_type":"external","opening_text":null,"wiki":"enwiki","coordinates":[],"auxiliary_text":[],"language":"en","title":"Copywrong~enwiki","version":657578431,"external_link":[],"namespace_text":"User","namespace":2,"text_bytes":5,"incoming_links":0,"text":"hello","category":[],"defaultsort":false,"outgoing_link":[],"timestamp":"2015-04-21T15:02:03Z","content_model":"wikitext","create_timestamp":"2007-02-28T15:46:39Z"}

Content translation dumps

edit

The content translation dumps are provided in 3 formats, json with html, json with text, and tmx with text. 'Text' in this context means that any html markup has been stripped out; see the file excerpts below for an example.

For each entry the following are included, with field names varying according to format:

  • the language of the source text (the text to be translated)
  • the target language
  • the source text itself
  • the machine translation of the source text, and the machine translation engine used
  • the target (human translated text)


For more information, see the extension documentation on published translations.

Sample excerpt (from cx-corpora._2el.html.json.gz)
    {
        "id": "629016/17",
        "sourceLanguage": "ba",
        "targetLanguage": "el",
        "source": {
            "content": "<section rel=\"cx:Section\" id=\"cxSourceSection17\" data-mw-cx-source=\"undefined\"><h2 id=\"4a7502fbf361633e09fc38c09d6c5b\"><span data-segmentid=\"96\" class=\"cx-segment\">Һылтанмалар</span></h2>\n</section>"
        },
        "mt": {
            "engine": "Yandex",
            "content": "<section rel=\"cx:Section\" id=\"cxTargetSection17\" data-mw-cx-source=\"Yandex\"><h2 id=\"4a7502fbf361633e09fc38c09d6c5b\"><span data-segmentid=\"96\" class=\"cx-segment\">Σύνδεσμος</span></h2></section>"
        },
        "target": {
            "content": "<section rel=\"cx:Section\" id=\"cxTargetSection17\" data-mw-cx-source=\"Yandex\"><h2 id=\"4a7502fbf361633e09fc38c09d6c5b\"><span data-segmentid=\"96\" class=\"cx-segment\">Εξωτερικοί σύνδεσμοι</span></h2></section>"
        }
    },
...
    {
        "id": "629016/8",
        "sourceLanguage": "ba",
        "targetLanguage": "el",
        "source": {
            "content": "<section rel=\"cx:Section\" id=\"cxSourceSection8\" data-mw-cx-source=\"undefined\"><ul id=\"mwGg\"><li id=\"mwGw\"><span data-segmentid=\"79\" class=\"cx-segment\">Башҡортостандың атҡаҙанған мәҙәниәт хеҙмәткәре (1993).</span></li></ul>\n\n</section>"
        },
        "mt": {
            "engine": "Yandex",
            "content": "<section rel=\"cx:Section\" id=\"cxTargetSection8\" data-mw-cx-source=\"Yandex\"><ul id=\"mwGg\"><li id=\"mwGw\"><span data-segmentid=\"79\" class=\"cx-segment\">Τιμήθηκε ο εργαζόμενος πολιτισμού Χαλκίδα (1993).</span></li></ul></section>"
        },
        "target": {
            "content": "<section rel=\"cx:Section\" id=\"cxTargetSection8\" data-mw-cx-source=\"Yandex\"><ul id=\"mwGg\"><li id=\"mwGw\"><span data-segmentid=\"79\" class=\"cx-segment\">Τιμημένος εργαζόμενος του πολιτισμού, Δημοκρατία του Μπασκορτοστάν (1993).</span></li></ul></section>"
        }
    },
Sample excerpt from text json file
    {
        "id": "501270/mwCA",
        "sourceLanguage": "ar",
        "targetLanguage": "el",
        "source": {
            "content": "المعتمديّة هي تقسيم إداري يستخدم في تونس. ويمثل المستوى الثاني للتقسيم الإداري بالجمهورية التونسية، حيث ترجع المعتمدية بالنظر إلى الولاية كما تنقسم إلى بلديات (مدن) ثم إلى عمادات (مناطق) ثم إلى مجالس قروية[1]."
        },
        "mt": {
            "engine": "Yandex",
            "content": "Σατέν κλωστές είναι η διοικητική διαίρεση που χρησιμοποιούνται στην Τυνησία. Και το δεύτερο επίπεδο, η διοικητική διαίρεση της Δημοκρατίας της Τυνησίας, όπου το σατέν κλωστές για τα μέλη είναι επίσης χωρίζεται σε δήμους (πόλεις) και, στη συνέχεια, να την κοσμητεία της (την περίπτωση) και, στη συνέχεια, να τα συμβούλια χωριό[1]."
        },
        "target": {
            "content": "Η μουταμιντίγια είναι όρος διοικητικής διαίρεσης που χρησιμοποιούνται στην Τυνησία. Ανήκει στο δεύτερο επίπεδο της διοικητικής διαίρεσης της Δημοκρατίας της Τυνησίας, όπου τα μουταμιντίγια των κυβερνείων επίσης χωρίζονται σε δήμους (πόλεις) και, στη συνέχεια, σε ιμαντάτ και στη συνέχεια σε συμβούλια χωριών[1]."
        }
    },

Sample excerpt from tmx-formatted file
    <tu srclang="ar">
      <tuv xml:lang="ar">
        <prop type="origin">source</prop>
        <seg>مراجع</seg>
      </tuv>
      <tuv xml:lang="el">
        <prop type="origin">mt</prop>
        <seg>Αναφορές</seg>
      </tuv>
      <tuv xml:lang="el">
        <prop type="origin">user</prop>
        <seg>Παραπομπές</seg>
      </tuv>
    </tu>
    <tu srclang="ar">
      <tuv xml:lang="ar">
        <prop type="origin">source</prop>
        <seg>معتمدية صيادة، ولاية المنستير</seg>
      </tuv>
      <tuv xml:lang="el">
        <prop type="origin">mt</prop>
        <seg>Γραφείο Επιτρόπου κυνηγός, η εντολή του Monastir</seg>
      </tuv>
      <tuv xml:lang="el">
        <prop type="origin">user</prop>
        <seg>Έδρα του μουταμιντίγια, στο κυβερνείο Μοναστίρ</seg>
      </tuv>
    </tu>

Image info dumps

edit

These files come in pairs. The -local- file contains the names and upload date/times of each file uploaded locally to the wiki. The --remote- file contains a list of the files uploaded to commons that are used on the local wiki; this information is retrieved from the MediaWiki globalimagelinks table.

The first line of each file lists the field name(s); the -local- file lists img_name and img_timestamp while the -remote- file lists gil_to.

Timestamps are in YYYYMMDDHHMMSS format. File names are written as they are in the database, so spaces are converted to underscores, for example.

Sample excerpt from orwiki-20190519-local-wikiqueries.gz:

Berhampur-university_logo.png	20120222071753
MKCG_Medical_college_logo_1.svg	20120222111540
SCB_Medical_college_logo.svg	20120224093008
VSS_medical_college_logo.svg	20120227075907

Media and article title dumps

edit

Each of these files consists of the line 'page_title' as the first line, followed by a list of titles of pages, in alphabetical order. The media titles dump lists titles of all pages in the File: namespace (6), and the page titles dump lists titles of all pages in the main (0) namespace. Titles are dumped as they are found in the database, so spaces have been converted to underscores.

Sample excerpt (from media titles dump):

page_title
!!!_(Chk_Chk_Chk)_-_One_Girl_One_Boy_cover_art.jpg
!!!_-_!!!_album_cover.jpg
!!e!VBQQ!mM_$(KGrHqEOKi8E03iU,-u!BNP3+G6Mqw_1.jpg
!0_Trombones_Like_2_Pianos.jpg
!Haunu.ogg
!Hero_(album).jpg

Short url dumps

edit

These files contain a list of entries in the following format: short-url|full-url where the short url https://w.wiki/short-url-here redirects to the full url in links on our wiki projects.

Sample excerpt:

L|https://en.wikipedia.org/wiki/LGBT
M|https://en.wikipedia.org/wiki/MediaWiki
N|https://en.wikipedia.org/wiki/NetHack
P|https://en.wikipedia.org/wiki/Jean-Luc_Picard
Q|https://www.wikidata.org/wiki/Help:Items
R|https://en.wikipedia.org/wiki/Dennis_Ritchie
S|https://sv.wikipedia.org/wiki/Stockholm