Datasets
Various places that have Wikimedia datasets, and tools for working with them.
Also, you can now store table and maps data using Commons Datasets, and use them from all wikis from Lua and Graphs.
List
editDataset Description | URL | Last Updated |
---|---|---|
Official Wikipedia database dumps | [1] | Present |
Parsoid exposes semantics of content in fully rendered HTML+RDFa, and is available for various languages and projects: enwiki, frwiki, ..., frwiktionary, dewikibooks, ... The prefix pattern is the wikimedia database name. Users include VE, Flow, Kiwix and Google. Parsoid also supports the conversion of (possibly modified) HTML back to wikitext without introducing dirty diffs. | [2] | Dead |
Taxobox - Wikipedia Infoboxes with Taxonomic information on Animal Species | [3] | Dead |
Wikipedia³ is a conversion of the English Wikipedia into RDF. It's a monthly updated dataset containing around 47 million triples | [4] | Dead |
DBpedia Facts extracted from Wikipedia info boxes and link structure in RDF format(Auer et al.,2007) | [5] | 2019 |
Multiple data sets (English Wikipedia articles that have been transformed into XML) | [6] | Dead |
This is an alphabetical list of film articles (or sections within articles about films). It includes made for television films | [7] | Dead |
Using the Wikipedia page-to-page link database | [8] | Dead |
Wikipedia: Lists of common misspellings/For machines | [9] | Dead |
Apache Hadoop is a powerful open source software package designed for sophisticated analysis and transformation of both structured and unstructured complex data. | [10] | Dead |
Wikipedia XML Data | [11] | 2015 |
Wikipedia Page Traffic Statistics (up to November 2015) | [12] | 2015 |
Complete Wikipedia edit history (up to January 2008) | [13] | 2008 |
Wikitech-l page counters | [14] | 2016 |
MusicBrainz Database | [15] | Dead |
Datasets of network extracted from User Talk pages | [16] | 2011 |
Wikipedia Statistics | [17] | Present |
List of articles created last month/week/day with most users contributing to article within the same period | [18] | Dead |
Wikipedia Taxonomy automatically generated from the network of categories in Wikipedia(RDF Schema format)(Ponzetto and Strube, 2007 a–c; Zirn et al., 2008) | [19] | Dead |
Semantic Wikipedia: A snapshot of Wikipedia automatically annotated with named entity tags(Zaragoza etal.,2007) | [20] | Dead |
Cyc to Wikipedia mappings: 50,000 automatically created mappings from Cyc terms to Wikipedia articles (Medelyan and Legg, 2008) | [21] | Dead |
Topic indexed documents: A set of 20 Computer Science technical reports indexed with Wikipedia articles as topics. 15 teams of 2 senior CS undergraduates have independently assigned topics from Wikipedia to each article (Medelyan et al., 2008) | [22] | Dead |
Wikipedia Page Traffic API | [23] | Present |
Articles published using the Content Translation tool. Both detailed lists and summary statistics are available. | [24] | 2022 |
Tools to extract data from Wikipedia:
editThis table might be migrated to the Knowledge Extraction Wikipedia Article
Tool | Description | URL | Last Updated |
---|---|---|---|
Wikilytics | Extracting the dumps into a NoSQL database | [26] | 2017 |
Wikipedia2text | Extracting Text from Wikipedia | [27] | 2008 |
Traffic Statistics | Wikipedia article traffic statistics | [28] | Dead |
Wikipedia to Plain text | Generating a Plain Text Corpus from Wikipedia | [29] | 2009 |
DBpedia Extraction Framework | The DBpedia software that produces RDF data from over 90 language editions of Wikipedia and Wiktionary (highly configurable for other MediaWikis also). | [30] [31] | 2019 |
Wikiteam | Tools for archiving wikis including Wikipedia | github | 2019 |
History Flow | History flow is a tool for visualizing dynamic, evolving documents and the interactions of multiple collaborating authors | [32] | Dead |
WikiXRay | This tool includes a set of Python and GNU R scripts to obtain statistics, graphics and quantitative results for any Wikipedia language version | [33] | 2012 |
StatMediaWiki | StatMediaWiki is a project that aims to create a tool to collect and aggregate information available in a MediaWiki installation.Results are static HTML pages including tables and graphics that can help to analyze the wiki status and development, or a CSV file for custom processing. | [34] | Dead |
Java Wikipedia Library (JWPL) | This is a open-source, Java-based application programming interface that allows to access all information contained in Wikipedia | [35] | 2016 |
Wikokit | Wiktionary parser and visual interface | github | 2019 |
wiki-network | Python scripts for parsing Wikipedia dumps with different goals | github | 2012 |
Pywikipediabot | Python Wikipedia robot framework | [36] | 2019 |
WikiRelate | API for computing semantic relatedness using Wikipedia (Strube and Ponzetto,2006) | [37] | 2006 |
WikiPrep | A Perl tool for preprocessing Wikipedia XML dumps(Gabrilovich andMarkovitch,2007) | [38] | 2014 |
W.H.A.T. Wikipedia Hybrid Analysis Tool | An analytic tool for Wikipedia with two main functionalities: an article network and extensive statistics.It contains a visualization of the article networks and a powerful interface to analyze the behavior of authors | [39] | 2013 |
QuALiM | A Question Answering system. Given a question in a natural language returns relevant passages from Wikipedia (Kaisser, 2008) | [40] | 2008 |
Koru | A demo of a search interface that maps topics involved in both queries and documents to Wikipedia articles. Supports automatic and interactive query expansion(Milne et al.,2007) | [41] | 2007 |
Wikipedia Thesaurus | A large scale association thesaurus containing 78M associations(Nakayama et al.,2007a,2008) | [42] | Dead |
Wikipedia English–Japanese dictionary | A dictionary returning translations from English into Japanese and vise versa, enriched with probabilities of these translations(Erdmann et al.,2008) | [43] | Dead |
Wikify | Automatically annotates any text with links to Wikipedia articles(Mihalcea and Csomai,2007) | [44] | Dead |
Wikifier | Automatically annotates any text with links to Wikipedia articles describing named entities | [45] | Dead |
Wikipedia Cultural Diversity Observatory | Creates a dataset named Cultural Context Content (CCC) for each language edition with the articles that relate to its cultural context (geography, people, traditions, history, companies, etc.). | [46] github | 2019 |
Time-series graph of Wikipedia | Wikipedia web network stored in Neo4J database. Pagecounts data stored in Apache Cassandra database. Deployment scripts and instructions use corresponding Wikimedia dumps. | github [47] | 2020 |
Basic python parsing of dumps | A guide for how to parse Wikipedia dumps in python | blog script | 2017 |
Wiki Dump Reader | A python package to extract text from Wikipedia dumps | [48] | 2019 |
MediaWiki Parser from Hell | A python library to parse MediaWiki wikicode. | docs github | 2020 |
Mediawiki Utilities | A collection of utilities for interfacing with MediaWiki:
|
mediawiki github | 2020 |
qwikidata | A python utility for interacting with WikiData | github | 2020 |
Namespace Database | A python utility which:
|
github | 2020 |