User:Isaac (WMF)/AI Datasets

This page is currently a draft. More information pertaining to this may be available on the talk page.

Translation admins: Normally, drafts should not be marked for translation.

This is an evolving draft of a potential blogpost (series) on datasets, AI, and Wikimedia.

From Dumps to Datasets: Recommendations for a More Purposeful Provision of Wikimedia Data for AI

It is well-documented that Wikimedia data – especially Wikipedia – is essential data to the progression of AI – especially language modeling – over the past several years. The BERT language model that was introduced in 2018 and is often considered the first modern LLM uses English Wikipedia as a majority of its data. Even today, with language models like GPT-4 and their corresponding training datasets being several orders of magnitude larger, Wikipedia is often one of the top sources of data.

Though this usage of Wikimedia data for AI has largely been beneficial in directing attention to the Wikimedia projects, it has largely been incidental to Wikimedia's mission. The dumps have been made available since at least 2005 and while researchers were considered an expected end-user, the AI community in particular has not been viewed as a key stakeholder. As the future of the Wikimedia projects begins to feel more intertwined with the future of generative AI, however, I believe that it is worth taking a deeper look at how this data is and could be used for AI. That is, understanding how we might be more purposeful in what we provide.

I argue that this new class of generative AI models lays bare several key deficiencies in the data that prevent the Wikimedia communities from fully benefiting from these new technologies: 1) a few key data gaps – namely imagery from Wikimedia Commons – that reduce the representativeness of these models; 2) a lack of Wikimedia-specific benchmark datasets that make it clear what we would like to see from AI that has been trained on this valuable Wikimedia data and help Wikimedians identify what models are best suited to their tasks.

Some history...or why generative AI is new to Wikimedia but AI is not

Wikimedia is no stranger to AI. Since at least 2010 with the introduction of User:ClueBot_NG, basic machine learning has been used in impactful ways on the projects (in that case, reverting vandalism). The 2010s saw an expansion of models provided to help support important on-wiki curation tasks, most notably the ORES infrastructure and corresponding models for edit quality, article quality, and article topic. These supervised-learning models were trained to support clearly-scoped tasks and were sufficiently small that they could be easily trained from scratch on Wikimedia data.

During this time, we also saw the rise of a new type of model on the Wikimedia projects – those used for richer language-understanding and -manipulation tasks. Machine translation on Wikipedia is the foremost example but other prominent examples include OCR on Wikisource and more recently models like article description recommendation. While these models could support some of Wikimedia's more complex needs and vastly sped up content creation, they were too large to warrant the Wikimedia community training them fully from scratch. Existing open-source models were often a good start but Wikimedia's needs are often greater, most notably its diversity of languages. A successful solution for closing these gaps has been providing Wikimedia data specific to these tasks to enable outside developers to improve their tooling in ways that directly support the Wikimedia projects. For example, translation data from Wikimedia was packaged to help developers fine-tune their models leading to advances like NLLB-200 that dramatically expanded translation support on Wikipedia. On Wikisource, a partnership with Transkribus used Wikimedian's transcriptions to enable new OCR models for poorly-supported documents like hand-written Balinese palm-leaf manuscripts.

The new generation of generative AI models, as typified by ChatGPT, are dramatically different from the above. They potentially enable a wide range of tasks from writing articles to building SPARQL queries to enabling new ways of interacting with content. They can support these diverse needs because they are not trained for any specific task. Their vast size, however, precludes any direct fine-tuning to meet Wikimedia needs. Instead, after a generic pre-training stage in which these models learn to predict the next word in a sentence, they undergo a stylistic instruction tuning stage in which they are trained to respond appropriately to a wide variety of tasks. This leaves two indirect pathways to nudge developers to train models that are beneficial for the Wikimedia projects: 1) providing more high-quality data, and, 2) establishing benchmarks to evaluate how well these models perform for important Wikimedia tasks and guide developers in ways in which we expect them to improve.

Dumps

The Wikimedia dumps typify the long-standing approach to providing data publicly. They are not backups or always consistent or complete but they are very very good representations of the Wikimedia projects at particular moments in time. These dumps have been invaluable to the Wikimedia research and AI communities and a whole host of helper libraries have sprung up to help folks interact with this data.

The gem of these dumps is the semi-monthly snapshots of the current wikitext for every article for every language edition of Wikipedia. Wikipedia text is invaluable to natural language processing (NLP) for a variety of reasons. It is generally long-form (lots of context to learn from), "well-written", and has minimal bias (at least compared to the rest of the internet) both in terms of a neutral point-of-view and the constant efforts of Wikipedians to close knowledge gaps. The content is trustworthy and sources reliable thanks to the careful work of the editor community. And the content is highly multilingual, making it especially important to the ability of developers to train models in languages that are underrepresented digitally.

For a long time, these dumps were still a messy resource that NLP folks had to further process to remove the various wikitext syntax (templates, categories, etc.) to leave the simple plaintext or natural language content of an article. Different researchers took slightly different approaches from a simple Perl script to a more fully-fledged Python library that also expands some templates. It's only recently that some standardization has appeared (positive foreshadowing) with Hugging Face providing already-processed plaintext snapshots of Wikipedia and this has been one of their most popular datasets along with subsets such as this 2016 dataset of text from just good and featured articles.

The dumps are not complete though. Notably, none of the imagery that is found on Wikipedia (as well as additional imagery on Wikimedia Commons) is available via dumps. Downloading this large source of freely-licensed imagery requires gathering the images one-by-one via APIs, but this can easily take months or result in throttling if the requests are too frequent. Large one-off subsets of Commons have been released in the past – most notably the Wikipedia-Image Text dataset, which could be compiled by Google given their unique infrastructure and was further parlayed into an image-text modeling challenge – but these remain one-offs and usually focus on just the subset of imagery that appears on Wikipedia. This large gap in the dumps means that Wikimedia is not nearly as well-represented within the computer vision community and resulting image or multimodal models, with datasets like ImageNet or LAION that are scraped from the broader web being more common.

Datasets

Datasets are opinionated dumps – they are designed to anticipate a need and nudge end-users towards these particular uses of the data. For example, Hugging Face's plaintext Wikipedia dataset takes Wikipedia content dumps, filters them down to articles, and strips them of various syntax in order to present text that is designed to be consumed for pre-training of language models. If they were intended for another use such as indexing for retrieval in a RAG pipeline, they might have been processed to still include infobox or tabular data (which can have useful facts that might not be found in the article text). This pre-processing can help lower the barrier for researchers to use this data. Benchmarks take this an additional step further by explicitly laying out a task that models are expected to perform and providing some level of supervised data to be used in fine-tuning or evaluating a model. Datasets and benchmarks allow us to encode our expectations for AI models that are using Wikimedia content.

Long before (super)alignment was a buzzword within generative AI spaces, Wikipedians had established core content policies to guide how editors should generate content. While simple in principle, many generative AI models struggle with these. They encode the biases found on the internet data that they are trained on (unless you ask really nicely to maintain a neutral point of view). They hallucinate (verrry original research). And they are very poor at citing their sources (or hallucinate again when asked).

Progress is being made in these areas but part of the problem is that the core benchmarks used for generative AI models are almost all reader-oriented skills like Q&A. Much more uncommon are benchmarks that test the ability of a model to edit – i.e. tasks of relevance to Wikimedians. This reduces the incentives for developers to train models for these tasks and leaves editors with little means of directly comparing the available models to determine which are best-suited for their needs.

A recent positive example in this space is the FreshWiki dataset, which collects high-quality articles on English Wikipedia that were created after a given date. This temporal air gap ensures that a model that is evaluated on the dataset is not merely memorizing data that it has already seen but is instead demonstrating its ability to create a well-balanced article from a set of provided sources. Creating new articles from scratch, however, is just a small fraction of the work required to grow and maintain Wikipedia. Still needed are benchmarks that test the ability of models to update articles with a new source (example), understand and summarize diffs (example), and detect policy violations in Wikimedia content (example).

Postscript

Datasets for helping the generative AI field to move forward in a way that benefits Wikimedians are important. We must be careful, however, to not lose sight of some of the more "basic" technologies that still lag in important ways as compared to the needs of the Wikimedia projects. OCR models still fail for many languages, greatly hindering the ability of the Wikisource community to digitize new knowledge from underrepresented languages. The general consensus on chat agents is that they require some form of information retrieval (RAG) to have the relevant information to answer questions correctly and with an explicit source. Keyword-based search still is an important part of this pipeline, which means good named-entity recognition (NER) is necessary to convert natural language questions into effective search queries on Wikipedia. Machine translation has made great strides in recent years, especially in terms of new open-source models, enabling exciting new opportunities on the Wikimedia projects. Vandalism-detection models are also advancing and are key to ensuring AI-supported content generation does not overwhelm the capacity of editors to patrol this new content. These technologies form the foundation upon which generative AI models can be made useful.