Abstract Wikipedia/Updates/2022-06-07
◀ | Abstract Wikipedia Updates | ▶ |
Communities will create (at least) two different types of articles using Abstract Wikipedia: on the one hand, we will have highly-standardised articles based entirely on Wikidata; and on the other hand, we will have bespoke, hand-crafted content, assembled sentence by sentence. Today we will discuss the first type, and we will discuss the second type in an upcoming newsletter. (Abstract Wikipedia/Updates/2022-06-21 )
Articles of the first type can be created very quickly and will likely constitute the vast majority of articles for a long time to come. For that we can use models, i.e. a text with variables. Put differently, a text with gaps which get filled from a different source such as a list, along the lines of the mad libs game. A model can be created once for a specific type of item and then used for every single item of this type that has enough data in Wikidata. The resulting articles are similar to many bot-created articles that already exist in various Wikipedias.
For example, in many languages, bots were used to create or maintain the articles for years (such as the articles about 1313, 1428, or 1697, each of which is available in more than a hundred languages). In English Wikipedia, many articles for US cities were created by a bot based on the US census, and later updated after the 2010 census. Lsjbot by Sverker Johansson is a well known example of a bot that has created millions of articles about locations or species across a few languages such as Swedish, Waray Waray, or Cebuano. Comparable activities, although not as prolific, have been going on in quite a few other languages.
How do these approaches work? Assume you have a dataset such as the following list of countries:
Country | Continent | Capital | Population |
---|---|---|---|
Jordan | Asia | Amman | 10428241 |
Nicaragua | Central America | Managua | 5142098 |
Kyrgyzstan | Asia | Bishkek | 6201500 |
Laos | Asia | Vientiane | 6858160 |
Lebanon | Asia | Beirut | 6100075 |
Now we can create a model that can generate a complete text from this data, such as:
“<Country> is a country in <Continent> with a population of <Population>. The capital of <Country> is <Capital>.”
With this text and the above dataset, we would have created the following five proto-articles (references not shown for simplicity):
Jordan is a country in Asia with a population of 10,428,241. The capital of Jordan is Amman.
Nicaragua is a country in Central America with a population of 5,142,098. The capital of Nicaragua is Managua.
Kyrgyzstan is a country in Asia with a population of 6,201,500. The capital of Kyrgyzstan is Bishkek.
Laos is a country in Asia with a population of 6,858,160. The capital of Laos is Vientiane.
Lebanon is a country in Asia with a population of 6,100,075. The capital of Lebanon is Beirut.
Classical textbooks on that topic such as “Building natural language generation systems” call this method “mail merge” (even though it is used for more than mail). A model is combined with a dataset, often from a spreadsheet or a database. This has been used for decades to create bulk mailings and other bulk content, and is a form of mass customisation. The methods have become increasingly complex over time and are able to answer more questions: How to deal with missing or optional information? How to adapt part of the text to the data, e.g. use plurals or grammatical gender or noun classes where appropriate, etc.? The bots that were mentioned above, which created millions of articles in various languages on Wikipedia, have mostly worked along these lines.
For a great example of how far the model approach can be pushed, consider Magnus Manske’s Reasonator, which, based on the data in Wikidata, creates the following automatic description in English for Douglas Adams:
Douglas Adams was a British playwright, screenwriter, novelist, children's writer, science fiction writer, comedian, and writer. He was born on March 11, 1952 in Cambridge to Christopher Douglas Adams and Janet Adams. He studied at St John's College from 1971 until 1974 and Brentwood School from 1959 until 1970. His field of work included science fiction, comedy, satire, and science fiction. He was a member of Groucho Club and Footlights. He worked for The Digital Village from 1996 and for BBC. He married Jane Belson on November 25, 1991 (married until on May 11, 2001 ), Jane Belson on November 25, 1991 (married until on May 11, 2001 ), and Jane Belson on November 25, 1991 (married until on May 11, 2001 ). His children include Polly Adams, Polly Adams, and Polly Adams. He died of myocardial infarction on May 11, 2001 in Santa Barbara. He was buried at Highgate Cemetery.
If we were to say that this is merely better than nothing, I think we would undersell the achievement of Reasonator. The above text, together with the appealing display of the structured data in Reasonator, leads to a more comprehensive access to knowledge than many of the individual language Wikipedias provide for Douglas Adams. For comparison, check out the articles in Azery, Urdu, Malayalam, Korean, or Danish. At the same time, it shows errors that most contributors wouldn’t know how to fix (such as the repetition of the names of the children, or the spaces inside the brackets, etc.).
The Article placeholder project has partially fulfilled the role of filling content gaps, but the developers have intentionally shied away from the results looking too much like an article. They display structured data from Wikidata within the context of a language Wikipedia. For example, here is the generated page about triceratops in Haitian Creole.
One large disadvantage of using bots to create articles in Wikipedia has been that this content was mostly controlled by a very small subset of the community — often a single person. Many of the bots and datasets have not been open sourced in a way that someone else could easily come in, make a change, and re-run the bot. (Reasonator avoids this issue, because the text is generated dynamically and is not incorporated into the actual Wikipedia article.)
With Wikifunctions and Wikidata, we will be able to give control over all these steps to the wider community. Both the models and the data will be edited on wiki, with all the usual advantages of having a wiki: there is a clear history, everyone can edit through the Web, people can discuss, etc.. The data used to populate the models will be maintained in Wikidata, and the models themselves in Wikifunctions. This will allow us to collaborate on the texts, unleash the creativity of the community, spot and correct errors and edge cases together, and slowly extend the types of items and the coverage per type.
In a follow-up essay, we will discuss a different approach to creating abstract content, where the content is not the result of a model based on the type of the described item, but rather a manually constructed article, built up sentence by sentence.
Development update from the week of May 27:
- The team had a session at Hackathon, which was well attended (about 30 people). Thanks to everyone for being there and your questions and comments!
- We also had follow-up meetings with User:Mahir256, to improve alignment on the NLG stream
- Below is the brief weekly summary highlighting the status of each workstream
- Performance:
- Observability document drafted.
- Updated Helm charts for getting function-* services in staging.
- Completed performance metrics design and shared for review
- NLG:
- Scoped out necessary changes to Wikifunctions post-launch
- Metadata:
- Started recording and passing up some function-evaluator timing metrics to the orchestrator
- Experience:
- WikiLambda (PHP) layer has been migrated to the new format of typed lists
- Improved the mobile experience of the function view page
- Transitioned the Tabs component to use Codex's, thanks to the Design Systems Team.
- Design: Carried out end-to-end user flow testing in Bangla.
- Performance:
(Apologies for this update being late. We plan to send out another update this week)