Module:Sandbox/AbstractWikipedia
This module is rated as alpha. It is ready for third party input, and may be used on a few pages to see if problems arise, but should be watched. Suggestions for new features or changes in their input and output mechanisms are welcome. |
This module, created by user:AGutman-WMF, is a prototype implementation of Abstract Wikipedia's template language in Scribunto.
For an overview of the logic of the entire system, see the overview documentation page.
The current module handles the high-level verbalization of templates and abstract content. It makes use of several sub-modules, which can be divided into distinct categories:
Core code
editThis is the core code needed for the system to run. While it may need maintenance, it generally does not need contributions by the template-language users. If the code is to be ported to Wikifunctions, this code can probably live in the back-end code of the Wikifunctions Orchestrator or Evaluator:
- Module:Sandbox/AbstractWikipedia/TemplateEvaluator Module which evaluates a template to a lexeme list (a.k.a. lemma tree) representation. It makes use of:
- Module:Sandbox/AbstractWikipedia/TemplateParser which parses a template (given as a string) to an abstract syntax tree representation.
- Module:Sandbox/AbstractWikipedia/Lexemes gives the definition of the lexeme object and its internal representation. In particular it contains the algorithm to prune lexeme forms according to grammatical constraints. This module makes use of:
- Module:Sandbox/AbstractWikipedia/UnifiableFeatures which provides a union-find data-structure to represent unifiable linguistic features across lexemes.
- Module:Sandbox/AbstractWikipedia/Wikidata provides various helper functions to access Wikidata items and lexemes for easier access by other modules.
- The current module runs the entire NLG pipeline. It relies on the other modules to do this.
User-modifiable code & data
editThe system relies on contributions from users in various ways, both in terms of code and data. When ported to Wikifunctions, these would probably live in the user-visible (and user-modifiable) part of Wikifunctions.
Code
edit- Module:Sandbox/AbstractWikipedia/Functions is a repository of functions callable by the template language, which can be extended by the user. In Wikifunctions terms, these would be functions defined within Wikifunctions by the user. In particular, language-specific modules can be added as sub-modules, such as Module:Sandbox/AbstractWikipedia/Functions/en for English, or Module:Sandbox/AbstractWikipedia/Functions/he for Hebrew. Any functions defined there override the (language-agnostic) functions defined in the main module.
- Module:Sandbox/AbstractWikipedia/Relations is a repository of specialized functions which correspond to grammatical relations which can be asserted on the slots of the template language. Here too, language-specific modules can added as sub-modules, such as Module:Sandbox/AbstractWikipedia/Relations/he
- Module:Sandbox/AbstractWikipedia/Phonotactics contains sub-modules identified by the language code, which should have a language-specific
applyPhonotactics
function, applying phonotactic rules. These sub-modules correspond to the Phonotactics module identified in the architecture document. As implementations, see for example Module:Sandbox/AbstractWikipedia/Phonotactics/en or Module:Sandbox/AbstractWikipedia/Phonotactics/he.
- Module:Sandbox/AbstractWikipedia/Constructors is a module which fetches or creates abstract content for certain types of items using Wikidata properties. It allows filling in for items where no curated abstract content has been created.
- Module:Sandbox/AbstractWikipedia/TextAssembler corresponds to the last stage of the architecture, assembling the output text of the pipeline while adjusting punctuation, spacing and capitalization. This is done in this module in the function
constructText
, which is intended to be language-agnostic. However, some data-tables within the module allow adjusting the realization behavior of specific punctuation marks.
Data
edit- Module:Sandbox/AbstractWikipedia/AbstractContent is a repository of manually-curated abstract content for specific items.
- Module:Sandbox/AbstractWikipedia/Renderers is the repository for language-specific renderers of constructor. The last element should correspond to a language code, e.g. Module:Sandbox/AbstractWikipedia/Renderers/he or Module:Sandbox/AbstractWikipedia/Renderers/en.
- Module:Sandbox/AbstractWikipedia/GrammaticalFeatures provides tables which link Wikidata grammatical features and categories Q-ids to internal representation, as well as providing a canonical ordering of these features (necessary for the lexeme form selection algorithm).
Usage
editThere are two different modes the module can be used with:
Write content about a Wikidata item
editUsing the function content
, the system will attempt to write some content about the given item.
The write-up can either be based upon the manually-curated abstract content or be constructed on-the-fly from existing Wikidata properties. Either way, in order for the abstract content to be realized, appropriate renderers for the realization language must have been priorly defined.
The content
function takes two arguments:
- The first argument is the Q-id.
- The second argument is the required language (e.g.
he
) If omitted, the content language of the project will be used.
Example
edit{{#invoke:Sandbox/AbstractWikipedia|content|Q937|en}}
Albert Einstein was a German physicist. He was born 14 March 1879 in Ulm and died 18 April 1955 in Princeton.
{{#invoke:Sandbox/AbstractWikipedia|content|Q6279|en}}
Joe Biden is a American politician. He was born 20 November 1942 in Scranton.
Direct Template Realization
editOne can ask to realize a specific template given as an argument, using other given arguments, using the function render
.
- The first argument is the template itself, using a sub-set of the template syntax described in the Template Language for Wikifunctions proposal (see limitations below).
- The second argument is the language of rendering, as language code (e.g.
en
) If omitted, the content language of the project will be used. - Any following named arguments are arguments to the given template, which can be evaluated using interpolation. In general, the value of these arguments is passed on as plain text, but, when an interpolation is evaluated in the scope of a slot (and not a function argument), it is handled specially:
- If the value is of the form of an L-id or a Q-id, the relevant function (
Lexeme
orLabel
/Person
) will be invoked with this value. - If the value contains a slot syntax (i.e. anything surrounded by { }) the text will be evaluated as a subtemplate, which itself has access to all arguments of the template.
- Otherwise, the text will be passed on to the
TemplateText
function.
- If the value is of the form of an L-id or a Q-id, the relevant function (
Examples
edit{{#invoke:Sandbox/AbstractWikipedia|render|{nummod:Cardinal(num)} {root:Lexeme(noun)}|en|num=5|noun=L7}}
will render 5 cats while
{{#invoke:Sandbox/AbstractWikipedia|render|{nummod:Cardinal(num)} {root:Lexeme(noun)}|en|num=1|noun=L1122}}
will render 1 dog.
See more examples in User:AGutman-WMF/Template Examples.
Notes
editThere are several differences between this prototype and the Template Language for Wikifunctions proposal, or elaborations of points which were not completely specified there.
- The functions callable within the templates are limited to those defined in Module:Sandbox/AbstractWikipedia/Functions and its submodules. Similarly, all relation functions must be defined in Module:Sandbox/AbstractWikipedia/Relations and its submodules.
- The implementation of the language-specific function dispatch is different from what is stated in the proposal: instead of having language-code suffixes of function names, the prototype simply loads the relevant language-specific implementations in the appropariate submodule, e.g. Module:Sandbox/AbstractWikipedia/Functions/en for English. The language-agnostic functions are still available (if not overriden) thanks to Lua's metatable mechanism. One could use the same mechanism to define longer chains of language-inheritance.
- Subtemplates can be defined as functions using the
evaluateTemplate
call. See for instance the implementation ofQuantifiedNoun
in the /Functions module. These subtemplates have access both to their own arguments and the global arguments passed to the top-level template. - Alternatively, subtemplates can be used as expansion of interpolation arguments, as explained above. These subtemplates have only access to the global arguments.
- L-ids and Q-ids have special semantics, in that when they appear within a slot (e.g.
{L123}
)they expand to the appropriateLexeme
orLabel
invocation (the latter calls thePerson
function if the Q-id refers to a human being). This happen also if they are passed as interpolation arguments. - Numbers given within a slot (e.g.
{5}
) are expanded to theCardinal
invocation. This, however, doesn't happen for numeric interpolation arguments. - The phonotactics and the spacing/capitalization module haven't been implemented yet.
- Spans of spaces are conserved by the parser, and are considered as special elements of text (
spacing
elements). - Punctuation is marked specially as
punctuation
, but there is currently no special treatment of it. To treat punctuation as simple text, one can enclose it in a textual slot (e.g.{"."})
(but this doesn't work for the colon and the } symbol, due to limitations of the parser).
local p = {}
-- This is the main module for the template evaluation to be invoked from
-- content pages
local evaluator = require("Module:Sandbox/AbstractWikipedia/TemplateEvaluator")
local c = require("Module:Sandbox/AbstractWikipedia/Constructors")
local default_functions = require("Module:Sandbox/AbstractWikipedia/Functions")
local default_relations = require("Module:Sandbox/AbstractWikipedia/Relations")
local t = require("Module:Sandbox/AbstractWikipedia/TextAssembler")
-- global variables (populated below)
-- Note that in Wikifunctions the functions, relations and renderers should be
-- available globally as Wikifunctions functions. Thus, the only global variable
-- needed is only the realization language variable, and that is in fact just
-- for convenience.
functions = {}
relations = {}
renderers = {}
language = ''
applyPhonotactics = function (lexemes) end -- do nothing by default
-- Initializes the above global variables, given an optional language code
-- If no language code is given, defaults to the Wiki's content language
local function initialize ( lang )
if lang then
language = lang -- global variable
else -- default to content language
language = mw.getContentLanguage():getCode()
mw.log("Using langauge "..language)
end
-- Initialize language-specific functions and relations
local status, module = pcall ( require , "Module:Sandbox/AbstractWikipedia/Functions/"..language)
functions = status and module or {}
setmetatable(functions, { __index = default_functions } )
local status, module = pcall ( require , "Module:Sandbox/AbstractWikipedia/Relations/"..language)
relations = status and module or {}
setmetatable(relations, { __index = default_relations } )
local status, module = pcall ( require , "Module:Sandbox/AbstractWikipedia/Renderers/"..language)
renderers = status and module or {}
-- There are currently no default renderers; to be added if appropriate
local status, module = pcall ( require , "Module:Sandbox/AbstractWikipedia/Phonotactics/"..language)
if status then
applyPhonotactics = module.applyPhonotactics
else
mw.log("No phonotactics module found for language "..language)
end
end
-- This function flattens the lexeme tree structure to a flat list result
local function flatten ( lexemes, result )
for _, lexeme_list in ipairs(lexemes) do
if lexeme_list.root then
flatten(lexeme_list, result)
else -- It is a single lexeme
table.insert(result, lexeme_list)
end
end
end
-- This function filters the forms of the lexemes to be consistent with their
-- features (morphosyntactic constraints)
local function applyConstraints ( lexemes )
for _, lexeme in ipairs(lexemes) do
lexeme.filterForms()
lexeme.sortForms() -- To ensure that canonical forms are prefered
end
end
local function realizeTemplate(template, language_code, args)
initialize(language_code)
lexeme_tree = evaluator.evaluateTemplate(template, args)
local lexemes = {}
flatten(lexeme_tree, lexemes)
applyConstraints(lexemes)
applyPhonotactics(lexemes)
return t.constructText(lexemes)
end
-- API function to render a template
function p.render ( frame )
--frame.args[1] is the template, frame.args[2] is the optional language code
return realizeTemplate(frame.args[1], frame.args[2], frame.args)
end
-- API function to write content about a q_id
function p.content ( frame )
local q_id = frame.args[1] or error "First argument should be Q-id"
local lang = frame.args[2] -- fallback to content language
local outline = c.Constructors(q_id)
local content = ''
for _, constructor in ipairs(outline) do
local args = { ["main"] = constructor }
local result = realizeTemplate("{root:main}", lang, args)
if #result > 0 then
if #content > 0 then
-- Add spacing between setnences: possibly language dependant
content = content .. ' ' .. result
else
content = result
end
end
end
return content
end
return p