Abstract Wikipedia/Template Language for Wikifunctions/Scribunto-based implementation

This is an overview documentation page of a Scribunto-based NLG system, based upon the proposal to create an NLG template language to use in the Abstract Wikipedia project. The Scribunto-based system has two goals: first, to act as a prototype implementation which can inform the future implementation of a similar system in Wikifunctions itself, and second, to function as a stand-alone NLG system based on Wikidata lexemes within the Scribunto environment.

Demo of the system

This page consists of two distinct parts:

  1. Description of the code flow;
  2. Instructions to contributors who may want to write new abstract content or renderers for specific languages.

Code flow: from Q-id to verbalized content

edit

The various modules used in the system have their own documentation pages. The aim of this section is to give an overview of how the components work together.

The flow of the program is closely following the proposed NLG architecture for Abstract Wikipedia.

 

From Q-id to templates

edit

The entry point of content verbalization is the content function, which takes (in a frame object) a Q-id as well as an (optional) realization language code (if the language code is not given, the Wiki's default language for the user is used instead).

Content retrieval

edit

The content function first retrieves the abstract content (i.e. instantiated constructors) for the given Q-id using the Constructors module. This content can be one of two types:

  • Manually-curated abstract content for the given Q-id, stored in the Abstract Content module.
  • Abstract content generated dynamically (by code in Constructors module) based upon Wikidata properties of the Wikidata item referred to by the Q-id.

These two modes of content retrieval have been discussed in the model articles update.

The retrieved content takes the form of a list of instantiated constructors, which should correspond to the outline of the generated article.

Template selection

edit

For each constructor it the content list, an appropriate template has to be selected and realized. This is done, within the content function, by realizing a "wrapper template" of the form {root:main} while passing to the template realization function a single template argument, named main, whose content is the constructor itself. The template realization logic (detailed below) will interpret this as a template with a single slot asking for the interpolation (i.e. verbalization) of the argument main. Since the interpolation of an argument which is a constructor requires fetching an appropriate template for it, this amounts to expanding the wrapper template with the constructor-specific template (for the given realization language).

Template realization

edit

The main flow of the template realization is done within the realizeTemplate function. This function can also be invoked directly from a Wiki-page using the render function (documented here). This is useful for debugging purposes, or if an ad-hoc template needs to be realized.

The template realization consists of several steps, which follow closely the generic realization algorithm specified for the template language (the corresponding phases of that algorithm are specified below in parenthesis).

Initialization

edit

The initialize function sets up some global variables, to be used throughout the realization pipeline. These are the following:

  • The language variable holds the language code of the realization language. It is either provided by the user, or is set to the project's content language.
  • The functions variable is a table to all functions accessible by the templates. These point to a language-specific implementation, or, in its absence, to a language-agnostic implementation.
  • The relations variable is a table to all relation functions accessible by the templates. These point to a language-specific implementation, or, in its absence, to a language-agnostic implementation.
  • The renderers variable is a table to language-specific renderers of constructors.
  • The applyPhonotactics variable points to a language-specific implementation of the phonotactics module.

Note that in Wikifunctions, all these elements (with the exception of the realization language code) should be accessible globally as Wikifunctions functions.

Template parsing and evaluation (phases 1 & 2)

edit

The evaluateTemplate function (part of the TemplateEvaluator module) is responsible for the execution of the first squared box in the above flowchart ("Templatic Renderer"). Its output is a rooted lexeme list: an ordered list of elements being themselves either lexemes or rooted lexeme lists, with an additional root field identifying one of these elements as the root. In the above flowchart this data-structure is named lemma tree. Note that effectively it is a shallow tree where all nodes are children of the root node (but this tree is not necessarily the same specified by the dependency relations of the template, except that they share the same root).

This phase consists itself of three parts:

  1. The TemplateParser module's parse function is called to transform the template, given as a string, to two lists: a structured list representation of its elements, as well as a list representation of the dependency relations to apply. The parser also returns the index of the root element, to be stored in the rooted lexeme list representation. See the module's documentation for the exact return format.
  2. The evaluateElements function traverses the element list and evaluates every template element into a lexeme or rooted lexeme list representation. The evaluation is done in accordance with the type of each element:
    • Textual elements (be they normal text, spacing characters or punctuation signs) are evaluated using the TemplateText function.
    • Numbers (within slots) are evaluated using the Cardinal function.
    • Lexeme identifiers (within slots), a.k.a. L-ids, are evaluated using the Lexeme function.
    • Item identifiers (within slots), a.k.a. Q-ids, are evaluated using the Label function.
    • Interpolation of arguments are evaluated (in the evaluateInterpolation function) by fetching the argument and then evaluating it using the same logic as above (e.g. if the argument is an L-id it will be evaluated using the Lexeme function). Special logic is applied in the following cases:
      • If the argument evaluates to a sub-template (i.e. a string containing { } ), that subtemplate is recursively evaluated using the evaluateTemplate function (returning a rooted lexeme list).
      • If the argument evaluates to a list of variant templates (a list of tables containing a template field), one appropriate template is selected according to its preconditions (by the selectTemplate function) and evaluated.
      • If the argument evaluates to a Constructor (i.e. a table containing a _predicate field), the corresponding template is fetched from the appropriate language-specific Renderers module (in accordance with given preconditions) and evaluated (this happens in the evaluateConstructor function).
    • Functions are evaluated (in the evaluateFunction function) by invoking the corresponding function with the given arguments. Functions and interpolations passed as arguments to the function are evaluated recursively (in evaluateFunctionArgument), but not that other argument types are simply passed as string arguments to the invoked function without any further processing.
  3. The relations specified in the template are applied (in applyRelations): basically, any relation which correspond to a known relation function defined in the Relations module (or a language-specific implementation thereof) is being applied on the relevant elements (or their roots, if the elements have been evaluated to a rooted lexeme list). These relation functions unify grammatical features of its two arguments according to the specified grammatical function, effectively propagating and sharing grammatical features among the lexemes, in accordance with the overall linguistic structure.

Application of morphosyntactic constraints (phase 3, using lenient pruning)

edit

The output of the last phase is a tree of lexemes, where each lexeme is associated with a set of grammatical features, possibly originating in another lexeme. These features can now act as constraints to prune the list of forms of each lexeme, so that only forms obey by the constraints are retained. At this stage the tree structure is not needed any more so it is flattened into a simple list of lexemes.

The pruning of the lexemes happens in the applyConstraints function, by invoking the filterForms method of the lexemes module. The filtering algorithm functions as follows:

  • Iterate over all forms of the lexeme. For each form:
    • Iterate over the grammatical constraints associated with the lexeme. For each grammatical constraint category:
      • If the form has a feature of the same category:
        • Discard the form if the form's feature is not unifiable with the constraint's feature (and move on to the next form).

In the prototype implementation, any feature is only unifiable with itself or with the empty feature. In a full-fledged implementation, a linguistic hierarchy of features may be desirable, allowing for a feature to be unifiable with any sub- or super-feature of it. The above algorithm only ensures that the features of categories shared between the form and the lexeme's constraints are unifiable (i.e. compatible), allowing for additional features on the forms, not mentioned by the constraints. A stricter pruning algorithm could ensure that the form's features are a subset of the constraints imposed on the lexeme, but since we do not have full control on the features of each form (imported from Wikidata), the algorithm above is preferable.

Note that this pruning algorithm doesn't work well with unary features, e.g. features that can either appear or not on a form, but don't stand in opposition to any other feature (e.g. the contraction feature, which may mark contracted English verbs). Such features will not be affected by this pruning algorithm, whether they appear in the constraints or on the forms (since the algorithm assumes that the lack of a feature is compatible with any feature). If support of such features is required, an additional step should be added to verify that such features are either present or absent both in the constraints and the form's features.

Ideally, the pruning algorithm should keep exactly one form of the lexeme. In practice, several forms may persist. To consistently output the less marked form in such cases, the forms are ordered according to a canonical order in the sortForms function. This functions sorts the forms according to a lexicographic order over the categories and features, as specified in cannonical_order table. For instance, if no person feature is specified for a verbal lexeme, the various person inflections of the lexeme will subsist the pruning process, but the sorting will ensure that a third-person form will appear first (if present among the original forms of the lexeme).

The output of this stage is a list of lexemes, where each lexeme has a pruned list of forms compatible with the grammatical constrains imposed on the lexeme. Unless further modifications are done in the rest of the pipeline, the first form in the list of pruned forms is the one which will be used for the rendering of the NLG text. (If no forms subsist the pruning process, then the lemma of the lexeme is used as fallback). Note that requesting the string form of a lexeme (using the builtin function tostring) yields this first form (or possibly the lemma), as the function has been redefined within the lexeme module.

Application of phonotactic rules (phase 4)

edit

At this stage, language-specific phonotactic and orthographic rules need to be applied. In general, forms can undergo phonotactic alteration depending on their neighboring forms (ignoring spacing and empty forms), so this process requires a linear traversal of the forms and the adaptation of those who should be changed. The phonotactic variant of each form can be stored in the existing list of its forms, in which case the right form has to be promoted to the first position in the form list, or there could be special rules altering the existing first form of the list (making use of the lexeme module's helper function replaceByForm).

Concretely, the application of the phonotactics is done with a function called applyPhonotactics. This function is tied (in the initialization phase) to a language-specific implementation, which can be found in a Module:Sandbox/AbstractWikipedia/Phonotactics/xx module (where xx stands for the language code). If such a function is not defined, a default no-op function is applied.

As examples, we can look at the English and Hebrew implementations of this:

  • The English implementation scans the lexemes for the indefinite article a (identified by its lemma and part-of-speech). If such a lexeme is found, the following lexeme (ignoring spacing and empty forms) is inspected, and if it starts with a vowel, the article's form is replaced by the form an. Note that in the current implementation, a simple list of regular expression is used to determine whether a form starts with a vowel. Ideally, this information should be stored and fetched from Wikidata for each lexeme.
  • The Hebrew implementation takes care of certain orthographic and phonotactic alternations happening after Hebrew proclitics. It scans the list of lexemes and if a proclitic, identified by its lemma, is found, the following lexemes may be altered in the following ways:
    • Spaces following proclitics are removed.
    • The definite article (identified by its part-of-speech) is removed following certain proclitics.
    • If the proclitic is followed by a number spelled out by digits, a hyphen is added, in accordance with Hebrew's orthographic conventions.
    • If a proclitic is followed by a word starting with the letter Vav, that letter is doubled, in accordance with Hebrew's writing rules of unvocalized text.

The output of this stage is a list of lexemes, where each lexeme should have as its first form a form compatible with the morphosyntactic and phonotactic constraints imposed on it.

Construction of final text (phase 5)

edit

The construction of the rendered text is done in the constructText function. Basically, this function concatenates the string representation of all the lexemes passed over from the previous stage. Recall that the string representation of a lexeme is normally the first form in its list of forms, which should correspond to the phonotactic and morphosyntactic constraints. While doing this the function takes also care of special rendering of spacing and punctuation, and applies necessary capitalization. Currently the following rules are implemented to this effect:

  • If consecutive spacing lexemes (e.g. lexemes which contain just whitespace) are encountered, only the last one is retained. This is necessary in order to avoid multiple consecutive spaces which may otherwise arise around slots which have evaluated to an empty string.
  • If consecutive trailing punctuation signs are encountered (as defined in the trailing_punctuation table), only the punctuation mark with highest priority is retained (e.g. a dot suppresses a comma).
  • Trailing punctuation marks also suppresses any preceding spaces.
  • The first word of the realized text, as well any word which follows a punctuation mark triggering capitalization (as defined in the capitalization table), is capitalized. This is done using the ucfirst function, which honors any special capitalization rules for the realization language.

The purpose of the above rules is to allow template authors to write the templates as naturally as possible, without paying too much thought as to where they should include spaces or punctuation marks. In most circumstances, these heuristics preserve only the necessary spaces and punctuation marks. (Note that if escaping of punctuation or spacing is needed, so as not to be handled through this function, the template author could wrap them in a TemplateText function invocation, which would assign them the text lexeme type, thus avoiding any further processing of this kind).

The output of this stage is a string of text corresponding to a single template or constructor.

Rendering a full article

edit

As the abstract content of a Q-id may contain several constructors, the last phase of the rendering consists of stitching together the rendering of each constructor. This is currently done quite simply by concatenating the output strings from the last phase, while adding an initial space between them when necessary. Note that constructors which haven't been associated with any renderer for a given language are simply omitted from the realization. This allows partial realization of content for those languages, instead of failing the entire realization.

Instructions to contributors

edit

Writing new Abstract Content

edit

You can create new abstract content for any item by editing the AbstractContent module. The abstract content for each item is an entry in the content table, keyed by the item's Q-id.

The entry is itself a table, consisting of a list of constructors to be rendered in the given order. Each constructor is a table, whose fields give the content of the constructor. There are currently no firm guidelines on how to design constructors, but in general they should be as language-agnostic as possible. Typically the table field values should be strings, such as Q-ids, but there is no hard restrictions, as fields may also contain numeric or boolean values. In particular, fields values may be sub-constructors, i.e. tables.

The only requirement for constructors is that they contain a _predicate field. That field should contain a string which identifies the constructor type. When the constructor is rendered, a corresponding renderer with the same name is looked up.

Note that it also possible to write Lua code, within the Constructors module, which creates automatically constructors for certain types of items. The logic to select the right type of constructor should happen in the Constructors function. The return value should be a list of constructors (possibly consisting of a single one), as specified above.

Creating new renderers

edit

Whether you add renderers to a language which already have some defined, or you create a new renderers module for a new language, the renderers should live in a module named Module:Sandbox/AbstractWikipedia/Renderers/xx where xx stands for the appropriate language code. The language-neutral Module:Sandbox/AbstractWikipedia/Renderers module defines some general utility functions for evaluation renderers.

The language-specific renderer modules export a table (customarily named p), of which each field is a renderer for a specific constructor type with which it shares the name (e.g. the Person renderer is intended to verbalize a Person constructor). Each renderer consists of a set of template-groups (explained below), of which one has to be named main. When a renderer is used (typically because a constructor of the corresponding type has to be verbalized), it is the main template-group which is verbalized. The other template-groups can be used as sub-template interpolations within the main templates and among themselves.

A template-group is basically a list of variant templates. When the template-group is evaluated, the first template variant in the list whose preconditions hold, is selected and verbalized. If no such template variant is found, an empty template, resulting in an empty verbalization, is returned.

Each variant template is a table with three fields:

  • The mandatory template field contains the actual template to verbalize, using the template language syntax (with some additions explained below).
  • An optional validity field lists the fields of the constructor which must be present in order for this variant to realize. Note that all templates in a renderer have access to the realized constructor's fields (usable as interpolation arguments).
  • An optional roles field lists the possible grammatical roles this variant fits. Essentially, it is a condition on the dependency label used together with the invocation of the sub-template or the corresponding constructor. Only if that label fits one of those listed in the roles list, will the template be evaluated. An empty string "" can match the absence of a label.
  • In the future, an additional conditions field may be added to allow evaluating arbitrary conditional expression on the constructor's fields.

Differences from the canonical template syntax

edit

As mentioned above, the implementation allows a slightly extended syntax in comparison to the one specified in the template language syntax document. These extensions include the following:

  • L-ids (an L followed by digits) and Q-ids (a Q followed by digits) can be used without being quoted. When they serve as arguments to a function, they are converted to strings. When they appear on their own within slots (either as literals or as interpolated arguments), they act as shorthand notations for the invocation of the template functions Lexeme(L-id) and Label(Q-id), respectively.
  • Similarly, numbers, appearing on their own in slots (either as literals or as interpolated arguments), act as a shorthand for the invocation of the Cardinal(number) template function.
  • Interpolation arguments can contain strings which are interpretable as templates (as they contain a slot: some text enclosed by { }). In that case, the interpolation argument is evaluated as a sub-template (which have access to the same arguments as the invoking template).
  • Similarly, interpolation arguments may contain template-groups (as defined above). This happens in particular within the definition of renderers. In this case, the interpolation consists of selecting the appropriate template variant and evaluation its template.
  • The extension by conditional functions has not been implemented in this prototype.
  • Since the templates are defined within Lua code, comments can be added following them using the standard Lua comment syntax.

Implementing new relation functions

edit

When creating content for a new language, it is probably that new dependency relations need to be defined, or existing definitions amended. For a dependency relation, used in the templates, to have an effect on the verbalization, it must have a corresponding relation function defined. Such a function can either be defined for all languages in Module:Sandbox/AbstractWikipedia/Relations or for a specific language in Module:Sandbox/AbstractWikipedia/Relations/xx where xx stands for the language code.

In general, it is advisable to define relations in a language-independent way as much as possible. For instance, we know that many languages have subject-verb agreement, but the exact features which agree may differ from language to language. Instead of defining a language-specific implementation for each combination of features, we may define a language-independent subj relation function, which enforces agreement on the person, number, and gender features and assigns the subject a nominative case. Such a function would work even for languages which exhibit only a subset of these agreement features, or don't have case morphology, as the irrelevant operations will become no-op operations. In some case, however, a language-specific implementation is needed; for instance, if we would like to assign ergative case to a subject, this would require a separate, language-specific, implementation of the relation function. (It could also be possible to group languages into a language hierarchy in which similar languages share the implementations, but this has not been done in the prototype.)

For the NLG pipeline to work correctly, the relation functions should always be defined with two input arguments, source and target, corresponding to the slots on which the relation operates. Moreover, the body of the function should in principle only use four operations, to ensure that the order of application of relations is immaterial (the l prefix refers to the Lexemes module):

  • verifyPos(slot, part-of-speech): this allows sanity-checking that the relevant slot is a lexeme with a part-of-speech unifiable with the given one. However, more often than not, this check is too restrictive in practice, so it can be avoided. In principle, the parts-of-speech could be arranged in a type-hierarchy which would allow for a more flexible type checking, but this hasn't yet been implemented in the prototype.
  • l.unifyFeatures(category-to-unify, source, target): unifies the features of the given category of the source and target slots.
  • l.unifyFeatures(source-category-to-unify, source, target, target-category-to-unify): unifies the feature of source category with the feature of the target category.
  • l.unifyWithFeature(category, slot, feature): Unifies the feature given in the slot's category with the passed-in feature. If the slot doesn't have such a category yet, this amounts to assigning the new feature to the given slot's category.

Note that any of these functions may fail if the given features are not unifiable, leading to a failure of the entire realization (with an appropriate error message).

Implementing new functions to use within templates

edit

A key part of the template language is the ability to invoke functions, taking a variable number of arguments, within slots. Language independent implementations of these functions are defined within the Module:Sandbox/AbstractWikipedia/Functions module, while language-specific implementations are defined in Module:Sandbox/AbstractWikipedia/Functions/xx where xx stands for the language code. The functions are defined as any other exported Scribunto module functions.

Function which are invoked at the slot level (in contrast to those invoked as arguments to other functions), must return a single lexeme or a list of lexemes. To easily construct the lexeme data type the Lexemes module comes handy (imported as l). Many of the functions defined extract data from Wikidata; for this purpose they can use the Wikidata helper module (imported as wd).

In general, language-specific implementations are needed when a function needs to access a language-specific Wikidata lexeme, or model some language-specific phenomena. Whenever possible, it is better to write language-independent functions. Note that these can serve as fallback implementations, which may then be overridden by a language-specific implementation.

Some of the functions defined are required by the system in order to work properly: these are the TemplateText, Lexeme, Label and Cardinal functions, which are invoked implicitly for certain elements of the template language, as explained above.

Writing functions as sub-templates

edit

In general, writing functions in these modules requires some knowledge of programming in Lua. However, it is possible, and advisable, to define functions which are simply evaluations of sub-templates, where the function arguments are tied to template arguments. As an example, this is the way the QuantifiedNoun function is defined, repeated here:

 function p.QuantifiedNoun(num, noun)
   return te.evaluateTemplate("{nummod:Cardinal(num)} {root:noun}", { num = num, noun = noun})
 end

The sub-template to use is given as first argument to the evaluateTemplate function (the te prefix stands for the TemplateEvaluater module, imported within the Functions module). This sub-template follows the normal template-language syntax, including the use of slots, relations, invocations of functions and interpolation arguments. The template thus defined has, however, only access to the interpolation arguments which are defined within the table passed as the second argument to evaluateTemplate (namely { num = num, noun = noun} in the example). As you can see, these are simply bindings of the function arguments to names of interpolation arguments (typically having the same name, but this is not required). By using this kind of function definition, you don't really need to know how to program in Lua, as you can define your function mostly using the template-language syntax.