Wikipédia Abstrata/Proposta de arquitetura do sistema de geração de linguagem natural

This page is a translated version of the page Abstract Wikipedia/Natural language generation system architecture proposal and the translation is 32% complete.

Proposta por Ariel Gutman

Este documento descreve uma proposta de arquitetura para um sistema de geração de linguagem natural (GLN) para a Wikipédia Abstrata. Ao considerar a arquitetura de um sistema GLN as seguintes considerações devem ser consideradas:

  1. Modularidade: o sistema deve ser modular, enquanto vários aspectos do GLN (por exemplo, regras morfossintáticas e fonotáticas) podem ser modificados de forma independente.
  2. Lexicalidade: o sistema deve ser capaz de obter dados léxicos (separados do código), e confiar em regras de linguagem produtiva para gerar tais dados no fluxo (por exemplo, modificando a conjugação do plural das palavras inglesas com -s).
  3. Recursividade: devido à natureza composicional e recursiva da maioria das línguas, [1] um sistema GLN eficaz teria que ser recursivo em si.

No contexto da Wikipédia Abstrata, outra restrição deve ser considerada:

  1. Extensibilidade: o sistema deve ser susceptível de ser ampliado tanto por linguistas e contribuidores técnicos, como por contribuidores não técnicos e não peritos, que trabalhem em diferentes partes do sistema.

A partir das restrições acima, parece razoável assumir que uma única função Wikifunctions (=WF) não pode capturar efetivamente a complexidade de um sistema GNL modular, mas sim múltiplas funções, cada uma responsável por um passo diferente na estrutura do GNL.

No projeto atual de funções individuais WF não pode:

  • Invocar outras funções WF.
  • Obter dados de fontes externas, como o Wikidata.
  • Alterar algum estado global do sistema.

Para superar essas limitações, este documento propõe uma arquitetura de GNL a ser operado pelo WF Orchestrator, que não está sujeito a nenhum destes. Além disso, para permitir a participação de contribuintes não técnicos, propõe-se a criação de uma linguagem de modelagem interna, que poderia ser executada por um avaliador WF personalizado.

Uma abordagem alternativa, seria remover as limitações de design dos avaliadores WF, a fim de encapsular toda a arquitetura GNL em uma única função WF (que poderia então invocar outras funções WF). Embora esta abordagem mudasse alguns aspectos da implementação do sistema (por exemplo, a orquestração da tubulação seria editável por contribuintes da WF), a arquitetura conceitual permaneceria na maioria igual.

No final do documento, é dada uma breve comparação com outras abordagens sugeridas.

Arquitetura

Como explicado acima, o pipeline completo do GNL não pode ser encapsulado em uma única função da Wikifunções (=WF), mas deve ser executado pelo orquestrador do WF, o que permitiria obter dados de fontes externas (em particular o Wikidata), invocando diferentes funções de WF (definidas pelos contribuintes) e mantendo o estado necessário ao fazê-lo. A arquitetura prevista é apresentada no diagrama a seguir, onde as formas azuis-escuros são elementos que seriam criados pelos colaboradores de Wikifunções (retângulos) ou Wikidata (retângulos arredondados), enquanto os elementos azuis-claros representam funções ou dados que vivem dentro do orquestrador da WF, e, portanto, não são diretamente adequados para a contribuição da comunidade.

 

Vamos detalhar os passos:

  1. Dado um tipo de construtor, um render específico é selecionado,[2] e os dados contidos no construtor dado são passados para o render como seus argumentos de função.
  2. O render é basicamente um template: uma combinação de texto estático, e espaços que podem ser preenchidos com os argumentos do render, lexemas de Wikidata, ou a saída de outros renderizadores. Os modelos são relativamente fáceis de entender e escrever, e assim a autoria de renderizadores será acessível para contribuintes não técnicos.
  3. A saída do render é uma árvore de sintaxe de dependência (usando, por exemplo, Universal Dependencies (UD) ou Surface-Syntactic Universal Dependencies (SUD) formalisms)[3] em que os nos são lexemas não-infligidos (identificados por seus lemmas), aumentados com algumas restrições morfológicas. Na prática, a árvore não precisa ser completamente especificada; em particular, o texto estático não precisa necessariamente fazer parte da árvore.
  4. Relying on a language-specific grammar specification, the morphological constraints coupled with structure of the syntactic tree allow the inflection of the lemmas, according to the lexical data present in Wikidata, or using inflectional tables of the grammar specification. The output of this step is a linear sequence of text, minimally annotated with part-of-speech information (i.e. whether a word represents a noun, a verb, a preposition etc.).
  5. At this step phonotactic constraints are being applied, applying language specific sandhi phenomena. These can include the selection of contextual forms (e.g. in English a/an) or contraction/crasis of adjacent forms (e.g. French de + le = du).
  6. As a final clean-up step, spacing, capitalization and punctuation may need to be adjusted in order to render the final text to be stored in a Wikipedia article. This step can be modeled in a language-agnostic way, by using (language-dependent) annotations from the previous steps.

In the above architecture, there are three components which need to be curated by community members:

  1. Templatic renderers - these make up the bulk of the needed work, as every constructor needs one templatic renderer per language (though re-use of renderers for parts of sentences is possible). Note that the term Renderer is used here in a narrower sense than in Architecture for a Multilingual Wikipedia. In the latter, the term Renderer refers to an end-to-end data-to-text function, while here we use the term Renderer to refer to a specific component of the NLG pipeline, namely a template. This is no coincidence, since in the above architecture, the other parts of the pipeline are relatively fixed, and don’t need constant curation by community members.
  2. Grammar specifications - these would have to specify the relevant morphological features needed for each language, their hierarchy and how these manifest themselves via dependency relations. These specifications may either be stored as data in Wikidata, or as functions in Wikifunctions (to be decided). It is probable that the creation and curation of these grammars will require substantial linguistic and technical knowledge, but since they are created once per (human) language, this is deemed acceptable.
  3. Wikidata lexemes - these will be curated as today, but it would be important that the features they use are inline with the grammar specifications of each language.

Structure of templates

Since the bulk of needed work by community contributors would be the creation of templatic renderers, it is important to make this task as easy as possible, and in particular avoid requiring any coding experience.

Similar to the Composition “language” in Wikifunctions, we can develop an in-house templating language.[4] The templating language should allow specifying a linguistic tree (with UD annotations) over three types of arguments:[5]

  • Static text
  • Terminal functions fetching lemmas from Wikidata, or creating lemmas on the fly from other arguments (e.g. numbers[6]).
  • Other renderers

The templating language will have a dedicated evaluator module, called by the WF orchestrator. The latter will be responsible for passing the output through the various modules of the NLG pipeline outlined above.

Example

Let’s assume we have a simple Constructor conveying the age of a person:[7]

Age(
  Entity: Malala Yousafzai (Q32732)
  Age_in_years: 24
)

To render such a Constructor in English, we will use a templatic notation similar to the following (being a Z14/Implementation type):

{
 "type": "implementation",
 "implements": "Age_renderer_en",
 "template": {
   "part": {
     "role": "subject",  # grammatical subject     
     "type": "function call",
     "function": "Resolve_Lexeme",
     "lexeme": {
        "reference": "Entity"
      }
    },
  "part": {
    "role": "root",  # root of the clause     
    "type": "function call",
    "function": "Resolve_Lexeme",
    "lexeme": {
        "value": "be"  # replace with L-id
      }
    },
  "part": {
    "role": "num",  # numerical modifier
    "of": 4,  # Part 4 (“year”)   
    "type": "function call",
    "function": "Cardinal_number",
    "number": {
        "reference": "age_in_years"  
      }
    },
   "part": {
    "role": "npadvmod", 
    "of": 5,  # Part 5 (“old”)
    "type": "function call",
    "function": "Resolve_Lexeme",
    "lexeme": {
        "value": "year"  # replace with L-id
      }
    },
  "part": {
    "role": "acomp",
    "type": "string",
        "value": "old"      
    },
}
}

Some of the syntactic roles (npadvmod, acomp) have in fact no agreement effect, so one can leave them out.

Structure of grammar

The grammar needs to include the following information:

  1. What part of speech the language has.
  2. What grammatical features are appropriate for each part of speech
  3. (Possibly) a type hierarchy of the features
  4. How do grammatical relations (i.e. dependency relations) interact with grammatical features and parts-of-speech.

Note that the first points can be inferred from the Wikidata lexemes available for a given language, but it would be useful to make them explicit as part of a grammar definition, which would also enforce/validate the Wikidata lexeme definitions.[8] One could write such a validator per language as a WF function, which would then run on the Wikidata lexemes to mark if they are correctly annotated according to the language's schema.

As for the grammar relations, these can be encoded either as data in Wikidata or as functions in WF. Dependency relations can be implemented as unification of grammatical features of their nodes, one could implement each relation as a Composition WF function, using the Unify operator as a builtin function. For instance, a "subj" relation for English would be implemented as following (using short-hand notation):

subj_en(noun, verb): 
    Unify(noun.pos, NOUN);  # Validate types
    Unify(verb.pos, VERB);
    Unify(noun.number, verb.number);
    Unify(noun.person, verb.person);
    Unify(noun.case, NOMINATIVE);

Note that implemented like this, the subj function is not a pure functional, since it affects its input arguments (and in fact the return value is not used, unless an error of unification occurs). To keep things simple, this special behavior would need to be supported by the function evaluator.

One may want to bundle together features which get unified together. For instance, if we observe that number and person are often unified together, we may define a sub-function such as the following:

agr(left, right):
    Unify(left.number, right.number);
    Unify(left.person, right.person);

Then we can redefine the subj relation above as following:

subj_en(noun, verb): 
    Unify(noun.pos, NOUN);  # Validate types
    Unify(verb.pos, VERB);
    agr(noun, verb);
    Unify(noun.case, NOMINATIVE);

Modular design of grammars and renders

It is often the case that languages from the same language family exhibit some grammatical and structural similarities. One may take advantage of this phenomenon by defining a hierarchy of languages and language-families,[9] and allow the NLG system to use dynamic dispatch to the most concrete implementation of a (sub)-renderer or a (sub)-relation.

Other approaches

To date, I'm aware of two other systems that have been proposed to handle the NLG of Abstract Wikipedia.

  1. Grammatical framework (GF) is an established functional programming language intended to support multilingual natural language generation and understanding (see newsletter description). It has a thriving community of computer scientists, linguists and other enthusiasts who contribute to it.
  2. Ninai/Udiron is a Python-based NLG system built by community member Mahir Morshed, which uses lexeme data from Wikidata and combines them using UD trees. The system has been built with the Abstract Wikipedia project in mind. Some interesting examples of constructors and how they are rendered can be found in the Ninai demonstrations.

While the two systems are different, they can be contrasted with the proposal outlined in this document along similar axis:

  • Both systems are geared toward converting relatively abstract & compositional semantic representations into grammatical structure and then text.
  • They require mastering some programming skills, be it a domain-specific language (GF) or a general programming language (Python).
  • The ordering of the words in the output text is determined by the entire NLG pipeline (e.g. adding a Question operator could change the word order in English).
  • Insofar as the grammar definitions are correct, the output is guaranteed to be grammatical

The proposal outlined in this document, on the other hand, is specifically intended to make it as easy as possible for people without prior technical knowledge to make contributions. This implies the following:

  • It can work with concrete, non-compositional, semantic representations (as the Age example above). This however does not exclude handling more abstract representations.
  • At the entry level, almost no programming skills are required to write templatic renderers. Knowledge of linguistics (in particular dependency annotations) can be useful to achieve grammatical output, and is necessary in order to write the grammar specifications themselves.
  • The ordering of words is determined by the templates themselves, and is not changed later in the pipeline.
  • Output can be ungrammatical, if a template has not been designed correctly.

Footnotes

  1. A questão de saber se a recursão existe em todas as línguas têm sido acalorada debate nos últimos anos
  2. Pode ser útil permitir a renderização do construtor nominalmente (por exemplo, “o casamento de Marie com Pierre”) ou verbalmente (“Marie se casou com Pierre”). Nesse caso, mais de um render por construtor seria necessário.
  3. O formalismo do SUD é mais simples e possivelmente mais adequado para as tarefas do GNL. Osborne & Gerdes (2019) fornecer uma discussão sobre a deficiência da UD. Veja também https://surfacesyntacticud.github.io/conversions/ para uma comparação dos dois formalismos. Em ambos os casos, talvez tenhamos de estender o conjunto de relações de dependência para capturar alguns padrões necessários para GNL, como a inter-referência pronominal.
  4. The templating language could be designed to be "syntactic sugar" above the Composition language, and thus it could probably be run by the same evaluator as the Composition language.
  5. See the poster "Using Dependency Grammars in guiding Natural Language Generation" (A. Gutman, A. Ivanov, J. Kirchner, 2019) as well as the corresponding working paper.
  6. One can use Unicode’s Common Locale Data Repository (CLDR) library to render cardinals and ordinals in different languages, as well as other data types such as dates.
  7. In practice the age should probably be calculated from the birthdate, but for the sake of example, it is specified in the constructor. We may moreover envisage a dynamic constructor in which part of the data is calculated on the fly.
  8. Currently there is no consistency in the annotation of lexemes, even in a single language. For example, the form "has is annotated as "singular, third-person, simple present" while the form "is" is annotated as "third-person singular, indicative present".
  9. Depending on the needed granularity, one may use the existing hierarchical codes as defined in the ISO 639-5 standard, or alternatively rely on the existing language-hierarchy defined in MediaWiki.