Multilingual Wikidata
Multilingual Wikidata is a set of standards and, eventually, functionality for supporting multilingual content within a Wikidata dataset.
Multilingual Datasets
editA multilingual Wikidata dataset is any dataset with translatable content, with translatable here meaning language-specific and language-indifferent with respect to the overall entity or model. In particular, datasets which concern themselves with such things as an attribute's original language or the processes by which translation/transliteratoin occur should usually not use multilingual attributes; by definition a translatable attribute is one that is expressed in a particular language, but for which any particular linguistic expression is equivlant to all others.
Consider a simple model for anmials, for example:
> DESC animal; COLUMN TYPE DESC ---------------------------------------------------- species_name VARCHAR2(50) Species name in the NOT NULL Linnaean taxonomy commmon_name VARCHAR2(50) Animal common name NOT NULL TRANSLATABLE
The species name is a language-specific attribute, but not a multilingual or translatable one, since all names in the particular taxonomy must be in Latin. The common name, however is both language-specific and translatable- it does not matter in the model which common name is used when referring to a particular type of animal, nor is there a concept of the original language of a common name and how other common names might derive from it.
For other types of datasets, however, such concerns are important. For example, in cataloging there is the concept of a parrallel title, which is an equivalent to the original title in another language. The process by which a parallel title is assigned is a subject of concern for catalogers. In the case of movies a film has one original title and is then assigned new titles as it is released in different linguistic markets; these new titles are often quite different from what a direct translation would be like in order to optimally market the film.
Database-level Implementation
editEvery multilinugal table in a Wikidata dataset, defined as containing at least 1 translatable column, follows the above pattern during implementation. A base table is created containing the entity primary key and all non-translatable attributes. A second table, with the string _ML suffixed to the base table name, is created containing the primary key and all-translatable attributes. A _ML table always has the following columns:
COLUMN TYPE DESC ---------------------------------------------------- language_id INT(15) Language of the translated content; foreign key to Language primary_lang TINYINT(1) Whether content in this language NOT NULL should be considered "primary" or somehow take precedence over other translations. For example, if the content is originally in this language. Primary language content should be given preference when choosing an expression to do additional translations from. Only one language should be tagged as primary.
The combination of base entity primary key and language id will always be unique in the _ML table.
Wikidata Enhancements
editAs part of future enhancements, Wikidata should support multilingualism in its data definition UIs through the TRANSLATABLE column/attribute modifier flag. If any entity attribute is flagged as translatable, Wikidata should automatically create a _ML table for the entity in the dataset's underlying SQL DDL.