Mingua: A Machine Translation Interlingua

Help brainstorm: I'm starting to develop a computer-readable Interlingua to be used in translating among many languages. I welcome collaboration on this -- see below.

The vision

edit

Here's my vision: Wiktionary is a tremendous resource. It already has many sets of synonyms across many languages.

  • Wiktionary users to be educated to put more information about syntax, in a regular format so that a computer program can read it.
  • a relatively simple translation program which uses complex information in Wiktionary, where the Wiktionary is constantly being updated.
  • disambiguated input (see section below).
  • People to be able to use the translation program as an aid in translating articles (at first), and updating the Wiktionary information as they go along whenever they see that the machine translation is getting something wrong. As time goes on it will get better and better at just directly translating. I would like to see such a machine translation program accessible on Wikimedia and able to use the most up-to-date Wiktionary information at any time, so people can add information and immediately see how much better the machine translation looks.

We can begin immediately if the programming resources are available: the first program can just list the synonyms, translating one word at a time. Later this program can be replaced by one that actually parses the language.

The name "Mingua"

edit

I just made up the name "Mingua" for this language, which I'm developing here on Wikipedia and which I hope others will help develop, too. "Mingua" is short for "Machine Interlingua".

I hope Mingua won't be centred on English and won't even be centred on Indo-European languages but will have elements common to all or almost all languages. I was therefore pleased to hear that apparently in Chinese it sounds like "Ming-hua" which means "Shining speech". It's more important for the structure of the language to reflect all languages than the name ("a rose by any other name...") but I was pleased to hear this about the name.

Disambiguated input

edit

As user User:Sloyment suggested, disambiguated input can be used in the machine translations.

Machine translation may someday be very good. It may someday be able to translate several paragraphs or several pages with no silly mistakes. But, it will never be completely free of problems with ambiguity. Even humans will always have problems with ambiguity.

Consider this sentence:

"I put it in the dollhouse in the livingroom."

Does it look clear? What does it mean? It could correspond to any of these situations:

  1. There's more than one dollhouse, and the one I'm talking about is in the livingroom.
  2. There's only one little dollhouse, and I've been carrying it around with me. I was in the livingroom when I put the thing into it. (Compare "I put it in my pocket in the livingroom.")
  3. I put the thing into the tiny livingroom which is inside the dollhouse.

Arguably, the 3rd interpretation could happen only if there were a comma after "dollhouse", but the other two are both possible interpretations.

The first time someone tries to machine-translate a text, this ambiguity could be pointed out, and the person could edit the original Wikipedia article to make it unambiguous.

One advantage is that it may become easier for humans to read, too.

Ways to disambiguate:

  1. I put it in the dollhouse which is in the livingroom.
I put it in (the dollhouse in the livingroom).
  1. I put it in the dollhouse when I was in the livingroom.
I (put it in the dollhouse) in the livingroom.
  1. I put it in the dollhouse's little livingroom.
I put it ((in the dollhouse) (in the livingroom)). [Clear parsing but not clear meaning.]

The versions with parentheses and other disambiguation marks for machine translation could be stored on a separate Wikipedia for that purpose. Or, as I would prefer to see, they could be stored on the regular Wikipedia, and symbols could be used that only show up when you edit the page, not when you display it. For example: {{(}} for openbracket, perhaps. I would prefer to use ordinary parentheses, and when someone wants real parentheses in the article they would have to put backslashes before them or something.

As the machine translator improves, fewer such diambiguation marks would be needed -- but it would always need some, (either disambiguation marks or rewording), even if only in rare cases.

Types of info needed in Wiktionary

edit

The page about the English word "put" should note that this verb normally requires two arguments in addition to the subject: "I put something somewhere", not just "I put something". OK, it says "To place something somewhere". Maybe this could be interpreted as meaning that both arguments are required. There's one more bit of information also needed: that the second argument is interpreted as a beginning-state. Compare "I put the books on the table" with "I put the books onto the table", where the word "to" indicates a beginning-state. OK, and when synonyms are given in other languages, maybe it needs to be indicated which arguments correspond to which. For example: suppose a language doesn't have a verb exactly like "to cause" but has a verb like "results from", where "A causes B" means the same as "B results from A" it needs to say so, rather than just listing "results from" as a synonym for "to cause").

Mingua may look something like this:

X puts Y Z (or equivalent X moves Y to Z) becomes:

declarative present v2(X, v20(Y, beginning-state Z)),    

or in more detail:

declarative present v2(agent:X; patient:v20(patient:Y; locative: beginning-state Z))

where v2 is the verb "to cause" and v20 is the verb "to move". (I'm just beginning and may change which verb gets which number.)

Sorry it looks complicated. But it will normally be read by a computer.

Mingua will have nouns, verbs and other words, and hopefully there will be Wikipedia pages linking the Mingua words to words in other languages.

Mingua verbs will have agents, patients and other arguments, but they will not have subjects and objects. They will not be either active or passive. They will have a specific order to their arguments, that's all.

For example: suppose v30 is the verb "to give". It will normally have 3 arguments: the agent (the person doing the giving); the patient (the thing being given) and another argument (the person being given to). So "X gives Y to Z" could look like this:

v30(X,Y,Z)

This could be rendered in English as "X gives Y to Z" or just as correctly could be rendered in English as "Y was given to Z by X". Or perhaps "Y was donated to Z by X", and other possibilities. Even "Z was given Y by X." The verb in Mingua is neither active nor passive. It has no surface structure, only deep structure.

The verb "to rain" would have no required arguments:

"rain"()

In English it gets an impersonal pronoun "it" as a subject; in Mingua it needs no subject. The "it" in English (and similar words in some other languages) doesn't represent anything in deep structure.

Collaboration wanted

edit

if you want to collaborate let me know and perhaps this can be moved to a project page which multiple users can edit. Also let me know about other similar projects. I'm already aware of R. Morneau's copyrighted work on an Interlingua with words like "kopumba", and the Machine Translation Project page listing a number of resources (see link on my user page) and the Apertium project for translating between similar languages.