Talk:Community Wishlist/Wishes/Wikipedia Machine Translation Project
This page is for discussions related to the Community Wishlist/Wishes/Wikipedia Machine Translation Project page.
Please remember to:
|
MinT for reader
editFYI WMF Language team are developing a similar feature "MinT for Wiki Readers" that allows readers generate automatic translation (use the MinT service). See also the recent update of this project. SCP-2000 05:02, 1 August 2024 (UTC)
- Very interesting, thanks for this info – I'll look further into it but there are many differences. Three of the larger differences to my proposal is that I'm proposing that these sites are not generated dynamically but are available as static and often indexed websites, that errors are correctable and the MT improvable via specifying rules or errors, and that the most advanced machine translation tools are used where I doubt that MinT is nearly as good as DeepL for example but I could be wrong.
- Possibly MinT could be used for what is proposed here and I will look into this further and maybe make a talk page post there. Prototyperspective (talk) 11:09, 1 August 2024 (UTC)
- Thanks @Prototyperspective for this detailed and documented wish, and to @SCP-2000 for sharing MinT for Readers.
I wanted to share a bit more context on the impact MinT has had: https://diff.wikimedia.org/2023/11/20/unlocking-the-worlds-languages-in-wikipedia-a-look-into-mints-impact-so-far/ and share some feedback from users https://diff.wikimedia.org/2023/11/21/what-users-think-about-the-machine-in-translation-mint-service/
I wanted to flag that our processes align with a lot to what you're suggesting: 1. Leverage existing wiki content to populate a language version of the same content, via Machine translation. 2. Allow editors to "fix" or amend machine translation to ensure the translation is sensible for the local wiki. JWheeler-WMF (talk) 12:22, 2 August 2024 (UTC)
- Thanks @Prototyperspective for this detailed and documented wish, and to @SCP-2000 for sharing MinT for Readers.
Keeping translations in sync
editIn UKWP, I use a respective version of en:Template:Translated page with version
and insertversion
parameters, specifying a source's and a translation's revisions respectively. I keep a list of articles I've translated in uk:User:Olexa Riznyk/Переклади, and use uk:User:Olexa Riznyk/translations status.js script that represents this list as a sortable table with columns #, Article, Importance, Source language, Translation age in days, Number of days the translation is outdated, Number of source changes since translation, Source size change since translation. I use it to decide when to update translations. When I update a translation, I update version
and insertversion
as well, of course. Something similar could theoretically be used per category, per project etc. for their respective articles that have en:Template:Translated page on their Talk pages. Meanwhile, using insertversion
allows to easily see what changes have been made to a target article after a translation, and this is also valuable, sometimes contributing to updates of a source article. --Olexa Riznyk (talk) 14:40, 2 August 2024 (UTC)
wikitext markup
editGPT-4 (in ChatGPT+), with some prompt engineering, deals with wikitext markup pretty well. You could look through edit history of uk:User:Olexa Riznyk/Чернетка/Конструювання підказок, where I was adding a next section translated by GPT-4 as is, and then fixing the translation. I was considering to creat create custom actionse for the GPT to pull interwiki data, but I'm actually pretty happy with it suggesting wikilinks naïvely, as it often suggest missing redirects. But I'm thinking about implementing actions anyway. --Olexa Riznyk (talk) 14:51, 2 August 2024 (UTC)
Whether to store raw machine translations, and when to involve humans
editI don't share the fear that some external actor would create a machine-translated Wikipedia. What for, if an online translation is built into modern browsers? I also use the built-in translation in Chrome to read articles in Wikipedia languages I don't know. I cannot imagine that I, as a user, would need a stored raw machine-translated Wikipedia.
I don't know the sutiation in other Wikipedia languages, but in Ukrainian it is pretty common to translate articles from other languages rather than to write them from scratch, at least in domains I am intersted in. Wikipedia is a community project, and it doesn't make sense to stay isolated in own language's community capacity only, not utilizing the capacity of others.
Of course, Wikipedia infrastructure for translating content and for maintaining (updating) translations could be much better. I gave up using Wikimedia's translation infrastructure: it was too glitchy, too inconvenient, it handled wikitext markup terribly, and it didn't learn in context at all (from my corrections of previous sections' translations). I use ChatGPT Plus with some prompt engineerig, and get petter performance. ChatGPT Plus provides a possibility to create actions for getting data from web, like interwiki data (although, I do not use this yet, I'm pretty satisfied with "guessed" wikilinks, they sometimes highligh a lack of some redirects).
As for me, ideally, it would be to have a translation system, based on some generative LLM with RAG for enriching its context with "closest" source-translation pairs from updates of previous article's seciions' translations, as well as from translations of other articles, collected by usage of Template:Translated page, including its version
and insertversion
properties. This should make updating translations very convenient and pretty robust, as previous translation pair of the same article would be used in LLM's context as well.
Although, there is an additional danger when machine translations becomes "near perfect": that humans will tend to save its output without checking it thoroughly. Some LLMs are capable of outputting their certanties in parts of their output. It would be good for a translation system to provide its human users with these certanties, so that they don't miss important part needing their verification, and so that they would better understand their responsibility for what they accept.
Usage of Template:Translated page with version
and insertversion
properties (or something like that) creates a possibility for automatic tracking how outdated existing translations are, and how much they need attention to be updated (how big or old out-of-sync source or target changes are). This could be shown per user's watchlist, per category, per wikiproject, whatever.
--Olexa Riznyk (talk) 17:45, 4 August 2024 (UTC)
- First of all thanks for your insights above, I don't think I will be able to respond to them but they could be useful for this proposed project to see how you used AI tools for related tasks.
- It's not necessarily a fear, it could also be the case that against odds it turns out to be better than if WMF was to be implementing it, e.g. quicker development (e.g. due to larger share of resources being spent on tech development) and maybe it's another open source organization. More likely it would simply be best if Wikimedians in standard ways built this.
- Here's what for: the mention error correction and the findability. People don't search the Web in Norwegian and then notice they don't find what they looked for and turn to the English Wikipedia article and let it get machine translated through some extension they have installed. People don't do that and I don't worry much about things that are hypotheticals of what they could do. The issue starts already with that the starting point is not Wikipedia but a Web search engine and these days the content or answers to their queries finds the searcher, not the other way around. The Wikipedia/MLWP article needs to find the user to which it is relevant.
- Moreover, people equate their native language Wikipedia with Wikipedia – if there is a Wikipedia article related to their query but the info they're looking for is not there and/or it is very short/low-quality then people would consider that a case closed for Wikipedia and look for other sources in their language rather than go to some other language WP and let it get dynamically translated.
- Furthermore, there are many more advantages beyond the error correction such as including a video or image in the corresponding language rather than the source's article's.
- Translating articles often is good and all but have you read the section about "The problems addressed here": they get out of sync as they don't change when the source article changes, it requires a lot of time compared to an adjustable MT system and is not done at scale, and most importantly: Ukrainian Wikipedia is also much smaller than ENWP - that is it has less-comprehensive, less-sourced, fewer articles just like any other Wikipedia with the largest ones being German and Spanish Wikipedia which I've both edited and seen this in many cases even when not looking at the statistics like the one included in the proposal.
- Great point about clarifying uncertainties – I thought maybe two MT tools could be used and then the diffs be used in a similar way but it would be better if the MT tools themselves would clarify which parts may be uncertain such as due to ambiguous wording like "light" which can be translated to totally "Licht" and "leicht" in German or "ligero" in Spanish. All of such things only becomes possible once there is a machine translation and correction system. Another potential thing like that would be to highlight things as needing review when one translated article gets a 'post MT error correction' e.g. in the next iteration considering this correction and considering similar translations with greater 'uncertainty' so that the uncertainty is high enough for the part to be flagged as needing human checking. In this previous example, the sentence could be sth like "…narrow columns spanned by light ceilings" and a user error-corrected one language from "bright ceilings" to "lightweight ceilings" – then the other language translations are also adjusted and in the next iteration when another article has a similar sentence e.g. connecting pillars/columns to ceilings/roofs, it has larger uncertainty for it to not mean "bright" and/or greater likelihood of it to mean "lightweight".
- The problems of articles being out of sync is not mainly about tracking how out of sync they are. For example the target article is also changed by other editors in the meantime and it's very resource-intensive and cumbersome to get it in sync again. Prototyperspective (talk) 22:06, 4 August 2024 (UTC)
Existing diffs of MT-adjustments
editIsaac (WMF) provided some relevant info in some phab issue where I mentioned this proposal: MinT apparently has some dataset of post-MT-adjustments, see "in [11]" here (info here). In "Translated Sections:" there are three values: source content, machine translated content, final published content. This may be useful for testing or for building systems that make use of such diffs. It's however very different from the adjustments-diffs part of this proposal, for the following reasons (maybe I got one or two a bit wrong):
- it only has the first adjusted version (by the editor who originally publishes the first version of the section) but here contents would continue getting adjusted indefinitely
- it only works with a very small subset of articles that have been translated (even just those translated using some particular tool) – this doesn't work with millions of already-available static articles.
- each individual adjustments is not e.g. tagged/classed for why it was changed (probably not important) and a full section diff, not diffs where one issue is fixed at a time; relevant to this is that in what's proposed here when contributors fix issues in one article it often also affects many other articles which have the same issues (by either fixing them directly as well with a tag for the diff or by flagging the phrase as needing review)
- the part about learning from these diffs to flag or adjust / correct other phrases seems missing – this is just a dataset and afaik it's not used to improve the MinT translation model(s) either
- people may add or alter content instead of just adjusting or correcting (in MTWP semantic flaws would need to be corrected in the source article)
- it seems like many Wikipedias don't use the tools whose use would be required for this dataset so the dataset misses many languages
Note that this proposal is much broader than learning from adjustments-diffs which is just one component of it.
Moreover, I asked people at MinT whether specifying 'low certainty/confidence of correct translation' for phrases is possible which is one part of this proposal and suggested that the project proposed here could become a successor project to "MinT for wiki readers" at some point (e.g. once MTs reached a certain baseline quality). Prototyperspective (talk) 16:52, 11 November 2024 (UTC) added 2 more points. --Prototyperspective (talk) 11:57, 16 November 2024 (UTC)