Grants:Project/Swahili machine translation research
Project idea
editWhat is the problem you're trying to solve?
editWhat problem are you trying to solve by doing this project? This problem should be small enough that you expect it to be completely or mostly resolved by the end of this project. Remember to review the tutorial for tips on how to answer this question.
The Wikipedia and Wikimedia communities in Tanzania and Kenya do not have access to language tools that will increase quality and efficiency of health articles from English to Swahili. This impedes the ability to populate Wikipedia Swahili with high-quality health articles from Wikipedia English.
What is your solution?
editFor the problem you identified in the previous section, briefly describe your how you would like to address this problem. We recognize that there are many ways to solve a problem. We’d like to understand why you chose this particular solution, and why you think it is worth pursuing. Remember to review the tutorial for tips on how to answer this question.
The first step in solving this problem is to test, on a small scale, the value of two main language tools that are readily available in commercial languages - can we take what we have learned with those language pairs and apply it to a 'marginalized' language group? The first is an editing tool applied to the source content: If this tool is applied, does it help the language editors? Specifically, we have a repository of 12,000 simplified health terms, developed with the support of WikiMedical Project and medical students from University of California, San Francisco. The terms are available in an editing tool and can be accessed. If this repository is used to edit vetted health articles from Wikipedia English, does it impact the Swahili editor's ability to translation more articles at a higher quality? Does it also allow for expansion of the community to less-skilled editors?
The second tool is a health-specific machine translation engine. Currently, machine translation (MT) is not a useful tool in Swahili. While some major developers have created MT, it has not been maintained or improved, and evidence is that it is more of a burden to the translator than an aid. However, with the advent of neural machine translation, we can improve 'generic' engines fairly quickly, and especially if we focus on a particular domain (or topic). In this case, the solution is to use the health articles translated into Swahili by the community to build a domain-specific health engine. More parallel health data from the public domain, will be included into the engine to make it even stronger. The research will test whether this engineered machine translation engine will help the Swahili Wikipedia community through speed, quality level and community participation.
Project goals
editWhat are your goals for this project? Your goals should describe the top two or three benefits that will come out of your project. These should be benefits to the Wikimedia projects or Wikimedia communities. They should not be benefits to you individually. Remember to review the tutorial for tips on how to answer this question.
Determine whether plain English editing of source content helps the Swahili Wikipedia community with accuracy of translation and ability to expand the community.
Determine whether a health-specific machine translation for English to Swahili can be more useful to the Wikipedia community than current off-the-shelf Swahili engines, and can help with speed of translation.
Project impact
editHow will you know if you have met your goals?
editFor each of your goals, we’d like you to answer the following questions:
- During your project, what will you do to achieve this goal? (These are your outputs.)
- Once your project is over, how will it continue to positively impact the Wikimedia community or projects? (These are your outcomes.)
For each of your answers, think about how you will capture this information. Will you capture it with a survey? With a story? Will you measure it with a number? Remember, if you plan to measure a number, you will need to set a numeric target in your proposal (e.g. 45 people, 10 articles, 100 scanned documents). Remember to review the tutorial for tips on how to answer this question.
Overall, to achieve both goals and to build a sustainable improvement for the Swahili Wikipedia community, we will work directly with members of the community to gauge interest, provide advice on translation process, and guide the overall research project. We aim to meet with a small group of the community at the beginning, middle and end of the project, and to keep the broader community informed throughout the process. We also aim to inform the broader community of editors focused on marginalized languages on the findings from the research to determine whether language tools can help other communities as well.
Goal One: Determine whether plain English editing of source content helps the Swahili Wikipedia community with accuracy of translation and ability to expand the community.
The first goal through a research project in which translation accuracy (level of quality) is measured using vetted health articles from English Wikipedia. A baseline test will be run, working with established community members, and it will be evaluated against a test after a plain English term database has been applied. In parallel, less-established community members will be asked to be part of the test as well to see if this process allows them to be more involved editors. The testing also will include focus group discussions to evaluate the process and the experience before and after the database is applied.
Goal Two: Determine whether a health-specific machine translation for English to Swahili can be more useful to the Wikipedia community than current off-the-shelf Swahili engines, and can help with speed of translation.
The second goal will be achieved by building a health-specific machine translation engine for English to Swahili. This engine will build upon existing engines that have low-accuracy rates, but that can be improved with training. The aim will be 15 million words, with a strong bias toward health content. We will then work with the community to test this engine, working again with established community members (to test speed of translation process), and with less-established community members (to test ability to participate more in editing). Metrics will include Bleu score of the engine (a way to evaluate accuracy), speed tests, and feedback from community members within focus group discussions.
A final report on the two pieces of research will include an overview of the research, the results, and recommendations.
Do you have any goals around participation or content?
editAre any of your goals related to increasing participation within the Wikimedia movement, or increasing/improving the content on Wikimedia projects? If so, we ask that you look through these three metrics, and include any that are relevant to your project. Please set a numeric target against the metrics, if applicable. Remember to review the tutorial for tips on how to answer this question.
This research project is intended to evaluate whether language tools, made available to a community that works in a 'marginalized' language (that is a language that has not been the focus of technology development by major developers), can increase participation among community members, and increase population of Swahili Wikipedia by community members. The researchers will test directly with the Swahili Wikipedia community members. Additionally, a small group of community members will be asked to participate in an advisory group throughout the research project.
Specific metrics to be achieved: Edited health articles from English Wikipedia, using plain English health terms: at least 50 articles Machine translation engine trained with English Wikipedia health articles translated into Swahili: at least 200 articles included in the training of the engine; at least 10 million words total to build the engine. Higher level of accuracy of the engine: achieve a Bleu score above 55. Testing translation of the plain English health terms: work with at least 10 established community members and 3 less-established community members. Testing translation speed with and without the machine translation engine: work with at least 10 established community members and test at least 20 articles.
Project plan
editActivities
editTell us how you'll carry out your project. What will you and other organizers spend your time doing? What will you have done at the end of your project? How will you follow-up with people that are involved with your project?
1) Work with WikiProject Medicine to identify 20 articles for testing the plain English termbase. These articles also will form the base of the health-specific translations to train the engine.
2) Meet with Swahili Wikipedia community in Tanzania and Kenya; form an advisory group from the community to guide the research, help in ensuring transparency, and to be involved in the testing.
3) Conduct initial screen of articles using the plain English termbase.
4) Work with subset of the community to test translations of the plain English articles. Include a majority experienced editors, but also at least two to three less experienced editors. Conduct a simple survey and a focus group discussion after the translation process to assess results.
5) Meanwhile, continuously train the engine with Swahili content. This includes using the health-specific Wikipedia articles, translated in parallel, English to Swahili. But it must also include much more open content in order to build a usable engine. TWB will include at least 500,000 words already available, but it will begin immediately to translate more health and humanitarian information, En to Swahili, that is readily and openly available.
6) Create a specialized machine translation engine in Swahili. This will take at least nine months. TWB will work with established baseline engines and technologists specialized in MT development in order to train an engine using established and new content/data. The engine will be openly available for all Wikipedia community members, ideally added to Common Translation.
7) Test translation using the machine translation engine. This will be after nine months of training. Again, we will work with the Swahili Wikipedia community for this testing, looking at both the speed for more experienced community members, and accuracy for less experienced. The testing metrics will be supplemented with focus group discussion on the experience of working with the MT, including issues and ways to improve the experience.
8) Reporting on the research as well as the overall process will complete the project. This transparent reporting will be provided to the community and also to other language communities.
Budget
editHow you will use the funds you are requesting? List bullet points for each expense. (You can create a table later if needed.) Don’t forget to include a total amount, and update this amount in the Probox at the top of your page too!
Meetings in Tanzania and Kenya with community, advisory group, and TWB (3 total): $8,000 Development of open machine translation, English to Swahili: $40,000 Server space: $10,000 Project management: $25,000 Research expertise (design and administration of tests and focus groups): $10,000 Reporting (full report development); end-of-research reporting back to community and to other communities: $7,000
Community engagement
editCommunity input and participation helps make projects successful. How will you let others in your community know about your project? Why are you targeting a specific audience? How will you engage the community you’re aiming to serve during your project?
We are specifically targeting the Swahili Wikipedia community because Swahili is a major language in East Africa, spoken by millions of people as a first, second, or third language. As a lingua franca, it is a critical language for business in the eastern Africa region. Additionally, a large number of Swahili speakers are technology-savvy, and there is the ability to receive Wikipedia as a tool on mobile phones, which is the main technology use in large parts of Swahili-speaking east Africa. Finally, Swahili was Translators without Borders' (TWB) first major language focus in Africa; TWB has trained community members in Kenya both in translation and in Wikipedia editing. As a research project, we wanted to focus on one language community that we know well.
Despite its depth as an important lingua franca, Swahili is marginalized in terms of language technology. Unlike European languages, many of which are spoken by far fewer people, there is little focus on building technology tools for Swahili.
For all of these reasons, Swahili is a major language in TWB's language equality project, called Gamayun. Translators without Borders’ Gamayun: The Language Equality Initiative is intended to apply cutting-edge language technologies to improve rapid communications in languages that are currently marginalized. Gamayun does not replace human translation or face-to-face interaction. Instead, it expands information access and complements human translation and face-to-face interaction, providing as much information as possible, rapidly, and increasing the ability to receive information and feedback from community members. Ultimately, the goal is to shift control of communications to the community, allowing community members to share their needs and access information in the form (text or voice) and language they know best.
Get involved
editParticipants
editPlease use this section to tell us more about who is working on this project. For each member of the team, please describe any project-related skills, experience, or other background you have that might help contribute to making this idea a success.
The overall project will be manage by TWB Kenya, based in Nairobi. TWB has a small office and a manager based in Nairobi, as well as strong connections with several translators trained in health and in Wikipedia editing. These translators will help in building the large Swahili content set for the engine. The manager of the office (Paul Warambo) is a trained Swahili translator and has been with TWB since 2012. Also, a board member for TWB Kenya (Iribe Mwangi) is a Swahili linguist; he will advise on the quality of the dataset.
The development of the dataset will be overseen by TWB's head of translation (Stella Paris). She oversees all translation work at TWB and is currently involved in the development of the Levantine Arabic dataset.
The research portion of the project will be led by TWB's measurement and evaluation manager (Eric DeLuca). He has conducted focus group discussions on language and tested community comprehension in Nigeria, Bangladesh, and Greece. He also is involved with a current test of Levantine Arabic.
The project management, logistics, project design and reporting will be led by TWB's Gamayun Project Manager (Grace Tang). Grace was TWB's first Words of Relief manager, based in Nairobi, and was instrumental in building the first Swahili engine with Microsoft. She is now shaping TWB's approach to language equality through machine translation.
TWB will work with an expert in specialized machine translation engine development for the training of the engine. There are a number of partners to consider, but currently TWB is working with Prompsit, a small group in Spain, to create a specialized engine for Levantine Arabic. Care will be given to ensure quality and openness of the engine.
TWB will work directly with WikiProject Medicine, which has been a partner with TWB since 2011.
Community notification
editPlease paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. You are responsible for notifying relevant communities of your proposal, so that they can help you! Depending on your project, notification may be most appropriate on a Village Pump, talk page, mailing list, etc. Need notification tips?
https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Medicine
Endorsements
editDo you think this project should be selected for a Project Grant? Please add your name and rationale for endorsing this project below! (Other constructive feedback is welcome on the discussion page).