Grants:IEG/Proofreading semiautomatically the Catalan Wikipedia with LanguageTool
This project is funded by an Individual Engagement Grant
Project idea
editWhat is the problem you're trying to solve?
editProofreading large amounts of text is a daunting task, but it is very much needed to improve the quality of some Wikipedia pages.
What is your solution?
editThe objective can be achieved using a smart combination of technological tools and linguistic knowledge and intuition in order to minimize (but not suppress) human supervision. The process involves the following steps:
- Analyze the whole Catalan Wikipedia using the Open Source proofreading program LanguageTool.
- Filter and sort the results of the LanguageTool analysis.
- Supervise the filtered and sorted results. This is the non-automatic part of the process.
- Apply the selected results in the corresponding Wikipedia articles with the help of a bot.
Project goals
editThe described process has been tested with a certain degree of success during the last two years. With the acquired knowledge, now we want to achieve these goals:
- To make the process faster so a Wikipedia like the Catalan one in size can be proofread in a significantly shorter amount of time.
- To complete the proofreading of the whole Catalan Wikipedia.
- To rewrite and document the code so it can be used by other people in other languages.
Project plan
editActivities
edit- Improve significantly the filtering and selection of the LanguageTool analysis results. This includes: minimizing problems of wikitext parsing; filtering out sentences in other languages (like quotations, titles, bibliography...) or non-standard language (ancient or dialect). These improvements can be done before the analysis, during the analysis and after the analysis.
- Create auxiliary tools for the non-automatic supervision: black lists, etc.
- Do the non-automatic supervision of the results. This will be used to evaluate the success of the previous filtering steps. The LanguageTool rules can also be updated and improved when necessary.
- Document the code so it can be used by other people in other languages.
- Test the process in at least one more language besides Catalan. Annoucements will be made to reach potential collaborators willing to take the lead in their own Wikipedias.
Budget
edit- Project development: 3,000 EUR (for six 40-hour work weeks)
- Total Budget: 3,000 EUR
(I dropped a budget allocation of 250 EUR to cover server infrastructure costs based on the committee recommendation to use WMFlabs instead. If WMFlabs is not a feasible substitute, I will need this allocation.)
Community engagement
editWe'll survey our target community at the start and at the end of the period.
Sustainability
editThe developed code will be available for continued use in Catalan Wikipedia, and it will be easily adaptable to other languages. Collaborators will be needed in order to use it in other Wikipedias.
Measures of success
edit- The success can be measured by the number of edits made in Catalan Wikipedia articles. It will be of the order of hundreds of thousands. As a rough estimate, I'll do at least 400,000 edits.
- A test (without edits) is made in another language (preferably one with good support in LanguageTool). Edits should be done only by an active and trusted part of the target Wikipedia.
- Code and documentacion is available on GitHub. Documentation is posted in Final Report for IEGrant, and annoucements are made to reach other Wikipedias.
Get involved
editParticipants
editJaume Ortolà (Done more than 600.000 spelling and grammar corrections to Catalan wikipedia using the bot Langtoolbot; mantainer of Catalan language in LanguageTool, Grammar Checker; mantainer of several dictionaries and tools for the Catalan language.).
Community Notification
editPlease paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. Need notification tips?
Endorsements
editDo you think this project should be selected for an Individual Engagement Grant? Please add your name and rationale for endorsing this project in the list below. (Other constructive feedback is welcome on the talk page of this proposal).