Grants:Project/Anti-vandal bots for Vietnamese Wikipedia

statusproposed
Anti-vandal bots for Vietnamese Wikipedia
summaryIn this project, we will develop several methods applied deep learning networks to counter the vandalism in Vietnamese Wikipedia.
targetVietnamese Wikipedia
type of granttools and software
amount15000
type of applicantindividual
granteeAlphama
contact• alphamawikipedia@gmail.com
join
endorse
created on10:31, 16 June 2021 (UTC)

Project idea

edit

What is the problem you're trying to solve?

edit

What problem are you trying to solve by doing this project? This problem should be small enough that you expect it to be completely or mostly resolved by the end of this project. Remember to review the tutorial for tips on how to answer this question.

Vietnamese Wikipedia is a language edition with more than 1 million articles but has only about 100 active editors working daily. Therefore, sometimes the community can not handle or even overload in controlling vandalisms. In this project, we will develop approaches and apply some deep learning networks to detect vandalisms and revert these edits. We start with simple vandalisms and then develop to detect sophisticated ones.

What is your solution?

edit

For the problem you identified in the previous section, briefly describe your how you would like to address this problem. We recognize that there are many ways to solve a problem. We’d like to understand why you chose this particular solution, and why you think it is worth pursuing. Remember to review the tutorial for tips on how to answer this question.

Deep learning (DL), one of the prominent fields of natural language processing, applies neural networks over million cells that improve the outcome performance comparing to classical approaches. Nowadays, with the outbreak development of DL, the scientific community gain a lot of methods and pre-trained models proved to work well the required tasks. The future of Wikimedia plan aims to apply technologies in developing the content so we believe Vietnamese Wikipedia must integrate DL to patrol its content now or later on.

From an article revision, we put its content with its editor information (IP addresses, username) to the classification model and get out the output in the form of category labels (vandal, spam, unknown, promotion, etc) and its label precision. Depending on the output, the system will decide to revert the edit or not. We inherit some classification models on huggingface.co for the task of text classification. We also try to test the input with LTSM and seq2seq networks to see which one is the best method for detecting and classifying vandalisms.

Project goals

edit

What are your goals for this project? Your goals should describe the top two or three benefits that will come out of your project. These should be benefits to the Wikimedia projects or Wikimedia communities. They should not be benefits to you individually. Remember to review the tutorial for tips on how to answer this question.

In this project, we want to set three goals:

  • Improve community awareness about vandalisms and types of vandalisms
  • Help to counter vandalisms on the content and reduce the human efforts in patrolling content
  • To be a helpful approach/material that adopts the community on how to build bots and technology awareness in similar tasks.

Project impact

edit

How will you know if you have met your goals?

edit

For each of your goals, we’d like you to answer the following questions:

  1. During your project, what will you do to achieve this goal? (These are your outputs.)
  2. Once your project is over, how will it continue to positively impact the Wikimedia community or projects? (These are your outcomes.)

For each of your answers, think about how you will capture this information. Will you capture it with a survey? With a story? Will you measure it with a number? Remember, if you plan to measure a number, you will need to set a numeric target in your proposal (e.g. 45 people, 10 articles, 100 scanned documents). Remember to review the tutorial for tips on how to answer this question.

We would like to gather data and community opinions about vandalisms in Vietnamese Wikipedia and synthesize them in the form of Wikipedia guidelines/articles. This step helps us to understand the perspective of the community about vandalism categories and vandalism behaviours. In the future, the new editors can refer to our guidelines to continue to develop the content or related tasks.

Our technologies will publicly and broadly share with the community in order to inherit and develop tools/bots to counter vandalism. We prefer to be the pioneer to open a new chapter in the development of Vietnamese.

We measure the results by three means:

  • The opinions of the community in the form of discussions or survey with at least 20 active editors.
  • We prefer to have more than 1000 vandalisms can be identified and reverted by using the tool in the hand of some reputation editors in several months.
  • The precision (F1) and the accuracy should be at least 80%.

Do you have any goals around participation or content?

edit

Are any of your goals related to increasing participation within the Wikimedia movement, or increasing/improving the content on Wikimedia projects? If so, we ask that you look through these three metrics, and include any that are relevant to your project. Please set a numeric target against the metrics, if applicable. Remember to review the tutorial for tips on how to answer this question.

We would like to use this project as a motivation to improve the Vietnam Wikimedians User Group which have been inactive due to COVID-19 and the break of key members.

Project plan

edit

Activities

edit

Tell us how you'll carry out your project. What will you and other organizers spend your time doing? What will you have done at the end of your project? How will you follow-up with people that are involved with your project?

  • Gather vandalism patterns: we will collect the vandalisms by the pre-defined form as (contributor information, article information, vandalism revision,...)
  • Apply deep learning models to train the dataset and evaluate the results on dev set, train set comparing to human evaluation.
  • Optimize the trained model and integrate model to pywiki bots for using in Vietnamese Wikipedia.

Budget

edit

How you will use the funds you are requesting? List bullet points for each expense. (You can create a table later if needed.) Don’t forget to include a total amount, and update this amount in the Probox at the top of your page too!

Here are the budget of each task:

  • Survey and discussions about vandalism and related things: 1200 USD
  • Create articles about vandalism on Vietnamese Wikipedia: 300 USD
  • Gather vandalism patterns (automatically by bots and human, organized as corpus): 1000 USD
  • Research anti-vandalism methods (literature review, scholar consultation, test available approaches): 1500 USD
  • Develop and build models: 5000 USD
  • Run and test models over dataset (model optimization, model comparison, human evaluation methods, automatic evaluation methods): 5000 USD
  • Integrate to pywiki bots: 1000 USD

In total, our bugdet is 15000 USD.

Community engagement

edit

Community input and participation helps make projects successful. How will you let others in your community know about your project? Why are you targeting a specific audience? How will you engage the community you’re aiming to serve during your project?

The community will engage in the process of discussions and surveys and human evaluations about vandalism and types of vandalism. In this project, we prefer to notice the community by means of Vietnamese Wikipedia and emails. We also do hope to attract young editors who in the field about the project.

Get involved

edit

Participants

edit

Please use this section to tell us more about who is working on this project. For each member of the team, please describe any project-related skills, experience, or other background you have that might help contribute to making this idea a success.

Here are members of this project. We prefer to specify some members that we believe we can have the best cooperation:

  • Alphama (talk · contribs) - I am a sysop of viwiki. My interest is about Natural Language Generation, Knowledge Base, Sentiment Analysis and related fields.

Community notification

edit

Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. You are responsible for notifying relevant communities of your proposal, so that they can help you! Depending on your project, notification may be most appropriate on a Village Pump, talk page, mailing list, etc. Need notification tips?

Endorsements

edit

Do you think this project should be selected for a Project Grant? Please add your name and rationale for endorsing this project below! (Other constructive feedback is welcome on the discussion page).