Research:Improving multilingual support for link recommendation model for add-a-link task

Tracked in Phabricator:
Task T342526

Created

10:23, 24 July 2023 (UTC)

Contact

Martin Gerlach

Wikimedia Foundation

Collaborators

Aisha Khatun

Wikimedia Foundation

Kevin Bazira

Wikimedia Foundation

Isaac Johnson

Wikimedia Foundation

Duration: 2023-07 – ??

Research:Projects

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

In a previous project we developed a machine-learning model to recommend new links to articles^[1]: Research:Link_recommendation_model_for_add-a-link_structured_task

The model is used for the add-a-link structured task. The aim of this task is to provide suggested edits to newcomer editors (in this case adding links) to break down editing into simpler and more well-defined tasks. The hypothesis is that this leads to a more positive editing experience for newcomers and, as a result, they will keep contributing in the long-run. In fact, the experimental analysis showed that newcomers are more likely to be retained with this features, and that the volume and quality of their edits increases. As of now, the model is deployed to approximately 100 Wikipedia languages.

However, we have found that the model currently does not work well for all languages. After training the model for 301 Wikipedia languages, we identified 23 languages for which the model did not pass the backtesting-evaluation. This means, that we think the model’s performance does not meet a minimum quality standard in terms of the accuracy of the recommended links. Detailed results: Research:Improving multilingual support for link recommendation model for add-a-link task/Results round-1

In this project, we want to improve the multilingual support of the model. This means we want to increase the number of languages for which the model passes the backtesting evaluation such that it can be deployed to the respective Wikipedias.

Methods

We will pursue 2 different approaches to improve the multilingual support.

Improving the model for individual languages.

We will try to fix the existing model for individual languages. From the previous experiments where we trained the model for 301 languages, we gathered some information about potential improvements for individual languages (T309263). For example, the two most promising approaches are

Unicode decode error when running wikipedia2vec to create article embeddings as features. This appeared in fywiki and zhwiki (T325521). This has been documented in the respective github repository. It also proposed a fix; however, this hasnt been merged yet. The idea would be to implement (or adapt if necessary) the proposed fix.
Word-tokenization. Many of the languages which failed the backtesting evaluation do not use whitespaces to separate tokens (such as Japanese). The current model relies on whitespaces to identify tokens in order to generate candidates for anchors for links. Thus, improving the work-tokenization for non-whitespace delimited languages should improve the performance of the models in these languages. We recently developed mwtokenizer, a package for doing tokenization in (almost) all languages in Wikipedia. The idea would be to implement mwtokenizer into the tokenization pipeline.

Developing a language-agnostic model.

Even if we can fix the model for all languages above, the current model architecture has several limitations. Most importantly, we currently need to train a separate model for each language. This brings challenges for deploying this model for all languages, because we need to train and run 300 or more different models.

In order to simplify the maintenance work, ideally, we would like to develop a single language-agnostic model. We will explore different approaches to try to develop such a model while ensuring the accuracy of the recommendations. We will use. Among others, the language-agnostic revert-risk model as an inspiration where such an approach has been implemented and deployed with success.

Results

Improving mwtokenizer

We hypothesize that we can improve language support for the add-a-link model by improving the tokenization for languages that do not use whitespaces to separate words such as Japanese.

As a first step, we worked on the newly developed mwtokenizer package (as part of Research:NLP Tools for Wikimedia Content), a library to improve tokenization across Wikipedia languages, so that it can be implemented into the add-a-link model. Specifically, we resolved several crucial issues (phab:T346798) such as fixing the regex for sentence-tokenization in non-whitespace languages.

As a result, we released a new version (v0.2.0) of the mwtokenizer package which contains these improvements.

Improving the model for individual languages

Some of the major changes made to improve performance of the existing language-dependent models are phab:T347696:

Replacing nltk and manual tokenization with mwtokenizer. This enabled effective sentence and word tokenization of non whitespace languages and thus improved performance (Merge Request).
Fixing Unicode error that was preventing a few models to run successfully. (Merge Request).
Fixing a regex that was causing links detected by the model to not be placed appropriately in the output string for non-WS languages. (Merge Request).

Having solved the major errors, we can now run all the languages without error and have improved performance in a lot of the non-whitespace languages using the improved mwtokenizer. Below are the current results for the languages that did not pass backtesting before. Previous results can be found here: Results round-1.

Table showing change in performance for languages that did not pass backtesting earlier.
wiki	previous precision	precision	previous recall	recall	comments	passes backtesting
aswiki	0.57	0.68	0.16	0.28	improvement!	borderline (precision is below 75%)
bowiki	0	0.98	0	0.62	improvement!	True
diqwiki	0.4	0.88	0.9	0.49	recall dropped	True
dvwiki	0.67	0.88	0.02	0.49	improvement!	True
dzwiki	-	1.0	-	0.23	improvement!	True
fywiki	error	0.82	error	0.459	improvement!	True
ganwiki	0.67	0.82	0.01	0.296	improvement!	True
hywwiki	0.74	0.75	0.19	0.30	similar results	True
jawiki	0.32	0.82	0.01	0.35	improvement!	True
krcwiki	0.65	0.78	0.2	0.35	slight improvement	True
mnwwiki	0	0.97	0	0.68	improvement!	True
mywiki	0.63	0.95	0.06	0.82	improvement!	True
piwiki	0	0	0	nan	only 13 sentences	False
shnwiki	0.5	0.99	0.02	0.88	improvement!	True
snwiki	0.64	0.69	0.16	0.18	similar results	borderline (precision is below 75%, recall is close to 20%)
szywiki	0.65	0.79	0.32	0.48	slight improvement	True
tiwiki	0.54	0.796	0.5	0.48	slight improvement	True
urwiki	0.62	0.86	0.23	0.54	improvement!	True
wuuwiki	0	0.68	0	0.36	improvement!	borderline (precision is below 75%)
zhwiki	-	0.78	-	0.47	improvement!	True
zh_classicalwiki	0	1.0	0	0.0001	improvement, low recall	False
zh_yuewiki	0.48	0.31	0	0.0006	low recall	False

The following table shows current performance of some languages that had passed backtesting earlier. We make this comparison to ensure the new changes does not deteriorate performance.

Table showing change in performance for some languages that passed backtesting earlier.
wiki	previous precision	precision	previous recall	recall	comments
arwiki	0.75	0.82	0.37	0.36	improvement
bnwiki	0.75	0.725	0.3	0.38	similar results
cswiki	0.78	0.80	0.44	0.45	similar results
dewiki	0.8	0.83	0.48	0.48	similar results
frwiki	0.815	0.82	0.459	0.50	similar results
simplewiki	0.79	0.79	0.45	0.43	similar results
viwiki	0.89	0.91	0.65	0.67	similar results

Exploratory work for language-agnostic model

Currently we train a model for each language wiki and each model is served independently. This creates deployment strain and is not easy to manage in the long run. The main goal is to develop a single model that supports all (or as many as possible) languages in order to decrease the maintenance cost. We could also develop a few models each with a set of compatible languages.

First we need to ensure languages can be trained and served using a single model. To test this hypothesis we perform some exploratory work on language-agnostic models (phab:T354659). Some of the important changes that made were:

Removing the dependency on Wikipedia2Vec by using outlink-embeddings. These embeddings were created in-home using Wikipedia links. (Merge Request)
Add a gridsearch module to select the best possible model (Merge Request)
Add a feature called `wiki_db` that names a wiki (e.g. enwiki, bnwiki). This should ideally help the model when combining multiple languages. (Merge Request)
Combined training data of multiple languages, trained a single model, ran evaluation of each language, and compared performance with single-language models. (Merge Request)

To create a language-agnostic model

We first combined training data of 2 unrelated languages and performance did not drop much. This motivated us to scale the experiment to 11 and then to 50 languages. We trained two models on two sets of ~50 languages. One set had 52 central languages from fallback chains and another had 44 randomly selected wikis. We trained a model on all languages in each set and evaluated on each individual language wiki. The performance comparison of the language-agnostic model and the single-language model can be found here: main_v2 and sec_v2 . The performance of the language-agnostic model for both sets of languages are comparable to the single-language versions. This shows we can theoretically select any set of wikis, perform combined training, and expect very good results.
We extend the experiment and train a model with all (317) language wikis with a cap of 100k samples per language. The evaluations can be found here: all_wikis_baseline. Similar to before, some languages have some drop in performance, but a lot of the languages perform almost on par with single language based models. Specifically, 14% of the languages had >=10% drop in precision, while the rest were close to the precision of single-language trained models. We increased the cap to 1 million samples per language. Evaluation here: all_wikis_baseline_1M. The performance remains extremely close to the 100k samples experiment, with slight decrease in precision in 4 languages and slight increase in 4 other languages.

Takeaways: Based on our experiments, we confirm that it is indeed possible to combine languages, even randomly, and expect performance very close to the single-language models. How many models to train, what languages should be trained together, and how many samples to choose are all questions that need more experiments to answer and will mostly depend on the memory and time constraints of training the model(s).

Building the Pipeline

Since the language agnostic model(s) would require mix-n-match across 300+ languages, manually building and testing all models becomes hard to keep track of. So we build an airflow pipeline that can automatically run on various subsets of language Wikipedias to collect data, create training and testing datasets, train model, and perform evaluation. Besides, since a random sampling of languages trained together in a language agnostic setting gives relatively good performance, we adopt each shard as a set of wiki to train a model on. These shards were created with wiki size and memory in mind. So, we can directly use these groups of wikis to run pipelines that would take similar times to process each wiki and therefore create a balanced set of language agnostic add-a-link models.

The production code has two components so far:

Note that the code are in mwaddlink branches in both repositories and are pending merge. Some more work needs to be done before they can be completely merged.

research-datasets/mwaddlink

Research Datasets repo contains the code that airflow will run. Multiple projects are all housed in src/research_datasets. Follow the instructions in the ReadMe to set up the repo.
src/research_datasets/mwaddlink/__main__.py is the entry point to our pipeline. There are function calls to all necessary components of the pipeline.
All projects use the same conda environment created from the dependencies listed in pyproject.toml file. The conda environment will also house the code in this repository. We will then put this conda env where airflow can access it and then airflow will be able to run our code.
- Dev setup: Package the conda env so airflow can access it: pip uninstall -y research-datasets && pip install . && conda pack --ignore-editable-packages --ignore-missing-files -o conda_envt.tgz -f && hdfs dfs -put -f conda_env.tgz && hdfs dfs -chmod 755 conda_env.tgz. If you have both the .tgz file and airflow dev running in the same stat machine, then you can directly link airflow to .tgz, no need to put it in hdfs.
- Prod setup: When the branch is merged, the gitlab CI/CD pipeline will automatically create a packaged .tgz file that be linked to airflow. Manual packaging will not be necessary.
With "mwaddlink.py" = "research_datasets.mwaddlink.__main__:entry_point" in pyproject.toml (which will be replaced by Commands), mwaddlink.py will be a file created under <conda_env>/bin and can be called directly by airflow as an executable.

Todos:

Research datasets is growing and changing. The code in src/research_datasets/mwaddlink/__main__.py will be replaced by Commands. This means we can directly call all functions without having one main entry point. These changes are already made, we just need to remove the entry point from pyproject.toml, directly call the functions in airflow, and fix any errors that come up. See src/research_datasets/__main__.py --help for the available list of commands to call from airflow.
Currently this code runs successfully start to finish for small wikis (tested for shard 7 wikis). For large wikis (shard 1, shard 6: enwiki, frwiki, jawiki, ruwiki) the generate_training_dataset script fails with memory error. This error cannot be solved by increasing memory. Some suggestions are provided in this slack thread that can be tested.
Some changes are pending before MR can be merged. See comments in the MR. A first stab was done at refactoring generate_anchor_dictionary. The broad changes required are:
- Refactoring to break up large functions.
- Adding typing
- Adding unit tests. We can start by adding tests for the utils from here.
- Change the way snapshots are chosen (MR Comment)
- The wikipedia embeddings are currently stored in hdfs at /tmp/research/mwaddlink/embeddings. They were copied over from /home/isaacj/topic_model/link-rec/embeddings as one time generated embeddings. An airflow pipeline to generate these embeddings is in-progress. Once a pipeline is set up, we can directly link to the newly generated embeddings instead of referring to static embeddings.
- Make sure all randomizations are reproducible. (e.g. pyspark .limit() in here)

airflow-dags/mwaddlink

To set up airflow repo (Instructions from here:

Clone airflow-dags. Move to the mwaddlink branch.
Create a directory where the analytics-privatedata user can create the airflow configuration directory, e.g. sudo -u analytics-privatedata mkdir /tmp/mwaddlink [In a stat machine]
Start the airflow dev instance using the analytics-privatedata user, with an unused port, and specifying the created config directory: sudo -u analytics-privatedata ./run_dev_instance.sh -m /tmp/mwaddlink -p 8989 research [In stat machine]
The dev instance will be accessible at the configured port, and uses the research DAGs of the active branch of the airflow-dags repo. So make sure mwaddlink branch is active.
Running the airflow instance as analytics-privatedata user is necessary as the airflow dags submit spark jobs using [skein] based yarn containers, and for the kerberos credentials to properly propagate we need a user with a kerberos keytab (like analytics-privatedata) instead of a kerberos cache credential like the normal users.
To see the airflow UI, tunnel the port 8989 from stat machine to local. In a terminal in local: ssh stat1008.eqiad.wmnet -N -L 8989:stat1008.eqiad.wmnet:8989.
OR, if you use vscode and use remote-ssh extension to connect to ssh, spinning up the airflow instance will automatically tunnel 8989 to local. You should be able to see a pop up at the bottom right of the screen.
You should then be able to access airflow UI at http://localhost:8989/
Note: Since everyone using airflow dev will act as analytics-privatedata user, make sure no one else is using the same stat machine, and definitely not the same port. This will cause your airflow instance to fail and simply connect to other's instances, causing confusion. To double check that you can connected to your instance go to airflow UI → Admin → Configurations → Check that "dags_folder" is set to /srv/home/<your_name>/airflow-dags/research/dags. If it isn't your name, your UI might be pointing to someone else’s airflow dev set up.

Understanding the DAG:

Airflow will link to your conda env. If airflow is in the same stat machine as your research-datasets repo, and hence your conda env, you can directly link to the env. Other wise move it to hdfs and like to the hdfs file like so dag_config.hadoop_name_node + <conda_path>.
Once set up, the airflow UI will list the mwaddlink DAG. We can manually trigger the DAG for testing purposes. Eventually the DAG will be set up with a schedule to automatically run the pipeline(s) in some regular intervals. Logs can be found in /tmp/mwaddlink/airflow/logs (note that /tmp/mwaddlink path was used to create the dev instance).
Everytime you make changes to the research-datasets code, you need to re-pack it so airflow can get the latest code.

Todos:

Some changes will be required when Commands are used in research_datasets. This is will change the entry points or script calls in the airflow code.
Currently the pipeline is set up to run a single shard. We either need to create a DAG that runs ALL wikis (this all shards), or find a way to change variables through UI to run various shards manually.
Run tests (re-create fixtures) and create MR. Since there are too many wikis, we may want to avoid generating fixtures for all wikis.

More details and learning about setting up code in research_datasets and airflow_dags can be found in this Google Doc.

Results

The pipeline currently runs successfully for medium to small wikis. There are some memory errors (listed in todos) that arise for the larger wikis that will require some additional code maneuvering. The pipeline was run on shard 7. This means all data were collected for the 10 wikis in shard 7, a language agnostic model was trained on these wikis together, and evaluations were run against the trained model. Below is the comparison of this model's performance with the single-language model evaluations from before (precision and recall columns in all_wikis_baseline). Increases in precision or recall with the language agnostic model are bolded. Performance of the language agnostic model is very close to the language dependent models, sometimes even better than the language dependent models, despite being trained on multiple languages at once.

Comparison of performance of the language agnostic (LA) vs language dependent models in some wikis.
Wiki	LD Precision	LD Recall	LA Precision	LA Recall
eswiki	0.83	0.49	0.79	0.53
huwiki	0.89	0.43	0.82	0.43
hewiki	0.76	0.26	0.72	0.29
ukwiki	0.85	0.52	0.82	0.54
arwiki	0.82	0.35	0.85	0.46
cawiki	0.85	0.48	0.78	0.41
viwiki	0.88	0.59	0.96	0.81
fawiki	0.86	0.51	0.85	0.61
rowiki	0.88	0.52	0.88	0.58
kowiki	0.73	0.23	0.56	0.16

Resources

t.b.a.

References

↑ Gerlach, M., Miller, M., Ho, R., Harlan, K., & Difallah, D. (2021). Multilingual Entity Linking System for Wikipedia with a Machine-in-the-Loop Approach. Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 3818–3827. https://doi.org/10.1145/3459637.3481939

[1] Gerlach, M., Miller, M., Ho, R., Harlan, K., & Difallah, D. (2021). Multilingual Entity Linking System for Wikipedia with a Machine-in-the-Loop Approach. Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 3818–3827. https://doi.org/10.1145/3459637.3481939

[1]