Research:Improving multilingual support for link recommendation model for add-a-link task

Tracked in Phabricator:
Task T342526

Created

10:23, 24 July 2023 (UTC)

Contact

Martin Gerlach

Wikimedia Foundation

Collaborators

Aisha Khatun

Wikimedia Foundation

Kevin Bazira

Wikimedia Foundation

Isaac Johnson

Wikimedia Foundation

Duration: 2023-07 – 2025-02

Research:Projects

This page documents a completed research project.

In a previous project we developed a machine-learning model to recommend new links to articles^[1]: Research:Link_recommendation_model_for_add-a-link_structured_task

The model is used for the add-a-link structured task. The aim of this task is to provide suggested edits to newcomer editors (in this case adding links) to break down editing into simpler and more well-defined tasks. The hypothesis is that this leads to a more positive editing experience for newcomers and, as a result, they will keep contributing in the long-run. In fact, the experimental analysis showed that newcomers are more likely to be retained with this features, and that the volume and quality of their edits increases. As of now, the model is deployed to approximately 100 Wikipedia languages.

However, we have found that the model currently does not work well for all languages. After training the model for 301 Wikipedia languages, we identified 23 languages for which the model did not pass the backtesting-evaluation. This means, that we think the model’s performance does not meet a minimum quality standard in terms of the accuracy of the recommended links. Detailed results: Research:Improving multilingual support for link recommendation model for add-a-link task/Results round-1

In this project, we want to improve the multilingual support of the model. This means we want to increase the number of languages for which the model passes the backtesting evaluation such that it can be deployed to the respective Wikipedias.

Methods

We will pursue 2 different approaches to improve the multilingual support.

Improving the model for individual languages.

We will try to fix the existing model for individual languages. From the previous experiments where we trained the model for 301 languages, we gathered some information about potential improvements for individual languages (T309263). For example, the two most promising approaches are

Unicode decode error when running wikipedia2vec to create article embeddings as features. This appeared in fywiki and zhwiki (T325521). This has been documented in the respective github repository. It also proposed a fix; however, this hasnt been merged yet. The idea would be to implement (or adapt if necessary) the proposed fix.
Word-tokenization. Many of the languages which failed the backtesting evaluation do not use whitespaces to separate tokens (such as Japanese). The current model relies on whitespaces to identify tokens in order to generate candidates for anchors for links. Thus, improving the work-tokenization for non-whitespace delimited languages should improve the performance of the models in these languages. We recently developed mwtokenizer, a package for doing tokenization in (almost) all languages in Wikipedia. The idea would be to implement mwtokenizer into the tokenization pipeline.

Developing a language-agnostic model.

Even if we can fix the model for all languages above, the current model architecture has several limitations. Most importantly, we currently need to train a separate model for each language. This brings challenges for deploying this model for all languages, because we need to train and run 300 or more different models.

In order to simplify the maintenance work, ideally, we would like to develop a single language-agnostic model. We will explore different approaches to try to develop such a model while ensuring the accuracy of the recommendations. We will use. Among others, the language-agnostic revert-risk model as an inspiration where such an approach has been implemented and deployed with success.

Results

Improving mwtokenizer

We hypothesize that we can improve language support for the add-a-link model by improving the tokenization for languages that do not use whitespaces to separate words such as Japanese.

As a first step, we worked on the newly developed mwtokenizer package (as part of Research:NLP Tools for Wikimedia Content), a library to improve tokenization across Wikipedia languages, so that it can be implemented into the add-a-link model. Specifically, we resolved several crucial issues (phab:T346798) such as fixing the regex for sentence-tokenization in non-whitespace languages.

As a result, we released a new version (v0.2.0) of the mwtokenizer package which contains these improvements.

Improving the model for individual languages

Some of the major changes made to improve performance of the existing language-dependent models are phab:T347696:

Replacing nltk and manual tokenization with mwtokenizer. This enabled effective sentence and word tokenization of non whitespace languages and thus improved performance (Merge Request).
Fixing Unicode error that was preventing a few models to run successfully. (Merge Request).
Fixing a regex that was causing links detected by the model to not be placed appropriately in the output string for non-WS languages. (Merge Request).

Having solved the major errors, we can now run all the languages without error and have improved performance in a lot of the non-whitespace languages using the improved mwtokenizer. Below are the current results for the languages that did not pass backtesting before. Previous results can be found here: Results round-1.

Table showing change in performance for languages that did not pass backtesting earlier.
wiki	previous precision	precision	previous recall	recall	comments	passes backtesting
aswiki	0.57	0.68	0.16	0.28	improvement!	borderline (precision is below 75%)
bowiki	0	0.98	0	0.62	improvement!	True
diqwiki	0.4	0.88	0.9	0.49	recall dropped	True
dvwiki	0.67	0.88	0.02	0.49	improvement!	True
dzwiki	-	1.0	-	0.23	improvement!	True
fywiki	error	0.82	error	0.459	improvement!	True
ganwiki	0.67	0.82	0.01	0.296	improvement!	True
hywwiki	0.74	0.75	0.19	0.30	similar results	True
jawiki	0.32	0.82	0.01	0.35	improvement!	True
krcwiki	0.65	0.78	0.2	0.35	slight improvement	True
mnwwiki	0	0.97	0	0.68	improvement!	True
mywiki	0.63	0.95	0.06	0.82	improvement!	True
piwiki	0	0	0	nan	only 13 sentences	False
shnwiki	0.5	0.99	0.02	0.88	improvement!	True
snwiki	0.64	0.69	0.16	0.18	similar results	borderline (precision is below 75%, recall is close to 20%)
szywiki	0.65	0.79	0.32	0.48	slight improvement	True
tiwiki	0.54	0.796	0.5	0.48	slight improvement	True
urwiki	0.62	0.86	0.23	0.54	improvement!	True
wuuwiki	0	0.68	0	0.36	improvement!	borderline (precision is below 75%)
zhwiki	-	0.78	-	0.47	improvement!	True
zh_classicalwiki	0	1.0	0	0.0001	improvement, low recall	False
zh_yuewiki	0.48	0.31	0	0.0006	low recall	False

The following table shows current performance of some languages that had passed backtesting earlier. We make this comparison to ensure the new changes does not deteriorate performance.

Table showing change in performance for some languages that passed backtesting earlier.
wiki	previous precision	precision	previous recall	recall	comments
arwiki	0.75	0.82	0.37	0.36	improvement
bnwiki	0.75	0.725	0.3	0.38	similar results
cswiki	0.78	0.80	0.44	0.45	similar results
dewiki	0.8	0.83	0.48	0.48	similar results
frwiki	0.815	0.82	0.459	0.50	similar results
simplewiki	0.79	0.79	0.45	0.43	similar results
viwiki	0.89	0.91	0.65	0.67	similar results

Exploratory work for language-agnostic model

Currently we train a model for each language wiki and each model is served independently. This creates deployment strain and is not easy to manage in the long run. The main goal is to develop a single model that supports all (or as many as possible) languages in order to decrease the maintenance cost. We could also develop a few models each with a set of compatible languages.

First we need to ensure languages can be trained and served using a single model. To test this hypothesis we perform some exploratory work on language-agnostic models (phab:T354659). Some of the important changes that made were:

Removing the dependency on Wikipedia2Vec by using outlink-embeddings. These embeddings were created in-home using Wikipedia links. (Merge Request)
Add a gridsearch module to select the best possible model (Merge Request)
Add a feature called `wiki_db` that names a wiki (e.g. enwiki, bnwiki). This should ideally help the model when combining multiple languages. (Merge Request)
Combined training data of multiple languages, trained a single model, ran evaluation of each language, and compared performance with single-language models. (Merge Request)

To create a language-agnostic model

We first combined training data of 2 unrelated languages and performance did not drop much. This motivated us to scale the experiment to 11 and then to 50 languages. We trained two models on two sets of ~50 languages. One set had 52 central languages from fallback chains and another had 44 randomly selected wikis. We trained a model on all languages in each set and evaluated on each individual language wiki. The performance comparison of the language-agnostic model and the single-language model can be found here: main_v2 and sec_v2 . The performance of the language-agnostic model for both sets of languages are comparable to the single-language versions. This shows we can theoretically select any set of wikis, perform combined training, and expect very good results.
We extend the experiment and train a model with all (317) language wikis with a cap of 100k samples per language. The evaluations can be found here: all_wikis_baseline. Similar to before, some languages have some drop in performance, but a lot of the languages perform almost on par with single language based models. Specifically, 14% of the languages had >=10% drop in precision, while the rest were close to the precision of single-language trained models. We increased the cap to 1 million samples per language. Evaluation here: all_wikis_baseline_1M. The performance remains extremely close to the 100k samples experiment, with slight decrease in precision in 4 languages and slight increase in 4 other languages.

Takeaways: Based on our experiments, we confirm that it is indeed possible to combine languages, even randomly, and expect performance very close to the single-language models. How many models to train, what languages should be trained together, and how many samples to choose are all questions that need more experiments to answer and will mostly depend on the memory and time constraints of training the model(s).

Building the Pipeline

We build an airflow pipeline that can automatically run on various subsets of language Wikipedias to collect data, create training and testing datasets, train model, and perform evaluation. We can either train models for each individual language (as was done before) or train a combined model for all or a subset of languages (in order to reduce the overall number of models). For the latter, we found that a random sampling of languages trained together in a language agnostic setting gives relatively good performance, we can adopt each shard as a set of wiki to train a model on. These shards were created with wiki size and memory in mind. So, we can directly use these groups of wikis to run pipelines that would take similar times to process each wiki and therefore create a balanced set of language agnostic add-a-link models.

The production code has two components so far:

Research Datasets (Development branch: addalink)
Airflow DAGs (working branch: research-dags)

research-datasets/mwaddlink

Research Datasets repo contains the code that airflow will run. Multiple projects are all housed in src/research_datasets.
Follow the instructions in the ReadMe to set up the virtual environment and dependencies. As of 2025-02, these instructions were (please check before proceeding):
- git clone https://gitlab.wikimedia.org/repos/research/research-datasets.git
- conda env create -f conda-environment.yaml
- conda env create -f conda-environment.yaml
- pipx ensurepath (only once per system)
- pipx install uv (only once per system)
- uv pip install -e .[dev]
The addalink notebook shows how to execute the individual steps in the training pipeline. This can be used to test and debug the code in the pipeline. The code was succesfully tested for different wikis (including enwiki as the "largest" wiki).
When making changes to the code: commit, then push in a different branch (e.g. addalink) and create a merge request so it can get reviewed by Research Engineering before being merged into the main branch. This will trigger several pipelines that automatically check the changes for (see this example for MR64 ): Linting, Typechecking , Testing (see the links how to run these locally for debugging). These checks need to pass succesfully before the MR can be merged.

TODOs

Update mwaddlink's README
generate_backtesting_eval: currently the evaluation code is recreating the dictionaries and executing the previously existing code. Ideally, we would like to also update it to use dataframes/parquet files instead of dictionaries.
Code could use some cleanup
- the names are not appropriate anymore, e.g. anchor_dictionary/filter_dict_anchor is not actually a dictionary
- the "generate" prefix is meaningless, remove it from files/methods
- inconsistent "output" and "directory" references, used interchangably
- generate_backtesting_data needs a new name, it is also generating training data
Adapt format of output files for inference

airflow-dags/mwaddlink

To set up airflow repo following the instructions from Knowledge Gaps repo:

Clone airflow-dags: git clone https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags.git
Check out the research-dags branch
Starting airflow dev instance (manually):
- Create a directory where the analytics-privatedata user can create the airflow configuration directory, e.g. sudo -u analytics-privatedata mkdir /tmp/mwaddlink [In a stat machine]
- Start the airflow dev instance using the analytics-privatedata user, with an unused port, and specifying the created config directory: sudo -u analytics-privatedata ./run_dev_instance.sh -m /tmp/mwaddlink -p 8989 research [In stat machine]
- The dev instance will be accessible at the configured port, and uses the research DAGs of the active branch of the airflow-dags repo.
- Running the airflow instance as analytics-privatedata user is necessary as the airflow dags submit spark jobs using [skein] based yarn containers, and for the kerberos credentials to properly propagate we need a user with a kerberos keytab (like analytics-privatedata) instead of a kerberos cache credential like the normal users.
- To see the airflow UI, tunnel the port 8989 from stat machine to local. In a terminal in local: ssh stat1008.eqiad.wmnet -N -L 8989:stat1008.eqiad.wmnet:8989.
Starting airflow dev instance via VSCode:
- OR, if you use vscode and use remote-ssh extension to connect to ssh, spinning up the airflow instance will automatically tunnel 8989 to local. You should be able to see a pop up at the bottom right of the screen. You should then be able to access airflow UI at http://localhost:8989/. You can also run this script from the research-datasets repo to start up the airflow instance.
Note: Since everyone using airflow dev will act as analytics-privatedata user, make sure no one else is using the same stat machine, and definitely not the same port. This will cause your airflow instance to fail and simply connect to other's instances, causing confusion. To double check that you can connected to your instance go to airflow UI → Admin → Configurations → Check that "dags_folder" is set to /srv/home/<your_name>/airflow-dags/research/dags. If it isn't your name, your UI might be pointing to someone else’s airflow dev set up.
Running the dag
- Go to Admin -> Variables: Set all the arguments (e.g. for which wikis to run the training pipeline) --> Save
- After that you can trigger the dag.
After running the dag, you might need to change permissions of the output files to access them from your own user account
- sudo -u analytics-privatedata hdfs dfs -chmod -R 755 /tmp/research/mwaddlink/

Understanding the DAG:

Airflow will link to your conda env. If airflow is in the same stat machine as your research-datasets repo, and hence your conda env, you can directly link to the env. Other wise move it to hdfs and like to the hdfs file like so dag_config.hadoop_name_node + <conda_path>.
Once set up, the airflow UI will list the mwaddlink DAG. We can manually trigger the DAG for testing purposes. Eventually the DAG will be set up with a schedule to automatically run the pipeline(s) in some regular intervals. Logs can be found in /tmp/mwaddlink/airflow/logs (note that /tmp/mwaddlink path was used to create the dev instance).
Everytime you make changes to the research-datasets code, you need to re-pack it so airflow can get the latest code.

TODOs:

generate_anchor_dictionary and generate_wdproperties can run in parallel in the dag

Some initial learnings about setting up code in research_datasets and airflow_dags can be found in this Google Doc.

Results

The pipeline currently runs successfully for medium to small wikis. There are some memory errors (listed in todos) that arise for the larger wikis that will require some additional code maneuvering. The pipeline was run on shard 7. This means all data were collected for the 10 wikis in shard 7, a language agnostic model was trained on these wikis together, and evaluations were run against the trained model. Below is the comparison of this model's performance with the single-language model evaluations from before (precision and recall columns in all_wikis_baseline). Increases in precision or recall with the language agnostic model are bolded. Performance of the language agnostic model is very close to the language dependent models, sometimes even better than the language dependent models, despite being trained on multiple languages at once.

Comparison of performance of the language agnostic (LA) vs language dependent models in some wikis.
Wiki	LD Precision	LD Recall	LA Precision	LA Recall
eswiki	0.83	0.49	0.79	0.53
huwiki	0.89	0.43	0.82	0.43
hewiki	0.76	0.26	0.72	0.29
ukwiki	0.85	0.52	0.82	0.54
arwiki	0.82	0.35	0.85	0.46
cawiki	0.85	0.48	0.78	0.41
viwiki	0.88	0.59	0.96	0.81
fawiki	0.86	0.51	0.85	0.61
rowiki	0.88	0.52	0.88	0.58
kowiki	0.73	0.23	0.56	0.16

Resources

References

↑ Gerlach, M., Miller, M., Ho, R., Harlan, K., & Difallah, D. (2021). Multilingual Entity Linking System for Wikipedia with a Machine-in-the-Loop Approach. Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 3818–3827. https://doi.org/10.1145/3459637.3481939

[1] Gerlach, M., Miller, M., Ho, R., Harlan, K., & Difallah, D. (2021). Multilingual Entity Linking System for Wikipedia with a Machine-in-the-Loop Approach. Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 3818–3827. https://doi.org/10.1145/3459637.3481939

[1]