Research:Expanding Wikipedia articles across languages/Inter language approach/Section Alignment at Large Scale
Following our previous work on Cross-lingual Section Alignment we have expanded the language coverage and updated the ML-pipelines in order to compute section alignment across 205 different languages. This page describes the details about the implementation and performance of our new algorithm.
System description
[edit]Features
[edit]- Title similarity: The cosine similarity between the vector representations of two section titles.
- Link similarity (sum): Sum of link similarity for all section pairs with the same source and target sections. Link similarity is defined here as the Jaccard index between two vectors containing section links, represented as wikidata items, from each section.
- Link similarity (mean): Mean link similarity for all section pairs with the same source and target sections.
- Edit distance: The Levenshtein distance between two section titles.
- Normalized co-occurence count: The number of times two section titles co-occur across all articles normalized by the maximum number of times the source section co-occurs with any target section.
- Source count: Total number of times the source section occurs across all articles.
- Target count: Total number of times the target section occurs across all articles.
- Source position (relative to the top)
- Target position (relative to the top)
- Source position (relative to the bottom)
- Target position (relative to the bottom)
Feature extraction
[edit]The above features are extracted in two phases. During the first phase, the latest revisions for all articles in a language are read from the wikitext_current hive table. Each revision is then parsed to extract all level 2 headings, list of links found under each heading, represented as wikidata items, the relative position of each heading in the article and the total number of times each heading occurs across all articles in this language.
In the second phase, all articles in the source language and its target language(s) are aligned using their wikidata id. Then for each wikidata item, every possible combination (pair) of headings in the source language and headings in the target language is generated. Next, using the data retrieved during the previous phase, the rest of the features are calculated for each pair.
Doing this in two phases helps decouple features that are target language dependent from those that are not. This means that the latter do not have to be re-calculated every time we calculate the former for a new target language and results in significant speed up across the board.
Models
[edit]Natural Language Processing
[edit]In order to calculate title similarity, section titles need to be encoded into vectors first. Since similarity between two vectors is determined using cosine similarity, similar sections should produce similar vector representations despite belonging to different languages. To generate these representations, a cross-lingual sentence embedding model, LaBSE is used. LaBSE maps languages to a 768 dimensional shared vector space, eliminating the need to align embeddings before they can be compared.
Machine Learning
[edit]To generate a single similarity score for each section pair based on its features, a gradient boosting classifier is used. This classifier generates a score between 0 and 1 which is the probability of the pair target being an accurate translation of the source. The targets for each source are then ranked according to this probability.
Training data
[edit]The training data is generated by combining the ground truth with the extracted features. The ground truth consists of crowdsourced section translations in 6 languages: Arabic, English, French, Japanese, Spanish and Russian. These section pairs are joined with the data generated by the feature extraction pipeline on source, target, source language and, target language. This yields a dataset with both positive and negative examples unlike the ground truth which contains just positive ones. The resulting dataset, however, is imbalanced with a disproportionately high number of negative examples and any machine learning task involving it will have to address that.
Testing data
[edit]Test data is generated by joining the section translations done using the Content Translation tool (CXT) with the extracted features on source, target, source language and, target language. The translations from the CXT are labelled 'True' and the rest of the pairs 'False'. The precision on this dataset is measured by counting the number of sources for which the CXT translation was among the top n targets.
Limitations
[edit]LaBSE, the embedding model used to generate vector representations for section titles, currently supports 109 languages[1]. This means that sections from unsupported languages might not be correctly encoded.
Output description
[edit]The generated section alignments for each language are available for download in the form of sqlite databases.
Output schema
[edit]col_name | data_type | comment |
---|---|---|
source | text | source section title |
target | text | target section title |
source_language | text | wikipedia where the source section comes from |
target_language | text | wikipedia where the target section comes from |
probability | real | probability of the target being the source's translation |
rank | integer | target's rank according to probability |
Performance
[edit]The following results include the top 100 language pairs by number of section pairs tested. The precision here denotes the probability that, of all the aligned target sections for a source section in the test data, the CXT translation was among the top 5. Please note that any source sections occurring more than once per (source language, target language) in the CXT dataset were counted as one pair and tested by checking if any of the corresponding targets ended up among the top 5.
Source language | Target language | Precision @ 5 | Pairs tested |
---|---|---|---|
enwiki | eswiki | 0.970 | 12988 |
enwiki | frwiki | 0.939 | 9165 |
enwiki | arwiki | 0.937 | 8456 |
enwiki | viwiki | 0.946 | 6054 |
ruwiki | ukwiki | 0.986 | 5980 |
ruwiki | bawiki | 0.919 | 5382 |
enwiki | jawiki | 0.906 | 5328 |
enwiki | zhwiki | 0.915 | 5153 |
enwiki | itwiki | 0.941 | 5039 |
enwiki | ukwiki | 0.944 | 4934 |
enwiki | ptwiki | 0.964 | 4691 |
enwiki | trwiki | 0.953 | 4246 |
enwiki | ruwiki | 0.912 | 4110 |
enwiki | hewiki | 0.925 | 4062 |
enwiki | idwiki | 0.973 | 3495 |
enwiki | fawiki | 0.946 | 3402 |
enwiki | rowiki | 0.964 | 3048 |
enwiki | bnwiki | 0.962 | 2832 |
enwiki | tawiki | 0.963 | 2707 |
enwiki | elwiki | 0.946 | 2685 |
enwiki | cawiki | 0.940 | 2604 |
eswiki | cawiki | 0.971 | 2296 |
frwiki | ocwiki | 0.989 | 2094 |
enwiki | dewiki | 0.876 | 1884 |
enwiki | pawiki | 0.982 | 1781 |
enwiki | mlwiki | 0.952 | 1632 |
enwiki | cswiki | 0.917 | 1466 |
enwiki | kowiki | 0.905 | 1375 |
enwiki | mkwiki | 0.966 | 1308 |
enwiki | srwiki | 0.928 | 1212 |
enwiki | sqwiki | 0.971 | 1178 |
enwiki | nlwiki | 0.925 | 1176 |
enwiki | mswiki | 0.957 | 1174 |
enwiki | afwiki | 0.977 | 1089 |
enwiki | huwiki | 0.897 | 1041 |
dewiki | frwiki | 0.852 | 1026 |
frwiki | eswiki | 0.920 | 995 |
ruwiki | hywiki | 0.959 | 991 |
frwiki | enwiki | 0.918 | 922 |
dewiki | enwiki | 0.895 | 893 |
enwiki | urwiki | 0.948 | 828 |
enwiki | plwiki | 0.891 | 824 |
enwiki | tewiki | 0.953 | 813 |
eswiki | enwiki | 0.913 | 797 |
ukwiki | ruwiki | 0.958 | 754 |
jawiki | zhwiki | 0.856 | 750 |
enwiki | fiwiki | 0.888 | 732 |
enwiki | thwiki | 0.920 | 679 |
enwiki | hiwiki | 0.938 | 659 |
enwiki | dawiki | 0.933 | 658 |
frwiki | itwiki | 0.921 | 648 |
eswiki | euwiki | 0.946 | 635 |
enwiki | slwiki | 0.959 | 631 |
dewiki | itwiki | 0.872 | 626 |
enwiki | cywiki | 0.955 | 616 |
ruwiki | hewiki | 0.874 | 595 |
ruwiki | enwiki | 0.906 | 595 |
enwiki | tlwiki | 0.939 | 594 |
eswiki | glwiki | 0.927 | 587 |
enwiki | orwiki | 0.926 | 582 |
enwiki | svwiki | 0.930 | 568 |
enwiki | kawiki | 0.952 | 568 |
enwiki | bgwiki | 0.929 | 564 |
ruwiki | bewiki | 0.978 | 544 |
enwiki | hywiki | 0.918 | 538 |
enwiki | mywiki | 0.929 | 535 |
eswiki | frwiki | 0.882 | 534 |
enwiki | guwiki | 0.958 | 524 |
frwiki | cawiki | 0.922 | 523 |
enwiki | knwiki | 0.965 | 510 |
enwiki | glwiki | 0.901 | 506 |
dewiki | nlwiki | 0.876 | 499 |
ruwiki | ttwiki | 0.950 | 497 |
cawiki | eswiki | 0.961 | 491 |
enwiki | hawiki | 0.924 | 487 |
eswiki | ptwiki | 0.960 | 475 |
dewiki | eswiki | 0.870 | 453 |
enwiki | ckbwiki | 0.642 | 450 |
frwiki | arwiki | 0.824 | 449 |
plwiki | ukwiki | 0.918 | 426 |
itwiki | frwiki | 0.903 | 423 |
zhwiki | enwiki | 0.899 | 414 |
enwiki | siwiki | 0.951 | 412 |
enwiki | euwiki | 0.926 | 404 |
enwiki | hrwiki | 0.948 | 400 |
itwiki | enwiki | 0.932 | 385 |
ruwiki | tgwiki | 0.916 | 382 |
enwiki | jvwiki | 0.866 | 372 |
itwiki | eswiki | 0.923 | 364 |
enwiki | eowiki | 0.893 | 355 |
enwiki | etwiki | 0.915 | 354 |
dewiki | ukwiki | 0.852 | 352 |
jawiki | kowiki | 0.937 | 350 |
ptwiki | enwiki | 0.935 | 336 |
ruwiki | kkwiki | 0.955 | 332 |
frwiki | ptwiki | 0.927 | 329 |
enwiki | gawiki | 0.966 | 323 |
enwiki | mrwiki | 0.944 | 322 |
ruwiki | sahwiki | 0.729 | 321 |
enwiki | bswiki | 0.974 | 312 |
Code & Data
[edit]- Code: https://gitlab.wikimedia.org/mnz/section-alignment
- Output data can be found here: https://analytics.wikimedia.org/published/datasets/one-off/section_alignment/
References
[edit]- ↑ Feng, Fangxiaoyu, et al. "Language-agnostic bert sentence embedding." arXiv preprint arXiv:2007.01852 (2020)