Jump to content

Research:Expanding Wikipedia articles across languages/Inter language approach/Section Alignment at Large Scale

From Meta, a Wikimedia project coordination wiki
This page documents a completed research project.



Following our previous work on Cross-lingual Section Alignment we have expanded the language coverage and updated the ML-pipelines in order to compute section alignment across 205 different languages. This page describes the details about the implementation and performance of our new algorithm.

System description

[edit]

Features

[edit]
  • Title similarity: The cosine similarity between the vector representations of two section titles.
  • Link similarity (sum): Sum of link similarity for all section pairs with the same source and target sections. Link similarity is defined here as the Jaccard index between two vectors containing section links, represented as wikidata items, from each section.
  • Link similarity (mean): Mean link similarity for all section pairs with the same source and target sections.
  • Edit distance: The Levenshtein distance between two section titles.
  • Normalized co-occurence count: The number of times two section titles co-occur across all articles normalized by the maximum number of times the source section co-occurs with any target section.
  • Source count: Total number of times the source section occurs across all articles.
  • Target count: Total number of times the target section occurs across all articles.
  • Source position (relative to the top)
  • Target position (relative to the top)
  • Source position (relative to the bottom)
  • Target position (relative to the bottom)

Feature extraction

[edit]

The above features are extracted in two phases. During the first phase, the latest revisions for all articles in a language are read from the wikitext_current hive table. Each revision is then parsed to extract all level 2 headings, list of links found under each heading, represented as wikidata items, the relative position of each heading in the article and the total number of times each heading occurs across all articles in this language.

In the second phase, all articles in the source language and its target language(s) are aligned using their wikidata id. Then for each wikidata item, every possible combination (pair) of headings in the source language and headings in the target language is generated. Next, using the data retrieved during the previous phase, the rest of the features are calculated for each pair.

Doing this in two phases helps decouple features that are target language dependent from those that are not. This means that the latter do not have to be re-calculated every time we calculate the former for a new target language and results in significant speed up across the board.

Models

[edit]

Natural Language Processing

[edit]

In order to calculate title similarity, section titles need to be encoded into vectors first. Since similarity between two vectors is determined using cosine similarity, similar sections should produce similar vector representations despite belonging to different languages. To generate these representations, a cross-lingual sentence embedding model, LaBSE is used. LaBSE maps languages to a 768 dimensional shared vector space, eliminating the need to align embeddings before they can be compared.

Machine Learning

[edit]

To generate a single similarity score for each section pair based on its features, a gradient boosting classifier is used. This classifier generates a score between 0 and 1 which is the probability of the pair target being an accurate translation of the source. The targets for each source are then ranked according to this probability.

Training data

[edit]

The training data is generated by combining the ground truth with the extracted features. The ground truth consists of crowdsourced section translations in 6 languages: Arabic, English, French, Japanese, Spanish and Russian. These section pairs are joined with the data generated by the feature extraction pipeline on source, target, source language and, target language. This yields a dataset with both positive and negative examples unlike the ground truth which contains just positive ones. The resulting dataset, however, is imbalanced with a disproportionately high number of negative examples and any machine learning task involving it will have to address that.

Testing data

[edit]

Test data is generated by joining the section translations done using the Content Translation tool (CXT) with the extracted features on source, target, source language and, target language. The translations from the CXT are labelled 'True' and the rest of the pairs 'False'. The precision on this dataset is measured by counting the number of sources for which the CXT translation was among the top n targets.

Limitations

[edit]

LaBSE, the embedding model used to generate vector representations for section titles, currently supports 109 languages[1]. This means that sections from unsupported languages might not be correctly encoded.

Output description

[edit]

The generated section alignments for each language are available for download in the form of sqlite databases.

Output schema

[edit]
col_name data_type comment
source text source section title
target text target section title
source_language text wikipedia where the source section comes from
target_language text wikipedia where the target section comes from
probability real probability of the target being the source's translation
rank integer target's rank according to probability

Performance

[edit]

The following results include the top 100 language pairs by number of section pairs tested. The precision here denotes the probability that, of all the aligned target sections for a source section in the test data, the CXT translation was among the top 5. Please note that any source sections occurring more than once per (source language, target language) in the CXT dataset were counted as one pair and tested by checking if any of the corresponding targets ended up among the top 5.

Source language Target language Precision @ 5 Pairs tested
enwiki eswiki 0.970 12988
enwiki frwiki 0.939 9165
enwiki arwiki 0.937 8456
enwiki viwiki 0.946 6054
ruwiki ukwiki 0.986 5980
ruwiki bawiki 0.919 5382
enwiki jawiki 0.906 5328
enwiki zhwiki 0.915 5153
enwiki itwiki 0.941 5039
enwiki ukwiki 0.944 4934
enwiki ptwiki 0.964 4691
enwiki trwiki 0.953 4246
enwiki ruwiki 0.912 4110
enwiki hewiki 0.925 4062
enwiki idwiki 0.973 3495
enwiki fawiki 0.946 3402
enwiki rowiki 0.964 3048
enwiki bnwiki 0.962 2832
enwiki tawiki 0.963 2707
enwiki elwiki 0.946 2685
enwiki cawiki 0.940 2604
eswiki cawiki 0.971 2296
frwiki ocwiki 0.989 2094
enwiki dewiki 0.876 1884
enwiki pawiki 0.982 1781
enwiki mlwiki 0.952 1632
enwiki cswiki 0.917 1466
enwiki kowiki 0.905 1375
enwiki mkwiki 0.966 1308
enwiki srwiki 0.928 1212
enwiki sqwiki 0.971 1178
enwiki nlwiki 0.925 1176
enwiki mswiki 0.957 1174
enwiki afwiki 0.977 1089
enwiki huwiki 0.897 1041
dewiki frwiki 0.852 1026
frwiki eswiki 0.920 995
ruwiki hywiki 0.959 991
frwiki enwiki 0.918 922
dewiki enwiki 0.895 893
enwiki urwiki 0.948 828
enwiki plwiki 0.891 824
enwiki tewiki 0.953 813
eswiki enwiki 0.913 797
ukwiki ruwiki 0.958 754
jawiki zhwiki 0.856 750
enwiki fiwiki 0.888 732
enwiki thwiki 0.920 679
enwiki hiwiki 0.938 659
enwiki dawiki 0.933 658
frwiki itwiki 0.921 648
eswiki euwiki 0.946 635
enwiki slwiki 0.959 631
dewiki itwiki 0.872 626
enwiki cywiki 0.955 616
ruwiki hewiki 0.874 595
ruwiki enwiki 0.906 595
enwiki tlwiki 0.939 594
eswiki glwiki 0.927 587
enwiki orwiki 0.926 582
enwiki svwiki 0.930 568
enwiki kawiki 0.952 568
enwiki bgwiki 0.929 564
ruwiki bewiki 0.978 544
enwiki hywiki 0.918 538
enwiki mywiki 0.929 535
eswiki frwiki 0.882 534
enwiki guwiki 0.958 524
frwiki cawiki 0.922 523
enwiki knwiki 0.965 510
enwiki glwiki 0.901 506
dewiki nlwiki 0.876 499
ruwiki ttwiki 0.950 497
cawiki eswiki 0.961 491
enwiki hawiki 0.924 487
eswiki ptwiki 0.960 475
dewiki eswiki 0.870 453
enwiki ckbwiki 0.642 450
frwiki arwiki 0.824 449
plwiki ukwiki 0.918 426
itwiki frwiki 0.903 423
zhwiki enwiki 0.899 414
enwiki siwiki 0.951 412
enwiki euwiki 0.926 404
enwiki hrwiki 0.948 400
itwiki enwiki 0.932 385
ruwiki tgwiki 0.916 382
enwiki jvwiki 0.866 372
itwiki eswiki 0.923 364
enwiki eowiki 0.893 355
enwiki etwiki 0.915 354
dewiki ukwiki 0.852 352
jawiki kowiki 0.937 350
ptwiki enwiki 0.935 336
ruwiki kkwiki 0.955 332
frwiki ptwiki 0.927 329
enwiki gawiki 0.966 323
enwiki mrwiki 0.944 322
ruwiki sahwiki 0.729 321
enwiki bswiki 0.974 312

Code & Data

[edit]

References

[edit]
  1. Feng, Fangxiaoyu, et al. "Language-agnostic bert sentence embedding." arXiv preprint arXiv:2007.01852 (2020)