Machine learning models/Proposed/Article descriptions
This model card page currently has a draft status. It is a piece of model documentation that is in the process of being written. Once the model card is completed, this template should be removed. |
Model card | |
---|---|
This page is an on-wiki machine learning model card. | |
Model Information Hub | |
Model creator(s) | Marija Šakota, Maxime Peyrard, and Robert West |
Model owner(s) | Isaac (WMF) |
Model interface | https://ml-article-descriptions.toolforge.org/ |
Publications | arXiv |
Code | model and API |
Uses PII | No |
In production? | No |
Which projects? | Android |
This model uses existing wikitext and article descriptions in other languages to recommend potential Wikidata article descriptions for Wikipedia articles in 25 languages. | |
How can we concisely explain what a Wikipedia article is about? In order to help users understand this, each Wikipedia page should be annotated with a short description indicating article's topic. In practice, a large fraction of articles are missing a short description. This problem is particularly striking for low-resource languages, such as Kazakh or Lithuanian, but it also affects high-resource languages, such as English or German.
This model is generating short descriptions for a desired article in 25 languages. It was built on top of mBART[1] and it uses first paragraphs and existing short descriptions to give suggestions for a missing short description. For example, a short description for Beer could be "Alcoholic drink made from fermented cereal grains”.
This model was trained on 100 thousand samples. Each sample includes first paragraphs of the Wikipedia article and existing short descriptions, in up to 25 chosen languages. This data originated from the editing activities of Wikipedia and Wikidata editors, and was collected in an automated fashion.
This model is deployed on CloudVPS. Right now, it can be publicly accessed through a Toolforge interface. This model can be used to help editors, by giving suggestions on possible short descriptions for a chosen Wikipedia article.
Motivation
[edit]Short descriptions, along with article titles, can largely facilitate the navigation over Wikipedia universe. Their main purpose is to help users disambiguate search results and briefly summarise an article's topic. They can also be useful for knowledge production and management, as editors rely on article titles for organising their work. Annotating articles with short descriptions can increase intuition and efficiency. Since a lot of these descriptions are currently missing, an automated approach could assist human editors in creating them and achieving the satisfying coverage faster.
Current model is able to work on almost any Wikipedia article in any of the 25 chosen languages: ['en', 'de', 'nl', 'es', 'it', 'ru', 'fr', 'zh', 'ar', 'vi', 'ja', 'fi', 'ko', 'tr', 'ro', 'cs', 'et', 'lt', 'kk', 'lv', 'hi', 'ne', 'my', 'si', 'gu']
Users and uses
[edit]- assisting Wikipedia editors in creating Wikipedia short descriptions.
- making predictions on language editions of Wikipedia that are not in the listed 25 languages or other Wiki projects (Wiktionary, Wikinews, Wikidata, etc.)
- making predictions on Wikipedia articles without leading paragraphs and any existing descriptions in other languages
- auto-generating edits without an editor in the loop
- overwriting existing valid descriptions
Ethical considerations, caveats, and recommendations
[edit]Model
[edit]Performance
[edit]Implementation
[edit]Main model was built on top of mBART[2], while mBERT[3] was used to encode the existing short descriptions before providing them to the model. Between the encoder and the decoder of mBART, an additional attention-style block was added to fuse article representations into a single embedding. Representations of existing descriptions are averaged to produce a single embedding, which is then concatenated to the article representation. These are sent to the decoder of mBART. mBART encoder and mBERT were frozen during finetuning, leaving only the decoder and the added attention-style block to be tuned.
- Learning rate: 3e-5
- Epochs: 5
- Maximum input length: 512
- Vocab size: 250027
- Number of encoder attention layers: 12
- Number of decoder attention layers: 12
- Number of attention heads: 16
- Length of encoder embedding: 1024
- Number of parameters: 720M
- Number of trainable parameters: 215M
- Model size on disk: 2.95GB
{
lang: <language code string>,
title: <page title string>,
blp: <page is biography of living person? boolean>
prediction: [
<recommended description string>,
... (up to # of beams requested)
]
}
Input
GET /article?lang=en&title=Frida_Kahlo&num_beams=2
Output
{
"lang": "en",
"title": "Frida_Kahlo",
"blp": False,
"num_beams": 2,
"groundtruth": "Mexican painter (1907-1954)",
"latency": {
"wikidata-info (s)": 0.15179753303527832,
"total network (s)": 0.5065276622772217,
"total (s)": 4.562705039978027
},
"features": {
"descriptions": {
"ar": "رسامة مكسيكية",
...
"vi": "họa sĩ México"
},
"first-paragraphs": {
"ar": "فريدا كالو \u200f رسامة شهيرة ولدت في أحد ضواحي كويوكان، المكسيك في 06 يوليو، 1907 وتوفيت في 13 يوليو، 1954 في نفس المدينة.",
...
"vi": "Frida Kahlo de Rivera là một họa sĩ người Mexico, người đã vẽ nhiều bức chân dung, chân dung tự họa và các tác phẩm lấy cảm hứng từ thiên nhiên và các hiện vật của Mexico. Lấy cảm hứng từ văn hóa đại chúng của đất nước, cô đã sử dụng một phong cách nghệ thuật dân gian ngây thơ để khám phá các câu hỏi về bản sắc, chủ nghĩa hậu thuộc địa, giới tính, giai cấp và chủng tộc trong xã hội Mexico. Những bức tranh của cô thường có yếu tố tự truyện mạnh mẽ và hiện thực pha trộn với tưởng tượng. Ngoài việc thuộc về phong trào Mexicayotl sau cách mạng, tìm cách xác định một bản sắc Mexico, Kahlo đã được mô tả như một nhà siêu thực hoặc hiện thực ma thuật."
}
},
"prediction": [
"Mexican painter",
"Mexican artist"
]
}
Data
[edit]The training data for this model consists of 100 thousand Wikipedia articles, in 25 language editions. Each article is represented with its lead paragraph and short description, if they exist. This data originated from the editing activities of Wikipedia and Wikidata editors, and was collected in an automated fashion. Validation data and test data both consist of 10 thousand Wikipedia articles collected in the same manner.
- Split was done by Wikidata ID, maintaining the natural distribution of the languages present. All 25 languages appear in the input for training, validation, and test datasets.
- In practice this means that, in the training data, the most commonly seen languages were English (77%), German (45%), and Dutch (41%), while the least common ones were Gujarati (0.7%) and Sinhala (0.7%).
- Same pipeline and approach as training data
- All 25 languages present in the input with the similar distribution of appearances
Licenses
[edit]Citation
[edit]Cite this model as:
@article{sakota2022descartes,
title={Descartes: generating short descriptions of wikipedia articles},
author={Sakota, Marija and Peyrard, Maxime and West, Robert},
journal={arXiv preprint arXiv:2205.10012},
year={2022},
doi={10.1145/3543507.3583220},
url={https://arxiv.org/abs/2205.10012},
}