Requests for comment/Large-scale errors at Malagasy Wiktionary
This is a subpage; for more information, see the Requests for comments page.
Audit of Malagasy Wiktionary
Written by Metaknowledge, with help from Surjection, AryamanA, Erutuon, and Smashhoof, along with input from a fluent speaker of Malagasy who wishes to remain anonymous.
Bot-Jagwar is a bot account run by Jagwar. At mg.wikt, it has made 22,828,226 edits (and counting), catapulting mg.wikt to be the second-biggest Wiktionary, with a total of 6,103,961 entries (and counting). (Note that as bot edits are continuing, all these numbers will be outdated.) Jagwar has a secondary bot account, Bot-Jagwar II, which has only made 6,976 edits. Another major bot contributing to mg.wikt, making the exact same type of edit, is Ikotobaity, with 2,456,748 edits (run by Lohataona until 2017; now inactive). These three bots have created 6,076,769 new mainspace pages (and counting), which is 99.23% of all mainspace pages on mg.wikt. (Jagwar also ran bot edits on his main account, so the true number of bot-created entries is about 50,000 higher.)
In this blog post, he details the history of his bot and mg.wikt. He uses NLP and automated translation in order to generate new entries, without any human intervention or oversight. To quote Jagwar himself: "But as time passes a lot of pages get created, and even with a lot rate of error, you end up with thousands of pages of potentially wrong information." (emphasis not mine) So he knows these entries are wrong, but simply doesn't care.
The reason that no action has been taken at mg.wikt is that Jagwar is the sole admin who has made edits, and there is no active editing community. Jagwar himself has only made 6 edits in the last 90 days, of which only 3 were in mainspace. Even an editing community of the size of the biggest Wiktionary, en.wikt, would not be able to clean up after these bots by hand.
Problems with non-Malagasy entries on mg.wikt
Of the 4,953,779 (and counting) non-Malagasy entries on mg.wikt, the vast majority were created by these bots based on automatic translation from other Wiktionaries, chiefly en.wikt and fr.wikt. These translations can be wanting in various ways. Some of them have nearly correct definitions, but are missing important lexicographical information that makes the entry as a whole misleading, e.g. mg:wikt:nigger is translated as mainty, which is an adjective that simply means "black" — this is obviously problematic coverage of a highly offensive word. Others are incorrect because only one part of the entry is translated, e.g. mg:wikt:cirugía plástica (Spanish for "plastic surgery") is translated as fandidiana, which just means "surgery". Still others are incorrect because the entry was parsed incorrectly, e.g. mg:wikt:match#Espaniola (Spanish for "match", as in a sporting match) is translated as mahaleo, afokasoka, which is nonsensical — the first word means "to be equal (to), to match" and the second "match [device used to light a fire]". Here the bot was trying to hedge its bets by giving multiple, mutually exclusive interpretations of what English "match" could mean, and yet both are incorrect! Many others are not wildly wrong, but still useless, e.g. mg:wikt:duniani (Swahili locative form meaning "in/on the world") is translated as giloby, which means "globe".
Inflected forms of words in non-Malagasy languages were bot-created for various languages, including Spanish. Many of these are basically correct in their content, but the presentation is misleading at best; at mg:wikt:afilan, two definitions are given, but one points to the suffix -ar rather than the word itself, and the other uses the English word "default" in the definition, inaccurately. However, a significant portion of non-lemma entries seem to be incorrect, due to bizarre bot errors, e.g. mg:wikt:consorcíate, which tries to link to an obviously incorrect entry "sense=affirmative". There are 24,953 entries linking to "formal=n", 17,847 entries linking to "formal=y", 23,337 entries linking to "person=1", and many thousands more with similar errors.
Some entries were not created based on other Wiktionaries, but seemingly based on dictionary entries, causing bizarre errors like mg:wikt:singing traditional sakalava accompany the drum, which is claimed to be a word in French (!).
When an entry on another Wiktionary is deleted, renamed, or corrected, the copy of it made on mg.wikt is never modified, leading to yet another source of error, although likely a much smaller one. For example, the Kinyarwanda section on mg:wikt:bogobogo is incorrect, because it is based on fr:wikt:bogobogo, which was deleted earlier this year, but the bot had already created an entry on mg.wikt.
Relatively few of these entries are marked in any way for the reader to beware. Of the entries so marked, there are 406,725 entries marked as translated from en.wikt, 107,307 entries marked as translated from fr.wikt, and 119,294 entries from other Wiktionaries. The reason for this appears to be that this categorisation for entries needing to be verified, which is accompanied by a template that warns the reader that the entry has been translated, is a recent addition, as older translated entries lack it.
Quantifying the rate of error
Only a careful inspection can reveal the extent of errors, which is not possible for all the millions of entries on mg.wikt. I assessed a random subsample of 100 pages with at least one non-Malagasy lemma entry. The full list of entries with their assessments, including details on any problems, is at Small wiki audit/Malagasy Wiktionary/100. I found that 49/100 were essentially unusable, as they had serious errors or omissions. A further 29/100 were only partially usable, due to significant omissions that did not rise to the level of being outright errors. Only 22/100 appear to be fully correct and usable, of which 2 are uncertain and included to be generous. Assuming this is a representative subsample, as there is no reason not to do so, this suggests that around half of all non-Malagasy lemma entries are incorrect, and only around a fifth are fully usable (and even many of these have minor errors!). This kind of consistently low quality would be grounds for blocking if done by a human editor on any Wiktionary.
Problems with Malagasy entries on mg.wikt
There are 41,902 entries categorised as lacking any definition, most of which seem to be Malagasy entries, and around 30,000 of which are the result of the definitions being removed due to copyright violation many years ago. Although there are 1,150,182 Malagasy entries in total, most of these are inflected forms, which can generally be safely created by bots. These definitionless entries are not strictly speaking incorrect, but a definition is the most central function of a dictionary, so these entries fail to be a useful part of the dictionary as a whole.
Additionally, there are 6,319 effectively definitionless Malagasy entries not counted in the table below, like mg:wikt:matoanteny, where the word to be defined is given as the definition, instead of giving an actual definition, or mg:wikt:mahaketrona, where the definition is blank. Some cases, like mg:wikt:tamboho, have two identically duplicated Malagasy sections, each of which simply gives the word to be defined (listed twice) as the definition. This kind of entry is not even categorised as needing a definition, but is equally useless as a dictionary entry, and the duplication of section reflects the bots' inability to follow basic Wiktionary formatting.
The bot-added translation sections in Malagasy entries are also largely incorrect. For example, mg:wikt:ny#Malagasy means "the", but among the translations given are "so, her, him, them" for English, "Internet" for Afrikaans, "Herodianus, sarawakensis, bogotensis, beijingensis, herous, colon, parasceve" for Latin, "orchestra, banana, ataraxia" for Romanian, and many, many more examples of absurd mistranslation on that one entry alone.
A fluent Malagasy speaker was consulted in order to assess the correctness and grammaticality of the Malagasy used in definitions. He concurred with the basic problems identified here, and stated that some Malagasy entries, like mg:wikt:ady fom-pananana, are defined with incomplete sentences. In regard to both Malagasy and non-Malagasy entries, he said that they are "hit or miss on whether the information is useful or not", without assessing the accuracy of the information. In addition to these content issues, some bot-created Malagasy entries, like mg:wikt:navadika, may have correct content but are so misformatted that they are hardly recognisable as Wiktionary entries.
Quantifying the rate of definitionless entries
There are at least 47,379 definitionless Malagasy entries in total (along with 24,626 definitionless non-Malagasy entries). This total and the table below do not include about 1,423 Malagasy entries of a type shown by mg:wikt:ambaratonga, where the definitions are circular and therefore the dictionary provides synonyms, but the entries themselves are effectively definitionless.
Part of Speech | Entries | No Definition | Notes |
---|---|---|---|
Nouns | 71,757 | 8,036 | |
Verbs | 31,347 | 4,989 | |
Phrases | 6,637 | 2,843 | |
Adjectives | 4,146 | 687 | |
Proper nouns | 1,187 | 3 | All of these are defined as "name of person", "name of place", etc. |
Roots | 304 | 0 | All defined as root forms from verbs. |
Adverbs | 14 | 0 | |
Infixes | 3 | 0 |
Recommendations
So far, no external action has been taken because despite discussions, Jagwar continues to run his bot without consequences. To quote him, "But this mass-adding content, especially in language I didn’t speak at all, seemed to annoy people that have decided to discuss about the case on MetaWiki forum. No concluding results was given, and things were as they were before." We need to change this.
I strongly recommend that all non-Malagasy entries created on mg.wikt by Bot-Jagwar, Bot-Jagwar II, Ikotobaity, and Jagwar's bot run under his own account be deleted, and all the translation sections in Malagasy entries be removed. I further strongly recommend that the owners of these bots, Jagwar and Lohataona, be warned not to use them to create more entries at any Wiktionary ever again, or else the bots will be globally blocked.
I weakly recommend that all definitionless Malagasy entries on mg.wikt created by these bots be deleted. This is not actively harming the dictionary in the same way as incorrect content, but it is lowering the signal-to-noise ratio and usefulness of the dictionary.
Further work: Problems at other Wiktionaries
Jagwar ran his bot at some other language Wiktionaries, in some cases using the same automated translations and producing questionable content that those Wiktionaries have not checked.
- 218,156 edits at chr.wikt from 2012 to 2014, almost all unedited by humans. These populate the category chr:wikt:Category:Entry to be checked, which currently contains 185,434 entries. There are no active editors at chr.wikt.
- 127,389 edits at ku.wikt from 2012 to 2013, almost all unedited by humans. The Malagasy entries here include a large number of verbs that are simply defined as the present tense of that very verb, thus lacking any actual definition, e.g. ku:wikt:mivoendre. Although not incorrect, these are essentially undefined entries.
Edits on other Wiktionaries were primarily adding Malagasy lemmas (at fr.wikt and at en.wikt, where he used his main account to run bot edits) or adding interwiki links, so no major harm seems to have been done elsewhere. However, editors at those Wiktionaries should still be advised to look over his edits, as they still contain frequent errors in definitions, part of speech assignment, and more.
Previous discussions
See:
- Talk:Wikimedia News/archive 2#Bot-dominated_wikis
- wikt:Wiktionary:Votes/bt-2012-07/User:Bot-Jagwar for bot status
- mg:wikt:User talk:Jagwar (and its archives)
Poll and discussion
Poll
Support for strong recommendation
That all non-Malagasy entries created on mg.wikt by Bot-Jagwar, Bot-Jagwar II, Ikotobaity, and Jagwar's bot run under his own account be deleted, and all the translation sections in Malagasy entries be removed.
- Support. Jagwar's laissez-faire attitude and quantity-over-quality approach bring down the Wiktionary project as a whole. We won't be able to enlist the Malagasy speakers needed to clean up this giant mess, so let's delete the junk. Ultimateria (talk) 01:06, 17 September 2020 (UTC)
- Thanks to Metaknowledge for writing this up. I strongly Support all three recommendations proposed in the audit. The Malagasy Wiktionary is not usable as a dictionary in its current form as a result of the flaws described on the project page. Many entries are so misleading or confusing that they are more like dada artworks than dictionary entries.
One rather general glitch is that several Malagasy translations of foreign-language words for "bird", such as bird, Vogel (the Dutch entry shouldn't be on this page; it is also present on the correct page vogel), pájaro, ave, birdo, avis, ucelo, böd, are often given as a string including numerous much more specific bird names. An exception is oiseau. This suggests the French word was translated independently from the other languages, probably with the other languages having been machine-translated via English, and that the step of translating from English "bird" to Malagasy contained a bug. Also mind that none of these had been cleaned up, despite Jagwar listing en-4, es-2 and eo-1 in his Babel box. This is woefully inadequate quality control. That said, more recent translations of "bird" that have been translated from the English Wiktionary do not have this problem. (Ironically, those are the ones with a warning label.) Lingo Bingo Dingo (talk) 06:59, 17 September 2020 (UTC)- Support PUC (talk) 11:10, 17 September 2020 (UTC)
- Support Fay Freak (talk) 13:12, 17 September 2020 (UTC)
- Support Noé (talk) 14:33, 17 September 2020 (UTC)
- Support RexSueciae (talk) 14:45, 17 September 2020 (UTC)
- Support Darmo (talk) 16:18, 17 September 2020 (UTC)
- Support --Vahagn Petrosyan (talk) 17:08, 17 September 2020 (UTC)
- Support as proposer. Metaknowledge (talk) 17:35, 17 September 2020 (UTC)
- Support - I commend the audit report analysis and support this primary recommended action. AllyD (talk) 17:46, 17 September 2020 (UTC)
- Support — surjection ⟨??⟩ 17:51, 17 September 2020 (UTC)
- Support Zoozaz1 (talk) 18:09, 17 September 2020 (UTC)
- Support AryamanA (talk) 18:12, 17 September 2020 (UTC)
- Support The poor quality of Malagasy Wiktionary makes it useless, and honestly embarrassing. Smashhoof (talk) 18:15, 17 September 2020 (UTC)
- Support Kritixilithos (talk) 18:16, 17 September 2020 (UTC)
- Support Fenakhay (talk) 18:17, 17 September 2020 (UTC)
- Support Let's take out the garbage. Having nothing is better than having the wrong content. Eiríkr Útlendi │ Tala við mig 18:55, 17 September 2020 (UTC)
- Support the audit and recommendation 1 (as an one-off action only); For recommendation 2, we still needs to discuss the detail of enforcement.--GZWDer (talk) 19:26, 17 September 2020 (UTC)
- Support Lagrium (talk) 22:11, 17 September 2020 (UTC)
- Support Justinrleung (talk) 23:40, 17 September 2020 (UTC)
- Support This wiki should be re-written to be more useful for local Malagasy peoples, not just a machine translation playground. --Liuxinyu970226 (talk) 00:12, 18 September 2020 (UTC)
- Support — actually that is still not the maximal recommendation. The bots' edits in other wiktionaries should be purged, too. --Janwo (talk) 01:18, 18 September 2020 (UTC)
- Support I run one of the most active bots on en.wikt, and have a Ph.D. in computational linguistics, and I know for sure that what Bot-Jagwar is trying to do is simply not possible with current technology, particularly for a low-resource language like Malagasy. The likes of Google and Amazon have large numbers of employees working full time on machine translation, and whenever possible make use of huge bilingual or multilingual corpora plus tons of handcoded rules and enormous amounts of processing power. Thousands upon thousands of research papers have been written on machine translation, and you can see the results e.g. in Google Translate, which does a pretty good job translating between English and certain other major languages (e.g. French, Spanish, German, Chinese), and an OK job on some other major languages (e.g. Russian), but still a bad job on many others (e.g. Hindi, despite it being one of the top 5 most spoken languages), not to mention smaller languages like Malagasy. If Google can't do a good job, how can a single person in their spare time do so? For this reason I don't use any machine learning in any of my bot scripts, and I strongly advise against doing so. Jagwar evidently completely misunderstands the limits of current machine learning technology. Given the mass of junk thereby created, and the impossibility of manually reviewing it, there seems no alternative but deleting it all. Benwing2 (talk) 02:29, 18 September 2020 (UTC)
- @Benwing2: Who said anything about machine learning? The way the bot actually seems to work is by building a relational database of translations based on content from other Wiktionaries. Whenever a new entry is created on another Wiktionary, the bot reads it, tries to find the relevant Malagasy translation using its database, and then creates a corresponding page on mg.wiktionary. It also has capabilities for importing external wordlists from dictionaries, for creating non-lemma forms of words, and so on. There's no fancy "machine learning" here. PiRSquared17 (talk) 16:52, 19 September 2020 (UTC)
- Yes, even I was surprised by that. I use nltk but nowhere in the blog post I've mentioned machine learning except for another linguistic project that has currently not been used on Wikibolana. Jagwar grrr... 20:45, 19 September 2020 (UTC)
- Apologies if I got this wrong. I was going by Metaknowledge's statement that the bot uses NLP and automated translation. If you are using NLTK you are using machine learning, although it depends on exactly what part of NLTK is being used. Regardless, I have looked into the Malagasy Wiktionary previously, before the poll came out, and found the results lacking, at the very least. You cannot create 6,000,000 automated entries using any method, ML or not, and expect them to be anything more than a pile of junk unless you have manual review of them, of which there is almost none. There was formerly a bot (Tbot) that produced some automated translation entries on the English Wiktionary, but there, manual review was actually feasible, and happened. What Jagwar should have done, if they have their heart set on using a bot, is produce a much smaller number of entries and recruit people to review them manually. As it is, the dictionary is simply not reliable, which makes it worse than useless, and it erodes trust in the entire Wiktionary project. Benwing2 (talk) 21:02, 19 September 2020 (UTC)
- Of the 6M entries, less than 500k are lemmata entries, the rest are form-of entries which, parsing errors aside, can be safely created and fixed by bots. I would have followed the recommendations myself and allowed a do-over, if not for the soft ban for me (and a fellow contributor now inactive for reasons irrelevant here) also in the strong recommendations. Jagwar grrr... 21:20, 19 September 2020 (UTC)
- I agree that form-of entries can generally be created by bots, **if** done very carefully. I have done this myself, in fact, although it is not easy. My scripts to generate Russian form-of entries are by themselves about 5,000 lines of Python code, and require non-trivial manual oversight. You do not seem to have taken so much care, e.g. your entry for wikt:mg:Palucéennes uses raw text and direct Wiki formatting rather than templates, and your entry for wikt:mg:Pacific silver firs doesn't link to the singular wikt:mg:Pacific silver fir, but just sticks the text in directly. Your entry for wikt:mg:Palatini has the text "endriky ny lazaina andehilahy ploraly ny teny" which Google translate renders as "plural form of masculine plural". I don't know if the cause of the ungrammatical nature of this text is Google translate's fault or yours, but in any case it's incorrect: this is actually the nominative masculine plural and the genitive masculine and neuter singular. These are just the first three examples I have looked at, and all of them have problems indicating a generally sloppy approach. As for a future ban, User:Metaknowledge wrote "Nobody is denying you a future chance to make it right: nobody is proposing that you or your bot be blocked." What this means, in my eyes, is that you have to demonstrate through discussion and consensus that your bot will do the right thing before you can implement a do-over. Essentially, you've lost the trust of the community, and it's up to you to do the hard work to regain it. Benwing2 (talk) 22:28, 19 September 2020 (UTC)
- I agree with you that the form-of entries must be tightly monitored. I wrote enwikt verb form parsers and unit tested them within my time budget with some languages templates being more difficult to parse than others. However, your use of Google Translate to decipher "endriky ny lazaina andehilahy ploraly ny teny" did not work. 'Lazaina' in this context means 'subject'. The definition therein is therefore correct (yes, genitive one is missing, but does that take anything from it?), thanks. As for the ban, yes, no one has proposed to block my just yet. If it's only my bot being banned from ns:0, it just means that I'm back to the good old manual edit again, something I can still work with as of today. Jagwar grrr... 21:44, 21 September 2020 (UTC)
- Support --Minorax (talk) 03:43, 18 September 2020 (UTC)
- Support --Holder (talk) 08:36, 18 September 2020 (UTC)
- Support --Lambiam 10:34, 18 September 2020 (UTC)
- Support Allahverdi Verdizade (talk) 12:02, 18 September 2020 (UTC)
- Support — Inqilābī (talk) 14:39, 18 September 2020 (UTC)
- Support I'm not really a fan of blowing things up and starting from scratch, but the egregious scale of this issue, especially the concerns raised by two established members of the enwikt community, leave us with little choice but this. ミラP 04:32, 20 September 2020 (UTC)
- Support Sometimes, there are invalid creations made on FR which are automatically recreated on MGWIKT. When theses are deleted on French project, they aren’t on the Malagasy Wix.
- Also, this bot could create entries before the projects it follow ! When a translation is added, the page is created before than you create the relevant page on your Wix ! (Example : translation added, Page is created on Malagassy Wix and remain unchanged today, page is finally created on FR.) Otourly (talk) 04:43, 21 September 2020 (UTC)
- Support Thank you for this exhaustive audit work. Pamputt (talk) 06:16, 21 September 2020 (UTC)
- Support --Udo T. (talk) 12:37, 21 September 2020 (UTC)
- Support – Em-mustapha User | talk 05:26, 22 September 2020 (UTC)
- Support --Betterknower (talk) 21:37, 22 September 2020 (UTC)
- Support As Otourly indicated, when a false/fake/wrong entry is created on FR, it appears in the MGWIKT. There are many other cases in which I see problems (for instance extinct languages). This is a bit like what happened to the scots Wikipedia : huge quantity of creation, without any supervision. Treehill (talk) 08:22, 24 September 2020 (UTC)
- Support I always wondered how Malagasy Wiktionary had so many entries. Now I know. This is a huge issue and, in my opinion, this is the only way to adequately deal with it. MSG17 (talk) 23:17, 24 September 2020 (UTC)
- Support Prof.(Mrs.) Aniebiet I. Ntui 10:56, 25 September 2020 (UTC)
- Support. Long overdue. I'm glad something is being done about this. By the way, there are similar problems at the Cherokee Wiktionary, with the added problem that Jagwar apparently doesn't even speak Cherokee. Mx. Granger (talk) 05:45, 26 September 2020 (UTC)
- Support Chris Troutman (talk) 20:01, 26 September 2020 (UTC)
- Support. Malagasy Wiktionary is not only unusable as a dictionary, as is shown in this audit, but also is in danger of never really getting any active community. I believe that not many people would like to work in a project where almost 100% of entries are in urgent need to be reviewed and rewritten by a human to be usable. And there’s millions of them. That’s highly demotivating to say the least. Meiræ 20:05, 26 September 2020 (UTC)
- Support - Wiktionary is not a place to post poor machine translated definitions en masse in order to boost numbers. We host these projects for readers, not those that edit them. This is of no use to native speakers at all. --IWI (talk) 02:53, 29 September 2020 (UTC)
- Support -ArdiPras95 (talk) 12:44, 11 October 2020 (UTC)
- Support - Thank you for taking the initiative on this. If a community of Malagasy speakers ever does form, they shouldn't inherit this enormous burden -- they should be free to start building from the ground up.__Gamren (talk) 14:34, 16 October 2020 (UTC)
- Support With the hope that such a mess would not repeat again, and that will mark a turning point in taking at heart our common mission in collecting knowledge. --Sannita - not just another it.wiki sysop 18:54, 20 December 2020 (UTC)
Opposition to strong recommendation
Oppose, I feel growing the Malagasy Wiktionary community ought to be our priority --MarcoSwart (talk) 10:28, 25 September 2020 (UTC)
Support for weak recommendation
That all definitionless Malagasy entries on mg.wikt created by these bots be deleted
- Support as proposer. Metaknowledge (talk) 17:35, 17 September 2020 (UTC)
- Support. The current number of definitionless entries is not justifiable. Lingo Bingo Dingo (talk) 17:41, 17 September 2020 (UTC)
- Support. There's really no good way to fix this short of deleting and stopping the automated processes. I wish some other Malagasy speakers were available for input, but unfortunately we do not have them. AryamanA (talk) 17:44, 17 September 2020 (UTC)
- Support - I commend the audit report analysis and support this secondary tidy-up recommendation. AllyD (talk) 17:46, 17 September 2020 (UTC)
- Support — surjection ⟨??⟩ 17:51, 17 September 2020 (UTC)
- Support Zoozaz1 (talk) 18:09, 17 September 2020 (UTC)
- Support There's no reason to keep definitionless entries around. Smashhoof (talk) 18:15, 17 September 2020 (UTC)
- Support Kritixilithos (talk) 18:16, 17 September 2020 (UTC)
- Support Fenakhay (talk) 18:21, 17 September 2020 (UTC)
- Support Ditto the above comments. Baffling to me that such an obviously unhelpful situation has been allowed to persist for so long. Kudos to Metaknowledge for putting this together. Eiríkr Útlendi │ Tala við mig 18:55, 17 September 2020 (UTC)
- Support I don't see the point of keeping definitionless entries unless they are quickly being defined by users (which does not seem to be happening here, or ever). Human-potato hybrid (talk) 23:28, 17 September 2020 (UTC)
- Support Justinrleung (talk) 23:40, 17 September 2020 (UTC)
- Support --Vahagn Petrosyan (talk) 04:46, 18 September 2020 (UTC)
- Support — Inqilābī (talk) 14:41, 18 September 2020 (UTC)
- Support - but with a modicum of gratitude to @Jagwar: for forcing all of us to focus on the legitimate question of how to go about using bots and machine learning properly and how not to do so. As time goes on and technology proceeds this issue will only become more pressing and we need to develop better answers and strategies in this area.Jcwf (talk) 19:40, 20 September 2020 (UTC)
- Support I think it is better to clean everything so that it is easier to restart. Pamputt (talk) 06:17, 21 September 2020 (UTC)
- Support --Udo T. (talk) 12:37, 21 September 2020 (UTC)
- Support as per Smashhoof. MSG17 (talk) 23:18, 24 September 2020 (UTC)
- Support Treehill (talk) 13:27, 25 September 2020 (UTC)
- Support Mx. Granger (talk) 05:45, 26 September 2020 (UTC)
- Support We should have never allowed bots to make pages, at all. Chris Troutman (talk) 19:59, 26 September 2020 (UTC)
- Support. And why they were created in the first place? If you do not have definitions for words that you putting in then you not creating a dictionary. Maybe a database of some sort but not a dictionary. Meiræ 20:21, 26 September 2020 (UTC)
- Support As per my comment above. --Sannita - not just another it.wiki sysop 18:55, 20 December 2020 (UTC)
Opposition to weak recommendation
Oppose, I feel growing the Malagasy Wiktionary community ought to be our priority --MarcoSwart (talk) 10:28, 25 September 2020 (UTC)
Comments
I am not sure that such a sweeping proposal should be made outside of RFC. --Rschen7754 18:12, 17 September 2020 (UTC)
- Rschen7754 My issue is that RFCs do not always lead to outcomes. Do you want to fix known problems, or do you want to create more bureaucratic hoops to jump through? Jagwar knows that bureaucracy is MetaWiki's weakness; look at what he said: "But this mass-adding content, especially in language I didn’t speak at all, seemed to annoy people that have decided to discuss about the case on MetaWiki forum. No concluding results was given, and things were as they were before." Metaknowledge (talk) 18:20, 17 September 2020 (UTC)
- I get that - I really do. But I am also concerned about being fair to everyone involved and providing enough visibility to the discussion and I am not sure that such a discussion on a fairly newly created community forum does that. --Rschen7754 18:24, 17 September 2020 (UTC)
- I just put this discussion on the meta main page announcements and on the Malagasy Wiktionary for more visibility, so hopefully that alleviates some of your concerns. Zoozaz1 (talk) 18:27, 17 September 2020 (UTC)
- Thanks, Zoozaz. Rschen, who are we not being fair to? Jagwar is the only active editor at mg.wikt, and he has been pinged. Reading Requests for comment/How to improve RfC Process should be evidence enough for how bad an idea an RFC would be. It's an RFC with no end or solution in sight, and its raison d'être is to solve the problem of RFCs having no end or solution in sight! Metaknowledge (talk) 18:29, 17 September 2020 (UTC)
- Even if all we get back is the sound of crickets, it is general procedure that we post at that wiki to make sure that all potential editors see the discussion. My concern also is with stewards/global sysops accepting the outcome since they would be required to do the deletions. --Rschen7754 18:33, 17 September 2020 (UTC)
- @Zoozaz1: Could you post this somewhere visible on mg.wikipedia too? Also, I agree with Rschen7754 that an RfC would be the usual venue to discuss solutions, but it's also true that RfCs often go nowhere. Maybe this discussion could be moved to an RfC subpage if people think it really matters. PiRSquared17 (talk) 18:33, 17 September 2020 (UTC)
- I've posted a notice on Malagasy Wikipedia here. Zoozaz1 (talk) 18:42, 17 September 2020 (UTC)
- Would anyone object to me or another GS/steward putting a bilingual notice on mg:wikt:MediaWiki:Sitenotice saying "There is a request for comments that proposes the deletion of many entries" / "Il y a un appel à commentaires qui propose de supprimer beaucoup de pages." (I'd put it in Malagasy too, but I don't speak it.) PiRSquared17 (talk) 21:46, 17 September 2020 (UTC)
- This is basically an RFC already, I think. It just is on a different subpage than usual. --MF-W 18:46, 17 September 2020 (UTC)
Polls are evil. At this point it's probably not helpful if more support votes get piled on. I think what would best now is that User:Jagwar makes a statement. --MF-W 18:52, 17 September 2020 (UTC)
- Wiktionaries are often run by poll, so this is what we're accustomed to. I think that essay is silly, and you shouldn't be surprised that Wiktionary culture is evident here. In any case, I'd like a statement from Jagwar as well, but you can't will him into talking if he doesn't want to. Metaknowledge (talk) 19:05, 17 September 2020 (UTC)
- His last edits were on 7 September, so maybe he hasn't seen it yet. --MF-W 19:19, 17 September 2020 (UTC)
- MF-Warburg Jagwar made a statement below. What do you think? Metaknowledge (talk) 07:00, 18 September 2020 (UTC)
For obvious reasons, I won't cast a vote.
I won't cast a vote, but maybe I should, though not sure where to put it. To oppose would make me a target and probably ostracise me for being stubborn (if already not the case), and get harrassed; to support would make me at best someone reasonable but would effectively.get me barred from using my bot ad vitam aeternam, effective on all Wiktionaries and on all namespaces (which I don't want, and find unfair.) If we still literally follow the recommendations, any infringement would lead to a forced retirement from all Wikimedia projects. The only reason we get to this point is that people find it annoying to have one Wiktionary with 6M bot created entries with a significant percentage of "low-quality content", so that an audit has been made. So in the end, the numbers actually matter.
Machine translation is hard, there is no denying it. I can assure you that all errors, by bot or manual, were made in good faith. Cumulated time developing the bot scripts to currently run can be counted in years. That was not as full-time work, I confess, but it's still years nonetheless, of trial and error. Keeping it working with the ever-evolving formatting on Wiktionaries takes time, now a scarce resource for me. But I do it because I want to. I do it within my means. Even before the use of bots to massively create entries, the emails I was getting had nothing constructive, and I'm not even mentioning this audit.
As there is no interest, neither for me nor the community, to run a counter-audit to try to counter all of Metaknowledge's statements made on the main page, I take full responsibility for all the laissez-faire at the Malagasy Wiktionary for all the previous years. It makes sometimes sense to start over, as I did in 2011 with the deletion of half a million Volapuk verb forms. Maybe I was wrong, as I was developing the first version of the automatic translation engine at about the same time. Here you are talking about wiping 4.9M pages off the Malagasy Wiktionary, rendering it even more useless than you think it is, while also now denying me any future chance to make it right. Manual edits could be an option, but health problems also would follow.
Jagwar grrr... 22:38, 17 September 2020 (UTC)
- Thank you for your statement. Nobody is denying you a future chance to make it right: nobody is proposing that you or your bot be blocked. In fact, that is what I would like the best. If you truly want to make it right, you'll stop running your bot right now, and start thinking about how you can make good-quality entries instead. Metaknowledge (talk) 23:00, 17 September 2020 (UTC)
- What should I understand from this recommendation of yours, then? "I further strongly recommend that the owners of these bots, Jagwar and Lohataona, be warned not to use them to create more entries at any Wiktionary ever again, or else the bots will be globally blocked."
- I've basically given up my bot flag on pretty much any other Wiktionary to focus on mg.wiktionary. I can't speak for Lohataona but the last recommendation sounds like a serious threat to me. I think about how I can make good quality entries pretty pretty much everyday. I've already stopped filling translation sections in search for a better solution, blacklisted a huge bunch of languages for translation, tried to parse the etymology section, added examples (in French and English) and tried to add transcription and IPA (Bulgarian, Russian, Japanese, Korean, Vietnamese), with more or less success. Jagwar grrr... 23:23, 17 September 2020 (UTC)
- Have you considered that good-quality entries can't be reliably created by bot with your methods? Moreover, there is no automated way to assess how bad your bot's entries are, but my survey suggests that about half of the lemma entries are wrong or unusable. If you find this threatening, you should reconsider how much you value quantity over quality. Metaknowledge (talk) 23:57, 17 September 2020 (UTC)
- These bot edits are, to put it bluntly, useless. Bad translations don't benefit anyone. I taught translation at university level, and I know enough about automated translation to say that it is pretentious to think that a single person can create a translation bot (even less so for languages they do not speak fluently) that produces reliable output without which can be published without being passed though a human's brain for checking. Therefore I suggest that any more runs of these bots should only be allowed if there is proof of (1) more sensible (correct) output and (2) better quality control by someone with adequate skills in both the source and the target languages. --Janwo (talk) 01:18, 18 September 2020 (UTC)
- I don't agree with your assertion that deleting these entries will make the Wiktionary "even more useless"; it's better to have a small number of correct translations than millions of incorrect ones. In fact, having millions of incorrect translations actually makes the project almost entirely useless to a reader, when compared to having a wiki with only a few entries that are correct. Quality should go over quantity. --IWI (talk) 03:21, 29 September 2020 (UTC)
Case study: Ukrainian coverage at Malagasy Wiktionary
- @Jagwar: Honestly I wanted to give you a try. As a native speaker of Ukrainian and an advanced speaker of French (there are many Malagasy-French dictionaries, I used mostly http://motmalgache.org/) I tried to check the most basic Ukrainian words: there are no Malagasy<>Ukrainian dictionaries so perhaps you made something at least somewhat useful? The result was a complete disaster:
- The first word I chose was mg:wikt:є. It is a singular present form of the verb to be (I am a user, this is a page). The translation was mg:wikt:varavarana which is a door or a window opening (I don't even know how you can get it). The reverse translation брама, двері is however more or less correct (both mean a door opening), and these two entries are translated more or less correctly.
- The second word was mg:wikt:іти (to go). The translation was mg:wikt:tsena, which normally means market. I checked French Wiktionary, and it said marche (a march) instead of marché (a market). Obviously the French page was also created by your bot, making it useless. Same for English en:wikt:tsena where you even created the page manually, and still wrong. This error seemingly moved to smaller Wiktionaries as they all seem to think that tsena means some alignment of people marching and not a market. Again, the reverse translation in mg:wikt:tsena базар, ринок is very accurate. These two words are however not translated correctly: mg:wikt:базар is a tsena but not a fihaonana (this word is not really used for a meeting place), mg:wikt:ринок is completely wrong (the word tsuka does not seem to exist in Malagasy at all).
- The third one was the verb to eat, which is normally mg:wikt:їсти but there are two extra pages mg:wikt:ї́сти (a useless stress but with a Cherokee interwiki) and mg:wikt:істи (misspelled). Somehow they have three different translations: the correct one is misakafo in mg:wikt:ї́сти, the one in mg:wikt:істи is incorrect (it cannot be used as a noun like fihinanana but only as a verb), and the entry with the correct spelling mg:wikt:їсти gives various versions, some of them are right and some are wrong: it is clearly not manao ladina an-tany (no notion of ground, should be опускатися instead), not milaoka (no notion of what is being eaten) and it is not exactly miatatra, mibosibosika or mifotampotana (no notion of eating ravenously, should be жерти instead). Overall translations are more or less ok but not fully usable.
- I decided to go the end of the alphabet with mg:wikt:я (I): it can be aho, iaho or izaho but not je (probably a trace of French translation).
- For the word egg there are two entries again: mg:wikt:яйце and mg:wikt:яйце́ (again with an extra stress). The first one is correct: it is indeed atody, but the second one is completely wrong, it cannot be baolina (this word is never used for a ball, although it is used as an equivalent of the English slang word balls for male sexual organs).
- Finally, I went for як (how, or a yak as an animal). No evidence of the second meaning which is fine, but the pronoun is wrong: it is not hoatrinona as it never means how much (скільки is used instead)
- Unfortunately this is not something we can recommend to our readers. Too many misspellings, too many translations that are completely wrong, and anyone trying to use it to speak some sort of Ukrainian will in reality speak gibberish. Even if machine translations have errors, this seems to be a very poor implementation: firstly, errors are accumulated at every step of the process (a French marché became marche and this error spread to all translations), secondly, there was no cross-check (I cannot understand how mg:wikt:tsena has correct Malagasy>Ukrainian translations, but at the same time Ukrainian>Malagasy translations are wrong), thirdly, there were no checks for misspellings (too frequently there are two or more entries for the same word). All of these could have greatly reduced the problem.
- The worst thing is that these errors migrated to other Wiktionaries: there seem to be many similarly wrong articles in the Cherokee Wiktionary (which probably needs cleanup as well) and Malagasy entries in the French and English Wiktionaries seem to be also wrong (pinging @Noé: as an action in French Wiktionary is probably needed as well). I understand that you have good intentions but the result is disastrous, and honestly we cannot continue like nothing happened. Unless you come up with a clear strategy of fixing it (we are talking of deletion of thousands of misspellings, and fixing perhaps a million of wrong translations) these pages simply cannot stay — NickK (talk) 11:38, 2 October 2020 (UTC)
- Hi, NickK,
- I think this exploration is better in a subthread. Feel free to merge it again if you prefer. For actions in French Wiktionary, I prefer to wait for the Malagasy Wiktionary to be clean by bot first, and then, without any precipitation, we will discuss about any issue that may appears in French Wiktionary. Noé (talk) 15:23, 2 October 2020 (UTC)
Case study: Volapük coverage at Malagasy Wiktionary
First of all I want to thank Metaknowledge again for the extensive and detailed audit, which I think gives an excellent general description of the problems at the Malagasy Wiktionary, and for the recommendations, all of which I support. Now I'd like to specifically discuss the coverage of Volapük on the Malagasy edition, as it is the second largest language on this Wiktionary edition. Volapük is a constructed language and one of the oldest artificial international auxiliary languages. It has fewer than 100 fluent speakers. It was first published in 1879 by Johann Martin Schleyer and a reform by Arie de Jong, enacted by Albert Sleumer in 1931, introduced a modified version, Volapük Nulik. The original version became known as Volapük Rigik and fell out of use. Volapük coverage on editions of Wiktionary usually focuses on Volapük Nulik.
- Malagasy (over 1,150,000 entries) and Volapük (almost 1,143,000 entries) are by far the largest languages and the only ones with more than 1 million entries each. Each constitutes more than one sixth of the Wiktionary in terms of entries. But the Volapük entries include nearly 1,087,000 conjugated forms, contrasting with only 15,000 noun lemmas and 6,000 verb lemmas. Some other rather large languages are Latin (almost 611,000 entries), Spanish (more than 465,000), Italian (409,000) and Russian (almost 347,000). What these languages have in common, apart from that they are European (four are Indo-European, one is an atypical Euroclone), is that they all have rich verbal morphologies. We have already seen the potential of such morphologically rich languages with the Volapük verb forms that make up 95% of all Volapük entries on mg.wikt. Take note of what Jagwar wrote on his blog in 2013: "In 2011, I got mad: after discovering the astonishing easiness of Volapük, I wrote a script to upload the word forms of that language. At full speed – i.e around 50,000 edits per day – three weeks were required to make the Malagasy Wiktionary the third biggest Wiktionary of the world." I strongly suspect that these other large foreign languages have also been used to pad the Malagasy Wiktionary with low-effort non-lemma entries, though I believe that there was never any intention to make it the largest edition of Wiktionary.
- One problem that you will encounter if you try to mass-create machine-translated Volapük lemmas is the absence of online translation tools for Volapük, so it will be difficult to add definitions. There are indeed a lot of undefined Volapük lemmas on the Malagasy Wiktionary; what is curious about them is that they do often have a German definition and sometimes a translation table. These entries were copied from the Volapük Wiktionary, which by the way has few definitions in Volapük, and ultimately derive from Arie de Jong's Wörterbuch der Weltsprache. I suspect there may be more than 10,000 of them. Using an imprecise search method, there appear to be around 14,676 entries copied from Volapük Wiktionary (not all actually have German translations); this very closely matches the 14,678 entries that need definition (alternative query with the same result); one entry excluded in the first search but included in the second is nedib. If that figure is about correct, more than half of the Volapük lemmas may not have a Malagasy but either a German definition or no definition at all. The copying from the Volapük Wiktionary has been executed in very crude way: if the Volapük Wiktioniary indicates that a lemma was absent in the original edition but present in a subsequent edition of die Wörterbuch der Weltsprache, the Malagasy Wiktionary only shows "N.D."; if the relevant section on the Volapük Wiktionary entry is empty, the bot still copied the empty section to the Malagasy version. (I am not sure about the copyright status of those entries, though I note that the 2012 Evertype edition does acknowledge De Jong's estate; De Jong died in 1957.) Anyway, what this effectively means is that the Malagasy Wiktionary contains what presumably is the third-largest Volapük-German translating dictionary in history. How this is ever going to be useful to a Malagasy speaker who is not proficient in German is something I don't know.
- Jagwar subsequently used his bot to create non-lemma forms. In some cases these may have been taken from another version of Wiktionary or an external source, but in many cases it seems he has used his own understanding of Volapük grammar to have them automatically generated, because there are strange errors and anomalies. One is that he has let his bot create thousands of active present-tense forms beginning with a-; while this may not strictly be an error if one wants to be charitable, in practice these verb forms are almost never used. On the blog it is stated that there once were more erroneous forms, because many conjugated forms were in fact derived from nouns: "But its repercussion on article count was not visible due to the mass deletion of Volapük language entries. Why this mass deletion? Because many entries seemed to be wrong as they are not conjugation of verbs, but nouns (-.-‘), so the decision is taken to delete them all to re-create them later, with a better quality if possible." While the deletion is commendable and demonstrates a modicum of quality control, it is also evidence that the bot operator had insufficient command of the language for the mass creation of entries without introducing many errors.
- What is certainly a serious error is that sometimes the results are a mixture of Volapük Rigik and Volapük Nulik. Abadereigonöx is an example of this, the verb reigön was not used in Volapük Rigik, while the -öx imperative was abolished in Arie de Jong's reform. In some cases there is a hyphen between the pronominal ending and the modal suffix that does not belong there.
- There are also Volapük lemmas on the Malagasy Wiktionary that do have definitions, probably based of the French and English Wiktionaries, these definitions suffer from the same flaws that Metaknowledge described above. So ab, "but", is translated as "maize, Indian corn", perhaps due to an error involving French maïs and mais; and fut, actually "foot", is rendered as "to scratch the ground, to paw the ground". This kind of content only belongs in a parody dictionary.
To reiterate, Volapük is a very large language on the Malagasy Wiktionary, but this is due to a huge number of non-lemma entries that were used to arbitrarily inflate this Wiktionary, as there are relatively few lemma entries and a substantial number of these lemmas may not have any definition in Malagasy. There are besides serious errors in both lemma and non-lemma entries, including errors in translation and morphological errors. These are grounds to support the deletion of non-Malagasy bot-created entries and the deletion of definitionless entries, including those in foreign languages, as a remedy for the current problems. I also support the warning against adding more bot-created entries of the types we have seen, to prevent the users in question from recreating this situation. Finally, I would like to raise the subject of whether more oversight is needed for the mass creation of bot-generated entries in small Wiktionaries, in particular when they involve machine-translation; normally vibrant communities can regulate and police bot edits, but currently there is no supervision if the communities are too small to function in this way. Lingo Bingo Dingo (talk) 17:48, 20 September 2020 (UTC)
- Thank you, @Lingo Bingo Dingo, for your additions. You raise some important pints that should probably also be discussed on a general level, not only with respect to Malagasy Wiktionary. These large-scale bot "translations" are doing no good and should be discouraged.--Janwo (talk) 01:23, 21 September 2020 (UTC)
- May be the good solution would not involve the massive import of bot generated content, as plain articles, but just as complementary results propersly tagged and visibly identified as being generated. This could be a good goal for Abstract Wikipedia (but not before next year), which will generated some automated responses that will remain identifiable as such by readers, not to be confused with normally edited (and vetted) wiki articles or dictionnary entries. I see no interest for externally generated boit contents imported "as is". This bot should better be part of the internal wikis features, ruled and tweaked by the normal community. They can only be supplemental "helpers" but not really plain contents. We need to clearly identify the source of these generations just like all other contents made by anyone, and should not need to look at the article history to decipher it. People that want to look for these generated contents should be guided differently with other tools/links/namespaces (possibly templates, modules, or references to. external tools, or databases (including Wikidata whose imports in other wiki is clearly identified or vetted more scrupulously and more precisely in dedocated boxes or sections). Don't mix everything at the same level. verdy_p (talk) 16:43, 16 October 2020 (UTC)
Next steps
@MF-Warburg, PiRSquared17, Holder, and Martin Urbanec: It is now time to determine our next steps. This page demonstrates a clear consensus in favour of carrying out the recommendations of the audit. Jagwar seems to accept this consensus to some degree, and said above that if he cannot use his bot, "I'm back to the good old manual edit again, something I can still work with as of today", but he is still running his bot as we speak, creating more low-quality and incorrect entries.
Obviously, the first step is that Jagwar needs to stop running his bot. Secondly, someone needs to clean up the mess. Surjection, an admin and bureaucrat at en.wikt who generated many of the lists used in the audit, has volunteered to do the mass deletions. A steward could make him an admin at mg.wikt for a short period of time (say, a month). Alternatively, Surjection could simply provide a full list of entries needing to be deleted to a global sysop, who could run a deletion script in his stead, but we would need a GS to volunteer for this job. How would you like to proceed? Metaknowledge (talk) 23:19, 21 September 2020 (UTC)
- I'm not familiar with the handling of such deletion scripts. --Holder (talk) 04:28, 22 September 2020 (UTC)
- It's not just deletion that needs to be handled, but also some sections from pages (not all) have to be deleted, including bot-created non-Malagasy sections from entries (which may also have a Malagasy entry in some cases) as well as translation sections. I would however wager that deleting entire pages is probably 95% of the job already. — surjection ⟨??⟩ 07:06, 22 September 2020 (UTC)
- You're right, I should've mentioned that part, but it doesn't require any special user rights, so you could still take on that part of the task regardless of who carries out the page deletions. Metaknowledge (talk) 15:28, 22 September 2020 (UTC)
- It certainly wouldn't be technically difficult to use something like pywikibot's delete.py or Twinkle's batch delete to remove all the pages at once, assuming that lists of pages to be deleted already exist. I'd like to leave how to proceed up to stewards and/or LangCom members, who should probably be the ones to close this RfC. PiRSquared17 (talk) 18:53, 23 September 2020 (UTC)
@Jon Kolbert, MusikAnimal, RadiX, and Trijnstel: Again, let me present the quote from Jagwar that I presented in the audit: "But this mass-adding content, especially in language I didn’t speak at all, seemed to annoy people that have decided to discuss about the case on MetaWiki forum. No concluding results was given, and things were as they were before." It doesn't have to be the same way this time. I have pinged a few stewards so we can have a productive conversation about solutions. Metaknowledge (talk) 17:09, 23 September 2020 (UTC)
- This topic is an interesting subject as the new Abstract Wikipedia project will attempt to create automatically translated contents. We should learn the lessons from automated translation, and should have an audit of rules that this bot used, and why they are insufficient, because this is exactly what we'll need for the new project (starting next year after the initial development of the Wiki of functions which will implement the architecture and repository needed to host the translation rules (inside what is currently named renderers. So even if this bot is stopped now, given the huge number of submissions that were made in MG.WP (and some other Wikipedias for minority languages, which are exactly in the goals of Abstract Wikipedia, some of them having accepted such data), we should ask to the bot's owners to submit their code to open source, so we can know the rules that were implemented. We will then avoid the same errors, and we can also learn why even the contents generated by Abstract Wikipedia should be auditable and fixable in each wiki: we DO NEED a visible distinction for these edits, so they can be corrected as any time by manual edits: these automated submissions should be considered like "stubs" and properly marked visibly, so we can make the distinction easily, and allow local overrides that any bot would not rewrite blindly. I.e. a way to make both coexist peacefully without causing massive troubles to regular users.
- So just banning or blocking will not be sufficient, we must learn from it (and we need the cooperation from @Jagwar: (and if he refuses, then we'll have no other reasonable choice than to block its bot account(s) and only allow him to act as a regular user with his personal account, and then find a way to cleanup the garbage, which is already a tremendous task to do for regular users: can we really automate these deletions/reverts safely?). verdy_p (talk) 20:15, 2 October 2020 (UTC)
- I think that the lesson is that automated translation can't pull as much weight as we would like, and at least for now, human oversight is necessary every time prose is presented. But that would imply in turn that Abstract Wikipedia is a bad idea, so I doubt such a lesson would be well received by the people who most need to learn from it. Metaknowledge (talk) 22:16, 2 October 2020 (UTC)
- @Verdy p: The bot is already open-source and Abstract Wikipedia is not about parsing human language and translating it. Abstract Wikipedia is about reading from an article specification, from the beginning written in machine-readable format, and then by using a set of rules generating an article. The only "translation" is from that format to human language. -- Guherto (talk) 16:11, 6 October 2020 (UTC)
- I agree that a normal translation is more complex because it involves parsing the source language into an abstract form. But Abstract Wikipedia is there to create and host that abstract form and then use it to generate the actual language (second part of translation, which also involves semantics analytics in order to "project" the abstract contnt into the target language, and this is not easy at all: the project scope is supposed to allow translating a complete article (at least paragraph by paragraph, not necessarily sentence by sentence as some source sentences can only be formalized into several abstract sentences, and a single abstract sentences may still require creating multiple separate sentences. As well the various sentences will need to avoid repetitions and there are common things like pronouns or adverbs or particles, or sometimes only a match of grammatical gender/type/case to derive terms and agglutinate them correctly: this is really a translation even if the source language is abstract). And after all your bot for wiktionnary was also supposed to create translaed definitions (it had more errors because of lack of contextual analysis of the source language, notably when it is very difficult to determine like in English where there very little hints to guess the abstract content correctly (same form for nouns, verbs at any grammatical tense, ommisions of many semantic prepositions, lack of marking for subjets: it is in fact more difficult to guess from English than from French that provides much more hints, and ist is wellknown that English is a very ambiguous languages, that permits lots of interpretations, and that's why the common English language is replaced by a more formal jargon in legal texts that even native English speakers can't understand, or why so many Englush texts have to redefined the concepts they will used further down even if this semantic is the most common in legal contexts; a bot cannot easily guess that context, which may only be infered from external texts not seen by the translating engine). verdy_p (talk) 16:26, 6 October 2020 (UTC)
- Okay, but to clear up any confusion, the bot isn't mine. -- Guherto (talk) 20:50, 9 October 2020 (UTC)
- I don't know why you think I was speaking about you. I just commented on the general topic, because this is a larger problem that is highly related to an important project approved and publicly announced by the WMF, that will use automatic text generators (half of the part of an automatic translator, which also requires a scanner to infer an abstract content, but that was already using abstract data inside the source Wiktionnaries coming from their extensive use of templates). I know that the problematic bot discussed here was operated by user Jagwar (cited since the start of this page), and I've not implied there was another bot. verdy_p (talk) 14:16, 10 October 2020 (UTC)
Seems more like a prosecutorial opinion than an audit
I am not sure how many of you have undertaken investigations or audits in the real world, but this seems to
- Where is your scope?
- Where is your report based on the criteria for undertaking an audit?
- Where is your summary of findings?
Then generally an audit report is written and given to the body that is audited and their comment sought. Has that happened in this case? Has a summary of the report been written and given to the local community? Have the findings been translated into the local language?
With the feedback and the cooperation of the community some recommendations can be written, with corrective actions, a time frame and who is responsible. I don't see any of that here. Am I missing what is an audit?
To me this more looks like a w:kangaroo court, some "gotcha" and outside imposition of others opinion. — billinghurst sDrewth 00:15, 25 September 2020 (UTC)
- I have done some further reading of the process for this audit, and I cannot be more depressed. The self-appointed auditor went into this process with preconceived opinions of what was the problem and what needed doing, what level of procedural fairness exists to get a fair audit report? This report should be struck. — billinghurst sDrewth 00:39, 25 September 2020 (UTC)
- This overlaps with what you wrote, but my main concern is that no alternative courses of action were ever considered. Maybe mass deletion of entries is inevitable as part of any solution to the mg.wikt issue, but that was just taken for granted. Even if mass deletion is the way to go, could a more limited subset of entries be identified as likely to be flawed? Etc. This would have involved engaging with the local community and global community in developing proposals for actions, instead of the authors of the report coming up with two specific ideas and then starting a poll on them. PiRSquared17 (talk) 01:35, 25 September 2020 (UTC)
- @Billinghurst and PiRSquared17: I see that you have some problems with my approach, which might be related to the use of the word "audit". I never intended for this to mirror the kind of audit made in corporate or governmental contexts; you can substitute a term like "quality assessment" instead. I am unsure as to whether you read the report in its entirety, as you seem to have missed that Jagwar is the sole active member of the community, and he has already weighed in above. In fact, you have not engaged with any of the facts I presented, but rather your interpretation of the "procedural fairness". Do you care about the problems at mg.wikt? Metaknowledge (talk) 04:59, 25 September 2020 (UTC)
- So you are confirming that it is a prosecutorial process where what has been proposed has been re-imagined by you? Those have previously been RFCs where one is able to present your case. If you are going to pick a small component of a wiki, then say that and take it through an RFC, do not hide it behind a supposed neutral audit process.
Otherwise, duty of care and due process are now absent in how we do things around here now? I have addressed my various responses to the various components in the respective places. — billinghurst sDrewth 05:28, 25 September 2020 (UTC)
- @Billinghurst: I have never claimed to be neutral; we are all human, and true neutrality is impossible. However, I do claim that I have represented the facts faithfully, and you have not challenged the correctness of any fact I have presented. I ask you again: do you care about the problems at mg.wikt? Metaknowledge (talk) 06:08, 25 September 2020 (UTC)
- @Metaknowledge: Your case belongs in a standard RFC, not (mis)represented as an audit. Not presented under the banner of these small wiki audits. Please take it to that normal process. — billinghurst sDrewth 10:17, 25 September 2020 (UTC)
- @Billinghurst: Why are you incapable of answering my question? I know you care about Meta bureaucracy, and I'm willing to jump through whatever hoops you'd like. But I can't tell whether following your instructions will merely be a waste of my time, and it will be determined by whether or not the community on Meta cares about small wikis like mg.wikt. I ask you for the third time: do you care about the problems at mg.wikt? Metaknowledge (talk) 16:57, 25 September 2020 (UTC)
- This is not about what you call "meta bureaucracy". This is about a fair and equitable process for all WMF communities, be they big or small. It is also about a process for those who hold global roles, and an authority to act.
This analysis of yours belongs in an RFC, so please take it there. — billinghurst sDrewth 23:27, 25 September 2020 (UTC)
- @PiRSquared17: It is true that alternative proposals were not raised before Marco Swart's plan, although I privately did consider whether less severe measures could achieve what is necessary. However, I did and do not believe these are going to be adequate. First of all there is no editing community to speak of at the Malagasy Wiktionary, so manual review at a large scale is impracticable. Both errors in form (as explained by Otourly and Treehill for the French Wiktionary, and by me for Volapük at the Malagasy Wiktionary) and errors in translation have been created by the bot; errors of the latter type are created routinely. Neither type of error seems predictable enough that a bot could identify them without an external reference. And if I look at the Malagasy Wiktionary's Volapük entries, that I consider the elephant in the room, then I simply do not think any partial solution will be sufficient. What kind of reference work could a bot use to evaluate the quality of Volapük-to-Malagasy translations? How many of the Volapükists in the world are going to learn Malagasy or how many Malagasy speakers will learn Volapük? The Volapük lemmas without Malagasy definitions and the incorrect forms will have to deleted anyway. As far as I can tell only the reversal of the bot edits in Volapük or the complete deletion of Volapük from the Wiktionary is going to remove all the bot errors; so reverting the bot's edits seems potentially less drastic or maybe as drastic as the other option. I can understand it if users who are specialised in Latin, Spanish, Italian, Russian, German, Finnish or whatever other language that is on mg.wiktionary feel that similarly drastic measures are needed in relation to their specialism.
- The creation of the poll was a spontaneous development that originally did not involve the main author and co-authors of the evaluation. As you can see, the first voters in the poll were not authors (though speaking for myself I did comment on the piece when it was still in Metaknowledge's userspace). Only the section for the weak recommendation was created by an author. Lingo Bingo Dingo (talk) 19:36, 26 September 2020 (UTC)
- (Edit conflict.) @Billinghurst and PiRSquared17: (Disclosure: I contributed a few figures to the audit, and am an English Wiktionary admin.) I am kind of shocked at these comments, since it makes me wonder if the evidence at Small wiki audit/Malagasy Wiktionary has been fully considered by you all; Billinghurst, especially, you seem to have more problems with the audit's authorship or process rather than its substance. You can see for yourself the quality of the bot-generated entries on mg.wikt. The author of the audit manually went through 100 randomly selected entries and assessed their quality, which is a great deal of volunteer effort to collect evidence, and that evidence itself cannot be questioned even if the next steps based on them are debatable. And the option for not deleting entries is clearly given in the vote.
- Look, I really don't think you understand the scale of this issue. Imagine if the English Wikipedia had entries translated from the Spanish Wikipedia entirely by a terrible translation software that just did word-for-word substitution. That could in no way be a good resource for any of us, and a mass deletion is a reasonable path to suggest to fix that kind of problem. You all are talking about bureaucratic procedure and the legitimacy of this audit. The thing is, Wikimedia projects are democratic. Someone had to take the initiative to volunteer their time to run this audit, someone had to propose potential solutions, someone had to take on the role of auditor. We have no rules for small wiki audits and their procedure thus far as they are a new initiative, and I think this is a laudable way to do one in the absence of any previous guidelines or precedent. You all speak of "outside imposition of others opinion" but the Malagasy editing community on Wiktionary is one active editor, and that editor is responsible for the bot that is the subject of the audit, so naturally there has to be some outside intervention.
- Finally, both of you are suggesting ways this could be different. "Has a summary of the report been written and given to the local community? Have the findings been translated into the local language?" "This report should be struck." "could a more limited subset of entries be identified as likely to be flawed?" The fact is, neither of you are willing to step up and do those things, and for such a tiny editing community as in Malagasy, I doubt anyone else is. Sort of like how the Scots Wikipedia has languished in its terrible state forever, until it hit headlines. This audit is a start, and it is not perfect, but it is suggesting a way forward that could fix some serious problems. I will note that a native anonymous Malagasy speaker did work with Metaknowledge; perhaps they could take on the task of translating the audit since I think that is a valid concern. But this audit was sorely needed, and we need to work constructively to figure out what to do next rather than delegitimizing the whole thing over some baseless bureaucratic objections that do not actually deal with the substance of the issue. AryamanA (talk) 05:47, 25 September 2020 (UTC)
- @AryamanA: Don't shoot messengers. This is not an audit. Take it out of the audit space, and present it as an analysis through an RFC. Don't keep it here.
I am one of the few people that you should NOT be using the "why don't you do it better" argument. That is a weak argument when your primary argument fails and you should not fall back upon it. I have done my bit around here for years; you are going to lose based on that challenge. — billinghurst sDrewth 05:56, 25 September 2020 (UTC)
- @Billinghurst: I think once you start framing it as an ego issue of win/lose, there's probably nothing constructive we can get out of this conversation. AryamanA (talk) 15:56, 25 September 2020 (UTC)
- I have read the report and I agree that the issues highlighted are real and serious. The work that went into making the quality assessment and detailing all the ways in which word definitions were flawed is incredible. My only issue is that I think there could be some procedural improvements here, and I'd like to establish good precedent for future small wiki audits. PiRSquared17 (talk) 12:36, 25 September 2020 (UTC)
- @PiRSquared17: I can agree with that. It seems the idea of the Small Wiki Audit was a bit rushed with the whole Scots controversy, and we should certainly establish a standard procedure for future audits so they're up to bureaucratic snuff. AryamanA (talk) 15:56, 25 September 2020 (UTC)
Comment An RFC was opened, though it was done as a transclusion. I have moved the pages into the RFC subpages. I have included the previous sections within {{quotation}} onto the main page of this RFC. I have updated links pointing to the discussion. — billinghurst sDrewth 01:00, 26 September 2020 (UTC)@Billinghurst: In my perception you come across rather pushy and condescending here. You criticize the choice of words ("audit") and the choice of actions in harsh words and you do not leave much time to react but quickly make the changes /moving etc.) that you see fit without waiting for the input of those who undertook or supported this "audit". Do you think that is fair procedure? --✍ Janwo Disk./de:wp 09:40, 26 September 2020 (UTC)
- @Janwo: The proponent sought advice from stewards, and opened it as an RFC and transcluded the page into that RFC. It meant an RFC was going to be sitting in "audit" space with the discussion in main and talk namespace. RFCs have the practice of being main namespace conversations, and in the RFC page hierarchy. Following that choice of direction by the proponent, I tidied and aligned page into that hierarchy so it aligns with all other RFCs, and fixed all internal links to the page. I have changed no content discussion, with the only change that I made was to bring the pertinent talk page conversation to the main namespace per existing convention.
Yes, I have criticised the process undertaken here. With regard to the audience, it didn't seem that the audience here was averse to expressing harsh criticism.
The conversation with the audit space would have had little standing and given stewards and global sysops no ability to act within existing policies or practices. I undertook no part in the content argument, and continue to act without conflicted interest on the matter. — billinghurst sDrewth 11:36, 26 September 2020 (UTC)
Growing the Malagasy Wiktionary community ought to be the priority
The "small" in the title refers to the number of active editors on Malagasy wiktionary. This suggests that this small number could be the cause of some problems. So for any remedy it would be of paramount importance to take an approach that will increase the number of active editors on Malagasy Wiktionary. Just applying a kind of surgical approach from the outside will probably result in unintended adverse results.
In my view the investigation into the content of Malagasy Wiktionary is valuable labor done in good faith to improve the project. Unfortunately if we don't involve more Malagasy-speaking people in improving the Wiktionary the recommendations will bring little improvement. Ironically, I would consider Jagwar's edits valuable labor done in good faith to improve the project too, but unfortunately if he doesn't involve more Malagasy-speaking people in improving the Wiktionary his work does not really improve the wiktionary either.
Of course I understand that running a bot to create or remove pages is easier than recruiting new editors. But sometimes the hard way is the better one. My tentative proposal would be:
- Appoint 2 or 3 Wikimedians having experience with Wiktionary, community building and Madagascar to recruit a group of at least 3 native-speaking Malagasians able and willing to work for at least a year on improving Malagasy Wiktionary.
- If they don't succeed in achieving this within 3 months, we should not just consider the 2 proposals above but the viability of Malagasy Wiktionary in general.
- If they do succeed in finding a dedicated group of editors, and creating a real community, this would be the forum to discuss the best way to improve Malagasy Wiktionary. Adopting the proposals above completely or partly would be up to this community. --MarcoSwart (talk) 10:21, 25 September 2020 (UTC)
- You have set incredibly high standards: I can't think of anyone in the Wiktionary community, let alone two or three people, who meets your standards to be appointed. I certainly don't! In fact, they should also be fluent in Malagasy, in order to be a mentor to new editors, so the real standards are even higher. And then these impossibly qualified people are given an absurdly difficult task, that of recruiting Wikipedians who will have the weight of an entire project on their shoulders. I appreciate your attempt to present a novel solution, because nobody else has done that, but this simply isn't feasible. Metaknowledge (talk) 16:51, 25 September 2020 (UTC)
- Recruiting editors to add to a small wiki is one thing; recruiting editors to clean up millions of pages never seen by humans is another. Especially when those pages are in Finnish, Mongolian, Yucatec Maya, etc. Who on earth could possibly be qualified to (and interested in) cleaning up this garbage? Ultimateria (talk) 17:24, 25 September 2020 (UTC)
- Adding to what has already been said, it must be noted that Madagascar is a poor country with most inhabitants living below the poverty line, that Malagasy has a modest speaker community of around 25 million speakers and that its 'dialects' are rather divergent. There are going to be few native or fluent speakers of Malagasy who have the leisure, skills or equipment at their disposal to edit a Wiktionary. Fixing a mess of this order of magnitude is going to take one to three five-year plans if you have a community with the size and activity of the English Wiktionary. I do not see how this is feasible for a Wiktionary with effectively no community. Lingo Bingo Dingo (talk) 18:14, 25 September 2020 (UTC)
@Metaknowledge, Ultimateria, and Lingo Bingo Dingo:, please allow me to give some clarifications in a single answer. You may say I'm a dreamer, but in my view Wiktionary should aim to serve even smaller languages than Malagasy. I view finding ways to offset disadvantages due to differences in wealth and history as part of the goals we share.
It is reasonable to develop standards based upon centuries of lexicographic experience and available solutions. But simply applying them in situations where those resources are lacking might be counterproductive. Engaging people to take care of their own interests is a lot harder in the short term, but yields better results in the long run.
You seem to assume that if we find a group of Malagasy editors they would not adopt the two recommendations made, while you are presenting even more arguments why they should. That's missing the entire point of my proposal. I'm not arguing the recommendations are inherently wrong. By ignoring the possibility of strengthening a community and only remove the consequences of a community being weak or non-existent we're only suppressing symptoms. A quick fix, but it doesn't bring us closer to the stated goal of creating open-content dictionaries in every language.
The requirements for the recruiters appear feasible to me, the hardest part probably the Madagascar experience. A solution could be to have a group of 3 in which each member covers at least 2 requirements in a way that for each requirement there are at least 2 people meeting it.
Fluency in Malagasy is not required from the recruiters. Recruited editors would need to have access to computers and it is highly unlikely that in that case they wouldn't already speak English of French. It is important to involve at least two people who do have experienced Madagascar. Many things we may take for granted like "leisure" or "volunteer" may not be as obvious in Madagascar.
The editors recruited will probably be just a small group. They will have to figure out a best way forward, taking this into account. They might want to follow up on the recommendations and ask for some technical help implementing them. They might have other ideas too or instead and ask for some technical help implementing those. The important thing is to create a group for whom mg.wiktionary is a going concern, not a one-off nuisance.
It would show a lack of realism to ignore the possibility that we won't succeed in finding new editors and I haven't done so in my proposal. In all, it would take about 4 months more (or just 1 if we were unable to find suitable recruiters).--MarcoSwart (talk) 17:32, 26 September 2020 (UTC)
- I would counter that all of your efforts have been an utter waste of time and do a disservice to Malagasy speakers and the Wikipedia movement. Chris Troutman (talk) 19:58, 26 September 2020 (UTC)
- @Chris troutman: This aggressive comment confused me; judging by the edit history, it seems you intended this toward Jagwar rather than Marco? Metaknowledge (talk) 22:40, 26 September 2020 (UTC)
- @Metaknowledge: Yes, and I was too sloppy in my writing. Jagwar's edits did the disservice. I don't think Marco's well-meaning plan is worthwhile. Chris Troutman (talk) 01:14, 27 September 2020 (UTC)
- @Chris troutman: For Wiktionary, maybe. For Wikipedia, you're wrong. Jagwar grrr... 14:08, 27 September 2020 (UTC)
- Marco, I think your plan has only one virtue, which is that a local community could decide the fate of the incorrect entries. This is outweighed by its many flaws, which mainly relate to how infeasible it is to even recruit recruiters, let alone for them to recruit editors. And what could be more demotivating than to join a project where your contributions would be drops of water in an ocean of incorrect and unusable content? We might have more success trying to implement your plan if the bad entries are deleted first, so it's clear what work is left to be done. Metaknowledge (talk) 22:40, 26 September 2020 (UTC)
Thanks
Thank you, @MarcoSwart:, I deeply appreciate your proposal to fix all of this more constructively.
In the last 10 years, I have tried to attract editors on the Malagasy Wikipedia, taking in their remarks and their critics whenever they give some. I also attempted to give it more visibility on the internet and set a much lower bar of entry by being much more "inclusionist" than anyone in the English Wikipedia would care to be, and taking care of all the formatting needed in order to make the articles readable, even with minimal encyclopedic content. I can say today that the goal is more or less achieved with 2 people having joined the Malagasy Wikipedia and have got involved with it for 3 years. That being done, I had stopped using bots for mass-creation of articles there until all of the "1,000 articles that every Wikipedia should have" are created. So in short, that serves two goals: to cross the 100,000 milestone organically and to cover the basic subjects before 100,000. The equivalent for Wiktionary has been covered more than eight years ago, at least for French and English. The recommendations given in this RfC -- if ever implemented -- will infortunately wipe them all off, but I'm not afraid to start over, with dictionary sources, this time.
What wouldn't I give to decide about this radical cleanup among native Malagasy speakers? The thing is, despite having 25 million speakers, the barrier of entry is extremely high for the average Malagasy. I explained some of that at Tell_us_about_Malagasy_Wikipedia#Comments if you care to read. With those parameters taken into account, I estimate that currently 1M people can "read" Wikipedia as readily as people in advanced economies. As of current, 90% of Malagasy read Wikipedia in French. If less than 1/10,000 would care to edit (extrapolated fr.wikipedia active users over speakers of French), that gives a pool of approximately 100 editors. If for 1 Wiktionary editor there could be 50 Wikipedia editors (frwikt/frwiki active users), then less than 10 potential users there could be willing to contribute. Count me in, I am one of them. Lohataona was there at some point, but is now inactive.
The above being said, it does not mean that the mg.wiktionary is non-viable. On the contrary. Furthermore, the lack of activity is no grounds for closure. The lack of content is. Even with 4.9M pages wiped off, 1.1M still stand. On the other hand, the sheer number of speakers makes it a totally legitimate project to have, even more with the numbers estimated above, which can only increase with time as the barriers mentioned in the link become less unsurmountable.
Jagwar grrr... 16:05, 27 September 2020 (UTC)
Cherokee Wiktionary
As noted elsewhere, the Cherokee Wiktionary has similar problems on a slightly smaller scale: about 200,000 entries, mostly created by Bot-Jagwar. The wiki has no community around to deal with this mass of error-filled entries (wikt:chr:Special:RecentChanges has nothing going on). I'd suggest that the same measures be taken for this wiki as for the Malagasy Wiktionary. Mx. Granger (talk) 05:56, 26 September 2020 (UTC)
- I agree, but I wanted to focus on one wiki at a time. I know chr.wikt has no active editors, but my hope is that they have administrators that could be contacted and who could solve this problem on their own without requiring any outside intervention. Metaknowledge (talk) 17:08, 26 September 2020 (UTC)
- Fair enough, I'm happy to focus on the Malagasy Wiktionary for now and worry about chr.wikt another time. Mx. Granger (talk) 16:38, 27 September 2020 (UTC)
- @Mx. Granger: So chr:wikt:Special:ListUsers/sysop doesn't show any real users. This implies that the problem is worse than I thought: chr.wikt is completely abandoned, and we'll need global sysops to step in. I don't know whether this requires a separate RFC, as it is a different wiki, but the exact same bot causing the exact same problems. Perhaps @Rschen7754, MF-Warburg, and Billinghurst: could weigh in on this procedural question. Metaknowledge (talk) 22:25, 27 September 2020 (UTC)
- Indeed, when Jagwar announced his intention to add these bot-created entries in 2013, the local community was already long gone, and he was seemingly never given local approval to run his bot. Metaknowledge (talk) 22:31, 27 September 2020 (UTC)
- If you really want to discuss on this matter, maybe we should rename this page Jagwar grrr... 16:23, 27 September 2020 (UTC)
- I am not certain that the discussion about Cherokee wiktionary belong on this talk page. Seems that there is a systemic issue with small wikis having no community to manage editing and bot(s) editing these small wiktionaries without an adequate consensus. It seems that this lies back with a few solutions. The LangCom needs to address independent wikis rather than hosting on incubator. A bot operator who is having their edits queried should be voluntarily stopping those edits. Both are worthy of their own discussions if there is no quick and easy path to resolution — billinghurst sDrewth 00:09, 28 September 2020 (UTC)
- I have requested temporary sysop rights on chr.wikt (message) to clean up. This is on the assumption that there is significant community consensus to clean up mg.wikt (which has now been mostly done), which should also apply to chr.wikt as the circumstances seem to be exactly the same (or even worse, as Jagwar, the bot's operator, does speak Malagasy but not Cherokee). — surjection ⟨??⟩ 20:38, 20 February 2021 (UTC)
User conduct
I'm still coming up to speed on this matter - but I find it bizarre that the proposals made only focus on content and not conduct. In fact there are some serious questions that haven't been asked:
- If these bots are so controversial why do they still retain the bot flag?
- If these bots are so controversial shouldn't they be blocked?
- Should Jagwar really retain adminship?
- Should Jagwar be blocked?
Ultimately this RFC is powerless if all the pages are deleted and the bots start making them again. --Rschen7754 06:15, 27 September 2020 (UTC)
- The other issue I have is consistency. We just failed to globally ban a user for machine translations across many wikis at Requests for comment/Global ban for Eric abiog, which brought to light that some wikis do accept machine translations. That being said, there was human intervention in that case and no bots used, and it was mostly at established wikis. Just presenting a counterargument here - not saying that I fully agree with that line of thinking but I think this needs to be addressed. --Rschen7754 06:22, 27 September 2020 (UTC)
- @Rschen7754: 1. Some of your questions are already answered in my report. The bots still retain the bot flag at mg.wikt because there is no community to rein them in. As the RFC's power to enforce this, the strong recommendation already says that if the bots create such entries again, they will be blocked. 2. The questions you raise related to conduct are further proposals that seem geared toward punishing Jagwar, rather than fixing the mess. I oppose blocking him, and I neither support nor oppose removing his adminship. He is capable of making good entries by hand, and using his admin tools to prevent vandalism, and I want to encourage him to do these things. 3. As for consistency, I don't see how that example is relevant. This isn't about machine translations, with seem to be the rationale in that case. The problem is with incorrect and unusable entries, and it would be no different if Jagwar had trained an army of monkeys to create the entries instead of writing code: the outcome is an awful wiki. Metaknowledge (talk) 07:31, 27 September 2020 (UTC)
- User:Jagwar, are you maybe willing to just delete all the wrong pages yourself? Seems like then everything would be fine and dandy. --MF-W 17:26, 27 September 2020 (UTC)
- Sure, allow me some time to do it. Jagwar grrr... 17:30, 27 September 2020 (UTC)
- @Jagwar: Thank you. The first step is to stop running your bot, which will prove that you intend this in good faith. Then Surjection can provide you with a list of all the entries to be deleted. Metaknowledge (talk) 19:03, 27 September 2020 (UTC)
- @Metaknowledge: Currently, the bot is already stopped. Waiting for Surjection's list. Jagwar grrr... 19:15, 27 September 2020 (UTC)
- It's not as simple as deleting a bunch of pages. Some might have Malagasy sections or non-Malagasy sections added by actual users. While there will be a lot of pages that will simply get deleted, deleting pages outright won't be enough. The first step is to find all L2 sections that have been added by the aforementioned bots and then to find the pages that should be deleted completely, it is simply a matter of finding all pages where every section falls under the aforementioned conditions. Generating the entire list is going to take a while. — surjection ⟨??⟩ 19:48, 27 September 2020 (UTC)
- Here's one list to get you started. The list is too large to fit into a wiki page or a paste bin, so I'm uploading it as a (compressed) file. It includes all mainspace entries that were created either by Bot-Jagwar, Bot-Jagwar II or Ikotobaity, have only ever been edited by bots (based on a small and probably incomplete list) and that do not contain the Malagasy L2 (=mg=). Only entries created before September 2020 (except for some entries from September 1) are included. As far as I can tell, they match the conditions proposed and should be fully deleted. This list is however incomplete, and even deleting every page listed on it is just the first step in this entire cleanup process. In other words, this list is very much incomplete. (Yes, the raw list is around 60 megabytes, and has some 4.8 million pages listed on it) — surjection ⟨??⟩ 21:46, 27 September 2020 (UTC)
- Going through that list alone is going to take months. That leaves us plenty of time to do the rest in a surgical manner. Jagwar grrr... 17:49, 28 September 2020 (UTC)
- @Jagwar: It really shouldn't take months. Do you need Surjection's help? Metaknowledge (talk) 04:24, 29 September 2020 (UTC)
- @Metaknowledge: If you're really that much in a hurry, then yes. Otherwise, I'll be averaging at 50,000 page deletes per day. Jagwar grrr... 13:29, 29 September 2020 (UTC)
- @Jagwar: I'm currently working on a second list. One of the (strong) recommendations was that the translation sections be deleted from every Malagasy entry (except for maybe the few exceptions that were not populated by bots). How are you planning to approach that task? — surjection ⟨??⟩ 18:04, 18 October 2020 (UTC)
- @Surjection: I don't think it's possible to discriminate human and bot translations on a line-by-line basis. For now I'll just clear the {{-dika-}} section altogether and recreate them later. Jagwar grrr... 19:17, 18 October 2020 (UTC)
- Blocks are not punishments for past behaviour and if the user says they will stop doing what they were doing, I see absolutely no reason to block. Blocks are to protect projects from damage, or protect the users in them. As for the bot, it should only be blocked if it starts creating such machine translations again, which I am assuming will not happen. --IWI (talk) 03:06, 29 September 2020 (UTC)
Translations
Hello all,
I added the translation options and made a French translation as the story was unclear for a colleague. The last issue of French Wiktionary's Actualités echoed the story undergoing here, and I invited Jagwar to comment his views in the next issue. I think this whole story should be a good opportunity to discuss several interesting issues such as small communities support, documentation of underdescribed languages, NLP assistance for language description, interwiktionary cooperation. There is clearly no war between our communities and we will improve the content with the time going. If needed, feel free to point me some paragraphes here that should be translatable as well. You can translate the audit (or whatever you want to call it) to your own language or to Magalasy if you can. Veloma Noé (talk) 15:31, 2 October 2020 (UTC)
- @Noé: Done: Malagasy section added and used for deletion summary. Jagwar grrr... 19:18, 18 October 2020 (UTC)
Cleanup
@Jagwar: I've finished up with the second list, which is in two parts.
The list does contain:
- (some, but not all) entries imported from other Wiktionary editions (which were improved by or via bots); most have not been improved since
- possibly false positives (tried to get rid of them). if a bot has automatically fixed formatting issues, those L2s might get included.
but does not contain:
- possible bot entries created under the main account. there are probably many of them, but Jagwar might better remember when he ran the bot under his own account
- accelerated entries, which should probably be checked over too in some way
Since both lists are again fairly large, I've chosen to use a file host for them. List 2A contains the pages to be deleted entirely (~122k), while list 2B contains L2s (sections identified by level 2 headings) to be deleted (~3.5k pages in total affected), in the following format: every line is one page/article. the line begins with the title, and then for every L2 that is to be deleted, a tab followed by that heading (which are usually language code templates, such as {{=ln=}}). So for example, the first line has mbala {{=ln=}} {{=mnh=}} (with the whitespace being tab characters), which means that from mbala, the headings {{=ln=}} and {{=mnh=}} should be deleted, while the rest should stay. The list should in either case only contain headings added by bots, except for any Malagasy entries which should be excluded.
If you need any help from me (I have a bot that I can set up to perform this kind of cleanup work) for any part of this cleanup process, please let me know. There should probably also be some discussion into what seems to be entries created not strictly by bots, but by using an acceleration mechanism. If mg.wikt had a community, these would best be handled by cleaning them up, but I'm not sure what the best approach would be with them in this case. — surjection ⟨??⟩ 16:29, 31 October 2020 (UTC)
- Thanks, I'll take it from here. Jagwar grrr... 11:26, 1 November 2020 (UTC)
- @Jagwar: It seems there are still some entries in list #1 that were not deleted, such as wikt:mg:aedicula (line #1942260 on list 1). It's probably not the only exception, so I suggest you go through the list(s) again (perhaps somehow excluding the entries created since). — surjection ⟨??⟩ 20:25, 20 February 2021 (UTC)
- Maybe they have been skipped due to errors, anyway I'll go through the list again and look into errors should they arise. —Jagwar • grrrr... / homewiki 18:13, 3 March 2021 (UTC)
- @Jagwar: Have you done this yet? There are still bad entries that should have been deleted. If you need assistance, Surjection can help. Metaknowledge (talk) 23:52, 20 March 2021 (UTC)
- @Metaknowledge: Yes I have. Should you find any bad entry left, please make a list and I'll look into it. —Jagwar • grrrr... / homewiki 13:20, 21 March 2021 (UTC)
- @Jagwar: Here's an example: mg:wikt:Citations:nope. Metaknowledge (talk) 18:42, 22 March 2021 (UTC)
- @Metaknowledge: Done. —Jagwar • grrrr... / homewiki 20:22, 22 March 2021 (UTC)
- @Jagwar: Sorry, I didn't mean just deleting that one, but figuring out how it got passed over. On a somewhat unrelated note, were low-quality entries like mg:wikt:племя created by hand? Metaknowledge (talk) 05:09, 23 March 2021 (UTC)
- @Metaknowledge: Surjection provided me with the list you can find above. After additional checks either the section or the whole page is deleted. For some reason, the page you mentioned above wasn't listed. mg:wikt:племя was created by hand using a list and a template you can find at mg:wikt:Wiktionary:Fanamboarana takelaka —Jagwar • grrrr... / homewiki 17:48, 23 March 2021 (UTC)
Mass creation of pages by bots has resumed
I just noticed today that Bot-Jagwar has started creating new pages again. What was the use of deleting so many pages previously only for the bots to create pages again? I do not know Malagasy language so request someone to check whether the new pages are valid. AVSmalnad77 talk 05:57, 13 February 2021 (UTC)
- @AVSmalnad77 and Jagwar: The most recent contributions are inflected forms of Latin nouns, and seem to be fine. But I share your concern that going back to mass creation of entries by bot is likely to lead to the same errors that precipitated this in the first place. Jagwar, what kinds of entries are you creating, what is your source, and how are you ensuring that you're not repeating the same mistakes? Metaknowledge (talk) 19:54, 20 February 2021 (UTC)
- I am mostly creating inflections forms after creating the lemmata manually. I am mostly using enwikt as a source for both. —Jagwar • grrrr... / homewiki 18:16, 3 March 2021 (UTC)
- @AVSmalnad77 and Metaknowledge: It seems this bot is used to create entries from other wiktionaries including frwikt. @Jagwar: what are the guaranties to not be a second waste of energies ? For instances, the word mg:wikt:pré-terrier had been created one minute only after I published the French version. So is it a human-made translation ? Otourly (talk) 17:57, 19 October 2023 (UTC)
- The guaranties? I may not have money or securities to give here, but I'm more cautious this time. mg:wikt:pré-terrier was created by bot shortly after the French one by using a fine-tuned machine translation model published by Meta and fine-tuned by myself, and its output was further curated to rule out bad translations as much as possible. —Jagwar • grrrr... / homewiki 20:23, 19 October 2023 (UTC)
- @Jagwar: Could your bot at least just avoid unpatrolled pages (like mg:wikt:crucimorphe) ? Because theses creations are not often kept. Otourly (talk) 17:56, 23 October 2023 (UTC)
See also https://en.wiktionary.org/wiki/Wiktionary:Tea_room/2021/December#What's_going_on_here%3F --Fytcha (talk) 19:55, 21 December 2021 (UTC)
- @Jagwar: What I can observe about these new activity is :
- Errors are still there mg:wikt:ⴰⵛⵎⴰⵣ (only in French ?), or unpatrolled pages on fr (and other wikis?) created in a minute on mg… (Some francophone patrollers kindly do twice work) I obviously can verify all entry since I don’t speak malgasy, neither autotranslated malgasy. Some recent entries are so specific that professional translators themself could have trouble to translate them it their native language. AI or bot couln’t pass this test.
- This bot created [1] 1,9 million pages only in 2023 ! What is the goal ?
- Only three years after the previous scandal, nothing has changed, and it’s done more massively.
- What shall we do ?
- Otourly (talk) 16:45, 7 November 2023 (UTC)