Jump to content

User:Alecmconroy/Language study

From Meta, a Wikimedia project coordination wiki
  • Start with the entire global user database
Select all from users
  • Remove users who are 'inactive'
where user_touched is sufficiently recent
  • Remove users who contribute to only one project
  • The remaining users should be those who are active on more than one project. Useful. Count up for any two projects, how much overlap their editing communities have.
  • Except for meta and commons, each project has a native language. For any two languages, how much do their editing communities overlap?

Method

[edit]
  1. Data provided by Platonides [1]
  2. Data imported into a php associative array via custom script
  3. Data analysis script transformed the data into tables
  4. Data analysis script created visualization data file in the form a .gdf file
  5. .gdf files were imported into Gephi, which creates visualizations of the data.
  6. Within Gephia, layout algorithms like ForceAtlas2 were used to find the ideal configuration of each visualization.
  7. Also computed 'eigenvalue centrality to confirm en is, in fact, the 'most central' by that measure. (as we would expect from intuition, visualization, and the data tables)

Quick conclusions

[edit]
  1. Language barriers exist, but they are not as great as I imagined. The lack of a coherent global community cannot be explained by linguistic barriers alone-- overlap with en is quite substantial.
  2. en repeatedly exhibited 'special' properties even in analysis where I did not a prior tell the algorithm to treat en as special. By a large number of unbiased measures, it was demonstrably the 'most central' language.

Note and Missing data

[edit]

[2]

  • need worldwide population of each language
  • need wikimedia population of each language
  • need conditional probability tables for knowing one language given another.
  • Gauge autonomous translation difficulty for En->languages.
  • Identify Languages where a majority of speakers also speak En.
  • The UN languages? Arabic, Mandarin, En, Fr, Russian, Spanish. Of these,

Nations, sortable by non-"English Speakers"

[edit]

The English-Speaking World

[edit]

Nations with less than 15% of the population unable to communicate in English. Interestingly. these numbers challenge our existing preconceptions--

  • Canada, long considered a stronghold of the Anglosphere, but actually only 85% know en.
  • Many nations without a history of English as a mother tongue nevertheless have fluency rates comparable to those of Canada. Israel has only slightly less than Canada, while Denmark, The Netherlands, Sweden, and Norway all have rates higher than Canada.
  • When multiproject editing patterns become available, I predict that we will these nations editing very actively both in common spaces and in their mother-tongue spaces. After the En speakers, these nations probably represent those that are 'most tied in' to the movement.
  • Assuming the accuracy of this data, nations of Canada-Israel or above may not present a national challenge. (although we don't organize by nation so this exercise is mostly to pass the time until they announce the results. :)

Our Wikipedia Languages

[edit]

source Active Wikipedia editors (more than 5 edits) per project in May 2010.

  • It is each project's duty to help send messages to and from those projects its members can directly communicate with. Each project needs to have a plan for sharing translations between it and the global community.
  • If any project prefers to receive messages in a language other than its own or english, please let the global community know which language(s) would also work best for your community.
  • Can you estimate what percentage your contributors speak both english and your project language? Hard numbers ideal, but even a general consensus.
  • Are there any other languages, other than your project's own, that your editor community is familiar with?

Tentative thoughts about language priorities.

[edit]
(this work has already been done by smarter people than me and the results as very similar)
  • Our hub lang is En. (And we should always remind people that is purely pragmatic. )
  • Our five next-largest projects are: de, fr, ru, es, and ja. These six languages give us about 3/4 of our current editors, though this is already out of date and will change with time.
  • zh and ar probably needed to included in the 'core' just for sheer commonsense-- both are large, diverse, face linguistic barriers, and are underrepresented in WM.
  • This list is not necessarily exhaustive.

Misc

[edit]
  • Find the "obscure language enthusiasts" and ask them to help projects communicate.
  • For 'official'/'important' translations, consider recruiting the best editors from the Simple En to 'pre-translate' / proofread En statements for brevity, simplicity, clarity.

Strategies for communicating with en speakers

[edit]

There are two basic strategies for translation to/from all internet languages.

  • "Direct Translation Strategy" relies upon links to en.
  • "Indirect Translation Strategy" relies upon links to a project that is itself strongly linked to en.

Visualizing our languages

[edit]

'Truest' visualizations

[edit]

The truth is that our projects are very densely connected. If you let each project be on the outside of a circle and connect them with thin gray lines, the inside of the circle is so covered with lines that it appears essentially like a solid color. In small numbers of active users, at least, our languages do connect to each other

Forceatlas2 all edges

[edit]
  • If we give some languages more 'weight' than others, then some 'fall' to the center, while others are pushed outward. Those closest to the center are, in this visualization at least, more 'central' according to ForceAtlas2.
  • While this image of a fully-interconnected set of languages is inspiring, it is not particularly useful in developing an intercommunication strategy.

Thus, we realize that while nearly all projects are connected, not all connections are as strong. In some cases, we have lots of users who speak both languages, sometimes just a few. Thus, some of the very very weak connections need to be dropped, so that we can see only those connections with the largest bilingual speakers. At the same time, let's not forget those 'weak' connections do in fact actually exist, just because we aren't showing them. All visualizations from here on out don't show most of the connections we actually have.

Visualization 2-- 'ForceAtlas2' showing top few edges

[edit]
  • Give each language a weight based on it's active users, kinda like a star's mass. Some stars are big, some are small-- some languages have many active users, some are still young. Since the point is for everyone to be able to intercommunicate, languages with more active speakers carry 'more weight', since if you can speak that language, there are more people to talk to, and thus more opportunities for inter-language communication.
  • Let each interconnection between languages act like a "gravity" pulling the two language "stars" closer together. The stronger the interconnection, the stronger the 'pull'. Thus, closely connected languages will 'tend' to be close to each other.
  • Let languages that are not closely linked tend to push each other apart, 'repelling' each other likes magnets of the same charge. Thus, not closely-linked languages will 'tend' to be far apart from each other.
  • If you 'pretend' all this is so, throw a random assortment of language "stars" onto a blank "universe" and let time run. After a while, you get what you see here:


Thus, we see:

  • Large projects that are closely connected to others tend to fall to the 'heart' of this nebula', while less-connected projects orbit the outer rings of this central galaxy of languages. Languages that are closely connected tend to attract each other, and thus tend to be closer to each other. Languages that are less connected tend to lie farther apart from each other.
  • Most languages are in a central cloud of densely-interconnected languages. En is at the center of this main "nebula" that contains most of our projects' total mass.
  • English and Simple English are tightly bound together, like a binary star system. This makes sense.
  • In 'orbit' around the two english projects lies a dense "asteroid belt" of densely packed languages. Most world languages, and those closest to them, lie in this belt. pt, es, fr, it, pl, de and nl to name just a few.
  • Surrounding this 'asteroid belt' is a less-dense but still-central "ring" of languages which are 'a little less' tied to the mass of the projects, just not to the extent of other languages. ru, zh, ar, ko, ja, to name just a very few.
  • In this ring, we see a lot of closely relatedly languages pairs. ko, ja, and zh are nearby each other. es and cs, ru and uk. Again, to name just the ones my eye noticed.
  • Around the edges of the central cloud, we see some "hub" languages that are connected to lots of smaller projects: eu, hu, ka, az, bg, oc.
  • Outside of this central cloud, we see "hub" languages that serve as the centers of their own clusters/systems. These languages are strongly connected to the central cloud, but many of the languages they are connected to are not strongly connected to any central cloud languages. These projects may be frequented by language enthusiasts. These languages might be ideally suited to recruit communicators to conduits for those smaller projects in their orbit. Examples include wuu, kn, uz, vo(w:Volapük), gv (Manx), gd (Gaelic).

Visualization Three: Connectedness by favorite second language to EN

[edit]

Keep only one connection per project-- it's strongest connection. Drop all others. * Language directly connected to EN are blue. Languages connected to blue nodes are colored orange or yellow-- yellow if they have children, orange if they do not. Languages connected to a yellow language are colored red. These language are, by this method, 'most distant' from en.

  • Languages in blue prefer to speak in en over other languages. These projects should tend to have strong translation communities available, if we could mobilize them.
  • Languages in orange, or yellow prefer to speak in a blue language over en. If these languages have trouble communicating with en, their blue favorite-second-language would be where they likely turn for indirect translation.
  • Languages in red prefer to speak in a yellow language. Red languages are where potential 'isolated' projects will be, the ones who may potentially post the greatest communications difficulties. (But being red alone does not automatically mean anything such difficulties actually exist of course)
  • Small red languages's connections do not seem to necessarily mirror the language commonalities of their real-world population. (again, just my impression). I think we should be a little suspicious whether language connections of small red languages will remain the same as those languages gain active users. Presumably, infusion of active users would tend to this data match real world linguistic classifications, which this visualization does not preserve. This may well be due to a biasing effect of the high requirement for active users and the short duration (1 month) of the time window studied. I predict over the foreseeable future, we will see erdos-number visualization should approach all languages connected to the language we currently call en-- that is, all languages becoming blue languages where their favorite second language is en.
  • The en<-->de connect is, in absolute terms, a special connect, the strongest connection between our two most populous projects. Attempts to building a common lexicon (Wikilish?) should begin at that connection.
  • ru , de, and zh all jump out as important 'hubs' or 'branches' in this scheme-- that is, they connect us to many languages that are not themselves most-directly connected to en.

Addendum: a fourth visualization-- 'binary tree'

[edit]

Starting with en, pick two 'child' languages-- the two strongest second-languages on en. Link to them.

For each of those, pick their two strongest second languages (ignoring those languages already connected to en elsewhere in the graph. Keep doing this, so that each language has two "children" it is the 'parent' if. (parent and child are mathematical terms in this case, completely unrelated to any real-world meaning)

  • This style of connections, called a binary tree, is a very bad way to model our densely-interconnected projects. It ignores most of our connection, and arbitrarily imagines that each project can communicate with, at most, three other languages. This is not at all true-- each language can communicate with as many languages as it has bilingual speakers-- thus the maximum of 3 connections is a very very unusual and arbitrary one. Thus, I do not currently know of an application that requires such a layout.
  • That said, for the inner parts of the tree, a few meaningful patterns do emerge. en<-->de, ru->uk, es->ca. Mostly though, this kind of a structure isn't very useful for modeling our project.
  • We could improve upon this by trying to create a 'globally optimum' binary tree-- right now, node relationships are assigned rather arbitrarily. Since a binary tree is of no known current use to us, I didn't bother.

Tentative Conclusion and improvements

[edit]
  • Contrary to my earlier concerns which prompted this study, the use of en as a central language does seem objectively defensible. Real-world populations, readership population, and our less active editors may all have dramatically different language preferences than our active users. But among active users, practically all languages have their strongest second-language ties to en.
  • While this may be biasing in the data caused by the high standard for "active user", it may also be fairly comprehensible. The Wikimedia Movement started in en, and being able to communicate with the existing movement, in some limited or indirect way, is a reality-imposed barrier to joining the movement.
  • In future, try to get a larger dataset that comes closer to getting all users.
  • In future, directly ask users for their language proficiencies, so we don't have to infer it.
  • In future, create a smallest-space matrix, factor analysis, other such stuff.
  • In future, do a similar analysis but use percentage-connection as a weight instead of strength-of-connection.
  • In future, do a edge layout looking just at links to en and forceatlas2 it. With no shared edges, projects locations will depend entirely upon their relationship to en. This will produce a map of 'distance from en'. Do the same thing with just links from en/de. and so on.
  • If requested, do a 'distance from any give project" visualization, of the sort done for en.