Jump to content

Needs assessment for documentation and revitalization of Indic languages using Wikimedia projects

From Meta, a Wikimedia project coordination wiki

The executive summary of this report can be read here.

The research Needs Assessment for the documentation and revitalization of Indic languages using Wikimedia projects has been conducted in order to understand the needs of Indic languages for their digitization. This research was funded as a part of the Movement Strategy Implementation Grants. Visit the respective pages on Meta-Wiki for more information on the Grant proposal and the Grant report.

Introduction

[edit]

Before beginning the digitization of the language, it is essential to understand the needs of the particular language. As native speakers would be the backbone of open source language digitization, we conducted a survey and interviews to get an understanding of the same. To understand the challenges perceived by native language speakers with regards to the development of their language, the following question was asked in the survey designed for them: “In your opinion, what are the challenges to your language on online platforms?” there were a total of 106 responses out of 108. The breakdown of the responses is mentioned below:

  • Need for awareness towards the language- 9
  • Not seen as a language in its own right- 3
  • None- 10 (response of dominant language speakers)
  • Lack of good content and/ or information- 33
  • Lack of resources (like font, google voice recognition, scripts not unicoded, digital literacy, presence on the internet)- 24
  • Lack of pride within native language speakers towards their language, lack of interest in mother tongue, and dominating effect of larger language/s- 18
  • Lack of material-5
  • Lack of vocabulary-1

* unclear responses have been omitted.

The trend of the above responses demonstrate the needs of languages with regards to beginning the digitization of language. The research conducted by us indicates that communities can conduct activities and bring content online to remedy the specific problems of their language or to promote it in general.

Utilizing Wikimedia projects to digitize linguistic and cultural content can be done systematically after analyzing the current practices. As with any other platform, there can be challenges and opportunities with regards to using Wikimedia platforms for digitization. These aspects were discussed with the interviewees and related questions asked in the surveys conducted. The results have been produced after a thorough analysis of the same and recommendations based on them have been produced. The recommendations can be utilized for the next steps or action to be taken.

Questions that will be answered via this research:

[edit]
  1. What is the state of linguistic awareness among various language speakers?
  2. What is the state of awareness about Wikimedia projects?
  3. What are the gaps and opportunities with regards to utilization of Wikimedia projects for language digitization and inclusion?
  4. What needs to change in order to include more stakeholders for language digitization via Wikimedia projects?

Methods and data:

[edit]

For this research, we collected the opinions of people via two methods: surveys and interviews. There were three categories of respondents; indigenous/native language speakers, Wikimedians, and language experts. We got overall 139 survey responses from two categories; indigenous language speakers and Wikimedians. There were 31 responses from Wikimedians. The gender ratio was more skewed in the case of Wikimedians than in the general survey. This might be a result of the fact that fewer women are Wikipedia editors.

Survey for Wikimedians:

Two separate surveys were designed, one for non-Wikimedians and one for Wikimedians. Total Responses to survey for Wikimedians: 32. There were 29 male participants and 3 female participants. 15 of these participants were Bangladeshi Wikimedians, a volunteer helped with sharing the survey with them. Several of them also participated in the online workshops.

Breakdown of languages known by the survey takers: Konkani, Hindi, Kashmiri, Punjabi, Odia, Portuguese, Brazilian Portuguese, Hebrew & Yiddish, Nepali, Dotyali(Doteli), and Darchuleli. The platforms used to share the surveys were: multiple mails on mailing lists (India, Bangladesh, Sri Lanka), telegram groups, Twitter, and sharing by acquaintances.

Survey for indigenous/native language speakers: Total responses in the general survey: 108. Out of which 48 were female, 60 male, and 1 preferred not to state gender.

The languages of the participants varied, with speakers of Hindi, Punjabi speakers, Pahadi, Odia, Bengali, Malayalam, Angika, Assamese, Bhojpuri, Dogri, Tobago Creole, Aymara, Awadhi, Haryanvi Ahirwati, Braj, Magahi, Telegu, Mising, Bishnupuriya, Bengali-Rarhi dialect, Maithili, Pnar, Khasi, Pangal dialect of Manipuri, Bodo, Marwari, Himachali, Kangdi, Sirmauri Pahadi, and Gaddiyali speakers. Several of these languages are under-resourced currently. The platforms used to share the surveys were: multiple mails on mailing lists (India, Bangladesh, Sri Lanka), telegram groups, Twitter, and sharing by acquaintances on personal messages and Whatsapp groups of linguistics scholars.

We interviewed a total of 15 people- 3 female and 12 male. There were overlaps between the three categories; indigenous/native language speakers, Wikimedians, and language experts. Most of the Wikimedians are also language activists. The native/indigenous language speakers are also linguistics scholars. The native language speakers had Bodo, Haryanvi-Ahirwati, Braj, and Gojri as their native language. 26% of them were engaged in digital activism, 45% shared content and news related to their language, 27% created content in their language, and 67% spoke, wrote, and shared their language on digital platforms.

This research was carried out in an explanatory manner with a qualitative method of semi-structured interviews. Secondary research was included i.e. analysis of already available data. The data analysis is both inductive and deductive, deductive analysis appearing mostly in the conclusion and recommendations section of this research. The unit of analysis was interviews of 15 people and survey responses from 139, comprising three categories mentioned above. The interviews were conducted in the months of October to December 2022. Purposive sampling and snowball sampling methods were used to choose the interview participants. Initially, we spoke to participants we were already familiar with and then those participants suggested other prospective participants.

The interviews were conducted online via Zoom calls, except one that was conducted via email. The surveys were conducted via Google form. All of the interviews were recorded with consent from the interviewees. The interviews were semi-structured, some questions were added based on the conversation. The duration of the interviews varied according to the input of the interviewees, the average amount of time that each interview lasted was 43 minutes.

The analysis began in stages, the primary stage was transcribing the recorded interviews. After transcribing the interviews, the keywords were coded using color coding. Overall 10 codes were created. Then a Spreadsheet file was created as the codebook. Spreadsheet files were also created for the survey responses. It helped organize the data precisely and observe parallels, if any. Notes taken during the interviews were used. Existing literature has been used to strengthen arguments.

Results and discussion:

[edit]

This report being a product of the research project: Needs Assessment for documentation and revitalization of Indic languages using Wikimedia projects, discusses the results of investigation into the same. This section is divided into two subsections: A. Language activism and digitization B. Wikimedia projects and language digitization.

The first section: ‘Language activism and digitization’ discusses language activism and digital preservation of linguistic and cultural content in general. It explains the strategies for digitization of linguistic content: current practices, requirements, generalizations, misconceptions, and next steps. It is divided into six subsections: Introduction, Utilizing the internet for language digitization, Relationship between documentation and revitalization, Oral Culture, Challenges, and Opportunities.

The second section: ‘Wikimedia projects and language digitization’ expands on the topic- language digitization and the usage of Wikimedia platforms for that purpose. It deals with the current practices, challenges and opportunities of innovative usage of the Wikimedia projects for language digitization. It is divided into the following subsections: Introduction, Challenges, Social needs and challenges, Technical needs and challenges, General challenges and solutions, Opportunities, and Wikimedia platforms recommended for language digitization.

Language activism & digitization:

[edit]

This theme combines two themes: digitization and language activism. Here, digitization refers to uploading and making language and cultural content available online. Language activism refers to activism for the sake of preserving the diversity of languages.

Mobile phones and the internet are becoming accessible to a large number of people in India and Bangladesh, so digitization of languages and culture has become more of a possibility as well. How many languages is the internet available in? A research by Whose Knowledge demonstrates the dominance of a handful of languages over the internet. This indicates that the internet is ready to be made linguistically inclusive by the involvement of diverse language speakers. People can utilize the internet to digitize their language in various ways, Wikimedia platforms is one of them!

Language experts on utilizing the internet for language digitization:

[edit]

Language digitization can take various forms. An Angika language activist advises that people bring their language online, it would be a form of language activism as it would also increase awareness about the language. He says: “Mobile phones have reached even in remote villages. It is only natural that native languages and cultures should be digitized and made available to the public. It would also spread awareness about the language.”

Eddie Avila who works with Global Voices as the Director of Rising Voices, an initiative that helps support communities seeking to leverage the internet to meet their self-determined needs, enlightens us of the fallacy that prevents common people from digitizing their language. He says:

Helping to facilitate networks can encourage people discover their own capacity to contribute to language documentation. Some people may have misconceptions about requirements for language digitization. They think that they need professional video cameras or professional audio tools. Toolkits, projects, resources, activities can make things more accessible and more appealing to those wanting to contribute to preserving their language and culture.

Several participants from all the three categories of interviewees and surveys say that creating awareness regarding the importance of language diversity is important. This response reveals that under-resourced languages are not being given enough importance/attention.

The purpose for which the language is being brought online impacts which platform it should be made available on. Daniel Bögre Udell, Wikitongues founder, advises to assess who you want to digitize and popularize your language for: outsiders or within the community. In case of the latter- put it online where your community is online. If former- big international platforms-YouTube and TikTok. Social Media can be a powerful tool for spreading awareness about your language. A point to note is that big, international platforms might be where your community is online. For educational content- Wikimedia tools can be utilized.

An important question that needs to be answered before going into the details of implementation is, “what comes under language digitization?” Amir Aharoni from Language Diversity Hub says that broadly defined, anything happening in that language is digitization of the language, whether writing Wikipedia articles in that language, or stories on Wikisource. In the presence of such a large pool of options, where should one begin with? Eddie says that often there is a tendency to want to get involved in all types of activities or to be present on all of the platforms, but because there is limited time and resources, one might want to set priorities by figuring out what important for them and their community. It’s not necessary to reinvent the wheel, but they can begin by identifying examples of others who are doing similar work that can be adapted for their own context.

Relationship between documentation and revitalization of languages:

[edit]

This section discusses two themes- Documentation and Revitalization. Documentation refers to the collection of various types of data of a given field of knowledge. Here, it specifically refers to documentation of languages. Revitalization refers to efforts towards putting life into something that is on the verge of death. In the context of languages it refers to creating more users and speakers of a language, especially among young people.

  • A resounding no is the answer of the majority of native speaker interviewees to the need of revitalization, they believe that it is not required for their language as their language is not dead. Regarding this matter, language expert Daniel suggests that evaluation of the health of the language is important to determine what kind of revitalization is needed. Also, according to them, documentation and revitalization are interrelated- documentation is the primary step for the preservation of a language. Eddie says: “Documentation is a part of revitalization. Activities like language documentation, archiving, and other preservation activities can coincide with activity like social media activism for the language or other types of activities for revitalization."
  • Documentation of language and revitalization of language are often interrelated activities. The documented data can be utilized later to revitalize the language. Also, efforts to document the language create awareness towards it in the language community and other concerned parties, therefore, that also contributes to the possible revitalization of the language.
  • Different people can contribute to language digitization differently. Subhashish O Foundation founding member and documentary filmmaker who has documented languages such as Karbi, Achhami, Baleswari Odia, Ho, and Kusunda, tells us:

Written documentation in an endangered language means it could be used by a linguist/ professionally trained individual. A citizen archivist could capture the orality of the language- folklore, folk songs, contemporary aspects of the language. It can be used later by linguists for the analysis of the language.

Oral Culture:

[edit]

Although the digitization of language and culture can take various forms, there can be a tendency to be inclined towards the textual aspects of it. While this seems justified and conventional, seen from the point of view of inclusion of languages and cultures, this inclination can exclude languages, cultures and communities that do not have a lot of written literature and are majorly in oral usage.

Why is recording oral culture important?
[edit]

Oral literature is tied with the way of life of various communities. A Bodo speaker says that folk songs were an intrinsic part of his culture, but they are disappearing as times are changing.

Folk songs and folk narratives that were very much related to our culture 50 or 100 years ago are fast disappearing now. We can start recording them in video and audio format to preserve the oral tradition.

He also appeals for the need of documenting the present form of language since languages are dynamic and would not be the same in, say, a decade. It is to be noted that around 70% non-Wikimedian survey respondents said that recording folk songs and folktales as the preferred method to contribute to their language digitally. Other interviewees: Punjabi speaker Sumanpreet and Wikimedian Kundan Amitabh also advocate documenting the current form of a language since they change with time.

Linguist Bidisha Bhattacharjee in her essay ‘Role of Oral Tradition to Save Language and Cultural Endangerment’ in 'Linguistics and Language Sciences' states: The oral tradition is a rich source of preservation of cultural heritage and it reflects through the linguistic expression and linguistic variety of people. Bidisha advises that NGOs, academic bodies or institutions working to save the endangered linguistic communities and preserve the linguistic and cultural diversity can focus on documentation of different forms of oral traditions of the communities to have a better understanding of their value and their dynamicity and unique properties of language and culture. Ruth Finnegan in "Literacy and Orality" has classified oral traditions into the following categories: oral literature, generalized historical knowledge and memoir/personal recollections. The collection of proses, poems and folk tales, songs associated with different rituals, different celebrations, festivals, people of different professions, capturing different moods, give a detailed and in-depth understanding of the society, community, people, and their life-style.

A Braj speaker tells us about the varieties of Braj folk songs- Suddas, Languriya, Aalha along with Rasiya, Malhaar, Faag etc. sung in rural areas which have near to no representation on digital platforms. He also mentions that as people are migrating to urban areas, these forms of songs are being lost, since they are not practiced anymore. It highlights that languages are bound to linguistic knowledge as well as cultural knowledge. As a Bangladeshi Wikimedian interviewee, Iftekhar states “A language may be well documented by documenting only its grammar and vocabulary, but it may die away if its oral culture is not documented and passed down to the generations.”

To the question of how native speakers can contribute to the digitization of their language, President of Wikimedia User Group Nigeria-Olushola Olaniyan and Daniel say that they can provide the verbal form of their language. Daniel expands on this: the native speakers can contribute to the digitization of orality of language either in the form of audio/video recordings or vocabulary/lexicon elicitation by speaking or providing speech. Olushola says that the indigenous speakers might not be literate, but they can speak it, they own it. For language digitization they can aid in the audio-visual documentation of the language.

Subhashish says this about the anthropological importance of oral culture and history:

Literature and history are often the voices of the dominant people. I think oral history enables parallel documentation of what's happening in society. Because at the end of the day, folklore or folk songs are the language of the people, they are the voices of the people. That is mostly not documented well in written history or written literature.

Some of the interesting responses from the survey mention the importance of content to be available in a variety of mediums for under-resourced languages and communities- videos and audios available online along with subtitles which can be understood by large populations, requirement of resources for tribal communities, and guidance to interested people.

A native Mising speaker says: “There are not many people who can read and write the language properly... If online videos or audios could be made available , it would be very beneficial for them.”

A Haryanvi speaker underlines the need for awareness, she says: “People having less information on how to digitize their language on online platforms” is a major problem for her language.

Which type of content should be prioritized- oral or textual?:
[edit]

This is a decision that should depend on the language, says Subhashish. An endangered language- both written and oral aspects should be preserved.

Written documentation in an endangered language means it could be used by a linguist/ professionally trained individuals. A citizen archivist could capture the orality of the language- folklore, folk songs, contemporary aspects of the language. It can be used by linguists for the analysis of the language.

Similar to the case of interrelated nature of documentation and revitalization, the oral and written form of language are interrelated, meaning that orality of the language can not be sidelined.

Capturing the orality of a language is especially recommended for languages where not a lot of literature (text) exists. Olushola shares that most languages are only for oral engagement- they can't be documented in the textual form easily. Also relevant for newly emerging languages as they face the problem of lack of reference on Wikipedia as described by Iyumu- a Paiwanese Wikimedian and a member of Language Diversity Hub and Olushola. This issue is also highlighted by a survey taker. The respondee says that there is: “inadequacy of relevant references to secure and improve our local Wikipedia articles, lack of quality volunteers to contribute on digital platforms to promote language.”

Challenges:

[edit]
Cultural context:
[edit]

Within digital preservation of language, culture specific challenges might exist in the language communities. For instance, community members might be shy. Iyumu mentions this challenge with the Paiwan speaking community and explains that it is a class issue – in Paiwan, class hierarchy matters a lot, therefore not everyone is forthcoming to speak and be assertive. The issue of gender and permissions is mentioned by Olushola. Although the same challenges might not exist in different language communities, there would be specific challenges that insiders would be aware of and possibly adept at dealing with. For instance, Haryanvi and Ahirvati speaker Priyanka mentions the possibility of lack of people who agree to provide their input to documentation. She also mentions that these possibilities are reduced when the person is from the community itself. Inclusion of people from the community itself also matches with Wikimedia’s spirit of volunteerism and open knowledge where people contribute to their culture and knowledge system.

On the field:
[edit]

During active language digitization, the most common problems faced in the course of language digitization as experienced during the Oral Culture and Language Documentation workshops conducted during this project, and as shared by language experts are the following: slow internet and problems with storage.

Problems with websites:
[edit]

There are several platforms to upload media files- Internet Archive, Lingua Libre, and Wikimedia Commons, but one has to be careful with sensitive content. Cloud hosting can be expensive and not secure enough. Slow internet might also be an issue for certain platforms.

Opportunities:

[edit]
Various Platforms to upload linguistic-cultural content:
[edit]

There are various platforms that can be utilized to preserve linguistic content. Subhashish expands on this theme- A. Social Media- interests many people, people these days become internet users because of SM. One can think in reverse, how to use those features to document languages- however, these platforms are not built for language documentation. B. Social Media-content moderation or content takedown would ruin the archives if done primarily on SM. So, it is advised to think of alternate platforms- Internet Archive, Wikimedia Commons, and Lingua Libre among others. These platforms have certain limitations, like the uploaded content being public to everyone and the website being tough to use for people with slow internet etc.

Immediate steps:
[edit]
  • Presence of native speakers: in the digital preservation of a language, the physical presence of the native language speaker matters. The interviewees state that the regular presence of the native language speaker in the geographical location and therefore their familiarity with the language and culture is desired. They can provide the verbal form of their language, as mentioned earlier. If the person initiating the digital preservation is an outsider it is important to “be a catalyst rather than being savior” is recommended by Subhashish.
  • Preservation of lexemes of present form of the language: is advised by Pramod Rathor-a Braj speaker, Sagar- a Bodo speaker a Wikimedian, and a Punjabi speaker & journalist.
  • Preservation of the oral culture of the language: like folk songs, Ghazals, Suddas etc. is advised by three native language speaker interviewees.
  • Policies should be explained to interested people: policies might be tough to understand for outsiders or need to be explained to newcomers.

Wikimedia projects & language digitization:

[edit]

Wikimedia projects as platforms for digitization of linguistic and cultural content- An introduction:

This section discusses two themes: Linguistic inclusivity on Wikimedia projects and policies, Inclusion of indigenous language speakers for open knowledge. As the aim of this project is to understand how to make Wikimedia platforms easier to use for varying languages and cultures, this section deals with the forms of cultural knowledge that can be digitized, need for innovation in the way we look at language digitization (textual versus oral) and the challenges- potential and current, faced in undertaking such work.

It discusses the usage of Wikimedia projects as sites for digitization of languages.

Challenges:

[edit]

Beginning with the challenges, we will move towards the solution to these in the forthcoming sections. This section describes two types of challenges: A. Technical needs and challenges, and B. Social needs and challenges

Technical needs and challenges
[edit]
  • Entry level barriers: Several Wikimedia platforms have a complex pathway that a volunteer would have to deal with first, in order to contribute to it. For instance, the interface of an incubator is quite different from that of a Wikipedia, and uploading media files, especially video, on Wikimedia Commons is a complex procedure for beginners. Daniel, Olushola, and Iyumu raise the issue about Wikimedia projects other than Wikipedia being tough to navigate for beginners.
  • Wikimedia technology and community practices unfriendly to oral culture content: This is a problem listed by language experts and Wikimedians in association with using Wikimedia platforms for language digitization. The volunteer community might not be familiar with different cultures, resulting in takedown/objection to oral content. The policies might have been created by people who lack contextual knowledge of majorly oral languages. This demonstrates favoritism to text over orality.

A survey respondent Wikimedian has also mentioned the problem of conservatism against uploading music. Quote from a survey taker about utilizing Wikimedia platforms for digitization of culture: “By being more open. We've had very negative experiences while trying to contribute Konkani music. The dominant voices which sideline such attempts need to be kept in check.” In order to avoid removal of such content, one would have to take the extra step of keeping email notifications on and replying to queries fast.

  • Need for awareness: regarding Wikimedia is also required as interested people might not be aware of Wikimedia projects as sites for language digitization. Sumandeep, a Punjabi speaking journalist who has researched on female Wikimedia volunteers, advises that one can begin by raising awareness in schools and colleges.

This need is reflected in the analysis of survey responses as well. While there were overall 82 responses admitting their knowledge of Wikipedia, to the related question of “What do you know about Wikimedia projects other than Wikipedia?”, only 12 mention any of the Wikimedia sister projects, i.e. only around 11%. This indicates the need for increasing awareness about Wikimedia sister projects.

Social needs and challenges:
[edit]

While the above mentioned challenges deal directly with Wikimedia platforms, the social challenges mentioned below discuss the needs of change in the social outlook in order to create a more linguistically and culturally inclusive Wikimedia.

  • Need of mentorship is mentioned by language experts as well as by indigenous/native language speakers. A Gojri speaker affirms this need, they list mentorship as one of the requirements to effectively include youths for the cause of languages. There are several options and complexities out there within the not so complex task of language documentation, this can cause confusion and difficulty for newcomers. The availability of mentorship ensures not only efficiency but also saves time and effort. A mentor would be able to help directly or guide to an appropriate source. The mentor does not have to be from the same language either, as Eddie Avila says: “Even if there are not existing activists in one’s language, opportunities for cross-linguistic, cross-regional mentorship between activists from another language can guide and inspire interested individuals.”

Mentorship is specially recommended for Wikimedia newbies, Eddie says:

In terms of Wikimedia projects, policies might be tough to understand for those not familiar with the platform, but the mentoring model especially from those from the same language can community can help remove some of these barriers to understanding. We have seen example of how communities are adapting Wikimedia projects based on their own local context and approach to knowledge sharing.

The need for mentorship was also evident during the workshops conducted as part of the project.

  • Finding ways to encourage and inspire language speakers to get involved. There might be a lot of challenges with regards to connectivity, access to equipment, basic skills/digital literacy. One can work towards making these things accessible, providing equipment, internet and guiding interested people towards grants. Language experts advise that easily understood information should be provided and via the medium appropriate for them- via videos or short clips, in their language as well. They say that seeing toolkits and resources makes a big difference for inspiring activists from similar communities/backgrounds.
  • Need to acknowledge and tackle conservatism/Wikipedia Centrism: Wikimedians during the interview stated that Wikipedia should be the initial platform to begin with, however they do not provide an explicit reasoning for it other than convention. Socially, Wikipedia has a brand value. A language having its own Wikipedia edition provides validation to the language and becomes newsworthy, as in an example here. As the experts above mention, what does your language really need currently- validation or relevant content is an important question to be contemplated over. The appropriate Wikimedia platform for the digitization of linguistic and cultural content can be selected after consideration.
General challenges and solutions:
[edit]
  • Need for motivation among possible volunteers: in a financially unstable country, some requirements of youngsters have to be kept in mind, since one can’t expect contribution while they are struggling with basic necessities. Some type of reward- certificates, acknowledgement is required. Assistance like equipment and internet support is suggested by language experts and Wikimedians.
  • Regular meetups: meetups on a timely basis and editathons help to keep Wikimedia contributors motivated as per Wikimedians and language experts.
  • Need for consent, explain licenses: Informing involved people about the consequences of uploading data is a requirement ethically as well.
  • Savior mentality needs to be avoided- It is important to “be a catalyst rather than being savior” as recommended by Subhashish
  • Need to have a community hinders individual contributions: Certain Wikimedia projects can be useful without constant updates and contributions to remain relevant. For example, a volunteer can contribute several hundred words to the Wiktionary of their language, and it would be useful. The contribution of several people in such platforms is not a requirement for them to be useful. Wikipedia and Wiktionary for instance, require an incubator phase. Certain projects that do not require an incubator phase: Commons and Wikidata.
  • Problem of reference in Wikipedia: as per a survey response- “inadequacy of relevant references to secure and improve our local Wikipedia articles, lack of quality volunteers to contribute on digital platforms to promote language.” Also, there is the issue of references in Wikipedia for a new language. If a language has other content on Wikimedia platforms, it is easier to run Wikipedia pages with references. Interviewees Olushola and Iyumu talk about this issue as well. A respondent advises that Wikipedia and Wikiquote be launched in endangered languages for their revival: “Wikimedia projects are helpful in a number of ways but that would require some active people to participate in the game. WMF projects such as Wikipedia, Wikiquote etc. could be launched in endangered languages to help the languages revive.” However the caveat that it would require active participation from a number of people is notable.
Opportunities:
[edit]

This section breaks down responses from the survey and interviews in order to evaluate the opportunity of utilizing Wikimedia projects best suited for language digitization.

Wikimedia platforms recommended for language digitization

  • Wikidata: Wikimedians and language experts familiar with Wikimedia projects suggest Wikidata, since it has a low entry level barrier and can be useful to individual contributors.
  • Wikipedia: is suggested as the primary project by two Wikimedians, since it is utilized most, as discussed earlier. Daniel, Olushola, and Iyumu mention that Wikimedia projects other than Wikipedia are tough to navigate for beginners. What are the solutions for these that can be applied without impeding or dragging the task of digitization of languages to longer periods? Creation of tools within the span of small projects or workshops would be tough, so how can people navigate these platforms while also being able to complete the tasks efficiently?
  • Commons: is suggested by language experts familiar with Wikimedia platforms and by Wikimedia volunteers, as it can be linked to other Wikimedia projects as well. It is recommended also for digitizing languages used mostly in the oral form and the orality of all languages. It is especially relevant for the languages that are used mostly orally and less literature is available.

Subhashish, who is also a veteran Wikimedian has these suggestions: Wikidata- useful for simplicity, low entry level barriers, but text based. Lexemes being created and being used in other projects. It is creating a corpus for future machine translation projects. Adding a layer of audio data makes it even more valuable. Commons being used to document languages- them being used on wiki projects. For example, Voice intro project- used in biographical articles on Wikipedia. Revitalization on wiki projects: incubator projects can be created.

Survey responses to the question: “How can the Wikimedia projects aid in documentation of languages?”

Only about 40% of the Wikimedians mentioned that it’s possible to utilize sister projects of Wikimedia for digitization of oral culture.

Survey takers have diverse responses to the question: “How can the Wikimedia projects aid in documentation of languages?” and various Wikimedia platforms were suggested in the survey result. Below is the breakdown of the recommendations:

A survey taker advises technical and financial assistance to volunteers, another advises working on it locally not remotely, meetings and research is advised by yet another. Two responses suggest training to interested individuals. Three respondents suggest Wikisource. Other platforms suggested include Wiktionary, Commons, Wikidata, Wikipedia, Wikiquote, and Wikibooks.

One of the survey takers says: “I don't know (which) project. Registering with needy populations, researching and rescuing the traditions and customs of their language, making them available on social networks in the community family network. Wikimedia with its interested affiliates.” A trilingual Wikimedian suggests: “We in the Wikimedia projects can help in transcribing songs, ballads and poems preserved orally, by making the transcriptions public and easily accessible for future researchers.”

Following is the breakdown of recommendations in response to the question: “How can the Wikimedia projects aid in revitalization of languages?”

Wikisource-7, Commons-3, Wikipedia-2, folk songs and idioms-1, Promoting engaging cultural events and researching local people's languages-1.

Another survey taker suggests inclusion of native speakers: “Wikimedia projects should develop a separate strategy for endangered languages, such as planned work on endangered language revival with the help of native speakers and user groups.”

These responses are in contrast with the responses from Wikimedians over in-depth interviews. Two Wikimedians emphasized on contributing to Wikipedia over other Wikimedia projects. This demonstrates that Wikipedia has a brand value and recognition. However, that is not the exclusive response from Wikimedians, another Wikimedian interviewee, Iftekhar from Bangladesh Wikimedian community says “Via Wikimedia platforms, I can see documentation of a language through articles concerning the language (Wikipedia) and its grammar (Wikibooks), vocabulary (Wiktionary), literature (Wikisource), and courses (Wikiversity). But there is a problem with audio-visual documentation. Wikimedia Commons can not perform the job properly, because it treats all kinds of audio and video as the same and stores them somewhat uncategorically. It would be better if Commons had a separate section for language documentation or a different project altogether.”

In conclusion, Wikimedia projects have a lot of potential as sites for preservation of linguistic and cultural content and it is clear to language experts as well as Wikimedians.

Conclusion

[edit]

This research clarifies that Wikimedia projects are excellent sites for the presence of a variety of content from different languages and cultures. However, there is a lot of space for improvement for the systematic utilization of these resources to facilitate contributions from underrepresented language communities.

Space for improvement and creating awareness within Wikimedia projects:

It would not be ideal to utilize Wikimedia projects without a proper plan. Different projects have different uses along with their pros and cons; these need to be considered before they are used. For example, as mentioned by an interviewee, Commons is the primary site for uploading audio-visual content, however the content might be difficult to find if it's uploaded without categorization. Need for creating awareness for sister projects of Wikipedia:

As the result of the general survey shows, barely 11% participants were aware of any Wikimedia projects other than Wikipedia. If Wikimedia projects are to be made more inclusive and diverse in terms of content, then potential contributors need to be aware of their presence.

Need to address text-centrism:

Due to the brand value of Wikipedia, language enthusiasts usually promote it over other Wikimedia projects. Recognizing the potential of other projects and the innovative ways in which they can be used would be useful. Since Wikimedia platforms are sites not only of preservation but also of popularization, deliberating over which aspect to focus on would be helpful in choosing the suitable platform.

Aid and mentorship to interested individuals:

As mentioned earlier, policies of Wikimedia projects are tough to intuitively understand as a newbie. Hence, mentoring becomes an important tool to onboard newbies to the various projects and platforms. Inter-language and inter-community mentorship is also a solution for language communities that do not have contributors currently. Other forms of aid can include providing equipment and internet support, as per the requirements of the individual or the group.

Recommendations

[edit]

The several Wikimedia platforms are highly fertile grounds for digital preservation of languages in various ways. However, the application of the following recommendations would help improve the utilization of them in the Indic context.

Technology

[edit]
  1. Improving software and tools: Resources should be allocated to improve the infrastructure of Wikimedia projects. There are websites other than Wikimedia projects such as Living Dictionaries, that are used for language preservation and revitalization. Perhaps some of their intuitive designs can be replicated within Wikimedia projects. This aspect has to be paid special attention since majority of the languages are oral, do not have any/a lot of academic and literary texts, therefore, the representation of diversity of languages requires that a Wikimedia platform, say Commons, is capable of not only hosting such content but also enables easy discovery. There are several Wikimedia projects that can be used to host the oral and textual aspect of any language, e.g. lexemes can be added on Wikidata, folk songs can be uploaded on Commons, transcription of audio-visual content on Wikisource. However, the process for doing these is not intuitive. Uploading a video on Commons requires that it be in WebM format, so its online conversion is required. If one wants to transcribe/create subtitles for an audio/video file on Commons, they would either have to use another website or doing it via a cumbersome manual process. To conclude the point, Wikimedia projects have to be invested in to enable their easy utilization to enable documentation and revitalization of languages.

Wikimedia platforms

[edit]
  1. Moving out of the text-centrism and using Wikimedia projects innovatively: Text centrism of Wikimedia has to be recognized. Wikipedia is recognized by several people but only a handful are aware of the sister projects. The sister projects are valuable platforms for digitization of textual and non-textual forms of languages. This is especially important for communities with majorly oral languages. Collection of audio-visual content in addition to textual content helps in linguistic and cultural representation of under-resourced and underrepresented languages and communities.
  2. Promoting contribution where presence of an established community is not a prerequisite: Wikimedia has come to be associated strongly with Wikipedia. Creating a Wikipedia in one's language is usually the course that language activism takes. However, do we want to restrict languages that can be hosted by Wikimedia by constricting them to Wikipedia? For example, Angika Wikipedia incubator was created a decade ago, it has still not become an active Wikipedia. The other platforms: Commons and Wikidata do not require incubators. This raises the question of whether all languages need a Wikipedia? Research indicates that articles of a language that are most read are from the local content rather than global content in the local language. It would make sense to recognize projects that can support local cultural content like media, lexemes etc.
  3. Small communities and individual contributors can focus on aspects that do not require constant updates: Wikipedia article for say a country or a living individual needs to be updated to remain relevant as sources of information and would become outdated if not regularly updated. However, not all Wikimedia projects require constant updates and contributions to remain relevant. A volunteer can contribute several hundred words to the Wiktionary of their language or create lexemes on Wikidata singlehandedly, information of this type does not need to be updated regularly to remain relevant. This approach is useful for cases where a community does not exist and would be encouraging to individual contributors. For instance, Wikipedia and Wiktionary require an incubator phase, which require a number of active contributors for them to become full-fledged projects. Certain projects that do not require an incubator phase: Commons and Wikidata. Lexemes, oral culture/history videos, and transcription of oral culture do not need to be updated, they will remain relevant and useful forever.
  4. Inclusion of contributors from diverse cultural and linguistic communities: Social opportunities and involvement from ground up. Revitalization of languages is bound with digitization of language content. This means that any digital activity beneficial to the language is a positive step. Whether it is a collective movement to increase the number of speakers for the language or creating social media content in the language, there is no single correct way to do it. A first step towards a more inclusive internet would be to do an in-depth research on the current state of languages on Wikimedia projects. How many languages are being supported in various forms by Wikimedia projects today? What steps should be taken in order to improve and include more languages in the coming years? These can be some of the guiding questions. Inspiration can be taken from State of the Internet's Languages report 2022.

Oral culture

[edit]
  1. Citizen archivists have to be promoted to create oral culture content: This research has established the importance of oral cultural and linguistic content, the next step is to put forth the creation of such content in motion. Citizen archivists from the same community or region can capture the orality of the language well. The 1947 Partition Archive has successfully trained individuals and collected more than 10,000 oral histories. A training course similar to the Reading Wikipedia in the Classroom training of trainers might be useful as well.
  2. Creation of oral culture content relevant for given languages: As mentioned in the Oral Culture section above, oral culture like folk songs are disappearing fast. Oral narratives are carriers of culture and need to be recorded and uploaded. Oral history can include historical events to the personal or social history of a speaker. Folk songs around local customs and culture like suddas, langurias, crop harvest songs etc. can be recorded to capture the language, its lexemes, and the culture. The Oral Culture Transcription Toolkit provides guidance on utilizing Commons and Wikisource for documenting oral culture content and for its representation in the textual form. However, major improvements in software, as mentioned above, will help to make the process easier and involve less hopping between platforms.

Support

[edit]
  1. Providing needed support to interested individuals: There are certain avenues for supporting individuals and communities with internet, equipment, and mentorship support. The Hardware Donation program, CIS-A2K requests, Wikimedia Community Funds are some of the routes, individuals and communities can take to get internet or equipment support. Recently launched peer-learning initiative Let's Connect can be a great avenue as well.