Jump to content

Grants:Project/Rapid/AlmaMaK/OER tool and documentation

From Meta, a Wikimedia project coordination wiki
statusfunded
Reusing OER resources on Wikimedia tool and documentation
A tool and documentation to systematically survey, assess and integrate large amounts of open license text, data and images from external sources and integrate it into Wikimedia.
targetEnglish Wikipedia, Commons, Wikidata, Wikisource
start date1st June
end date1st october 2019
budget (local currency)1,778 EUR
budget (USD)$2000
grant typeIndividual (but being hosted by UNESCO)
granteeAlmaMaK


Project Goal

[edit]

Briefly explain what are you trying to accomplish with this project, or what do you expect will change as a result of this grant. Example goals include, "recruit new editors", "add high quality content", or "train existing editors on a specific skill".

English Wikipedia comprises 5.8 million articles while it is estimated that Wikipedia could cover over 100 million subjects. As a result, a huge amount of knowledge is not yet represented in Wikipedia. Since in person training workshops have retention rates below 1%, it does not appear to be scalable to train experts to share their knowledge by adding information directly into Wikipedia. However, there is a huge amount of open license content written by experts available online, e.g:

  • There are over 13,000 Open Access journals, PLOS, the largest journal in the world is open license.
  • The Met Museum website has 100,000 of object descriptions available under open license
  • Internet Archive holds millions of open license documents.
  • Project Gutenberg holds over 58,000 copyright free books.
  • The UNESCO website will soon have 1.5 million page available under CC BY-SA
  • The UNESCO OER recommendation should vastly increase the amount of open license content available produced by governments.

Much of the information needed by Wikipedia is available and under an open license. Information on the Sustainable Development Goals, for instance, is often being produced by experts (including the UN), written for a general audience. It is, however, not easily accessible. This is because it is often fragmented across hundreds of low traffic websites, in long UN reports, not available in the reader’s language, not available in accessible formats or not online. A study by the World Bank showed that almost one third of their publications had never been viewed and only 13% had been viewed more than 250 times.

Over the past 3 years, English Wikipedia developed an easy process to reuse text from open license sources and a metrics tool to measure the usage of text from different sources in cooperation with UNESCO. As a result, UNESCO publication text appears on Wikipedia articles, which are seen over 4 million times per month. This demonstrates the potential of OERs as sources of text for Wikipedia. We are now starting to expand this work to other UN agencies including FAO and WIPO. Text from one 28 page FAO publication has been used to improve 10 Wikipedia articles related to fishing which are viewed over 80,000 times per month.

In order to extract full value of open license resource to increase the quality and quantity of expert written content on Wikipedia, a tool is needed to systematically survey, assess and integrate large amounts of open license text, data and images from external sources and integrate it into Wikimedia projects.

A structured process will have additional benefits and encourage the production of additional tools:

  • Since it will introduce their work to a new and more diverse audience, the tool will encourage organisations to make use of open license by establishing Wikipedia as a forum to share their text. Wikipedia is not yet able to capitalise on the text because there does not exist a structured process to assess what resources are available.
  • Given that the UN requires a very high quality of documentation, the tool will ensure that the documentation added to Wikipedia will be of good quality and used by a wide audience.
  • The tool will encourage others to create further tools to facilitate the extraction and processing of open license content from external sources.  

Project Plan

[edit]

My Background

[edit]

I hold two Bachelor degrees from the University of Leipzig, Germany in Anthropology and Linguistics. As a student assistant, I carried out research for the Chair of Anthropology, copy-edited texts and set up an academic booklet out of Master students' term papers. Sponsored by a Carlo Schmid Programme fellowship (October 2018-March 2019) through the German Academic Exchange Service (DAAD), I worked in the Diversity of Cultural Expressions Section at UNESCO. There, I was able to enhance and extend my analytic abilities by carrying out research on the implementation of the 1980 Recommendation concerning the Status of the Artist, for instance. As a result, and under the guidance of the Wikimedian in Residence at UNESCO, John Cummings, I implemented Wikipedia as a tool to share my findings with a broader community. I wrote two Wikipedia articles on “Artistic Freedom” and the Section’s Global Report “Re|Shaping Cultural Policies”. Additionally, I integrated important data (abbreviations used by UNESCO, data (graphs and figures) produced by UNESCO etc.) into Wikidata. I also participated in the Wiki4Women event to increase the presence of Wikipedia articles on women.

Activities

[edit]

Tell us how you'll carry out your project. What will you and other organizers spend your time doing?

This grant is needed because this project will take a very significant amount of time and needs to be achieved in the next months. Crucially, the project intends to provide a platform for open license text from UNESCO and other organisations and to incentivise other organisations, both in and out of the UN system, to release a significant amount of content.

John Cummings, Wikimedian in Residence at UNESCO will act as a mentor and if requested by WMF, supervisor for me. Navino Evans (working with John in 2019) will offer technical advice and help with technical implementation. John and Navino are funded by another grant, not by this request. UNESCO is providing free office space and staff time to collaborate on the project including Ian Denison, Chief of UNESCO Publications Unit and publication staff from different sectors.

The main activities for this grant are:

  • The development of a tool for systematic survey, assessment and integration of large amounts of open license material from external sources into Wikimedia requires thorough research. Under the guidance of Wikimedian in Residence at UNESCO, John Cummings, and in collaboration with the UNESCO staff responsible for publications and web content. The research portion of the project will include working the community, Wikimedians in Residence and other groups to understand what is wanted by the community and what might be technically possible. This will include building on the work done on Help:Adding_open_license_text_to_Wikipedia on which text is suitable and what adaptation needs to happen to make it suitable for Wikipedia. I’m able to work in English, German, Portuguese, French, Italian, Spanish allowing me to communicate with multiple language communities and to provide the tool in many languages. We are starting with English Wikipedia because it comprises the largest amount of existing infrastructure for using open license text and the largest audience and number of volunteers.
  • We also want to build case studies within the UN to show value to other UN agencies and other organisations encouraging them to release their content under Open Access. The tool is going to be applied to high quality content and data developed by UNESCO as a case study. UNESCO has a very large amount of content of 1,500 open license publications and 1.5 million web pages under an open license. This large collection, which is comparable in size to other large organisations will help surface many structural issues that working with smaller collections could not. This tool will be usable for other producers of large amounts of high quality content of open license in order to sustainably contribute to Wikimedia.

How will you let others in your community know about your project (please provide links to where relevant communities have been notified of your proposal, and to any other relevant community discussions)? Why are you targeting a specific audience?

We have notified groups of people working with cultural organisations as they are the most likely to use what will be created. This process will open up a new way of sharing large amounts of knowledge for these organisations on Wikimedia projects.

What will you have done at the end of your project? How will you follow-up with people that are involved with your project?

  • A process to systematically survey, assess and integrate large amounts of open license text, data and images from external sources and integrate it into Wikimedia.  
  • Excellent quality instructions integrated into Wikipedia and linked from many other pages.
  • Built case studies to show the value of open licensing and importing content systematically into Wikimedia.

I will work with John Cummings, Wikimedian in Residence and Navino Evans, who is working with him in the role of the project’s technical lead. John and Navino will promote this work at Wikimedia events and other opportunities.

Impact

[edit]

How will you know if the project is successful and you've met your goals? Please include the following targets and feel free to add more specific to your project:

We will have created the opportunity for Wikimedia volunteers to systematically analyse and import content from open license sources into Wikimedia projects for English Wikipedia (and possibly other languages), Commons, Wikidata and Wikisource. The process will be known about and used.

We will have increased the number of ways knowledge can come into and be represented by Wikipedia and provide a new avenue for partnerships and collaborations with external organisations.

I will produce case studies for content developed by UN agencies that has gone through the process for others to understand, copy and improve upon (the project will not include paid production of articles).

  1. Number of total participants
  2. Number of articles created or improved (if applicable)
  3. Number of photos uploaded to Wikimedia Commons (if applicable)
  4. Number of photos used on Wikimedia projects (if applicable)

Resources

[edit]

What resources do you have? Include information on who is the organizing the project, what they will do, and if you will receive support from anywhere else (in-kind donations or additional funding).

Category Item description Unit cost (euros) Number of units for 4 months Total cost in Euros Total cost in US Dollars Notes Source of funding
1 Equipment Macbook Air 1100 1 1100 1230 Applicant
2 Office space Desk at UNESCO + office services 500 4 2,000 2,261 Standard UNESCO costing for desk space UNESCO
3 Staff time Time of UNESCO staff 40 20 800 904 UNESCO
4 Staff time John Cummings' time 25.64 20 256.40 289.85 Wikimedia Foundation project grant

What resources do you need? For your funding request, list bullet points for each expense:

Category Item description Unit cost (euros) Number of units for 4 months Total cost in Euros Total cost in US Dollars Notes Source of funding
1 Project time Project time for me 8.89 200 1778 2000 This grant

Endorsements

[edit]
  • Support Support This tool is really useful and the team behind implementing the tool is also strong.Pratik.pks (talk) 11:16, 21 April 2019 (UTC)
  • Support Support *Makes wikipedia better 46.11.7.150 19:34, 21 April 2019 (UTC)
  • Support Support --Isaac (talk) 21:51, 21 April 2019 (UTC)
  • Support Support Ircpresident (talk) 06:41, 22 April 2019 (UTC)
  • Support Support --Ipigott (talk) 11:16, 22 April 2019 (UTC)
  • Support Support Emjackson42 (talk) 13:45, 22 April 2019 (UTC)
  • I don't oppose but a note of caution is needed. Previous additions of blocks of open license text to WP articles by John Cummings have met quite a lot of opposition, and often been removed. Additions need to be done carefully. Johnbod (talk) 15:30, 22 April 2019 (UTC)
    • For reference most of this resistance was due to misunderstandings about the copyright of the content due to third parties falsely claiming copyright on UNESCO content. To my knowledge only around 5 pieces of content was removed for other reasons out of around 250 articles with content added from UNESCO. John Cummings (talk) 18:11, 22 April 2019 (UTC)
      • That number seems surprisingly low to me - I remember there also being concerns about slabs of mandarin committee-written (and then translated) text overwhelming articles, or creating new ones. Johnbod (talk) 03:06, 23 April 2019 (UTC)
        • Hi John, I'm now really confused, we haven't done anything with translated text. John Cummings (talk) 09:59, 23 April 2019 (UTC)
          • I'm not saying you've translated it, but I was assuming much of the text was originally drafted in other languages - it read like it. Johnbod (talk) 01:38, 27 April 2019 (UTC)
  • Support Support I use the last generation process for this at OTRS. I understand the proposal here - it is for an easier and better process for releasing text. Where OTRS works well for images, it is problematic for text. There are even new OTRS release processes in development and working this idea from another perspective is good for all kinds of media, including text, images, data, and the rest. Blue Rasberry (talk) 16:02, 22 April 2019 (UTC)
  • I've not reviewed the impact measures and the budget in detail, but I agree that importing freely licensed text to seed (existing) articles is a high impact activity. As evidenced by some comment above, it's also a time-intensive and tedious work which requires a lot of focus and is easy to get wrong if you have too little time for it, so there's less need for advice on how to do it than work to actually do it systematically (including mapping what articles would benefit the most from what sources and devising sensible monitoring for progress). Nemo 17:18, 22 April 2019 (UTC)
  • I want to support this project so I can update my licenses regarding privacy 173.123.116.67 09:17, 23 April 2019 (UTC)
  • Support Support May Hachem93 (talk) 10:13, 23 April 2019 (UTC)
  • Support Support A good proposal to leverage already-existing open information T.Shafee(Evo﹠Evo)talk 11:16, 23 April 2019 (UTC)
  • Neutral Neutral A interesting proposal, I'd like to see something in the application that provides for addressing copy editting and wikifying in the process with such volumes there is the potential for significant cleanup, copy editing, categorization, wikifying, and poor terminology especially with Indigenous knowledge. While the various language communties are good at these a significant workload dump is going to ruffle feathers, I take it on good faith that the process will weed out unencyclopedic material before it gets transfered. Gnangarra (talk) 11:06, 24 April 2019 (UTC)
Hi @Gnangarra:, thanks very much for your comment, agreed this needs to be addressed. We have discussed it but didn't make it clear in the application, I've added a line in the proposal about it. I've previously done some work on this in the documentation on Help:Adding_open_license_text_to_Wikipedia but I think there needs to be some thought on how to transform the text for individual publications as well as general guidance (which I'm sure can be improved). As an example UNESCO recently made all 7000+ pages of General History of Africa available under CC BY-SA, which will be a great resource but some of it was written over 30 years ago so will most probably need some adaptation beyond the guidance given in the help page. Can you think of any examples we could use as case studies to help highlight issues? Thanks, John Cummings (talk) 12:12, 24 April 2019 (UTC)
The process very much needs someone to take responsibility to address issues, bring data up to date. What we experience in Australia is the very worst of Colonial knowledge being used by government to push other agendas over the last 20 years, with much of that being dumped in all kinds of documents see en:History wars for an overview. These types of problems become instant end game for collaborations with Indigenous communities see en:Yagan for the way in which Indigenous people are treated, that drove away 5 contributors I had just spent a month working with. Sadly the final straw in that was one editor applied to present at Montreal about the issues of Indigenous knowledge use to colonial source to advance racial inequity. The kind of cultural knowledge dumps that dont come for the cultural communities but rather from "European" documentations cause problems for those working with out reach to these communities. As I said care needs to be taken with the dumping of content and the porposal really needs to have more about sense checking, fact checking, and treating Indigenous knowledge with respect. Gnangarra (talk) 13:00, 24 April 2019 (UTC)
Thanks very much for the explanation @Gnangarra:, we'd really love to work with you on this to reduce the possibility of this kind of thing happening again. It feels like what is really needed is to create some kind of mechanism to better assess and apply rules on Neutral Point of View for imported text. Currently text is being imported in a kind of haphazard way that is invisible to most contributors. I think there a lot of similarities with issue we have on Wikidata of no central register of imported datasets so there is no community discussion (including subject matter expertise) of how/if they should be imported and no way to help each other with importing data. This is something I've been I've partially addressed by creating Wikidata:Dataset Imports. Thanks again, John Cummings (talk) 15:58, 24 April 2019 (UTC)
  • Support Support I support these efforts as this is the right direction, and John has the skills and experience to deliver on his plan. --Rosiestep (talk) 14:44, 24 April 2019 (UTC)
  • Support Support Freely licensed text is an underused resource and this project could help remedy that. Rachel Helps (BYU) (talk) 16:49, 24 April 2019 (UTC)
  • Support Support Great project, and a very practical solution to the challenge facing us regarding editor hours and scope of information to be added to Wikipedia. Smirkybec (talk) 17:31, 24 April 2019 (UTC)
  • Support Support The community needs tools to better reuse available Open Access text. Impact on GLAM projects may be really relevant. —Marcok (talk) 02:17, 25 April 2019 (UTC)
  • Support Support I think this is a great idea and I have no doubt that John could implement it successfully. Nevborg (talk) 13:32, 25 April 2019 (UTC)
  • Support Support I hope it expand its reach to include public domain text available in Philippine government & non-profit sites. --Exec8 (talk) 02:42, 26 April 2019 (UTC)
  • Support Support Battleofalma (talk) 14:23, 29 April 2019 (UTC)
  • Support Support A way to identify good quality content and tailor it to Wikipedia? Yes please. Richard Nevell (talk) 17:42, 29 April 2019 (UTC)
  • Support Support Hoping it will work. The people involved in this make me think it's gonna work, and it's gonna be worth it. Sannita - not just another it.wiki sysop 14:40, 1 May 2019 (UTC)
  • Support Support This is a long overdue project that could immeasurably add to the knowledge on Wikipedia. Completely endorse. Islahaddow (talk) 10:59, 2 May 2019 (UTC)
  • There were some concerns on the en.wp Village Pump that the freely licensed text may not be suitable for importation to Wikipedia on POV/OR/other concerns. That said, it was pointed out that much of the content is suitable for Wikisource and would be welcome there (something that I agree with). The proposal should be modified to make contributions to Wikisource, allowing Wikimedia as a whole to extract maximum value from the imported content. MER-C (talk) 14:49, 3 May 2019 (UTC)
    • @MER-C: we have now added Wikisource to the submission as a possible 'home' for the content. John Cummings (talk) 21:50, 3 May 2019 (UTC)
  • Oppose Oppose I'm not sure it's entirely clear what we are going to end up with, but it appears some sort of automated, or semi-automated, process for creating Wikipedia articles from UNESCO and other documents is the aim. This might work for Wikisource, but good Wikipedia articles cannot be created by algorithms. A human author is needed to write a proper encyclopaedia article. If the UNESCO material is already in a suitable form then we really don't need any special process. With copy-paste plus a boilerplate reference hundreds of articles per hour could be turned out. But I bet the problem is that the material is not actually directly usable in this way. Sure, the UNESCO material might make good references, but it still needs an editor to knock it into shape.
Several early attempts on Wikipedia to mass produce articles from external databases had results that weren't pretty. Examples of this were astronomical objects and geographical locations. These projects certainly created a huge number of articles, but mostly poor quality stubs. Many of these have subsequently been deleted, sometimes en masse, as non-notable, or in some cases inaccurate. In short, I fear this project may create more in the way of problems for future editors than production of good material. I might change my mind if some specific examples of UNESCO pages turned into articles were provided that could have been achieved with an automated process, but at the moment, I'm not seeing that as likely. SpinningSpark 09:31, 5 May 2019 (UTC)
@Spinningspark:, I think there has been a misunderstanding, to be clear, we are not creating an automated or semi-automated process to add text to Wikipedia (including creating articles) at all and no part of the process will use an algorithm. We want to create a collaborative process for assessing open license text and recording that assessment due to the volume of content that is available (see the General History of Africa example above). This will build on the work I have previously done to provide a process to help people use existing open license text and to adapt it to be suitable to include in Wikipedia. Please let us know which sentences in the proposal gave you the impression that any part of the process would be automated and we will try to make it more clear. John Cummings (talk) 16:25, 5 May 2019 (UTC)
John Cummings, I'd request you reply to the concerns raised in the thread you started on en-wiki and then ignored. Per the comments of myself and others there, in my view your essay which you link above demonstrates such a complete misunderstanding of Wikipedia that I'm strongly inclined to delete it to ensure there's no risk of good-faith users stumbling across it, trying to follow its instructions, and getting themselves blocked for disruption.Iridescent (talk) 07:01, 7 May 2019 (UTC)
Hi @Iridescent:, I stopped visiting the page after this message. I think that discussion of a Wikipedia page is off topic for the endorsement section of the grant proposal so I've replied to you on the talk page. Thanks John Cummings (talk) 11:29, 8 May 2019 (UTC)
John Cummings you asked Please let us know which sentences in the proposal... I believe the sentence you are looking for is this one: a tool is needed to [] integrate large amounts of [content] and integrate it into Wikimedia projects. Your redundant emphasis on "integration" naturally leads people to share your focus on that as your purpose for the tool. However I believe the primary concerns are unrelated uncertainty about possible automated integration. However I'll address that as an independent comment. Alsee (talk) 10:12, 7 May 2019 (UTC)
  • Comment: This proposal has, thus far, received a less than favorable reception at English Wikipedia Village pump. I believe I can explain why. Commons expects and desires full and raw inclusion of any acceptably licensed materials with a credible educational purpose. Wikisource expects and desires full and raw inclusion of any appropriate and acceptably licensed materials. Most Wikidata edits are bots, generally inhaling selected raw content. However Wikipedia is purpose-crafted content, encyclopedia articles. Wikipedia content is subject to a large number of restrictive policies on acceptability. The concern is that externally produced content will rarely take a form fit for presentation as "encyclopedic coverage", and will even more rarely fit under our definition of appropriate encyclopedic coverage. I'm sure there's no objection to having resources like this available for editors to pull-from, but there is concern that encouraging people to insert bulk-content is likely to end badly. Anyone who starts editing with the intent of adding external content onto Wikipedia can easily end up in a block. For Wikipedia purposes a paragraph pretty much qualifies as "bulk content", and adding content like that to multiple pages will draw a lot of attention. It can work if someone knows what they're doing and they found some highly suitable content. Alsee (talk) 10:12, 7 May 2019 (UTC)
Hi @Alsee:
Thanks very much for your thoughts.
As you say Wikipedia is only one of the possible avenues for external content and I agree that a large amount of the content will not be suitable for Wikipedia. However it is my experience that content written externally is sometimes suitable for inclusion in Wikipedia with light to moderate alteration, especially if it comes from a high quality source. Many other publishers are creating well referenced explanations of a subject with suitable sources under an open license. We want to create a place that will allow contributors work together to understand what text is suitable. I think this will reduce the possibility of things going 'wrong'.
Thanks again
John Cummings (talk) 11:32, 8 May 2019 (UTC)
  • Support Support I'd be glad of such a tool/process, and can see a number of useful applications for GLAM/Education outreach. Lirazelf (talk) 09:13, 8 May 2019 (UTC)
  • Support Support A very well designed and promising proposal! Robbru (talk) 13:09, 14 May 2019 (UTC)
  • Support Support Chicagohil (talk) 21:19, 16 May 2019 (UTC)

Participants

[edit]
  • Volunteer Reviewing 46.11.7.150 19:33, 21 April 2019 (UTC)