Grants talk:Project/Enrichment of multilingual scientific/technical/medical terms of Wikitionary
Add topicProject Grant proposal submissions due 30 November!
[edit]Thanks for drafting your Project Grant proposal. As a reminder, proposals are due on November 30th by the end of the day in your local time. In order for this submission to be reviewed for eligibility, it must be formally proposed. When you have completed filling out the infobox and have fully responded to the questions on your draft, please change status=draft to status=proposed to formally submit your grant proposal. This can be found in the Probox template found on your grant proposal page. Importantly, proposals that are submitted after the deadline will not be eligible for review during this round. If you're having any difficulty or encounter any unexpected issues when changing the proposal status, please feel free to e-mail me at cschillingwikimedia.org or contact me on my talk page. Thanks, I JethroBT (WMF) (talk) 23:20, 27 November 2018 (UTC)
Eligibility comments
[edit]Hi Marco Ciaramella, I had a few comments concerning eligibility for the proposal:
- The review and deployment of new extensions on Wikimedia Projects can take a very long time. The documentation for extension developers on MediaWiki reports that it can take longer than two years, and the scope for Project Grants is limited to one year. If new extension development is the proposed approach for this project, I am afraid this proposal will not eligible for review because it's unlikely it will be completed in the time frame.
- If there an existing extension you wish to improve that can incorporate some of this functionality, or a different approach you would like to take to conduct this project, please let me know as soon as possible, as I will be finishing making eligibility decisions tomorrow.
Thanks, I JethroBT (WMF) (talk) 19:45, 10 December 2018 (UTC)
- Dear @I JethroBT (WMF):
- 1. my project proposal is 9 months long, not 2 years;
- 2. I proposed 3 activities (1. algorithm development, 2. algorithmic vocabulary extension of english Wikitionary by using Wikipedia, 3. manual Wikitionary vocabulary extension for italian when this information can not be inferred from Wikipedia). The Wikitionary extension is for STM (scientific, technical, medical) terms "only". The specific evaluation of the number of terms and of the quality of terms will be specified in the algorithm development phase. My guess is to have thousands of new or extended entries in english and in italian, but I would prefer to provide more specific details only when the automatic procedure works, i.e. by the month #3.
- 3. About the algorithm, I will document/evaluate/share the pseudo-code at the general level, i.e. what kind of fields in Wikipedia could be reused for Wikitionary and how. This is what I include in the project, to enable others to carry a similar work. My intention is not to include in this project, a) the code developed b) the detailed pseudocode, as the entity extraction used and so. I consider the detailed pseudocode as an IntelliSemantic internal background.
- 4. The activity includes of course paper publication work, for assessment and result diffusion and future documentation.
- I hope that this clarifications could help to make this proposal eligible, otherwise we can edit some parts according to this clarification thenselves.
- Cheers,
- Marco --Marco Ciaramella (talk) 12:50, 11 December 2018 (UTC)
- @Marco Ciaramella: My apologies, I misunderstood the use of the term "extension" in the context of this proposal, so my concerns about the proposal length were based on an incorrect assumption. The only remaining concern I have is related to work on algorithms and code. The following kinds of work is eligible:
- Production of code, research, materials that are published and released as free and open-source. Licensing should be compatible with current Wikimedia and MediaWiki practices
- ...whereas the following is not:
- Production of code, research, materials that are created on a closed source platform or published in such a way that access is not freely available
- While you noted above that you will be able to provide psuedocode, you would not be able to include in this project "a) the code developed b) the detailed pseudocode, as the entity extraction used and so." This suggests there are some important aspects of the funded work that would not be able to be released under an open source license, most importantly, the code developed. I'm afraid this kind of work is not eligible for funding, but please clarify if I have misundestood. I JethroBT (WMF) (talk) 14:57, 11 December 2018 (UTC)
- @Marco Ciaramella: My apologies, I misunderstood the use of the term "extension" in the context of this proposal, so my concerns about the proposal length were based on an incorrect assumption. The only remaining concern I have is related to work on algorithms and code. The following kinds of work is eligible:
- @I JethroBT (WMF):
- I now understand your point: probably the "production of code" is misleading in this context, since my main interest is in the openness of the research results (including the algorithm developed and a published paper issued on an open repository). With this approach, anyone can reproduce results (e.g. implement a component suitable within Wikimedia environment - and then maintain it updated): I think that this approach is more flexibile for Wikimedia too, which can have its preferences about the deployment platform. Hence, I will not include the "coding" activity as a funded activity, but I would maintain "research" (e.g. algorithm specification to produce the code and results obtained) and "publications" activities.
- Cheers,
- --Marco Ciaramella (talk) 16:27, 11 December 2018 (UTC)
Eligibility confirmed, round 2 2018
[edit]We've confirmed your proposal is eligible for round 2 2018 review. Please feel free to ask questions and make changes to this proposal as discussions continue during the community comments period, through January 2, 2019.
The Project Grant committee's formal review for round 2 2018 will occur January 3-January 28, 2019. Grantees will be announced March 1, 2018. See the schedule for more details.
Questions? Contact us.--I JethroBT (WMF) (talk) 16:38, 11 December 2018 (UTC)
Commments of Ruslik0
[edit]Thanks for this proposal but I have a few comments/questions:
- Have you tried to engage with the Wiktionary community before and what is their opinion about the project?
- Is Wikipedia suitable source of medical terms? The quality of articles varies considerably and you can extract plainly nonsensical definitions from some articles.
- Which language Wikipedias are you going to mine?
- How do you ensure that dictionary definitions in different languages do not contradict each other?
- Which programming language are you going to use? What licence will you use? Will the source code be open?
Ruslik (talk) 18:58, 12 January 2019 (UTC)
Replies to Ruslik0's comments
[edit]- Hallo, @Ruslik0:
- The project is actually under discussion at this page.
- Yes, it is - I am also an user in this sense. The first project step is to extract some relevant terms according some known techniques of Natural Language Processing (this task is commonly referred as "term extraction"). Linking the extracted terms to plain definitions would be assessed with the most suitable techniques, including what is referenced on Wikipedia and the quality ranking category of each page (see: Wikipedia:WikiProject Wikipedia/Assessment).
- English, Italian. But another task would be to assess a multi-language process suitable for targeting other languages.
- It is intended to semi-supervise the generated English results, when Italian entries will be added manually to void eventual contradictions, also in line with Wiktionary:CFI.
- We would prototipe the algorithm, that will be available openly in the form of a detailed presudo-code in a open publication. It is no intention of my project to release/mantain any piece of code, that could be a further task.
- In conclusion, the deliverables of the project would be: 1. to add new entries (some thousands) in two languages (english, italian), 2. to assess a multi-language process that could be implemented as a future step.
- --Marco Ciaramella (talk) 15:21, 18 January 2019 (UTC)
Comments from bluerasberry
[edit]Please only reply briefly, and trust that I will ask more if I want to know more.
- The proposal says, "it is not possible infer in Wikitionary the Etymology and the Pronunciation fields, since these fields are not supported by Wikipedia". This is correct, but we can do that in Wikidata in the d:Wikidata:Lexicographical data project. For pronunciation the mw:Wikispeech project plans to take information from that lexicographical project. This proposal makes no mention of Wikidata, the Lexicographical collection or project, or Wikispeech. To what extent do you have any familiarity of those projects? What is your impression of the extent of overlap in what you are proposing versus what those projects are doing?
- What collaboration do you expect to have with the Wikimedia community during this project?
- I looked you up at global user contributions. It looks like your home project is Italian Wikipedia, and you have about 100 edits at about 10 other projects, so you seem to know your way around. The project you are proposing would affect several languages and several Wikimedia projects. How would you coordinate communication to stakeholders in this project?
Blue Rasberry (talk) 14:46, 22 January 2019 (UTC)
Reply to @Bluerasberry:
[edit]- Yes, I am aware of both projects and I have a background in speech technology and ontology-building. The integration of results from both projects can (must) be part of the extension algorithm, since each "term" in Wiktionary consists basically of a page containing different fields (grammatical features - also referred to Part-of-Speech - pronunciation, etc.): such "term" must be linked to a corresponding Q-term. However, Wikispeech is not presently supporting the Italian language and therefore Wikispeech project could not be involved for this part.
- My information needs would consists mainly in technical feedbacks about the API usage of projects related to Wiktionary.
- The typical final users for such new terms could be persons active in research in different subjects, for personal or professional reasons - technically I myself can be a final user also for personal use also as a Wikipedia long term contributor (I am active contributor since the born of the Italian-speaking Wikipedia).
- Cheers,
- --Marco Ciaramella (talk) 08:49, 25 January 2019 (UTC)
- @Marco Ciaramella: Thanks, these answers all make sense for this project. Blue Rasberry (talk) 11:47, 25 January 2019 (UTC)
Summary
[edit]This is a research project which will enable a scalable, documented and well-assessed process to extend and enrich Wiktionary for STM multilingual terms. This project will reuse as possible the human contributions of different Wikimedia projects: that could positively impact the use of Wiktionary for research purposes, with a look to the non-english languages. Main topics are: Natural Language Processing (text summarization) and Open data (Open science).
In summary:
- This project will integrate in Wiktionary different kind of information already available in Wikipedia, Wikidata, MediaWiki, but not yet fully used in Wiktionary (hence it will reuse as possible Wikipedia resources).
- This project will develop a human-supervised, but automatic algorithm to perform this information enrichment (hence scalable and extensible to other languages).
- The algorithm will also identify fields which could require human contribution (I will carry the assessment the for this contribution for the Italian language).
- The finally validated algorithm will be openly available as a pseudocode for eventual software implementation.
- The final documentation will result in a published research paper under an open publishing license, where I will assess the project results.
--Marco Ciaramella (talk) 21:11, 27 January 2019 (UTC)
Aggregated feedback from the committee for Enrichment of multilingual scientific/techical/medical terms of Wikitionary
[edit]Scoring rubric | Score | |
(A) Impact potential
|
6.0 | |
(B) Community engagement
|
5.0 | |
(C) Ability to execute
|
6.3 | |
(D) Measures of success
|
3.3 | |
Additional comments from the Committee:
|
Opportunity to respond to committee comments in the next week
The Project Grants Committee has conducted a preliminary assessment of your proposal. Based on their initial review, a majority of committee reviewers have not recommended your proposal for funding. You can read more about their reasons for this decision in their comments above. Before the committee finalizes this decision, they would like to provide you with an opportunity to respond to their comments.
Next steps:
- Aggregated committee comments from the committee are posted above. Note that these comments may vary, or even contradict each other, since they reflect the conclusions of multiple individual committee members who independently reviewed this proposal. We recommend that you review all the feedback carefully and post any responses, clarifications or questions on this talk page by 5pm UTC on Tuesday, May 11, 2021. If you make any revisions to your proposal based on committee feedback, we recommend that you also summarize the changes on your talkpage.
- The committee will review any additional feedback you post on your talkpage before making a final funding decision. A decision will be announced Thursday, May 27, 2021.
@Marco Ciaramella: Please see note above about the opportunity to respond to committee comments before they finalize a decision on your proposal. Please let me know if you have any questions. With thanks, I JethroBT (WMF) (talk) 03:53, 7 February 2019 (UTC)
Answer to comments
[edit]Analysis of the last comments
[edit]Topic number | Kind of topic | Topic | Comment(s) involved |
---|---|---|---|
1 | Technical | Platform choice and use | 1, 2, 4, 12, 13 |
2 | Technical | Community involvement | 5, 6, 9, 10, 11, 14 |
3 | Technical | Project evaluation of results (what, how) | 5, 6 |
4 | Technical | Reuse after the project | 3 |
5 | Management | Control of project results | 7, 13 |
6 | Management | Skills available | 7 |
7 | Management | Working context and grant use | 8, 14 |
Answer Summary
[edit]- I am proposing a research project about a class of terms (STM) characterized by specific features, as for example most of these terms are neologisms, which are typically not well covered in lexicons. In this case, data from Wikimedia projects, as Wikipedia and Wikidata, if properly exploited, can fulfill this requirement.
- In any case, I fully agree that Wikidata will be used as a backbone of the project and that the proposed project will enrich as well Wikidata content besides Wiktionary. In any case, a lexicon like Wiktionary remains a significant resource for final users, independently from the internal architectural implementation.
- To characterize the status of multilingual STM coverage and quality at the beginning of the project and to measure the project achievements I will use as test-sets suitable text samples of STM literature for a sample of topics and languages: I think that these evaluations are user-oriented and pragmatic.
Answers by topic
[edit]- (Platform choice and use) Taking also into account your suggestions, I came to the conclusion that the proposed project has to enrich and extend Wikidata as well as Wiktionary and that this issue could also be mentioned in the project title/description. Hence, a) I do agree that Wikidata has to be used as an infrastructure backbone of the project; b) but I think that it is useful to achieve the best from both the contents of Wiktionary and Wikidata, since they have a different coverage across languages on STM-terms; c) finally I do think that a lexicon like Wiktionary has not to be overlooked and that it will remain for quite a long time a significant resource for the final users, independently from Wikimedia architecture.
- (Community involvement) What to do: through the the project page, I will: 1) report to the community the terms set under current analysis to be added to Wiktionary and Wikidata, according also to the project evaluation status (see also the next point - 2); 2) explicit feedbacks from the community about the project. How to do: a new specific task of the project will be added to manage community involvement activities described here.
- (Project Evaluation and Results) What to do: increasing the quantity and quality of Wiktionary and Wikidata entries for SMT multilingual terms - not only once, but as method which can be reused after the project. How to do: a specific task of the project (see point 5.) will be devoted: a) to measure the quantity and quality of terms on Wiktionary and Wikidata for sets of documents in a specific domain (STM) and languages (en, it) before and after the application of the algorithms; b) to compare eventual variants and/or tunings of these algorithms to identify the most suitable/efficient.
- (Reuse after the project) The deliverables will be: a) an Analysis of the coverage and quality of STM terms in different languages for Wiktionary and Wikidata; b) algorithm(s) suitable to improve the coverage and quality of multilingual STM terms also with the human supervision; c) the pseudocode of the final solution, which can be engineered after the project also by others and/or Wikimedia; d) A peer-reviewed paper for dissemination use.
- (Control of project results) I do have a documented experience on project management. Besides that, a monthly project review will be shared.
- (Skills available) I am involved in Natural language processing and Linked Open Data research and I was already involved in projects dealing with science literature-related terms processing in international context (see my Google scholar summary). Moreover, I am an active contributor in the Italian chapter of Wikipedia since its foundation (2005).
- (Working context and Grant) If I will selected for this research grant, I will be put in partial leave from IntelliSemantic for the 60% of my time to follow this project, with a corresponding reduction of my personal salary and of company-related expenses, according to the Italian law 53/2000. Hence, the grant is intended to fund my activity for an average of 24 hours per week, for 9 month of elapsed time, which will not be paid by (and not working for) my company. I will typically devote to this project on Friday, Monday and Tuesday every week. The signed agreement with my company will be included in project documents. Any additional external support from that company, when needed up to my discretion, can be donate for free to the project.
Note: the main project page was last reviewed after these comments on February 14, 2019. --Marco Ciaramella (talk) 18:16, 14 February 2019 (UTC)
Round 2 2018 decision
[edit]This project has not been selected for a Project Grant at this time.
We love that you took the chance to creatively improve the Wikimedia movement. The committee has reviewed this proposal and not recommended it for funding. This was a very competitive round with many good ideas, not all of which could be funded in spite of many merits. We appreciate your participation, and we hope you'll continue to stay engaged in the Wikimedia context.
Next steps: Applicants whose proposals are declined are welcome to consider resubmitting your application again in the future. You are welcome to request a consultation with staff to review any concerns with your proposal that contributed to a decline decision, and help you determine whether resubmission makes sense for your proposal.
Over the last year, the Wikimedia Foundation has been undergoing a community consultation process to launch a new grants strategy. Our proposed programs are posted on Meta here: Grants Strategy Relaunch 2020-2021. If you have suggestions about how we can improve our programs in the future, you can find information about how to give feedback here: Get involved. We are also currently seeking candidates to serve on regional grants committees and we'd appreciate it if you could help us spread the word to strong candidates--you can find out more here. We will launch our new programs in July 2021. If you are interested in submitting future proposals for funding, stay tuned to learn more about our future programs.