Jump to content

Grants:Project/Jeblad/Better support for imported data in wikitext

From Meta, a Wikimedia project coordination wiki
statusnot selected
Better support for imported data in wikitext
summaryVerify if it possible to interact with a NLP-engine from wikitext, and in a way obvious to the casual editor.
targetAll Mediawiki-based projects
amountUSD 12.900
granteeJeblad
contact• jeblad@gmail.com
this project needs...
contact
advisor
organization
grantee
volunteer
join
endorse
created on12:14, 18 September 2016 (UTC)


Project idea

[edit]

What is the problem you're trying to solve?

[edit]

Explain the problem that you are trying to solve with this project or the opportunity you’re taking advantage of. What is the issue you want to address? You can update and add to this later.

Assume there is some data on an external site, which could be Wikidata but it could also be something like a census bureau, and you want to use those external data in at text on for example Wikipedia, and in such a way that they are updated as new data becomes available. To do so we must identify the actual data, for example population, and a source for the data, for example an entity on Wikidata, and manually keep track on when updates are necessary.

What is your solution?

[edit]

If you think of your project as an experiment in solving the problem you just described, what is the particular solution you're aiming to test? You will provide details of your plan below, but explain your main idea here.

Today we do so for data from Wikidata by using the property parser function, but by doing so we add complexity and redundancy to the text. Usually it is already said elsewhere in the text what is the entity, still we must state it again each time the data is included. This adds redundancy that must be maintained. By using some engine for Natural Language Processing (NLP) we can reuse whatever the editor has already stated in the text, and also adapt the rest of the text to reflect whatever the predicate states about the material value (the object).

If the material value (the object) change in some minor ways, like a change in population from 3001 to 3002 for a given municipality, it might not have an impact on the text. If the change in material value has a major impact on the text, like if a previously uninhabited municipality suddenly has inhabitants, or a person changes gender, then the text must be adjusted due to agreement (concordance) between different parts, adding to the overall complexity.

It might be postulated that a service could be made, such that the service create a tagged (annotated) structure representing the text. This structure could then be turned into a Lua table and provided to a Lua module. There should be several such loadable structures, serving specific purposes. From wikicode we could then access specific annotated entries in this structure, run the associated Lua code, and we could also traverse to other parts of the same annotated text and change the annotation if necessary.

The Lua table holding the annotated text should be a clean table (like the present ones loaded through mw.loadData()) so we avoid security issues. We should only be able to say how the text should be annotated, to avoid security issues in the service itself. It should although be possible to edit the assigned tags. In the wikitext itself we mark words and phrases that we want to change depending on the annotations.

Those markers in the text will then fire specific Lua code depending on the surrounding annotated text. This is a slight simplification as we will run into problems if there are multiple markers inside a single sentence, but perhaps it is possible to work around the problem. (Note that this will only be an experimental first version!)

Clarifications and limitations

[edit]

When a NLP engine annotates a text it tries to give the text some interpretation, and hopefully an optimum interpretation. Usually there are multiple possible interpretations, and the NL engine just returns one of them. In this experiment this assumed optimum interpretation will be taken for granted, and it will be used as hash keys during lookup of the methods for valuation. This will not give optimum results, as the chosen interpretation will not be according to the possible instantiations. What should be done is to match the possible interpretations against the possible instantiations over the type, and then rejecting those instantiations that are impossible given the actual values. That should give far better results, but it is also much more heavy going from O(1) for hashing to O(MN) for linear search. Both M and N are small, but the difference in load can be significant.

The test of the integration is really nothing more than to see if a simplified notation like The lake is {area} in area is sufficient, or if it is necessary to use additional annotation. It is not about creating all kinds of additional gadgets to make this work. It is although possible to color a failed predicate, and to add a tracking category. This could make it easier to understand where and why the replacement rules fails.

It will probably be obvious pretty fast that a tool is necessary to identify working and/or failing rules. That could be a special page that list the marked predicates with some context. Creation of such a page is not part of the experiment.

There are further background on Grants:Project/Jeblad/Better support for imported data in wikitext/Motivation.

Project goals

[edit]

Explain what are you trying to accomplish with this project, or what do you expect will change as a result of this grant.

Make an example integration between data provided by Wikidata and the other Wikimedia projects, by using natural language processing, to facilitate reuse of data from Wikidata in (slightly more) natural language.

The goal is to be able to update a wikitext like the following. The lake is Mjøsa, and the data is available at d:Q212209.

At its widest, near [[Hamar]], the lake is 15 km wide. It is 365 km² in area and its volume is estimated at 56 km³; normally its surface is 123 metres above sea level, and its greatest depth is 468 metres.

Example 1.

In example 1 the areal is not quite right, it is more accurately calculated as 369.4533±0.0001 km². Is it wrong to use an approximation to 365 km²? Probably not. Would it be better to use 369 km²? Probably. Would such small updates create a lot of work? Definitely. If we instead could write something like in example 2, then we could keep articles more updated – even on small projects with less manpower.

At its widest, near [[Hamar]], the lake is {wide} wide. It is {area} in area and its volume is estimated at {volum}; normally its surface is {altitude} above sea level, and its greatest depth is {depth}.

Example 2.

In this specific example the text is from the same article as the connected item, which makes the problem somewhat simple. It is possible to just use the current parser function, and while the text becomes less readable it is still comprehensible for most inclined technical editors. If we instead tries to identify the last notable work by Dario Fo and use the current parser function we have a maintenance problem, because the name of the work can create disagreeing concordance on gender, and the wikitext will also become much less readable.

Note that this project will mostly be experiments on how to use natural language processing in live wikitext, it will probably not be a final solution, and it will most likely create more questions than answer.

Project plan

[edit]

Activities

[edit]

Tell us how you'll carry out your project. What will you and other organizers spend your time doing? What will you have done at the end of your project? How will you follow-up with people that are involved with your project?

  • Integrate a well-supported NLP engine into a production-like environment (Vagrant with necessary adaptations)
  • Create and test a way to tag text and replace (materialize) special markup that needs implicit filtering and escaping
  • If time allow; test out alternate approaches to post-editing tags on words and phrases (most likely this has to go in a phase two)
  • If time allow; test out methods to inject simple canned text (references)

Budget

[edit]

How you will use the funds you are requesting? List bullet points for each expense. (You can create a table later if needed.) Don’t forget to include a total amount, and update this amount in the Probox at the top of your page too!

The estimated workload is about 3 full-time person-months for an experienced developer; or 6 person-months at 50 %. This workload estimation is based on the main developer's previous experience with similar projects.

Budget breakdown

[edit]
Item Description Commitment Person-months Cost
Main developer Developing and releasing proposed code Part time (50 %) 6 USD 12,900
There is no co-funding
Total USD 12,900

The item costs are computed as follows: The main developer's gross salaries (including 35 % Norwegian income tax) are estimated upon pay given to similar projects using Norwegian standard salaries,[1] given the current exchange rate of 1 NOK = 0.124 USD, and a quarter of a year's full-time work.

Community engagement

[edit]

How will you let others in your community know about your project? Why are you targeting a specific audience? How will you engage the community you’re aiming to serve at various points during your project? Community input and participation helps make projects successful.

This will only be a research project to see if the solution is feasible.

Sustainability

[edit]

What do you expect will happen to your project after the grant ends? How might the project be continued or grown in new ways afterwards?

This project is only to check the feasibility of the using an NLP engine inside a Wikipedia-like environment, and it is highly likely that any code must be adapted and/or refined. If it works out, then new projects (iterations) must be defined. The obvious next step would be to continue on the idea sketched at m:Grants:Project/Jeblad/Better support for imported data in wikitext/Motivation.

Some possible continuations

  • It would probably be necessary at some point to add more languages to the chosen NLP engine. Stanford Core NLP has algorithms for Arabic, Chinese, French, German, and Spanish text, but that is only a few of the 200+ languages for Wikimedia projects.
  • Some structures can perhaps be recognized as specific types of discourse. For example examples or arguments, or perhaps opinions. Arguments should be valid and have references (evidence) for all premises and the conclusion. If opinions are found they must be neutral.
  • Define a strategy for how NLG can be used to create larger chunks of text within an article.

Measures of success

[edit]

How will you know if the project is successful and you've met your goals? Please include specific, measurable targets here.

In increasing difficulty

  • Primary objective is to take a few articles from Wikipedia in some core languages (supported by the chosen NLP engine) that parses correctly, export them to a Labs instance and process them there, where the chosen NLP engine provides some core values and/or adaptations from an external source (most likely Wikidata).
  • Second objective of the experiment is to verify that correct code snippet is chosen given a context for the abstract value, that is the subject is known, and replace the marker text with its material value.
  • Third objective is to identify the referred subject, but without any ranking mechanism in place.

The first objective is mostly an engineering problem. Can we do it, and how. It is also a question about which NLP-engine does in fact solve the problem, and how much processing power is necessary.

The second objective is a consequence of completing the first, but it also extends it to actually using the annotations to identify the correct code snippet. It seems likely that this is possible, if the primary objective succeed. (Lookup of code snippets is a hashing-problem.)

The third objective is the hard one, and it is not likely that this will have a good solution after the end of this project. It is solvable, but it probably needs additional work. (Lookup of subject is a graph search-problem.)

Get involved

[edit]

Participants

[edit]

Please use this section to tell us more about who is working on this project. For each member of the team, please describe any project-related skills, experience, or other background you have that might help contribute to making this idea a success.

The project will be primarily carried out by myself, while other team members, partners, volunteers etc. are participating by commenting or advising. It will hopefully be part of a small research project on use of integrated NLP on a wiki, but the initial part is just to make a working baseline. And yes, it would definitely be interesting if more people joined!

Community notification

[edit]

Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. Need notification tips?

Endorsements

[edit]

Do you think this project should be selected for a Project Grant? Please add your name and rationale for endorsing this project below! (Other constructive feedback is welcome on the discussion page).

  • This would help making the imported data a tad less cryptic for us pedestrians Petter Bøckman (talk) 14:54, 2 November 2016 (UTC)
  • Seems resonable Pmt (talk) 21:19, 2 November 2016 (UTC)

References

[edit]