Jump to content

User:Jeblad/Autogenerated text

From Meta, a Wikimedia project coordination wiki

This is a description of a proposed solution for autogenerated text in Wikipedia, using Wikidata as a repository for data. Producing readable text is also called natural language generation or language production.

Structure

[edit]

Content determination

[edit]

What should go into the text. Basically which properties are referenced, but also which additional statements can be inferred from the existing statements.

First run-through is to include those sets that are referred, then exclude those that refer properties that isn't available. Extending the text by using qualifiers will neither include nor exclude main statements, but it can change which phrases are used.

Content structuring

[edit]

Using the topics from the phrases that survived the inclusion and exclusion at the content determination phase, the remaining phrases are clustered to form paragraphs. The internal organization will be structured according to role (forgot that), preference, and the pair symbol-meets.

Referring expressions

[edit]

Aggregation

[edit]

Collect similar or related sentences into higher constructs.

Lexicalisation

[edit]

Linguistic realisation

[edit]

Text insertion rules

[edit]

Alt. 1

[edit]

A rough sketch of the structure. This would go on a page reflecting each of the properties. During a call to build content for an item all pages with definitions for the properties will be collected and merged.

[
  {
    'type' : 'content', // should perhaps be "fragment"
    'means' : [ 'sym1' ],
    'symbols' : { 'foo' : '{{property:...}}', 'bar' : 'Constant string' },
    'text' : 'Whatever string with {{{foo}}} and {{{bar}}}',
    'rank' : 42
  },
  {
    'type' : 'paragraph',
    'means' : [ 'sym1', 'bar' ],
  },
  {
    'type' : 'section',
    'means' : [ 'sym1', 'bar' ],
    'text' : 'Some text',
  },
]

The previous data will be used to build a tree

  • type defining the type (class) of node
  • means defining which symbols must exist
  • symbols is which new symbols to create
  • text is the actual string generated
  • rank is the sort order

In means it should be possible to specify non-existence. This is necessary to reject one statement due to symbols injected by another statement.

A first cleanup phase is done where all statements that are expected to fail during materialization are removed. Those are the ones that refer properties that does not exist for the item.

The structure is built as a tree, and only just before output the tree are flattened. The relation to the symbol value must be kept in the text as we must do some text transform later on.

First the content level statements are built, iteratively until all of them are stable. They are stable when no more symbols can be injected. The complexity of this process is O(n²).

Second the paragraph level is built, iteratively until all of them are stable. Then content is assigned to each of them. After each assignment try to stabilize the paragraphs again, and if they changes then stabilize content again. The complexity is now O(n⁴).

Third put each of the unassigned content fragments in its own paragraph.

Fourth the section level is built, iteratively until all of them are stable. Then paragraphs are assigned to each of them. After each assignment of paragraphs try to stabilize section again, and if they changes then stabilize paragraphs again recursively. The complexity is now O(n⁶) so we really need some efficient way to stop the iteration at each recursion level.

At this point we can run the text transform to get the grammer right. What to do with phrases instead of terms… In a first version we could skip this, but then the text would be ugly…

We can perhaps keep a chain of symbols from the previous lower levels and check for intersection with changes in higher levels. If the intersection does not change, then nothing will change at lower level. That could speed up change detection somewhat.

It should probably be a default section acting as the 0 section, or ingres section. It would be a text-less section.

Alt. 2

[edit]

This extends SimpleNLG by defining bindings as JSON structures.

[
  {
    /* this operates on Q(Mary) as default (external example) */
    'means' : [ 'sym1' ],
    'symbols' : { 'owner' : '{{#property:name}}', 'pet' : '{{#property:pet}}' },
    '': [
      {
        'type' : 'subject',
        'text' : '$owner'
      },
      {
        'type' : 'verb',
        'text' : 'chase'
      },
      {
        'type' : 'object',
        'text' : '$pet'
      }
    ]
  }
]

Mer kompakt form

[
  {
    /* this operates on Q(Mary) as default (external example) */
    'means' : [ 'sym1' ],
    'symbols' : [
      { 'type': 'subject', 'text': '{{#property:name}}' },
      { 'type': 'verb', 'text': 'chase' },
      { 'type': 'object', 'text': 'the {{#property:pet}}' },
    ]
  }
]

Enda mer kompakt form

[
  {
    /* this operates on Q(Mary) as default (external example) */
    'means' : [ 'sym1' ],
    'clause' : [ '{{#property:name}}', 'chase', 'the {{#property:pet}}' } ]
  }
]

Mer avansert form, skifte til negert futurum, legge til complement

[
  {
    /* this operates on Q(Mary) as default (external example) */
    'means' : [ 'sym1' ],
    'phrase' : {
      'subject': [
        { 'type': 'noun phrase', 'property': 'name' },
        { 'type': 'noun phrase', 'text': [ 'her', { 'label': 'giraffe' } ] } /* implicit label of an item? */
      ]
      'verb': { 'type': 'noun phrase', 'action': 'chase', 'feature': { 'tense': 'future', 'negated': true } },
      'object': [
        { 'type': 'noun phrase', 'determiner': 'the', 'property': 'pet' },
        { 'type': 'noun phrase', 'text': 'George' }, /* this could refer to an item and a property, but we don't want an item here */
        { 'type': 'noun phrase', 'text': 'Martha' } /* this could refer to an item and a property, but we don't want an item here */
      ]
      'complement': [ 'very quickly', 'despite her exhaustion' ]
    ]
  }
]

A phrase must (?) use a symbol in a subject position before it can be used in a object position. This will create automatic rules. If not satisfied, link out to a description.

A means would be a directive to only add a sentence after the given symbol is defined. Perhaps it is easier to use a cluster feature than a means feature. A cluster (perhaps theme, or is this the SPL type?) would be an hint to the planner to put sentences with the same marker in close proximity, possibly inside the same section or paragraph.

Some of the clauses must be given a name, so they can be extended on local projects. The general version of the clauses are centrally defined, while they can be locally refined. Perhaps the names can be reused as page names in the system messages space (ie Mediawiki space). The name will be somewhat similar to the macro name in SPL.

Alt. 3

[edit]

If following the definition , where

  • S is the syntactic tree (our definition of the surface model of this property, gaps in this will be filled with data from the item)
  • E is additional syntactic trees (typically implicit references to qualifiers, but also explicit references to other properties)
  • C is a condition for applicability (usually whether some definition has previously been given, will change through the text)
  • T is a set of topics (implicit given if by whether the property is in fact used in the item)

Slot fillers for the gaps will come from an additional type of slot filler (surface realiser) for the snak. This realiser must know how to do inflection.

Property from alt 2 should be passed through an express function. This would be an object like ExpressItem with methods for things like makeProperName, makePronoun and makeReferringExpr. Such express functions lead to an additional level below sentences.

There should probably be a preference that can be used for prioritization among phrases. If several phrases (templates) has the same preference value, then the one used will be picked on random. When used the preference is decremented to avoid reusing the same phrase again unless there are only one available.

There should be a topic to control how phrases are clustered together. This will be somewhat counter-intuitive with means. Perhaps only using topics, they seems simpler and more intuitive.

The base condition that must be satisfied is that all referred properties must be available. First run-through is to include those sets that are referred, then exclude those that refer properties that isn't available. What about qualifiers? Other conditions could apply,...

The additional syntactic trees would be the ones describing the qualifiers. Those will then be used the same way as ordinary statements, but only for each single statement. (How to add the references?)

Image insertion rules

[edit]

Images can be inserted when Wikibase is up and running on Commons, but needs identity resolution running on partial data which is quite difficult. If simplified to images that are actually in use on some projects it is much easier. It is probably necessary to know why or how the image is used.

Parser function

[edit]

A parser function can be made that filter on the same symbols that controls the build of the text. It will not only filter out some content fragments but also paragraphs and section titles.

The filtering work by first selecting sections, then checks if unselected paragraphs can be selected, and finally if any yet unselected content fragments can be selected.

See also

[edit]

Ramblings

[edit]
[edit]