Jump to content

Abstract Wikipedia/Wikidata Abstract Representation

From Meta, a Wikimedia project coordination wiki

This is a proposal, still under discussion, for the abstract representation of natural language in the narrow context of constructors and the wider context of the Wikidata Abstract Representation and Natural Language Generation project.

In this proposal, we adopt as common ground the contents of the article Architecture for a Multilingual Wikipedia by Denny Vrandečić.[1]

Initial author & co-authors: Kutz Arrieta, Maria Keet, James Forrester, Ariel Gutman, Cory Massaro, Arthur Lorenzi.

Constructor Unit

Constructors contain an abstract representation of an informationally-complete predicate statement. They would typically correspond to a single renderer, but this is not a hard requirement.

There could also be super-constructors that contain constructors within them, e.g. corresponding to an outline of a paragraph or an entire article. Nesting of constructors is generally permitted. Also, it is conceivable that one constructor might be verbalized in more than one sentence to improve readability (as is also the case in the example in the Common Ground section). Depending on the syntactic context, a constructor may also be verbalized as a nominal phrase (e.g. "San Francisco being the 4th largest city..." etc.), or have several possible sentence-like realizations (e.g. in German there is a difference in structure for stand-alone or embedded sentences). All of this would happen at rendering time.

The number of constructors will grow as more "language" and languages are represented. In the future they could be automatically or semi-automatically generated by a parser.

We are assuming compositionality / combinatoriality. Constructors will be combined into larger units and, in the process, undergo formal changes to some of their elements and position assignments.

Constructor Structure

As a guiding principle, the initial system can make inoffensive mistakes to feed an iterative correction process involving contributors.

Constructors shouldn't include syntactic, morphological, lexical or any language-specific information. But constructors live in an ecosystem with other components who influence each other and they may have to be adapted accordingly.

Predicate (= name/type of the constructor)

This is the more "verbal" part of the content, often known (in most linguistic formalisms) as a Verb Phrase. This is the nucleus of the statement. It will be rendered as a statement whose nucleus is a verb.

At the core, we are dealing with a relation (in the logical sense). The specific verbalization will depend on the type of constructor and the specific language or locale.

In the example mentioned in the Common Ground section "Ranking" is the predicate and the actual representation is an instance of that relation.

A Constructor type, e.g. a Ranking constructor, can exist regardless of the individual city (or entity), so it can be used for any city or entity for which such data is available. [add reference to CoSMo]

Semantic roles & slots

The Abstract Representation will send signals to the renderers about which lexical entities (be they named entities, e.g. QIDs, or not) are to be verbalized and what processes need to happen in the syntactic layer. These signals may be attached both to the slots (arguments), as semantic roles, and to the predicates (formality, speech acts, etc.).

The position put forward initially in this document is to explicitly use semantic roles (such as agent or patient, though terminology may differ) in the constructor. These will then serve as signals needed for renderers to choose the appropriate lexemes and generate the correct language output.

In this case, the slots associated with every predicate will be labeled for semantic roles, be they primary or secondary.

Semantic signals will be sent down the pipeline, they should be resolved in post-processing.

There are several semantic frameworks / approaches to semantic role encoding we could use here. To mention only some:

  • Semantic Macroroles, such as ACTOR and UNDERGOER (sometimes named AGENT and PATIENT). ACTOR can be used with all predicates while UNDERGOER can be used with all transitive predicates.
  • More exhaustive frameworks for semantic roles, allowing for more versatility in the name of the role depending on the predicate. Such roles include notions such as AGENT, PATIENT, EXPERIENCER, RECIPIENT, etc.
  • Predicate-specific semantic roles, akin to Framenet. For instance, Framenet defines the “ingestion” predicate with roles such as Ingestor, Ingestible and Instrument. Note that the “ingestion” predicate may correspond to different verbs (even in the same language), such as “eating”, “drinking”, “devouring”, etc.
  • Ad-hoc arguments could be defined for various verbal predicates (possibly related to specific Q-ids). E.g. for the “eating” predicate (which could be linked to Q213449) we may define arguments such as eater, food, utensil, as needed by the realization etc. While this is similar to the above possibility, it is explicitly more ad-hoc and doesn’t attempt to generalize over multiple similar predicates.

Defining semantic roles also allows us to differentiate between predicates that are superficially identical, for example:

ORDER1
order [someone {patient} ] [to do X {object:verbal} ] [next week {time} ] [to fix Y {cause/purpose:verbal} ]
ORDER2
order [a sandwich {object} ] [in a restaurant {location/source} ] [to go {modality} ]

etc.

Grammatical cases are the morphological realization of primary semantic roles via markers of their roles (often suffixes), meaning, their rendering. These refer to the syntactic counterparts of primary semantic roles. In order to render correctly in natural languages this information is necessary because:

  • In languages with a rich marked case system this information is necessary to select the correct form of the lexemes that fulfill the roles, and sometimes to select the appropriate form of the verb, as some of these roles might be represented in the form of the verb itself.
  • In languages with a not so rich case system, this information will determine the position of the slots in statements and/or will be rendered via prepositional phrases.

“Dependencies” or Secondary Roles

Prepositional phrases, adverbial phrases, adjuncts, postposition/phrase groups will be assigned a semantic role feature. These will be rendered in different ways, depending on the language.

We are making a distinction between primary and secondary semantic roles. Primary roles directly relate to the predicate and, thus, are part of the constructor definition. Secondary roles, in theory, could be added to any constructor.

Semantic-Pragmatic features

Some of these features (such as animacy, humanness, shape - for Japanese quantifiers, for example) should be represented in the lexical representation (i.e. Wikidata lexicographical data). This information will allow the renderers to:

  • match semantic roles with the appropriate lexemes beyond purely syntactic considerations;
  • apply the appropriate agreement (or concord) between lexemes.

In a second phase of the project:

  • Discourse-level pragmatic features, such as STATEMENT, WH-QUESTION, OPEN_ENDED QUESTION, etc. should be included in the Abstract Representation, in the constructor, in order to signal the syntactic layer.
  • Formality and other such signals should also be included and attached to the predicate itself. Formality signals should probably also live in the templates and lexemes.

We should also include here slots containing expressions such as “as it stands”, “as far as we know”, etc., which are sentence-level modifiers with an indirect but meaningful impact in the semantics of the statement. They should have roles assigned to them.

External Reference(s) of the Constructor

As mentioned above, constructors don't include any language-specific information, such as morphology, syntax, lexicon, content realization, etc. but need to keep these external components in mind.

Content

Content points to the entities (QIDs or lexemes) the statement refers to. Content essentially points to the lexemes and QIDs that fill out the slots in the statements.

Syntactic Layer

This layer is concerned with aspects such as the grammatical features of the lexeme to be used (in the case of Wikidata, lexical entities), the order in which components/slots will be rendered and additional modifications those renderers might undergo.

This layer is compositional: it will therefore also include the rules needed to combine, or not, the available constructors.

This layer includes signals to make the appropriate selection of lexemes as per morphosyntactic and phonotactic rules. They are rendered by different functions and at different times.

The template language, if used, will connect to the abstract representation and the composition syntax so that it can be processed by the Wikifunctions Orchestrator component of the overall architecture.

Morphological Layer

Right now, morphology in Wikidata is attached to the lexical entities. Therefore we are not positing a morphological layer. Morphology features will have to be included in the lexemes' representation.

Lexemes should include semantic features and any other feature the abstract representation might encode. We should not require an exact match on the features, i.e. over-specify features that are not needed. For example, many nouns or entities are able to fulfill multiple roles. We should have some generic feature that allows these items to bypass restrictions from the Abstract Representation. In some languages the same lexeme will be apt to fulfill practically any role. But, let's say, if a lexeme is in the ergative case form, it can only fulfill the role of Agent. If that lexeme has both the features ergative + singular and absolutive + plural, it can fulfill the roles of agent, subject and object.

Prepositions, particles and postpositions (and possibly other non-nominal categories) need to encode their attachment restrictions. Here is an example:

role: THEME/ABOUT/SUBJECT
English: about {prep} (restrictions: noun phrase / nominalized verb)
German: von {prep} (restrictions: ……/ dative case)
Basque: bidez {postposition} {instrumental case} (restrictions: noun phrase / nominalized verb + genitive case)

We need to reach alignment on where and how rules should be stored. They could be stored as functions or declaratively. Present consensus seems to be to store declaratively as much as possible.

Putting the weight of the morphology on the lexemes is, probably, the correct way to go, as Abstract Wikipedia will be based on crowd contributions, but this will add complexity in the lexemes and, possibly, in the orchestrator.

Building a morphological layer could be used to generate and overgenerate (to be cleaned up by speakers) different forms of lexemes, when introducing data for a language. This could be part of a toolkit to facilitate the addition of new languages. [add link to Wikidata article-workshop]

Proposals

Discussions

Defective Input in Constructors

We should probably separate main signals from secondary signals. Plurality could be one of the secondary signals. But we should differentiate semantic plural from morphological plural:

  • Morphological plural is a language-specific feature which is apparent in the language (due to inflection, or agreement). Not all languages have morphological plural, so we could choose to not mark this feature in the constructor. But if we want constructors to be abstract enough and reusable enough for many languages, we may choose to mark plurality in all cases and let the renderers decide what to ignore.
  • Semantic plural is the idea that a set contains a plurality of items (e.g. the items could be associated with a quantifier, such as “many” or “some”). This could be marked in the Constructor. Whether it is realized as a plural noun, plural verbal agreement or not at all in a given language, is up to the renderer of the given language.

We could have sub-constructors that deal with plurality. Design issues like these need to be sorted out.

Coming up with a set of semantic roles is the sensible way to do it, but we should keep in mind that users might not be able to assign semantic roles correctly. We may want to build more abstract constructors and go to the lexical module and check the verb predicate structure. Another option would be to have an API and documentation that facilitates the task as well as a vetting process.

The representation is a wrapper and alongside there is a set of entities and information and it is a matter of the renderer to realize them. We should find cases where we cannot recover the information.

If the constructor is attached to a renderer. The roles would be implicit in the renderer. But the decision of a constructor being attached to a renderer has not been made. In such a case, the user will include an example and the roles would be implicit to the example. But, relying on users' examples could be risky.

It's going to be difficult to introspectively decide. We need a mechanism to recover from the things we have not included: is it the fault of the data or of the content?

The community would build classes of constructors, like birth. See A Hierarchical Unification of LIRICS and VerbNet Semantic Roles, figure 1,[2] and A Hierarchy with, of, and for Preposition Supersenses, figure 1.[3] Roles like Actor would be needed, else we can't fully verify the correct allocation of things in the template.

Different Types of Roles

Does anything prevent multiple different kinds of roles from coexisting? Do we want to prevent a situation where multiple constructors exist for the same act?

For example, a Framenet-style representation for eating might look like

EAT ( role:EATER, role:THING_EATEN )

whereas a representation with macro-roles might look like

EAT ( role:AGENT, role:OBJECT )

This is probably an undesirable situation, but perhaps inevitable when the community starts creating their own constructors.

One could easily add a hierarchy of domain-specific roles to manage them. for the example: SubRole(Eater, Agent) works as well for SubRole(Predator, Agent), and so on. Mandatory naming by users may be a bit too much, but populating them at the back-end with values from the likes of framenet or verbnet may be of use.

Alternative proposals

TBD

Examples

These are a few random examples for testing this proposal.

Note that the abstract representations provided for these examples are to be understood as representation proposals. The authors of this document have not reached a consensus on these representations, nor the proposal itself. Note also these representations make no reference to semantic roles, which goes counter one of the basic tenets in this proposal.

  • Edith Eger is the youngest daughter of Lajos and Ilona Elefánt, Hungarian Jews in an area which was, at the time of her birth, in Czechoslovakia. Her father was a tailor.
Child[person=Edith Eger,father=Lajos Elefant,mother=Ilona Elefánt,rank=<last>]
->
child_renderer_en:"{Person}{Copula}the{Rank}{Lexeme(child)}of{father}and{mother}."

and e.g., a “type level” constructor, alike

Child [ person = <a person>, father = <a person>, mother = <a person>, rank = <rank value> ]

so we can fetch data for all children and their parents and render that in sentences for each [child,father,mother] combination there is in the data?

Other loose end to be put on a list somewhere, perhaps as part of the examples, perhaps not: how to link the constructor to the template, how to make sure that the right elements from the constructor matches with the right element from the template, like that for the Child constructor above, we have, e.g.,

-> child_renderer_en_1: "{Person} {Copula} the {Rank} {Lexeme(child)} of {father} and {mother}."

or

-> child_renderer_en_2: "{father} and {mother} had {Person} as the {Rank} {Lexeme(child)}."

but not

child_renderer_en_wrong: "{father} {Copula} the {Rank} {Lexeme(child)} of {Person} and {mother}."

Ignore now for an initial phase to hope that’s done sensibly and for a phase 2 to validate? can/has this to be baked into either of the constructor or template language somehow? should it even? e.g., like that person is the subject, however that’s realized in the text, so that child_renderer_en_wrong can be flagged as wrong when it uses the specification from UD/SUD/thelike as {subj:father} because there’s, say, subj:person in the constructor?

-> child_renderer_en: "{Person} {Copula} the {Rank} {Lexeme(child)} of {father} and {mother}."
Ethnicity [ person = Edith Eger, ethnicities = < Jewish, Hungarian > ]
Location [ person = Edith Eger, place = Czechoslovakia, time = Birth [ person = Edith Eger ] ]
Birth [ person = Edith Eger, place = Czechoslovakia ]
Profession [ person = Lajos, profession = tailor ]

Note: the following example exhibits interesting linguistic phenomena, such as a nominalized -ing (or gerund), impersonal passive construction, the quantifier both and loads of coordination.

  • Cooking is done both by people in their own dwellings and by professional cooks and chefs in restaurants and other food establishments.
Conjunct [ Cook [ chef = people, location = <home> ], Cook = [ chef = professional cook, location = restaurants ] ]
  • Invasion of Privacy (album)

This would and should be a QID. But we should think about how other nominalised structures, more or less lexicalized, should be represented.

No abstract representation has been proposed for the following examples:

  • Gravity has an infinite range, although its effects become weaker as objects get farther away.
  • Don't Starve is a survival video game developed by the Canadian indie video game developer Klei Entertainment.
  • In linguistics, a noun phrase, or nominal (phrase), is a phrase that has a noun or pronoun as its head or performs the same grammatical function as a noun.

The following example has been used in the Scribunto-based NLG system:

Rendered text (English):
Marie Curie was a Polish chemist and physicist. She was born 7 November 1867 in Warsaw and died 4 July 1934 in Passy. Marie Curie was a pioneer in the research of radioactivity. She researched radium. She was awarded the Nobel Prize in Physics in 1903, together with Pierre Curie and Henri Becquerel, for pioneering the research of radioactivity. She was awarded the Nobel Prize in Chemistry in 1911, for the discovery of polonium and radium. Marie Curie was the first woman to win the Nobel Prize.
Abstract content (proposal):[Note 1]
Person { 
		person = "Q7186",
		birth = Birth {
	        	date = Date {
	            day = "7",
	            month = "11",
	            year = "1867",
	        },
	        place = "Q270",
	    },
	    death = Death {
	        date = Date {
	            day = "4",
	            month = "7",
	            year = "1934",
	        },
	        place = "Q388949",
	    },
	    origin = "Q36", -- Poland. Should be a list including France
	    occupation = List {
	    	_predicate = "List",
	    	first ="Q593644",  -- Chemist
	    	second = "Q169470", -- Physicist
	    }
	},
	Pioneer {
		person = "Q7186",
		in_what = Research { 
			research_field = "Q11448", -- Radioactivity
		},
	},
	Research {
		person = "Q7186",
		pronominalize = true, -- This should possibly done by a post-processor
		research_field = "Q1128",
	},
	AwardedPrize {
		person = "Q7186",
		pronominalize = true,
		prize = "Q38104",  -- Nobel prize in physics
		date =  Date {
	            year = "1903",
	    },
	    with = List { 
	    	first = "Q37463",
	    	second = "Q41269",
	    },
	    reason = Pioneer {
			person = "Q7186",
			in_what = Research { 
				research_field = "Q11448", -- Radioactivity
			},
		},
	},
	AwardedPrize {
		person = "Q7186",
		pronominalize = true,
		prize = "Q44585",  -- Nobel prize in physics
		date =  Date {
	            year = "1911",
	    },
	    reason = Discovery {
	    	discovery = List {
	    		first = "Q979",
	    		second = "Q1128"
	    		
	    	}
	    }
	},
	Rank {
		rank = 1,
		person = "Q7186",
		reference_group = "Q467",  -- woman
		activity = AwardedPrize {
			prize = "Q7191",  -- Nobel Prize
		}
	}

Footnotes

  1. "Person" should possibly be named "PersonIntroduction" as it is intended to be a constructor which introduces a person in the first sentence of an article.

References

  1. Vrandečić, Denny (2020). "Architecture for a Multilingual Wikipedia". arXiv:2004.04733. 
  2. Bonial, Claire; Corvey, William; Palmer, Martha; Petukhova, Volha V.; Bunt, Harry (2011). "A Hierarchical Unification of LIRICS and VerbNet Semantic Roles". 2011 IEEE Fifth International Conference on Semantic Computing (IEEE). doi:10.1109/ICSC.2011.57. Retrieved 2022-11-24. 
  3. Schneider, Nathan; Srikumar, Vivek; Hwang, Jena D.; Palmer, Martha (2015). "A Hierarchy with, of, and for Preposition Supersenses" (PDF). Proceedings of LAW IX - The 9th Linguistic Annotation Workshop (Association for Computational Linguistics). Retrieved 2022-11-24.