User:Strobilomyces/MultilingualWikipediaExample
Multilingual Wikipedia Examples
[edit]Introduction
[edit]On the Abstract Wikipedia/Examples page, when explaining the architecture of Multilingual Wikipedia it says "The look of the Constructors and their rendering results will be entirely up to the communities". But this presupposes a difficult process of establishing the scheme under which the constructors and such functions will operate. I want to understand for myself how this might work and below I try to develop two detailed examples, partly based on Denny's explanation here. The devil is in the detail, and so it is important to work out any examples down to a low level of detail. Here I am concerned about defining clearly how Multilingual Wikipedia might work, not on making the interface user-friendly.
For the moment I am trying to ensure that the example abstract text should work at least for English, French, German and Spanish (so automatically the same general schema should work for a lot of related languages).
At present the development team rightly gives priority to the Wikifunctions project, but there is also much work which could be done on Multilingual Wikipedia before Wikifunctions is needed here (it would be possible to determine the architecture and the form of the abstract text, make a list of principal constructors with specifications, find what new Wikidata items will be needed and what lexeme information will be required for various classes of languages, and so on).
A very simple example of abstract text
[edit]General
[edit]A first example is as follows
Marie Curie saw her friend.
and to express that I am proposing the following abstract text where each constructor line starts with a label, followed by the constructor name and its arguments. One line has the extra keyword "sentenceHead" to mark it as the root of the grammatical tree for a sentence. The indentation is only cosmetic, and does not affect the result.
- sentenceHead::ExampleText: TransitiveHumanAction(verbChoice:see, tense:encyclopedic preterite, subject:MarieCurie, object:ObjectPhrase)
- MarieCurie: NounPartInWD(item:Q7186 "Marie Curie")
- ObjectPhrase: PossessiveNounPart(possessiveType=pronoun, possessed:Friend, possesser:ref MarieCurie)
- Friend: NounPartInWD(item:Q17297777 "friend", number:singular, gender:M)
I think all proposals for abstract text have the form of a tree of function calls, as here. It is tempting to think that the implementation could take the form of functions calling each other in just that structure, with the lower-level functions providing strings or grammatical tree-structures to be put together by the higher-level functions, to generate the eventual rendered text. But it cannot work so simply as that (in one pass), because information to select the grammatical forms has to be communicated from arbitrary parts of the grammatical tree to any other part.
In this proposal, each constructor line in the abstract text will be implemented with two corresponding functions, an "expand" function and a "render" function. The expand function will generate the information needed to find the correct lexical and pronomial forms, etc., in a parse tree data structure having the same tree structure as that implied by the abstract text. Examples of such properties to be added in the parse tree would be gender and grammatical number; the actual list of properties would depend on the language (but note that even English needs gender for pronouns). At each node of the expand tree each expand function will call the corresponding functions of the lower-level constructors to build up the tree. The suggested parse tree for ExampleText is shown below.
There would be very many constructors available, and they should allow many optional arguments (for qualifying phrases etc.) but these would be omitted if not needed.
"Bindings" tend to be a big area of difficulty in natural-language processing. An example of a "binding" is where a pronoun refers to a particular antecedent, for instance in the sentence "Bill washed his face", the pronoun "his" is probably bound to "Bill", but it might also be bound to someone else. I think that here all trouble can be avoided by having a rule that in the abstract text, all possible bindings must be specified explicitly. I propose that in the abstract text, bindings can only be to constructor lines (which have labels) and bindings are to be specified by giving the label name marked with "ref" to the constructor argument where it is used.
I thought that verbs would have items in Wikidata, of which the item numbers would appear in the abstract text to allow the lexemes in each language to be found. Such WD entries are not there currently. Verbs are complicated, for instance in English or German verbs can take different prepositions or cases, or change their meaning according to associated prepositions/particles (in phrasal verbs like "give up"). The alternative to having verbs in WD is that the constructors will have to "know" all the verbs which they use and there could be roughly as many constructors as verbs. That method is what I am assuming here. But I suggest that there could be a general constructor "TransitiveHumanAction" which would construct simple sentence parts with have a wide choice of verbs which would be selected through the verbChoice parameter by keywords - for instance "eat/drink/hit/miss/see/hear/...". The "see" choice would select verb "see" in English, "voir" in French etc. and the associated syntactical knowledge would be contained in the expand and render functions. For instance there would be a table linking each verb choice to the lexeme "L" code of the verb in the given language; such a table could be stored in a WD item associated with the constructor.
Typically encyclopedic text does not use the first and second grammatical persons (I, you or we). For simplicity I assume here that only the third person is allowed in the abstract text. Later the proposal could be extended to cover other grammatical persons. Also I only consider active voice and indicative mood here.
Grammatical case is important in German, but it is also needed for pronouns in the other languages.
In the original text the gender of Marie Curie's friend is ambiguous in many languages (for instance in Spanish we can choose "amigo" or "amiga"). For the sake of simplicity I assume that we know that it was a man and I specify the gender in the abstract text. But it should also be possible to have "gender"="Unknown", and the target languages should find a solution for this.
Normally users should generate and read the abstract text through a program which could take advantage of Wikifunctions's multilanguage features. In that way the abstract text could appear in the user's selected language, for instance a gender parameter would use a Wikifunctions multivalue item where the name "gender" and the values "M", "F" and "Unknown" would all be visible in the selected language. Since I don't want to enter into that complication yet, the names in my abstact text are in English (but this abstract text should allow rendered text in any language to be generated). Incidentally the user interface program to read and generate the abstract text will (for checking) no doubt require additional function(s) for each constuctor line type and language.
Implementation of the very simple example - expand phase
[edit]I suggest that the expand functions would have names like "<constructor>_expand_<language>", for instance "TransitiveHumanAction_expand_en". The corresponding rendering functions, which will be called in the second pass to generate the output, should be like "<constructor>_render_<language>". There will be an overall function which takes the whole abstract text as input and gives the corresponding language text as output; that might be called Render_Abstract_Text_en in English.
The parse tree resulting from the "expand" pass could be something like the following tree structure, which is in a vaguely JSON-like format. In practice, for convenience, more information would be annotated into this tree, but that is an internal implementation detail.
- {label:ExampleText, constructor:TransitiveHumanAction, verbChoice:see, tense:encyclopedic preterite, number:singular, gender:F,
- {role:subject, label:MarieCurie, constructor:NounPartInWD, item:Q7186, number:singular, gender:F, case:nominative},
- {role:object, label:ObjectPhrase, constructor:PossessiveNounPart, possessiveType=pronoun,
- {role:possessed, label:Friend, constructor:NounPartInWD, item:Q17297777, number:singular, gender:M, case:accusative},
- {role:possessing, ref:MarieCurie, number:singular, gender:F}
- }
- }
How will the constructor functions for this text work?
The overall function to render abstract text, Render_Abstract_Text_<language>, could proceed in the following stages.
- Read the abstract text and put it into an associative array or "dictionary" (indexed by the label of each constructor line) for later reference.
- Create the parse tree by finding the root constructor ("TransitiveHumanAction") and calling TransitiveHumanAction_expand_<language> to create it. TransitiveHumanAction_expand_<language> will "know" that the sentence root needs the subject and the object as children and call the relevant child expand functions to fill in relevant properties and dependent nodes.
- Traverse the parse tree and fill in those properties which are defined by binding references (this should be possible without knowledge dependent on particular constructors).
- Call the top level render function, TransitiveHumanAction_render_<language>, to obtain the rendered text. That function will call the subsidiary render functions in turn to build up the target text, using the context information which is available in the parse tree.
TransitiveHumanAction_expand_<language> "knows" that the subject should be nominative case, but before subsequent expansion it will not know whether it is singular or plural, nor what gender that item has. The case, the grammatical number, and the gender need to be propagated through the tree so that they are available at every corresponding node, and that is essentially the function of the expand pass. Naturally there will be other such properties needed, such as "definite/indefinite article type", which is not covered in this initial example. The list will grow when more languages are taken into account. Some properties propagate downwards from the root (the case) and others upwards from the leaves (number and gender); each expand function "knows" which of its child properties have to go downwards and which upwards.
In German the gender and number could actually be combined into one field with values "M singular"/"F singular"/"N singular"/"plural". But for other languages there would have to be separate gender and number properties in the abstract text.
TransitiveHumanAction_expand_<language> would add the "MarieCurie" node to the parse tree with standard basic information like the "subject" role, the label, the child constructor name "NounPartInWD" and also the "nominative" case property. Then it would call the child NounPartInWD_expand_<language> function with a pointer to the relevant part of the current parse tree. That function would use the Q7186 item claims to complete the properties of the parse tree node. Since for the Marie Curie item property P31 "instance of" equals Q5 "human", it will set number=singular and since property P21 "sex or gender" equals Q6581072 "female", it will set gender="F".
Next consider the object of the sentence. TransitiveHumanAction_expand_<language> will add the "ObjectPhrase" node child of TransitiveHumanAction with basic information. I suppose this constructor would have a "possessiveType" parameter which could have had the value "prepositional" (leading to "friend of Marie Curie" in English) or "genitive" (leading to "Marie Curie's friend" in English and defaulting to the same as "prepositional" in French). But in this case the parameter is "pronoun", meaning that a possessive pronoun would be used in place of the possessor.
When PossessiveNounPart_expand_<language> is called it will create the Possessed and Possesser child nodes in the parse tree with basic properties. It will "know" that the case propagates down to the PossessedPart child node "Friend", but not to the PossesserPart child node "MarieCurie".
For the "possessed" branch the function will create the "Friend" node, attach the basic properties, and call NounPartInWD_expand_<language> on the new node. In general the latter function could search item Q17297777 in Wikidata to find the number and gender. But since this is a common noun the number cannot be found from WD; instead it had to be specified in the abstract text. And whilst for most common nouns the gender could be found from the WD item, here the gender was specified in the abstract text - that was because the target language may require to choose the gender in the case of a friend (example: "amigo" or "amiga" in Spanish). So actually NounPartInWD_expand_<language> does not need to do anything here.
For the possessing parse tree branch, PossessiveNounPart_expand_<language> will not create a further new node. The "ref MarieCurie" property implies that the number and gender properties should just be copied from the bound "MarieCurie" node, but a referenced node like that may not have been expanded yet. I suggest that such referenced properties should be left as references at this stage and filled in subsequently by Render_Abstract_Text_<language> after the rest of the expand phase is finished.
The possessed item number and gender will get propagated up to the object child node "ObjectPhrase" of TransitiveHumanAction. The knowledge about the verb part is held inside function TransitiveHumanAction_render_<language> and there is no need for action at the TransitiveHumanAction_expand_<language> stage.
Implementation of the very simple example - render phase
[edit]After the recursive expand phase, Render_Abstract_Text_<language> should propagate the properties corresponding to bindings (reference nodes) so that each node will have its number, gender and case properties (and perhaps others).
Next TransitiveHumanAction_render_<language> will be called to generate the rendered string recursively. The plan to be adopted in this case in English would probably be to render the subject subtree to get a string for the subject, calculate the verb string to get the verb part, render the object subtree to get a string for the object and concatenate the three strings separated by spaces and terminated with a period to get the final result. In fact for this simple case the same strategy would also work well for French, Spanish and German.
Then NounPartInWD_render_<language> would be called to generate the string for the name of Marie Curie, which should be derived from WD item Q7186. First the P31 "instance of" property will be consulted to determine the type of processing needed; in this case the item is a human. In general even for humans as target, it is not a trivial task to determine how the name will be rendered, and a very pertinent piece by Denny is available here. When I read that page, which applies only to a very tiny part of the Multilingual Wikipedia project, it seems to me that however the project is finally implemented, the team will have a lot of work to occupy themselves. If a name includes a title (Madame Curie), other properties such as case may have to be taken into account. A simple (but dangerous) expedient is to take the language label of the item, and that works nicely in this case, giving "Marie Curie".
The verb string needs to be generated for verbChoice = "see". Somewhere there should be a table giving the lexeme code for each possible verb choice which is allowed. This table might be in a WD entry associated with the constructor TransitiveHumanAction. For the "see" option in English it would be L185.
The tense of the verb is another big issue. Languages have various tenses which are selected in various ways, and so abstract text choices will not just correspond to the tenses of a particular language. I suggest that there could be various descriptive abstract text tense names expressing the general context, and the various languages could make choices accordingly. Here the name of the abstract tense is "encyclopedic preterite" and in English the simple past could be chosen, in French the past historic (passé simple), in German the preterite, and so on.
Encyclopedic text is typically in the third person and I currently assume that only the third person is allowed. Also I only allow the active voice and the indicative mood.
TransitiveHumanAction_render_<language> will "know" that the grammatical number and gender for the verb come from the subject role child. Then for these languages the appropriate lexeme form can be found based on the known parameters (singular, feminine, chosen tense, third person as always, active as always, indicative as always). For English the result is "saw".
Then PossessiveNounPart_render_<language> would be called to render the object part. Here with PossessiveType=pronoun the form of the output will be a possessive pronoun, then a space, then the result of evaluating the possessed part "Friend" - that works for all these languages. In German the third person possessive pronoun depends on the gender/number of the possesser, the gender/number of the possessed part, and the case of the possessed part (here it is "ihren"). In English it just depends on the gender and number of the possesser part (here it is "her"). I think it is not too difficult for PossessiveNounPart_render_<language> to work that out through look-up tables in the software (and that there is no point in putting this information in WD).
Then PossessiveNounPart_render_<language> would call NounPartInWD_render_<language> to render the Possessed part, "Friend" based on item Q17297777 with singular number and gender=M. But there is a problem here and strangely enough the Spanish WD lexeme entry currently implements a different solution to the French and German entries. The Spanish WD lexeme information counts "amigo" and "amiga" as gender forms of the same lexeme (L39702), whereas the French "ami" and "amie" have different lexemes (L9093 and L11973) and the German "Freund" and "Freundin" are similar to the French. In the French system perhaps the two forms need to be linked to two different items, which does not fit well with my current proposal. Surely other German adjective/nouns such as "Deutscher"/"Deutsche" need to be treated as one lexeme? The solution in general is not clear to me, but in any case I assume that NounPartInWD_render_<language> will find a suitable text - in the easy English case it is "friend".
When the child strings are put together by TransitiveHumanAction_render_<language>, I think that the results would be as follows:
language | rendered text |
---|---|
en | Marie Curie saw her friend. |
es | Marie Curie víó su amigo. |
fr | Marie Curie vit son ami. |
de | Marie Curie sah ihren Freund. |
Slightly more extensive example
[edit]General
[edit]The sentence which inspired my next abstract text is from the second paragraph of w:en:Castle, after a bit of simplification. It is as follows:
European-style castles originated in the 9th and 10th centuries, after the fall of the Carolingian Empire resulted in its territory being divided up.
and I am proposing the following abstract text to express that.
- sentenceHead::BiggerExample: ComeIntoBeing(verbChoice:arise, tense:encyclopedic preterite, what:CastleNounPart, qualifyingPhrase:TimeandCausePhrase)
- CastleNounPart: QualifiedCommonNounPart(articleType:generalClass, number:plural, listOfAdjectiveParts:[CastleAdjective], noun:CastleNoun)
- CastleAdjective:StyleOf(European)
- European. GetAdjectiveFromWDNoun(item:Q46 "Europe")
- CastleNoun: NounPartInWD(item:Q23413 "Castle")
- TimeandCausePhrase: PhraseCombination(comboType=and, phraseList:[TimePhrase, CausePhrase])
- TimePhrase: TimeIntervalSpecification(relationType:during, intervalType:listOfPeriods, periodType:century, list:[9, 10])
- CausePhrase: TimeClause(relationType:after, complementType:EventClause, complement:Event)
- Event: ResultedInClause(tense: encyclopedic preterite, causer:CauseSubject, result:CauseResult)
- CauseSubject: PossessiveNounPart(possessiveType=preposition, possessed:Fall, possesser:Empire)
- Fall: NounPartInWD(articleType:definite, number:singular, item:Q3042783 "societal collapse")
- Empire: NounPartInWD(articleType:definite, number:singular, item:Q31929 "Carolingian Empire")
- CauseResult: PossessiveNounPart(possessiveType=preposition, possessed:Division, possesser:TerritoryNP)
- Division: NounPartInWD(articleType:definite, number:singular, item:Q518554 "partition")
- TerritoryNP: PossessiveNounPart(possessiveType=pronoun, possessed:ref Empire, possesser:TerritoryNP)
- Territory: NounPartInWD(number:singular, item:Q183366 "territory")
As in the very simple example above, when Render_Abstract_Text_<language> is called on the abstract text, it will first apply the functions of the form <constructor>_expand_<language> to create a parse tree where each node is annotated with the relevant properties "grammatical number", "gender", "article type", "grammatical type", etc. which will be needed during the rendering phase.
Bigger example - render phase
[edit]Now consider the action of the rendering functions. The strategy of ComeIntoBeing_render_en will be to evaluate the "what" noun phrase as a string, evaluate the verb part (this function "knows" the relevant verbs for coming into being and how to use them with prepositions etc.), evaluate the optional qualifyingPhrase, and append the three strings separated by spaces and terminated by a period. In practice most of the qualifyingPhrase processing should be done by a standard function which would be shared by many constructors.
Function QualifiedCommonNounPart_render_<language> is meant to put a possible article, a list of adjective parts and a noun part together into a noun phrase. With some languages the adjective parts may come after the noun.
Statements about a general class such as "castles" need a definite article in such languages as French and Spanish, but not in English. I suggest article type "generalClass" for this type of noun phrase. The English constructor will then know that here no article is needed, but other languages will put the definite article.
The listOfAdjectiveParts has one member, "CastleAdjective", for which the English model is "European-style". The word "European" is a "demonym", that is an adjective derived from a place, in this case WD item Q46 "Europe". The property P1549 "Demonym" is available (allowing the basic word text to be looked up for a given language and sometimes gender), but it currently gives the word for a person, which can be different from the adjective (for instance in German we would have person: "Europäer" and adjective: "europäisch"). So I think the demonym is not applicable here. I don't think adjectives are currently covered well by the system, but I notice that the English lexeme L1349 "happy" links to Q8 "happiness" through property P5137 "item for this sense". So I propose that each adjective lexeme needs to have property P5137 (on one of its senses), pointing to the item of the corresponding noun in WD. Then GetAdjectiveFromWDNoun will be able to get the lexeme of the adjective for the noun. In the various languages it is European/europeo/européén/europäisch.
There is a complication here; the original English text said "European-style" and I suppose that there will be a constructor "StyleOf" which will make a given adjective or noun into a phrase meaning "of the given style". In other languages the best strategy may be to turn "x-style noun" into "noun of type x", for instance "castles of type European". Now the word European will have to agree with "type", not with the castles. I will assume that all this can be handled.
In the first pass the gender of the castles will have been found (by finding the lexeme linked to Q23413 "Castle" through property P5137 "item for this sense") and it will have been propagated to the adjective part on the parse tree. In a fashion similar to that described above for the very simple example, QualifiedCommonNounPart_render_<language> can generate the text "European-style castles" (for English), or "Los castillos de tipo europeo" (for Spanish).
I propose that the ComeIntoBeing constructor will support various verbs relevant to coming into being (such as "arise", "originate", "emerge", "be born", etc.) and that one of them will be selected by the verbChoice parameter. The choices are not general verbs, they are a limited set of options which should be understood by all the ComeIntoBeing language functions. In the example I suppose that "originate" was not an available option for some reason and that instead the "arise" option is selected. The number (plural) and the gender of the subject are known, so the appropriate form can be found from the lexeme. In English the result would be "arose".
I am assuming that phrases can be combined using the constructor PhraseCombination. Here there are two, TimePhrase and CausePhrase, which can just be concatenated in all these languages. Punctuation is another issue. The TimeClause render constructor should work out whether the generated string will start with a comma, and the CausePhrase constructor similarly.
I imagine that TimeIntervalSpecification could generate many different temporal phrases which specify when something happened (durations, specific dates, etc.) Here with relation type "during" and given a list of centuries, it could generate "during the ninth and the tenth centuries". The function will "know" that in English centuries use ordinal numbers and require a definite article. Another example with parameters (relationType:during, intervalType:range, periodType:year, start:1953, end:1967) might generate "during the period from 1953 to 1967". This sort of function needs to be extended until all commonly used time expressions are covered.
The TimeClause constructor is intended in English to take the form ", after" + a clause which specifies an event. The event, implemented with the ResultInClause constructor, is just a subject-verb-object clause in English (in German the order would be subject-object-verb, such as "nachdem der Zusammenbruch ... die Aufteilung ... als Folge hatte"). The subject of the ResultInClause event is the fall of the Carolingian Empire.
The Carolingian Empire has a WD item, Q319329, of which the language label should show the text in the nominative, but unfortunately that is not what is needed. Instead I think we need a lexeme for "Carolingian" and a lexeme for "Empire". The item Q319329 would only work if there were lexemes for the whole two-word phrase in each language, but I suppose that that is not practical (though in German the phrase is only one word, "Karolingerreich", so perhaps it might have a lexeme). But it seems a waste to provide all the lexemes for every possible compound word like this when the forms can all be derived from the components. I propose that lexemes will only exist for the component words and that the lexemes will be linked to the item through property links of type P5137 "item for this sense", with qualifiers to indicate the grammatical role and order number of each lexeme in the phrase. So in this case the lexeme "Carolingian" will be linked to Q319329 with role qualifier "adjective" and word order qualifier "1", whilst the lexeme "Empire" will be linked with role qualifier "noun" and word order qualifier "2". In German the lexeme "Karolinger" might be linked to Q319329 with role qualifier "prefix" and word order qualifier "1", whilst the lexeme "Reich" might be linked with role qualifier "basenoun" and word order qualifier "2", finally to give "des Karolingerreichs" (as it is genitive). The phrase "des karolingischen Reichs" might have been better anyway. The NounPartInWD constructor needs to work all this out to give "the Carolingian Empire" in English.
I also have a problem with the noun part "Fall" which I identify as Q3042783 "societal collapse" in WD. The problem is that rendering the phrase as "the societal collapse of the Carolingian Empire" is not very satisfactory; the word "societal" should be omitted here. But in other contexts item Q3042783 should indeed be rendered as "societal collapse" and I am not sure what the solution is.
The result part of the LeadToClause construction can now be evaluated. In the abstract text for simplicity I replaced the rather difficult gerund "its territory being divided up" by a noun phase "the partition of its territory" (the word "division" is more ambiguous than "partition" and I needed to find a WD item for it). This phrase uses PossessiveNounPart twice and it uses a reference to the "Empire" constructor to specify the possessor for the possessive pronoun. In this way it should be possible to derive the cause result "the partition of its territory".
I assume that the LeadToClause constructor "knows" a suitable verb phrase to use to express that one thing leads to another. In the original English text that was "result in", but now I suppose that LeadToClause happens to be implemented using "lead to". The constructor will have the lexeme for the verb "lead" and will find the third-person singular preterite form "led". It will then insert the word "to" and then the result noun phrase "the fall of the Carolingian Empire led to the partition of its territory".
When the rendered text is assembled I suggest that (with a bit of optimism) the following results could be obtained.
language | rendered text |
---|---|
en | European-style castles arose during the 9th and 10th centuries, after the societal collapse of the Carolingian Empire led to the partition of its territory. |
es | Los castillos de tipo europeo se originaron durante los siglos IX y X, después de que el colapso social del Imperio carolingio condujese a la particíón de su territorio. |
fr | Les châteaux de style européen apparurent aux 9e et 10e siècles après que l'effondrement de l'Empire Carolingien eut conduit au partage de son territoire. |
de | Die Burgen europäischer Art entstanden während der 9en und 10en Jahrhunderte, nachdem der gesellschaftliche Zusammenbruch des karolingischen Reichs die Aufteilung dessen Territoriums verursacht hatte. |
Problems and lessons
[edit]- The abstract text can't be processed in one pass in practice, so each constructor will require at least two corresponding functions to implement it in each language. I suggest that each constructor will have an "expand" function which in the first pass will build up a parse tree to contain the relevant information, and a "render" function which in the second pas will render the text using the appropriate grammatical information which will then be available.
- Initially I thought that entries in WD would be used for the verbs, so that one constructor could support numerous different verbs. Such WD entries don't exist at present and phrasal verbs which have associated prepositions or particles which determine their meaning would cause complications in English and German. Now I think that the constructors have to contain this verbal knowledge; for instance perhaps a constructor for a clause will have a reference to the lexeme of each verb which it can use, as well as information about the types of objects or prepositional phrase(s) which it can take.
- A solution for adjectives is needed; the abstract text must give a language-independent specifier for the adjective. I suggest that one of the senses of the lexeme should be linked through WD property P5137 "item for this sense" to the WD item for the noun corresponding to the given adjective. Then the lexeme of the adjective can be found.
- It is difficult to know how to handle a multi-word expression like "Carolingian Empire" which has its own item, Q31929, in WP, unless the name "Carolingian Empire" is actually a lexeme in each language. But surely we should only have lexemes for the words "Carolingian" and "Empire", not for every expression combining the words? Perhaps the Carolingian Empire item cannot be used in the abstract text, but instead the expression needs to be a noun phrase build up out of "Carolingian" and "Empire". On the other hand in German the whole phrase corresponds to a single word: "Karolingerreich". Another example is Q274151 "fried egg", which corresponds to the single word "Spiegelei" in German. Many species of birds have single-word names in German, but note that also the egg of each species has a single-word designation, and similarly with the beak or wing etc. I think there cannot be items in WD for all those combinations. For me the only consistent solution to this problem is to allow one item in WD to correspond to multiple lexemes, using qualifiers to specify an order number and an indication of the grammatical role. So in English there would be no lexeme for "Carolingian Empire", but the lexeme for "Carolingian" would point to Q31929 with qualifiers "word order" = 1 and "role" = "adjective" and the lexeme for "Empire" would point to Q31929 with qualifiers "word order" = 2 and "role" = "noun". This is a significant issue which would need rules to be formulated.
- The best WD item which I could find to express the fall of an empire was Q3042783 "societal collapse". But when we want to refer to "the collapse of the Carolingian Empire" it is not very satisfactory to have as rendered result "the societal collapse of the Carolingian Empire"; the word "societal" should be omitted in this case. But in other contexts the full phrase "societal collapse" should be used. I am not sure how to deal with this.
- A policy is needed for dealing with cases where a noun has gender variants (such as amigo/amiga in Spanish, ami/amie in French and Freund/Freundin in German). In the WD lexeme part at present in Spanish the words are different gender forms of one lexeme whereas in French and German the masculin and feminine forms belong to different lexemes (and so I think they need to be linked to different WD items). This decision is likely to affect the abstract text.
- The abstract text should specify an "article type" for the sake of those languages which have definite and indefinite articles. But this cannot just be definite/indefinite/none as the usage varies between languages. As an example, take "birds have feathers" (English), "los pajaros tienen plumas" (Spanish) and "les oiseaux ont des plumes" (French). I suggest the value "GeneralClass" for the article of "birds" in the abstract text; then each language could pick a different solution.
- So far the proposal is oriented to a small set of Indo-European languages. It will be necessary to work out how to extend the set, presumably by providing more types of syntactical information in the abstract text. Incidentally, I think it might be useful to have sub-projects which would render text only for particular groups of languages such as "Slavic" or "Finno-Ugric" ones. The "real" Multilingual Wikipedia project (which would aim to cover all languages) would move much more slowly than the sub-projects.
Conclusion
[edit]My conclusion is that such a system is possible in principle (if the quality of the generated text does not have to be too high) but it would be very complicated to implement. For people to be able to contribute casually, I think that a very detailed specification of how it would work would be needed and there would have to be an enormous amount of co-ordination activity.