Jump to content

Modular Wiktionary

From Meta, a Wikimedia project coordination wiki

The Modular Wiktionary is a proposal for restructuring Wiktionary to make it more useful, powerful, and easier to contribute. It shares many of the same goals as Ultimate Wiktionary, and can be thought of as both an implementation of the ultimate Wiktionary and a similarly grand proposal for what Wiktionary could be.

Vision

[edit]

We have a functioning, growing Wiktionary community. Any changes to the software must be geared toward building on the information already in Wiktionary and the community working on it. The Modular Wiktionary seeks a step-wise refinement of the current Wiktionary. Each step should improve the functionality for readers and for writer/editors, without majorly disrupting the Wiktionary communities by overwhelming them with new features.

In the end, this proposal will reduce the tremendous amount of information duplication in the current Wiktionary, allowing for each reader access to a vast amount of information, while greatly increasing the ease with which editors can add information. The result will be a dictionary, a many-language-to-many-language translating dictionary, a thesaurus, and much more.

The updates to the "Main Entries" must come first; it is this update which will allow future updates. The addition of Concepts and Word Forms or Paradigms are pretty much independent of each other, although each will benefit from the other. For ease of exposition, XML examples are provided below, but the same effect could be acheived in a relational database.

Main Entries

[edit]

The Main Entries are roughly equivalent to the current Wiktionaries in each language, except the data is stored in a structured format (relational database or XML). This module is where language-specific information is stored, such as definitions in the same language, usage information, example sentences, etc. It also forms the link between the Concept Map and the Word Forms/Paradigms (see below). The structure for this particular module can probably be adapted from an existing XML or relational format, but a simple example is shown here:

<entry lang="en" text="bank" speechpart="noun">
   <sub-entry>
      <text>A financial institution where you can 
            save and borrow money.
      </text>
      <example>I have a savings account at the bank.</example>
      <etymology>...</etymology>
      <usage>...</usage>
      <translation lang="es">banco (1)</translation>
   </sub-entry>
   <sub-entry>
       <text>The sloping edges of a river or lake.</text>
       <example>I fished from the river bank.</example>
   </sub-entry>
   <case name="plural" text="banks" />
</entry>
<entry lang="en" text="ox" speechpart="noun">
   <sub-entry>
       <text>A large bovine beast of burden</text>
       <example>The ox pulled the plow for the farmer.</example>
   </sub-entry>
   <case name="plural" text="oxen" />
</entry>

What the user will see

[edit]

This step could be achieved without much change using the proposed Wikidata project. As in the screenshot on that page, instead of having one textfield in which they have to memorize social rules for formatting, editors will see multiple fields which prompt them for how to enter information on a word. For example, instead of entering ==English==, there will be a combo box where a user can pick from a list of languages or add one.

Readers will see entries put together similarly as entries are now.

How it helps

[edit]

Because the information is separated out, new templates for words can be easily applied across all words without reediting the entire Wiktionary. Fine-grained searching becomes possible: readers can specify which languages they want to search for words in, easily finding all words in a particular language, all adverbs, etc.

While a more complete system for Word Forms (e.g., plural, past participle, etc.) is still to come below, a search function could be added at this point that would allow users who land on the page for "eaten" to be redirected to "eat" automatically, without having to create redirects or stub entries for every single word form. Users who go to "leads" could be showen the entries for lead ("he leads the children") and lead ("this play has two leads"), but not "lead" as in the element, because that doesn't have a form "leads".

Concept Map

[edit]

Note: there's a similar concept in each of the Tables for Wiktionary-proposals.

The concept map is a language-independent ontology of all concepts that humans have words and definitions for. Each definition of a word (from the Main Entries section) maps to a "concept". Concepts can then be related to each other using relations. For example: direct antonym ("hot" to "cold"), sister ("chilly" to "cold"), parent ("raining" to "drizzling"), etc. (See WordNet for a more formal discussion of these relations.) If two definitions of words from different languages map to the same concept, that means they have the same meaning.

<concept id="1"/> <!--financial bank-->
<concept id="2"/> <--river bank-->
<concept id="3"/> <!--oxen-->
<concept id="4"/> <!--institution-->
<concept id="5"/> <!--animal-->
<relation type="parent" concept1="4" concept2="1"/>
<relation type="parent" concept1="5" concept2="3"/>

The concept map reduces or eliminates the need for every word in every language to be defined in every language. Instead, every word must be defined in one language, and linked to concepts that are defined in other languages. All wiktionary databases will be combined in this step. Wiktionary sub-domains can be kept as a shortcut for choosing the language of interface and definitions (as per the current setup); polyglot readers could view many language wiktionaries at once, if they so choose, using the interface language of their choice.

What the user will see

[edit]

The reader will see the same thing, with perhaps a new "concept map" interface to view interrelated concepts.

For the editor, the interface for adding words will not change that much; but there will be a new interface needed to for easy manipulation of concepts and their relations.

How it helps

[edit]

For the reader, the biggest change will be that, due to their automated nature, translations into many languages and synonyms and antonyms will be more prevalent.

For the editor: if I have 10 words, from 10 different languages, all with the same meaning, instead of defining all 10 words in each of 10 wiktionaries (100 entries), I merely relate all 10 words to the same concept, with only a language-specific definition (10 entries). Now, anybody who speaks any of those languages has the definition to all 10 words. Anybody who can relate a word in an 11th language to the same concept automatically gets translations in 10 languages for the new word (plus an additional translation for all the existing entries).

Beyond direct translation, the concept map has other uses. Suppose that somebody defines the word "foo" in a language you don't know using that same language. If they also relate that definition of "foo" to a parent concept, say, "flower", then you know that "foo" is a kind of a flower, even though you don't know what kind. The concept map also acts as a thesaurus.

Word Forms/Paradigms

[edit]

This part of the proposal is meant to address morphology. That is, it addresses the fact that in the English Wiktionary we need to have separate entries for "cats" and "cat"; "eat", "eating", "eats", "eaten", "ate", etc. In other languages, a single word may have more than a hundred spellings based on what grammatical features it has. It takes a lot of work to make each entry but if they aren't there, somebody who doesn't know English already but wants to know what "ate" means won't know to look it up at "eat".

Word Forms

[edit]

Every language has certain features with possible values, and Word Forms have one of those possible values. For example, one feature of English nouns is "Number", and possible values for this feature are "Singular" and "Plural". In other languages, possible values for Number include "Dual" (2 items). In many languages, nouns have a "Gender" associated with them.

As part of starting a new Wiktionary, you would define what features and values are possible in the new language (as data). Main Entries (essentially equivalent to the definitions in the current Wiktionary entries) are then associated with multiple Word Forms for different feature values (singluar: "foot", plural: "feet"). Anybody searching for any of the forms will return that entry.

<feature id="1" name="Part of Speech">
   <value id="1" name="Noun"/>
   <value id="7" Name="Verb"/>
</feature>
<feature id="2 name="Number">
   <value id="2" name="Singular"/>
   <value id="3" name="Plural"/>
</feature>
<wordform id="1" text="bank">
   <value id="1"/><value id="2"/>
</wordform>
<wordform id="2" text="banks">
   <value id="1"/><value id="3"/>
</wordform>
<wordform id="3" text="ox">
   <value id="1"/><value id="2"/>
</wordform>
<wordform id="4" text="oxen">
   <value id="1"/><value id="3"/>
</wordform>

Note that this is similar to part of the proposal in Vortaro tables. It differs in that the Vortaro tables have language features and values built in to the table structure, whether or not each language has those features, but no place for defining new features (as data).

Paradigms

[edit]

Paradigms are a slightly more sophisticated way of handling morphology. With a Paradigm, we define a standard set of changes that happen to a word when it has different features. For example, a standard Paradigm for English nouns could be "add s to make a plural" (cat, cats). Another, rarer paradigm for English nouns could be "add -en to make a plural" (ox, oxen). We can use Paradigms to prompt the user for likely values for different Word Forms, either caching each word form in a database or just storing the Paradigm.

<paradigm id="1" lang="en"> <!--Default Noun-->
   <value id="1"/> <!--Noun-->
   <wordform text="_"> <!--underscore inserts root form-->
      <value id="2"/> <!--Singular-->
   </wordform>
   <wordform text="_s">
      <value id="3"/> <!--Plural-->
   </wordform>
</paradigm>
<paradigm id="2" lang="en"> <!--Oxen-->
   <value id="1"/> <!--Noun-->
   <wordform text="_">
      <value id="2"/> <!--Singular-->
   </wordform>
   <wordform text="_en">
      <value id="3"/> <!--Plural-->
   </wordform>
</paradigm>

Putting It All Together

[edit]

Along the way, the Main Entries can be updated to include the progress in Concepts and Word Forms/Paradigms. The translations and cases in the Main Entries will no longer be needed, having been replaced by Concepts and Paradigms. (There is still some question as to whether the "definitions" of an entry need to be grouped under that entry, but this hierarchy has been retained in the example below for parallelism.)

<entry lang="en" text="bank">
   <sub-entry concept="1" paradigm="1">
      <text>A financial institution where you can 
            save and borrow money.
      </text>
      <example>I have a savings account at the bank.</example>
   </sub-entry>
   <sub-entry concept="2" paradigm="1">
       <text>The sloping edges of a river or lake.</text>
       <example>I fished from the river bank.</example>
   </sub-entry>
</entry>
<entry lang="en" text="ox">
   <sub-entry concept="3" paradigm="2">
       <text>A large bovine beast of burden</text>
       <example>The ox pulled the plow for the farmer.</example>
   </sub-entry>
</entry>
<entry lang="es" text="banco">
   <sub-entry concept="1" paradigm="250"/>
</entry>

For a congruent view of what this might look like as relational database tables, check out: Image:Modular tables.JPG.

And Beyond

[edit]

Further refinement of the Wiktionary software could happen: Concepts and Paradigms will almost surely require refinements with use. Phonetics, language varieties (i.e. dialects, sociolects, etc.), etymology are all areas in which Wiktionary could probably improve. But we should not try to do all things at once. Even if we could do it technically, it would probably be quite a shock to all the Wiktionary editors.

See also

[edit]