Jump to content

Abstract Wikipedia/Canonical and normal

From Meta, a Wikimedia project coordination wiki

We had a few discussions about canonical and normal forms for the JSON representation of ZObjects lately.

For most contributors and users of Wikifunctions, it should really not matter: the UX should effectively shield them from the JSON representation. But for those cases where it matters, here is a bit of background.

Many different JSON object can effectively describe the same ZObject. There are two well-defined forms: the normal and the canonical form. These are the ones used in the system. Sometimes, contributors may write in something in between, and all parts of the system should accept it.

Normalization should return the same normal form on any well-formed input. Canonicalization should return the same canonical form on any well-formed input. Most places thus should not care what they receive as long as it is well-formed. They can simply canonicalize or normalize it and then get the input in the format they want to work with.

Internally, we store the canonical form in the database and use preferably the canonical form when displaying it to the user. The normal form is used in almost all other cases, particularly when processing the data in the orchestrator or in the front end.

Finally, there is so-called labelization. This is simply replacing every ZID or key reference with the label in the given language. That is usually a lossy step: an un-labelizer will not necessarily work. This is purely for output purposes. In the following we show the labelized version on the left, and the raw version on the right.

Here's a function call in normal form:

Human-readable form Normal form
{
  "type": {
    "type": "Reference",
    "reference ID": "Function call"
  },
  "function": {
    "type": "Reference",
    "reference ID": "Head"
  },
  "list": {
    "type": "List",
    "head": {
      "type": "String",
      "value": "a"
    },
    "tail": {
        "type": "List"
    }
  }
}
{
  "Z1K1": {
    "Z1K1": "Z9",
    "Z9K1": "Z7"
  },
  "Z7K1": {
    "Z1K1": "Z9",
    "Z9K1": "Z811"
  },
  "Z811K1": {
    "Z1K1": "Z10",
    "Z10K1": {
      "Z1K1": "Z6",
      "Z6K1": "a"
    },
    "Z10K2": {
        "Z1K1": "Z10"
    }
  }
}

That same function call in canonical form:

Human-readable form Canonical form
{
  "type": "Function call",
  "function": "Head",
  "list": [ "a" ]
}
{
  "Z1K1": "Z7",
  "Z7K1": "Z811",
  "Z811K1": [ "a" ]
}

The same function call in what most software developers would consider a reasonably readable form:

Head(["a"])

This is the call to the function Head on one argument which is a list with one element, that element being the string "a". Head returns the first element of a list, so the result of that function call would be, in normal form:

{
  "type": "String",
  "value": "a"
}
{
  "Z1K1": "Z6",
  "Z6K1": "a"
}

or, in canonical form, simply:

"a"

Because we sometimes get confused by the terminology, we also call the normal form the “long” or “explicit” form. The term “normal” is inspired by the concept of database normalization. The canonical form we also call the “short” or “compact” form, as it is usually much shorter. Canonicalization is the usual term for this step in computer science. Note that in computer science canonicalization and normalization are often synonyms of each other, which is why “short” and “long” form seems to lead to less confusion.