Jump to content

Abstract Wikipedia/Typed lists

From Meta, a Wikimedia project coordination wiki

Five options how to deal with typed lists.

Situation description

[edit]

One of the nice things about using JSON is that, in canonical form, we used the Array notation, which made lists much more compact to represent.

The list of five strings, which in the canonical form looks like this:

[ "a", "b", "c", "d", "e" ]

Looks like this in the normal form:

{
  "Z1K1": {
    "Z1K1": "Z9",
    "Z9K1": "Z10"
  },
  "Z10K1": {
    "Z1K1": "Z6",
    "Z6K1": "a"
  },
  "Z10K2": {
    "Z1K1": {
      "Z1K1": "Z9",
      "Z9K1": "Z10"
    },
    "Z10K1": {
      "Z1K1": "Z6",
      "Z6K1": "b"
    },
    "Z10K2": {
      "Z1K1": {
        "Z1K1": "Z9",
        "Z9K1": "Z10"
      },
      "Z10K1": {
        "Z1K1": "Z6",
        "Z6K1": "c"
      },
      "Z10K2": {
        "Z1K1": {
          "Z1K1": "Z9",
          "Z9K1": "Z10"
        },
        "Z10K1": {
          "Z1K1": "Z6",
          "Z6K1": "d"
        },
        "Z10K2": {
          "Z1K1": {
            "Z1K1": "Z9",
            "Z9K1": "Z10"
          },
          "Z10K1": {
            "Z1K1": "Z6",
            "Z6K1": "e"
          },
          "Z10K2": {
            "Z1K1": {
              "Z1K1": "Z9",
              "Z9K1": "Z10"
            }
          }
        }
      }
    }
  }
}

This makes the JSON not only easier to read, but also requires fewer bytes when storing.

Now, we have introduced typed lists. We have a function, Z881, which creates a list, whose elements are all of a specific type. So, for example, if in the above case, we had a typed list of strings, this is how the normal form of that looks like (more or less):

{
  "Z1K1": {
    "Z1K1": {
      "Z1K1": "Z9",
      "Z9K1": "Z7"
    },
    "Z7K1": {
      "Z1K1": "Z9",
      "Z9K1": "Z881"
    },
    "Z881K1": {
      "Z1K1": "Z9",
      "Z9K1": "Z6"
    }
  },
  "K1": {
    "Z1K1": "Z6",
    "Z6K1": "a"
  },
  "K2": {
    "Z1K1": {
      "Z1K1": {
        "Z1K1": "Z9",
        "Z9K1": "Z7"
      },
      "Z7K1": {
        "Z1K1": "Z9",
        "Z9K1": "Z881"
      },
      "Z881K1": {
        "Z1K1": "Z9",
        "Z9K1": "Z6"
      }
    },
    "K1": {
      "Z1K1": "Z6",
      "Z6K1": "b"
    },
    "K2": {
      "Z1K1": {
        "Z1K1": {
          "Z1K1": "Z9",
          "Z9K1": "Z7"
        },
        "Z7K1": {
          "Z1K1": "Z9",
          "Z9K1": "Z881"
        },
        "Z881K1": {
          "Z1K1": "Z9",
          "Z9K1": "Z6"
        }
      },
      "K1": {
        "Z1K1": "Z6",
        "Z6K1": "c"
      },
      "K2": {
        "Z1K1": {
          "Z1K1": {
            "Z1K1": "Z9",
            "Z9K1": "Z7"
          },
          "Z7K1": {
            "Z1K1": "Z9",
            "Z9K1": "Z881"
          },
          "Z881K1": {
            "Z1K1": "Z9",
            "Z9K1": "Z6"
          }
        },
        "K1": {
          "Z1K1": "Z6",
          "Z6K1": "d"
        },
        "K2": {
          "Z1K1": {
            "Z1K1": {
              "Z1K1": "Z9",
              "Z9K1": "Z7"
            },
            "Z7K1": {
              "Z1K1": "Z9",
              "Z9K1": "Z881"
            },
            "Z881K1": {
              "Z1K1": "Z9",
              "Z9K1": "Z6"
            }
          },
          "K1": {
            "Z1K1": "Z6",
            "Z6K1": "e"
          },
          "K2": {
            "Z1K1": {
              "Z1K1": {
                "Z1K1": "Z9",
                "Z9K1": "Z7"
              },
              "Z7K1": {
                "Z1K1": "Z9",
                "Z9K1": "Z881"
              },
              "Z881K1": {
                "Z1K1": "Z9",
                "Z9K1": "Z6"
              }
            }
          }
        }
      }
    }
  }
}

Which wouldn’t be too terrible, but the canonical version of that looks almost as bad:

{
  "Z1K1": {
    "Z1K1": "Z7",
    "Z7K1": "Z881",
    "Z881K1": "Z6"
  },
  "K1": "a",
  "K2": {
    "Z1K1": {
      "Z1K1": "Z7",
      "Z7K1": "Z881",
      "Z881K1": "Z6"
    },
    "K1": "b",
    "K2": {
      "Z1K1": {
        "Z1K1": "Z7",
        "Z7K1": "Z881",
        "Z881K1": "Z6"
      },
      "K1": "c",
      "K2": {
        "Z1K1": {
          "Z1K1": "Z7",
          "Z7K1": "Z881",
          "Z881K1": "Z6"
        },
        "K1": "d",
        "K2": {
          "Z1K1": {
            "Z1K1": "Z7",
            "Z7K1": "Z881",
            "Z881K1": "Z6"
          },
          "K1": "e",
          "K2": {
            "Z1K1": {
              "Z1K1": "Z7",
              "Z7K1": "Z881",
              "Z881K1": "Z6"
            }
          }
        }
      }
    }
  }
}

And that is a bit painful. (Also, it is about 750 bytes versus 25 bytes otherwise.)

The question is, can we somehow regain how succinct JSON arrays are, without too much pain at this point?

Options

[edit]

Autoguess the type, live with ambiguity

[edit]

Denny’s original idea was that everytime we see a list such as:

[ "a", "b", "c", "d", "e" ]

in canonical form, i.e. a list whose elements are all of the same type, we assume in normalization that this is actually a typed list, not simply an untyped list.

It has the advantage that it is very short. Seriously, it's really neat to write by hand and read.

It has a number of disadvantages, the last one being very severe:

  1. Sometimes a list which is meant to be untyped will become typed just due to normalization and canonicalization, two operations which are meant to never change the meaning of an object. And whereas that is true, it should very rarely lead to issues, because anything that takes an untyped list could also accept a typed list, since a typed list is a strict subtype of an untyped list. That would need some special coding.
  2. But this leads to more type hierarchy, which is in general a difficult thing.
  3. On the other hand, when we have an empty typed list, and canonicalize that, there is nothing that tells us for what type that typed list is. We lose information. That should rarely lead to practical issues, because we could say “well, even if I expect a typed list of a specific type, if I get an empty list, who cares, I could assume it is an empty list of the type I expect”. That would need some special coding.
  4. The front end would need some way to escape something that looks like a typed list, but actually isn’t. We must be able to say “well, it thinks that this is a typed list, but really I want this to be an untyped list”.
  5. Most importantly, we would need to evaluate every single element of a list in order to figure out what type the list is. That means canonicalization and normalization suddenly require evaluation. Yikes. And also it can be very expensive to figure out what the type of the list is if the calculations happen to be expensive. And what with calculations that return an Error?

All in all, this seems like a very unstable and expensive solution, although it has by far the nicest serialization outcome.

By now, I think this Option would likely be a mistake. But I didn’t raise this whole issue sooner because I didn’t realize that this would be a mistake, and that we could just go on with this solution. Apologies!

Cory: This is what we currently do in Javascript code. The logic is complex and obviously can't recurse into functions, but at the end of the day, there's a heuristic that tells us (with perfect precision but necessarily imperfect recall) whether two types are "the same"; then, if all elements of a list have the same type, that becomes the Z881K1 in normal form. In all other cases (or when the list is empty), the type is just Z1.

Introduce new syntax for this one case

[edit]

We can introduce a completely new syntax for typed lists. Simone suggested the following in chat:

{
  "Z1K1": {
    "Z1K1": "Z7",
    "Z7K1": "Z881",
    "Z881K1": "Z6"
  },
  "*": [ "a", "b", "c", "d", "e" ]
}

This wouldn’t even be a well-formed object according to our rules, but it is a special case in the syntax for typed lists. The idea is that this syntax means the same as the typed list above.

(Note: Simone didn’t suggest exactly that, but his suggestion wasn’t valid JSON so I adapted it).

Advantage: it remains very compact. It is unambiguous. Work wise, Denny initially believed it would be just a change in the canonicalizer / normalizer, but it turns out this would be extremely labor-intensive.

Disadvantage: we need to special case this in the canonicalizer / normalizer, and it really doesn’t look like anything else. This will potentially be quite a bit of coding work.

Cory: This will be a massive amount of coding work. We assume everywhere that keys have a specific shape. Supporting "*" has repercussions throughout the code base and will doubtless surface many, many subtle bugs.

We could just use "K1" instead of "*" if we're going to special-case this, but this special case also breaks assumptions we have everywhere. In particular, it introduces a really bizarre non-isomorphism between the keys of a canonical form and the keys of a normal form. Given that we do allow some level of mixed forms, I worry that doing this will open the door to a lot of ambiguity and quirky behavior.

Geno: Totally agree with Cory here.

Change what it means to be a typed list

[edit]

Currently Z881 results in the following type:

{
  "Z1K1": "Z4",
  "Z4K1": {
    "Z1K1": "Z7",
    "Z7K1": "Z881",
    "Z881K1": "Z6"
  },
  "Z4K2": [
    {
      "Z1K1": "Z3",
      "Z3K1": "Z6",
      "Z3K2": {
        "Z1K1": "Z6",
        "Z6K1": "K1"
      },
      "Z3K3": {
        "Z1K1": "Z12",
        "Z12K1": []
      }
    },
    {
      "Z1K1": "Z3",
      "Z3K1": {
        "Z1K1": "Z7",
        "Z7K1": "Z881",
        "Z881K1": "Z6"
      },
      "Z3K2": {
        "Z1K1": "Z6",
        "Z6K1": "K2"
      },
      "Z3K3": {
        "Z1K1": "Z12",
        "Z12K1": []
      }
    }
  ],
  "Z4K3": "Z110"
}

I.e. Z881(Z6) results in a type that has two keys, one of type String, and the other of type Z881(Z6). This is what leads to the K1/K2 above.

What if, instead, the Z881(Z6) would result in the following type:

{
  "Z1K1": "Z4",
  "Z4K1": {
    "Z1K1": "Z7",
    "Z7K1": "Z881",
    "Z881K1": "Z6"
  },
  "Z4K2": [
    {
      "Z1K1": "Z3",
      "Z3K1": "Z10",
      "Z3K2": {
        "Z1K1": "Z6",
        "Z6K1": "K1"
      },
      "Z3K3": {
        "Z1K1": "Z12",
        "Z12K1": []
      }
    }
  ],
  "Z4K3": {
    "Z1K1": "Z7",
    "Z7K1": "Z891",
    "Z891K1": "Z6"
  }
}

This is a type with a single key that takes a standard untyped list, Z10. The subtle difference is that the validator in this case would be a new function that takes the list in K1 and checks for each element in the list that is indeed of that key, or else it raises a validation error.

We would need to implement Z891, which is a function taking an argument T of type Type, and returns a validator, i.e. a function that takes a list and returns it unaltered if all elements in the list are of type T, or else returns a validation error saying which elements failed the validation.

In this case, the actual list would look like this:

{
  "Z1K1": {
    "Z1K1": "Z7",
    "Z7K1": "Z881",
    "Z881K1": "Z6"
  },
  "K1": [ "a", "b", "c", "d", "e" ]
}

That is very similar to Option 2, but does not require any special Syntax handling. The array in K1 is a Z10/untyped list.

Advantage: it works cleanly with the existing function model and serialization.

Disadvantages:

That’s something we really should have realized a few months ago. It feels like there is a lot of work that has been done which can now be thrown out, and this might be potentially frustrating and demotivating to choose this option.

Also, the keys do not really have the type - it is just a validation step that ensures that the keys are indeed of that type. A UX needs to either understand and implement this pattern in order to support a neat UX, or it will let the user choose every type manually and then tell them afterwards that they chose poorly, because validation doesn’t pass (The WikiLambda frontend does that already, so I am not worried about this particular frontend).

The validation might be expensive (but by far not as expensive as in case of Option 1, because we don’t have to evaluate elements which are function calls, we can simply look up their return type. But the problem is with functions that return a Z1, this might be expensive or impossible, so we would cut our losses on that edge case).

Quite a few changes in the orchestrator and the frontend here.

Cory: I'm less upset about this one. It solves one of the issues with Option 2 (no longer breaks isomorphism between canonical/normal form). The semantics for the Z4 are kind of consistent with the rest of our function model, but not really. What we're saying here is "K1 says it's an X, but we magically decide that it's actually an array of X in this special case." if we're going to go this direction and support array syntax, we should just add an "array of" operator to Z4s, i.e. a way to say that "K1 is an array of elements of type X." Then, though, we have to ask ourselves why we're not just directly using arrays everywhere in the first place.

Leave as it is

[edit]

Well, yeah, painful, but we live with it.

Omit Z1K1 on Non-Top-Level Objects

[edit]

Since the Z4 returned by Z881 tells us everything we need to know about both K1 and K2, we can just omit Z1K1 on non-top-level objects. So a list of strings could just look like this:

{
  "Z1K1": {
    "Z1K1": {
      "Z1K1": "Z9",
      "Z9K1": "Z7"
    },
    "Z7K1": {
      "Z1K1": "Z9",
      "Z9K1": "Z881"
    },
    "Z881K1": {
      "Z1K1": "Z9",
      "Z9K1": "Z6"
    }
  },
  "K1": {
    "Z1K1": "Z6",
    "Z6K1": "a"
  },
  "K2": {
    "K1": {
      "Z1K1": "Z6",
      "Z6K1": "b"
    },
    "K2": {
      "K1": {
        "Z1K1": "Z6",
        "Z6K1": "c"
      },
      "K2": {
        "K1": {
          "Z1K1": "Z6",
          "Z6K1": "d"
        },
        "K2": {
          "K1": {
            "Z1K1": "Z6",
            "Z6K1": "e"
          },
          "K2": {
          }
        }
      }
    }
  }
}

A savings of ~80 lines over the object in Situation Description.

The major change here would be to add a single rule to the orchestrator: if Z1K1 is absent from a ZObject, look at its immediate parent's Z4 and infer the type of each key from Z4K2. The recursive calls to Z881 will be pretty trivial if we either 1) do caching or 2) trust that the result of a Z7 is always the same as a Z4 which has that Z7 as its identity (so we can immediately compare the recursive Z7s with the identity of the top-level Z4).

Geno: I think this option makes me worried for the same reason as Option 2, it requires a change in our understanding of the function model. Currently every sub-branch (the value of every key) must be a valid ZObject itself, which allows us to validate every level of a ZObject independently. This would break that assumption, which probably means big changes all around.

Cory: That's very true. It would mean a lot of back-and-forth calls to the orchestrator, at the very least, for the frontend to do any validation or work with types at all.

Denny: I feel like Option 5 is orthogonal to the other options, i.e. it might be valuable to adopt Option 5 independently of the array issue at hand here, as it would potentially save a lot of bytes and make the content potentially more readable. I assume it would apply only to the canonical format, not to the normalized one, and the normal one would still have the full type everywhere (which would make the example here even shorter! - I wrote it up below)

This doesn't address the issue of using arrays or not though, so these two seem orthogonal.

I kinda like the idea of dropping the Z1K1s when it is clear, but I also see Geno's points. So if we restrict it to: - only in the canonical version - only if the expected type is not a Z1 This might work.

{
  "Z1K1": {
    "Z1K1": "Z7",
    "Z7K1": "Z881",
    "Z881K1": "Z6"
  },
  "K1": "a",
  "K2": {
    "K1": "b",
    "K2": {
      "K1": "c",
      "K2": {
        "K1": "d",
        "K2": {
          "K1": "e",
          "K2": {}
        }
      }
    }
  }
}

Have the first element of a list be the type

[edit]

Let the canonical form of a type list as above be

[ "Z6", "a", "b", "c", "d", "e"]

Given that we have to treat array special anyway, this would be a rather compact representation.

An untyped list would accordingly be:

[ "Z1", "a", "b", "c", "d", "e"]

All JSON arrays would always have at least one element, the type. An untyped empty list would look like:

[ "Z1" ]

Denny first thought that, sure, this solution is very compact, but it is a bit too much magic, or a bit too confusing. Particularly, empty lists don't look empty. But after a few days of mulling it over Denny realized that this solution has a number of pretty big advantages:

  • It doesn't introduce new syntax. Everything that is wellformed under the new syntax was also wellformed under the old syntax. So no new syntax needed, as in Option 2 above.
  • No need to guess anything. This solution is completely lossless. The disadvantages of Option 1 and Option 3 are gone.
  • The representation is as compact as it can be without Option 2.
  • The normal form stays exactly the same. So everything that uses the normal form doesn't have to change anything. This is an advantage over Option 3.
  • The canonical form changes. Everywhere we store canonical form, or any code that deals with the canonical form, needs to be touched. But that's true for any of the above solutions as well. That's the same for all solutions but Option 4.
  • People who have gotten used to read
    { "Z1K1": "Z10" }
    
    can also get used to recognize
    [ "Z20" ]
    
    as an empty list of testers pretty quickly.
  • No change in the data model needed at all. It is 'only' a change in the canonicalizer and normalizer.

We could even go a step further and say "In addition to the above rule,

[]

and

[ "Z1" ]

are the same, in order to allow for really empty arrays. Or not.

In short, Denny is changing his preference from Option 3 to this one, Option 6 (but tends to not prefer the "step further", let's call that Option 6b, although he won't be opposed to it either).

(Suggested by Benjamin Degenhart via Telegram)