Abstract Wikipedia/Semantics of validation in Wikifunctions
This page is currently a draft. More information pertaining to this may be available on the talk page. Translation admins: Normally, drafts should not be marked for translation. |
Abstract Wikipedia |
---|
(Discussion) |
General |
Development plan |
|
Notes, drafts, discussions |
|
Examples & mockups |
Data tools |
Historical |
Validation is currently underspecified. What does it mean for an object to be valid? Intuitively, we have an idea of what that means for values (i.e. results of evaluation that have been fully resolved); a value is valid when:
- The fields of the object match the description of the keys in the type.
- The fields are themselves valid (which makes validity a recursive definition).
- The custom validator (if any) called on the object does not return an error.
However this is not sufficient and has a lot of ambiguities:
- Which
Z4/Type
are we referring to? Is it theZ1K1/type
of the object? Is it theZ4K2/keys
of the parent object? TheZ8K1/arguments
of a function? TheZ8K2/return type
of a function? Probably all of the above, but how do we avoid validating the same things over and over again? - The implementation of the orchestrator currently uses a lazy evaluation strategy. This means objects are not fully evaluated until the very end. Fields can be unresolved during the computation, which means that they will have type
Z7/Function call
,Z9/Reference
, orZ18/Argument reference
that does not match the expected type. What does it mean for these partially evaluated objects to be valid? Ignoring such fields would mean we might be accepting invalid inputs.
This document attempts to propose potential solutions to this problem.
Concepts
[edit]There are multiple operations that are involved in evaluation. All of them are built around a central operation called resolve()
, which takes an object and, if it is a Z7/Function call
, Z9/Reference
, or Z18/Argument reference
, evaluates it accordingly. It keeps doing so as long as it is possible. Once it is done, the result is an object that is no longer a Z7/Function call
, Z9/Reference
, or Z18/Argument reference
. We call this a resolved object.
From that, the following operations are defined:
resolveKey()
that takes an object and a field and resolves the subobject in that field. We call this a resolved field.resolveFully()
that takes an object and fully resolves it and all of its subfields. In the current lazy strategy, this corresponds roughly to callingresolve()
on the object then callingresolveKey()
on each of its fields (except some special fields such asZ1K1/type
,Z14K2/composition
, andidentity
fields). We call this a fully resolved object.
The result of a call to the orchestrator is hence the result of a single call to resolveFully()
. (Note that this is currently broken: T317230.)
Eager evaluation
[edit]If we use eager evaluation instead of lazy evaluation, the problem becomes simpler because we would be dealing with values (i.e. fully resolved objects). Note that in this setting, the definitions of the operations are switched around:
resolveKey()
fully resolves the corresponding field,fullyResolve()
callsresolveKey()
on each of the fields before calling resolve() on the current object (again except special fields such asZ1K1/type
,Z14K2/composition
, andidentity
fields).
We define two mutually recursive properties:
- A value is valid w.r.t. a
Z4/Type
when:- Each of its fields is declared in the
Z4K2/keys
of theZ4/Type
. - Each of its fields is valid w.r.t. the
Z3K1/value type
in the correspondingZ3/Key
. - Each of its fields is self-valid (see next definition).
- The custom
Z4K3/validator
(if any), does not return an error when called on the value.
- Each of its fields is declared in the
- A value is self-valid when:
- Its own
Z1K1/type
is valid w.r.t.Z4/Type
. - It is valid w.r.t. its own
Z1K1/type
.
- Its own
(Note that we don’t require all Z4K2/keys
to be present, because some types omit them, such as lists (where the fields are only set when the list is not empty).)
It then becomes somewhat clearer how validation should happen:
- Whenever we
fullyResolve()
an object, we check that it is self-valid. - Whenever we
resolveKey()
a field, we check that the subobject is valid w.r.t. to theZ3K1/value type
in the correspondingZ4K2/key
. - Whenever a function is called on an argument, that argument is fully resolved, so we check that the argument is valid w.r.t. the
Z8K1/arguments
of the function. - Whenever a function returns, the returned object is fully resolved, so we check that the return value is valid w.r.t. the
Z8K2/return type
.
We could even extend the definition of the valid property to objects where each key has been fully resolved, so that we validate each step in the resolve()
operation (i.e. every time we resolve a function call, check that the function call object is valid, every time we resolve an argument reference, check that the argument reference object is valid, etc.).
Pros:
- Might be simpler than lazy evaluation.
Cons:
- Some special care should be taken around special fields to avoid evaluating/validating them (e.g.
Z14K2/composition
) as well as to avoid infinite recursion (e.g. we should accept thatZ4
is a valid type). We will not go into detail about this here. - Might require bigger changes to move away from the current lazy evaluation strategy.
Performance considerations
[edit]As one may imagine, doing all these validations naively can be extremely inefficient. Notice that when we pass a value to a function, the value will have already been validated, but we validate it again w.r.t. the expected argument type. Even worse, when we check that an object is valid, we validate each field twice: once against the Z3K1/value type
expected field type and once against the field’s own Z1K1/type
. Since this happens at every level, this can easily lead to an exponential blowup!
To avoid that, we can make a few optimizations.
Tagging self-valid values
[edit]The first would be to tag objects as self-valid whenever they are fully resolved and checked that they are self-valid for the first time. Equivalently, we do not check that subfields are self-valid because we know by the way we structured the orchestration that we would have already checked that. This eliminates the expensive call to check that each field is self-valid. However, there is still a quadratic factor from the fact that each layer needs to validate all the layers below it.
Using type comparison
[edit]A further potential improvement would be to use some form of type comparison. Suppose we have a way to compare that two types are “the same”, which we call type equivalence. Then we can modify the definitions to the following:
- A value is valid w.r.t. a
Z4/Type
when:- It is self-valid.
- Its
Z1K1/type
is equivalent to the given type.
- A value is self-valid when:
- Its own
Z1K1/type
is valid w.r.t.Z4/Type
. - Each of its fields is declared in the
Z4K2/keys
of theZ1K1/type
. - Each of its fields is valid w.r.t. the
Z3K1/value type
in the correspondingZ3/Key
. - The custom
Z4K3/validator
(if any), does not return an error when called on the value.
- Its own
This is a bit similar to type conversion in systems with dependent types (i.e. with arbitrary computations in types):
(Read: if the term checks against type and is equivalent to another type then also checks against .)
Defining such a type comparison operation is not trivial. Syntactic equality is not enough, and neither are identity fields, because they may contain free variables that are not resolved. For example, the map function which takes a type , a list of elements of type and a function from to , returns a list of elements of type . If we then instantiate this function on lists of strings and pass the result to another function that expects a list of strings, we need to be able to identify that (in the environment where maps to ) is the same as .
In general, systems that allow computations in types compare types by evaluating them and comparing their normal (i.e. fully evaluated) form. Allowing arbitrary computations makes this undecidable. In practice systems can restrict the kind of computations that are allowed to be used in types to make this decidable (e.g. System F, Calculus of Constructions, etc.). This is not the case in Wikifunctions, so we have to accept validation looping infinitely and/or use heuristics that are “good enough” for practical use.
Moreover, this operation has a cost, so it is not completely clear whether it would be better than the quadratic performance using only self-valid tags (although on big objects such as long lists it would probably be). This would need to be verified by data in practice.
Implications on duck typing
[edit]Resorting to type comparison in the context of Wikifunctions has some implications on the semantics of validation, in particular on duck typing. Indeed, if an object expects that a field has some type O_1 but that field has another type O_2 that has the same fields as O_1 but a different validator, it is undecidable to check that O_2 is equivalent to O_1, so we need to do one of the following:
- Fail validation because the types are not exactly the same, at the risk of rejecting something that would have passed otherwise.
- Ignore O_2 and succeed the validation, at the risk of accepting something that would have been rejected otherwise.
Alternatively, maybe it would be possible to use type comparison for the fields only and still run the custom validator on every validation.
Caching intermediary operations
[edit]An alternative to having a type comparison operation is to cache all operations in the orchestrator, so that when we validate an object w.r.t. a type twice, we immediately return the result. However, this is also not trivial:
- Again, we need to take free variables and scopes into account when hashing and equating types.
- It is not clear that the objects will be validated in the same scope every time, and therefore if this method will lead to enough performance gain.
Implications on JSON schema validation
[edit]We note that with the proposed solutions, JSON schema validation becomes redundant (it was even incomplete since it accepts unresolved fields unconditionally) and can be completely removed. We could also expose validation as another endpoint of the orchestrator service so that it can be used in other places that make use of validation, e.g. the frontend.
Lazy validation
[edit]If we want to keep a lazy evaluation strategy, we could attempt to do validation lazily. We make use of the same type comparison capability described above and give the following definition of validity:
- A resolved object is valid w.r.t. a
Z4/Type
when:- It is self-valid.
- Its
Z1K1/type
is equivalent to the given type.
- A resolved object is self-valid when:
- Its own
Z1K1/type
resolves to an object that is valid w.r.t.Z4/Type
. - Each of its other fields is declared in the
Z4K2/keys
of theZ1K1/type
. - Whenever one of its fields is resolved, that field is valid w.r.t. the
Z3K1/value type
in the correspondingZ3/Key
. - The custom
Z4K3/validator
(if any), does not return an error when called on the value.
- Its own
This is how we then perform validation:
- Whenever we
resolve()
an object, we check that it is self-valid, by resolving itsZ1K1/type
and checking that each of its other fields is declared in the type. - Whenever we
resolveKey()
a field, if the key isZ1K1/type
we check that the resulting subobject is valid w.r.t.Z4/Type
, otherwise we check that the field is valid w.r.t. theZ3K1/value type
in the correspondingZ3/Key
. - Whenever a function is called, we check that the given arguments are declared in the
Z8K1/arguments
of the function, and whenever an argument is used, that argument is resolved, and we check that the resulting object is valid w.r.t. theZ8K1/arguments
of the function. - Whenever a function returns, the returned object is resolved, and we check that the resulting object is valid w.r.t. the
Z8K2/return type
.
The key observation is that we can only validate an object w.r.t. to another type once it is resolved, so we cannot check that subobjects match the expected type of the fields until each subobject is resolved.
In particular, it could be the case that some fields are invalid and we never check them. However, as long as they are not accessed and hence not involved in the final result, the computation will succeed (this is an expected behavior for lazy evaluation).
Also, notice that running the custom validator might resolve some subfields if it accesses them. This is fine because resolving a field is a cached operation.
Pros:
- Closer to the current implementation
Cons:
- Requires type comparison
The need for type comparison
[edit]Would it be possible to achieve lazy validation without type comparison? It might be, but it’s quite a bit more complicated. Whenever we check if an object is valid w.r.t. to a type, we “push” that type into the object (so an object will need to keep track of multiple types). Whenever we resolve a field of that object, we check it against the field declarations of each of those types.
Static type checking
[edit]Another approach we could try to take is to attempt to do some sort of static type checking that ensures that objects will be valid once evaluated. We would need to develop new rules for that and it would add a whole new layer to the orchestrator. Some observations:
- We still need to compute the types dynamically during the type checking since we allow arbitrary computations in types. This is typical of dependent types systems but here we also have no guarantee that the computation will terminate.
- It would not be possible to statically check that the result of custom validators would pass, so those are definitely out of the picture. We will still need to run the validators at runtime.
- We would probably need more type annotations on functions, i.e. having generic function types.
- We also need to restrict the possibility of adding or removing fields dynamically from objects.
With this in mind, it might be possible to devise rules that ensure that an object will have some set of fields, similar to polymorphic variants in OCaml. However, given that we would still need to compute types dynamically and run the validators at runtime, the added complexity seems to outweigh the benefits.
Other solutions?
[edit]- Do not check fields, only run custom validators? The current description of the function model seems to imply that we would allow fields that are not declared, so maybe that was the original intent (that’s not how the current implementation works though).
- Mix of eager and lazy evaluation? Do not evaluate function arguments before the function call, but once an argument is accessed, resolve it fully. Not sure what the benefits are but maybe it makes things easier because we would only validate objects when they are fully resolved. Would still need to think about how to treat lazy fields (identity, type, etc.).
Eager evaluation without type comparison | Lazy evaluation | |
---|---|---|
Type comparison | Not required, but maybe good to have? | Required, not sure if it's possible without it? |
Valid tag | Probably required for acceptable performance | Maybe not required |
Removes JSON schema | Yes with type comparison | Yes? |
Correctness confidence | … | … |
… | … | … |