Wikidata/Notes/Tainting and extracting

This note describes a tainting and extraction scheme that might (or might not) be interesting for Wikidata. It makes some assumptions about the page structure that is common to some sites for national statistics, like the one for Statistics Norway.

Note that tainting of the page is nothing more than tagging it for manual update. Note also that the description is a rewrite of a solution that was made for an ordinary wiki-like site and not for Wikidata as such.

Assumptions

The site has a common page with a stable link for some kind of statistics or group of statistics
The page has identifiable links to specific pages of interest, that might point to further pages
The final page has a dtg and the content will not change more often that this timestamp
The final page is made for presentation and as such can be navigated with Xpath statements
Extracted text can be further split with regex

Parser functions

The overall function can be implemented as a single function with additional context switching or as a small set of functions with slightly different functionality. If the functions does not extends beyond point 3 then they will only do tainting, if they do extend to point 4 and beyond they will do extraction.

There are one directive that must be present, either as an anonymous argument or as as an argument named src. This points to the the external stable web page. There are also one directive that can be present, either as an anonymous argument or as as an argument named dst. This points to the page that should be tainted or updated. If its not present it is assumed to be the present page.

At this point the parser function will request the HEAD if a previous timestamp is known, then compare the timestamp and request a GET if they differ. If there is no previous timestamp a GET request will be sent unconditionally. When the reply comes in the page will be turned into a digest and compared to a previous digest if it is known.

Here there will be a change of operation if there are no further directives exist. If a previous digest is not known the implicitly or explicitly given page is tainted. If a previous digest is known the implicitly or explicitly given page will not be tainted. Note that the page is never untainted. If there are further directives the tainting will be held in a pending state, and the final outcome depends on the additional directives.

An additional directive is follow where either a Xpath statement tries to identify a link or if that fails something that can be turned into a text which is then scanned for a valid link. If the statement does not look like a Xpath statement it is assumed to be a regex identifying the link title or if that fails it is used for a search after the link text.

This implements point 1, 2 and 3

{{TAINT:src=url|follow=[xpath|regex]}}
{{TAINT:dst=page|src=url|follow=[xpath|regex]}}

The same process as for the first landing page will be repeated for the new page. Several follow directives can be given and each new will traverse further out the chain. Each page might set the tainting in a pending state, and even if the final page is identical to a previous version the context might have changed and then the interpretation of the final page.

When the chain of directives are exhausted any set taint pending state is transfered to the the implicitly or explicitly given page.

If additional directives are available the parser function(s) will not only do tainting but extraction of data from the final page. If the final state imply tainting, and if it is possible to do extraction and it does not fail, the implicitly or explicitly given internal page will be updated as necessary and will not be tainted. It is safe to assume that Wikidata will need some kind of tainting scheme, but it is not safe to assume that extraction will be implemented or even a wanted feature.

Additional directives for navigation the final page is extract where either a Xpath statement is used to identify an element and its children, or a regex is used to do the same. It should be possible to run multiple extract on the same page.

This implements point 1, 2, 3 and 4

{{TAINT:src=url|follow=[xpath|regex]|extract=[xpath|regex]}}
{{TAINT:dst=page|src=url|follow=[xpath|regex]|extract=[xpath|regex]}}

Note that a Xpath statement might produce siblings and that they will be pushed into a list, and that grouping from a regexp will do the same. After the extract is done each resulting text item in the list are transformed to a flat text before further processing.

In some cases the list might contain additional spurious elements to get rid of them an optional last directive can be added, a filter pass using only reqex directives. A list item must be hit by at least one include directive, and by none of the exclude directives.

This implements point 1, 2, 3, 4 and 5

{{TAINT:src=url|follow=[xpath|regex]|extract=[xpath|regex]|include=regex|exclude=regex}}
{{TAINT:dst=page|src=url|follow=[xpath|regex]|extract=[xpath|regex]|include=regex|exclude=regex}}

The final list can be made available for use as parameters in templates or as arguments for system messages. Here it is a slight problem as a missing parameter or argument should lead to tainting of the implicitly or explicitly given internal page.

Simplification

The parser function takes one or two anonymous arguments. If only one anonymous argument is provided, then it is the external stable web page and the page to be tainted is the present internal page as given by {{FULLPAGENAME}}. If two anonymous arguments are provided, and the second is identified as an url, then it is the external stable web page and the page to be tainted is explicitly given and if necessary with an additional namespace.

This implements point 1 and 3

{{TAINT:url}}
{{TAINT:page|url}}

If additional anonymous arguments are given they are follow statements.

If the parser function is named EXTRACT the last additional anonymous argument is an extract and any more arguments are follow statements.

Usually the form will be one of the following

{{TAINT:url|xpath 1|…|xpath N}}
{{TAINT:page|url|xpath 1|…|xpath N}}
{{EXTRACT:url|xpath 1|…|xpath N|xpath}}
{{EXTRACT:page|url|xpath 1|…|xpath N|xpath}}

XSLT

It is possible to do much the same with XSLT, and in the more complex cases that might turn out simpler. Usually the process up to identification of the final source page will be the same, and then a transform will be done on this page. Children of the page constructed by the transform will then be the entries in the same list of extracted items.

An example of a possible layout for a tag function is something like

<extract src="foo">
    <follow regex="bar" />
    <stylesheet ..stuff..>
        ..more stuff..
    </stylesheet>
</extract>