Jump to content

Web2Cit/Docs/Core

From Meta, a Wikimedia project coordination wiki
(Redirected from Web2Cit/Core)

Web2Cit Core is a JavaScript library that implements the Web2Cit translation features described in the How Web2Cit works section of our Basics documentation.

Installation

[edit]

Web2Cit is available as an npm package so can be installed in any npm project with:

npm install web2cit

The package includes TypeScript .d.ts declaration files which provide type information for TypeScript projects.

Basic usage

[edit]

The Web2Cit server imports the Web2Cit core library to expose some of its capabilities as a web service. See our Server documentation for more information.

To use Web2Cit core in a JavaScript (or TypeScript) project, briefly:

  1. Import Web2Cit.
  2. Begin by creating a Domain object for the domain you want Web2Cit to return translation results for.
  3. Use the fetchAndLoadConfigs instance method (of the Domain class) to fetch and load configuration files from the Web2Cit storage.
  4. Use the translate instance method to get translation results for one or more target paths.

Usage examples

[edit]

In this section we will explain some basic usage of Web2Cit core via examples from the Web2Cit server's codebase. We will go in the order in which they are presented in the server's code. At the time of writing this, Web2Cit server version is 1.1, which uses Web2Cit core v2.

Check the Web2Cit integrated editor source code for other usage examples (for example, those related to editing domain configuration). And the Core library's source code for the full API (see Development section).

Importing from the library

[edit]
import { Domain, Webpage } from "web2cit";

As explained in our Basics documentation, Web2Cit translation happens on a per-domain basis. Accordingly, the Domain class is usually the starting point of anything related to Web2Cit (see T302588 for a proposal to use a top-level Web2Cit class instead).

The Webpage class provides features concerning with fetching and caching of target-specific HTML and Citoid responses (see below).

Webpage objects

[edit]
const target = new Webpage(url);
domainName = target.domain;
targetPath = target.path;

In the particular case quoted here, the Webpage creator is being used to validate the target URL that has been passed to the server.

However, Webpage objects implement additional features, mainly fetching and caching HTML responses from the target server, and the corresponding Citoid responses, as discussed below.

The Domain object

[edit]
const window = new JSDOM().window;
domain = new Domain(domainName, window, {
  userAgentPrefix,
});

The Domain constructor creates a Domain object, which is the base for most Web2Cit operations. The constructor takes the following positional arguments:

  1. domain (string): the domain or hostname (e.g., www.example.com). This currently does not support schemes (assumed to be https) or port (see T315020).
  2. windowContext (Window): a Window object that will be used for XPath operations.
  3. options (DomainOptions): a series of optional parameters.
DomainOptions
  • templates? (TemplateDefinition[]): an optional list of template definitions, to initialize the Domain object with.
  • patterns? (PatternDefinition[]): an optional list of pattern definitions, to initialize the Domain object with.
  • tests? (TestDefinition[]): an optional list of test definitions, to initialize the Domain object with.
  • fallbackTemplate? (FallbackTemplateDefinition): the fallback template to use when no applicable template is found for a given target. Defaults to a fallback template included in the core library.
  • catchallPattern? (boolean): whether a catch-all pattern should be used for paths not matching any URL path pattern. Defaults to true;
  • forceRequiredFields? (FieldName[]): a list of translation fields that are mandatory (e.g., templates not including them will be ignored). Defaults to an array defined in the core library's config file;
  • userAgentHeaderName? (string): an alternative User-Agent header for browser-based applications where the User-Agent header may be overwritten by the browser (e.g., Api-User-Agent; see Wikimedia's User-Agent_policy);
  • userAgentPrefix? (string): for cases where a prefix identifying the library consumer would like to be added to the user agent for outgoing requests from the library;
  • originFetch? (typeof fetch): custom fetch function to use for domain's same-origin requests;

Changing storage location

[edit]
let storageRoot = domain.templates.storage.root;
storageRoot = `User:${user}/` + storageRoot;
domain.templates.storage.root = storageRoot;
domain.patterns.storage.root = storageRoot;
domain.tests.storage.root = storageRoot;

By default, the core library fetches configuration files from the main Web2Cit storage on Meta-Wiki at Web2Cit/data/. In this code snippet, we see how the Web2Cit server changes this default location to support using configuration files from a user's sandbox storage (see Server documentation):

Note this is a hack until passing custom storage locations to the Domain constructor is supported (see T306553).

domain.templates, domain.patterns and domain.tests refer to DomainConfiguration objects. Check the source code for other properties that may be used to further customize the storage location, such as the MediaWiki instance (set to https://meta.wikimedia.org by default), or the corresponding file names.

Fetching and loading configuration

[edit]

The code below fetches the configuration files from the storage repository and loads them (i.e., parses template/pattern/test definitions and sets them as the current configuration):

await domain.fetchAndLoadConfigs(options.tests);
configsFetched = true;
targetPaths = domain.getPaths();

The Domain object's getPaths() function simply gets all target paths which have been configured as translation templates or tests. In the server snippet here, it is used when no specific target has been specified (i.e., domain parameter set, path unset; see Server documentation), to get a translation result for all paths configured.

The Domain's Webpage factory

[edit]
if (options.citoid) {
  for (const targetPath of validTargetPaths) {
    const target = domain.webpages.getWebpage(targetPath);
    // Make the citoid cache fetch its data
    // regardless of whether it is needed or not by one of the translation procedures.
    target.cache.citoid.getData();
  }
}

As explained in the Server documentation, the citoid parameter controls whether we want to prepend the raw citation as returned by Citoid. To prevent fetching the Citoid response for a target twice (once to get the raw citation requested by the citoid parameter, and once for any Citoid selection step that may be needed further down during translation) we make use of both the Domain's Webpage factory and the Webpage object capabilities.

First, domain.webpages refers to a Webpage factory, which stores Webpage objects previously created via the getWebpage method and returns them if they are requested again instead of creating new ones. This is used here from the server, but of course is widely used within the core code.

On the other hand, as mentioned above, the Webpage objects handle fetching and caching of the HTML and Citoid responses for a target. Here, the getData method of the citoid cache, will fetch the Citoid data and cache it so it doesn't have to be fetched again if it is needed later in the process.

For example, note how these functions may be called again in lines 400 and 415 down below in the source code. They will return the same Webpage object and prevent Citoid from being called again (if already called).

Finally, translation

[edit]

So far, we have prepared the Domain object to run its most important function: the translate function.

const targetOutputs = await domain.translate(validTargetPaths, {
  // if debug enabled, return non-applicable template outputs
  onlyApplicable: options.debug ? false : true,
});

The translate method accepts the following parameters:

  • paths (string | string[]): one or more paths, corresponding to webpages to be translated.
  • options object (TranslateOptions):
    • allTemplates? (boolean; default: false): if true, translation won't stop on the first applicable template, but rather all candidate templates will be tried.
    • onlyApplicable? (boolean; default: true): if true, only results for applicable templates will be returned.
    • fillWithCitoid? (boolean; default: false): if true, undefined citation fields will be populated with values from the Citoid response (pending implementation; see T302019).
    • forceTemplatePaths? (string[]): a list of paths corresponding to translation templates to try, instead of using the list of candidate templates from the same URL path pattern group.
    • forcePattern? (string): used to force a specific URL path pattern translation group. This option will be ignored if the forceTemplatePaths option has been set.

The return value is a promise that resolves to an array of TargetOutput objects, one per translation target.

TargetOutput objects

Each TargetOutput object corresponds to a specific translation target, and includes:

  • domain: an object with information about the domain to which the target belongs, including:
    • name (string): the hostname of the domain. For example, www.example.com.
    • definitions: an object with information about the domain configuration files used, including:
      • patterns:
        • revid (number | undefined): the revision ID of the patterns.json configuration file used from the Web2Cit storage; undefined if no valid file found.
      • templates: idem patterns, for templates.json file.
      • tests: idem patterns, for tests.json file.
  • target: an object with information about the translation target, including:
    • path (string): the path to the target webpage, within the current domain. For example, /path/to/target.html
    • caches:
      • http?.timestamp (string): if the target webpage's HTML was fetched, a timestamp indicating when.
      • citoid?.timestamp (string): if the Citoid response for the target webpage was fetched, a timestamp indicating when.
  • translation:
    • pattern (string | undefined): the URL path pattern defining the translation subgroup to which the translation target belongs. If the translate method was called with the forceTemplatePaths option set, this value will be undefined.
    • error? (Error): if an error was thrown during target translation, it will be included here.
    • outputs (TranslationResult[]): an array of target results, one per translation template tried, each including:

Each TranslationResult object corresponds to a specific translation template tried, and includes:

  • template: an object with information about translation template tried, including:
    • applicable (boolean | undefined): whether the translation template is applicable for the target webpage. Note that if the translate method was called with options onlyApplicable=false and allTemplates=false, the output may include untried templates. In these cases, the applicable value will be undefined.
    • path (string | undefined): the translation template's path; undefined if corresponding to the fallback template.
    • fields (FieldInfo[]): an array of FieldInfo objects, each with detailed information about a specific template field (see below).
  • citation (WebToCitCitation | undefined): a citation in Citoid's mediawiki-basefields format (see Citoid API documentation), except that the source field may be Web2Cit in addition to Zotero. Citation will be undefined if the template turned out to be non-applicable for the translation target.
  • timestamp (string): timestamp when template output was returned.
  • scores:
    • fields (TestFieldOutput[]): an array of TestFieldOutput objects, each with detailed information about a specific test field (see below).

Each FieldInfo object provides detailed information about a specific template field, and includes:

  • name (string): the name of the template field, corresponding to one of the translation fields.
  • required (boolean): whether the template field was marked as required on the translation template.
  • procedures: an array of objects with information about specific translation procedures, each including:
    • selections: an array of objects with information about specific selection steps, each including:
      • selection type and config properties, as defined in the Selection objects section of the Storage documentation.
      • output (string[]): an array of string values corresponding to the output of the selection step.
    • transformations: an array of objects with information about specific transformation steps, each including:
      • transformation type, config and itemwise properties, as defined in the Transformation objects section of the Storage documentation.
      • output (string[]): an array of string values corresponding to the output of the transformation step.
    • output (string[]): the output of the translation procedure, which corresponds to the output of the last transformation step, or the combined output of all selection steps if no transformation steps defined.
  • output (string[]): the output of the template field, which corresponds to the combined output of all translation procedures.
  • valid (boolean): whether the template field's output is valid, according to the corresponding translation field's validation pattern.
  • applicable (boolean): whether the template field is applicable for the translation target; false if required=true and valid=false.

Each TestFieldOutput object provides translation test information about a specific test field, and includes:

  • fieldname (string): the name of the test field, corresponding to one of the translation fields.
  • goal (string[]): the expected translation output for the given translation field.
  • score (number): the test score that results from comparing the expected output above vs the template output for the same field (see the corresponding section in the Tests documentation).

Note that Web2Cit core's output interfaces may be normalized in the future; see T302431.

Detailed information

[edit]

This section includes detailed information of some parts of the library.

Domain configuration objects

[edit]

There is the DomainConfiguration abstract class, inherited by PatternConfiguration, TemplateConfiguration and TestConfiguration subclasses.

The objects of these classes are configured with a domain name and with storage configurations, and they "know" how to fetch the corresponding configurations from the storage, for which they have a series of methods.

In addition, domain configuration subclasses implement specific parse and loadConfiguration instance methods that "know" how to parse revision content into configuration values.

Configuration objects have a private values property holding an array of configuration values (templates, patterns or tests), either added manually, or loaded from a revision.

Redirects

[edit]

Domain configuration objects' fetch methods follow MediaWiki redirects (see T304772). This is useful for domain aliases; for example if www.example.com is an alias of example.com, configuration files of the former may be redirected to those of the latter. See the Domain aliases section of the Editing documentation for further information.

Fetch wrapper

[edit]

Web2Cit core uses fetch wrapper to use custom user agents.

It may also be used to use custom fetch functions (via the Domain constructor's originFetch option; see above) for specific origins, a feature needed to circumvent CORS restrictions in the Web2Cit integrated editor.

JSON schema files

[edit]

These JSON schema files indicate the shape of the Web2Cit configuration files saved to the Web2Cit storage repository on Meta-Wiki. They are used by Web2Cit server to create custom JSON editor forms for specific configuration file types (see the Editing documentation).

Ideally, these files should be generated automatically from the Typescript types (or vice versa) as described in T308347.

These files are currently served via the Github mirror due to Wikimedia GitLab's CORS restrictions (see T305700). See T318352 for a proposal to serve them from the Web2Cit server instead.

Development

[edit]

Source code

[edit]

The project's source code is hosted on Wikimedia's GitLab here, and mirrored to Github here.

Source code is written in TypeScript, a JavaScript superset with type checking support.

Development environment

[edit]

This project uses npm for installing and managing dependencies. To set up the development environment:

  1. Clone the git repository, or your fork of it. For example: git clone https://gitlab.wikimedia.org/diegodlh/w2c-core.git.
  2. cd into the cloned repository and run npm install. This will:
    1. install the required dependencies;
    2. configure Husky to automatically lint and format staged files before committ, using ESLint and Prettier, respectively; and
    3. use TypeScript's tsc to build the source code to ./dist/ (see Building section below).

Software design

[edit]
Video recording of an introduction to Web2Cit core library architecture

Building

[edit]

TypeScript code is compiled into JavaScript using TypeScript tsc compiler. This compiler's behavior can be controlled via the ./tsconfig.json file.

To do so, simply run npm run build. Alternatively, you may run npm run build:watch to automatically rebuild upon changes.

Publishing a new version

[edit]
  1. Run git clean -xdf to clean the git repository. Make sure to stash any uncomitted changes before doing so.
  2. Run npm install (this is needed to install genversion, which is needed below).
  3. Run npm version --no-git-tag-version with the corresponding newversion argument (e.g., npm version ... patch). This will:
    1. update package.json and package-lock.json with the increased version number;
    2. run genversion to update ./src/version.ts and stage it.
  4. Move changes from the "Unreleased" section of the changelog to the new version's section.
  5. Stage all changes and commit with commit message "Bump vX.Y.Z".
  6. Tag as "vX.Y.Z" using git tag.
  7. Publish to npm using npm publish. You may use --tag next to tag the new npm version as next, instead of the default tag latest.
  8. If published successfully, run git push and git push --tags.

Automatic tests

[edit]

Some automatic tests have been defined using the Jest framework.

To run them, simply run npm run test

To run individual tests, run npm run test followed by a test name pattern. For example, npm run tests selection should run the tests defined in the ./src/templates/selection.test.ts file only.

Debugging tests

[edit]

If you use Microsoft's Visual Studio Code editor, the repository's .vscode/launch.json file will automatically configure test debugging for you.

To debug tests, open the Run and Debug panel, choose the "Debug Jest Tests" option (provided by the launch.json file) from the drop-down, and hit Start Debugging. Test execution will pause at any breakpoints you define, both on test or regular source code files.

If you want to debug specific tests, modify the launch.json file by adding the test name pattern (as used to run specific tests, above) as an argument to the args array of the Debug Jest Tests configuration.