Jump to content

Web2Cit/Docs/Server

From Meta, a Wikimedia project coordination wiki

The Web2Cit server is the part of the Web2Cit ecosystem that makes the functionalities of the Web2Cit core available via a web service for consumption from other parts of the ecosystem, such as the Web2Cit user script and the Web2Cit monitor, as well as from projects relying on Zotero translators, such as Zotero browser connectors or ZoteroBib.

It returns translation results for one or more target URLs using the corresponding domain configurations defined by Web2Cit collaborators, as available from the Web2Cit storage repository on Meta-Wiki.

How to use

Enter a target URL on the Web2Cit server's homepage and click "Extract" to get the translation summary page.

The simplest way to use the translation server is going to its home page (https://web2cit.toolforge.org/) and entering a target URL one would like to get translation results for. Alternatively, just go to https://web2cit.toolforge.org/<YourTargetURL>. This is an alias of https://web2cit.toolforge.org/translate?tests=true&url=<YourTargetURL> (see URL query string parameters below).

Alternatively, just click on the "Web2Cit" link beneath the citation results on Wikipedia.

If you have the Web2Cit user script installed, the translation summary for a target webpage may also be opened from Wikipedia, by clicking the "Web2Cit" link that appears beneath the citation results on the "Add a citation" dialog.

Note that the translation summary returned also includes the translation results as embedded metadata. Therefore, you can use this URL in the unmodified Wikipedia's automatic citation generator (i.e., without the Web2Cit user script installed), or with other tools relying on Zotero translators, such as Zotero browser connectors and ZoteroBib.

Sandbox configurations

Configuration files may be saved to a personal sandbox storage to experiment with them without affecting all Web2Cit users (see the Editing documentation to learn how to do this using the JSON editor).

To instruct the Web2Cit server to use configuration files from your personal sandbox, on the translation summary page enter your Wikimedia username in the field next to "Switch to sandbox configuration:" and click on "Switch". This will use configuration files under User:<YourUserName>/Web2Cit/data/ on Meta-Wiki.

Alternatively, you may just go to https://web2cit.toolforge.org/sandbox/<YourUserName>/<YourTargetURL>.[Note 1] This is an alias of https://web2cit.toolforge.org/translate?tests=true&sandbox=<YourUserName>&url=<YourTargetURL> (see URL query string parameters section below).

Debugging

Additional debugging information may be included for each translation target, that may help understand why translation may not have worked as expected.

To do so, you may simply click the "Enable debugging" at the bottom of the translation summary page.

Alternatively, you may go to https://web2cit.toolforge.org/debug/<YourTargetURL>.[Note 1] This is an alias of https://web2cit.toolforge.org/translate?tests=true&debug=true&url=<YourTargetURL> (see URL query string parameters section below).

Debugging information

The debugging information includes:

  • config: The revision ID of the configuration files (patterns.json and templates.json) used from Meta, or whether they are missing or corrupt (in which case the catch-all pattern and fallback template will be used by default.
  • pattern: The URL path pattern group to which the target webpage belongs (with ** being the catch-all pattern), and which in turn determines which translation templates have been tried.
  • templates: A list of translation templates tried, until the first one applicable for the target webpage.

For each template:

  • path: the path to the template webpage on which it is based (or undefined for the fallback template);
  • applicable: whether the template is applicable for the target webpage or not (note that outputs from non-applicable templates are not included in the results);
  • fields: a list of template fields (see below).

For each template field:

  • name: the field's name (itemType, title, etc; see Fields documentation);
  • isArray: whether the field expects a single value (false) or a list of values (true);
  • pattern: the pattern that all output values must follow for the field output to be valid;
  • required: whether it is a required field or not (note that the output of all required fields must be valid for the template to be applicable);
  • procedures: a list of translation procedures (see below);
  • output: the field output, which is a combination of all procedure outputs;
  • valid: whether the field output is valid or not (again, note that the output of all required fields must be valid for the template to be applicable);
  • applicable: whether the template field is applicable for the target webpage (i.e., either not required, or required AND valid).

For each translation procedure:

  • selection: A selection object, including:
    • steps: a list of selection steps (see below);
    • output: the overall selection output, which is a combination of all selection step outputs.
  • transformation: A transformation object, including:
    • steps: a list of transformation steps (see below);
    • output: the overall transformation output, which is the output of the last transformation step (or the selection output, if there are no transformation steps).

For each selection or transformation step:

  • type: the step type (fixed, citoid, etc; see step types and configurations in Templates documentation);
  • config: the step configuration;
  • itemwise: in case of transformation steps, whether it was applied to each item of the input independently (true), or to the entire input as a whole (false);
  • out: the step output.

Note that, when parsing configuration files, Web2Cit ignores invalid definitions. These ignored elements will not show up in the debugging information. For example:

  • multiple translation templates for the same path, or templates missing one or more mandatory fields;
  • template fields with invalid field names;
  • procedures without either selection or transformation arrays (empty arrays are OK);
  • selection or transformation steps of an invalid type or with an invalid configuration.

URL query string parameters

The Web2Cit server provides a /translate endpoint which takes a series of URL query string parameters. The URLs indicated in the previous sections are just shortcuts or aliases to this endpoint with specific combinations of parameter values.

Either one of url or domain parameters are mandatory. The others are optional or have a default value:

  • url (string): the target URL for which a translation result is required.
  • domain (string): the domain for which translation results must be returned. This is ignored if the url parameter was provided.
  • path (string | undefined): the path in the domain given for which a translation result must be returned. If undefined, it will return translation results for all paths for which there is a template or a test case configured (as used by the Web2Cit monitor). This parameter is ignored if the url parameter was provided.
  • citoid (boolean; default: false): prepends the citation returned by Citoid to the array of citations returned for each target URL (see T307393 for support with HTML or JSON response fomats).
  • debug (boolean; default: false): includes debugging information for each translation target. Note this is not supported with MediaWiki response format.
  • format (html | json | mediawiki; default: html): the format in which translation results must be returned (see Response formats section below).
  • sandbox (string): an optional parameter indicating the username whose sandbox storage will be used to fetch configurations from (see Sandbox configurations section).
  • tests? (boolean): whether translation tests should be used.

Response formats

Translation results may be returned in one of three available formats, as requested via the format URL query string parameter:

HTML

This is the default format. It returns a translation summary web page including translation results, grouped by translation target and URL path pattern group.

Translation results list a series of translation fields, each including the translation output (as returned from the applicable translation template), the expected output (as defined in the translation test), and the test score.

A dash (-) indicates an empty output, and n/a in the expected output or score columns indicates that an expected output has not been defined.

In addition, the page includes embedded metadata using property names from the http://www.zotero.org/namespaces/export# vocabulary. These can be directly interpreted by Zotero's Embedded Metadata translator, and hence can be used with the unmodified Wikipedia's automatic citation generator (i.e., without Web2Cit user script installed), or with other tools relying on Zotero translators, such as Zotero browser connectors and ZoteroBib.

MediaWiki

The MediaWiki format is a JSON mimicking the mediawiki-basefields response format from the Citoid API, for consumption from the Web2Cit user script.

JSON

This is the most complete return format, and is used by the Web2Cit monitor:

Version 1.0:
  • info:
    • apiVersion: a string indicating the version number of Web2Cit server.
    • config?: an object with information about the configuration files used (undefined in case of fetching error), including:
      • patterns:
        • path: path to the patterns.json configuration file on the MediaWiki instance used by the Web2Cit storage. For example, Web2Cit/data/com/example/www/patterns.json.
        • revid: revision ID of the patterns.json configuration file used; undefined if file does not exist or is corrupt.
      • templates: idem patterns, for templates.json.
      • tests: idem patterns, for tests.json.
  • data?: an object with translation data (undefined in case of overall translation error), including:
    • targets: an array of objects, one per target path requested, each including:
      • path: the path requested for translation.
      • href?: the full URL to the target webpage; will be undefined if the target path is invalid.
      • pattern?: the URL path pattern group to which the target path belongs; will be undefined if the target path is invalid.
      • results: an array of translation results, one per translation template. Normally, only one translation result would be returned,[Note 2] corresponding to the first applicable template, or none if no applicable template has been found (see T317448, though). Each translation result includes:
        • template: an object with information about the translation template used, including:
          • path?: the path to the webpage on which the template is based; undefined if it is the fallback template.
          • label?: the (optional) fancy name given to the template.
        • fields: an array of translation field objects, each including:
          • name: the name of the translation field.
          • output: an array of strings representing the template field output; if a template field has not been defined for this translation field, an empty array will be returned;
          • test: an array of strings representing the expected output (as defined in the translation test); will be undefined if a test field has not been defined for this translation field.
          • score: the test score resulting from the comparison between the translation and the expected outputs; undefined if an expected output is not available.
        • score: the average test score across all translation fields; undefined if no translation score is defined for any field.
      • score: the average test score across all translation results (from the same target); undefined if no translation score defined for any result.
      • error?: if an error is thrown during target translation, it will be included here.
      • debug?: if the server has been called with the debug option (see URL query string parameters section above), this will include an object with detailed information for debugging. Check the DebugJson type on the ./src/types.ts file of the w2c-server repository to find out more about it.
    • score: the average translation score across translation targets; undefined if no translation score is defined for any target.
  • error?: if a general error affecting translation of all targets occurred, the Error object thrown will be included here.

Errors

General errors

These will be included in a root error property of the JSON or MediaWiki response formats, or returned as plain text on the HTML response format:

Target errors

Target-specific errors will be included in an error property under the corresponding targets array object of the JSON response format, or as HTML text under the corresponding target section of the HTML response format:

  • Invalid path error: the target's path is not a valid path. For example: https://web2cit.toolforge.org/translate?domain=www.example.com&path=abc.
  • NoApplicableTemplateError: no applicable template has been found for the translation target. This would happen when not even the fallback template is applicable; see for example T313236.
  • Any target translation error included in one of the target outputs returned by the translate method of Web2Cit core's Domain object (see the Core documentation).

These errors will not show on the MediaWiki response format, which comprises an array of citations, and targets throwing an error during translation will not return a citation.

Response codes

Web2Cit server currently responds with HTTP response status code 200, except:

  • status code 400 (bad request):
    • on Invalid query error (see Errors section),
    • on unsupported debug or test mode for MediaWiki format error (see Errors section),
    • on No target error (see Errors section),
    • on Invalid target error (see Errors section),
    • on Invalid domain error (see Errors section);
  • status code 404 (not found):
    • if no citation has been returned, either because no target paths have been specified, or because no applicable template has been found for any target.

Translation

Web2Cit server's Home and HTML-format translation summary pages are translated collaboratively on translatewiki.net, here.

Help

If you need help or would like to report a bug, suggest a feature, etc, you can leave a comment in this page's discussion page, or create a task in Phabricator, with the web2cit-server project tag.

Beta testers

There is an additional server instance running at https://w2c-beta.toolforge.org/. New versions of the Web2Cit server may be available here for public testing, until they are ready for deployment to the production server.

Technical information

The Web2Cit server's code is available under a GNU GPL v3 license on a Wikimedia GitLab repository: https://gitlab.wikimedia.org/diegodlh/w2c-server.

The code is written in Typescript and built with tsc. Package management is done with npm.

We use the Express framework to set up the server.

It runs from Toolforge tool account, available at https://web2cit.toolforge.org/

Development

You will need git, node, npm and nvm.

  1. Clone the w2c-server git repository, or your fork of it: git clone https://gitlab.wikimedia.org/diegodlh/w2c-server.git.
  2. Change to the repository's directory: cd w2c-server.
  3. To make sure you are using the same version of Node than the one running on Toolforge, run nvm use. This will install and switch to the Node version indicated in the file ./.nvmrc.
  4. Install the required dependencies by running npm install. The repository is configured (via the engine-strict=true option in ./.npmrc) to enforce Node and npm minimum versions indicated in the package.json file. If you are having trouble with the default version of npm installed with nvm, install a more recent version with nvm install --latest-npm and run npm install again.
  5. To start the development server, run npm run dev. This should build the project and serve the app from the ./dist/ directory to http://localhost:3000/.

Web2Cit core usage

As mentioned in the introduction, the Web2Cit server exposes the functionalities of the Web2Cit core via a web service. For this reason, the Web2Cit core (npm package web2cit) is one of its dependencies.

To better understand how Web2Cit server makes use of this library, you may check the brief explanation included in the Core documentation to showcase Web2Cit core capabilities.

Using local Web2Cit core

If you want that Web2Cit server uses your local build of Web2Cit core, you need to follow these steps:

  1. On the Web2Cit core directory:
    1. run npm link. This will create a global symlink for the Web2Cit core dependency.
    2. run npm build. Make sure you run this again after any changes made to the Web2Cit core source code.
  2. On the Web2Cit server directory
    1. run npm link web2cit. This will make the Web2Cit server use the global symlink for Web2Cit core instead of the package downloaded from npm.
    2. run npm install. You may need to run this again after some changes made to the Web2Cit core source code.

Debug with Visual Studio Code

The repository includes a .vscode/launch.json debug configuration file for the Visual Studio Code editor, with custom settings to ensure breakpoints can be set onto source files from the web2cit module (configuration/outFiles), even if a local build is being used (--preserve-symlinks as runtime argument).

To start debugging:

  1. Build Web2Cit server using npm run build as explained above.
  2. On Visual Studio Code, open the Run and Debug pane and start debugging using the Launch Program custom configuration. This will run node dist/app.js and attach the debugger to it. You should see a server is listening on 3000! message on the debug console.
  3. Set breakpoints where you want the program to stop for debugging purposes.

Automatic tests

Automatic tests of the Web2Cit server have not been implemented yet. See T305564.

However, in the meantime, it may be worth it considering the Web2Cit monitor (which uses translation tests defined by Web2Cit collaborators) as a way to semi-automatically check that changes made to the server's source code do not result in unexpected side effects:

  1. Download and install the Web2Cit monitor locally.
  2. Run the monitor with --all and --log arguments, to run checks for all configured domains, and (importantly!) to write results locally, respectively.
  3. Move the result files to a separate directory, to avoid overwriting them below.
  4. Change the monitor's source code to use the server build that wants to be tested and run it again with the same arguments.
  5. Finally, use a diff tool to compare both sets of result files and identify unexpected differences between them.

This test procedure is a temporary workaround and not a replacement for proper automatic tests. For example, tests would be limited to the set of server functions used by the monitor, and to the features relied upon by the collaboratively defined translation tests. In addition, it involves fetching data from third-party web servers, whose responses may also change upon repeated requests.

Document changes

Use the changelog to document changes, as described here. Keep changes under the "Unreleased" section, until a new version is ready to be deployed (see below).

Deployment

To deploy the server to production, repeat steps 1-3 above on your host. Read below for the special case of Toolforge.

Running from Toolforge

These section describes how Web2Cit server is set up to run from the Toolforge servers. If you want to run Web2Cit server locally or on a private host, you won't need this.

The Web2Cit server is running from the web2cit Toolforge account. The following steps were followed to set up and run the web server:

  1. Login to Toolforge: ssh login.toolforge.org. Note that you must have Toolforge access to do this. Follow the steps here if you don't.
  2. Become the web2cit tool account: become web2cit. This account was created by following the steps here. Note that there is an alternative w2c-beta account for tests.
  3. Clone the git repository.
  4. Edit the service.template file so that webservice commands below use the following predefined arguments:
    • backend: kubernetes
    • type: node16
  5. Open a webservice shell to make sure you are using the right version of node and npm to install dependencies and build: webservice shell
    1. Change to the w2c-server directory and run npm install.
    2. Run npm run build. This will compile the source code to the w2c-server/dist directory.
    3. Run exit to quit from the webservice shell.
  6. Run webservice start. By convention, this will run npm start from www/js (symlinked to w2c-server/dist) to start the web server.

Note that the web2cit account also hosts the (pre-alpha) Web2Cit integrated editor. Its files are statically served from www/static (symlinked to w2c-editor/build) via https://tools-static.wmflabs.org/web2cit/. See the integrated editor documentation for the details.

Toolforge logs

To check the logs of the Kubernetes container initialized by webservice:

  1. Run kubectl get pods to find the name of the container's parent pod.
  2. Run kubectl logs <pod_name> to see the logs of the container's current instantiation. Because containers will be restarted automatically upon failure, to see the logs of the container's previous instantiation (e.g. after a crash) add --previous at the end of the command.

Versioning

Consider creating a new version before deployment. To do so:

  1. Move changes from the "Unreleased" section at the top of the changelog file to the new version's section.
  2. Run npm version --no-git-tag-version with the corresponding version increment argument (e.g., prerelease). Use --no-git-tag-version to skip the automatic commit.
  3. Stage CHANGELOG.md, package.json and package-lock.json changes.
  4. Commit as "Bump vX.Y.Z" and tag as "vX.Y.Z".
  5. Push commit and tag.

Home and result pages

The home and HTML-format result pages are created using React, and server-side rendered using renderToStaticMarkup.

i18n

Home and HTML-format result pages are internationalized using i18next.

Translated messages are located under locales/.

See T317044 for discussion around collaboratively translating this via Translatewiki.

JSON editor

In addition to returning Web2Cit translation results at the root (/) and /translate endpoints, the Web2Cit server currently serves the JSON configuration file editor statically at /edit.html.

As described in the JSON editor documentation, this JSON editor uses some URL query parameters, including a link to a JSON schema file, to render a JSON editing wizard.

The source code for this JSON editor is currently part of the Web2Cit server repository, but it is prepared to be split into a separate general-purpose MediaWiki JSON editor project if desired (see T306837).

Linting and formatting

We use ESLinter as linter, with Typescript support via typescript-eslint.

Prettier is used as formatter, and eslint-config-prettier is used to disable ESLint rules that may conflict with it.

lint-staged runs ESLinter and Prettier via a Husky pre-commit hook.

Other dependencies

JSDOM is used to create the window object that is needed to create Web2Cit core's root Domain object. This is currently used to provide XPath functionality.

Notes

  1. a b Sandbox and debugging aliases can be combined as https://web2cit.toolforge.org/sandbox/<YourUserName>/debug/<YourTargetURL> or https://web2cit.toolforge.org/debug/sandbox/<YourUserName>/<YourTargetURL>.
  2. More than one translation result may be supported if the server ever supports returning results for all applicable templates, a function supported by Web2Cit core via the allTemplates option of the Domain's translate method. Also, see T307393 for another case where more than one translation result could be returned.