Web2Cit/Docs/Server
The Web2Cit server is the part of the Web2Cit ecosystem that makes the functionalities of the Web2Cit core available via a web service for consumption from other parts of the ecosystem, such as the Web2Cit user script and the Web2Cit monitor, as well as from projects relying on Zotero translators, such as Zotero browser connectors or ZoteroBib.
It returns translation results for one or more target URLs using the corresponding domain configurations defined by Web2Cit collaborators, as available from the Web2Cit storage repository on Meta-Wiki.
How to use
[edit]The simplest way to use the translation server is going to its home page (https://web2cit.toolforge.org/) and entering a target URL one would like to get translation results for. Alternatively, just go to https://web2cit.toolforge.org/<YourTargetURL>
. This is an alias of https://web2cit.toolforge.org/translate?tests=true&url=<YourTargetURL>
(see URL query string parameters below).
If you have the Web2Cit user script installed, the translation summary for a target webpage may also be opened from Wikipedia, by clicking the "Web2Cit" link that appears beneath the citation results on the "Add a citation" dialog.
Note that the translation summary returned also includes the translation results as embedded metadata. Therefore, you can use this URL in the unmodified Wikipedia's automatic citation generator (i.e., without the Web2Cit user script installed), or with other tools relying on Zotero translators, such as Zotero browser connectors and ZoteroBib.
Sandbox configurations
[edit]Configuration files may be saved to a personal sandbox storage to experiment with them without affecting all Web2Cit users (see the Editing documentation to learn how to do this using the JSON editor).
To instruct the Web2Cit server to use configuration files from your personal sandbox, on the translation summary page enter your Wikimedia username in the field next to "Switch to sandbox configuration:" and click on "Switch". This will use configuration files under User:<YourUserName>/Web2Cit/data/
on Meta-Wiki.
Alternatively, you may just go to https://web2cit.toolforge.org/sandbox/<YourUserName>/<YourTargetURL>
.[Note 1] This is an alias of https://web2cit.toolforge.org/translate?tests=true&sandbox=<YourUserName>&url=<YourTargetURL>
(see URL query string parameters section below).
Debugging
[edit]Additional debugging information may be included for each translation target, that may help understand why translation may not have worked as expected.
To do so, you may simply click the "Enable debugging" at the bottom of the translation summary page.
Alternatively, you may go to https://web2cit.toolforge.org/debug/<YourTargetURL>
.[Note 1] This is an alias of https://web2cit.toolforge.org/translate?tests=true&debug=true&url=<YourTargetURL>
(see URL query string parameters section below).
Debugging information |
---|
The debugging information includes:
For each template:
For each template field:
For each translation procedure:
For each selection or transformation step:
Note that, when parsing configuration files, Web2Cit ignores invalid definitions. These ignored elements will not show up in the debugging information. For example:
|
URL query string parameters
[edit]The Web2Cit server provides a /translate
endpoint which takes a series of URL query string parameters. The URLs indicated in the previous sections are just shortcuts or aliases to this endpoint with specific combinations of parameter values.
Either one of url
or domain
parameters are mandatory. The others are optional or have a default value:
url
(string
): the target URL for which a translation result is required.domain
(string
): the domain for which translation results must be returned. This is ignored if theurl
parameter was provided.path
(string | undefined
): the path in the domain given for which a translation result must be returned. Ifundefined
, it will return translation results for all paths for which there is a template or a test case configured (as used by the Web2Cit monitor). This parameter is ignored if theurl
parameter was provided.citoid
(boolean
; default:false
): prepends the citation returned by Citoid to the array of citations returned for each target URL (see T307393 for support with HTML or JSON response fomats).debug
(boolean
; default:false
): includes debugging information for each translation target. Note this is not supported with MediaWiki response format.format
(html | json | mediawiki
; default:html
): the format in which translation results must be returned (see Response formats section below).sandbox
(string
): an optional parameter indicating the username whose sandbox storage will be used to fetch configurations from (see Sandbox configurations section).tests?
(boolean
): whether translation tests should be used.
Response formats
[edit]Translation results may be returned in one of three available formats, as requested via the format
URL query string parameter:
HTML
[edit]This is the default format. It returns a translation summary web page including translation results, grouped by translation target and URL path pattern group.
Translation results list a series of translation fields, each including the translation output (as returned from the applicable translation template), the expected output (as defined in the translation test), and the test score.
A dash (-
) indicates an empty output, and n/a
in the expected output or score columns indicates that an expected output has not been defined.
In addition, the page includes embedded metadata using property names from the http://www.zotero.org/namespaces/export#
vocabulary. These can be directly interpreted by Zotero's Embedded Metadata translator, and hence can be used with the unmodified Wikipedia's automatic citation generator (i.e., without Web2Cit user script installed), or with other tools relying on Zotero translators, such as Zotero browser connectors and ZoteroBib.
MediaWiki
[edit]The MediaWiki format is a JSON mimicking the mediawiki-basefields
response format from the Citoid API, for consumption from the Web2Cit user script.
JSON
[edit]This is the most complete return format, and is used by the Web2Cit monitor:
Version 1.0:
[edit]info
:apiVersion
: a string indicating the version number of Web2Cit server.config?
: an object with information about the configuration files used (undefined
in case of fetching error), including:patterns
:path
: path to thepatterns.json
configuration file on the MediaWiki instance used by the Web2Cit storage. For example,Web2Cit/data/com/example/www/patterns.json
.- revid: revision ID of the
patterns.json
configuration file used;undefined
if file does not exist or is corrupt.
templates
: idempatterns
, fortemplates.json
.tests
: idempatterns
, fortests.json
.
data?
: an object with translation data (undefined
in case of overall translation error), including:targets
: an array of objects, one per target path requested, each including:path
: the path requested for translation.href?
: the full URL to the target webpage; will beundefined
if the target path is invalid.pattern?
: the URL path pattern group to which the target path belongs; will beundefined
if the target path is invalid.results
: an array of translation results, one per translation template. Normally, only one translation result would be returned,[Note 2] corresponding to the first applicable template, or none if no applicable template has been found (see T317448, though). Each translation result includes:template
: an object with information about the translation template used, including:path?
: the path to the webpage on which the template is based;undefined
if it is the fallback template.label?
: the (optional) fancy name given to the template.
fields
: an array of translation field objects, each including:name
: the name of the translation field.output
: an array of strings representing the template field output; if a template field has not been defined for this translation field, an empty array will be returned;test
: an array of strings representing the expected output (as defined in the translation test); will beundefined
if a test field has not been defined for this translation field.score
: the test score resulting from the comparison between the translation and the expected outputs;undefined
if an expected output is not available.
score
: the average test score across all translation fields;undefined
if no translation score is defined for any field.
score
: the average test score across all translation results (from the same target);undefined
if no translation score defined for any result.error?
: if an error is thrown during target translation, it will be included here.debug?
: if the server has been called with thedebug
option (see URL query string parameters section above), this will include an object with detailed information for debugging. Check theDebugJson
type on the./src/types.ts
file of the w2c-server repository to find out more about it.
score
: the average translation score across translation targets;undefined
if no translation score is defined for any target.
error?
: if a general error affecting translation of all targets occurred, the Error object thrown will be included here.
Errors
[edit]General errors
[edit]These will be included in a root error
property of the JSON or MediaWiki response formats, or returned as plain text on the HTML response format:
- Invalid query: the URL query string is misformatted. For example: https://web2cit.toolforge.org/translate?abc.
- (Unsupported debug or test modes): MediaWiki format requested and format-unsupported
debug
ortests
parameters set to true. For example: https://web2cit.toolforge.org/translate?domain=www.example.com&format=mediawiki&debug=true. - No target: one of the URL shortcuts was used and the target URL was omitted. For example: https://web2cit.toolforge.org/debug/.
- Invalid target: a URL target has been provided, but the URL is invalid. For example: https://web2cit.toolforge.org/translate?format=json&url=abc.
- Invalid domain: a domain target has been provided, but it is invalid. For example: https://web2cit.toolforge.org/translate?domain=abc.
Target errors
[edit]Target-specific errors will be included in an error
property under the corresponding targets
array object of the JSON response format, or as HTML text under the corresponding target section of the HTML response format:
- Invalid path error: the target's path is not a valid path. For example: https://web2cit.toolforge.org/translate?domain=www.example.com&path=abc.
- NoApplicableTemplateError: no applicable template has been found for the translation target. This would happen when not even the fallback template is applicable; see for example T313236.
- Any target translation error included in one of the target outputs returned by the
translate
method of Web2Cit core'sDomain
object (see the Core documentation).
These errors will not show on the MediaWiki response format, which comprises an array of citations, and targets throwing an error during translation will not return a citation.
Response codes
[edit]Web2Cit server currently responds with HTTP response status code 200, except:
- status code 400 (bad request):
- on Invalid query error (see Errors section),
- on unsupported debug or test mode for MediaWiki format error (see Errors section),
- on No target error (see Errors section),
- on Invalid target error (see Errors section),
- on Invalid domain error (see Errors section);
- status code 404 (not found):
- if no citation has been returned, either because no target paths have been specified, or because no applicable template has been found for any target.
Translation
[edit]Web2Cit server's Home and HTML-format translation summary pages are translated collaboratively on translatewiki.net, here.
Help
[edit]If you need help or would like to report a bug, suggest a feature, etc, you can leave a comment in this page's discussion page, or create a task in Phabricator, with the web2cit-server project tag.
Beta testers
[edit]There is an additional server instance running at https://w2c-beta.toolforge.org/. New versions of the Web2Cit server may be available here for public testing, until they are ready for deployment to the production server.
Technical information
[edit]The Web2Cit server's code is available under a GNU GPL v3 license on a Wikimedia GitLab repository: https://gitlab.wikimedia.org/diegodlh/w2c-server.
The code is written in Typescript and built with tsc
. Package management is done with npm.
We use the Express framework to set up the server.
It runs from Toolforge tool account, available at https://web2cit.toolforge.org/
Development
[edit]You will need git, node, npm and nvm.
- Clone the w2c-server git repository, or your fork of it:
git clone https://gitlab.wikimedia.org/diegodlh/w2c-server.git
. - Change to the repository's directory:
cd w2c-server
. - To make sure you are using the same version of Node than the one running on Toolforge, run
nvm use
. This will install and switch to the Node version indicated in the file./.nvmrc
. - Install the required dependencies by running
npm install
. The repository is configured (via theengine-strict=true
option in./.npmrc
) to enforce Node and npm minimum versions indicated in thepackage.json
file. If you are having trouble with the default version of npm installed with nvm, install a more recent version withnvm install --latest-npm
and runnpm install
again. - To start the development server, run
npm run dev
. This should build the project and serve the app from the./dist/
directory to http://localhost:3000/.
Web2Cit core usage
[edit]As mentioned in the introduction, the Web2Cit server exposes the functionalities of the Web2Cit core via a web service. For this reason, the Web2Cit core (npm package web2cit) is one of its dependencies.
To better understand how Web2Cit server makes use of this library, you may check the brief explanation included in the Core documentation to showcase Web2Cit core capabilities.
Using local Web2Cit core
[edit]If you want that Web2Cit server uses your local build of Web2Cit core, you need to follow these steps:
- On the Web2Cit core directory:
- run
npm link
. This will create a global symlink for the Web2Cit core dependency. - run
npm build
. Make sure you run this again after any changes made to the Web2Cit core source code.
- run
- On the Web2Cit server directory
- run
npm link web2cit
. This will make the Web2Cit server use the global symlink for Web2Cit core instead of the package downloaded from npm. - run
npm install
. You may need to run this again after some changes made to the Web2Cit core source code.
- run
Debug with Visual Studio Code
[edit]The repository includes a .vscode/launch.json
debug configuration file for the Visual Studio Code editor, with custom settings to ensure breakpoints can be set onto source files from the web2cit module (configuration/outFiles
), even if a local build is being used (--preserve-symlinks
as runtime argument).
To start debugging:
- Build Web2Cit server using
npm run build
as explained above. - On Visual Studio Code, open the Run and Debug pane and start debugging using the Launch Program custom configuration. This will run
node dist/app.js
and attach the debugger to it. You should see aserver is listening on 3000!
message on the debug console. - Set breakpoints where you want the program to stop for debugging purposes.
Automatic tests
[edit]Automatic tests of the Web2Cit server have not been implemented yet. See T305564.
However, in the meantime, it may be worth it considering the Web2Cit monitor (which uses translation tests defined by Web2Cit collaborators) as a way to semi-automatically check that changes made to the server's source code do not result in unexpected side effects:
- Download and install the Web2Cit monitor locally.
- Run the monitor with
--all
and--log
arguments, to run checks for all configured domains, and (importantly!) to write results locally, respectively. - Move the result files to a separate directory, to avoid overwriting them below.
- Change the monitor's source code to use the server build that wants to be tested and run it again with the same arguments.
- Finally, use a diff tool to compare both sets of result files and identify unexpected differences between them.
This test procedure is a temporary workaround and not a replacement for proper automatic tests. For example, tests would be limited to the set of server functions used by the monitor, and to the features relied upon by the collaboratively defined translation tests. In addition, it involves fetching data from third-party web servers, whose responses may also change upon repeated requests.
Document changes
[edit]Use the changelog to document changes, as described here. Keep changes under the "Unreleased" section, until a new version is ready to be deployed (see below).
Deployment
[edit]To deploy the server to production, repeat steps 1-3 above on your host. Read below for the special case of Toolforge.
Running from Toolforge
[edit]These section describes how Web2Cit server is set up to run from the Toolforge servers. If you want to run Web2Cit server locally or on a private host, you won't need this.
The Web2Cit server is running from the web2cit
Toolforge account. The following steps were followed to set up and run the web server:
- Login to Toolforge:
ssh login.toolforge.org
. Note that you must have Toolforge access to do this. Follow the steps here if you don't. - Become the web2cit tool account:
become web2cit
. This account was created by following the steps here. Note that there is an alternativew2c-beta
account for tests. - Clone the git repository.
- Edit the
service.template
file so thatwebservice
commands below use the following predefined arguments:backend: kubernetes
type: node16
- Open a webservice shell to make sure you are using the right version of node and npm to install dependencies and build:
webservice shell
- Change to the
w2c-server
directory and runnpm install
. - Run
npm run build
. This will compile the source code to thew2c-server/dist
directory. - Run
exit
to quit from the webservice shell.
- Change to the
- Run
webservice start
. By convention, this will runnpm start
fromwww/js
(symlinked tow2c-server/dist
) to start the web server.
Note that the web2cit account also hosts the (pre-alpha) Web2Cit integrated editor. Its files are statically served from www/static
(symlinked to w2c-editor/build
) via https://tools-static.wmflabs.org/web2cit/
. See the integrated editor documentation for the details.
Toolforge logs
[edit]To check the logs of the Kubernetes container initialized by webservice
:
- Run
kubectl get pods
to find the name of the container's parent pod. - Run
kubectl logs <pod_name>
to see the logs of the container's current instantiation. Because containers will be restarted automatically upon failure, to see the logs of the container's previous instantiation (e.g. after a crash) add--previous
at the end of the command.
Versioning
[edit]Consider creating a new version before deployment. To do so:
- Move changes from the "Unreleased" section at the top of the changelog file to the new version's section.
- Run
npm version --no-git-tag-version
with the corresponding version increment argument (e.g.,prerelease
). Use--no-git-tag-version
to skip the automatic commit. - Stage
CHANGELOG.md
,package.json
andpackage-lock.json
changes. - Commit as "Bump vX.Y.Z" and tag as "vX.Y.Z".
- Push commit and tag.
Home and result pages
[edit]The home and HTML-format result pages are created using React, and server-side rendered using renderToStaticMarkup
.
i18n
[edit]Home and HTML-format result pages are internationalized using i18next.
Translated messages are located under locales/
.
See T317044 for discussion around collaboratively translating this via Translatewiki.
JSON editor
[edit]In addition to returning Web2Cit translation results at the root (/
) and /translate
endpoints, the Web2Cit server currently serves the JSON configuration file editor statically at /edit.html
.
As described in the JSON editor documentation, this JSON editor uses some URL query parameters, including a link to a JSON schema file, to render a JSON editing wizard.
The source code for this JSON editor is currently part of the Web2Cit server repository, but it is prepared to be split into a separate general-purpose MediaWiki JSON editor project if desired (see T306837).
Linting and formatting
[edit]We use ESLinter as linter, with Typescript support via typescript-eslint
.
Prettier is used as formatter, and eslint-config-prettier
is used to disable ESLint rules that may conflict with it.
lint-staged
runs ESLinter and Prettier via a Husky pre-commit hook.
Other dependencies
[edit]JSDOM is used to create the window object that is needed to create Web2Cit core's root Domain
object. This is currently used to provide XPath functionality.
Notes
[edit]- ↑ a b Sandbox and debugging aliases can be combined as
https://web2cit.toolforge.org/sandbox/<YourUserName>/debug/<YourTargetURL>
orhttps://web2cit.toolforge.org/debug/sandbox/<YourUserName>/<YourTargetURL>
. - ↑ More than one translation result may be supported if the server ever supports returning results for all applicable templates, a function supported by Web2Cit core via the
allTemplates
option of theDomain
'stranslate
method. Also, see T307393 for another case where more than one translation result could be returned.