Community Wishlist Survey 2020/Archive/Structured, plain text Wikisource exports

Random proposal ►◄ Archive The survey has concluded. Here are the results!

Structured, plain text Wikisource exports

N Proposed did not respond

Problem:

One of the main goals of Wikisource is to produce transcription text that can be widely shared and reused in other contexts. Wikisource has a lot of promise for allied organizations, such as GLAM institutions, that also work on preserving and providing access to textual works. Currently, however, Wikisource is vastly underutilized by the GLAM sector, and one of the main reasons is the lack of interoperability.

It is actually quite difficult to make completed texts on Wikisource useful to the outside world because of the use of wiki formatting and templates in transcriptions, as well as non-transcription text on Wikisource pages, which are not easy to strip out in any programmatic way—and certainly not in any built-in way easily available to reusers, such as an API request.

For example, here is the first paragraph of the US Declaration of Independence:

WHEN in the course of human Events, it becomes necessary for one People to dissolve the Political Bands which have connected them with another, and to assume among the Powers of the Earth, the separate and equal Station to which the Laws of Nature and of Nature’s God entitle them, a decent Respect to the Opinions of Mankind requires that they should declare the causes which impel them to the Separation.

But here is how Wikisource transcribes it:

{{dropinitial|W}}HEN in the cour{{ls}}e of human Events, it becomes nece{{ls}}{{ls}}ary for one People to di{{ls}}{{ls}}olve the Political Bands which have connected them with another, and to a{{ls}}{{ls}}ume among the Powers of the Earth, the {{ls}}eparate and equal Station to which the Laws of Nature and of Nature’s God entitle them, a decent Re{{ls}}pect to the Opinions of Mankind requires that they {{ls}}hould declare the cau{{ls}}es which impel them to the Separation.

Here is how that looks in an API response (spoiler: terrifying).

Who would benefit:

This would primarily benefit Wikisource's potential reusers, specifically stakeholders seeking to use Wikisource as a platform for crowdsourcing their content and then ingest transcription data back into their dataset.
It would also benefit all Wikisource editing communities more directly by encouraging increased institutional partnerships and new users.
It might help Wikisource in other ways, like improving the search index, and paving the way for future storage of transcriptions as structured data statements (directly on Wikidata or as local statements like Structured Data on Commons).

Proposed solution:

A successful outcome would involve a documented, core functionality that allows a user to easily access a transcription for a given work as plain text, and in a machine-readable context.

This could look like a method in the existing Wikisource API for requesting sanitized data, with a "translation" layer on the backend that can take the formatted Wikisource pages and deliver clean versions to the consumer. For example, a JSON array of pages if querying a mainspace or index namespace page, or, at the very least, the ability to get such data on a per-page basis by querying the page namespace. It would hopefully not be an external, standalone tool that might become unmaintained.

I think it could look something like Extension:TextExtracts, except that extension doesn't seem to work in Wikisource contexts and has a character limit.

More comments:
Phabricator tickets:
Proposer: Dominic (talk) 16:54, 23 October 2019 (UTC)[reply]

Discussion

While I do like the idea, it is worth considering the Rest API in line with this proposal. Dominic´s example in Rest API. See also https://en.wikisource.org/api/rest_v1/ --Snaevar (talk) 17:42, 23 October 2019 (UTC)[reply]

@Snaevar: Yes, that's a good point; I wasn't really aware of it. It looks like what we want could be similar in approach to the Wiktionary definition method, except without any HTML elements in the response. In general, though, I really like how this works: https://en.wiktionary.org/api/rest_v1/page/definition/apple. Dominic (talk) 19:49, 23 October 2019 (UTC)[reply]

@Dominic: I'm not sure if it's quite what you're after, but the Wikisource Export tool can export to plain text (for example, United States Declaration of Independence (Dunlap Broadside)). It does this by creating an epub of the work and using Calibre to convert that to plain text. —Sam Wilson 23:29, 23 October 2019 (UTC)[reply]

@Samwilson: This is a useful example as well, but doesn't quite get at the use case I'm describing either. I think there are several export tools like this aimed at serving the reader of a text, but none that seem optimized for the use case of a downstream harvester attempting to consume Wikisource transcription data at scale. For example, if I am an institution that contributed that digital image of the Declaration, and thousands of others, and then wanted to ingest the work produced by Wikisource back into source dataset, I would want to be able to easily query for the text of a specific digital image in the form of structured data (when I say plain text, I mean the transcription doesn't have HTML elements or wiki markup, not that it is unstructured or TXT format). Dominic (talk) 21:14, 24 October 2019 (UTC)[reply]

@Dominic: It sounds like you might be talking about being able to export in TEI format, or some other non-presentational markup? This would indeed be terrific! At the moment, the closest we come is HTML, because wikitext only fully outputs to HTML. There have been experiments with using things like Pandoc to turn wiki HTML into Docbook or other structured formats, but because we don't have semantic markup for lots of elements (e.g. a quotation paragraph in a book is just indented, or a chapter title is just larger text) there's no way to actually portray these things in a real structured way in any format. For instance, the example paragraph you give above uses the long s, and an institution bringing the transcription back into their collection would want to retain that knowledge about the work, but plain text doesn't give it (that's a slightly shallow example, maybe, because that can just be represented with an actual ſ character – but lots of things can't be).

Maybe you could update the proposal to clarify what you mean by "as plain text, and in a machine-readable context", because it feels like these things might be in opposition — if it's machine readable, it might be text but it's no really plain text, if you see what I mean? We already have plain text, but it's not very useful! :-) Sam Wilson 22:36, 24 October 2019 (UTC)[reply]

Reagarding the Rest API. You still have to learn how to do it. What about something easy for tech idiots like me on a few clicks. I dont have a time to learn it, does the GLAM employee will have a time to learn it? Why somebody would be learning something if they can have it elsewhere without learning new things? Juandev (talk) 09:17, 4 November 2019 (UTC)[reply]

Dominic, thank you for submitting this proposal! Unfortunately, we are unable to take on this work, as we did not receive a response from you in regard to our question. We apologize for any disappointment, and thank you again for taking part in the survey! --IFried (WMF) (talk) 00:26, 20 November 2019 (UTC)[reply]