Jump to content

BHL/Our outcomes/WiR/Status updates/2025-01-24

From Meta, a Wikimedia project coordination wiki

11 December 2024 - 24 January 2025

[edit]

Hello, all,

I start this update by thanking everyone that took a look at the last report, specially those that gave feedback on the 6 tasks. It seems we are all aligned on refining the BHL Image Structured Data Model and getting metadata transformed in 5-star Linked Open Data. That, thus, has been the focus of the last 2 weeks.

The #P180 shortcut enables a direct link to the "depicts" statements for this blue-bellied parrot illustration.

Bite-sized news

[edit]
  • After a kind invitation by Fiona (WMF), I have presented on Jan 15 at the Commons Community Call a bit on what we are doing in the BHL-Wiki Working Group. The community is aligned on showing support for Structured Data on Commons, which might mean more and better tools for SDC in the future and more support for the Commons Query Service.
  • As a request from Heidi Meudt on the WikiProject Biodiversity chat, the hasDepicts.js tool now is set to be "on" by default. It adds green or white balls under file thumbnails showing whether they have, or not, depicts (P180) statements. It helps a bit when curating this information manually on Commons.
  • The BHL Arena app got good attention and feedback, I thank everyone that tried it! Good requests were raised by RichartLitt and dpriskorn on the GitHub repo and by Gio on Telegram. Hopefully I will find the time to implement them soon.
  • The Commons Impact Metrics Dashboard is now tracking metrics for the BHL subcategories for South America and Africa. As these are newly tracked by the Commons Impact Metrics system, there might be bugs.

PIWG connections

[edit]

On last Wednesday, Jan 22, I attended the Persistent Identifiers Working Group (PIWG) to meet Nicole and others, getting to know a bit more of the identifiers work. From the conversations, some tasks emerged that could be useful towards BHL's mission:

Task 1. Develop more metrics about reuse of BHL content (i.e. DOIs) on Wikipedia

Rod Page wrote a few queries in that direction, listed on Biodiversity Heritage Library/Our outcomes/Quarry. So far, about 7680 uses of BHL DOIs + 10139 links directly to BHL website. Hopefully I can extend some of these metrics, getting, for example, how many times these pages are visited per month.

Task 2. Update citations on Wikipedia to link more clearly to BHL using the "#via" parameter

Citations currently show up as:

As you may see, BHL is hidden behind the links. The alternative makes the links to BHL much clearer, and could be used both for BHL DOIs and direct website links:

Task 3. Update the Mix'n'Match catalog for BHL Authors D

Deprecated ids still linger. Siobhan contacted Lucy Schrader, who kindly provided instructions on how to do it. If all goes well, this should be done shortly. For reference, see all BHL catalogs in Wikidata:WikiProject_BHL/Statistics.

If you have more tasks related to persistent identifiers in the BHL-Wiki interface, please let me know!

Structured Data Case Studies

[edit]

After putting together a Tutorial for editing BHL images SDC on OpenRefine, based on work by Siobhan and Sandra, I proceeded to do a few case studies to detect corner cases towards a scale-up. I am versioning the BHL Image Data Model, and it is currently at v0.1.1 .

Documentation for Each Case Study

[edit]

Each case study corresponds to a different publication for which the images have been loaded into Commons. I am keeping track of these changes on a master Google Spreadsheet, which some general notes, criteria and links. Each case study is getting its own Google Docs, at least for now:

This photo album served to test the tutorial. One complexity there was that it needed manual curation of the number of photos, done in OpenRefine. A particular detected detail was that some images for this album appear to have been digitized twice. Other little modelling challenges like this one are listed in the case studies documents, for those interested.

A small category, where some files had the {{BHL}} template, some with not. We decided to, for now on, only parse files with the {{BHL}} template, which include more metadata. This template is used over 230.000 times in Commons, making it already a decently-sized dataset.

At this point, I moved from OpenRefine, but a custom script using Started using WikibaseIntegrator (GitHub Repository: Reconciliation Bot). The main advantage is the scalability and the opportunity to automate a few of the manual parts of the workflow.

One of the plants drawn by T.Wild.

A nice set of plant illustrations, part of the famed contributions of von Martius and of the Prince of Wied in Brazil. The illustrator, "T. Wild", did not have even a Wikidata item. It seems to be common for illustrators, engravers and lithographers to be forgotten in history. Crediting them in metadata is, I think, an starting point to do justice to.

At this point I implemented editgroups it, so batches of edits may be tracked, and a parser to get Sponsor and Collection based on the Bibliography IDs. The images in the category also had some structured data, added by the FlickypediaBackfillrBot, which lead me to unearth bugs in the core of the WikibaseIntegrator package. Luckly, I was able to hand-fix the bugs, and now the bot runs with the custom package at github.com/lubianat/WikibaseIntegrator.

Max, the Prince of Wied, hunting with Joaquim Kuêk.

Another work by Max, the Prince of Wied, but this time I was unable to figure out the illustrator. When studying the work, I found this image of Max with a native "helper", Joaquim Kuêk, and was a shocked. The dead macaw and the exploitation of indigenous people — Kuêk's skull was put in a museum after his death — hint at the ugly, untold parts of these beautiful books. In a way, images like this remind us the importance of re-signifying the biodiversity treasure in BHL, and making sure it has a positive and long lasting impact for human kind and biodiversity. I think that is, ultimately, the mission behind this whole Wiki work.

Anyways, coming back to the structured data, some files in the category had, in the metadata, species names loaded from BHL OCR. I wrote code, then, to parse these as "depicts" statements. Some names dected by the BHL OCR don't seem to be used anymore, e.g. Noctilio dorsatus which seems to be now called Noctilio leporinus . I am investigating technical ways to make these leaps automatically, maybe via GBIF.

This jacana was coloured green in the book, an alien look for a familiar bird to Brazilian birders.

This 1648 book is outstanding for its historical importance. It was published in Latin, in Leiden and reflects a lot of the Natural History of Dutch Brazil. There was, recently, a 1.5 Million Euro Horizon2020 grant just for studying this book. There are many copies known and catalogued and BHL has at least 5 different Title identifiers for it, all connected to the same Wikidata item.

That detail adds some complexity to the workflow, but luckly, the {{BHL}} templated stored the exact Item where each image came from. They seem to have been colored by hand and not by the illustrator, so some images look funny, and very different from copy to copy.

A rare case of a public domain book about Brazilian biodiversity that was actually authored by Brazilians and edited in Brazil! The collection, about woody trees of São Paulo (where I live) has a mix of photos and illustrations, so adding the "instance of" (P31) property implies the need for manual curation. As the workflow is mostly automated by now, I modified Magnus Manske's sdc_tool to be able to ad `instance of (P31)` statements too. The modified script is available at User:TiagoLubiana/sdc_tool.js.

Another techy detail is that, for this case, the PDF/DJVU of the book was not available on Commons. It was surprisingly straightforward to add it to Commons via the Internet Archive Upload tool (ia-upload.wmcloud.org/). I might make a little video tutorial for the upload of BHL scans to Commons using the tool.

And that was it so far. I will keep navigating these categories for the next few weeks, refining the model and the bot code. Let me know if there is any particular category that you would like me to study/add metadata to!

Tech adventures on Phabricator tickets

[edit]

Navigating with more intensity in the Commons Structured Data waters led me to find some bugs, inconsistencies and missing features, which in turn turned into dialogs with the Wikimedia tech community via Phabricator, including:

  • T298672, a bug when new users try and revert Structured Data
  • T383584 and T304391, a feature request to be able to link, using SDC, two media files (e.g. an image and the scan where it comes from). The tickets are for two different solutions: one, a quick workaround, the other, for a larger Wikibase restructuring, which would enable better queries for this data
  • T384221, based on a feedback by Siobhan on the BHL Arena app, a feature request for sharing links to Commons files opening directly the "Structured Data" view

And that is basically itǃ Thank you for reading this update, and if you have questions or suggestions, don't hesitate to reach out.