Grants:Project/Hjfocs/soweego 2/Timeline
This project is funded by a Project Grant
proposal | people | timeline & progress | finances | midpoint report | final report |
Timeline for Hjfocs
[edit]Timeline | Date |
Validator | July 2022 |
Feedback loop, data providers side | November 2022 |
Feedback loop, data users side | November 2022 |
New catalogs | November 2022 |
Overview
[edit]- Project start date: July 5, 2021
- Workboard:
https://github.com/Wikidata/soweego/projects/2
- Codebase:
https://github.com/Wikidata/soweego
Monthly updates
[edit]Please note that the following sections span one month, starting from the 5th day of the current one. For instance, July 2021 stands for July 5th to August 4th 2021.
July 2021
[edit]The very first task was the refinement of validation criteria, as proposed in Grants:Project/Hjfocs/soweego_2#How:_the_solution. We started the discussion with the community on the Wikidata chat and mailing list:
- d:Wikidata:Project_chat/Archive/2021/07#Item_validation_criteria;
- https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/FZD6DPIQAQ3LJMEUV5IA7ATY2RXA2UQ5/ .
While we think that consensus was reached on criterion 2, i.e., links validation, we plan to leave the discussion open until agreement on the automatic ranking actions is achieved.
Besides that, we started technical work with a focus on the validator component:
- refresh target catalog imports;
- use a URL blacklist;
- output catalog IDs;
- improve extraction of Wikidata identifiers from URLs.
With respect to links validation, we implemented a suggestion by Azertus raised during the discussion:
You could generate some statistics on the URLs that could be added in a second phase, like prevalence of domains. Based on that list, new properties could be proposed or domains could be whitelisted, etc.
The following sub-sections hold frequency statistics about URLs that could not be automatically converted to valid identifiers. We submitted them for community discussion at d:Wikidata:Project_chat#URLs_statistics_for_Discogs_(Q504063)_and_MusicBrainz_(Q14005).
Most frequent Web domains
[edit]This table displays Web domains that occur more than 1,000 times in Discogs (Q504063) and MusicBrainz (Q14005): some should actually map to known identifiers, some may be candidates for new Wikidata properties. See Grants:Project/Hjfocs/soweego_2/Stats for less frequent ones.
Domain | Frequency | Comment | Property candidate |
---|---|---|---|
isni.oclc.org | 54007 | d:Property:P213#P1793 regex contains spaces, URLs don't | Oppose |
www.youtube.com | 18395 | user namespace, not to be confused with channel (d:Property:P2397) | Support |
itunes.apple.com | 17468 | artist URLs, not to be confused with existing iTunes properties | Support |
www.musik-sammler.de | 13541 | Support | |
www.myspace.com | 12063 | add optional www to d:Property:P3265#P8966 |
Oppose |
www.bbc.co.uk | 10596 | musical work review URLs, not to be confused with existing BBC properties | Support |
www.metal-archives.com | 7703 | URLs not matching d:Property:P1952#P1630: consider adding the new value https://www.metal-archives.com/bands/$1 | Oppose |
muzikum.eu | 5548 | Support | |
lyrics.wikia.com | 4153 | Support | |
www.generasia.com | 3360 | Support | |
plus.google.com | 2084 | obsolete URLs | Oppose |
nla.gov.au | 1962 | Support | |
www.reverbnation.com | 1811 | Support | |
musicmoz.org | 1747 | Support | |
www.45cat.com | 1632 | artist URLs, not to be confused with seven inches (d:Property:P9083) | Support |
web.archive.org | 1213 | can be used as d:Property:P1065 value, plus d:Property:P2960 and d:Property:P485 | Oppose |
www.purevolume.com | 1106 | Support | |
www.amazon.com | 1076 | URLs not matching d:Property:P6276#P1630: consider adding the new value https://www.amazon.com/-/e/$1 | Oppose |
Discogs Band
[edit]Domain | Frequency | Examples |
---|---|---|
www.myspace.com | 7876 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.youtube.com | 2647 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.reverbnation.com | 541 | 1. URL, record; 2. URL, record; 3. URL, record; |
instagram.com | 487 | 1. URL, record; 2. URL, record; 3. URL, record; |
web.archive.org | 483 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.twitter.com | 363 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.metal-archives.com | 270 | 1. URL, record; 2. URL, record; 3. URL, record; |
facebook.com | 217 | 1. URL, record; 2. URL, record; 3. URL, record; |
plus.google.com | 181 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.soundcloud.com | 143 | 1. URL, record; 2. URL, record; 3. URL, record; |
bandzone.cz | 139 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.ProgArchives.com | 137 | 1. URL, record; 2. URL, record; 3. URL, record; |
itunes.apple.com | 129 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.AllMusic.com | 129 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.audioculture.co.nz | 121 | 1. URL, record; 2. URL, record; 3. URL, record; |
open.Spotify.com | 107 | 1. URL, record; 2. URL, record; 3. URL, record; |
de-de.facebook.com | 102 | 1. URL, record; 2. URL, record; 3. URL, record; |
Discogs Musician
[edit]Domain | Frequency | Examples |
---|---|---|
www.myspace.com | 4186 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.youtube.com | 1415 | 1. URL, record; 2. URL, record; 3. URL, record; |
repertoire.bmi.com | 644 | 1. URL, record; 2. URL, record; 3. URL, record; |
instagram.com | 406 | 1. URL, record; 2. URL, record; 3. URL, record; |
adp.library.ucsb.edu | 380 | 1. URL, record; 2. URL, record; 3. URL, record; |
films.discogs.com | 275 | 1. URL, record; 2. URL, record; 3. URL, record; |
web.archive.org | 268 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.twitter.com | 264 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.bach-cantatas.com | 202 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.ascap.com | 188 | 1. URL, record; 2. URL, record; 3. URL, record; |
musicianbio.org | 162 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.drummerworld.com | 147 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.soundcloud.com | 133 | 1. URL, record; 2. URL, record; 3. URL, record; |
plus.google.com | 130 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.famousbirthdays.com | 125 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.radioswissjazz.ch | 125 | 1. URL, record; 2. URL, record; 3. URL, record; |
Musicbrainz Band
[edit]Domain | Frequency | Examples |
---|---|---|
itunes.apple.com | 7218 | 1. URL, record; 2. URL, record; 3. URL, record; |
isni.oclc.org | 6825 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.youtube.com | 6757 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.metal-archives.com | 5024 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.musik-sammler.de | 3299 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.bbc.co.uk | 2513 | 1. URL, record; 2. URL, record; 3. URL, record; |
muzikum.eu | 2384 | 1. URL, record; 2. URL, record; 3. URL, record; |
lyrics.wikia.com | 2165 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.purevolume.com | 939 | 1. URL, record; 2. URL, record; 3. URL, record; |
musicmoz.org | 935 | 1. URL, record; 2. URL, record; 3. URL, record; |
d-nb.info | 905 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.45cat.com | 891 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.reverbnation.com | 826 | 1. URL, record; 2. URL, record; 3. URL, record; |
plus.google.com | 694 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.45worlds.com | 560 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.generasia.com | 536 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.spirit-of-metal.com | 533 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.amazon.com | 417 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.7digital.com | 332 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.progarchives.com | 269 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.spirit-of-rock.com | 230 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.pandora.com | 227 | 1. URL, record; 2. URL, record; 3. URL, record; |
store.cdbaby.com | 220 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.killfromtheheart.com | 192 | 1. URL, record; 2. URL, record; 3. URL, record; |
uk.7digital.com | 170 | 1. URL, record; 2. URL, record; 3. URL, record; |
web.archive.org | 153 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.livefans.jp | 129 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.zenial.nl | 123 | 1. URL, record; 2. URL, record; 3. URL, record; |
us.7digital.com | 121 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.sonymusic.co.jp | 120 | 1. URL, record; 2. URL, record; 3. URL, record; |
nla.gov.au | 109 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.onkyomusic.com | 108 | 1. URL, record; 2. URL, record; 3. URL, record; |
cafe.daum.net | 104 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.cdjapan.co.jp | 102 | 1. URL, record; 2. URL, record; 3. URL, record; |
Musicbrainz Musician
[edit]Domain | Frequency | Examples |
---|---|---|
isni.oclc.org | 47182 | 1. URL, record; 2. URL, record; 3. URL, record; |
itunes.apple.com | 10035 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.youtube.com | 7575 | 1. URL, record; 2. URL, record; 3. URL, record; |
muzikum.eu | 3157 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.bbc.co.uk | 2924 | 1. URL, record; 2. URL, record; 3. URL, record; |
nla.gov.au | 1852 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.musik-sammler.de | 1717 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.metal-archives.com | 1498 | 1. URL, record; 2. URL, record; 3. URL, record; |
lyrics.wikia.com | 1141 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.generasia.com | 1103 | 1. URL, record; 2. URL, record; 3. URL, record; |
plus.google.com | 1079 | 1. URL, record; 2. URL, record; 3. URL, record; |
ibdb.com | 869 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.rockabilly.nl | 759 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.45cat.com | 646 | 1. URL, record; 2. URL, record; 3. URL, record; |
anison.info | 558 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.amazon.com | 542 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.ibdb.com | 496 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.encyclopedisque.fr | 454 | 1. URL, record; 2. URL, record; 3. URL, record; |
musicmoz.org | 436 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.reverbnation.com | 365 | 1. URL, record; 2. URL, record; 3. URL, record; |
soundtrackcollector.com | 319 | 1. URL, record; 2. URL, record; 3. URL, record; |
store.cdbaby.com | 317 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.bach-cantatas.com | 298 | 1. URL, record; 2. URL, record; 3. URL, record; |
ocremix.org | 283 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.findagrave.com | 283 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.rocky-52.net | 251 | 1. URL, record; 2. URL, record; 3. URL, record; |
rcs-discography.com | 221 | 1. URL, record; 2. URL, record; 3. URL, record; |
utaitedb.net | 213 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.junodownload.com | 189 | 1. URL, record; 2. URL, record; 3. URL, record; |
pomus.net | 179 | 1. URL, record; 2. URL, record; 3. URL, record; |
web.archive.org | 179 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.worldcat.org | 175 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.7digital.com | 161 | 1. URL, record; 2. URL, record; 3. URL, record; |
stage48.net | 156 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.facebook.com | 149 | 1. URL, record; 2. URL, record; 3. URL, record; |
tower.jp | 144 | 1. URL, record; 2. URL, record; 3. URL, record; |
imusti.com | 136 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.audionetwork.com | 124 | 1. URL, record; 2. URL, record; 3. URL, record; |
anidb.net | 123 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.purevolume.com | 120 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.onkyomusic.com | 118 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.naxos.com | 118 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.qim.com | 116 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.directlyrics.com | 115 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.todotango.com | 112 | 1. URL, record; 2. URL, record; 3. URL, record; |
play.google.com | 110 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.sonymusic.co.jp | 109 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.classicalarchives.com | 106 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.cmt.com | 104 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.musicapopular.cl | 103 | 1. URL, record; 2. URL, record; 3. URL, record; |
operabase.com | 102 | 1. URL, record; 2. URL, record; 3. URL, record; |
Musicbrainz Musical Work
[edit]Domain | Frequency | Examples |
---|---|---|
www.musik-sammler.de | 8519 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.bbc.co.uk | 5096 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.generasia.com | 1705 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.metal-archives.com | 870 | 1. URL, record; 2. URL, record; 3. URL, record; |
lyrics.wikia.com | 842 | 1. URL, record; 2. URL, record; 3. URL, record; |
pitchfork.com | 574 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.nme.com | 414 | 1. URL, record; 2. URL, record; 3. URL, record; |
musicmoz.org | 376 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.spirit-of-metal.com | 356 | 1. URL, record; 2. URL, record; 3. URL, record; |
thesession.org | 277 | 1. URL, record; 2. URL, record; 3. URL, record; |
soundtrackcollector.com | 241 | 1. URL, record; 2. URL, record; 3. URL, record; |
exclaim.ca | 176 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.avclub.com | 164 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.progarchives.com | 158 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.inmusicwetrust.com | 155 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.angrymetalguy.com | 138 | 1. URL, record; 2. URL, record; 3. URL, record; |
stage48.net | 135 | 1. URL, record; 2. URL, record; 3. URL, record; |
web.archive.org | 130 | 1. URL, record; 2. URL, record; 3. URL, record; |
drownedinsound.com | 130 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.popmatters.com | 123 | 1. URL, record; 2. URL, record; 3. URL, record; |
www.rollingstone.com | 105 | 1. URL, record; 2. URL, record; 3. URL, record; |
August 2021
[edit]We consider that community discussion around validation criteria has reached a satisfactory level, with the latest updates on automatic ranking discussed here: d:Wikidata:Requests_for_permissions/Bot/Soweego_bot_4.
Outreach
[edit]- The bot permission request for validation criterion 3 was approved: d:Wikidata:Requests_for_permissions/Bot/Soweego_bot_4;
- the project lead attended relevant sessions at Wikimania 2021;
- contributed a pull request that got merged into the Wikidata constraints violation checker tool: https://github.com/wmde/wikidata-constraints-violation-checker/pull/33.
Criterion 2: URLs
[edit]- Use d:Property:P2888 for URLs that cannot be converted into identifiers.
Criterion 3: biographical data
[edit]- Give priority to Wikidata in case of more precise date values;
- dump shared statements, to be used as references in Wikidata;
- resolve QIDs of place strings coming from target catalogs;
- dump Wikidata values not available in target catalogs.
Feedback loops with data providers
[edit]We decided to make a very early first step towards this key project goal, by reigniting the discussion with catalog owners. More specifically, we:
- ran full validation of Discogs URLs;
- submitted artist rotten URLs to relevant team members at Discogs;
- ran full validation of MusicBrainz URLs;
- submitted artist rotten URLs to relevant team members at MusicBrainz.
We really look forward to enabling feedback loops with them.
September 2021
[edit]Outreach
[edit]- Talk given at the d:Wikidata:Events/Data_Quality_Days_2021;
- contributed to Cloud VPS documentation: wikitech:special:diff/1925476;
- submitted bug report to the Cloud VPS team: phab:T291168.
Criterion 2
[edit]The soweego bot is keeping ingesting third-party identifiers. As a result, we are receiving community feedback on the project lead's talk page, especially due to problematic MusicBrainz URLs. We devoted a consistent amount of time to regularly address them:
- d:Topic:Wg08drdjtghn2tj5;
- [1];
- deleted identifiers that are URL-encoded.
Criterion 3
[edit]As an outcome of d:Wikidata:Events/Data_Quality_Days_2021 discussions, we agreed that biographical data is delicate and should be reviewed before being ingested. The d:Wikidata:Mismatch_Finder tool seems the ideal candidate: it is in active development, and we contributed a sample real-world dataset coming from MusicBrainz validation.
Feedback loops with data providers
[edit]Discussion follow-ups with target catalog owners about rotten URLs:
- Discogs stated they currently don't have mechanisms to remove rotten URLs;
- they may leverage the dataset we submitted to notify their users about the issue, following a crowdsourced paradigm;
- MusicBrainz decided to start building their own URL checks, given the large amount of rotten ones we submitted;
- we pointed them to relevant pieces of the soweego code base that are in charge of such checks;
- in the MusicBrainz database, a specific field marks URLs as ended, and we should take it into account.
Technical
[edit]- Replaced pip with Conda for dependency management;
- bumped Python and all dependencies to their latest version;
- handled pywikibot timeouts caused by high lags of Wikidata Query Service servers;
- backed up the soweego VPS;
- deleted the Debian Stretch instance, to be deprecated soon;
- spawned a fresh Debian Bullseye one.
October 2021
[edit]Outreach
[edit]- Supported the creation of the new identifier property d:Property:P9965, see d:Wikidata:Property_proposal/musik-sammler.de_artist_ID;
- Attended the d:Wikidata:WikidataCon_2021 conference.
Criterion 2
[edit]- The soweego bot completed the ingestion of third-party identifiers related to musicians and bands;
- we kept addressing community feedback:
- deleted wrong identifiers, as reported in [2];
- replaced percent-encoded identifiers with decoded ones;
- used pluses instead of whitespaces for d:Property:P3192 values;
- added d:Property:P9965 statements.
Feedback loops with data providers
[edit]Updates from target catalog owners about rotten URLs:
- the database owner at Discogs stated he can't perform a direct action and remove the dataset we provided;
- he needs to schedule development time to implement the removal automation;
- discussion with Discogs users is in progress.
Technical
[edit]- Removed pip requirements;
- bumped all versions of project dependencies;
- replaced Travis with pre-commit for continuous integration;
- don't fail builds after pre-commit autofixes;
- refined the script for low-level claims deletion.
Extension request
[edit]Request #1
[edit]New start date
[edit]July 5, 2021
New end date
[edit]July 4, 2022
Rationale
[edit]The main grantee is currently involved in a third-party research project on a full time basis. As a result, we would like to shift the actual start date of this project, to ensure that the whole team is fully engaged.
Approval
[edit]Noting here that the grant extension request to July 4, 2022 was approved by program officer, Mjohnson (WMF) in January 2021. -- JTud (WMF), Grants Administrator (talk) 22:26, 13 September 2021 (UTC)
Request #2
[edit]New end date
[edit]November 4, 2022
Rationale
[edit]Starting January 2022, the main grantee will be involved into a research project with a 40% commitment (where 100% stands for full time). As a result, we computed the additional time needed to ensure the planned commitment for soweego 2.
Side note: the new project has tight connections to Wikidata and will be carried out together with fellow Wikimedian Daniel Mietchen, so we foresee some overlap with soweego 2, and really look forward to mutual benefits.
Approval
[edit]Noting here that the grant extension request to November 4, 2022 has been approved. The new midpoint report is due by January 30, 2022 and new final report due date is December 4, 2022. -- JTud (WMF), Grants Administrator (talk) 22:26, 13 September 2021 (UTC)
Request #3
[edit]New end date
[edit]November 19, 2021
Rationale
[edit]Starting January 2022, the main grantee will join WMF as a full-time staff. This request supersedes the previous one.
Acknowledgement
[edit]Thank you the update, User:Hjfocs, and congratulations on your new role at the Foundation! I confirm that your project grant is officially 'closed' in our records. -- All the best, JTud (WMF), Grants Administrator (talk) 19:31, 10 December 2021 (UTC)