Jump to content

Community Wishlist Survey 2019/Citations/Automatic web archive

From Meta, a Wikimedia project coordination wiki

Automatic web archive

  • Problem: many people forget to archive websites when they use websites as a source. That makes it harder later if/when a page has changed and the URL does not work anymore
  • Who would benefit: All who take care of broken web links and all readers since the information is documented
  • Proposed solution: since we probably will not get our own archive. That with every weblink the outside goes to archives in the webarchive [1]
  • More comments:

Discussion

@ZellmerLP: This sounds related to m:InternetArchiveBot? --AKlapper (WMF) (talk) 12:48, 30 October 2018 (UTC)[reply]

@AKlapper (WMF) and ZellmerLP: IABot just finds the archived copies; the archival of sources is done separately by a script that looks through recent changes on certain WMF wikis. I believe the relevant task for this request would be phab:T199193, since archival is already performed on most new URLs anyway. (As I noted there, a significant portion of the URLs currently not being archived correctly are likely unarchiveable because of limits to the Internet Archive, and not because the bot can't find the URLs fast enough.) Jc86035 (talk) 13:11, 30 October 2018 (UTC)[reply]
As noted at task T199193, instant archiving of new references is something we're already looking to work with InternetArchive to accomplish as part of the Knowledge Integrity program :) Samwalton9 (WMF) (talk) 13:15, 30 October 2018 (UTC)[reply]
@Samwalton9 (WMF): Is the WMF still planning to use the original idea of archiving sources instantaneously? I think it could be valuable, but it would be a little disappointing if it's simply decided that any page with a robots.txt or with dynamic content doesn't need to be properly archived, and to me it seems quite odd to ignore them in a plan called "Knowledge Integrity". That these are perhaps inherent limitations of the Internet Archive's software doesn't mean that it couldn't be done differently. (As stated in the task, I would think most URLs don't disappear between their addition to Wikipedia and their archival by IA a day or so later.) Jc86035 (talk) 13:43, 30 October 2018 (UTC)[reply]
Honestly we haven't really got into the details on this task yet - it's dependent on the citation event stream which is being worked on first. Your comment on that task is a really great overview of the limitations, however, and we'll make sure to take that into consideration when moving ahead with this. Samwalton9 (WMF) (talk) 13:50, 30 October 2018 (UTC)[reply]
ZellmerLP It appears to me that InternetArchive already does this as part of their work on IABot. Cyberpower678 might be able to provide more insights. -- NKohli (WMF) (talk) 21:47, 30 October 2018 (UTC)[reply]

I think this is the same as or very similar to Community Wishlist Survey 2016/Categories/Bots and gadgets#Automatic links to Internet Archive, which is about automatic archiving at the time the link is saved. IABot still does a fantastic job of doing this after the fact, so I wonder if that is sufficient. MusikAnimal (WMF) (talk) 22:42, 30 October 2018 (UTC)[reply]

Might Perma.cc be relevant here? Apparently it's run by some libraries and archives on request by anyone; it's specifically designed to prevent linkrot in academic texts. HLHJ (talk) 02:36, 31 October 2018 (UTC)[reply]
@HLHJ: I personally think it's nice but not large enough to have much of a noticeable effect. The Internet Archive is basically doing the same thing already, but on an industrial scale (the Wayback Machine has about five and a half orders of magnitude more captures than perma.cc). Jc86035 (talk) 13:14, 31 October 2018 (UTC)[reply]

We have a ticket open to do this in real time with mw:citoid at task T115224 (and was assigned to me with high priority), but it is currently stalled because the patch increased response time dramatically. This potentially could be revisited using the IABot service though, which may be fast enough, which I haven't done. Currently citoid development is frozen due to deployment issues but hopefully will be unfrozen soon. Mvolz (WMF) (talk) 12:12, 6 November 2018 (UTC)[reply]

Voting