Jump to content

InternetArchiveBot/FAQ for sysadmins

From Meta, a Wikimedia project coordination wiki


This page contains a list of common questions asked by server admins about InternetArchiveBot.

Q: Why is the bot making requests to my site?

A: InternetArchiveBot is a heavily relied on tool on Wikipedia. The bot routinely checks articles on Wikipedia and repairs, replaces, or removes broken links. To do this, it needs to ping the URL to check if it is working at all. It normally does this by sending HEAD requests to reduce the stress on the server. In some cases it may try to do a full GET request if the HEAD request fails.

Q: There is a robots.txt on my site but InternetArchiveBot is ignoring it. Why is it not respecting robots.txt?

A: This is because InternetArchiveBot isn't actually crawling your site for its content. The site's content is not being saved anywhere. What the bot is simply doing is assessing if the URL is actually a working URL delivering content. It's only accessing that URL because it is being used as a source on Wikipedia. You will note that IABot is making HEAD requests in most cases. If the source is removed from Wikipedia, or is found to be broken, IABot will stop making requests to it.

Q: Why is the bot making numerous requests at once?

A: InternetArchiveBot tests links on a per article bases. That means it goes through articles one at a time, and tests all the links found on the article. If your site is being heavily used on a specific article, InternetArchiveBot will make requests to all of those URLs. IABot will however wait 1 second in between requests going to the same site.

Q: Is there a waiting period between these tests?

A: Yes, if the URL is deemed to be alive, it will maintain a minimum wait time of 1 week before testing the specific URL again. If the URL found to be dead, it will test it 2 more times waiting at least 3 days between each test before declaring it broken. If the bot finds the URL to be alive in one of the 2 remaining tests, the waiting period of 1 week is re-instated. Once the bot declares a link is broken, it will cease further tests on the URL.

Q: Can I block the bot?

A: You can, but it's not advisable. Blocking the bot may result in the bot determining your entire site is broken, and will be treated as such on Wikipedia. It is recommended you contact User talk:InternetArchiveBot and request they whitelist your domain. Once whitelisted the bot will stop making requests to your domain.

Q: What will happen if I block the bot?

A: InternetArchiveBot does it's best to determine if your site went down, or it got blocked from accessing it. If it determines it has been blocked, it will whitelist your domain, and automatically cease making requests to it. If it can't detect that it has been blocked, IABot will eventually begin treating all URLs in your domain as broken and may replace or remove them from Wikipedia.

Q: What happens if a URL or a site has been deemed non-functional?

A: InternetArchiveBot will stop making requests to the URLs it deems as broken, and begin to replace or remove the URLs from Wikipedia. If replaced, it usually replaced with an archive copy captured by the Wayback Machine.

Q: We recently restructured our site, however InternetArchiveBot is still making requests to the old URLs. What do we do?

A: There are two options. The easiest option would be to have all old URLs redirect using 302 codes to their new correct URLs. IABot follows redirects, and will test the new URL automatically. The second option is to contact User talk:InternetArchiveBot and inform them of the new URL structure and how to convert them correctly.

Q: What do I do if I have more questions?

A: It is recommended you contact User talk:InternetArchiveBot and leave a message on their talk page. To do that, click on the New Section tab found in the top right of the page.