Jump to content

Contingency planning

From Meta, a Wikimedia project coordination wiki

Questions to be addressed

[edit]
  1. What would happen in case of a crisis?
  2. What should happen in case of a crisis?
  3. Which types of crisis are likely to cause trouble?
  4. What do you think will happen if Jimbo dies in a car accident today?
  5. What do you think SHOULD happen?
  6. How should the board change so that everything survive such a terrible event?

Obviously, the main crisis could occur in case a group of terrorists break in the colocation facility in Florida and destroy all servers, but I believe there are other situations which could be problematic.

I would like to see a list of possible crisis type here, and what they could imply, how should we react to it in the best way, so as not to impair our wonderful project.

I invite you to list here all what you can think of and propose suggestions. Be creative :-)

Anthere 14:47, 26 Jan 2005 (UTC)


Technical contingency planning

[edit]

The details below were made in response to the potential problems caused by Hurricane Charley which was expected to hit Tampa, Florida in August 2004.

Backups

[edit]
  • Any backup tarballs with sensitive data should not be made readable by all
  • Innodb data (100 GB): JeLuF
  • En images: Jeronim (rsynced the images home, to be current)
  • Non-en images: JeLuF
  • Private backups (83MB): Tim's uni machine, also Jamesday
  • /home/wikipedia. 150MB uncompressed is available as a backup (30 MB) in /var/backup.tar.bz2. Encrypted version at backup.tar.bz2.x (Stored by Tim Starling and Jamesday)
  • /home/wikipedia/backup/private: Jeronim has the user tables
  • /etc: Tim Starling
  • /usr/local from rabanus: Shaihulud
  • /usr/local: Shaihulud will make a tarball of this. /usr/local on zwinger is mostly apache logs which can be deleted.
  • /home/wikipedia/src is not backed up
  • Redundant backups not made for all files

Colo

[edit]

The colo appear reassured that the building will not be affected by the hurricane. The facility is on two different power grids in downtown Tampa. However, both were taken down deliberately on the morning of 13 August by the power company. We are currently [sometime around 14 August] running on generators. The colo has diesel for 48 hours and a 2h refill contract, which may not be worth much under these conditions.

Offsite slave database

[edit]

Restoring everything from the above backups will take days. In future, all database stuff ought to be read into an offsite slave DB server and the miscellaneous backups will be decompressed.

Jimbo and Shaihulud have both mentioned running DB slaves at home.

Currently, the servers in Europe can not take the entire load. However, we may be able to have a dedicated backup machine within the French server cluster when that is set up, and have a regular schedule of backups to there.

For short term emergencies, the Oregon State University Open Source Lab can help and can provide 3 to 4 machines to us at short notice. Contact: scott At osuosl dotorg

Read only requirements

[edit]

The bare minimum requirement to run read-only is one DB, one web server, one squid. 3-5 machines, one of them fairly decent as a db server. An IDE RAID with lots of RAM would do for the database. If we made a static HTML dump, then it should be easy to serve from 3.

DNS

[edit]
  • If the downtime was short, there may not be enough time to switch the DNS.
  • The dns ttls should be lowered. *.wikipedia.org is 3600. wikimedia.org's dns is on godaddy
  • Should we have an alternative secondary DNS?
    • One or more of the Paris servers could serve as secondary DNS.