RecoverXenWorld

From Computer Laboratory System Administration
Jump to navigationJump to search

Procedure to follow after major outage

After a major outage (network fault, filer problems, etc) the many services run under Xen may no longer be functioning correctly. The common fault is that the kernel finds that writes to a "local" disc time out, it deems there to be a fault, and marks the file system as read-only. This means that the services may continue running just fine, but are just unable to log their use, or accept changes (e.g. web servers, DNS, LDAP, etc). The tree stages to recovery are:

  1. reboot the failed machines
  2. check that they are "powered up" ("running" as far as XenServer is concerned)
  3. check that they are "running normally" (multi-user OS, not stuck in fsck or some such)

The first can be done by enumerating all linux hosts (e.g. "cl-onserver --xe cl-vm-status run | grep -v ad.cl") and running "touch /tmp/XX || reboot -f -n" e.g. over ssh from an omnipotent server, or using "Force Reboot" in XenCenter.

The second can be done by looking for "0 1" entries in "cl-onserver --xe cl-vm-status" and starting those systems.

The third can be tested by attempting ssh to all linux hosts.

Order to start systems

Bring up key servers first, e.g.

  1. name servers, LDAP servers, KDCs
  2. the main Lab web server
  3. any real-time critical systems (e.g. collecting real time data)
  4. shared Lab servers such as slogin-serv*
  5. group servers
  6. test and development systems

Windows world

The most likely systems to be affected are the Exchange service and Verex. To restart Verex login to the server and restart the Verex service.

If exchange is not working the usual problem is that the store service has failed. You should start the Microsoft Exchange Information Store service on both svr-win-mta1 and svr-win-mta0. Then in the exchange system manager make sure the mailboxes are all mounted (expand server configuration and select mailbox then click on the server in the top pane) - if not right click on any un-mounted store and select mount.

If any machines are down then start them, the first to be started should be svr-win-db0. Apart from that there are no ordering constraints.

Unix VMs

After a problem, the machine can be:

  1. running with a read-only FS
  2. trying to shutdown
  3. shutdown
  4. XenServer is unable to start it
  5. the OS wedges during startup
  6. fsck fails and it waits for user intervention
  7. the higher level applications fail to start as expected

The first step is to ensure that Xen reboots the VM. If the FS is read-only (ro), there is little need to do a clean shutdown, which may wedge waiting for an operation which will never complete. If logged in run "touch /tmp.xx || reboot -f -n". If using XenCenter, use "Force Reboot".

Old systems may ask for a root PW before giving a single user shell. To the shell type "fsck -y /dev/xvda1; exit" which will fix the FS and cause the system to reboot.
More recent systems may offer to Fix the FS, allow the user to Manually fix the FS, etc. If this happens, press "F".
BUG: some systems appear to ignore the "F". An expert is needed to fsck the FS on a dom0 or another domU until we find a fix.