RecoverXenWorld

From Computer Laboratory System Administration
Revision as of 14:32, 16 June 2014 by pb22 (talk | contribs) (→‎Unix VMs: Put in more instructions as to what to do)
Jump to navigationJump to search

Procedure to follow after major local network outage

After a major outage the many services run under Xen may no longer be functioning correctly.

Windows world

The most likely systems to be affected are the Exchange service and Verex. To restart Verex login to the server and restart the Verex service.

If exchange is not working the usual problem is that the store service has failed. You should start the Microsoft Exchange Information Store service on both svr-win-mta1 and svr-win-mta0. Then in the exchange system manager make sure the mailboxes are all mounted (expand server configuration and select mailbox then click on the server in the top pane) - if not right click on any un-mounted store and select mount.

If any machines are down then start them, the first to be started should be svr-win-db0. Apart from that there are no ordering constraints.

Unix VMs

After a problem, the machine can be:

  1. running with a read-only FS
  2. trying to shutdown
  3. shutdown
  4. XenServer is unable to start it
  5. the OS wedges during startup
  6. fsck fails and it waits for user intervention
  7. the higher level applications fail to start as expected

The first step is to ensure that Xen reboots the VM. If the FS is read-only (ro), there is little need to do a clean shutdown, which may wedge waiting for an operation which will never complete. If logged in run "touch /tmp.xx || reboot -f -n". If using XenCenter, use "Force Reboot".

Old systems may ask for a root PW before giving a single user shell. To the shell type "fsck -y /dev/xvda1; exit" which will fix the FS and cause the system to reboot.
More recent systems may offer to Fix the FS, allow the user to Manually fix the FS, etc. If this happens, press "F".
BUG: some systems appear to ignore the "F". An expert is needed to fsck the FS on a dom0 or another domU until we find a fix.