We have been keeping a log of the times when the pool master reboots itself.
We predicted that the pool master would fence itself on 13 April at about 4:00pm (16:00) and it did!!! If you see the log above, you will note that the last two reboots took place on the same day at about the same time (6th April and 30th April).
There is a known problem with with XenServer where:
Log rotation runs at every 4:00 AM. In a High Availability (HA) environment, XAPI suddenly hangs and reboots when log rotation is running. Sometimes it causes the host to reboot by self-fence.
See: XAPI Suddenly Hangs and Reboots by Self-Fence when Log Rotation is Running
After investigation we have concluded the “Log Rotation” is *not* the cause of our problems. We are stuck for a solution. So, for the moment we have disbaled HA. This has stopped the server from fencing and we have not experienced any more problems since.
What will we try next? On the 14the May 2012 we plan to perform a XenServer upgrade from Citrix XenServer 5.6 fp1 to XenServer 5.6 SP2 to see if this will fix the problem.
Since upgrading to XenServer 5.6 SP2 we are no longer experiencing fencing.
I have the same environment with same problem! But I’m with the SP2 version and the problem continues. I have 12 Cisco Blades with XenServer SP2 and a Dell with 8 blades (Dell are the only ones who restart)
I disabled the power management options in the bios as suggested by Citrix, however, did not change (you can try to verify that solves your problem):
In the BIOS menu, set the value for the C-states and Turbo Mode option to Disabled.
I Will be opening a case on Citrix to identify what is causing the problem.
When we first experienced this problem over a year ago we did the following
(1) We edited the following file
/etc/sysconfig/unplug-vcpus on each Xenserver and set
How to Adjust Virtual CPU Count for Domain0 on XenServer 5.6 Feature Pack 1
(2) Hosts Become Unresponsive with XenServer 5.6 on Nehalem and Westmere CPUs
In the BIOS menu, set the value for the C-states option to Disabled.
In the BIOS menu, set the value for the Turbo Mode option to Disabled.
(3) Set the ha-config:timeout=120 (using command line!!)
The above seemed to have fixed the problem for about a year. But now the problem has returned.
What are your HA timeout settings?