Briefly: we have three Dell R710 servers running XenServer Enterpise Edition 5.6 FP1. They are connected to a Dell PS4000E iSCSI SAN. Xenserver High Availability (HA) is also enabled , but more on our set-up another day.
For a few weeks now we have been experiencing “random” reboots of our poolmaster. The technical name I believe is called “fencing” (See the excellent post Customizing XenServer’s HA Pool Timeout Settings to learn more).
Previously we enabled HA using the XenCentre and experienced these random reboots almost every week, some times twice in a day. After researching the topic , we began to enable HA using the command line:
xe pool-ha-enable heartbeat-sr-uuids=28b3e5c4-44de-8384-7c1a-25144f6f9396 ha-config:timeout=120
This was fine for several weeks, but now we are experiencing fencing again and I can’t explain why. Our poolmaster had rebooted (fenced) on the following days:
28/Feb/12 13:54 XEN01
28/Feb/12 15:33 XEN03
07/Mar/12 11:37 XEN01
21/Mar/12 11:33 XEN01
What to do? Today we have increased the HA time-out by another 30 seconds:
xe pool-ha-enable heartbeat-sr-uuids=28b3e5c4-44de-8384-7c1a-25144f6f9396 ha-config:timeout=150
We are going to monitor this carefully. So far, this is what we have done to try to resolve the “fencing” problem:
1. Management Interface – When enabling HA it is recommended to bond 2 NICs. The Management interface is critical to HA operations and should be bonded and if posible connected to two switches for redundancy.
2. Enable HA using the command line – specifying a time-out of more than 30 seconds. We have tried 120 and have now increased this 150. This process seems to be trial and error.
3. Check NTP settings – which we are going to do in the next post.
Leave a Reply