Phone call at 8 am
This morning I received a call from an operator that tells me that the services were down.
The system is made up of:
* two Sun server (X4450) with CentOS 5.X, Xen dom0,
* one DRBD resource for each services,
* a first cluster that control DRBD and Xen,
* a second cluster that controls the services (over the first level).
The server are connected using trunking/bounding pairs of NIC both for a layer 3 switch (HP Procurve 2824) and for connect to one another.
All the DRBD resources are in the correct state but one was in stalled state.
I changed the status of all service in cluster level-2 in ‘standby’ mode with:
# crm node standby <vm-domU-lev2>
than I changed status also for cluster level-1:
# crm node standby <vm-domU-lev1>
than I try to restart manually the ‘stalled’ DRBD resource. I follow the instruction for “manual split brain recovery” but this case is different because the two server have distinct roles:
dom0-a# /etc/init.d/drbd start
dom0-b# /etc/init.d/drbd start
Set primary/secondary the resource:
dom0-a# drbdadm primary <resource>
dom0-b# drbdadm secondary <resource>
Disconnect the “wrong” side and connect it (to force resync):
dom0-b# drbdadm disconnect <resource>
dom0-b# drbdadm -- --discard-my-data connect <resource>
We observe the starting of sync process but after little time the resource goes into ‘stalled’ state. I repeat the last steps a few times but with the same results.
After 2 hours of test I suppose that the switch was “in messy” and force to reboot it. This action was decisive :)
After some investigation I discover that in December 2012 HP releases a firmware upgrade that covers some problem with this switch …