sdoro Blog

A blogging framework for hackers.

First 2013 Day

Phone call at 8 am

This morning I received a call from an operator that tells me that the services were down.

The system is made up of:

* two Sun server (X4450) with CentOS 5.X, Xen dom0,
* one DRBD resource for each services,
* a first cluster that control DRBD and Xen,
* a second cluster that controls the services (over the first level).

The server are connected using trunking/bounding pairs of NIC both for a layer 3 switch (HP Procurve 2824) and for connect to one another.

All the DRBD resources are in the correct state but one was in stalled state.

I changed the status of all service in cluster level-2 in ‘standby’ mode with:

    # crm node standby <vm-domU-lev2>

than I changed status also for cluster level-1:

    # crm node standby <vm-domU-lev1>

than I try to restart manually the ‘stalled’ DRBD resource. I follow the instruction for “manual split brain recovery” but this case is different because the two server have distinct roles:

    dom0-a# /etc/init.d/drbd start
    dom0-b# /etc/init.d/drbd start

Set primary/secondary the resource:

    dom0-a# drbdadm primary <resource>
    dom0-b# drbdadm secondary <resource>

Disconnect the “wrong” side and connect it (to force resync):

    dom0-b# drbdadm disconnect <resource>
    dom0-b# drbdadm --  --discard-my-data connect <resource>

We observe the starting of sync process but after little time the resource goes into ‘stalled’ state. I repeat the last steps a few times but with the same results.

After 2 hours of test I suppose that the switch was “in messy” and force to reboot it. This action was decisive :)

After some investigation I discover that in December 2012 HP releases a firmware upgrade that covers some problem with this switch …