Disaster Recovery setup

I'm trying to setup a Redhat Cluster that could be site disaster proof.

To illustrate this, see the diagram.

The nodes would service databases instances (not in a RAC fashion), or NFS services, each service running on one node at a time.

The principles would be to use Clustered LVM (CLVM) and LVM mirroring on top of it (though it does not yet support online resizing, and it still needs a 3rd device to carry the metadata).

The main question about this kind of setup (with RHCS) is about the quorum in case one of the 2 sites were to go down, the other site not having anymore the majority, the whole cluster would go down.

We are already managing a few hundreds of clusters this way using non RH (HP ServiceGuard) clustering software and it handles the site loss scenario by provinding a "out-of-cluster" tie-breaker on a 3rd site which guarantees a split-brain proof setup.


LVM2 over MDADM over DM-Multipath

As mentioned in my previous post, we have faced several problems mirroring directly with LVM2 among them:

  1. Unable to lvextend a mirrored LV without breaking the mirror

  2. Unable to maintain a mirrored LV synch'ed across reboots without using a 3rd disk for the log

We have finally decided to explore the mdadm solution to find a solution to our problem.


The first question was whether to use DM-MP or MDADM for multipathing purposes

ü DM-MP :

It offers support for a lot of FC disk arrays and could be used (we didn't test it) with ISCSI devices

It supports real round-robin multipathing allowing better performance (multibus)

ü Multipathing with MDADM :

No round-robin multipathing option, only failover with manual fallback.

We decided to make it work with DM-MP


As mentioned above, the mirroring will be made through mdadm.

The question was about the necessarity to use "fd Linux raid autodetect" partitions or not.

We first setup things using fd partitions, but after rebooting the servers, we faced an unsolvable problem.

As mdadm is started in the rc.sysinit before DM-MP on RHEL 4 (and 5), we ended up with md raid1 arrays built with /dev/sdX devices rather than /dev/dm devices.

Instead of hacking the rc.sysinit script, we removed the fd partitions and created a /etc/init.d/mdadm script that starts mdadm after dm-multipath is loaded. The raid1 arrays are setup in the /etc/mdadm.conf file.

No "fd Linux RAID autodetect" partitions.

Volume management

LVM2 is the tool used for volume management for its flexibility over pure partitions.

Detailled setup


Our test server is dual FC connected to an HP EVA 8100 Array. Both Luns (VDISK) are from the same disk array, but the goal in production is to present to the server a Lun from 2 different disks arrays located on 2 different sites.

Below the detail of /etc/multipath.conf

defaults {

polling_interval 5

path_grouping_policy multibus

getuid_callout "/sbin/scsi_id -g -u -s /block/%n"

no_path_retry fail


blacklist {

devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"

devnode "^hd[a-z]"

devices {

device {

vendor "(HITACHIHP)" product "OPEN-.*"

getuid_callout "/sbin/scsi_id -g -u -s /block/%n"


device {

vendor "HP" product "HSV2[10]0"

getuid_callout "/sbin/scsi_id -g -u -s /block/%n"



Check that Multipath is working fine:

root@hostname# multipath -ll -v2

[size=5 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [prio=8][active]
\_ 0:0:2:4 sdr 65:16 [active][ready]
\_ 0:0:3:4 sdt 65:48 [active][ready]
\_ 0:0:4:4 sdv 65:80 [active][ready]
\_ 0:0:5:4 sdx 65:112 [active][ready]
\_ 1:0:2:4 sdz 65:144 [active][ready]
\_ 1:0:3:4 sdab 65:176 [active][ready]
\_ 1:0:4:4 sdad 65:208 [active][ready]
\_ 1:0:5:4 sdaf 65:240 [active][ready]

[size=5 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [prio=8][active]
\_ 0:0:2:3 sdq 65:0 [active][ready]
\_ 0:0:3:3 sds 65:32 [active][ready]
\_ 0:0:4:3 sdu 65:64 [active][ready]
\_ 0:0:5:3 sdw 65:96 [active][ready]
\_ 1:0:2:3 sdy 65:128 [active][ready]
\_ 1:0:3:3 sdaa 65:160 [active][ready]
\_ 1:0:4:3 sdac 65:192 [active][ready]

\_ 1:0:5:3 sdae 65:224 [active][ready]

RAID with mdadm

Create the md array using the disks WWID (UUID) rather than the dm aliases, as the names could change upon reboot.

root@hostname# mdadm --create --verbose /dev/md1 --level=1 --raid-devices=2 /dev/mapper/3600508b40006baca0000c00002c80000 dev/mapper/3600508b40006baca0000c00002c30000

check that the mirror is ok
        Version : 00.90.01
     Creation Time : Tue Aug 12 18:27:28 2008
        Raid Level : raid1
         Array Size : 5242816 (5.00 GiB 5.37 GB)
         Device Size : 5242816 (5.00 GiB 5.37 GB)
         Raid Devices : 2
         Total Devices : 2
        Preferred Minor : 1
        Persistence : Superblock is persistent

        Update Time : Tue Aug 12 18:31:44 2008
        State : clean
         Active Devices : 2
        Working Devices : 2
        Failed Devices : 0
        Spare Devices : 0

UUID : 6bb39d9f:7f66b358:9a9584f1:6b3e0114
        Events : 0.1

Number Major Minor RaidDevice State
0                  253          3          0          active sync /dev/dm-3
1                  253          4          1          active sync /dev/dm-4

Put the raid1 Array in the /etc/mdadm.conf
DEVICE /dev/mapper/*
ARRAY /dev/md1 level=raid1 num-devices=2 UUID=6bb39d9f:7f66b358:9a9584f1:6b3e0114

Be carefull to always use UUID's rather than dm-X or sdX names, as these can change upon reboot.

Volume management

The LVM part of the job is the most common one, except the fact that we will be using the /dev/md1 as PV.

root@hostname# pvcreate /dev/md1

Then the VG,

root@hostname# vgcreate ORAvg /dev/md1

and then the LV's

root@hostname# lvcreate lvoldata ORAvg

and finally the appropriate filesystem

root@hostname# mke2fs -j /dev/ORAvg/lvoldata
root@hostname# tune2fs -c 0 /dev/ORAvg/lvoldata
root@hostname# tune2fs -i 0 /dev/ORAvg/lvoldata

Do not forget to create a /etc/init.d/mdadm script that is started after /etc/init.d/multipath script.

As start, this script must contain mdadm -A -s $DEVICE, where DEVICE is each of your defined (/etc/mdadm.conf) arrays.

The stop case manually fails as you'll have to stop first LVM.



As mentioned as intro, a few surfing hours later, some LVM questions remain unanswered.

Managing a few hundreds of Linux (RHEL4 and RHEL5) systems on a production environment, as well as HP-UX and Solaris machines, we decided to review our strategy concerning Linux by promoting it to mission critical level.

The idea being to bring some critical central databases to the Linux platform.

Our experience being mainly based on HP-UX for this kind of function, LVM is something we are used to practice.

We first assumed the LVM2 implementation was quite similar to the HP one, but we discovered some "unexpected features" ;-o.

The Mirror Question:

Let's assume we have a server dual connected to 2 FC disk arrays located on 2 different sites to be DR compliant. This server belongs to a cluster, with a backup node on the DR site.
This machine has its 2 internal system disks on a hardware RAID controller.
It has 2 FC disks from the arrays on which we are planning to install the application (a database).
Our idea was to create several LV's mirrored on these 2 FC disks, as we would have done on HP-UX.

Indeed, LVM2 allowed us to create mirrored LV's (lvconvert -m1 OraVG/lvoldata...), but ...there are some constraints:
  1. the metadata location: corelog or disk.
  • Disk : Ok, but it is necessarly a third disk. The problem is that, if we want to remain DR compliant, this disk has to be on both arrays, which means it has to be mirrored. Where do we put its metadata ? The never ending question !
  • Corelog: The metadata is kept in memory, fine, but when the server reboots, the whole mirror will be resync'ed as if it was its first built. It would be ok for small filesystems, but large data filesystems would take a while and generate some load to be rebuilt.
Comparing to HP-UX LVM which holds the metadata in its own structure, for what reason (good I guess) the LVM2 development team implemented the mirror feature this way ?

2. The Mapper Question (not very important, just for my knowledge): for what reason this mapper thing is appended to the name of the devices, as the /dev/vgname/lvolX devices exist ? and by the way, why not to address directly things this way rather than /dev/mapper/vgname-lvolX ?

Anyway, we did things differently by using mdadm to mirror the FC LUNs, than created the PV corresponding to the md device created before and then created the LVs.

About mdadm, there is a multipath option, we didn't use it. DM-Multipath allowed us to be in a round-robin fashion.

It would have been nice to be able to use LVM2 directly on top of DM-Multipath, without having to insert the md layer in the middle.



Day 1, August 2nd 2008

Welcome to my blog.

After spending long hours surfing looking for solutions to different technical questions about topics like LVM2, I came to create this humble blog to allow people share their ideas and concerns on such subjects.