Post by Mark Berryman Post by Rich Jordan
We are looking at the possibility of putting VMS boxes in two
locations, with Integrity boxes running VSI VMS. This is the very
beginning of the research on the possibility of clustering those two
servers instead of just having them networked. Probably have to be
master/slave since only two nodes and no shared storage.
After reviewing the various cluster docs, they seem to be focused on
older technologies like SoNET and DS3 using FDDI bridges (which would
allow shared storage). The prospect has a metropolitan area network
but I do not have any specs on that as yet.
Are there available docs relevant to running a distributed VMS cluster
over a metro area network or fast/big enough VPN tunnel? Or is that
just the straight cluster over IP configuration in the docs (which
we've never used) that we need to concentrate on?
First, I recommend you ignore the suggestions to add a 3rd node to your
cluster. In your situation, it is not really a viable answer.
There are configurations that will allow a member of a 2-node cluster to
automatically continue in the event that the other node fails. However,
if you lose the communication channel but both nodes stay up, the
cluster will partition and then you have to be really careful about how
you reform the cluster. Because of this, I tend not to recommend this
particular solution except in very specific circumstances.
(Circumstances where you can guarantee the correct node becomes the
shadow master when the cluster reforms and you haven't been writing
different data to each node).
As far as I can tell from your description, the only way clustering
would be a viable answer for you would be if you also did HBVS. In that
case, simply build a 2-node cluster with enough identical disks such
that all of the data you want present at the backup site can be placed
on a host-based volume set. HBVS will then keep the data at both sites
Failure modes in this scenario.
1. Loss of the communication channel. In this case, both nodes will
hang (for the duration of the cluster timeout parameters). More
specifically, each will freeze any process that attempts a write to
disk. As long as the communication channel comes back up before the
cluster times out, everything will resume automatically. If it doesn't,
both nodes should take a CLUEXIT bugcheck. Once the communication
channel is back up, you then bring each node back up as appropriate.
2. Loss of one node. In this case the other node will hang. Manual
intervention is required to get it going again (specifically, a couple
of commands at the console to reset quorum). At that point, everything
simply resumes on that node.
The main reason for doing it this way is that it becomes a human
decision to decide what to do in the event of any failure. In the event
of any node or communication failure, any surviving cluster members will
simply stop until you tell them what to do. The main intent here is to
simply prevent the wrong node from becoming the shadow master when (or
if) the cluster is reformed.
Since you are in contact with VSI, I have no doubt they will cover this
type of scenario with you. This is presented merely as an idea to
generate questions as part of your discussions.
This is all very interesting. If this was being done with virtual
only work it would be doing is handling HBVS. In the event of a
primary with the volumes from the secondary. Why pay for compute